Can MongoDB store and manage UTF-8 strings with code points outside the base multilingual plane?

In MongoDB 2.0.6, when I try to store documents or query documents containing string fields, where the string value contains characters outside of BMP, I get a lot of errors, such as: "Invalid UTF-16: 55357" or "buffer too small"

What settings, changes, or recommendations allow you to store and query multilingual strings in Mongo, especially those that include these characters above 0xFFFF?

Thanks.

+7
source share
1 answer

There are a few questions here:

1) Keep in mind that MongoDB stores all documents in BSON format. Also note that the BSON specification refers to UTF-8 lowercase encoding, not UTF-16 encoding.

Link: http://bsonspec.org/#/specification

2) All drivers, including the JavaScript driver in the mongo shell, must correctly handle strings that are encoded as UTF-8. (If they do not, this is a mistake!) Many of the drivers tend to handle UTF-16, although, as far as I know, UTF-16 is not officially supported.

3) When I checked this with the Python driver, MongoDB was able to successfully load and return a string value containing a broken pair of UTF-16 code. However, I could not load the broken pair of code using the mongo shell, and also could not store the string containing the broken pair of code into the JavaScript variable in the shell.

4) mapReduce () works correctly with string data using the correct pair of UTF-16 codes, but when trying to run mapReduce () in string data containing a broken pair of code, an error will occur.

It seems mapReduce () doesn't work when MongoDB tries to convert BSON to a JavaScript variable for use by the JavaScript engine.

5) I registered Jira issue SERVER-6747 for this problem. Feel free to follow him and vote for him.

+6
source

All Articles