German MongoDB Text Index

Question

German MongoDB Text Index

In my article collection, I have a text index:

{ "v" : 1, "key" : { "_fts" : "text", "_ftsx" : 1 }, "name" : "title_text_abstract_text_body_text", "ns" : "foo.articles", "weights" : { "abstract" : 1, "body" : 1, "title" : 1 }, "default_language" : "english", "language_override" : "language", "textIndexVersion" : 2 }

In my article collection, I have an entry like this:

 { "_id" : ObjectId("5477c28c807a9cd660ccd567"), "title" : "Hallo Welt!", "author" : "foo", "publishDate" : ISODate("2014-11-28T17:00:00Z"), "language" : "de", "abstract" : "Mein erster Artikel!", "body" : "Dieser Artikel ist in deutscher Sprache.", "__v" : 0 }

(In fact, there are different meanings in abstract and body , for brevity, let's look at them above)

When I try to find this article:

 db.articles.find({$text: {$search: 'Welt'}})

He is found.

But: When I try to find this article:

 db.articles.find({$text: {$search: 'Sprache'}})

I am not getting any results. But after I changed language to en or none , I get this article as a result with the same query.

What am I doing wrong?

Edit : As requested in the comments, here are the exact commands that lead to the behavior described above. It was supposed to do it this way, first of all, an apology.

 > db.test.drop() true > db.test.insert({language: "de", body: "vermutlich", title: "Artikel"}) WriteResult({ "nInserted" : 1 }) > db.test.ensureIndex({body: "text", title: "text"}) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } > db.test.find({$text: {$search: 'vermutlich'}}) > db.test.find({$text: {$search: 'Artikel'}}) { "_id" : ObjectId("54ea86d6c9ec98269e022c67"), "language" : "de", "body" : "vermutlich", "title" : "Artikel" } > db.version() 2.6.5

I also tried changing the language again:

 > db.test.update({}, {$set: {language: "en"}}) WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 }) > db.test.find({$text: {$search: 'Artikel'}}) { "_id" : ObjectId("54ea86d6c9ec98269e022c67"), "language" : "en", "body" : "vermutlich", "title" : "Artikel" } > db.test.find({$text: {$search: 'vermutlich'}}) { "_id" : ObjectId("54ea86d6c9ec98269e022c67"), "language" : "en", "body" : "vermutlich", "title" : "Artikel" }

Edit: Okay, so I just tried rebuilding this example. But I also added one German quote, so this is what I did:

 > db.test.drop() true > db.test.insert({ language: "portuguese", original: "A sorte protege os audazes.", translation: [{ language: "english", quote: "Fortune favors the bold."},{ language: "spanish", quote: "La suerte rotege a los audaces."}]}) WriteResult({ "nInserted" : 1 }) > db.test.insert({ language: "spanish", original: "Nada hay más surrealista que la realidad.", translation:[{language: "english",quote: "There is nothing more surreal than reality."},{language: "french",quote: "Il n'y a rien de plus surréaliste que la réalité."}]}) WriteResult({ "nInserted" : 1 }) > db.test.insert({ original: "is thisdagger which I see before me.", translation: {language: "spanish",quote: "Es este un puñal que veo delante de mí." }}) WriteResult({ "nInserted" : 1 }) > db.test.insert({original: "Die Geister, die ich rief...", language: "german", translation: {language: "english", quote: "The spirits that I've cited..."}}) WriteResult({ "nInserted" : 1 }) > db.test.ensureIndex( { original: "text", "translation.quote": "text" } ) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 }

Then I tried a few queries:

 > db.test.count({$text: {$search: "delante"}}) 1 > db.test.count({$text: {$search: "spirits"}}) 1 > db.test.count({$text: {$search: "Geister"}}) 0

Conclusion: mongoDB does not work with German? It is really frustrating

+8

mongodb mongodb-query

dave Feb 21 '15 at 2:00 p.m.

source share

3 answers

mnemosyn · Answer 1 · 2015-03-01T20:38:51+0000

wdberkeley is true, but I feel like adding a quick explanation of stemming , because I doubt that users without experience in this area will get the gist. I would also like to highlight some alternatives and general limitations.

In many languages, words are significantly transformed due to grammar rules, for example. for the German word "Geist" (mind / ghost / spirit):

 "Geist" (singular) -> "Geister" (plural) -> "Geistern" (plural accusative)

This effect is also known in English, but it is less pronounced, examples:

 "house" -> "houses" // "mouse" -> "mice" // "notable" -> "notably"

Usually we want searches to ignore this local grammatical structure, so if we search for "Geist" , it should find any of the words above. Enforcing this right is extremely difficult because language rules are complex and the correct answer cannot necessarily be determined without context.

Sweeping a suffix is a general and relatively simple approach, assuming that certain endings are most likely just ending and can be removed to get the foundation. Sometimes stem cells intentionally delete letters that actually belong to the stem, for example. "notable" -> "notabl" .

Since the language of quotation marks is known, the correct stockmer will be used for quotation marks. This works - with your data:

 > db.test.find({$text: {$search: 'Geist'}}).count() 1

Now the problem is that the user may not search for the stem, but for the derived form, so we need to apply the same transformation to the input. The main problem is that we do not know which transformation was applied in the first place. So you are trying to do something that is already associated with the added variable.

It’s good that there is a snowball , which is the stem used by MongoDB and other search engines such as SolR. It is available under the BSD license and has been ported to many languages, so you can do the same as in the database in the client code. Thus, we do not consider the database as a black box, but we also associate our client code with the database implementation detail ... Choose your poison.

We could, for example, simply run through all the stem cells and see which one contributes, but this can lead to positive effects, as the word can already be started, and a streamer from another language shortens it (German stockmer: 'mice' -> 'mic' ).

At the very least, we significantly reduce the number of queries that we need to make if we take a lot of the reported stem responses.

Alternatively, you can consult a list of words to guess which language this query might be in.

Even with this extra effort, it is important to understand the limitations that arise when simply disabling a suffix. For example, “mice” will not be found when searching for “mice,” not even with an English stem, because the stem suggests that the stem is shorter. Everything gets much worse if the texts do not really match their intended language (Ulysses ...)

In other words: a really good free-text search requires much more than just that, and adding language queries to them. Another search database is not a panacea - the problem is deeply rooted in the problem space ...

EDIT: ElasticSearch has a full comprehensive explanation of causality (I continue to find them after writing the answer)

EDIT2:

Why doesn't MongoDB just use different words?

Conversions apply only when inserting or updating text in a database. The query is simply looking for a stem match. Essentially, an index is made up of words that have been depleted. What you want will require walking throughout the collection every time. This would be very inefficient and defeat the goal of indexing. What you can do is take this step in the client code.

Why can't I just use $ or search for a text index twice

AFAIK, that limiting the query mechanism is probably related to ranking, because a good result based on two different inputs does not make much sense. But you can just run two queries and combine the results on the client side.

wdberkeley · Answer 2 · 2015-02-24T21:23:57+0000

Sorry, I was stupid. The problem is simple: we are trying to match the search text "vermutlich" with the text of the document "vermutlich" , and for this you need to analyze both languages with the same language rules. If you do the following:

 > db.test.drop() > db.test.insert({ "language" : "de", "body" : "vermutlich", "title" : "Artikel"}) > db.test.ensureIndex({ "$**" : "text" }) > db.test.count({ "$text" : { "$search" : "vermutlich" } }) 0 > db.test.count({ "$text" : { "$search" : "vermutlich", "$language" : "de" } }) 1

The first query searches for a document that has been indexed as German because of the language field using "vermutlich" , treated as an English word.

You can set the default language for the text index so that $language not specified in each query:

 > db.test.drop() > db.test.insert({ "language" : "de", "body" : "vermutlich", "title" : "Artikel"}) > db.test.ensureIndex({ "$**" : "text" }, { "default_language" : "de" }) > db.test.count({ "$text" : { "$search" : "vermutlich" } }) 1

dave · Answer 3 · 2015-02-26T13:31:34+0000

As a workaround, I created my text index with default_language: "none" and language_override: "none" . Thus, the words stop and do not stop. But at least I find direct matches regardless of language.

 > db.test.drop() true > db.test.insert({ language: "portuguese", original: "A sorte protege os audazes.", translation: [{ language: "english", quote: "Fortune favors the bold."},{ language: "spanish", quote: "La suerte rotege a los audaces."}]}) WriteResult({ "nInserted" : 1 }) > db.test.insert({ language: "spanish", original: "Nada hay más surrealista que la realidad.", translation:[{language: "english",quote: "There is nothing more surreal than reality."},{language: "french",quote: "Il n'y a rien de plus surréaliste que la réalité."}]}) WriteResult({ "nInserted" : 1 }) > db.test.insert({ original: "is this dagger which I see before me.", translation: {language: "spanish",quote: "Es este un puñal que veo delante de mí." }}) WriteResult({ "nInserted" : 1 }) > db.test.insert({original: "Die Geister, die ich rief...", language: "german", translation: {language: "english", quote: "The spirits that I've cited..."}}) WriteResult({ "nInserted" : 1 }) > db.test.ensureIndex( { original: "text", "translation.quote": "text" }, {default_language: 'none', language_override: 'none'} ) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } > db.test.find({$text: {$search: 'Geister'}}) { "_id" : ObjectId("54ed31b2e7ac93c32c760809"), "original" : "Die Geister, die ich rief...", "language" : "german", "translation" : { "language" : "english", "quote" : "The spirits that I've cited..." } }

If someone does not find the actual solution, I believe that the mongoDB text index should be broken.

German MongoDB Text Index

More articles: