When adding a multi-valued line field to a Lucene document, do you need a comma?

I create a Lucene index and add documents.

I have a multi-valued field, for this example I will use categories.

An item can have many categories, for example, jeans can fall under clothes, pants, men's, women's, etc.

When adding fields to a document, do commas make a difference? Will Lucen just ignore them? if I change the commas to spaces, will there be a difference? Does this automatically make the field ambiguous?

String categoriesForItem = getCategories(); // returns "category1, category2, cat3" from a DB call categoriesForItem = categoriesForItem.replaceAll(",", " ").trim(); // not sure if to remove comma doc.add(new StringField("categories", categoriesForItem , Field.Store.YES)); // doc is a Document 

Am I doing it right? or is there another way to create multi-valued fields?

Any help / advice is appreciated.

+6
source share
2 answers

This will be the best way to index fields with multiple ratings per document.

 String categoriesForItem = getCategories(); // get "category1, category2, cat3" from a DB call String [] categoriesForItems = categoriesForItem.split(","); for(String cat : categoriesForItems) { doc.add(new StringField("categories", cat , Field.Store.YES)); // doc is a Document } 

When several fields with the same name appear in the same document, both inverted indexes and term vectors logically attach field tokens to each other in the order the fields are added.

Also at the analysis stage, two different values ​​will be separated by incrementing the position via setPositionIncrementGap () automatically. Let me explain why this is necessary.

Your "category" field in the D1 document has two meanings - "foo bar" and "foo baz" Now, if you need to query the phrase "bar foo", D1 should not appear. This is ensured by adding an additional increment between two values ​​of the same field.

If you yourself concatenate field values ​​and rely on the analyzer to split it into multiple values, "bar foo" will return D1, which would be incorrect.

+14
source

If you are using StandardAnalyzer, this is normal if you have commas or spaces. But if you have another analyzer, it depends.

Another way: you can have the same field several times with a different category in it. Then I would recommend using KeywordAnalyzer or letting it be unpainted so that it matches your category name exactly.

+1
source

All Articles