Indexing n-word expressions as a single member in Lucene

I want to index a “compound word,” for example, “New York,” as one term in Lucene, not as “new,” “York.” Thus, if someone is looking for a “new place,” the documents containing “New York” will not match.

I think this is not the case for N-grams (actually NGramTokenizer), because I will not index only any n-grams, I want to index only some specific n-grams.

I did some research, and I know that I should write my own analyzer and, possibly, my own tokenizer. But I lost the TokenStream / TokenFilter / Tokenizer extension a bit.

thanks

+4
source share
2 answers

I assume that you have a way to detect verbose units (MWUs) that you want to keep. Then you can replace the spaces in them with an underscore and use WhiteSpaceAnalyzer instead of StandardAnalyzer (which rejects punctuation), possibly with LowerCaseFilter .

Writing your own Tokenizer requires quite some Lucene black magic. I never managed to wrap around the Lucene 2.9+ API, but check out the TokenStream if you really want to try it.

+1
source

I did this by creating a field that is indexed but not parsed. For this, I used Field.Index.NOT_ANALYZED> doc.add (new field ("fieldName", "value", Field.Store.YES, Field.Index.NOT_ANALYZED, TermVector.YES)); StandardAnalyzer.

I worked on Lucene 3.0.2.

0
source

All Articles