Separate numbers from letters in Lucene

Question

Separate numbers from letters in Lucene

In many of the documents I index with Lucene, people accidentally combine words with numbers. For example, you could say: "I was born in 2000," instead of "I was born in 2000."

Is there any Lucene tokenizer that can split words with numbers (e.g. in2000and) into multiple words (e.g. in 2000 and)?

+4

java lucene

mossaab Jun 04 '12 at 20:00

source share

2 answers

I do not use Solr. So I downloaded WordDelimiterFilter and WordDelimiterIterator from Solr. And added this code to his custom analyzer:

final Tokenizer source = new StandardTokenizer(matchVersion, reader); TokenStream result = new StandardFilter(matchVersion, source); int flags = WordDelimiterFilter.SPLIT_ON_NUMERICS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.GENERATE_WORD_PARTS; result = new WordDelimiterFilter(result, flags, null);

+1

mossaab Jun 05 '12 at 14:09

source share

Marko bonaci · Accepted Answer · 2012-06-05T07:29:09+0000

You can use WordDelimiterFilterFactory and add splitOnNumerics = 1 to your schema.

Separate numbers from letters in Lucene

More articles: