Separate numbers from letters in Lucene

In many of the documents I index with Lucene, people accidentally combine words with numbers. For example, you could say: "I was born in 2000," instead of "I was born in 2000."

Is there any Lucene tokenizer that can split words with numbers (e.g. in2000and) into multiple words (e.g. in 2000 and)?

+4
source share
2 answers

You can use WordDelimiterFilterFactory and add splitOnNumerics = 1 to your schema.

+2
source

I do not use Solr. So I downloaded WordDelimiterFilter and WordDelimiterIterator from Solr. And added this code to his custom analyzer:

final Tokenizer source = new StandardTokenizer(matchVersion, reader); TokenStream result = new StandardFilter(matchVersion, source); int flags = WordDelimiterFilter.SPLIT_ON_NUMERICS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.GENERATE_WORD_PARTS; result = new WordDelimiterFilter(result, flags, null); 
+1
source

Source: https://habr.com/ru/post/1416041/