I use Apache Tika to parse an XML document before indexing Apache Lucene .
This is part of Tiki:
BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(f);
ParseContext pcontext = new ParseContext();
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
return handler.toString();
I use StandardAnalyzer with a list of stop words for Tokenize my document:
analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET);
Is it possible to drop numerical terms because I do not need it?
Thank you for your help.
tommy source
share