I would like to get some common phrases with Lucene. I get some information from TXT files, and I lose a lot of context in order to not have information for phrases, for example. "Information Search" is indexed as two separate words.
What is the way to get such phrases? I can not find anything useful on the Internet, all the tips, links, tips, especially examples, are appreciated!
EDIT: I only store documents by title and content:
Document doc = new Document();
doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));
because for what I do, the most important thing is the contents of the file. Headings too often are not described at all (for example, I have many academic documents in PDF format, whose names are codes or numbers).
I desperately need to index the top occurring phrases from textual content, now I see how this simple “bag of words" approach is ineffective.
source
share