How common are phrases with Lucene

I would like to get some common phrases with Lucene. I get some information from TXT files, and I lose a lot of context in order to not have information for phrases, for example. "Information Search" is indexed as two separate words.

What is the way to get such phrases? I can not find anything useful on the Internet, all the tips, links, tips, especially examples, are appreciated!

EDIT: I only store documents by title and content:

 Document doc = new Document();
 doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
 doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));

because for what I do, the most important thing is the contents of the file. Headings too often are not described at all (for example, I have many academic documents in PDF format, whose names are codes or numbers).

I desperately need to index the top occurring phrases from textual content, now I see how this simple “bag of words" approach is ineffective.

+5
source share
3 answers

Julia, It seems that you are looking for n-grams , in particular Bigrams (also called collocations).

Here's a chapter on finding matches (PDF) by Manning and Schutze Fundamentals of Natural Language Statistical Processing .

To do this with Lucene, I suggest using Solr with a ShingleFilterFactory . See this discussion for more details .

+7

?

lucene.

, : ID ; ID , " nemo", .. . , "" - , , , .

: " nemo" . .

, , (org.apache.lucene.document.Document), ... -

Document doc = new Document();
doc.add(new Field("comments","Finding nemo was a very tough job for a clown fish ...", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("id", "finding nemo", Field.Store.YES, Field.Index.NOT_ANALYZED));

, :

  • comments: , Field.Index.ANALYZED
  • id: lucene , Field.Index.NOT_ANALYZED

lucene Tokenizer . .

Link (s) http://darksleep.com/lucene/

, ...:)

0

, PhraseQuery.

, Boolean omitTermFreqAndPositions. PhraseQuery , .

, , " ". , , , . , TermQuery , , , , , ( ), ( [] ). , , slop. - .

Lucene JavaDoc PhraseQuery

See this sample code for a demonstration of how to work with various query objects:

You can also try to combine different types of queries using the BooleanQuery class.

And with regard to the frequency of phrases, I believe that Lucene counting takes into account the frequency of terms occurring in documents.

0
source

All Articles