How can I get a list of unique terms from a specific area in Lucene?

I have an index from a large case with multiple fields. Only one of these fields contains text. I need to extract unique words from the entire index based on this field. Does anyone know how I can do this with Lucene in java?

+7
java lucene
source share
4 answers

You are looking for the term vectors (the set of all words that were in the field, and the number of times each word was used, excluding stop words). You will use IndexReader getTermFreqVector (docid, field) for each document in the index and fill out a HashSet with it.

An alternative would be to use terms () and select only those terms for the field you are interested in:

 IndexReader reader = IndexReader.open(index); TermEnum terms = reader.terms(); Set<String> uniqueTerms = new HashSet<String>(); while (terms.next()) { final Term term = terms.term(); if (term.field().equals("field_name")) { uniqueTerms.add(term.text()); } } 

This is not the optimal solution that you are reading, and then discard all other fields. Lucene 4 has a Fields class that returns terms (field) for only one field.

+8
source share

If you are using Lucene 4.0 api, you need to get the fields from the indexer. Fields then offers a way to get conditions for each field in the index. Here is an example of how to do this:

  Fields fields = MultiFields.getFields(indexReader); Terms terms = fields.terms("field"); TermsEnum iterator = terms.iterator(null); BytesRef byteRef = null; while((byteRef = iterator.next()) != null) { String term = new String(byteRef.bytes, byteRef.offset, byteRef.length); } 

In the end, for the new version of Lucene, you can get the string from the BytesRef call:

  byteRef.utf8ToString(); 

instead

  new String(byteRef.bytes, byteRef.offset, byteRef.length); 

If you want to get the frequency of the document, you can do:

  int docFreq = iterator.docFreq(); 
+25
source share

The same result, only a little cleaner, should use LuceneDictionary in the lucene-suggest . It takes care of a field that does not contain any conditions, returning BytesRefIterator.EMPTY . It will save you NPE :)

  LuceneDictionary ld = new LuceneDictionary( indexReader, "field" ); BytesRefIterator iterator = ld.getWordsIterator(); BytesRef byteRef = null; while ( ( byteRef = iterator.next() ) != null ) { String term = byteRef.utf8ToString(); } 
+3
source share

There is one error in answers using TermsEnum and terms.next() . This is because TermsEnum already points to the first member, so while(terms.next()) will while(terms.next()) first member.

Use the for loop instead:

 TermEnum terms = reader.terms(); for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) { // do something with the term } 

To change the code from the accepted answer:

 IndexReader reader = IndexReader.open(index); TermEnum terms = reader.terms(); Set<String> uniqueTerms = new HashSet<String>(); for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) { if (term.field().equals("field_name")) { uniqueTerms.add(term.text()); } } 
0
source share

All Articles