How to calculate the number of terms for each document in the lucene index?

I want to know the number of terms for each document in the lucene index. I searched in the API and on the web with no result. Can you help me?

+1
source share
2 answers

Lucene is built to answer the opposite question, that is, which documents contain a given term. Therefore, in order to get the number of terms for a document, you need to hack a bit.

The first method is to save the condition vector for each field, which is necessary in order to get the number of terms. The vector of terms is a list of field terms. During the search, you can get it using the getTermFreqVector IndexReader method (if they were saved during the index). When you have this, you will get the length of the vector, and you have the number of terms for this field.

Another method, if you saved the fields of your documents, is to return the text of these fields and calculate the number of terms by analyzing it (divide the text into words).

Finally, if approximation of the number of field members is enough for you, and you saved the norms during the index, it is possible to calculate the inverse function of the one used to calculate the field norms. If you look closely at the lengthNorm method of the affinity class, you will notice that it uses the number of field terms. The result of this method is stored in the index using the encodeNorm method. You can get the norms during the search time using the norms IndexReader method. With the norm in hand, it uses the inverse mathematical function of the one used in lengthNorm to return the number of members. As I said, this is only an approximation, because when the norm is saved, some accuracy is lost, and you may not get exactly the same number as what was saved.

+4
source

This is actually quite difficult to do in Lucene unless you store timeline vectors during the index. The structural data structure of Lucene is an inverted index that stores terms as keys and lists of document identifiers as values. This is why there is no getNumTerms () method in the API, because the fundamental data structures that Lucene uses do not support it.

At the same time, you can store the term vectors in the index, which you can find by the document identifier during the search. These vectors are essentially a complete list of all the terms in this document that you can count to get your # terms.

Cm

http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/document/Field.TermVector.html

Alternatively, you can pre-calculate all conditions and save them as a field in your index.

+4
source

All Articles