Lucene 4.4. How to get the temporary frequency throughout the index?

Question

Lucene 4.4. How to get the temporary frequency throughout the index?

I am trying to calculate the tf-idf value for each term in a document. So, I repeat the terms in the document and I want to find the frequency of this term in the entire corpus and the number of documents in which this term appears. Below is my code:

//@param index path to index directory //@param docNbr the document number in the index public void readingIndex(String index, int docNbr) { IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); Document doc = reader.document(docNbr); System.out.println("Processing file: "+doc.get("id")); Terms termVector = reader.getTermVector(docNbr, "contents"); TermsEnum itr = termVector.iterator(null); BytesRef term = null; while ((term = itr.next()) != null) { String termText = term.utf8ToString(); long termFreq = itr.totalTermFreq(); //FIXME: this only return frequency in this doc long docCount = itr.docFreq(); //FIXME: docCount = 1 in all cases System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount); } reader.close(); }

Although the documentation states that totalTermFreq () returns the total number of occurrences of this term in all documents, when testing, I found that it returns the frequency of this term in the document specified by docNbr. and docFreq () always return 1.

How can I get member frequency throughout the index?

Update Of course, I can create a map to display the term on its frequency. Then iterate over each document to calculate the total amount of time that takes place. However, I thought Lucene should have a built-in method for this purpose. Thanks,

+6

indexing lucene tf-idf

chepukha Dec 13 '13 at 20:17

source share

1 answer

femtoRgon · Accepted Answer · 2013-12-13T21:16:39+0000

IndexReader.TotalTermFreq(Term) will provide you with this. Your calls to similar methods in TermsEnum really provide statistics for all documents in the listing. Using a reader should give you statistics for all documents in the index itself. Sort of:

 String termText = term.utf8ToString(); Term termInstance = new Term("contents", term); long termFreq = reader.totalTermFreq(termInstance); long docCount = reader.docFreq(termInstance); System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);

Lucene 4.4. How to get the temporary frequency throughout the index?

More articles: