Temporal Vector Frequency in Lucene 4.0

I am upgrading from Lucene 3.6 to Lucene 4.0 beta. In Lucene 3.x, IndexReader contains the IndexReader.getTermFreqVectors() method, which I can use to extract the frequency of each term in a given document and field.

Now this method is replaced with IndexReader.getTermVectors() , which returns Terms . How can I use this (or possibly other methods) to extract the frequency in the document and in the field?

+7
source share
4 answers

Perhaps this will help you:

 // get terms vectors for one document and one field Terms terms = reader.getTermVector(docID, "fieldName"); if (terms != null && terms.size() > 0) { // access the terms for this field TermsEnum termsEnum = terms.iterator(null); BytesRef term = null; // explore the terms for this field while ((term = termsEnum.next()) != null) { // enumerate through documents, in this case only one DocsEnum docsEnum = termsEnum.docs(null, null); int docIdEnum; while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { // get the term frequency in the document System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq()); } } } 
+12
source

See this related question , specifically

 Terms vector = reader.getTermVector(docId, CONTENT); TermsEnum termsEnum = null; termsEnum = vector.iterator(termsEnum); Map<String, Integer> frequencies = new HashMap<>(); BytesRef text = null; while ((text = termsEnum.next()) != null) { String term = text.utf8ToString(); int freq = (int) termsEnum.totalTermFreq(); frequencies.put(term, freq); terms.add(term); } 
+3
source

There is various documentation on how to use flexible apis indexing:

Accessing fields / terms for document term vectors is the same API that you use to access posting lists, since vector vectors are really just a miniature inverted index for only one document.

So, itโ€™s great to use all of these examples as is, although you can make several shortcuts, since you know that there is only one document in this โ€œminiature inverted indexโ€. for example, if you just want to get the frequency of a term, you can just search for it and use aggregate statistics, for example totalTermFreq (see https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/core/org/ apache / lucene / index / package-summary.html # stats ), instead of actually opening a DocsEnum that will list only one document.

+1
source

This works for me on my Lucene 4.2 index. This is a small test program that works for me.

 try { directory[0] = new SimpleFSDirectory(new File(test1)); directory[1] = new SimpleFSDirectory(new File(test2)); directory[2] = new SimpleFSDirectory(new File(test3)); directoryReader[0] = DirectoryReader.open(directory[0]); directoryReader[1] = DirectoryReader.open(directory[1]); directoryReader[2] = DirectoryReader.open(directory[2]); if (!directoryReader[2].isCurrent()) { directoryReader[2] = DirectoryReader.openIfChanged(directoryReader[2]); } MultiReader mr = new MultiReader(directoryReader); TermStats[] stats=null; try { stats = HighFreqTerms.getHighFreqTerms(mr, 100, "My Term"); } catch (Exception e1) { e1.printStackTrace(); return; } for (TermStats termstat : stats) { System.out.println("IBI_body: " + termstat.termtext.utf8ToString() + ", docFrequency: " + termstat.docFreq); } } 
0
source

All Articles