Temporal Vector Frequency in Lucene 4.0

Question

Temporal Vector Frequency in Lucene 4.0

I am upgrading from Lucene 3.6 to Lucene 4.0 beta. In Lucene 3.x, IndexReader contains the IndexReader.getTermFreqVectors() method, which I can use to extract the frequency of each term in a given document and field.

Now this method is replaced with IndexReader.getTermVectors() , which returns Terms . How can I use this (or possibly other methods) to extract the frequency in the document and in the field?

+7

lucene

mossaab Aug 23 '12 at 18:41

source share

4 answers

Beddy · Answer 1 · 2013-04-24T15:35:09+0000

Perhaps this will help you:

 // get terms vectors for one document and one field Terms terms = reader.getTermVector(docID, "fieldName"); if (terms != null && terms.size() > 0) { // access the terms for this field TermsEnum termsEnum = terms.iterator(null); BytesRef term = null; // explore the terms for this field while ((term = termsEnum.next()) != null) { // enumerate through documents, in this case only one DocsEnum docsEnum = termsEnum.docs(null, null); int docIdEnum; while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { // get the term frequency in the document System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq()); } } }

Mark butler · Answer 2 · 2013-01-22T00:21:53+0000

See this related question , specifically

 Terms vector = reader.getTermVector(docId, CONTENT); TermsEnum termsEnum = null; termsEnum = vector.iterator(termsEnum); Map<String, Integer> frequencies = new HashMap<>(); BytesRef text = null; while ((text = termsEnum.next()) != null) { String term = text.utf8ToString(); int freq = (int) termsEnum.totalTermFreq(); frequencies.put(term, freq); terms.add(term); }

Robert Muir · Answer 3 · 2012-08-29T04:46:34+0000

There is various documentation on how to use flexible apis indexing:

Accessing fields / terms for document term vectors is the same API that you use to access posting lists, since vector vectors are really just a miniature inverted index for only one document.

So, it’s great to use all of these examples as is, although you can make several shortcuts, since you know that there is only one document in this “miniature inverted index”. for example, if you just want to get the frequency of a term, you can just search for it and use aggregate statistics, for example totalTermFreq (see https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/core/org/ apache / lucene / index / package-summary.html # stats ), instead of actually opening a DocsEnum that will list only one document.

Freddy who · Answer 4 · 2013-03-28T19:34:51+0000

This works for me on my Lucene 4.2 index. This is a small test program that works for me.

 try { directory[0] = new SimpleFSDirectory(new File(test1)); directory[1] = new SimpleFSDirectory(new File(test2)); directory[2] = new SimpleFSDirectory(new File(test3)); directoryReader[0] = DirectoryReader.open(directory[0]); directoryReader[1] = DirectoryReader.open(directory[1]); directoryReader[2] = DirectoryReader.open(directory[2]); if (!directoryReader[2].isCurrent()) { directoryReader[2] = DirectoryReader.openIfChanged(directoryReader[2]); } MultiReader mr = new MultiReader(directoryReader); TermStats[] stats=null; try { stats = HighFreqTerms.getHighFreqTerms(mr, 100, "My Term"); } catch (Exception e1) { e1.printStackTrace(); return; } for (TermStats termstat : stats) { System.out.println("IBI_body: " + termstat.termtext.utf8ToString() + ", docFrequency: " + termstat.docFreq); } }

Temporal Vector Frequency in Lucene 4.0

More articles: