How to calculate the frequency of deadlines for a set of documents?

Question

How to calculate the frequency of deadlines for a set of documents?

I have a Lucene index with the following documents:

doc1 := { caldari, jita, shield, planet } doc2 := { gallente, dodixie, armor, planet } doc3 := { amarr, laser, armor, planet } doc4 := { minmatar, rens, space } doc5 := { jove, space, secret, planet }

therefore, these 5 documents use 14 different terms:

 [ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]

frequency of each member:

 [ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]

for readability:

 [ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1, armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]

What I want to know now is how to get the term frequency vector for a set of documents?

eg:

 Set<Documents> docs := [ doc2, doc3 ] termFrequencies = magicFunction(docs); System.out.pring( termFrequencies );

will lead to the conclusion:

 [ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]

delete all zeros:

 [ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]

Note that the result wind only contains a member of the frequency of multiple documents. NOT the general frequencies of the entire index! The term "planet" is present 4 times in the entire index, but the original set of documents contains only 2 times.

A naive implementation should be to simply iterate over all the documents in the install docs , create a map and count each term. But I need a solution that will also work with a document set size of 100,000 or 500,000.

Is there a function in Lucene that I can use to get this vector vector? If there is no such function, what does the data structure look like, can someone create during the index to get such a vector easily and quickly?

I'm not a Lucene specialist, so I'm sorry if the solution is obvious or trivial.

Perhaps worth mentioning: the solution should work fast enough for a web application applied to customer search queries.

+6

java lucene

Manbugra May 27, '10 at 19:08

source share

2 answers

I do not know Lutsen, however; your naive implementation will scale if you do not immediately read the entire document in memory (for example, using an online analyzer). English text is approximately 83% redundant, so your largest document will have a map with 85,000 entries. Use one card per stream (and one stream per file combined obviouly) and you will scale just fine.

Update: If your list of terms does not change frequently; you can try to create a search tree from the characters in the list of terms or create an ideal hash function ( http://www.gnu.org/software/gperf/ ) to speed up parsing files (matching from search terms to target lines). Probably just a big HashMap will work too.

0

Justin May 27, '10 at 19:20

source share

Mihai toader · Accepted Answer · 2010-05-27T20:30:46+0000

Go here: http://lucene.apache.org/java/3_0_1/api/core/index.html and check this method

 org.apache.lucene.index.IndexReader.getTermFreqVectors(int docno);

you will need to know the document id. This is the internal identifier lucene, and it usually changes each time the index is updated (which removes :-)).

I believe that there is a similar method for lucene 2.xx

How to calculate the frequency of deadlines for a set of documents?

More articles: