I have a Lucene index with the following documents:
doc1 := { caldari, jita, shield, planet } doc2 := { gallente, dodixie, armor, planet } doc3 := { amarr, laser, armor, planet } doc4 := { minmatar, rens, space } doc5 := { jove, space, secret, planet }
therefore, these 5 documents use 14 different terms:
[ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]
frequency of each member:
[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]
for readability:
[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1, armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]
What I want to know now is how to get the term frequency vector for a set of documents?
eg:
Set<Documents> docs := [ doc2, doc3 ] termFrequencies = magicFunction(docs); System.out.pring( termFrequencies );
will lead to the conclusion:
[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]
delete all zeros:
[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]
Note that the result wind only contains a member of the frequency of multiple documents. NOT the general frequencies of the entire index! The term "planet" is present 4 times in the entire index, but the original set of documents contains only 2 times.
A naive implementation should be to simply iterate over all the documents in the install docs , create a map and count each term. But I need a solution that will also work with a document set size of 100,000 or 500,000.
Is there a function in Lucene that I can use to get this vector vector? If there is no such function, what does the data structure look like, can someone create during the index to get such a vector easily and quickly?
I'm not a Lucene specialist, so I'm sorry if the solution is obvious or trivial.
Perhaps worth mentioning: the solution should work fast enough for a web application applied to customer search queries.