I use Lucene to get the frequency of terms in documents, i.e. the number of occurrences of a certain term in each document. I use IndexReader.termDocs()for this purpose, and it works great for single-word terms, but since all words are stored separately in the index, it doesn't work for verbose terms.
Example (taken from this question): I'm interested in the frequency of the term “basketball” (or even “basket” ball), but after tokenization there will be two words, and I can get the frequency of the term “basket” and the term “ball”, but not the term "basketball".
I know all the verbose terms on which I want to get the frequency, also I'm not interested in saving the source text - only when statistics are received. So my first approach was to simply combine the words in a term . For instance. “Yesterday I played basketball” becomes “Yesterday I played basketball” and “My favorite writer is Kurt Vonnegut” becomes “My favorite author is KurtVonnegut”. It works: concatenated terms are treated like any other single word, so I can easily get the frequency. But this method is ugly and, more importantly, very slow. So I came to another.
My second approach is to write a special token filter that will capture tokens and check if they are part of the terms that need to be replaced (something like SynonymFilterfrom Lucene in action) In our case, when the filter sees the word “basket”, it will read another token, and if it is “ball”, the filter places one term (“basketball”) instead of two (“basket” and “ball”) in the output token stream . The advantage of this method over the previous one is that it searches for matches between complete words and does not scan the full text of substrings. In fact, most tokens will have different lengths and therefore will be discarded without checking for any letter in them. But such a filter is not easy to write, moreover, I am not sure that it will be fast enough to meet my needs.
, , - . , , TermDocs , , .
, , : Lucene?