I started work on the resume recovery component (document) based on the lucene.net engine. It works great and it retrieves the document and evaluates it based on
The idea underlying VSM, the longer the request time appears in the document regarding the number of times this term appears in all documents in the collection, the more that this document refers to the request.
Lucene's practical counting function is derived from below.
score(q,d)=coord(q,d)·queryNorm(q)· ∑( tf(t in d) ·idf(t)2 · t.getBoost() · norm(t,d) )
t in q
in that
- tf (t in d) correlates with the frequency, defined as the number of times that t appears in the current typed document d. Documents that have more cases of a certain term get a higher score.
- idf (t) means the reverse frequency of the document. This value correlates with the inverse of docFreq (the number of documents in which the term t appears). This means that more rare terms give a higher contribution to the overall score.
In most cases, this is very cool, but due to the calculation of the fieldnorm, the result is not accurate.
fieldnorm aka "field length norm" means the length of this field in this document (therefore, shorter fields automatically increase).
In this regard, we did not get accurate results. Say, for example, I received 10,000 documents, in which 3,000 documents received the keyword java and oracle. And the number of times it appears depends on each document.
- Suppose doc A got 10 java 20 oracle out of 1000 words, and doc B got 2 java 2 oracle out of 50 words.
- "java oracle", lucene doc B
- .
- , , , .
- , .
, . - ?
Luke .
java 50 6 11- .

java 24 5 - fieldnorm.

, ... , ,