Distribution of Lucene fieldNorm between similarity calculation and query time value

Question

Distribution of Lucene fieldNorm between similarity calculation and query time value

I am trying to understand how fieldNorm computed (by index time) and then used (and explicitly recalculated) during the query.

In all examples, I use StandardAnalyzer without stopping words.

Having disabled the DefaultSimilarity computeNorm method when indexing, I noticed that for two specific documents it returns:

0.5 for document A (which has 4 tokens in its field)
0.70710677 for document B (which has 2 tokens in its field)

He does this using the formula:

 state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

where boost is always 1

Subsequently, when I request these documents, I see that in the request I explain what I get

0.5 = fieldNorm(field=titre, doc=0) for document A
0.625 = fieldNorm(field=titre, doc=1) for document B

This is already strange (to me, I am sure, this is the one who missed something). Why don't I get the same values for the field norm as calculated during the index? Is this "query normalization" in action? If so, how does it work?

This, however, is more or less normal, because the two Norms time request fields give the same order as the calculations calculated during the index (a field with a shorter value has a higher field type in both cases)

Then I created my own affinity class, where I applied the computeNorms method:

 public float computeNorm(String pField, FieldInvertState state) { norm = (float) (state.getBoost() + (1.0d / Math.sqrt(state.getLength()))); return norm; }

With a time index, I get:

1.5 for document A (which has 4 tokens in its field)
1.7071068 for document B (which has 2 tokens in its field)

However, now when I request these documents, I see that they both have the same field norm as reported by the explanation function:

1.5 = fieldNorm(field=titre, doc=0) for document A
1.5 = fieldNorm(field=titre, doc=1) for document B

This is really strange for me now, but if I use, apparently, a good similarity to calculate the Norm field during the index, which gives me the correct values proportional to the number of tokens, and later, during the request, all this is lost, and the sais request Do both documents have the same field rate?

So my questions are:

why does the indexNorm field of the index reported by the computeNorm affinity method not remain the same as indicated in the request explanation?
why for two different fieldNorm values obtained during the index (via compityNormity), do I get the same fieldNorm values during the request?

== UPDATE

Ok, I found something in Lucene docs that clarifies some of my questions, but not all:

However, the quoted normal value is encoded as one byte before storage. During the search, the standard byte value is read from the index directory and decoded back to the value of the float. This encoding / decoding with decreasing index size occurs at the cost of loss of accuracy - it is not guaranteed that decoding (encode (x)) = x. For example, decode (encode (0.89)) = 0.75.

How many precision losses are there? Is there a minimal gap between the different values so that they remain different even after recalculating the accuracy of the losses?

+4

metrics indexing lucene

Shivan dragon Feb 28 '13 at 12:43

source share

1 answer

femtoRgon · Accepted Answer · 2013-02-28T19:24:47+0000

The encodeNormValue documentation describes the encoding step (where accuracy is lost) and, in particular, the final representation of the value

The encoding uses a three-bit mantissa, a five-bit exponent and a point with a zero exponent of 15, thus representing values from 7x10 ^ 9 to 2x10 ^ -9 with an accuracy of one significant decimal digit. Zero is also represented. Negative numbers are rounded to zero. Values that are too large to represent are rounded to the largest represented value. Positive values that are too small to represent are rounded to the smallest positive representable value.

The most important part to understand is that the mantissa is only 3 bits, which means that the accuracy is around one significant decimal digit.

An important justification note comes after a few sentences after your quote ends, where, according to Lucene docs,

The rationale for supporting such compression with loss of normal values is that, given the difficulty (and inaccuracy) of users to express their true need for information on demand, only big differences matter .

Distribution of Lucene fieldNorm between similarity calculation and query time value

More articles: