How is field length determined in Solr / Lucene?

As I understand it, the length of the field of this document is the number of terms indexed in the field of this document. However, it seems that the length of the field is never an integer. For example, I saw a document with two terms in its content field, but the length of the content field calculated by Solr is actually 2.56, not 2, as I expected. How is the field length actually calculated in Solr / Lucene?

I mean the length of the field since it is used in calculating the score according to the BM25 affinity function, but I think that the field lengths are calculated for other ranking schemes.

+6
source share
2 answers

As I see in the code for BM25Similarity:

public final long computeNorm(FieldInvertState state) { final int numTerms = discountOverlaps ? state.getLength() - state.getNumOverlap() : state.getLength(); return encodeNormValue(state.getBoost(), numTerms); } 

where state # getLength ():

  /** * Get total number of terms in this field. * @return the length */ public int getLength() { return length; } 

Actually, this is an integer. Could you tell me where you see non-integer values? SolrAdmin user interface? Where?

Now that you have posted the output, I found the place where it came from: source

Take a look at this:

 private Explanation explainTFNorm(int doc, Explanation freq, BM25Stats stats, NumericDocValues norms) { List<Explanation> subs = new ArrayList<>(); subs.add(freq); subs.add(Explanation.match(k1, "parameter k1")); if (norms == null) { subs.add(Explanation.match(0, "parameter b (norms omitted for field)")); return Explanation.match( (freq.getValue() * (k1 + 1)) / (freq.getValue() + k1), "tfNorm, computed from:", subs); } else { float doclen = decodeNormValue((byte)norms.get(doc)); subs.add(Explanation.match(b, "parameter b")); subs.add(Explanation.match(stats.avgdl, "avgFieldLength")); subs.add(Explanation.match(doclen, "fieldLength")); return Explanation.match( (freq.getValue() * (k1 + 1)) / (freq.getValue() + k1 * (1 - b + b * doclen/stats.avgdl)), "tfNorm, computed from:", subs); } } 

So, along the length of the field, they output: float doclen = decodeNormValue((byte)norms.get(doc));

  /** The default implementation returns <code>1 / f<sup>2</sup></code> * where <code>f</code> is {@link SmallFloat#byte315ToFloat(byte)}. */ protected float decodeNormValue(byte b) { return NORM_TABLE[b & 0xFF]; } /** Cache of decoded bytes. */ private static final float[] NORM_TABLE = new float[256]; static { for (int i = 1; i < 256; i++) { float f = SmallFloat.byte315ToFloat((byte)i); NORM_TABLE[i] = 1.0f / (f*f); } NORM_TABLE[0] = 1.0f / NORM_TABLE[255]; // otherwise inf } 

In fact, looking at wikipedia , this docLen should be

a | D | - the length of the document D in words

0
source

The development of the previous fieldLength answer is calculated using the complex mathematical normalization (encoding / decoding) equation ( basically compresses 32-bit integers to 8 bits to save disk space when storing data) in the SmallFloat.java class.

This is a description of the decodeNormValue () function, which computes the Length field in BM25:

The default implementation, which {@link encodeNormValue (float) encodes} normal values ​​as one byte before saving. When searching for time, the norm byte value is read from the index {@link catalog org.apache.lucene.store.Directory} and {@link decodeNormValue (long) decoded} back to the float norm value. This encoding / decoding with a decrease in the size of the index occurs with the price of precision loss - it is not guaranteed that decode (encode (x)) = x . For example, decode (encode (0.89)) = 0.875

Hope this helps.

0
source

All Articles