The problem with counting Lutsen

Question

The problem with counting Lutsen

I have a problem with the Lucene counting function that I cannot understand. So far, I could write this code to reproduce it.

package lucenebug; import java.util.Arrays; import java.util.List; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; public class Test { private static final String TMP_LUCENEBUG_INDEX = "/tmp/lucenebug_index"; public static void main(String[] args) throws Throwable { SimpleAnalyzer analyzer = new SimpleAnalyzer(); IndexWriter w = new IndexWriter(TMP_LUCENEBUG_INDEX, analyzer, true); List<String> names = Arrays .asList(new String[] { "the rolling stones", "rolling stones (karaoke)", "the rolling stones tribute", "rolling stones tribute band", "karaoke - the rolling stones" }); try { for (String name : names) { System.out.println("#name: " + name); Document doc = new Document(); doc.add(new Field("name", name, Field.Store.YES, Field.Index.TOKENIZED)); w.addDocument(doc); } System.out.println("finished adding docs, total size: " + w.docCount()); } finally { w.close(); } IndexSearcher s = new IndexSearcher(TMP_LUCENEBUG_INDEX); QueryParser p = new QueryParser("name", analyzer); Query q = p.parse("name:(rolling stones)"); System.out.println("--------\nquery: " + q); TopDocs topdocs = s.search(q, null, 10); for (ScoreDoc sd : topdocs.scoreDocs) { System.out.println("" + sd.score + "\t" + s.doc(sd.doc).getField("name").stringValue()); } } }

The result that I get from the launch is:

 finished adding docs, total size: 5 -------- query: name:rolling name:stones 0.578186 the rolling stones 0.578186 rolling stones (karaoke) 0.578186 the rolling stones tribute 0.578186 rolling stones tribute band 0.578186 karaoke - the rolling stones

I just don’t understand why the rolling stones has the same relevance as the rolling stones tribute . According to the lucene documentation , the more tokens a field has, the lower the normalization coefficient should be, so the rolling stones tribute should have a lower score than the rolling stones .

Any ideas?

+4

information-retrieval lucene scoring

Martin blech Nov 04 '09 at 12:45

source share

2 answers

I can play it on Lucene 2.3.1, but I don’t know why this is happening.

0

joseph Nov 04 '09 at 13:24

source share

Shashikant Kore · Accepted Answer · 2009-11-04T13:44:44+0000

The length normalization factor is calculated as 1 / sqrt(numTerms) (you can see this in DefaultSimilarity

This result is not stored directly in the index. This value is multiplied by the boost value for the specified field. The final result is then encoded in 8 bits, as described in Similarity.encodeNorm (). This is a lossy encoding, which means that small details are lost.

If you want to see the normalization of length in action, try creating a document with the following sentence.

 the rolling stones tribute abcdefghijk

This will create a sufficient difference in the length normalization values that you could see.

Now, if your field contains very few tokens in accordance with the examples that you used, you can set the promotion values for documents / fields based on your own formula, which is significantly higher for a short field. Alternatively, you can create custom affinities and override the legthNorm () method.

The problem with counting Lutsen

More articles: