I have a problem with the Lucene counting function that I cannot understand. So far, I could write this code to reproduce it.
package lucenebug; import java.util.Arrays; import java.util.List; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; public class Test { private static final String TMP_LUCENEBUG_INDEX = "/tmp/lucenebug_index"; public static void main(String[] args) throws Throwable { SimpleAnalyzer analyzer = new SimpleAnalyzer(); IndexWriter w = new IndexWriter(TMP_LUCENEBUG_INDEX, analyzer, true); List<String> names = Arrays .asList(new String[] { "the rolling stones", "rolling stones (karaoke)", "the rolling stones tribute", "rolling stones tribute band", "karaoke - the rolling stones" }); try { for (String name : names) { System.out.println("#name: " + name); Document doc = new Document(); doc.add(new Field("name", name, Field.Store.YES, Field.Index.TOKENIZED)); w.addDocument(doc); } System.out.println("finished adding docs, total size: " + w.docCount()); } finally { w.close(); } IndexSearcher s = new IndexSearcher(TMP_LUCENEBUG_INDEX); QueryParser p = new QueryParser("name", analyzer); Query q = p.parse("name:(rolling stones)"); System.out.println("--------\nquery: " + q); TopDocs topdocs = s.search(q, null, 10); for (ScoreDoc sd : topdocs.scoreDocs) { System.out.println("" + sd.score + "\t" + s.doc(sd.doc).getField("name").stringValue()); } } }
The result that I get from the launch is:
finished adding docs, total size: 5 -------- query: name:rolling name:stones 0.578186 the rolling stones 0.578186 rolling stones (karaoke) 0.578186 the rolling stones tribute 0.578186 rolling stones tribute band 0.578186 karaoke - the rolling stones
I just donβt understand why the rolling stones has the same relevance as the rolling stones tribute . According to the lucene documentation , the more tokens a field has, the lower the normalization coefficient should be, so the rolling stones tribute should have a lower score than the rolling stones .
Any ideas?
source share