Solve awful performance after upgrading from Lucene 4.0 to 4.1

After upgrading from Lucene 4.0 to 4.1, my solution performance deteriorated by more than an order of magnitude. The immediate cause is the unconditional compression of the stored fields. At the moment, I am returning to 4.0, but this is clearly not the way forward; I hope to find a different approach to my solution.

I use Lucene as a database index, that is, my saved fields are quite short: just a few words.

I use CustomScoreQuery , where in CustomScoreProvider#customScore I end up loading all the candidate documents and doing a detailed literature count on demand. I used two levels of heuristics to narrow down the set of candidate documents (based on the bone coefficient ), but in the last step I need to match each query word against each word of the document (they can be in a different order) and calculate the total score based on the sum of the best word matches .

How can I approach this differently and perform my calculations in such a way as to avoid errors when loading compressed fields during query evaluation?

+4
source share
2 answers

With Lucene 3.x, I had this:

 new CustomScoreQuery(bigramQuery, new FieldScoreQuery("bigram-count", Type.BYTE)) { protected CustomScoreProvider getCustomScoreProvider(IndexReader ir) { return new CustomScoreProvider(ir) { public double customScore(int docnum, float bigramFreq, float docBigramCount) { ... calculate Dice coefficient using bigramFreq and docBigramCount... if (diceCoeff >= threshold) { String[] stems = ir.document(docnum).getValues("stems"); ... calculate document similarity score using stems ... } } }; } } 

This approach allowed us to efficiently retrieve cached float values ​​from stored fields, which I used to get the bigram account of the document; it didn’t allow strings to be extracted, so I needed to load the document in order to get what I needed in order to calculate the similarity score of the document. It worked fine until Lucene 4.1 changed the compressed stored fields.

The correct way to use the enhancements in Lucene 4 is to enable DocValues as follows:

 new CustomScoreQuery(bigramQuery) { protected CustomScoreProvider getCustomScoreProvider(ReaderContext rc) { final AtomicReader ir = ((AtomicReaderContext)rc).reader(); final ValueSource bgCountSrc = ir.docValues("bigram-count").getSource(), stemSrc = ir.docValues("stems").getSource(); return new CustomScoreProvider(rc) { public float customScore(int docnum, float bgFreq, float... fScores) { final long bgCount = bgCountSrc.getInt(docnum); ... calculate Dice coefficient using bgFreq and bgCount ... if (diceCoeff >= threshold) { final String stems = stemSrc.getBytes(docnum, new BytesRef())).utf8ToString(); ... calculate document similarity score using stems ... } } }; } } 

This led to an increase in performance from 16 ms (Lucene 3.x) to 10 ms (Lucene 4.x).

0
source

In IndexWriterConfig you can go to Codec , which defines the storage method that the index will use. This will only work when the IndexWriter built (i.e. changing the configuration after building will have no effect). You will want to use Lucene40Codec .

Sort of:

 //You could also simply pass in Version.LUCENE_40 here, and not worry about the Codec //(though that will likely affect other things as well) IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, analyzer); config.setCodec(new Lucene40Codec()); IndexWriter writer = new IndexWriter(directory, config); 

You can also use Lucene40StoredFieldsFormat to get the old, uncompressed format of the saved field and pass it back from the custom implementation of Codec. You could probably take most of the code from Lucene41Codec and simply replace the storedFieldFormat() method. This may be a more focused approach, but a more complex approach, and I don’t know for sure if you might run into other problems.

Another note on creating a custom codec, as the API indicates that you should do this, is to extend FilterCodec and slightly change their example:

CustomCodec public end class extends FilterCodec {

  public CustomCodec() { super("CustomCodec", new Lucene41Codec()); } public StoredFieldsFormat storedFieldsFormat() { return new Lucene40StoredFieldsFormat(); } 

}


Of course, another implementation that comes to mind:

I think it’s also clear to you that the problem is that β€œI end up downloading all the documents of the candidates.” I will not rework too much in the implementation of the assessment. I don't have full information or understanding, but it looks like you are fighting the Lucene architecture to get it to do what you want. Stored fields should not be used for scoring, as a rule, and you can expect that the performance will suffer greatly, using the format of the saved field 4.0, as well as, although to a slightly lesser extent. Could there be a better implementation, both from the point of view of the counting algorithm and from the point of view of the structure of the document, which will eliminate the requirement for registering documents based on stored fields?

+2
source

All Articles