What is the most efficient way to get all comparable documents from a query in Lucene, unsorted?

Question

What is the most efficient way to get all comparable documents from a query in Lucene, unsorted?

I am looking to fulfill a request in order to maintain internal integrity; for example, removing all traces of a specific field / value from an index. Therefore, it is important that I find all the relevant documents (and not just the top n docs), but the order in which they were returned does not matter.

According to the docs, it seems like I need to use the Searcher.Search( Query, Collector ) method, but there is no built-in Collector class that does what I need.

Should I get my own collector for this purpose? What do I need to keep in mind when doing this?

+6

c # .net lucene lucene.net

devios1 Mar 25 '11 at 15:58

source share

2 answers

No need to write a hit collector if you just want to get all the Document objects in the index. Just loop from 0 to maxDoc () and call reader.document () for each doc identifier, don't forget to skip already deleted documents:

 for (int i=0; i<reader.maxDoc(); i++) { if (reader.isDeleted(i)) continue; results[i] = reader.document(i); }

0

bajafresh4life Mar 26 '11 at 0:58

source share

devios1 · Accepted Answer · 2011-03-25T18:06:54+0000

Turns out it was a lot easier than I expected. I just used the sample implementation in http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/Collector.html and wrote down the document numbers passed to the Collect() method in the list, exposing this as public Docs .

Then I just iterate over this property, passing the number back to Searcher to get the correct Document :

 var searcher = new IndexSearcher( reader ); var collector = new IntegralCollector(); // my custom Collector searcher.Search( query, collector ); var result = new Document[ collector.Docs.Count ]; for ( int i = 0; i < collector.Docs.Count; i++ ) result[ i ] = searcher.Doc( collector.Docs[ i ] ); searcher.Close(); // this is probably not needed reader.Close();

So far, it seems to work just fine in preliminary tests.

Update: Here's the IntegralCollector code:

 internal class IntegralCollector: Lucene.Net.Search.Collector { private int _docBase; private List<int> _docs = new List<int>(); public List<int> Docs { get { return _docs; } } public override bool AcceptsDocsOutOfOrder() { return true; } public override void Collect( int doc ) { _docs.Add( _docBase + doc ); } public override void SetNextReader( Lucene.Net.Index.IndexReader reader, int docBase ) { _docBase = docBase; } public override void SetScorer( Lucene.Net.Search.Scorer scorer ) { } }

What is the most efficient way to get all comparable documents from a query in Lucene, unsorted?

More articles: