Writing to the Lucene index, one document at a time, slows down over time

We have a program that works constantly, performs various actions and changes some entries in our database. These entries are indexed using Lucene. Therefore, every time we change the essence, we do something like:

  • open db transaction, open Lucene IndexWriter
  • make changes to the db in the transaction and update this object in Lucene using indexWriter.deleteDocuments(..) , then indexWriter.addDocument(..) .
  • If everything went well, commit transaction db and pass IndexWriter.

This works fine, but over time indexWriter.commit() takes more and more time. Initially, it takes about 0.5 seconds, but after several hundred such transactions, it takes more than 3 seconds. I have no doubt that it will take even longer if the script runs longer.

My solution so far has been to comment on indexWriter.addDocument(..) and indexWriter.commit() and update the entire index each time, first using indexWriter.deleteAll() , then re-adding all the documents within the same Lucene transction / IndexWriter (about 250 thousand documents in about 14 seconds). But this, obviously, runs counter to the transactional approach offered by the databases and Lucene, which keeps these two in sync and keeps the database updates visible to users of our tools who are searching with Lucene.

It seems strange that I can add 250 thousand documents in 14 seconds, but adding 1 document takes 3 seconds. What am I doing wrong, how can I improve the situation?

+8
java performance indexing lucene
source share
2 answers

What you do wrong assumes that Lucene's built-in transactional capabilities have performance and guarantees comparable to a typical relational database when they really aren't . More specifically, in your case, the commit synchronizes all index files with the disk, making the commit time proportional to the size of the index. That's why over time, your indexWriter.commit() takes more and more time. Javadoc for indexWriter.commit() even warns that:

This can be an expensive operation, so you should check the value in your application and only do it when it is really needed.

Can you imagine database documentation in which you should avoid committing?

Since your main goal is to keep your database updates up to date through Lucene's search, to improve the situation, do the following:

  • Call indexWriter.deleteDocuments(..) and indexWriter.addDocument(..) after successfully indexWriter.addDocument(..) the database instead
  • indexWriter.commit() periodically, not every transaction, to make sure your changes are eventually written to disk
  • Use SearcherManager to search and call maybeRefresh() to view updated documents in a reasonable amount of time.

The following is an example program that demonstrates how to update documents by periodically executing maybeRefresh() . It creates an index of 100,000 documents, uses the ScheduledExecutorService to set periodic calls to commit() and maybeRefresh() , prompts you to update one document, and then repeat the search until the update is visible. All resources are properly cleared when the program ends. Note that the control factor for when the update becomes visible is when maybeRefresh() is called rather than commit() .

 import java.io.IOException; import java.nio.file.Paths; import java.util.Scanner; import java.util.concurrent.*; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.*; import org.apache.lucene.index.*; import org.apache.lucene.search.*; import org.apache.lucene.store.FSDirectory; public class LucenePeriodicCommitRefreshExample { ScheduledExecutorService scheduledExecutor; MyIndexer indexer; MySearcher searcher; void init() throws IOException { scheduledExecutor = Executors.newScheduledThreadPool(3); indexer = new MyIndexer(); indexer.init(); searcher = new MySearcher(indexer.indexWriter); searcher.init(); } void destroy() throws IOException { searcher.destroy(); indexer.destroy(); scheduledExecutor.shutdown(); } class MyIndexer { IndexWriter indexWriter; Future commitFuture; void init() throws IOException { indexWriter = new IndexWriter(FSDirectory.open(Paths.get("C:\\Temp\\lucene-example")), new IndexWriterConfig(new StandardAnalyzer())); indexWriter.deleteAll(); for (int i = 1; i <= 100000; i++) { add(String.valueOf(i), "whatever " + i); } indexWriter.commit(); commitFuture = scheduledExecutor.scheduleWithFixedDelay(() -> { try { indexWriter.commit(); } catch (IOException e) { e.printStackTrace(); } }, 5, 5, TimeUnit.MINUTES); } void add(String id, String text) throws IOException { Document doc = new Document(); doc.add(new StringField("id", id, Field.Store.YES)); doc.add(new StringField("text", text, Field.Store.YES)); indexWriter.addDocument(doc); } void update(String id, String text) throws IOException { indexWriter.deleteDocuments(new Term("id", id)); add(id, text); } void destroy() throws IOException { commitFuture.cancel(false); indexWriter.close(); } } class MySearcher { IndexWriter indexWriter; SearcherManager searcherManager; Future maybeRefreshFuture; public MySearcher(IndexWriter indexWriter) { this.indexWriter = indexWriter; } void init() throws IOException { searcherManager = new SearcherManager(indexWriter, true, null); maybeRefreshFuture = scheduledExecutor.scheduleWithFixedDelay(() -> { try { searcherManager.maybeRefresh(); } catch (IOException e) { e.printStackTrace(); } }, 0, 5, TimeUnit.SECONDS); } String findText(String id) throws IOException { IndexSearcher searcher = null; try { searcher = searcherManager.acquire(); TopDocs topDocs = searcher.search(new TermQuery(new Term("id", id)), 1); return searcher.doc(topDocs.scoreDocs[0].doc).getField("text").stringValue(); } finally { if (searcher != null) { searcherManager.release(searcher); } } } void destroy() throws IOException { maybeRefreshFuture.cancel(false); searcherManager.close(); } } public static void main(String[] args) throws IOException { LucenePeriodicCommitRefreshExample example = new LucenePeriodicCommitRefreshExample(); example.init(); Runtime.getRuntime().addShutdownHook(new Thread() { @Override public void run() { try { example.destroy(); } catch (IOException e) { e.printStackTrace(); } } }); try (Scanner scanner = new Scanner(System.in)) { System.out.print("Enter a document id to update (from 1 to 100000): "); String id = scanner.nextLine(); System.out.print("Enter what you want the document text to be: "); String text = scanner.nextLine(); example.indexer.update(id, text); long startTime = System.nanoTime(); String foundText; do { foundText = example.searcher.findText(id); } while (!text.equals(foundText)); long elapsedTimeMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime); System.out.format("it took %d milliseconds for the searcher to see that document %s is now '%s'\n", elapsedTimeMillis, id, text); } catch (Exception e) { e.printStackTrace(); } finally { System.exit(0); } } } 

This example has been successfully tested using Lucene 5.3.1 and JDK 1.8.0_66.

+11
source share

My first approach: do not do this often. When you delete and re-add a document, you are probably starting a merge. Mergers are somewhat slow.

If you use IndexReader almost in real time, you can still search as before (it does not show deleted documents), but you do not get a commit penalty. You can always commit later to make sure the file system is in sync with your index. You can do this using your index, so you do not need to block all other operations.

See also this interesting blog post (and read also other posts, they provide excellent information).

+3
source share

All Articles