Index your MySQL database with Apache Lucene and sync them

  • When a new item is added to MySQL, it must also be indexed by Lucene.
  • When an existing item is deleted from MySQL, it must also be removed from the Lucene index.

The idea is to write a script that will be called every x minutes through the scheduler (for example, the CRON task). This is a way to synchronize MySQL and Lucene. What I have managed so far:

  • For each newly added item in MySQL, Lucene also indexes it.
  • For every item already added in MySQL, Lucene does not reindex it (there are no duplicated items).

This is what I ask you to help about:

  • For every previously added item that has been removed from MySQL, Lucene must also disable it.

Here is the code I used that tries to index the MySQL tag (id [PK] | name) table tag (id [PK] | name) :

 public static void main(String[] args) throws Exception { Class.forName("com.mysql.jdbc.Driver").newInstance(); Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/mydb", "root", ""); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer); IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), config); String query = "SELECT id, name FROM tag"; Statement statement = connection.createStatement(); ResultSet result = statement.executeQuery(query); while (result.next()) { Document document = new Document(); document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED)); document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED)); writer.updateDocument(new Term("id", result.getString("id")), document); } writer.close(); } 

PS: this code is for testing purposes only, no need to tell me how awful it is :)

EDIT:

One solution may be to delete any previously added document and reindex the entire database:

 writer.deleteAll(); while (result.next()) { Document document = new Document(); document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED)); document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED)); writer.addDocument(document); } 

I'm not sure if this is the most optimized solution, is it?

+8
java synchronization mysql indexing lucene
source share
2 answers

If you allow indexing / reindexing the run separately from your application, you will have problems with synchronization. Depending on your area of ​​work, this may not be a problem, but for many concurrent user applications.

We had the same problems when we had a work system performing asynchronous indexing every few minutes. Users could find the product using the search engine, and then, when the administrative person removed the product from the actual product stack, he still found it in the interface until the next re-task was completed. This leads to very confusing and rarely reproducible bugs reported in support of the first level.

We saw two possibilities: either connect the business logic to the updates of the search index, or implement the more complex task of asynchronous updates. We did the last.

In the background, there is a class running in a dedicated thread inside the tomcat application that receives updates and runs them in parallel. The wait time for backoffice updates for the frontend is 0.5-2 seconds, which significantly reduces the problems for supporting the first level. And this is as loosely coupled as it may be, we could even implement a different indexing mechanism.

+7
source share

Take a look at the Solr DataImportScheduler approach.
Basically, when starting a web application, a separate Timer stream is generated that periodically launches an HTTP message against Solr, which then uses the DataImportHandler that you configured to retrieve data from RDB (and other data sources).

So, since you are not using Solr, only Lucene, you should take a look at the DataImportHandler source for ideas.

+1
source share

All Articles