Using Lucene as a Relational Database

Question

Using Lucene as a Relational Database

I'm just wondering if we can achieve some RDBMS capabilities in lucene.

Example: 1) I have 10,000 project documents (pdf files) that need to be indexed with their content to make them searchable. 2) Each document is associated with ONE PROJECT. A project may contain details such as project name, number, start date, end date, location, type, etc.

I need to search the contents of pdf files for this keyword, but when displaying the results, I want to display the project metadata, as indicated in paragraph (2).

My idea is to associate a field called projectId with every PDF file when indexing. As soon as we get this, we will start the project metadata search again.

This way we could avoid data duplication. In addition, if we want to update the project metadata, we will finish the update in only one place. Otherwise, if we save this metadata with all pdf doument indices, we will complete updating all documents, which I am not looking for.

please inform.

+4

join search indexing rdbms lucene

KP. May 06 '09 at 9:00

source share

5 answers

Yuval F · Answer 1 · 2009-05-06T12:28:27+0000

If you understand correctly, you have two questions:

Is it possible to save the project identifier in Lucene and use it for further search? Yes you can. This is a common practice.
Can I use this project identifier to search Lucene for project metadata? Yes you can. I do not know if this is a good idea. It depends on the frequency of your metadata updates and your access pattern. If the metadata is relatively static and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project identifier as the primary key for the database table, which could be better.

bajafresh4life · Answer 2 · 2009-05-08T16:00:00+0000

That sounds good. The only limitation that you have (by storing the project link in Lucene, and not on the project data itself) is that you will not be able to request the text of the document and the project metadata at the same time. For example, "documentText: foo OR project_name: bar". If you don't have this requirement, then it looks like storing an identifier in Lucene that refers to a database row is what you need to do.

jeremyalan · Answer 3 · 2009-08-26T03:06:17+0000

It is definitely possible. But always remember that you are using Lucene for something that it is not intended for. In general, Lucene is intended for full-text search, and not for displaying relational content. Thus, the more complex your system, the more your relational content becomes, the more you will see a decrease in performance.

In particular, there are several areas that should be closely monitored:

Saving the value of each field in your index will result in poor performance. If you are not too interested in search results for the second query, or if your index is relatively small, this may not be a problem.
Also, keep in mind that if you do not use the default ranking algorithm and your custom algorithm requires project information in order to calculate points for each document, this will also have a significant impact on search performance.

If you need a more powerful index designed for relational content, there are hierarchical indexing tools (one of them is developed by Apache called Jackrabbit ), which are worth paying attention to.

As your project continues to grow, you can also check out Solr , also developed by Apache, which provides some advanced features, such as multi-faceted searches.

pvoosten · Answer 4 · 2009-08-28T20:31:33+0000

You can use Lucene in this way;

Pros:

Full-text search is easy to implement, which is not available in RDBMS.

Minuses:

Referential Integrity: You get it for free in the DBMS, but in Lucene you have to implement it yourself.

Hardy · Answer 5 · 2009-06-16T14:33:59+0000

I'm not sure about your overall setup, but maybe Hibernate Search is for you. This allows you to combine the benefits of a relational database with the power of a full-text search engine such as Lucene. Metadata can live in a database, possibly along with original PDF documents, while Lucene documents contain only searchable data.

Using Lucene as a Relational Database

More articles: