How to maintain lucene indices in an azure cloud application

Question

How to maintain lucene indices in an azure cloud application

I just started playing with the Azure Library for Lucene.NET ( http://code.msdn.microsoft.com/AzureDirectory ). So far I have used my own code to write lucene indices on an azure blob. So, I copied the blob in the localstorage of the azure role on the Internet / worker and read / wrote documents to the index. I used my own locking mechanism to make sure that we have no conflicts between reading and writing to blob. I hope the Azure Library takes care of these issues for me.

However, while testing the test application, I changed the code to use the compound file parameter, and every time I wrote to the index, I created a new file. Now, my question is: if I need to maintain an index - I save a snapshot of the index file and use it, if the main index is damaged, then how can I do it. Should I keep a backup of all .cfs files that are created or processed only with the latter, fine. Are there any api calls to clear blob to save the last file after each write to the index?

Thanks Kapil

+4

indexing azure lucene lucene.net azure-storage

Kapil Oct 08 '10 at 14:24

source share

2 answers

I use AzureDirectory for full-text indexing on Azure, and I get some odd results ... but hopefully this answer will be useful to you ...

firstly, the compound file option: from what I read and find out, the compound file is one large file with all the index data inside. all this leads to a large number of small files (configured using the SetMaxMergeDocs (int) IndexWriter function) written to memory. the problem with this is when you get to a large number of files (I stupidly set it to about 5000), it takes age to load the indexes (on the Azure server it takes about a minute, my dev box ... well it worked for 20 minutes and not finished yet ...).

as for backing up indexes, I have not come across this yet, but considering that we currently have about 5 million records and this will grow, I am also interested in this. if you use one mixed file, maybe upload the files to a working role, zip them and upload them with today's date, if you have a smaller set of documents, you can leave with re-indexing the data if something goes wrong ... but again, depends on the number ....

0

TiernanO Jan 6 '11 at 9:35

source share

TiernanO · Accepted Answer · 2011-07-04T11:53:37+0000

After I answered that, we eventually changed our search infrastructure and used Windows Azure Drive . We had a working role that mounted VHD using Block Storage and posted the Lucene.NET index on it. The code is checked to ensure that VHD was installed first and that an index directory exists. If the working role falls, VHD will automatically shut down after 60 seconds, and the second working role will be able to raise it.

Since then, we have changed our infrastructure again and switched to Amazon with a Solr instance for search, but the VHD option worked well during development. it could work well in test and production, but the requirements meant that we needed to switch to EC2.

How to maintain lucene indices in an azure cloud application

More articles: