Text index 100,000 PDF files containing 150 million pages

You have an interesting problem and I'm looking for the right solution. We have about 100,000 PDF documents of various sizes, with an average size of 150 pages. It is currently located on a RAID6 server and is also supported off site. PDF files contain only 6.5 TB.

Currently, we convert PDF files to text files and save them in a similar folder structure on the server. Then they will need to be indexed and made searchable, including backlinks to the source folder. Text files use the same name as PDF, with the addition of an additional naming convention. If my estimates are correct, it brings up to 4 billion words that need to be indexed.

What would be the appropriate solution to index these files?

+4
source share
4 answers

I would look at SOLR . We are currently studying it as a full-text search engine for documents. It is widely used and well maintained.

+1
source

It turns out like a 400K page, if I did my math correctly. This is a large page size.

What is an index for?

If you want proximity and a phrase, you need to index them all and a product such as SOLR. Through TIKI, I think you can index PDF.

Another option is to use the full SQL text. But you will need to create an application for the external interface. Where SOLR is both an application and an engine.

Do you need to index each word or just unique words? If you need only a basic search, then in English there will be only about 200,000 unique words. If you chase them like a streamer porter, this number will go down. Then throw away the stop words like "the". Then you need the correct email names and other words not contained in the dictionary. I index the documents manually, and even a very large collection ends with 300,000 (if these are real words, ocr will kill this number). If a document has 2,000 unique words, the cross-index is only 20,000,000,000. You can analyze words using REGEX. I know this seems ugly, but I do it manually in SQL and .NET. There is no proximity search or phrase, but it has a small area and is fast. (SQL Azure does not have the full text)

+1
source

Check out the Google Search Appliance . Why reinvent the wheel?

0
source

If there is no reason to use an SQL database for this, I would consider a specialized search engine.

Most full-text search engines can read PDF files without having to convert them to text files. I have used dtSearch in the past.

0
source

All Articles