How can I index many txt files? (Java / C / C ++)

I need to index a lot of text. The search results should give me the name of the files containing the request, and all the positions where the request is consistent in each file, so I do not need to download the entire file to find the corresponding part. What libraries can you recommend for this?

update : Lucene is suggested. Can you give me some information on how to use Lucene to achieve this? (I saw examples in which the search query returned only the relevant files)

+4
source share
8 answers

I believe the term lucene for what you are looking for is highlighting . Here is a very recent Lucene highlight report . You probably need to save the position information of the word in order to get the fragments you are looking for. Token API can help.

+2
source

For java try Lucene

+8
source

It all depends on how you are going to access it. And, of course, how many of them are going to access it. Read on MapReduce .

If you are going to collapse your own, you will need to create an index file, which is a map view between unique words and a tuple (file, line, offset). Of course, you can think of other data structures in memory, such as trie (prefix-tree) a Judy array and the like ...

Some third-party solutions are listed here .

+2
source

Lucene - Java

It is open source, so you are free to use and deploy in your application.

As far as I know, the Eclipse IDE help file is powered by Lucene - it has been tested by millions

+2
source

Take a look at http://www.compass-project.org/ it can be thought of as a wrapper on top of Lucene, the compass simplifies common Lucene usage patterns such as Google-style searches, indexes, and more complex concepts such as caching and indexing indexes. Compass also uses built-in optimizations for simultaneous commits and merges.

A review can give you more information http://www.compass-project.org/overview.html

I immediately included this in the spring project. It is really easy to use and gives what your users will see as Google results.

+2
source

Also see the Lemur Toolkit .

+2
source

Why aren't you trying to build a state machine by reading all the files? Transitions between states will be letters, and states will be either final (some files contain a read word, in this case the list is available there) or intermediate.

As for searching for multiple words, you will have to deal with them independently before crossing the results.

I find the Boost :: Statechart library to be useful.

0
source

I know that you asked for a library, you just wanted to point to the basic concept of creating an inverted index (from Introduction to the Information Search by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze).

0
source

All Articles