How can I index many txt files? (Java / C / C ++)

Question

How can I index many txt files? (Java / C / C ++)

I need to index a lot of text. The search results should give me the name of the files containing the request, and all the positions where the request is consistent in each file, so I do not need to download the entire file to find the corresponding part. What libraries can you recommend for this?

update : Lucene is suggested. Can you give me some information on how to use Lucene to achieve this? (I saw examples in which the search query returned only the relevant files)

+4

java c ++ c full-text-search

George Feb 23 '09 at 13:29

source share

8 answers

For java try Lucene

+8

Jared Feb 23 '09 at 13:37

source share

It all depends on how you are going to access it. And, of course, how many of them are going to access it. Read on MapReduce .

If you are going to collapse your own, you will need to create an index file, which is a map view between unique words and a tuple (file, line, offset). Of course, you can think of other data structures in memory, such as trie (prefix-tree) a Judy array and the like ...

Some third-party solutions are listed here .

+2

dirkgently Feb 23 '09 at 13:37

source share

Lucene - Java

It is open source, so you are free to use and deploy in your application.

As far as I know, the Eclipse IDE help file is powered by Lucene - it has been tested by millions

+2

Sung Feb 23 '09 at 13:37

source share

Take a look at http://www.compass-project.org/ it can be thought of as a wrapper on top of Lucene, the compass simplifies common Lucene usage patterns such as Google-style searches, indexes, and more complex concepts such as caching and indexing indexes. Compass also uses built-in optimizations for simultaneous commits and merges.

A review can give you more information http://www.compass-project.org/overview.html

I immediately included this in the spring project. It is really easy to use and gives what your users will see as Google results.

+2

Paul whelan Feb 23 '09 at 14:09

source share

Also see the Lemur Toolkit .

+2

Nemanja trifunovic Feb 23 '09 at 15:44

source share

Why aren't you trying to build a state machine by reading all the files? Transitions between states will be letters, and states will be either final (some files contain a read word, in this case the list is available there) or intermediate.

As for searching for multiple words, you will have to deal with them independently before crossing the results.

I find the Boost :: Statechart library to be useful.

0

Benoît Feb 23 '09 at 13:45

source share

I know that you asked for a library, you just wanted to point to the basic concept of creating an inverted index (from Introduction to the Information Search by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze).

0

Fabian steeg Feb 23 '09 at 15:45

source share

Yuval F · Accepted Answer · 2009-02-23T14:11:27+0000

I believe the term lucene for what you are looking for is highlighting . Here is a very recent Lucene highlight report . You probably need to save the position information of the word in order to get the fragments you are looking for. Token API can help.

How can I index many txt files? (Java / C / C ++)

More articles: