Quick text search through magazines

Question

Quick text search through magazines

Here's the problem I have, I have a set of magazines that can grow pretty quickly. Each day they are divided into separate files, and the files can easily grow to a gigantic size. To keep the size down, records older than 30 days are deleted.

The problem is when I want to search for these files for a specific string. The Boyer Moore search is unreasonably slow right now. I know that applications like dtSearch can provide quick searches using indexing, but I'm not sure how to implement this without taking up twice as much space as the journal already takes.

Are there any resources that I can check that might help? I'm really looking for a standard algorithm that will explain what I have to do to create an index and use it to search.

Edit:
Grep will not work, as this search needs to be integrated into a cross-platform application. I just can’t download it, including any external program.

The way this works is that there is a web interface in which there is a log browser. This is due to the C ++ web server custom backend. This server should look for logs in a reasonable amount of time. Finding multiple journal outlines is currently time consuming.

Edit 2: Some of these suggestions are great, but I have to repeat that I cannot integrate another application, this is part of the contract. But in order to answer some questions, the data in the journals varies from the messages received in a particular medical care format or messages related to them. I am looking to rely on the index, because although it may take up to a minute to restore the index, the search currently takes a very long time (I saw that it takes up to 2.5 minutes). In addition, many data are deleted before recording them. If some debug logging options are not enabled, more than half of the log messages are ignored.

The search basically looks like this: the user in a web form is presented with a list of the most recent messages (streaming from disk when scrolling through them, yay for ajax), as a rule, they will search for messages with some information in it, possibly a patient identifier or some the string they sent, and so they can enter the string into the search. The search is sent asynchronously, and the user web server linearly scans the logs 1 MB at a time for some results. This process can take a very long time when the logs become large. And this is what I am trying to optimize.

+8

algorithm search full-text-search scalability

ReaperUnreal 02 Oct '08 at 18:16

source share

6 answers

grep usually works very well for me with large logs (sometimes 12G +). You can find the windows version here .

+5

changelog Oct 02 '08 at 18:21

source share

Most likely, you will want to integrate a certain type of indexing search engine into your application. There are dozens, Lucene seems to be very popular. Check out these two questions for a few tips:

The best text search engine for integrating with a custom web application

How to implement the search function on the website?

+2

davr Oct 02 '08 at 18:34

source share

More information on how you perform the search can definitely help. Why, in particular, do you want to rely on the index, since you have to rebuild it every day when the logs are rolled over? What information is in these magazines? Is it possible to discard it before it is even recorded?

How long does this search last?

0

PeterAllenWebb Oct 02 '08 at 18:29

source share

You can check the source for BSD grep . You may not be able to count on grep to be for you, but nothing suggests that you cannot recreate similar functionality, right?

0

Hank Gay 02 Oct '08 at 20:08

source share

Splunk is great for searching through many magazines. May be redundant for your purpose. You pay depending on the amount of data (log size) that you want to process. I am sure that they have an API, so you do not need to use your interface if you do not want it.

-2

nathan Oct 02 '08 at 18:34

source share

PeterAllenWebb · Accepted Answer · 2008-10-02 19:19

Check out the algorithms Lucene uses to do its job. However, they are unlikely to be very simple. I had to learn some of these algorithms at one time, and some of them are very complex.

If you can identify the "words" in the text you want to index, simply create a large hash table of words that displays the hash of the word in its occurrences in each file. If users repeat the same search frequently, cache the search results. When the search is done, you can check each location to confirm that the search query is falling there, and not just a word with a corresponding hash.

Also, who really cares if the index is larger than the files themselves? If your system is really so big, with so much activity, is it a few dozen concerts for the doomsday index?

Quick text search through magazines

More articles: