Quick search to see if a string exists in large files using Delphi

I have a FindFile program in my program that will display files, but if the "Contains text" field is full, then it should only display files containing this text.

enter image description here

If the field "Contains text" is entered, I look at each file found for the text. My current way:

var FileContents: TStringlist; begin FileContents.LoadFromFile(Filepath); if Pos(TextToFind, FileContents.Text) = 0 then Found := false else Found := true; 

The above code is simple, and it usually works fine. But this has two problems:

  • Crash for very large files (e.g. 300 MB)

  • I feel it can be faster. This is not bad, but why wait 10 minutes looking for 1000 files, if there might be an easy way to speed it up a bit?

I need this to work in Delphi 2009 and to search for text files that may or may not be Unicode. This is only needed for text files.

So, how can I speed up this search and also make it work with very large files?


Bonus: I would also like to allow the ignore option. It is more difficult to make effective. Any ideas?


Decision:

Well, mghie pointed out my previous question How can I read the first few lines of many files in Delphi , and as I said, it was different and didn't provide a solution.

But he made me think that I had done this before, and I had it. I built a block reading procedure for large files that breaks it into 32 MB blocks. I use this to read the input file of my program, which can be huge. Normal work is fine and fast. So, the first step is to do the same for these files that I am viewing.

So now the question is how to effectively search in these blocks. Well, I had a previous question on this topic: Is there an effective whole word search function in Delphi? and RRUZ pointed me to SearchBuf's routine.

It also allows a β€œbonus” because SearchBuf has options that include searching for all words (answer to this question) and MatchCase / noMatchCase (answer to bonus).

So I'm leaving. Thanks again to the SO community.

+4
source share
6 answers

This is a problem related to your previous question How can I read the first few lines of many files in Delphi and the same answers apply. If you do not read the files completely, but in blocks, then large files will not cause problems. There is also great speed for files containing text in that you must cancel the search in the first match. You are currently reading entire files, even if the text to be found is in the first few lines.

+2
source

The best approach here is probably to use memory mapped files.

First you need a file descriptor, to do this, use the CreateFile API function.

Then go to CreateFileMapping to get the file association descriptor. Finally, use MapViewOfFile to map the file in memory.

To process large files, MapViewOfFile can only display a specific range in memory, so you can, for example, map the first 32 MB, then use UnmapViewOfFile to unmount it, and then MapViewOfFile for the next 32 MB, etc. (EDIT: as indicated below, make sure that the blocks you overlap in this way overlap with a multiple of 4kb and at least as long as the length of the text you are looking for so that you don't skip any text that might be divided at the border of the block)

To perform an actual search when (part) of a file is mapped to memory, you can make a copy of the source for StrPosLen from SysUtils.pas (unfortunately, it is defined only in the implementation section and does not appear in the interface). Leave one copy as is and make another copy, replacing Wide with Ansi every time. In addition, if you want to be able to search in binaries that may contain embedded #0 , you can delete the (Str1[I] <> #0) and .

Either find a way to determine if the file is ANSI or Unicode, or simply call both versions of Ansi and Unicode for each displayed part of the file.

Once you are done with each file, be sure to call CloseHandle first in the file association descriptor, and then when processing the files. (And don't forget to call UnmapViewOfFile first).

EDIT:

The big advantage of using memory mapped files instead of using, for example. TFileStream, to read a file in memory in blocks, is that bytes will only be inserted into memory once.

Usually, when accessing a file, Windows first reads the bytes into the OS file cache. Then copies them from there to the application memory.

If you use memory mapped files, the OS can directly map physical pages from the OS file cache to the application address space without creating another copy (reducing the time it takes to copy and use memory).

Bonus answer: By calling StrLIComp instead of StrLComp, you can do a case insensitive search.

+12
source

If you are looking for a text string search, find the Boyer-Moore search algorithm. It uses memory mapped files and a very fast search engine. These are some elements of delphi that contain implementations of this algorithm.

To give you an idea of ​​speed - I am currently viewing 10-20 MB files and takes about milliseconds.

Oh, just read that it could be unicode - not sure if it supports it, but definitely look down this path.

+3
source

Can a component be offered? If so, I would recommend ATStreamSearch . It handles ANSI and UNICODE (and even EBCDIC and Korean and more).

Or the TUTBMSearch class from JclUnicode (Jedi-jcl). It was mostly written by Mike Lishke (VirtualTreeview). It uses the tuned Boyer-Moore algorithm, which provides speed. The bad point in your case is that it fully works in unicode (widestrings), so the possibility of overwriting from String to Widestring can be punished.

+2
source

It depends on what data you are going to look for with it, so that you can achieve real effective results, you will need to allow your program to analyze interesting directories, including all files there, and store data in a database that you can access each times for a specific word in a specific list of files that can be generated before the search path. The database operator can provide you with results in milliseconds.

The problem is that you will need to run and analyze all the files after installation, which may take more than 1 hour to the amount of data that you want to analyze.

This database should be updated every time your program starts, this can be done by comparing the MD5 value of each file if it has been changed, so you do not need to analyze all your files every time.

If this way of working can be interesting, if you have all your data in a constant place and you analyze the data in the same files more than every time when completely new files, some code analyzers work that way and they are really effective. Thus, you invest some time in parsing and saving interesting data, and you can go to the place where the search word appears and provide a list of all the places in which it appears in a very short time.

0
source

If you need to search for files several times, it might be a good idea to use a word index.

This is called "Full Text Search."

The first time it will be slower (the text must be parsed and the indexes must be created), but any future search will be immediate: in short, it will only use indexes, and not read the entire text again.

You have the exact parser you need in Delphi Magazine, Issue 78, February 2002 : Alfresco Algorithms: Ask a Thousand Times Julian Bucknall Discusses Word Indexing and Document Search: If you want to know how Google uses its magic, this page which you need to go to. "

There are several FTS implementations for Delphi:

I would like to add that most databases have a built-in FTS engine. SQLite3 even has a very small but effective implementation, with page ranking, etc. We provide direct access from Delphi , with ORM classes, to this full-text search engine called FTS3 / FTS4.

0
source

All Articles