What happens faster, grepping through files, or executing a SQL LIKE% x% query through blobs?

Let's say I'm developing a tool that would save code snippets in a PostgreSQL / MySQL database or in a file system. I want to find these fragments. Using a search engine such as Sphinx does not seem practical, because you need exact text matches for the code to find the code.

grep and ack always worked fine, but storing material in a database makes a large collection of things more manageable in certain ways. I am wondering how the relative performance of grep recursively on a directory tree compares to executing a query such as SQL LIKE or MySQL REGEXP on an equivalent number of records with TEXT blocks.

+7
source share
4 answers

If you have 1M files for grep through, you (best of all, I know) go through each of them with a regex.

For all purposes and goals, you will do the same on table rows if you query them in bulk using the LIKE operator or regular expression.

My own experience with grep is that I rarely looked for something that didn't contain at least one full word, so you can use the database to reduce the set you're looking for.

MySQL has built-in full-text search functions, but I would recommend against, because they mean that you are not using InnoDB.

You can read about them from Postgres here:

http://www.postgresql.org/docs/current/static/textsearch.html

After creating an index in the tsvector column, you can do your grep in two steps: one to immediately find rows that might vaguely qualify, and then the other according to your true criteria:

 select * from docs where tsvcol @@ :tsquery and (regexp at will); 

This will be significantly faster than grep can do.

+2
source

I can not compare them, but they will last a long time. I assume grep will be faster.

But MySQL supports full-text indexing and search , which will be faster than grep - I think again.

Also, I did not understand what the problem was with Sphinx or Lucene. Anyway, here is the benchmark for MySQL, Sphinx and Lucene

+2
source

The internet seems to be guessing that grep using Boyer-Moore, which will make the request time depend (without multiplying) on ​​the size of the request. However, this is not so relevant.

I think it is almost optimal for a one-time search. But in your case, you can do better because you have repeated searches that you can use in the structure (for example, by indexing certain common substrings in your query), as suggested by bpgergo.

Also, I’m not sure that the regular expression mechanism that you are going to use is optimized for a non-specific query, you can try and see it.

You may want to keep all the files you are viewing in memory to avoid slowing down on your hard drive. This should work if you are not looking for a stunning amount of text.

0
source

If you need a full-text code index, I would recommend the Russ Cox Kosovo search tools https://code.google.com/p/codesearch/

How Google Code Search Works http://swtch.com/~rsc/regexp/regexp4.html

0
source

All Articles