Is the database / data source optimized for string matching?

Question

Is the database / data source optimized for string matching?

I want to store a large number of (~ thousand) lines and be able to match using wildcards.

For example, here is an example of content:

Folder1
Folder1/Folder2
Folder1/*
Folder1/Folder2/Folder3
Folder2/Folder*
*/Folder4
*/Fo*4

(each line also has additional data, for example tags, but the match is only against this key)

Here is an example of what I would like to map to data:

Folder1
Folder1/Folder2/Folder3
Folder3

( * is a wildcard here, it may be a different character)

I naively considered storing it in a MySQL table and using % wildcards with the LIKE operator, but MySQL indexes will only work for characters to the left of the template, and in my case it could be anywhere (i.e. %/Folder3 ).

So, I am looking for a quick solution that can be used with PHP . And I'm open: it could be a separate server, a PHP library using regular expression files, ...

+6

string-matching database php datasource database-performance

Matthieu napoli Feb 22 '13 at 12:55

source share

9 answers

Brickwall10 · Answer 1 · 2013-03-14T04:49:17+0000

Have you considered using the regex mechanism of MySQL? Try something like this:

SELECT * FROM your_table WHERE your_query_string REGEXP pattern_column

This will return rows with regex keys that match the query string. I expect it to work better than running a query to pull all the data and do the mapping in PHP.

Further information here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html

Ricardo Ortega Magaña · Answer 2 · 2013-02-26T01:20:03+0000

You might want to use a multi-core approach to solve this search in a fraction of the time, I would recommend searching and matching using FPGAs, but this is probably the most difficult way to do this, consider THIS ARTICLE using CUDA, you can search in 16 normal time, in multi-core CPU systems, you can use posix or a cluster of computers to perform a task (for example, MPI), you can call the Gearman service to start a search using advanced algorithms.

Don dickinson · Answer 3 · 2013-02-26T01:26:59+0000

If it were me, I would save the key field twice ... once forward and once backward (see mysql callback function). you can search the index with left (main_field) and left (reverseed_field). this will not help you when you have a wildcard in the middle of the line And the beginning (for example, "* Folder1 * Folder2), but it will be when you have a wildcard at the beginning or at the end.

eg. if you want to search * / Folder1, then find where on the left (reverse_field, 8) = '1redloF /'; for Folder1 / * / FolderX, where on the left (reverse_field, 8) = 'XredloF /' and left (main_field, 8) = 'Folder1 /'

orzechow · Answer 4 · 2013-02-27T15:38:30+0000

If your lines are some kind of hierarchical structure (as it looks in your example content), they are not really “real” files, but you say that you are open to alternative solutions - why not consider something like a file index

Select a new directory, for example myindex
Create an empty file for each entry using the string key as the location and file name in myindex

Now you can find matches with glob - thanks to the hierarchical structure of the file, glob searches should be much faster than finding all records in the database. If necessary, you can compare the results with MySQL data - thanks to your MySQL index by key, this action will be very fast.

But remember to update the myindex structure to INSERT , UPDATE or DELETE in the MySQL database.

This solution will compete only with a huge data set (but not too large, as mentioned in @Kyle) with a rather deep hierarchical structure.

EDIT Sorry, this will only work if the wildcards in your search terms are not in the stored strings themselves.

kraftb · Answer 5 · 2013-02-28T10:40:27+0000

Since wildcards (*) are in your data, not in your queries, I think you should start by breaking your data apart. You must create an index table with columns such as:

 dataGroup INT(11), exactString varchar(100), wildcardEnd varchar(100), wildcardStart varchar(100),

If you have a value of type "Folder1 / Folder2", save it in "exactString" and assign the identifier to the value in the master data table "dataGroup" in the above index table.

If you have a value of type "Folder1 / *", save the value of "Folder1 /" in "wildcardEnd" and again assign the identifier of the value in the main table to the "dataGroup" field in the above table.

Then you can match in your query using:

 indexTable.wildcardEnd = LEFT('Folder1/WhatAmILookingFor/Data', LENGTH(indexTable.wildcardEnd))

This will truncate the search string ("Folder1 / WhatAmILookingFor / Data") to "Folder1 /" and then match it with the wildcardEnd field. I assume mysql is smart enough not to trim for each row, but start with the first character and match it with each row (using B-Tree indices).

A value like "* / Folder4" will go into the "wildcardStart" field, but it will be canceled. To quote Missy Elliot: “Is it worth it, let me work out? I put down my business, flipped it and canceled it” ( http://www.youtube.com/watch?v=Ke1MoSkanS4 ). Save the value "4redloF /" to "wildcardStart" ,. Then WHERE, as shown below, will match the lines:

 indexTable.wildcardStart = LEFT(REVERSE('Folder1/WhatAmILookingFor/Folder4'), LENGTH(indexTable.wildcardStart))

Of course, you can do "REVERSE" already in your application logic.

Now about the difficult part. Something like "* / Fo * 4" should be split into two entries:

 # Record 1 dataGroup ==> id of "*/Fo*4" in data table wildcardStart ==> oF/ wildcardEnd ==> /Fo # Record 2 dataGroup ==> id of "*/Fo*4" in data table wildcardStart ==> 4

Now, if you agree with something, you need to make sure that each index record in the dataGroup is returned for full compliance and that there is no overlap. This can also be solved in SQL, but is beyond the scope of this question.

Alon catz · Answer 6 · 2013-03-02T17:13:58+0000

A database is not a suitable tool for performing these types of searches. You can still use a database (any database and any structure) to store strings, but you need to write code to execute all the queries in memory. Download all the rows from the database (several thousand rows are not really large), cache them and run the search \ match algorithm on them.

You probably have to code your algorithm yourself, because standard tools will be redundant for what you are trying to achieve, and there is no guarantee that they will be able to achieve exactly what you need.

I would build a regular expression of your wildcards and run these regular expressions at your input. Your probabaly will have to do some work until you get the correct expression, but this will be the fastest way.

bitfiddler · Answer 7 · 2013-03-04T20:07:57+0000

I suggest reading the keys and the useful data associated with them in a binary representation of a tree, ordered by an alphanumeric key. If your keys are not terribly “knocked down,” you can avoid the (small extra) riser building of a balanced tree. You can also avoid any tree maintenance code, because if I understand your problem correctly, the data will change frequently and it would be easier to restore the tree rather than adding / deleting / updating nodes in place. The overhead of reading into a tree is similar to doing the initial sort, and traversing the tree to find your value is direct and much more efficient than just running a regular expression against multiple lines. You can even find while working that your wildcards in the tree will lead to some shortcuts to crop the search space. A quick search reveals many resources and PHP snippets to get you started.

nubeiro · Answer 8 · 2013-02-22T15:03:45+0000

If you run SELECT folder_col, count(*) FROM your_sample_table group by folder_col , will you get duplicate values of folder_col (i.e. count (*) greater than 1)?

If not, it means that you can create SQL that will generate a valid sphinx index (see http://sphinxsearch.com/ ).

Lukasz Kujawa · Answer 9 · 2013-02-27T18:03:23+0000

I would not recommend doing a text search on a large dataset in MySQL. You need a database to store data, but that would be so. To search, use a search engine, for example:

Solr ( http://lucene.apache.org/solr/ )
Elastic Search ( http://www.elasticsearch.org/ )
Sphinx ( http://sphinxsearch.com/ )

These services allow you to instantly browse text search (including wildcards), -)

Is the database / data source optimized for string matching?

More articles: