Improving file search in mysql

I have several million file names that I need to find. They look like this:

LG_MARGINCALL_HD2CH_127879834_EN.mov 

If someone is looking for any of the following, he must match:

  • Margin
  • margin call
  • margin call mov
  • margin call hd ru
  • margin call hd en mov

I am currently using mysql% LIKE% search. Something like:

 SELECT filename FROM path WHERE filename LIKE '%margin%' AND filename LIKE '%mov%' 

This is deadly slow (it may take up to ten seconds to search). Please note that works .

What would be the best way to do the above search? Either use mysql or another program.

+7
sql unix mysql search full-text-search
source share
6 answers

Your search strategy, as you have noticed, is slow. He is slow because

  LIKE '%something%' 

must scan the table to find matches. The signs of% LIKE in finding LIKE are a great way to disrupt performance.

I do not know how many columns are in your path table. If there are many columns, you can do two quick things to improve performance:

  • get rid of SELECT * and list the names of the desired columns in your result set.
  • create a composite index consisting of a filename column followed by other columns you need to get.

(This will not help if there are only a few columns in the table.)

You cannot use the direct from the FULLTEXT software FULLTEXT to search for this material because it is intended for text in the language.

If I had to quickly make this work for production, I would do the following:

First create a new table called "searchterm" containing

  filename_id INT the id number of a row in your path table searchterm VARCHAR(20) a fragment of a filename. 

Secondly, write a program that reads the values โ€‹โ€‹of filename_id and filename , and inserts a bunch of different lines for each of them into searchterm . For the item you specified, the values โ€‹โ€‹should be:

 LG_MARGINCALL_HD2CH_127879834_EN.mov (original) LG MARGINCALL HD2CH 127879834 EN mov (split on punctuation) HD 2 CH (split on embedded numerics) MARGIN CALL (split on an app-specific list of words) 

So, you will have many entries in your searchterm table, all with the same filename_id value and many different small pieces of text.

Finally, you can do this when searching.

  SELECT path.id, path.filename, path.whatever, COUNT(DISTINCT searchterms.term) AS termcount FROM path JOIN searchterm ON path.filenanme_id = search.filename_id WHERE searchterm.term IN ('margin','call','hd','en', 'mov') GROUP BY path.id, path.filename, path.whatever ORDER BY path.filename, COUNT(DISTINCT searchterms.term) DESC 

This little query finds all the relevant fragments in what you are looking for. It returns multiple file names and presents them in an order that matches most conditions.

What I am suggesting is that you create your own full-sized search engine like sorta- sorta. If you really have several million multimedia files, this is definitely worth your effort.

+13
source share

It seems obvious that you need full-text search .

There are several solutions that can answer this, one of the best at the moment: Elastic Search .

It has all the features for processing full-text search in real time. And that pretty much goes beyond that, providing automatic suggestions, auto-complete, etc.

And it is open source.

+3
source share

Stop using a similar operator instead of use match () and use the full text index for your search column and your table should be MYISAM (I don't know if it is or not)

+1
source share

I suggest 2 things to try to improve performance. The first is to use the EXPLAIN keyword before select . This may give you some help regarding slow query performance. But I think this will not help. The second thing is to use REGEXP . An example of all this:

 EXPLAIN SELECT filename FROM path WHERE filename LIKE REGEXP '^.*MAR{1}.*mov{1}' 

but you need to look a little more to optimize the regex.

0
source share

Try using SPHINX for full-text search. http://sphinxsearch.com/

0
source share

This may be faster than using AND :

 SELECT filename FROM path WHERE filename LIKE '%margin%call%hd%en%mov%' 

But the presence of "%" at the beginning of the line will always slow down.

You should use the full-text search index in the field, and then use something like:

 SELECT filename FROM path WHERE MATCH(filename) AGAINST('+margin +call +hd +en +mov' IN BOOLEAN MODE); 
0
source share

All Articles