How to find similar results and sort by similarity?

Question

How to find similar results and sort by similarity?

How to request records sorted by similarity?

Eg. a search for "Stock Overflow" will return

Stack overflow
SharePoint overflow
Math Overflow
Political overflow
VFX overflow

Eg. a search for “LO” will return:

pabLO picasso
Michelangelo
jackson polLOck

I need help:

Using a search engine to index and search a MySQL table for best results
- Using Sphinx Search Engine With PHP
- Using Lucene with PHP
Using full-text indexing to find similar / containing strings

What doesn't work well

Levenshtein distance is very unstable. ( UDF Request )
A search for “dogs” gives me:
- dog
- swampy
- back
- big
- echo
LIKE returns better results, but returns nothing for long queries, although similar rows exist
- dog
- dogid
- dogaral
- dogma

+62

string sorting sql mysql similarity

Robin Rodricks Jul 26 '10 at 20:49

source share

3 answers

1. Similarity

For Levenshtein in MySQL, I found this, from ~~www.codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function~~

 SELECT column, LEVENSHTEIN(column, 'search_string') AS distance FROM table WHERE LEVENSHTEIN(column, 'search_string') < distance_limit ORDER BY distance DESC

2. Containing, case insensitive

Use the MySQL LIKE statement, which is case insensitive by default. % is a wildcard, so there can be any line before and after search_string .

 SELECT * FROM table WHERE column_name LIKE "%search_string%"

3. Containing case sensitive

The MySQL manual helps:

The default character set and collation is latin1 and latin1_swedish_ci, so unimportant string comparisons are case insensitive by default. This means that when you search with col_name LIKE 'a%', you get all column values starting with A or a. To make this case sensitive, make sure one of the operands is case sensitive or binary sorted. For example, if you are comparing a column and a row that has the latin1 character set, you can use the COLLATE statement to force the operand to have latin1_general_cs or latin1_bin sorted ...

My MySQL setup does not support latin1_general_cs or latin1_bin , but it was good for me to use utf8_bin sorting, since binary utf8 is case sensitive:

 SELECT * FROM table WHERE column_name LIKE "%search_string%" COLLATE utf8_bin

2./3. sorted by Levenshtein distance

 SELECT column, LEVENSHTEIN(column, 'search_string') AS distance // for sorting FROM table WHERE column_name LIKE "%search_string%" COLLATE utf8_bin // for case sensitivity, just leave out for CI ORDER BY distance DESC

+18

opatut Jul 26 '10 at 21:11

source share

It seems your definition of similarity is a semantic similarity. Therefore, to build such a similarity function, you must use semantic similarities. Please note that the amount of work on this issue can vary from several hours to several years, so it is recommended that you decide on the amount before starting work. I have not figured out what data you have in order to build a similarity relationship. I assume that you have access to a dataset of documents and a dataset of queries. You can start by matching words (for example, conditional probability). You will quickly find that you get a list of stop words because it is associated with most words simply because they are very popular. Using a conditional probability elevator will take care of the stop words, but will make the error prone attitude a small number (most of your cases). You can try Jacard , but since it is symmetrical, there will be many relationships that he will not find. Then you can consider relationships that appear only a short distance from the base word. You can (and should) consider a relationship base on a common building (for example, Wikipedia) and user-specific (for example, his emails).

Soon you will have many measures of similarity, when all measures will be good and will have an advantage over others.

To combine such measures, I would like to reduce the problem to the classification problem.

You must create a data set of paris words and mark them as “related”. To create a large tagged dataset, you can:

Use sources of well-known related words (like the good old Wikipedia categories) for good results.
Most of the word, not known as connected, is not connected.

Then use all the measures that you have as a function of pairs. You are now in the field of controlled classification problems. Create a classifier in the dataset that is tailored to your needs and get a similarity measure that fits your needs.

+2

DaL Nov 06 '15 at 10:22

source share

Yanick Rochon · Accepted Answer · 2010-07-26 21:08

I found that Levenshtein's distance can be good when you search for a complete string against another full string, but when you search for keywords in a string, this method does not return (sometimes) the desired results. In addition, the SOUNDEX function is not suitable for languages other than English, so it is very limited. You can leave with LIKE, but this is valid for basic searches. You may want to look at other methods of finding what you want to achieve. For example:

You can use Lucene as a search base for your projects. It is implemented in most major programming languages, and it will be quite fast and versatile. This method is probably the best, as it not only searches for substrings, but also transposes, prefixes, and suffixes of letters (all together). However, you need to keep a separate index (using CRON to update it from an independent script once in a while works).

Or, if you need a MySQL solution, full-text functionality is pretty good and, of course, faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table and then do a full-text search:

 CREATE TABLE IF NOT EXISTS `tests`.`data_table` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `title` varchar(2000) CHARACTER SET latin1 NOT NULL, `description` text CHARACTER SET latin1 NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;

Use the data generator to generate some random data if you do not want to create it yourself ...

** NOTE **: The column type must be latin1_bin in order to perform case-sensitive searches, not case-insensitive latin1 . For unicode strings, I would recommend utf8_bin for case sensitivity and utf8_general_ci for case insensitive search.

 DROP TABLE IF EXISTS `tests`.`data_table_temp`; CREATE TEMPORARY TABLE `tests`.`data_table_temp` SELECT * FROM `tests`.`data_table`; ALTER TABLE `tests`.`data_table_temp` ENGINE = MYISAM; ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` ( `title` , `description` ); SELECT *, MATCH (`title`,`description`) AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score` FROM `tests`.`data_table_temp` WHERE MATCH (`title`,`description`) AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) ORDER BY `score` DESC; DROP TABLE `tests`.`data_table_temp`;

For more information, see the MySQL API man page.

The disadvantage of this is that he will not look for transposing letters or “similar, sound words”.

** UPDATE **

Using Lucene to search, you just need to create a cron job (all web hosts have this “function”), where that job will just execute a PHP script (ig "cd / path / to / script; php searchindexer.php"), which will update indexes. The reason is that indexing thousands of “documents” (lines, data, etc.) may take several seconds, even minutes, but this is necessary so that all search queries are executed as quickly as possible. Therefore, you may need to create a delay job that will be executed by the server. It may be at night, or in the next hour, it is up to you. The PHP script should look something like this:

 $indexer = Zend_Search_Lucene::create('/path/to/lucene/data'); Zend_Search_Lucene_Analysis_Analyzer::setDefault( // change this option for your need new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive() ); $rowSet = getDataRowSet(); // perform your SQL query to fetch whatever you need to index foreach ($rowSet as $row) { $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8')) ->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8')) ->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable)) ->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8')) ; $indexer->addDocument($doc); } // ... you can get as many $rowSet as you want and create as many documents // as you wish... each document doesn't necessarily need the same fields... // Lucene is pretty flexible on this $indexer->optimize(); // do this every time you add more data to you indexer... $indexer->commit(); // finalize the process

Then this is basically what you are looking for (basic search):

 $index = Zend_Search_Lucene::open('/path/to/lucene/data'); // same search options Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive() ); Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8'); $query = 'php +field1:foo'; // search for the word 'php' in any field, // +search for 'foo' in field 'field1' $hits = $index->find($query); $numHits = count($hits); foreach ($hits as $hit) { $score = $hit->score; // the hit weight $field1 = $hit->field1; // etc. }

Here are some great sites about Lucene in Java , and PHP . Net .

In conclusion , each search method has its pros and cons:

You mentioned Sphinx search , and it looks very good if you can do a deamon on your web host.
Zend Lucene requires a cron job to re-index the database. Although this is completely transparent to the user, it means that any new data (or deleted data!) Does not always synchronize with the data in your database and therefore will not be displayed immediately when searching for users.
MySQL FULLTEXT search is good and fast, but it won’t give you all the power and flexibility of the first two.

Please feel free to comment if I forgot / missed anything.

How to find similar results and sort by similarity?

I need help:

What doesn't work well

More articles: