Search php (fuzzy)

if anyone ever sent a story to digg, it checks if the story has been sent already, I assume this is a fuzzy search.

I would like to implement something similar and want to know if they are using open source php class?

Soundex does not do this, sentences / lines can be up to 250 characters long

+6
php mysql full-text-search
source share
3 answers

Unfortunately, doing this in PHP is overly expensive (high CPU and memory usage). However, you can apply the algorithm to small data sets.

To explain in detail how to create a server crisis, several PHP built-in functions will determine the "distance" between the lines: levenshtein and Similar_text .. p>

Dummy data: (pretend these are news headlines)

  $ titles = <<< EOF
 Apple
 Apples
 Orange
 Oranges
 Banana
 EOF;

$ titles = explode ("\ n", $ headers);

At this point, $ titles should just be an array of strings. Now create a matrix and compare each heading with EVERY other heading for similarities. In other words, for 5 headers you get a 5 x 5 matrix (25 entries). Where the processor and memory are loading.

This is why this method (via PHP) cannot be applied to thousands of records. But if you want:

  $ matches = array ();
 foreach ($ titles as $ title) {
     $ matches [$ title] = array ();
     foreach ($ titles as $ compare_to) {
         $ matches [$ title] [$ compare_to] = levenshtein ($ compare_to, $ title);
     }
     asort ($ matches [$ title], SORT_NUMERIC);
 } 

At the moment, you basically have a matrix with "text distances". In the concept (not in real data), it looks something like the one in the table below. Please note that there is a set of 0 values ​​that go diagonally - this means that in the correspondence cycle two identical words are, well, identical.

  Apple Apples Orange Oranges Banana
 Apple 0 1 5 6 6
 Apples 1 0 6 5 6
 Orange 5 6 0 1 5
 Oranges 6 5 1 0 5
 Banana 6 6 5 5 0

The actual $ matches array looks like this (truncated):

  Array
 (
     [Apple] => Array
         (
             [Apple] => 0
             [Apples] => 1
             [Orange] => 5
             [Banana] => 6
             [Oranges] => 6
         )

     [Apples] => Array
         (
       ...

In any case, it is up to you (through experiments) to determine what a good numerical distance limit can basically coincide with - and then apply it. Otherwise, read on sphinx-search and use it - since it has PHP libraries.

Orange are you glad you asked about this?

+5
source share

I would suggest taking the URLs of users and storing them in several parts; domain name, path and query string. Use the PHP parse_url () function to get portions of the submitted URL.

Index of at least the domain name and path. Then, when the new user submits the URL, you look at your database for the record corresponding to the domain and path. Since columns are indexed, you first filter out all records that are not in the same domain, and then look at the remaining records. Depending on your dataset, this should be faster by simply indexing the entire URL. Make sure the WHERE clause is configured in the correct order.

If this does not meet your needs, I suggest trying the Sphinx. Sphinx is an open source full-text search engine that is much faster than MySQL's built-in full-text search. It supports creation and some other nice features.

http://sphinxsearch.com/

You can also take the title or text content of the user view, run it through the function to generate keywords and search the database for existing entries with those or similar keywords.

+2
source share

You can (depending on the size of your dataset) use the mySQL FULLTEXT search and look for elements that have a high score and are in a specific timeframe, and offer this / them to the user.

More information about the points here: Full text of the search on the Internet search throughout the text

0
source share

All Articles