Compare words, you also need to look for plurals and ing?

I have two word lists, suppose LIST1 and LIST2. I want to compare LIST1 with LIST2 to find duplicates, but it has to find the plural of the word as well as the form form. For instance.

Suppose LIST1 has the word account, and LIST2 has the words account, account. When I do the comparison, the result should show two matches for the word "account".

I do this in PHP and have a LIST in mysql tables.

+5
source share
5 answers

You can use a technique called porter steming to match each entry in the list with its own stem, and then compare the stems. The implementation of the Porter Stemming algorithm in PHP can be found here or here .

+5
source

What I would like to do is take my word and compare it directly with LIST2 and at the same time remove my word from every word you compare looking for the left word, s, es to denote the plural or word (this should be accurate enough). If not, you will have to generate an algorithm to create plurals from words, as this is not as simple as adding S.

Duplicate Ending List
s
es
ing

LIST1
Gas
Test

LIST2
Gases
Tests
Testing

List1 List2. , 1 , 2. , .

, .

0

, . , +'ing' +'s', .

MySQL, .

SELECT DISTINCT l2.word
  FROM LIST1 l1, LIST l2
  WHERE l1.word = l2.word OR l1.word + 's' = l2.word OR l1.word + 'ing' = l2.word;
0

Doctrine Inflector stemmer .

  • ,
  • Singularize, ('%')
  • , ('%')

,

/**
 * Use inflection and stemming to produce a good search string to match subtle
 * differences in a MySQL table.
 *
 * @string $sInputString The string you want to base the search on
 * @string $sSearchTable The table you want to search in
 * @string $sSearchField The field you want to search
 */
function getMySqlSearchQuery($sInputString, $sSearchTable, $sSearchField)
{
    $aInput  = explode(' ', strtolower($sInputString));
    $aSearch = [];
    foreach($aInput as $sInput) {
        $sInput = str_replace("'", '', $sInput);

        //--------------------
        // Inflect
        //--------------------
        $sInflected = Inflector::singularize($sInput);

        // Otherwise replace the part of the inflected string where it differs from the input string
        // with a % (wildcard) for the MySQL query
        $iPosition = strspn($sInput ^ $sInflected, "\0");

        if($iPosition !== null && $iPosition < strlen($sInput)) {
            $sInput = substr($sInflected, 0, $iPosition) . '%';
        } else {
            $sInput = $sInput;
        }

        //--------------------
        // Stem
        //--------------------
        $sStemmed = stem_english($sInput);

        // Otherwise replace the part of the inflected string where it differs from the input string
        // with a % (wildcard) for the MySQL query
        $iPosition = strspn($sInput ^ $sStemmed, "\0");

        if($iPosition !== null && $iPosition < strlen($sInput)) {
            $aSearch[] = substr($sStemmed, 0, $iPosition) . '%';
        } else {
            $aSearch[] = $sInput;
        }
    }

    $sSearch = implode(' ', $aSearch);
    return "SELECT * FROM $sSearchTable WHERE LOWER($sSearchField) LIKE '$sSearch';";
}

Input String: Mary Hamburgers
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'mary% hamburger%';

Input String: Office Supplies
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'offic% suppl%';

Input String: Accounting department
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'account% depart%';

Probably not perfect, but it's a good start! Where it falls, a few matches will be returned. There is no logic to determine the best fit. What is where things like MySQL fulltext and Lucene come in. Thinking about this a bit, you can use levenshtein to rank multiple results with this approach!

0
source

All Articles