Search for names in a MySQL database, which probably has typos

Currently, I am writing a script that is tasked with going through tens of thousands of lines of account information and clearing obscure addresses, as well as printing reports on how the address was cleared. Currently, the biggest source of unclean addresses are foggy street names (it's amazing how many ways you can call a street name). In any case, currently my script grabs the input name of the street and performs a series of edits related to the Norwegian language ( v. Becomes vegen , gt. Becomes gata , etc.) And looks for the street -name in the address database of 2 million lines. If he does not find a match, he will continue to share the second half of the street name and replace it with a wildcard. He is trying to find various wildcard search options.

Anyway, my question is:

Does MySQL include anything that could make this easier for me? I remember hearing the mention of a MySQL search function that finds cells in a column with the most suitable characters or something else. In cases where my wild-card search fails, this would be a great tool.

Anything that would help find matches with the wrong addresses would be great.

+4
source share
3 answers

One option might be to try using SOUNDEX to bring you closer to what you want. SOUNDEX will match the pronunciation so that it can bring you closer if people make mistakes based on the phonetic spelling of the street name.

You can also try the Levenshtein distance algorithm. This is probably more closely related to what you are looking for. Basically, he looks at how close one word is to another. It can be used to check spelling, etc., and it can be useful when looking for bad data in address fields. Here is a link to it:

http://www.merriampark.com/ld.htm

If you want the function to use the Levenshtein distance algorithm in MySQL, you can see an example here:

http://www.artfulsoftware.com/infotree/queries.php#552

+2
source

You might want to play FULLTEXT with FULLTEXT indices and fuzzy MATCH ... AGAINST queries. Keep in mind that words with at least 4 letters are excluded from the default index .

+2
source

This is a bit more work, but:

  • Create a word table with fields

    • word

    • num_appeared

  • And a pivot table between words and addresses

    • address_id

    • word_id

Turn the address table, divide the address by word, then paste each word into the word table and create an entry in the pivot table. When everything is ready, collect a table of words num_appeared ASC, and there - you have words with the highest chance of being deceived. Then you can create a script that Google searches after these words, and the google sentence may be the correct form of the word.

+2
source

All Articles