In the end, both levenshtein and similar_text were too slow with the number of lines it had to go through, even with a lot of checks, and only using one of them as a last resort.
As an experiment, I ported part of the code to C # to see how much faster it would be overflowing code. It worked after about 3 minutes with the same dataset.
Then I added an extra field to the table and used the PECL double metaphone extension to generate keys for each row. The results were good, though, as some of the numbers included caused duplicates. I think I could run each of these functions, but decided not to.
In the end, I chose the simplest approach, the full MySQL text, which worked very well. Sometimes there are errors, although they are easy to detect and correct. It also works very fast, after about 3-4 seconds.
Dancake
source share