Regex Errors

I have a regular expression created from a list in the database to match the names of the building types in the game. The problem is that typos, sometimes those who write instructions for their team in the game, will mistakenly name the building name, and obviously the regular expression will not pick it up (for example, the spelling "University" and "Unlimited").

Are there any suggestions for making matches match regular expressions with 1 or 2 letters?

A regular expression is regularly generated and launched on a local machine that can handle much more workload, so in the worst case I have algorithmic versions of each word with an absent letter, and then another with added letters.

I use PHP, but I hope that any solution to this problem will not be specific to PHP.

+6
regex
source share
6 answers

Let me introduce you to Levenshtein Distance , a measure of the difference between lines as the number of conversions needed to convert one line to another.

It is also built into PHP .

So, I would split the input file with characters other than words, and measure the distance between each word and your target list of buildings. If the distance is below a certain threshold, suppose it was a spelling error.

I think you will have more luck matching this method than trying to create a regular expression for each special case.

+13
source share

A Google β€œyou mean” implementation, looking at the previous results, may also help:

How do you implement "did you mean"?

+3
source share

What is Soundex ()? - Teifion (28 minutes ago)

The sound sound is similar to the Levenshtein function mentioned by the Triptych. This is a string comparison tool. See: http://us3.php.net/soundex

You can also look at metaphone and Similar_text. I would put this in a comment, but I do not have enough reputations yet .: D

+3
source share

In the days when we sometimes used Soundex () for these problems.

+2
source share

What a score; people of algorithms have worked hard on approximating regular expression matching . The oldest of these tools is probably agrep , originally developed at the University of Arizona and now available in a good open source version. You just tell agrep how many mistakes you are willing to endure , and it matches from there. It can also match blocks of text other than lines . The link above contains links to the newer version of GPLed agrep , as well as a number of language libraries for approximate regular expression matching.

+2
source share

This might be redundant, but Peter Norwig of Google wrote about writing spellcheckers in Python. It is definitely worth a read and may appeal to your case.

At the end of the article, he also contributed to the implementation of the algorithm in other languages.

+1
source share

All Articles