The place where the word is spelled

I am creating a PHP web application where people can try to translate the words they need to learn at school.

For example, someone needs to translate the Dutch word “weer” to “weather” in English, but unfortunately he calls it “lee”. Since he almost typed the correct word, I want to give him another try with dots " . " In the places where he made a mistake:

 Language A: weer Language B: weather Input: whether Output: w..ther 

Or, for example,

 Language A: fout Language B: mistake Input: mitake Output: mi.take 

Or:

 Language A: echt Language B: genuine Input: genuinely Output: genuinely (almost good, shorten the word a little bit) 

But if the input is too much different from the desired translation, I do not want to receive output, for example ........

I heard about Levenshtein distance, and it seems to me that I need an algorithm that is very similar to this, but I do not know how to place the points in the right place, and not repeat how many operations should be performed.

So, how can I return a misspelled word with dots in places where someone made a mistake?

+4
source share
3 answers

First, take a look at the Levenshtein algorithm on wikipedia . Then go to the examples and the resulting matrix d on the article page:

  *k* *i* *t* *t* *e* *n* >0 1 2 3 4 5 6 *s* 1 >1 2 3 4 5 6 *i* 2 2 >1 2 3 4 5 *t* 3 3 2 >1 2 3 4 *t* 4 4 3 2 >1 2 3 *i* 5 5 4 3 2 >2 3 *n* 6 6 5 4 3 3 >2 *g* 7 7 6 5 4 4 >3 

The distance is in the lower right corner of the matrix, d [m, n]. But from there it is now possible to follow the minimum steps in the upper left corner of the matrix, d [1, 1]. You just go left, up-left or up at every step, depending on what minimizes the path. In the above example, you will find the path indicated by the signs '>':

  sittingkitten 0 1 1 1 1 2 2 3 0 1 1 1 1 2 2 3 ^ ^ ^ ^ ^ ^ changes in the distance, replace by dots 

Now you can find on the minimum path where the locations d [i, j] are located, the distance change (marked with the ^ symbol in the example above), and for those letters that you put in the first (or second) word, the point at the point is the position i (or j respectively).

Result:

  sittingkitten ^ ^ ^ ^ ^ ^ . itt . n . . itt . n . 
+4
source

The terminology you are looking for is called "edit distance". Using something like Levenshtein distance , it will tell you the number of operations needed to convert one line to another (insert, delete, replace, etc.).

Here is a list of other distance editing algorithms .

Once you decide that the word “close enough” (that is, it does not exceed a certain threshold of necessary changes), you can show where the changes should occur (by displaying dots).

So how do you know where to put the dots?

The interesting thing about the “Levenshtein distance” is that it uses an M x N matrix with one word on each axis (see the sample matrix in Levenshtein's article ). When you create the matrix, you can determine which letters require “extra rights” in order to be correct. Where you put the dots. If the letter requires "additional changes", you simply print the letter. Pretty cool.

+3
source

I think you need to make the multi-step process not only Levenshtein. First, I would check if the input word is a form of the target word. This will catch your third example, and also don't worry about adding points. You can also use this step to catch synonyms. The next step is to check the difference in the length of the two lines.

If the difference is 0, you can make a letter to compare letters to place points. If you do not want to show all points, you must save the number of points and display some error message once above the limit. (Sorry it was wrong)

If the difference indicates that the input is longer, you need to check that the message will be deleted to fix the problem. Here you can use Levenshtein to see how close they are, if they are too far, show your error message, if it is within range, you will need to take Levenshtein steps in reverse order and mark the changes in some way. Not sure how you want to show that the letter needs to be deleted.

If the difference shows that the input is shorter, you can use the Levenshtein distance to see if the two words are close enough or show an error. Then follow the steps back, adding points for insertion and points for substitutions.

In fact, the last two steps can be combined into one function that goes through the algorithm and remembers the insert delete or substitution and changes the result accordingly.

0
source

All Articles