Text analysis (lemmalization, editing distance)

I need to analyze the text for the presence of forbidden words in it. Suppose there is a word in the blacklist: Deny. The word has many forms. In the text, a word can be, for example: “prohibiting”, “forbidden”, “forbidden”. To bring the word back to its original form, I use the lemmatization process. Your suggestions?

What about typos?
For example: "F0rb1d". I am thinking of using damerau - Levenshtein or something else. Your suggestions?

But what if the text is written like this :
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR "F0rb1dden1nformation.Privatecorresp0ndenceoftccmpany." (yes, no spaces)

How to solve this problem?
A quick algorithm is desirable, because the text is processed in real time.
And maybe some tips for improving performance (how to store, etc.)?

+7
c # nlp similarity lemmatization
source share
2 answers

There are two possible solutions, as far as I know the algorithms.

You can try using dynamic programming, LCS (the longest common subsequence). It will look for the source text for the desired word as a template, I believe that O (mn):

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem http://www.ics.uci.edu/~eppstein/161/960229.html

Although it would be easier to use a text search algorithm. The best I know is KMP , and that is O (n). To compare characters, you can group them into such sets as {i i l (L) 1}, {o O 0}, etc. However, you can change this to not match all letters (prohibit → prohibit).

http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm

So now you can compare the benefits of these two and your offers.

+2
source share
+1
source share

All Articles