Text analysis (lemmalization, editing distance)

Question

Text analysis (lemmalization, editing distance)

I need to analyze the text for the presence of forbidden words in it. Suppose there is a word in the blacklist: Deny. The word has many forms. In the text, a word can be, for example: “prohibiting”, “forbidden”, “forbidden”. To bring the word back to its original form, I use the lemmatization process. Your suggestions?

What about typos?
For example: "F0rb1d". I am thinking of using damerau - Levenshtein or something else. Your suggestions?

But what if the text is written like this :
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR "F0rb1dden1nformation.Privatecorresp0ndenceoftccmpany." (yes, no spaces)

How to solve this problem?
A quick algorithm is desirable, because the text is processed in real time.
And maybe some tips for improving performance (how to store, etc.)?

+7

c # nlp similarity lemmatization

user348173 Apr 3 '11 at 12:45

source share

2 answers

You can also use RegEx matches to check words. http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx

+1

kojoma Apr 05 '11 at 9:54

source share

Aurimas neverauskas · Accepted Answer · 2011-04-05T09:37:12+0000

There are two possible solutions, as far as I know the algorithms.

You can try using dynamic programming, LCS (the longest common subsequence). It will look for the source text for the desired word as a template, I believe that O (mn):

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem http://www.ics.uci.edu/~eppstein/161/960229.html

Although it would be easier to use a text search algorithm. The best I know is KMP , and that is O (n). To compare characters, you can group them into such sets as {i i l (L) 1}, {o O 0}, etc. However, you can change this to not match all letters (prohibit → prohibit).

http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm

So now you can compare the benefits of these two and your offers.

Text analysis (lemmalization, editing distance)

More articles: