I need to analyze the text for the presence of forbidden words in it. Suppose there is a word in the blacklist: Deny. The word has many forms. In the text, a word can be, for example: “prohibiting”, “forbidden”, “forbidden”. To bring the word back to its original form, I use the lemmatization process. Your suggestions?
What about typos?
For example: "F0rb1d". I am thinking of using damerau - Levenshtein or something else. Your suggestions?
But what if the text is written like this :
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR "F0rb1dden1nformation.Privatecorresp0ndenceoftccmpany." (yes, no spaces)
How to solve this problem?
A quick algorithm is desirable, because the text is processed in real time.
And maybe some tips for improving performance (how to store, etc.)?
c # nlp similarity lemmatization
user348173
source share