Creating lemmatizer: speed optimization

I am creating a lemmatizer in python. Since I need it to work in real time / process a fairly large amount of data, processing speed is an entity. Data. I have all the possible suffixes that are associated with all types of words with which they can be combined. In addition, I have lemmas that are related both to their type (words) and to lemmas. The program enters the word as an input signal and outputs its lemma. word = lemmafrom + suffix

For example (Note: although the example is in English, I am not building a lemmatizer for English):

word: prohibiting

lemmaform: forbidd

suffix: ing

lemma: forbid

My decision:

I converted the data to (nested) dicts:

suffixdict : {suffix1:[type1,type2, ... , type(n)], suffix2:[type1,type2, ... ,
type(n)]}    
lemmaformdict : {lemmaform:{type1:lemma}}

1) , . 3 , "ing", "ng", "n" suffixdict. , ( ).

2) dict. , .

3) , , 1) ans 2), .

: ? ( ) appriciated.

+5
2

. ? ( , ). :

enter image description here

, ( 0) (10, 12 17 ), . , , . (, ing), "".

, - . , - .:) Tries - ( - trie), .

+6

trie automaton, , . , , , .

, ( be). (), ('e', 'z', 'i'), ('e', 'd', 'a') ('e', 'v', 'o'), , , 'e' NFA.

"" . , , lemmaformdict . , ( ).

, , , ( ).

, NFA, . NFA DFA, , , , . , , Python . ( , , ++.)

+2

All Articles