I am creating a lemmatizer in python. Since I need it to work in real time / process a fairly large amount of data, processing speed is an entity. Data. I have all the possible suffixes that are associated with all types of words with which they can be combined. In addition, I have lemmas that are related both to their type (words) and to lemmas. The program enters the word as an input signal and outputs its lemma. word = lemmafrom + suffix
For example (Note: although the example is in English, I am not building a lemmatizer for English):
word: prohibiting
lemmaform: forbidd
suffix: ing
lemma: forbid
My decision:
I converted the data to (nested) dicts:
suffixdict : {suffix1:[type1,type2, ... , type(n)], suffix2:[type1,type2, ... , type(n)]} lemmaformdict : {lemmaform:{type1:lemma}}
1) , . 3 , "ing", "ng", "n" suffixdict. , ( ).
2) dict. , .
3) , , 1) ans 2), .
: ? ( ) appriciated.
. ? ( , ). :
, ( 0) (10, 12 17 ), . , , . (, ing), "".
, - . , - .:) Tries - ( - trie), .
trie automaton, , . , , , .
, ( be). (), ('e', 'z', 'i'), ('e', 'd', 'a') ('e', 'v', 'o'), , , 'e' NFA.
be
()
('e', 'z', 'i')
('e', 'd', 'a')
('e', 'v', 'o')
'e'
"" . , , lemmaformdict . , ( ).
lemmaformdict
, , , ( ).
, NFA, . NFA DFA, , , , . , , Python . ( , , ++.)