Arab Lemmatization and Stanford NLP

I am trying to make a lemmatization, i.e. identifying a lemma and possibly the root of the Arabic verb, for example: يتصل ==> lemma (infinitive of the verb) ==> اتصل ==> root (triliteral root / Jidr thoulathi) ==> و ص لل

Do you think Stanford NLP can do this?

Regards,

+4
source share
2 answers

The Strandford Arab segmentation apparatus cannot perform the correct lemmatization. However, you can train the new model to do something like completion:

  • تكتبون ← ت + كتب + ون
  • يتصل ← ي + تصل

, ( "تصل" - ), , MADAMIRA (http://nlp.ldeo.columbia.edu/madamira/).

. : Stanford Arabic , ( edu.stanford.nlp.international.arabic.process.IOBUtils):

  • lil- (لل) li + al- (ل + ال)
  • ta (ت) ha (ه) ta marbuta (ة)
  • ya (ي) alif (ا) alif maqsura (ى)
  • alif maqsura (ى) ya (ي)

, lemmatizing يتصل to ي + اتصل , alif ya ta. (, نساء ← امرأة).

:

وسيكتشفونه ← و + س + يكتشفون + ه

, Treebank LDC , , :

وسيكتشفونه ← و + س + ي + كتشف + ون + ه

, "كتشف" , "كتشف" تكتشفين, أكتشف, يكتشف .. , ATB script . , script parse_integrated : https://gist.github.com/futurulus/38307d98992e7fdeec0d

" " README.

+10

, Stanford NLP ,

Farasa MADAMIRA . 97,23% + 7% MADAMIRA .

Farasa Lemmatizer : https://arxiv.org/pdf/1710.06700.pdf

+1

All Articles