Is there an implementation of the Croatian word reduction algorithm?

I am looking for an implementation of the word gating algorithm of a Croatian word. Ideal in Java, but I would also accept any other language.

Is there somewhere a community of English-speaking developers who develop search applications for the Croatian language?

Thanks,

+4
source share
2 answers

Slavic languages ​​are highly flexive . The most accurate and fast approach would be a combination of rules and large mappings / dictionaries.

The work is done, but it is being held back. Croatian morphological vocabulary will help, but this is behind a slow API. More work can be found between Bosnian, Serbian and Croatian, not just Croatian.

Larger mappings are not always convenient (and you can effectively build a better rule transformer from mappings / dictionaries / cases).

Implementing using Hunspell and affix files can be a great way to get community and java support. For instance. Google search: hr_hr.aff

Not tested: you need to be able to cancel all words, build three end characters, go through some rules (for example, LCS) and build an accurate statistical transformer using the body text.

Best I can do python:

import hunspell hs = hunspell.HunSpell( '/usr/share/myspell/hr_HR.dic', '/usr/share/myspell/hr_HR.aff') # The following should return ['hrvatska']: print hs.stem('hrvatski') 
+6
source

here you can find a recent implementation made on ffzg in python - a stem for Croatian .

We performed a basic assessment of stem materials on a lemmatized newspaper case as the gold standard with an accuracy of 0.986 and a rating of 0.961 (F1 0.973) for adjectives and nouns. In all parts of speech, we got an accuracy of 0.98 and remembered 0.92 (F1 0.947).

It is released under the GNU license, but do not hesitate to contact the author for additional help (I know only the original author Nikola, but not his student).

0
source

All Articles