Is there an existing library or api that I can use to separate words in character-based languages?

I am working on a small Python project for a hobby that involves creating dictionaries for different languages โ€‹โ€‹using large texts written in that language. For most languages, this is relatively simple, because I can use the space separator between words to tokenize a paragraph into words for a dictionary, but, for example, Chinese does not use a space between words. How can I fake a paragraph of Chinese text into words?

My searches have found that this is a complex problem, so I wonder if there are solutions on the shelf for solving this problem in Python or elsewhere via api or any other language. This should be a common problem, because any search engine created for Asian languages โ€‹โ€‹will have to solve this problem in order to provide relevant results.

I tried to search Google using Google, but I'm not even sure what this type of tokenization is called, so my results do not find anything. Maybe just a push in the right direction will help.

+4
source share
1 answer

Language tokenization is a key aspect of natural language processing (NLP). This is a huge topic for large corporations and universities and is the subject of numerous doctoral dissertations.

I just passed my question to add the tag 'nlp'. I suggest you take a look at the "about" page for the "nlp" tag. You will find links to sites such as the Natural Language Toolkit , which includes a Python-based tokenizer.

You can also search Google for terms such as language tokenization and NLP.

+3
source

Source: https://habr.com/ru/post/1413352/


All Articles