I am working on a small Python project for a hobby that involves creating dictionaries for different languages โโusing large texts written in that language. For most languages, this is relatively simple, because I can use the space separator between words to tokenize a paragraph into words for a dictionary, but, for example, Chinese does not use a space between words. How can I fake a paragraph of Chinese text into words?
My searches have found that this is a complex problem, so I wonder if there are solutions on the shelf for solving this problem in Python or elsewhere via api or any other language. This should be a common problem, because any search engine created for Asian languages โโwill have to solve this problem in order to provide relevant results.
I tried to search Google using Google, but I'm not even sure what this type of tokenization is called, so my results do not find anything. Maybe just a push in the right direction will help.
source share