Is there an existing library or api that I can use to separate words in character-based languages?

Question

Is there an existing library or api that I can use to separate words in character-based languages?

I am working on a small Python project for a hobby that involves creating dictionaries for different languages using large texts written in that language. For most languages, this is relatively simple, because I can use the space separator between words to tokenize a paragraph into words for a dictionary, but, for example, Chinese does not use a space between words. How can I fake a paragraph of Chinese text into words?

My searches have found that this is a complex problem, so I wonder if there are solutions on the shelf for solving this problem in Python or elsewhere via api or any other language. This should be a common problem, because any search engine created for Asian languages will have to solve this problem in order to provide relevant results.

I tried to search Google using Google, but I'm not even sure what this type of tokenization is called, so my results do not find anything. Maybe just a push in the right direction will help.

+4

python api unicode utf-8 nlp

Dan rice May 19, '12 at 21:45

source share

1 answer

David · Answer 1 · 2012-05-23T21:58:13+0000

Language tokenization is a key aspect of natural language processing (NLP). This is a huge topic for large corporations and universities and is the subject of numerous doctoral dissertations.

I just passed my question to add the tag 'nlp'. I suggest you take a look at the "about" page for the "nlp" tag. You will find links to sites such as the Natural Language Toolkit , which includes a Python-based tokenizer.

You can also search Google for terms such as language tokenization and NLP.

Is there an existing library or api that I can use to separate words in character-based languages?

More articles: