Extract words using nltk from German text

I am trying to extract words from a German document, when I use the following method described in the nltk textbook, I cannot get words with special characters of the language.

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*'); words = nltk.Text(ptcr.words(DocumentName)) 

What should I do to get a list of words in a document?

An example with nltk.tokenize.WordPunctTokenizer() for the German phrase Veränderungen über einen Walzer looks like this:

 In [231]: nltk.tokenize.WordPunctTokenizer().tokenize(u"Veränderungen über einen Walzer") Out[231]: [u'Ver\xc3', u'\xa4', u'nderungen', u'\xc3\xbcber', u'einen', u'Walzer'] 

In this example, ä is treated as a delimiter, although ü is not.

+7
source share
3 answers

Call PlaintextCorpusReader with the encoding = 'utf-8' parameter:

 ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8') 

Edit: I see ... you have two separate problems:

a) Tokenization problem: when you test a literal string from German, you think you are Unicode input. You are actually saying that python takes bytes between quotes and convert them to a unicode string. But your bytes are misinterpreted. Fix: add the following line to the very top of the source file.

 # -*- coding: utf-8 -*- 

Suddenly your constants will be visible and symbolized correctly:

 german = u"Veränderungen über einen Walzer" print nltk.tokenize.WordPunctTokenizer().tokenize(german) 

Second problem: It turns out that Text() does not use unicode! if you pass it a unicode string, it will try to convert it to a pure-ascii string, which, of course, fails when you type non-ascii. Ugh.

Decision. My recommendation would be to completely not use nltk.Text and work directly with corpus readers. (This is generally a good idea: see nltk.Text own documentation).

But if you have to use nltk.Text with German data, here’s how: right, so you can unlock it, but then “encode” your Unicode back to the str list. For a German, it's probably safer to use Latin-1 encoding, but utf-8 seems to work too.

 ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8'); # Convert unicode to utf8-encoded str coded = [ tok.encode('utf-8') for tok in ptcr.words(DocumentName) ] words = nltk.Text(coded) 
+11
source

Take a look at http://text-processing.com/demo/tokenize/ I'm not sure if your text gets the correct encoding, because the WordPunctTokenizer in the demo processes the words in order. So is the PunktWordTokenizer.

+3
source

You can try a simple regex. Enough if you want only words; he will swallow all punctuation marks:

 >>> import re >>> re.findall("\w+", "Veränderungen über einen Walzer.".decode("utf-8"), re.U) [u'Ver\xe4nderungen', u'\xfcber', u'einen', u'Walzer'] 

Note that re.U changes the value of \w in RE based on the current locale, so make sure it is installed correctly. I have it set to en_US.UTF-8 , which is apparently good enough for your example.

Also note that "Veränderungen über einen Walzer".decode("utf-8") and u"Veränderungen über einen Walzer" are different lines.

+1
source

All Articles