Call PlaintextCorpusReader with the encoding = 'utf-8' parameter:
ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8')
Edit: I see ... you have two separate problems:
a) Tokenization problem: when you test a literal string from German, you think you are Unicode input. You are actually saying that python takes bytes between quotes and convert them to a unicode string. But your bytes are misinterpreted. Fix: add the following line to the very top of the source file.
Suddenly your constants will be visible and symbolized correctly:
german = u"Veränderungen über einen Walzer" print nltk.tokenize.WordPunctTokenizer().tokenize(german)
Second problem: It turns out that Text() does not use unicode! if you pass it a unicode string, it will try to convert it to a pure-ascii string, which, of course, fails when you type non-ascii. Ugh.
Decision. My recommendation would be to completely not use nltk.Text and work directly with corpus readers. (This is generally a good idea: see nltk.Text own documentation).
But if you have to use nltk.Text with German data, here’s how: right, so you can unlock it, but then “encode” your Unicode back to the str list. For a German, it's probably safer to use Latin-1 encoding, but utf-8 seems to work too.
ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8'); # Convert unicode to utf8-encoded str coded = [ tok.encode('utf-8') for tok in ptcr.words(DocumentName) ] words = nltk.Text(coded)
alexis
source share