- Stick to Unicode and utf-8 everywhere.
- Stay away from Japanese encodings: euc-jp, shiftjis, iso-2022-jp, but keep in mind that you are likely to come across them at some point if you continue.
- Get to know the segmenter for complex tasks such as POS analysis, word segmentation, etc. The standard tools used by most people who perform NLP (Natural Language Processing) in Japanese are in order of popularity / power.
MeCab (originally at SourceForge ) is awesome: it allows you to use text, like
「日本語 は 、 と て も 難 し い で す。」
and get all kinds of useful information back
kettle:~$ echo 日本語は、難しいです | mecab 日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ は 助詞,係助詞,*,*,*,*,は,ハ,ワ 、 記号,読点,*,*,*,*,、,、,、 難しい 形容詞,自立,*,*,形容詞・イ段,基本形,難しい,ムズカシイ,ムズカシイ です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス EOS
which is basically a detailed run of parts of speech, readings, pronunciations, etc. It will also help you analyze the verb tense,
kettle:~$ echo メキシコ料理が食べたい | mecabメキシコ 名詞,固有名詞,地域,国,*,*,メキシコ,メキシコ,メキシコ料理 名詞,サ変接続,*,*,*,*,料理,リョウリ,リョーリが 助詞,格助詞,一般,*,*,*,が,ガ,ガ食べ 動詞,自立,*,*,一段,連用形,食べる,タベ,タベたい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイEOS
However, the documentation is in Japanese, and it is a little difficult to set up and figure out how to format the output the way you want. Packages for ubuntu / debian and bindings in a bunch of languages are available, including perl, python, ruby ...
Apt-repos for ubuntu:
deb http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all deb-src http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
Installation packages: $ apt-get install mecab-ipadic-utf8 mecab python-mecab
should do the trick, I think.
Other alternatives to mecab are ChaSen , which was written many years ago by MeCab (which, by the way, works on google now) and Kakasi , which is much less efficient.
I would definitely try not to minimize your own pairing procedures. the problem is that this will require tons and tons of work that others have already done, and to cover all extreme cases with rules is, after all, impossible.
MeCab is statistically controlled and learns a lot of data. It uses a sophisticated machine learning technique called conditional random fields (CRF), and the results are really good.
Have fun with the Japanese. I'm not sure how good your Japanese is, but if you need help with the docs for mecab or what you can ask about it. Kanji can be quite intimidating at the beginning.