Japanese / Character Programming Tips

Question

Japanese / Character Programming Tips

I have an idea for several web applications to write to help me, and possibly others, it’s better to learn Japanese since I learn the language.

My problem is that the site will be mostly English, so it needs to mix Japanese characters, usually hirigan and katakana, but later kanji. I approach this; I found out that the pages and source files must be unicode and utf-8 types.

However, my problem is with the actual coding. I need to manipulate lines of text that are kana. For example:

けす I need to take this verb and convert it to te-form けして. I would prefer to do this in javascript, since this will help to do more manipulations in the future, but if I need to just make database calls and store everything in the database.

My question is not only how to do this in javascript, but also what some tips and strategies to do similar things in other languages. I hope you learn more about language learning applications, but I lost when it came to that.

+7

javascript language-agnostic unicode nlp cjk

percent20 May 02, '09 at 18:00

source share

7 answers

Stick to Unicode and utf-8 everywhere.
Stay away from Japanese encodings: euc-jp, shiftjis, iso-2022-jp, but keep in mind that you are likely to come across them at some point if you continue.
Get to know the segmenter for complex tasks such as POS analysis, word segmentation, etc. The standard tools used by most people who perform NLP (Natural Language Processing) in Japanese are in order of popularity / power.

MeCab (originally at SourceForge ) is awesome: it allows you to use text, like

  「日本語 は 、 と て も 難 し い で す。」

and get all kinds of useful information back

kettle:~$ echo 日本語は、難しいです | mecab 日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ は 助詞,係助詞,*,*,*,*,は,ハ,ワ 、 記号,読点,*,*,*,*,、,、,、 難しい 形容詞,自立,*,*,形容詞・イ段,基本形,難しい,ムズカシイ,ムズカシイ です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス EOS

which is basically a detailed run of parts of speech, readings, pronunciations, etc. It will also help you analyze the verb tense,

 kettle:~$ echo メキシコ料理が食べたい | mecabメキシコ 名詞,固有名詞,地域,国,*,*,メキシコ,メキシコ,メキシコ料理 名詞,サ変接続,*,*,*,*,料理,リョウリ,リョーリが 助詞,格助詞,一般,*,*,*,が,ガ,ガ食べ 動詞,自立,*,*,一段,連用形,食べる,タベ,タベたい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイEOS

However, the documentation is in Japanese, and it is a little difficult to set up and figure out how to format the output the way you want. Packages for ubuntu / debian and bindings in a bunch of languages are available, including perl, python, ruby ...

Apt-repos for ubuntu:

 deb http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all deb-src http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all

Installation packages: $ apt-get install mecab-ipadic-utf8 mecab python-mecab

should do the trick, I think.

Other alternatives to mecab are ChaSen , which was written many years ago by MeCab (which, by the way, works on google now) and Kakasi , which is much less efficient.

I would definitely try not to minimize your own pairing procedures. the problem is that this will require tons and tons of work that others have already done, and to cover all extreme cases with rules is, after all, impossible.

MeCab is statistically controlled and learns a lot of data. It uses a sophisticated machine learning technique called conditional random fields (CRF), and the results are really good.

Have fun with the Japanese. I'm not sure how good your Japanese is, but if you need help with the docs for mecab or what you can ask about it. Kanji can be quite intimidating at the beginning.

+26

si28719e May 03, '09 at 17:05

source share

What you need to do is take a look at the rules of grammar. Have an array of rules for each pairing. Take, for example, ~ て form. Psudocode:

 def te_form(verb) switch verb.substr(-1, 1) == "る" then return # verb minus ru plus te case "る" #return (verb - る) + てcase "す" #return (verb - す）＋して

etc .. Basically, break it down into verbs of type I, II and III.

+2

soycamo May 04 '09 at 4:55

source share

Your question is completely unclear to me.

however, I had some experience with the Japanese language, so I will give my 2 Cents.

since there is no word separation in Japanese texts (for example, a space character), the most important tool we should have received is a dictionary-based word recognizer.

after you have split the text, it’s easier to manipulate it with the “normal” tools.

there were only 2 tools that were done above, and as a by-product they also worked as a tagger (i.e. noun, verb, etc.).

edit: always use unicode when working with volumes.

+1

Berry tsakala May 02, '09 at 20:00

source share

If I remember correctly (and I fell back a year when I took Japanese so that I could be wrong), the replacements you want to make are determined by the last character or two in the word. Taking your first example, any verb ending in "す" will always have "して" when conjugated in this way. Similarly for む → んで. Could you possibly set the display of the last character → conjugate form. You may need to consider exceptions, for example, everything that is associated with xx って.

As for portability between languages, you will have to implement the logic differently depending on how they work. This solution would be fairly simple to implement in Spanish, since conjugations depend on whether the verb ends in -ar, -er or -ir (with some verbs requiring an exception in your logic). Unfortunately, this is the limit of my multilingual skills, so I don’t know how well it will do outside of these two.

0

Jimmy May 02, '09 at 18:39

source share

Since most Japanese verbs follow one of a small set of predictable patterns, the simplest and most capable way to generate all forms of a given verb is to let the verb know which conjugation it should follow, and then write the functions to generate each form depends on the conjugation .

pseudo code:

 generateDictionaryForm(verb) case Ru-Verb: verb.stem + るcase Su-Verb: verb.stem + すcase Ku-Verb: verb.stem + く...etc. generatePoliteForm(verb) case Ru-Verb: verb.stem + りますcase Su-Verb: verb.stem + しますcase Ku-Verb: verb.stem + きます...etc.

Irregular verbs, of course, will have a special cover.

Some variations of this will work for any other fairly regular language (i.e. not for English).

0

Amanda s May 07, '09 at 5:14

source share

Try installing my gem (rom2jap). He is in a ruby.

 gem install rom2jap

Open a terminal and enter:

 require 'rom2jap'

-2

user5849542 Jan 28 '16 at 0:17

source share

Michael borgwardt · Accepted Answer · 2009-05-03T17:58:45+0000

My question is not only how to do this in javascript, but what tips and strategies for doing these kinds of things in other langauges too.

What you want to do is a fairly simple string manipulation - besides missing word separators, as Barry notes, although this is not a technical problem.

In principle, for the modern programming language that supports Unicode (which, it seems to me, was from version 1.3, I suppose) there is no real difference between the Japanese channel or kanji and Latin writing - they are all just characters. And a string is just, well, a string of characters.

Where it becomes difficult, you need to convert strings and bytes, because then you need to pay attention to what encoding you use. Unfortunately, many programmers, especially native English speakers, tend to ignore this problem because ASCII is the standard standard encoding for Latin letters, and other encodings usually try to be compatible. If you need Latin letters, then you can do without bliss regarding character encodings, consider that bytes and characters are basically the same thing - and write programs that distort everything that is not ASCII.

Thus, the “secret” of Unicode-enabled programming is this: learn to recognize when and where strings / characters are converted to and from bytes, and make sure that all of these places use the correct encoding, that is, the same thing that will used for the inverse transform, and one that can encode the whole character that you use. UTF-8 is gradually becoming the de facto standard and should usually be used wherever you have a choice.

Typical examples (non-exhaustive):

When writing source code with non-ASCII string literals (encoding setting in the editor / IDE)
When compiling or interpreting such source code (the compiler / interpreter must know the encoding)
When reading / writing lines to a file (the encoding should be specified somewhere in the API or in the file metadata)
When writing rows to the database (encoding should be specified in the database or table configuration)
When delivering HTML pages via a web server (the encoding should be indicated in the HTML headers or meta-header of the pages, the forms can be even more complex)

Japanese / Character Programming Tips

More articles: