Using javascript, how can I count a combination of Asian characters and English words

I need to take a chain of mixed Asian characters (for now, suppose only Chinese characters or Japanese kanji / hiragana / katakana) and Alphanumeric (ie Enlgish, French) and calculate them as follows:

1) counts every Asian CHARACTER as 1; 2) consider each alphanumeric word as 1;

a few examples:

株式会社 myCompany = 4 characters + 1 word = 5 total 株式会社 マ イ コ = 7 characters


My only idea so far is to use:

var wordArray=val.split(/\w+/); 

and then check each element to see if its contents are alphanumeric (so read 1) or not (so take the length of the array). But I don’t feel that it’s really very smart at all, and the considered text can be up to 10,000 words, so it’s not very fast.

Ideas?

+6
javascript text character counting
source share
3 answers

Unfortunately, JavaScript RegExp does not support Unicode character classes; \w applies only to ASCII characters (modulo some browser errors).

You can use Unicode characters in groups, though, so you can do this if you can select each character set that interests you as a range. eg:.

 var r= new RegExp( '[A-Za-z0-9_\]+|'+ // ASCII letters (no accents) '[\u3040-\u309F]+|'+ // Hiragana '[\u30A0-\u30FF]+|'+ // Katakana '[\u4E00-\u9FFF\uF900-\uFAFF\u3400-\u4DBF]', // Single CJK ideographs 'g'); var nwords= str.match(r).length; 

(This is trying to give a more realistic number of words for the Japanese, counting each run of one type of kan as a word. It's still not true, of course, but it's probably closer than treating each syllable as one word.)

Obviously, there are many more characters to consider if you want to "do it right." Let's hope you don't have characters outside the base multilingual plane, for one!

+3
source share

You can iterate over each character in the text, studying each to look for word breaks. The following example does this by counting every Chinese / Japanese / Korean (CJK) ideologist as one word and treating all alphanumeric strings as separate words.

Some notes about my implementation:

  • It probably doesn’t handle accented characters correctly. They are likely to cause word breaks. You can modify wordBreakRegEx to fix this.

  • cjkRegEx does not include some of the more esoteric ranges of code points, since they require 5 hexadecimal digits for reference, and the JavaScript regex mechanism does not seem to allow you to do this. But you probably don't need to worry about this, since I don’t even think most fonts include them.

  • I intentionally left Japanese Hiragana and Katakana from cjkRegEx , as I am not sure how you want to deal with them. Depending on the type of text you are dealing with, it may make sense to consider lines of them as separate words. In this case, you need to add logic in order to recognize the word "kana" in comparison with the alphanumeric word. If you don't care, you just need to add your code point ranges to cjkRegEx . Of course, you can try to recognize word breaks in the channels, but it quickly becomes very difficult.

Implementation Example:

 function getWordCount(text) { // This matches all CJK ideographs. var cjkRegEx = /[\u3400-\u4db5\u4e00-\u9fa5\uf900-\ufa2d]/; // This matches all characters that "break up" words. var wordBreakRegEx = /\W/; var wordCount = 0; var inWord = false; var length = text.length; for (var i = 0; i < length; i++) { var curChar = text.charAt(i); if (cjkRegEx.test(curChar)) { // Character is a CJK ideograph. // Count it as a word. wordCount += inWord ? 2 : 1; inWord = false; } else if (wordBreakRegEx.test(curChar)) { // Character is a "word-breaking" character. // If a word was started, increment the word count. if (inWord) { wordCount += 1; inWord = false; } else { // All other characters are "word" characters. // Indicate that a word has begun. inWord = true; } } // If the text ended while in a word, make sure to count it. if (inWord) { wordCount += 1; } return wordCount; } 

Unihan Database is very useful for learning Unicode CJK. In addition, of course, the Unicode homepage contains a lot of information.

-one
source share

I think you want to iterate over all the characters and increase the counter every time the current character is in a different word (according to your definition) than the previous one.

-2
source share

All Articles