How to do word counting for a mixture of English and Chinese in Javascript

I want to calculate the number of words in a passage that contains both English and Chinese. For English, it's easy. Every word is a word. For the Chinese, we consider each character a word. Therefore, 香港人 here are three words.

So, for example, "I 香港人" should have a number of words of 6.

Any idea how I can count it in Javascript / jQuery?

Thanks!

+8
source share
3 answers

Try regex:

/[\u00ff-\uffff]|\S+/g 

For example, "I am a 香港人".match(/[\u00ff-\uffff]|\S+/g) gives:

 ["I", "am", "a", "香", "港", "人"] 

Then you can simply check the length of the resulting array.

The \u00ff-\uffff regular expression is a range of Unicode characters; you probably want to narrow it down to the characters you want to count as words. For example, CJK Unified would be \u4e00-\u9fcc .

 function countWords(str) { var matches = str.match(/[\u00ff-\uffff]|\S+/g); return matches ? matches.length : 0; } 
+6
source

It cannot be 6, because when you calculate the length of a string, it also contains spaces. So,

 var d = "I am a 香港人"; d.length //returns 10 d.replace(/\s+/g, "").length //returns 7, excluding spaces 

FYI: Your site must be correctly encoded.

I think I found what you need. "I 香港人", it contains a twice. So Using @PSL answer I found a way.

 var d = "I am a 香港人"; var uniqueList=d.replace(/\s+/g, '').split('').filter(function(item,i,allItems){ return i==allItems.indexOf(item); }).join(''); console.log(uniqueList.length); //returns 6 

Jsfiddle

As you comment, I assume that you propose the word "I 香 港 人" between each word. Now i have changed the code

 var d = "I am a 香 港 人"; var uniqueList=d.split(' ').filter(function(item,i,allItems){ return i==allItems.indexOf(item); }); console.log(uniqueList.length); //returns 6 

Jsfiddle

+1
source

I tried the script, but sometimes it counts the number of words incorrectly. For example, some people will type "香港人 computing 都 不錯 的", but the script will read it as 4 words (using the following script).

 <script> var str = "香港人computing都不錯的"; var matches = str.match(/[\u00ff-\uffff]|\S+/g); x= matches ? matches.length : 0; alert(x) </script> 

To fix the problem, I changed the codes to:

 <script> var str="香港人computing都不錯的"; /// fix problem in special characters such as middle-dot, etc. str= str.replace(/[\u007F-\u00FE]/g,' '); /// make a duplicate first... var str1=str; var str2=str; /// the following remove all chinese characters and then count the number of english characters in the string str1=str1.replace(/[^!-~\d\s]+/gi,' ') /// the following remove all english characters and then count the number of chinese characters in the string str2=str2.replace(/[!-~\d\s]+/gi,'') var matches1 = str1.match(/[\u00ff-\uffff]|\S+/g); var matches2 = str2.match(/[\u00ff-\uffff]|\S+/g); count1= matches1 ? matches1.length : 0; count2= matches2 ? matches2.length : 0; /// return the total of the mixture var lvar1= (count1+count2); alert(lvar1); </script> 

Now the script counts the number of words in the mixture of Chinese and English correctly .... Enjoy ..

+1
source

All Articles