R: Creating n-grams in R with Asian / Chinese characters?

So, I'm trying to create bigrams and trigrams of a given set of text, which is just Chinese. At first glance, the tau package seems almost perfect for the application. Given the following setup, I'm getting closer to what I want:

library(tau) q <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好") textcnt(q,method="ngram",n=3L,decreasing=TRUE) 

The only problem is that the output is in unicode character strings, and not in the characters themselves. So I get something like:

  _ + < <U <U+ > U U+ 9 +5 5 U+5 >_ _< _<U +59 59 2 29 29> 592 7 92 22 19 19 19 19 19 19 19 17 14 14 14 11 11 11 9 9 8 8 8 8 8 8 929 9> >< ><U 9>_ E +5E 3 3> 3>_ 5E 5E7 6 73 73> A E7 E73 4 8 9>< A> +6 8 8 8 8 5 5 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 2 +7 4> 4>< 7A A>< C U+6 U+7 +4 +4E +5F +66 +6C +76 +7A 0 0A 0A> 1 14 14> 4E 4EC 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 597 5F 5F8 60 60A 66 660 68 684 6C 6C1 76 768 7A7 7A> 7D 7D> 84 84> 88 88> 8> 8>< 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 97 97D A7 A7A A>_ C1 C14 CA CA> D D> D>_ EC ECA F F8 F88 U+4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

I tried to write something that would perform a similar function, but I can’t wrap the code around myself with something more than a monogram (sorry if the code is inefficient or ugly, I do my best here), The advantage of this method is so that I can get word counts in separate “documents” just by learning the DTM, which is nice.

 data <- c(NA, NA, NA) names(data) <- c("doc", "term", "freq") terms <- NA for(i in 1:length(q)){ temp <- data.frame(i,table(strsplit(q[i],""))) names(temp) <- c("doc", "term", "freq") data <- rbind(data, temp) } data <- data[-1,] DTM <- xtabs(freq ~ doc + term, data) colSums(DTM) 

This gives a good result:

 天 平 空 昊 今 好 很 气 的8 4 1 1 1 1 1 1 1 

Does anyone have any suggestions on using tau or changing my own code to achieve bigrams and trigrams for my Chinese characters?

Edit:

As pointed out in the comments, here is my sessionInfo() output:

 R version 3.0.0 (2013-04-03) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] tau_0.0-15 loaded via a namespace (and not attached): [1] tools_3.0.0 
+4
source share
1 answer

The stringdist package will do this for you:

 > library(stringdist) > q <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好") > v1 <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好") > t(qgrams(v1, q=1)) V1天 8平 4空 1昊 1 ... > v2 <- c("天气气","平","很好平","天空天空天空","昊天","今天的天天气很好") > t(qgrams(v2, q=2)) V1天气 2气气 1空天 2天空 3天的 1天天 3今天 1 ... 

The reason I move the returned matrices is because R does not display matrices correctly with respect to the column width, which is the length of the character string of the Unicode identifier (fx " <U+6C14><U+6C14> ").

If you are interested in more detailed information about the stringdist package - I recommend this text: http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms ;)

+1
source

All Articles