Ngram counter with tm packet in R

Question

Ngram counter with tm packet in R

I created a script for the frequency of words in a document using the object and dictionary documentTermMatrix in R. The script works with individual words, not with the compound word es. "foo" "bar" "foo bar"

This is the code

require(tm) my.docs <- c("foo bar word1 word2") myCorpus <- Corpus(VectorSource(my.docs)) inspect(DocumentTermMatrix(myCorpus,list(dictionary = c("foo","bar","foo bar"))))

But the result

 Terms Docs bar foo foo bar 1 1 1 0

I need to find one "foo bar" = 1

How can i fix this?

+4

dictionary r frequency text-mining tm

Rocco Nov 05 '14 at 18:13

source share

1 answer

jlhoward · Accepted Answer · 2014-11-05T19:53:08+0000

The problem is that DocummentTermMatrix(...) denotes tokenization by default word breaks. You need at least bitrams.

Confirm this post for a basic approach.

 library(tm) library(RWeka) my.docs <- c("foo bar word1 word2") myCorpus <- Corpus(VectorSource(my.docs)) myDict <- c("foo","bar","foo bar") BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2)) inspect(DocumentTermMatrix(myCorpus, control=list(tokenize=BigramTokenizer, dictionary=myDict))) # <<DocumentTermMatrix (documents: 1, terms: 3)>> # ... # Terms # Docs bar foo foo bar # 1 1 1 1

Ngram counter with tm packet in R

More articles: