Ngram counter with tm packet in R

I created a script for the frequency of words in a document using the object and dictionary documentTermMatrix in R. The script works with individual words, not with the compound word es. "foo" "bar" "foo bar"

This is the code

require(tm) my.docs <- c("foo bar word1 word2") myCorpus <- Corpus(VectorSource(my.docs)) inspect(DocumentTermMatrix(myCorpus,list(dictionary = c("foo","bar","foo bar")))) 

But the result

 Terms Docs bar foo foo bar 1 1 1 0 

I need to find one "foo bar" = 1

How can i fix this?

+4
source share
1 answer

The problem is that DocummentTermMatrix(...) denotes tokenization by default word breaks. You need at least bitrams.

Confirm this post for a basic approach.

 library(tm) library(RWeka) my.docs <- c("foo bar word1 word2") myCorpus <- Corpus(VectorSource(my.docs)) myDict <- c("foo","bar","foo bar") BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2)) inspect(DocumentTermMatrix(myCorpus, control=list(tokenize=BigramTokenizer, dictionary=myDict))) # <<DocumentTermMatrix (documents: 1, terms: 3)>> # ... # Terms # Docs bar foo foo bar # 1 1 1 1 
+4
source

All Articles