The problem is that DocummentTermMatrix(...) denotes tokenization by default word breaks. You need at least bitrams.
Confirm this post for a basic approach.
library(tm) library(RWeka) my.docs <- c("foo bar word1 word2") myCorpus <- Corpus(VectorSource(my.docs)) myDict <- c("foo","bar","foo bar") BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2)) inspect(DocumentTermMatrix(myCorpus, control=list(tokenize=BigramTokenizer, dictionary=myDict))) # <<DocumentTermMatrix (documents: 1, terms: 3)>> # ... # Terms # Docs bar foo foo bar # 1 1 1 1
source share