Here is a small snippet that computes all text combinations / ngrams for a given set of words. To work with large data sets, it uses a hash library, although it is probably still quite slow ...
require(hash) get.ngrams <- function(text, target.words) { text <- tolower(text) split.text <- strsplit(text, "\\W+")[[1]] ngrams <- hash() current.ngram <- "" for(i in seq_along(split.text)) { word <- split.text[i] word_i <- i while(word %in% target.words) { if(current.ngram == "") { current.ngram <- word } else { current.ngram <- paste(current.ngram, word) } if(has.key(current.ngram, ngrams)) { ngrams[[current.ngram]] <- ngrams[[current.ngram]] + 1 } else{ ngrams[[current.ngram]] <- 1 } word_i <- word_i + 1 word <- split.text[word_i] } current.ngram <- "" } ngrams }
So the next input ...
some.text <- "He states that he loves the United States of America, and I agree it is nice in the United States." some.target.words <- c("united", "states", "of", "america") usa.ngrams <- get.ngrams(some.text, some.target.words)
... will result in the following hash:
>usa.ngrams <hash> containing 10 key-value pair(s). america : 1 of : 1 of america : 1 states : 3 states of : 1 states of america : 1 united : 2 united states : 2 united states of : 1 united states of america : 1
Note that this function is case insensitive and registers any permutation of target words, for example:
some.text <- "States of united America are states" some.target.words <- c("united", "states", "of", "america") usa.ngrams <- get.ngrams(some.text, some.target.words)
... leads to:
>usa.ngrams <hash> containing 10 key-value pair(s). america : 1 of : 1 of united : 1 of united america : 1 states : 2 states of : 1 states of united : 1 states of united america : 1 united : 1 united america : 1
Rasmus Bååth
source share