Modeling a topic in R using phrases rather than single words

I try to make some modeling topics, but I want to use phrases where they exist, and not separate words that is.

library(topicmodels) library(tm) my.docs = c('the sky is blue, hot sun', 'flowers,hot sun', 'black cats, bees, rats and mice') my.corpus = Corpus(VectorSource(my.docs)) my.dtm = DocumentTermMatrix(my.corpus) inspect(my.dtm) 

When I check my dtm, it breaks all the words up, but I need all the phrases together, that is, there should be a column for each of: sky blue hot sun flowers black cats bees rats and mice

How to make a matrix of documents recognize phrases and words? they are separated by commas

The solution must be effective, because I want to run it on a lot of data

+5
source share
2 answers

You can try the approach using a custom tokenizer. You define terms with a few words that you want as phrases (I don't know the algorithmic code for this step):

 tokenizing.phrases <- c("sky is blue", "hot sun", "black cats") 

Please note that no stops are made, so if you want both “black cats” and “black cat”, you will need to enter both options. The case is ignored.

Then you need to create a function:

  phraseTokenizer <- function(x) { require(stringr) x <- as.character(x) # extract the plain text from the tm TextDocument object x <- str_trim(x) if (is.na(x)) return("") #warning(paste("doing:", x)) phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases)) if (any(phrase.hits)) { # only split once on the first hit, so you don't have to worry about multiple occurrences of the same phrase split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] # warning(paste("split phrase:", split.phrase)) temp <- unlist(str_split(x, ignore.case(split.phrase), 2)) out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) } else { out <- MC_tokenizer(x) } out[out != ""] } 

Then you start creating a term matrix as usual, but this time you include tokenized phrases in the corpus using the control argument.

 tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer)) 
+4
source

Perhaps look at this relatively recent publication on this topic:

http://web.engr.illinois.edu/~hanj/pdf/kdd13_cwang.pdf

they provide an algorithm for identifying phrases and breaking / marking a document into these phrases.

0
source

Source: https://habr.com/ru/post/1212403/


All Articles