Creating a โ€œwordโ€ of a cloud of phrases rather than single words in R

I am trying to make the word cloud from a list of phrases, many of which are repeated, and not from individual words. My data looks something like this: one column of my data frame is a list of phrases.

df$names <- c("John", "John", "Joseph A", "Mary A", "Mary A", "Paul HC", "Paul HC") 

I would like to make a word cloud, where all these names are considered as separate phrases whose frequency is displayed, and not the words that make them up. The code I used looks like this:

 df.corpus <- Corpus(DataframeSource(data.frame(df$names))) df.corpus <- tm_map(client.corpus, function(x) removeWords(x, stopwords("english"))) #turning that corpus into a tDM tdm <- TermDocumentMatrix(df.corpus) m <- as.matrix(tdm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) pal <- brewer.pal(9, "BuGn") pal <- pal[-(1:2)] #making a worcloud png("wordcloud.png", width=1280,height=800) wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors="black", vfont=c("sans serif","plain")) dev.off() 

This creates a word cloud, but it applies to every component word, not to phrases. So, I see the relative frequency "A". "H", "John" etc. Instead of the relative frequency of โ€œJoseph A,โ€ โ€œMary A,โ€ etc., which is what I want.

I am sure that it is not so difficult to fix, but I can not understand it! I would appreciate any help.

+7
r word-cloud
source share
2 answers

The difficulty is that each df$names element is treated as a "document" by the tm functions. For example, John A contains the words John and A It looks like you want to keep the names as they are, and just count their appearance - you can just use table to do this.

 library(wordcloud) df<-data.frame(theNames=c("John", "John", "Joseph A", "Mary A", "Mary A", "Paul HC", "Paul HC")) tb<-table(df$theNames) wordcloud(names(tb),as.numeric(tb), scale=c(8,.3),min.freq=1,max.words=100, random.order=T, rot.per=.15, colors="black", vfont=c("sans serif","plain")) 

enter image description here

+8
source share

Install RWeka and its dependencies, then try the following:

 library(RWeka) BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) # ... other tokenizers tok <- BigramTokenizer tdmgram <- TermDocumentMatrix(df.corpus, control = list(tokenize = tok)) #... create wordcloud 

The tokenizer line above discards text into phrases of length 2.
More specifically, it creates the phrases minlength 2 and maxlength 2.
Using the Algorithm of the general Duck algorithm, you can create different tokenizers (for example, minlength 1, maxlength 2), and you probably want to experiment with different lengths. You can also call them tok1, tok2 instead of the detailed "BigramTokenizer" that I used above.

+3
source share

All Articles