Creating a “word” of a cloud of phrases rather than single words in R

Question

Creating a “word” of a cloud of phrases rather than single words in R

I am trying to make the word cloud from a list of phrases, many of which are repeated, and not from individual words. My data looks something like this: one column of my data frame is a list of phrases.

df$names <- c("John", "John", "Joseph A", "Mary A", "Mary A", "Paul HC", "Paul HC")

I would like to make a word cloud, where all these names are considered as separate phrases whose frequency is displayed, and not the words that make them up. The code I used looks like this:

 df.corpus <- Corpus(DataframeSource(data.frame(df$names))) df.corpus <- tm_map(client.corpus, function(x) removeWords(x, stopwords("english"))) #turning that corpus into a tDM tdm <- TermDocumentMatrix(df.corpus) m <- as.matrix(tdm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) pal <- brewer.pal(9, "BuGn") pal <- pal[-(1:2)] #making a worcloud png("wordcloud.png", width=1280,height=800) wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors="black", vfont=c("sans serif","plain")) dev.off()

This creates a word cloud, but it applies to every component word, not to phrases. So, I see the relative frequency "A". "H", "John" etc. Instead of the relative frequency of “Joseph A,” “Mary A,” etc., which is what I want.

I am sure that it is not so difficult to fix, but I can not understand it! I would appreciate any help.

+7

r word-cloud

verybadatthis Nov 14 '14 at 20:03

source share

2 answers

Install RWeka and its dependencies, then try the following:

 library(RWeka) BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) # ... other tokenizers tok <- BigramTokenizer tdmgram <- TermDocumentMatrix(df.corpus, control = list(tokenize = tok)) #... create wordcloud

The tokenizer line above discards text into phrases of length 2.
More specifically, it creates the phrases minlength 2 and maxlength 2.
Using the Algorithm of the general Duck algorithm, you can create different tokenizers (for example, minlength 1, maxlength 2), and you probably want to experiment with different lengths. You can also call them tok1, tok2 instead of the detailed "BigramTokenizer" that I used above.

+3

knb Nov 14 '14 at 20:19

source share

keegan · Accepted Answer · 2014-11-14T20:34:16+0000

The difficulty is that each df$names element is treated as a "document" by the tm functions. For example, John A contains the words John and A It looks like you want to keep the names as they are, and just count their appearance - you can just use table to do this.

 library(wordcloud) df<-data.frame(theNames=c("John", "John", "Joseph A", "Mary A", "Mary A", "Paul HC", "Paul HC")) tb<-table(df$theNames) wordcloud(names(tb),as.numeric(tb), scale=c(8,.3),min.freq=1,max.words=100, random.order=T, rot.per=.15, colors="black", vfont=c("sans serif","plain"))

enter image description here

Creating a “word” of a cloud of phrases rather than single words in R

More articles: