Creating a contingency table using multiple columns in a data frame in R

Question

Creating a contingency table using multiple columns in a data frame in R

I have a data frame that looks like this:

structure(list(ab = c(0, 1, 1, 1, 1, 0, 0, 0, 1, 1), bc = c(1, 1, 1, 1, 0, 0, 0, 1, 0, 1), de = c(0, 0, 1, 1, 1, 0, 1, 1, 0, 1), cl = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 2)), .Names = c("ab", "bc", "de", "cl"), row.names = c(NA, -10L), class = "data.frame")

The cl column indicates cluster association and the variables ab, bc and carry binary answers, where 1 indicates yes and 0 indicates no.

I am trying to create a cluster with crosstab tables along with all the other columns in the data frame: ab, bc and de, where the clusters become columns of variables. The desired result looks like this:

  1 2 3 ab 1 3 2 bc 2 3 1 de 2 3 1

I tried the following code:

 with(newdf, tapply(newdf[,c(3)], cl, sum))

This gives me cross tabbing values only one column at a time. In my data frame there are 1600+ columns with 1 column of the cluster. Can anyone help?

+6

r contingency

Apricot Oct 31 '15 at 19:10

source share

4 answers

One way to use dplyr :

 library(dplyr) df %>% #group by the varialbe cl group_by(cl) %>% #sum every column summarize_each(funs(sum)) %>% #select the three needed columns select(ab, bc, de) %>% #transpose the df t

Output:

  [,1] [,2] [,3] ab 1 3 2 bc 2 3 1 de 2 3 1

+7

LyzandeR Oct 31 '15 at 19:23

source share

In base R:

 t(sapply(data[,1:3],function(x) tapply(x,data[,4],sum))) # 1 2 3 #ab 1 3 2 #bc 2 3 1 #de 2 3 1

+4

nicola Oct 31 '15 at 19:29

source share

You can also combine tidyr:gather or reshape2::melt and xtabs to have a match table

 library(tidyr) xtabs(value ~ key + cl, data = gather(df, key, value, -cl)) ## cl ## key 1 2 3 ## ab 1 3 2 ## bc 2 3 1 ## de 2 3 1

If you prefer to use the handset

 df %>% gather(key, value, -cl) %>% xtabs(value ~ key + cl, data = .)

+2

dickoa Oct 31 '15 at 19:37

source share

Gregor · Accepted Answer · 2015-10-31T19:24:04+0000

Your data is in a semi-long semi-wide format, and you want to receive it in a completely wide format. This is easiest if we first hide it in a completely long format:

 library(reshape2) df_long = melt(df, id.vars = "cl") head(df_long) # cl variable value # 1 1 ab 0 # 2 2 ab 1 # 3 3 ab 1 # 4 1 ab 1 # 5 2 ab 1 # 6 3 ab 0

Then we can turn it into a wide format, using sum as an aggregate function:

 dcast(df_long, variable ~ cl, fun.aggregate = sum) # variable 1 2 3 # 1 ab 1 3 2 # 2 bc 2 3 1 # 3 de 2 3 1

Creating a contingency table using multiple columns in a data frame in R

More articles: