Correlation between groups in R data.table

Question

Correlation between groups in R data.table

Is there a way to gracefully calculate correlations between values if these values are stored by a group in the same column of the data table. (other than converting a data table to a matrix)?

library(data.table) set.seed(1) # reproducibility dt <- data.table(id=1:4, group=rep(letters[1:2], c(4,4)), value=rnorm(8)) setkey(dt, group) # id group value # 1: 1 a -0.6264538 # 2: 2 a 0.1836433 # 3: 3 a -0.8356286 # 4: 4 a 1.5952808 # 5: 1 b 0.3295078 # 6: 2 b -0.8204684 # 7: 3 b 0.4874291 # 8: 4 b 0.7383247

Something that works, but requires entering group names:

 cor(dt["a"]$value, dt["b"]$value) # [1] 0.1556371

I am looking more for something like:

 dt[, cor(value, value), by="group"]

But it does not give me correlation (s) after.

Here is the same problem for a matrix with correct results.

 set.seed(1) # reproducibility m <- matrix(rnorm(8), ncol=2) dimnames(m) <- list(id=1:4, group=letters[1:2]) # group # id ab # 1 -0.6264538 0.3295078 # 2 0.1836433 -0.8204684 # 3 -0.8356286 0.4874291 # 4 1.5952808 0.7383247 cor(m) # correlations between groups # ab # a 1.0000000 0.1556371 # b 0.1556371 1.0000000

Any comments or help are greatly appreciated.

+8

r data.table correlation

Bram visser Mar 15 '14 at 8:38

source share

3 answers

I don’t know how to get it in matrix form right away, but I find this solution useful:

 dt[, {x = value; dt[, cor(x, value), by = group]}, by=group] group group V1 1: aa 1.0000000 2: ab 0.1556371 3: ba 0.1556371 4: bb 1.0000000

since you started with a molten dataset and you get a molten representation of the correlation.

Using this form, you can also simply calculate certain pairs, in particular, it is a waste of time calculating both diagonals. For example:

  dt[, {x = value; g = group; dt[group <= g, list(cor(x, value)), by = group]}, by=group] group group V1 1: aa 1.0000000 2: ba 0.1556371 3: bb 1.0000000

Alternatively, this form works just as well as cross-correlation between two sets (i.e. diagonal from block)

 library(data.table) set.seed(1) # reproducibility dt1 <- data.table(id=1:4, group=rep(letters[1:2], c(4,4)), value=rnorm(8)) dt2 <- data.table(id=1:4, group=rep(letters[3:4], c(4,4)), value=rnorm(8)) setkey(dt1, group) setkey(dt2, group) dt1[, {x = value; g = group; dt2[, list(cor(x, value)), by = group]}, by=group] group group V1 1: ac -0.39499814 2: ad 0.74234458 3: bc 0.96088312 4: bd 0.08016723

Obviously, if you ultimately want to get them in matrix form, you can use dcast or dcast.data.table , however, note that in the examples above you have two columns with the same name, to fix this, it’s worth renaming them in j-functions. For the original problem:

 dcast.data.table(dt[, {x = value; g1=group; dt[, list(g1, g2=group, c =cor(x, value)), by = group]}, by=group], g1~g2, value.var = "c") g1 ab 1: a 1.0000000 0.1556371 2: b 0.1556371 1.0000000

+4

Corone Jun 12 '14 at 14:25

source share

Since then, I have found another alternative for this. You were really close with your approach dt[, cor(value, value), by="group"] . What you really need is to first make a Cartesian join by dates, and then join together. I.e.

 dt[dt, allow.cartesian=T][, cor(value, value), by=list(group, group.1)]

This has the advantage that it will be combined in a series together (and not assume that they have the same length). Then you can convert this to a matrix form or leave it as it is intended to build as a heat map in ggplot, etc.

Full example

 setkey(dt, id) c <- dt[dt, allow.cartesian=T][, list(Cor = cor(value, value.1)), by = list(group, group.1)] c group group.1 Cor 1: aa 1.0000000 2: ba 0.1556371 3: ab 0.1556371 4: bb 1.0000000 dcast(c, group~group.1, value.var = "Cor") group ab 1 a 1.0000000 0.1556371 2 b 0.1556371 1.0000000

+4

Corone Oct 14 '14 at 9:48

source share

Scott ritchie · Accepted Answer · 2014-03-15T09:00:53+0000

There is no easy way to do this with data.table . The first method you provided:

 cor(dt["a"]$value, dt["b"]$value)

Perhaps the easiest.

An alternative is reshape your reshape format from "long" to "wide" format:

 > dtw <- reshape(dt, timevar="group", idvar="id", direction="wide") > dtw id value.a value.b 1: 1 -0.6264538 0.3295078 2: 2 0.1836433 -0.8204684 3: 3 -0.8356286 0.4874291 4: 4 1.5952808 0.7383247 > cor(dtw[,list(value.a, value.b)]) value.a value.b value.a 1.0000000 0.1556371 value.b 0.1556371 1.0000000

Update: If you use data.table version> = 1.9.0, you can use dcast.data.table , which will be much faster. To learn more, this post .

 dcast.data.table(dt, id ~ group)

Correlation between groups in R data.table

More articles: