Calculation of correlation in a data frame in R

I have a data frame d, it has 3 columns, which s, n, idand I need to calculate the correlation between the "s" and "n" based on their "id", How, for example, the data frame:

"s"   "n"   "id"
1.6    0.5   2
2.5    0.8   2
4.8    0.7   3
2.6    0.4   3
3.5    0.66  3
1.2    0.1   4
2.5    0.45  4

So, I want to calculate the correlation of 2, 3, and 4 and return it as a vector:

cor
0.18 0.45 0.65

My problem is how to select this identifier and calculate the correlation and return as a vector.

thank

+4
source share
4 answers
tab_split<-split(mydf,mydf$id) # get a list where each element is a subset of your data.frame with the same id

unlist(lapply(tab_split,function(tab) cor(tab[,1],tab[,2]))) # get a vector of correlation coefficients

with the sample you gave:

mydf<-structure(list(s = c(1.6, 2.5, 4.8, 2.6, 3.5, 1.2, 2.5), 
                     n = c(0.5,0.8, 0.7, 0.4, 0.66, 0.1, 0.45), 
                     id = c(2L, 2L, 3L, 3L, 3L, 4L,4L)), 
                .Names = c("s", "n", "id"), 
                class = "data.frame", 
                row.names = c(NA, -7L))

> unlist(lapply(tab_split,function(tab) cor(tab[,1],tab[,2])))
       2        3        4 
1.000000 0.875128 1.000000

NB: if your column names are always "n" and "s", you can also do

unlist(lapply(tab_split,function(tab) cor(tab$s,tab$n)))
+2
source

Here's the dplyr approach:

library(dplyr)
group_by(df, id) %>% summarise(corel = cor(s, n)) %>% .$corel
#[1] 1.000000 0.875128 1.000000
+3

,

unname(c(by(df[,-3], list(df$id), FUN=function(x) cor(x)[2])))
#[1] 1.000000 0.875128 1.000000

 unname(sapply(by(df[,-3], list(df$id), FUN=cor),`[`,2))
 #[1] 1.000000 0.875128 1.000000

library(data.table)
setDT(df)[,cor(s,n) , by=id]$V1
#[1] 1.000000 0.875128 1.000000

df <-  structure(list(s = c(1.6, 2.5, 4.8, 2.6, 3.5, 1.2, 2.5), n = c(0.5, 
0.8, 0.7, 0.4, 0.66, 0.1, 0.45), id = c(2L, 2L, 3L, 3L, 3L, 4L, 
4L)), .Names = c("s", "n", "id"), class = "data.frame", row.names = c(NA, 
-7L))
+2

Loop parameter (although it is probably slower than other solutions). If you want to include only certain identifiers, you must adapt the vector d, the correlations are returned in the vector v

d <- unique(mydf$id)
v <- vector("numeric", length = length(d))

for(i in seq_along(d)) {
  dat <- mydf[ which(mydf$id == d[i]), ]
  v[i] <- cor(dat$s, dat$n)
}
0
source

All Articles