Select the nth observation and summarize by groups using data.table

Question

Select the nth observation and summarize by groups using data.table

I would like to turn the first table into the second one, selecting the last observation of the group for a and b , the first observation for c , summarizes each observation for the group for d and e , and for f check if a valid date exists and use this date.

Table 1:

 ID abcdef 1 10 100 1000 10000 100000 ? 1 10 100 1001 10010 100100 5/07/1977 1 11 111 1002 10020 100200 5/07/1977 2 22 222 2000 20000 200000 6/02/1980 3 33 333 3000 30000 300000 20/12/1978 3 33 333 3001 30010 300100 ? 4 40 400 4000 40000 400000 ? 4 40 400 4001 40010 400100 ? 4 40 400 4002 40020 400200 7/06/1944 4 44 444 4003 40030 400300 ? 4 44 444 4004 40040 400400 ? 4 44 444 4005 40050 400500 ? 5 55 555 5000 50000 500000 31/05/1976 5 55 555 5001 50010 500100 31/05/1976

Table 2:

 ID abcdef 1 11 111 1000 30030 300300 5/07/1977 2 22 222 2000 20000 200000 6/02/1980 3 33 333 3000 60010 600100 20/12/1978 4 44 444 4000 240150 2401500 7/06/1944 5 55 555 5000 100010 1000100 31/05/1976

I looked at the StackOverflow questions and I only saw elements of this. I can go through the next step.

 library(data.table) setwd('D:/Work/BRB/StackOverflow') DT = data.table(fread('datatable.csv', header=TRUE)) AB = DT[ , .SD[.N], ID ] AB = AB[ , c('a', 'b') ] C = DT[ , .SD[1], ID ] C = C[ , 'c' ] DE = DT[ , .(d = sum(d), e = sum(e)) , by = ID ] Final = cbind(AB, C, DE) Final

My question is: can I perform operations with variables a , b , c , d , e in one transformation without the need to split it into 3?

Also, I have no idea how to do this f . Any suggestions?

Finally, I'm new to R. Is there anything else I can improve on my code?

+7

r data.table

apotheosied Apr 08 '17 at 12:09

source share

2 answers

We can use tidyverse . After grouping by 'ID', we summarise columns based on the first or last observation

 library(dplyr) DT %>% group_by(ID) %>% summarise(a = last(a), b = last(b), c = first(c), d = sum(d), e = sum(e), f = f[f!="?"][1]) # A tibble: 5 × 7 # ID abcdef # <int> <int> <int> <int> <int> <int> <chr> #1 1 11 111 1000 30030 300300 5/07/1977 #2 2 22 222 2000 20000 200000 6/02/1980 #3 3 33 333 3000 60010 600100 20/12/1978 #4 4 44 444 4000 240150 2401500 7/06/1944 #5 5 55 555 5000 100010 1000100 31/05/1976

+3

akrun Apr 08 '17 at 12:29

source share

Jaap · Accepted Answer · 2017-04-08T12:31:37+0000

There are several things you can improve:

fread will return the data.table, so there is no need to wrap it in data.table . You can check with class(DT) .
Use the na.strings parameter when reading data. The following is an example.

Summarize:

 DT[, .(a = a[.N], b = b[.N], c = c[1], d = sum(d), e = sum(e), f = unique(na.omit(f))), by = ID]

You'll get:

  ID abcdef 1: 1 11 111 1000 30030 300300 5/07/1977 2: 2 22 222 2000 20000 200000 6/02/1980 3: 3 33 333 3000 60010 600100 20/12/1978 4: 4 44 444 4000 240150 2401500 7/06/1944 5: 5 55 555 5000 100010 1000100 31/05/1976

Some explanations and other notes:

Substitution with [1] will give you the first value of the group. You can also use the first function, which is optimized in data.table and therefore faster.
A subset with [.N] will give you the last value of the group. You can also use the last function, which is optimized in data.table and therefore faster.
Do not use variable names, which are also functions in R (in this case, do not use c as the variable name). See Also ?c for an explanation of what the c function does.
To summarize f -variable, I used unique in combination with na.omit . If there is more than one unique date for an ID , you can also use, for example, na.omit(f)[1] .

If speed is a problem, you can optimize the above value (from thanks to @Frank):

 DT[order(f), .(a = last(a), b = last(b), c = first(c), d = sum(d), e = sum(e), f = first(f)), by = ID]

Order f will contain the NA values of last. As a result, GForce internal optimization is now used for all calculations.

Used data:

 DT <- fread("ID abcdef 1 10 100 1000 10000 100000 ? 1 10 100 1001 10010 100100 5/07/1977 1 11 111 1002 10020 100200 5/07/1977 2 22 222 2000 20000 200000 6/02/1980 3 33 333 3000 30000 300000 20/12/1978 3 33 333 3001 30010 300100 ? 4 40 400 4000 40000 400000 ? 4 40 400 4001 40010 400100 ? 4 40 400 4002 40020 400200 7/06/1944 4 44 444 4003 40030 400300 ? 4 44 444 4004 40040 400400 ? 4 44 444 4005 40050 400500 ? 5 55 555 5000 50000 500000 31/05/1976 5 55 555 5001 50010 500100 31/05/1976", na.strings='?')

Select the nth observation and summarize by groups using data.table

More articles: