Select the nth observation and summarize by groups using data.table

I would like to turn the first table into the second one, selecting the last observation of the group for a and b , the first observation for c , summarizes each observation for the group for d and e , and for f check if a valid date exists and use this date.

Table 1:

 ID abcdef 1 10 100 1000 10000 100000 ? 1 10 100 1001 10010 100100 5/07/1977 1 11 111 1002 10020 100200 5/07/1977 2 22 222 2000 20000 200000 6/02/1980 3 33 333 3000 30000 300000 20/12/1978 3 33 333 3001 30010 300100 ? 4 40 400 4000 40000 400000 ? 4 40 400 4001 40010 400100 ? 4 40 400 4002 40020 400200 7/06/1944 4 44 444 4003 40030 400300 ? 4 44 444 4004 40040 400400 ? 4 44 444 4005 40050 400500 ? 5 55 555 5000 50000 500000 31/05/1976 5 55 555 5001 50010 500100 31/05/1976 

Table 2:

 ID abcdef 1 11 111 1000 30030 300300 5/07/1977 2 22 222 2000 20000 200000 6/02/1980 3 33 333 3000 60010 600100 20/12/1978 4 44 444 4000 240150 2401500 7/06/1944 5 55 555 5000 100010 1000100 31/05/1976 

I looked at the StackOverflow questions and I only saw elements of this. I can go through the next step.

 library(data.table) setwd('D:/Work/BRB/StackOverflow') DT = data.table(fread('datatable.csv', header=TRUE)) AB = DT[ , .SD[.N], ID ] AB = AB[ , c('a', 'b') ] C = DT[ , .SD[1], ID ] C = C[ , 'c' ] DE = DT[ , .(d = sum(d), e = sum(e)) , by = ID ] Final = cbind(AB, C, DE) Final 

My question is: can I perform operations with variables a , b , c , d , e in one transformation without the need to split it into 3?

Also, I have no idea how to do this f . Any suggestions?

Finally, I'm new to R. Is there anything else I can improve on my code?

+7
r data.table
source share
2 answers

There are several things you can improve:

  • fread will return the data.table, so there is no need to wrap it in data.table . You can check with class(DT) .
  • Use the na.strings parameter when reading data. The following is an example.
  • Summarize:

     DT[, .(a = a[.N], b = b[.N], c = c[1], d = sum(d), e = sum(e), f = unique(na.omit(f))), by = ID] 

You'll get:

  ID abcdef 1: 1 11 111 1000 30030 300300 5/07/1977 2: 2 22 222 2000 20000 200000 6/02/1980 3: 3 33 333 3000 60010 600100 20/12/1978 4: 4 44 444 4000 240150 2401500 7/06/1944 5: 5 55 555 5000 100010 1000100 31/05/1976 

Some explanations and other notes:

  • Substitution with [1] will give you the first value of the group. You can also use the first function, which is optimized in data.table and therefore faster.
  • A subset with [.N] will give you the last value of the group. You can also use the last function, which is optimized in data.table and therefore faster.
  • Do not use variable names, which are also functions in R (in this case, do not use c as the variable name). See Also ?c for an explanation of what the c function does.
  • To summarize f -variable, I used unique in combination with na.omit . If there is more than one unique date for an ID , you can also use, for example, na.omit(f)[1] .

If speed is a problem, you can optimize the above value (from thanks to @Frank):

 DT[order(f), .(a = last(a), b = last(b), c = first(c), d = sum(d), e = sum(e), f = first(f)), by = ID] 

Order f will contain the NA values โ€‹โ€‹of last. As a result, GForce internal optimization is now used for all calculations.


Used data:

 DT <- fread("ID abcdef 1 10 100 1000 10000 100000 ? 1 10 100 1001 10010 100100 5/07/1977 1 11 111 1002 10020 100200 5/07/1977 2 22 222 2000 20000 200000 6/02/1980 3 33 333 3000 30000 300000 20/12/1978 3 33 333 3001 30010 300100 ? 4 40 400 4000 40000 400000 ? 4 40 400 4001 40010 400100 ? 4 40 400 4002 40020 400200 7/06/1944 4 44 444 4003 40030 400300 ? 4 44 444 4004 40040 400400 ? 4 44 444 4005 40050 400500 ? 5 55 555 5000 50000 500000 31/05/1976 5 55 555 5001 50010 500100 31/05/1976", na.strings='?') 
+9
source share

We can use tidyverse . After grouping by 'ID', we summarise columns based on the first or last observation

 library(dplyr) DT %>% group_by(ID) %>% summarise(a = last(a), b = last(b), c = first(c), d = sum(d), e = sum(e), f = f[f!="?"][1]) # A tibble: 5 ร— 7 # ID abcdef # <int> <int> <int> <int> <int> <int> <chr> #1 1 11 111 1000 30030 300300 5/07/1977 #2 2 22 222 2000 20000 200000 6/02/1980 #3 3 33 333 3000 60010 600100 20/12/1978 #4 4 44 444 4000 240150 2401500 7/06/1944 #5 5 55 555 5000 100010 1000100 31/05/1976 
+3
source share

All Articles