A subset of data.table (.SD) with two variables

Question

A subset of data.table (.SD) with two variables

I am trying to rewrite a function that I have been using for a while. Simplified this:

dat = data.table(dataframe) getRecentRow <- function(data) { #Get most recent row (with highest time) row = data[order(-Time)][1] return(row) } # Run getRecentRow on each chunk given an ID output = dat[,getRecentRow(.SD), by=ID]

This function gives me the most recent record (with the highest time) on the ID. However, you can have multiple entries for each identifier. These entries can be distinguished by SUBID. I would like to dig one level deeper and instead of getting the most recent entries on ID, I need the most recent entries in SUBID. Since SUBIDs are not unique, the identifier must also be taken into account. So I need the most recent entry for each identifier, per SUBID.

Summing up: the input for the getRecentRow () function should not be a subset by ID, but by ID and SUBID.

I tried:

 dat = data.table(dataframe) getRecentRow <- function(data) { #Get most recent row (with highest time) row = data[order(-Time)][1] return(row) } # Run getRecentRow on each chunk given an ID output = dat[,getRecentRow(.SD), by=list(ID, SUBID)]

But this returns incorrect output, outputting more required lines. This should be an easy fix, I think I will reformulate by=list(ID, SUBID) , but I cannot figure out how to do this.

+4

r data.table

Max van der heijden Feb 13 '13 at 14:41

source share

1 answer

Max van der heijden · Accepted Answer · 2013-02-14T08:41:58+0000

The problem is not the function. The function actually did its job all the time. The problem was typing. The identification number sometimes took on great importance, as a result of which the split, for some reason, failed. After converting this number to a character. The problem was resolved, and the function did a great job.

A subset of data.table (.SD) with two variables

More articles: