Data.table and "mandatory assessment". Mistake

Question

Data.table and "mandatory assessment". Mistake

I would like to use the data.table package in R to dynamically create aggregates, but I ran into an error. Below let my.dt be of type data.table .

 sex <- c("M","F","M","F") age <- c(19, 23, 26, 21) dependent.variable <- c(1400, 1500, 1250, 1100) my.dt <- data.table(sex, age, dependent.variable) grouping.vars <- c("sex", "age") for (i in 1:2) { my.dt[,sum(dependent.variable), by=grouping.vars[i]] }

If I run this, I get errors:

 Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i] : by must evaluate to list

However, the following works without errors:

 my.dt[,sum(dependent.variable), by=sex]

I see why the error occurs, but I do not see how to use a vector with the by parameter.

+7

r data.table

Ryan R. rosario Jul 15 '10 at 2:21

source share

2 answers

[UPDATE] 2 years after the question was asked ...

When running the code in the question data.table now more useful and returns this (using 1.8.2):

 Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i]) : 'by' appears to evaluate to column names but isn't c() or key(). Use by=list(...) if you can. Otherwise, by=eval(grouping.vars[i]) should work. This is for efficiency so data.table can detect which columns are needed.

and following the recommendations in the second sentence of the error:

 my.dt[,sum(dependent.variable), by=eval(grouping.vars[i])] sex V1 1: M 2650 2: F 2600

Old answer from July 2010 ( by now can be double and character , though):

Strictly speaking, by must be evaluated into a list of vectors, each with an integer storage mode. Thus, the numerical vector age can also be forcibly applied to an integer using as.integer() . This is because data.table uses radix sorting (very fast), but the radix algorithm is specifically for integers (see Wikipedia entry for “radix sorting”). Integer storage for key columns and ad hoc by is one of the reasons data.table is fast. Of course, the factor is an integer search for unique strings.

The idea of by as an expression of list() is that you are not limited to column names. Usually write column names directly in by . General - aggregate by month; eg:

 DT[,sum(col1), by=list(region,month(datecol))]

or a very quick way to group by month is to use a date other than the era, for example yyyymmddL, as shown in some examples in the package, for example:

 DT[,sum(col1), by=list(region,month=datecol%/%100L)]

Notice how you can name the columns inside the list () this way.

To define and reuse complex grouping expressions:

 e = quote(list(region,month(datecol))) DT[,sum(col1),by=eval(e)] DT[,sum(col2*col3/col4),by=eval(e)]

Or if you do not want to re-evaluate by expressions each time, you can save the result once and reuse the result to increase efficiency; if the by expressions themselves take a long time to compute / select, or you need to reuse it:

 byval = DT[,list(region,month(datecol))] DT[,sum(col1),by=byval] DT[,sum(col2*col3/col4),by=byval]

For the latest information and status, see http://datatable.r-forge.r-project.org/ . A new presentation will appear soon and hopes to release v1.5 in CRAN soon. This contains several bug fixes and new features detailed in the NEWS file. The reference data list contains about 30-40 messages per month that may be of interest.

+5

Matt dowle Jul 27 '10 at 12:56

source share

Vulpecula · Accepted Answer · 2010-07-15T04:40:52+0000

I made two changes to your source code:

 sex <- c("M","F","M","F") age <- c(19, 23, 26, 21) age<-as.factor(age) dependent.variable <- c(1400, 1500, 1250, 1100) my.dt <- data.table(sex, age, dependent.variable) for ( a in 1:2){ print(my.dt[,sum(dependent.variable), by=list(sex,age)[a]]) }

The numeric vector age must be forced into factors. As for the by parameter, do not use a quote for column names, but group them into a list (...). At least this is what the author suggested.

Data.table and "mandatory assessment". Mistake

More articles: