Using data.table in R , I am trying to perform an operation on a subset, excluding the selected item. I use the by operator, but I don't know if this is correct.
Here is an example. For example. the Delta value in IAH:SNA is (3 + 3) / 2, which is the average Stops in IAH:SNA after Delta exclusion.
library(data.table) s1 <- "Market Carrier Stops IAH:SNA Delta 1 IAH:SNA Delta 1 IAH:SNA Southwest 3 IAH:SNA Southwest 3 MSP:CLE Southwest 2 MSP:CLE Southwest 2 MSP:CLE American 2 MSP:CLE JetBlue 1" d <- data.table(read.table(textConnection(s1), header=TRUE)) setkey(d, Carrier, Market) f <- function(x, y){ subset(d, !(Carrier %in% x) & Market == y, Stops)[, mean(Stops)]} d[, s := f(.BY[[1]], .BY[[2]]), by=list(Carrier, Market)]
The above solution works very poorly on large datasets (this is essentially mapply ), but I'm not sure how to do it in a quick way to data.table .
Perhaps it is possible to (dynamically) generate a factor that does this? I just don't know how to do this.,.
Is there any way to improve it?
Edit: just for that, here is a way to get a larger version above
library(data.table) dl.dta <- function(...){ ## input years .. years <- gsub("\\.", "_", c(...)) baseurl <- "http://www.transtats.bts.gov/Download/" names <- paste("Origin_and_Destination_Survey_DB1BMarket", years, sep="_") info <- t(sapply(names, function(x) file.exists(paste(x, c("zip", "csv"), sep=".")))) to.download <- paste(baseurl, names, ".zip", sep="")[!apply(info, 1, any)] if (length(to.download) > 0){ message("starting download...") sapply(to.download, function(x) download.file(x, rev(strsplit(x, "/")[[1]])[1]))} to.unzip <- paste(names, "zip", sep=".")[!info[, 2]] if (length(to.unzip > 0)){ message("starting to unzip...") sapply(to.unzip, unzip)} paste(names, "csv", sep=".")} countWords.split <- function(x, s=":"){ ## Faster on my machine than grep for some reanon sapply(strsplit(as.character(x), s), length)} countWords.grep <- function(x){ sapply(gregexpr("\\W+", x), length)+1} fname <- dl.dta(2013.1) cols <- rep("NULL", 41) ## Columns to keep: 9 is Origin, 18 is Dest, 24 is groups of airports in travel ## 30 is RPcarrier (reporting carrier). ## For more columns: 35 is market fare and 36 is distance. cols[9] <- cols[18] <- cols[24] <- cols[30] <- NA d <- data.table(read.csv(file=fname, colClasses=cols)) d[, Market := paste(Origin, Dest, sep=":")] ## should probably d[, Stops := -2 + countWords.split(AirportGroup)] d[, Carrier := RPCarrier] d[, c("RPCarrier", "Origin", "Dest", "AirportGroup") := NULL]