R: Data table for the subset excluding the value

Using data.table in R , I am trying to perform an operation on a subset, excluding the selected item. I use the by operator, but I don't know if this is correct.

Here is an example. For example. the Delta value in IAH:SNA is (3 + 3) / 2, which is the average Stops in IAH:SNA after Delta exclusion.

 library(data.table) s1 <- "Market Carrier Stops IAH:SNA Delta 1 IAH:SNA Delta 1 IAH:SNA Southwest 3 IAH:SNA Southwest 3 MSP:CLE Southwest 2 MSP:CLE Southwest 2 MSP:CLE American 2 MSP:CLE JetBlue 1" d <- data.table(read.table(textConnection(s1), header=TRUE)) setkey(d, Carrier, Market) f <- function(x, y){ subset(d, !(Carrier %in% x) & Market == y, Stops)[, mean(Stops)]} d[, s := f(.BY[[1]], .BY[[2]]), by=list(Carrier, Market)] ## Market Carrier Stops s ## 1: MSP:CLE American 2 1.666667 ## 2: IAH:SNA Delta 1 3.000000 ## 3: IAH:SNA Delta 1 3.000000 ## 5: IAH:SNA Southwest 3 1.000000 ## 6: IAH:SNA Southwest 3 1.000000 ## 7: MSP:CLE Southwest 2 1.500000 ## 8: MSP:CLE Southwest 2 1.500000 

The above solution works very poorly on large datasets (this is essentially mapply ), but I'm not sure how to do it in a quick way to data.table .

Perhaps it is possible to (dynamically) generate a factor that does this? I just don't know how to do this.,.

Is there any way to improve it?

Edit: just for that, here is a way to get a larger version above

 library(data.table) dl.dta <- function(...){ ## input years .. years <- gsub("\\.", "_", c(...)) baseurl <- "http://www.transtats.bts.gov/Download/" names <- paste("Origin_and_Destination_Survey_DB1BMarket", years, sep="_") info <- t(sapply(names, function(x) file.exists(paste(x, c("zip", "csv"), sep=".")))) to.download <- paste(baseurl, names, ".zip", sep="")[!apply(info, 1, any)] if (length(to.download) > 0){ message("starting download...") sapply(to.download, function(x) download.file(x, rev(strsplit(x, "/")[[1]])[1]))} to.unzip <- paste(names, "zip", sep=".")[!info[, 2]] if (length(to.unzip > 0)){ message("starting to unzip...") sapply(to.unzip, unzip)} paste(names, "csv", sep=".")} countWords.split <- function(x, s=":"){ ## Faster on my machine than grep for some reanon sapply(strsplit(as.character(x), s), length)} countWords.grep <- function(x){ sapply(gregexpr("\\W+", x), length)+1} fname <- dl.dta(2013.1) cols <- rep("NULL", 41) ## Columns to keep: 9 is Origin, 18 is Dest, 24 is groups of airports in travel ## 30 is RPcarrier (reporting carrier). ## For more columns: 35 is market fare and 36 is distance. cols[9] <- cols[18] <- cols[24] <- cols[30] <- NA d <- data.table(read.csv(file=fname, colClasses=cols)) d[, Market := paste(Origin, Dest, sep=":")] ## should probably d[, Stops := -2 + countWords.split(AirportGroup)] d[, Carrier := RPCarrier] d[, c("RPCarrier", "Origin", "Dest", "AirportGroup") := NULL] 
+6
r data.table
source share
2 answers

@ The answer to Roland will work for some features (and when it will be better), but not in general. Unfortunately, you cannot apply the split-apply-comb strategy to data, nor can you complete the task, but you can if you make the data large. Let's start with a simpler example:

 dt = data.table(a = c(1,1,2,2,3,3), b = c(1:6), key = 'a') # now let extend this table the following way # take the unique a and construct all the combinations excluding one element combinations = dt[, combn(unique(a), 2)] # now combine this into a data.table with the excluded element as the index # and merge it back into the original data.table extension = rbindlist(apply(combinations, 2, function(x) data.table(a = x, index = setdiff(c(1,2,3), x)))) setkey(extension, a) dt.extended = extension[dt, allow.cartesian = TRUE] dt.extended[order(index)] # a index b # 1: 2 1 3 # 2: 2 1 4 # 3: 3 1 5 # 4: 3 1 6 # 5: 1 2 1 # 6: 1 2 2 # 7: 3 2 5 # 8: 3 2 6 # 9: 1 3 1 #10: 1 3 2 #11: 2 3 3 #12: 2 3 4 # Now we have everything we need: dt.extended[, mean(b), by = list(a = index)] # a V1 #1: 3 2.5 #2: 2 3.5 #3: 1 4.5 

Returning to the original data (and performing some operations in a slightly different way to simplify the expressions):

 extension = d[, {Carrier.uniq = unique(Carrier); .SD[, rbindlist(combn(Carrier.uniq, length(Carrier.uniq)-1, function(x) data.table(Carrier = x, index = setdiff(Carrier.uniq, x)), simplify = FALSE))]}, by = Market] setkey(extension, Market, Carrier) extension[d, allow.cartesian = TRUE][, mean(Stops), by = list(Market, Carrier = index)] # Market Carrier V1 #1: IAH:SNA Southwest 1.000000 #2: IAH:SNA Delta 3.000000 #3: MSP:CLE JetBlue 2.000000 #4: MSP:CLE Southwest 1.500000 #5: MSP:CLE American 1.666667 
+3
source share

Use the tiny bit of elementary math:

 d[, c("tmp.mean", "N") := list(mean(Stops), .N), by = Market] d[, exep.mean := (tmp.mean * N - sum(Stops)) / (N - .N), by = list(Market,Carrier)] # Market Carrier Stops tmp.mean N exep.mean # 1: IAH:SNA Delta 1 2.00 4 3.000000 # 2: IAH:SNA Delta 1 2.00 4 3.000000 # 3: IAH:SNA Southwest 3 2.00 4 1.000000 # 4: IAH:SNA Southwest 3 2.00 4 1.000000 # 5: MSP:CLE Southwest 2 1.75 4 1.500000 # 6: MSP:CLE Southwest 2 1.75 4 1.500000 # 7: MSP:CLE American 2 1.75 4 1.666667 # 8: MSP:CLE JetBlue 1 1.75 4 2.000000 
+4
source share

All Articles