Reorganization of a unique (NYC MTA turnstile) dataset into R

I have a uniquely looking dataset (New York MTA turnstile data) that I need to reinstall in a certain way to do some analysis. I wrote code that works, but not very efficient, since it is a very large dataset. I hope someone can suggest a better way.

There are 43 columns in this dataset. Columns 1-3 are unique identifiers (i.e. the turnstile at a particular station). Columns 4-8 then identify the measured time, the measured type, records, and then exit. 9-13, and then the remaining columns up to 43 follow the same pattern. The dataset is ugly, so I don't want to publish it here, but you can find it in the link below. You will need to view the data before 10/18/14.

http://web.mta.info/developers/turnstile.html

#Vector of column numbers that identifies the break a <- c(4, 9, 14, 19, 24, 29, 34, 39) #The actual loop to re-sort the data for (i in 1:nrow(data)) { for (j in 1:length(a)) { if (j == 8 ){ all <- rbind(all, cbind(data[i, 1:3], data[i, a[j]:43])) } else { all <- rbind(all, cbind(data[i, 1:3], data[i,a[j]:(a[j+1]-1)])) } } } 

The end result of all this is that it looks like this.

  1 2 3 1 2 3 4 5 5083 H026 R137 00-00-00 10-04-14 00:00:00 REGULAR 4072851 10491832 50831 H026 R137 00-00-00 10-04-14 04:00:00 REGULAR 4072918 10492356 50832 H026 R137 00-00-00 10-04-14 08:00:00 REGULAR 4073125 10492613 50833 H026 R137 00-00-00 10-04-14 12:00:00 REGULAR 4073511 10493116 50834 H026 R137 00-00-00 10-04-14 16:00:00 REGULAR 4073820 10493877 50835 H026 R137 00-00-00 10-04-14 20:00:00 REGULAR 4074140 10494817 

It works, but I know there is a much more efficient way to do this. Any help would be much appreciated!

edit:

I have to add a little more to this, as I have missed some critical snippets that may change the approach to this. After I read the data from read.csv, I will only multiply the data with a couple of meters (column 2). Since I liked the sentence, I converted a subset of the data to a string, as shown below. It really does work pretty well, but any further suggestion will be appreciated!

 out1 <- function() { data <- read.csv(name, header=FALSE) ##Isolate data for stations included in network area station <- subset(data, V2%in% station_names) data <- apply(station, 1, paste, collapse=",") starts <- seq(from=4, to=43, by=5) new_data <- rbindlist(lapply(strsplit(data, ","), function(x) { rbindlist(lapply(starts, function(y) { as.list(x[c(1:3, y:(y+4))]) })) })) setnames(new_data, colnames(new_data), c("CA", "UNIT", "SCP", "DATE", "TIME","DESC", "ENTRIES", "EXIT")) new_data <- as.data.frame(new_data) } 
+7
r
source share
2 answers

If you are not opposed to processing when loading data:

 # data via http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description_pre-10-18-2014.txt data <- readLines(textConnection("A002,R051,02-00-00,03-21-10,00:00:00,REGULAR,002670738,000917107,03-21-10,04:00:00,REGULAR,002670738,000917107,03-21-10,08:00:00,REGULAR,002670746,000917117,03-21-10,12:00:00,REGULAR,002670790,000917166,03-21-10,16:00:00,REGULAR,002670932,000917204,03-21-10,20:00:00,REGULAR,002671164,000917230,03-22-10,00:00:00,REGULAR,002671181,000917231,03-22-10,04:00:00,REGULAR,002671181,000917231 A002,R051,02-00-00,03-22-10,08:00:00,REGULAR,002671220,000917324,03-22-10,12:00:00,REGULAR,002671364,000917640,03-22-10,16:00:00,REGULAR,002671651,000917719,03-22-10,20:00:00,REGULAR,002672430,000917789,03-23-10,00:00:00,REGULAR,002672473,000917795,03-23-10,04:00:00,REGULAR,002672474,000917795,03-23-10,08:00:00,REGULAR,002672516,000917876,03-23-10,12:00:00,REGULAR,002672652,000917934 A002,R051,02-00-00,03-23-10,16:00:00,REGULAR,002672879,000917996,03-23-10,20:00:00,REGULAR,002673636,000918073,03-24-10,00:00:00,REGULAR,002673683,000918079,03-24-10,04:00:00,REGULAR,002673683,000918079,03-24-10,08:00:00,REGULAR,002673722,000918171,03-24-10,12:00:00,REGULAR,002673876,000918514,03-24-10,16:00:00,REGULAR,002674221,000918594,03-24-10,20:00:00,REGULAR,002675082,000918671 A002,R051,02-00-00,03-25-10,00:00:00,REGULAR,002675153,000918675,03-25-10,04:00:00,REGULAR,002675153,000918675,03-25-10,08:00:00,REGULAR,002675190,000918752,03-25-10,12:00:00,REGULAR,002675345,000919053,03-25-10,16:00:00,REGULAR,002675676,000919118,03-25-10,20:00:00,REGULAR,002676557,000919179,03-26-10,00:00:00,REGULAR,002676688,000919207,03-26-10,04:00:00,REGULAR,002676694,000919208 A002,R051,02-00-00,03-26-10,08:00:00,REGULAR,002676735,000919287,03-26-10,12:00:00,REGULAR,002676887,000919607,03-26-10,16:00:00,REGULAR,002677213,000919680,03-26-10,20:00:00,REGULAR,002678039,000919743,03-27-10,00:00:00,REGULAR,002678144,000919756,03-27-10,04:00:00,REGULAR,002678145,000919756,03-27-10,08:00:00,REGULAR,002678155,000919777,03-27-10,12:00:00,REGULAR,002678247,000919859 A002,R051,02-00-00,03-27-10,16:00:00,REGULAR,002678531,000919908,03-27-10,20:00:00,REGULAR,002678892,000919964,03-28-10,00:00:00,REGULAR,002678929,000919966,03-28-10,04:00:00,REGULAR,002678929,000919966,03-28-10,08:00:00,REGULAR,002678935,000919982,03-28-10,12:00:00,REGULAR,002679003,000920006,03-28-10,16:00:00,REGULAR,002679231,000920059,03-28-10,20:00:00,REGULAR,002679475,000920098")) library(data.table) starts <- seq(from=4, to=43, by=5) new_data <- rbindlist(lapply(strsplit(data, ","), function(x) { rbindlist(lapply(starts, function(y) { as.list(x[c(1:3, y:(y+4))]) })) })) setnames(new_data, colnames(new_data), c("control_area", "unit", "scp", "date", "time", "description", "entries", "exits")) dplyr::glimpse(new_data) ## Observations: 48 ## Variables: ## $ control_area (fctr) A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A0... ## $ unit (fctr) R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R0... ## $ scp (fctr) 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, ... ## $ date (fctr) 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-22-10, 03-22-10, ... ## $ time (fctr) 00:00:00, 04:00:00, 08:00:00, 12:00:00, 16:00:00, 20:00:00, 00:00:00, 04:00:00, ... ## $ description (fctr) REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR,... ## $ entries (fctr) 002670738, 002670738, 002670746, 002670790, 002670932, 002671164, 002671181, 002... ## $ exits (fctr) 000917107, 000917107, 000917117, 000917166, 000917204, 000917230, 000917231, 000... 
+7
source share

Here is an alternative approach to consider. It uses the stringi package and my splitstackshape package.

 library(splitstackshape) library(stringi) 

We will use the names from the field descriptions as indicated in the URL shared by @hrbmstr

 Names <- scan(what = "character", sep = ",", text = paste0( "C/A,UNIT,SCP,DATE1,TIME1,DESC1,ENTRIES1,EXITS1,", "DATE2,TIME2,DESC2,ENTRIES2,EXITS2,DATE3,TIME3,DESC3,", "ENTRIES3,EXITS3,DATE4,TIME4,DESC4,ENTRIES4,EXITS4,", "DATE5,TIME5,DESC5,ENTRIES5,EXITS5,DATE6,TIME6,DESC6,", "ENTRIES6,EXITS6,DATE7,TIME7,DESC7,ENTRIES7,EXITS7,", "DATE8,TIME8,DESC8,ENTRIES8,EXITS8")) ## What are the unique variable "stubs"? isRepeated <- unique(gsub("\\d", "", Names[4:length(Names)])) 

Then we will write a function using the above constants. The function performs the following actions:

  • Uses stri_split_fixed to split vector into matrix .
  • Trim any extra spaces.
  • Converts matrix to data.table and assigns appropriate names.
  • Uses merged.stack to transition from a "wide" form to a "half-long" form.

Here's the function:

 funAM <- function(invec) { temp <- stri_split_fixed(invec, ",", simplify = TRUE) temp <- `dim<-`(stri_trim_both(temp), dim(temp)) DT <- setnames(as.data.table(temp), Names) merged.stack(getanID(DT, 1:3), var.stubs = isRepeated, sep = "var.stubs") } 

Try:

 ## Try a dataset where we know there are unbalanced numbers of observations... data <- readLines("http://web.mta.info/developers/data/nyct/turnstile/turnstile_130615.txt") 

We will apply the function of the newly created data object:

 system.time(out <- funAM(data)) ## Reasonably fast # user system elapsed # 1.25 0.02 1.29 out # C/A UNIT SCP .id .time_1 DATE TIME DESC ENTRIES EXITS # 1: A002 R051 02-00-00 1 1 06-08-13 00:00:00 REGULAR 004153504 001427135 # 2: A002 R051 02-00-00 1 2 06-08-13 04:00:00 REGULAR 004153535 001427138 # 3: A002 R051 02-00-00 1 3 06-08-13 08:00:00 REGULAR 004153559 001427177 # 4: A002 R051 02-00-00 1 4 06-08-13 12:00:00 REGULAR 004153683 001427255 # 5: A002 R051 02-00-00 1 5 06-08-13 16:00:00 REGULAR 004153959 001427320 # --- # 241492: TRAM2 R469 00-05-01 6 4 # 241493: TRAM2 R469 00-05-01 6 5 # 241494: TRAM2 R469 00-05-01 6 6 # 241495: TRAM2 R469 00-05-01 6 7 # 241496: TRAM2 R469 00-05-01 6 8 

Compared to @hrbmstr's approach, here is the time:

 funHRB <- function() { starts <- seq(from=4, to=43, by=5) new_data <- rbindlist(lapply(strsplit(data, ","), function(x) { rbindlist(lapply(starts, function(y) { as.list(x[c(1:3, y:(y+4))]) })) })) setnames(new_data, colnames(new_data), c("control_area", "unit", "scp", "date", "time", "description", "entries", "exits")) new_data } system.time(out2 <- funHRB()) # user system elapsed # 23.59 0.03 23.77 

In addition, for comparison, both of these approaches are much faster than I assume that this is the first step of the OP, which I suggest using read.csv or something similar to get the data in R first. For me, this takes about a minute with the same dataset:

 system.time(DF <- read.csv( header = FALSE, col.names = Names, strip.white = TRUE, colClasses = rep("character", length(Names)), text = data)) # user system elapsed # 66.01 0.07 66.91 
+5
source share

All Articles