Fast, efficient way to loop over millions of rows and column matches

Question

Fast, efficient way to loop over millions of rows and column matches

I am working with eye tracking data right now, so you have a huge array of data (I think millions of rows) and therefore would like to quickly complete this task. Here is a simplified version.

The data tells you where the eye is looking at every moment in time, and for each file that we are looking at. X1, Y1 to the coordinates of the point we are looking at. For each file, there are several time points (representing the eyes that look at another place in the file in time).

Filename Time X1 Y1 1 1 10 10 1 2 12 10

I also have a file where elements are located for each file name. Each file contains (in this simplified case) two objects. X1, Y1 are the lower left coordinates, and X2, Y2 are the upper right coordinates. You can imagine this as providing a bounding box where the element is in each file. For example.

 Filename Item X1 Y1 X2 Y2 1 Dog 11 10 20 20

What I would like to do is add another column to the first data frame, which tells me which object the person refers to at every moment for each file. If you are not looking at any of the objects, I would like the column to say “none”. Things on the border are counted when viewing. For example.

 Filename Time X1 Y1 LookingAt 1 1 10 10 none 1 2 12 11 Dog

I know how to do this for a loop, but it takes forever (and crashed my RStudio). I am wondering if there could be a faster and more efficient way that I am missing.

Here's the dput for the first data frame (they contain more rows that I showed above):

 structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), Time = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", "6"), class = "factor"), X1 = structure(c(1L, 4L, 3L, 2L, 1L, 4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25" ), class = "factor"), Y1 = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), .Names = c("Filename", "Time", "X1", "Y1"), row.names = c(NA, -9L), class = "data.frame")

And here is the dput for the second:

 structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "3"), class = "factor"), Item = structure(1:4, .Label = c("Cat", "Dog", "House", "Mouse"), class = "factor"), X1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"), Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "13", "35"), class = "factor"), X2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"), Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "13", "35"), class = "factor")), .Names = c("Filename", "Item", "X1", "Y1", "X2", "Y2"), row.names = c(NA, -4L), class = "data.frame")

+7

r

pomegranate Dec 27 '15 at 17:59

source share

1 answer

Jaap · Accepted Answer · 2015-12-27T21:23:41+0000

Using the data.table and the provided example data, I would approach it as follows:

 # getting the data in the right format datcols <- c("X","Y") lucols <- c("X1","X2","Y1","Y2") setDT(dat)[, (datcols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = datcols ][, Filename := as.character(Filename)] setDT(lu)[, (lucols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = lucols ][, `:=` (Filename = as.character(Filename), X1 = pmin(X1,X2), X2 = pmax(X1,X2), # make sure that 'X1' is always the lowest value Y1 = pmin(Y1,Y2), Y2 = pmax(Y1,Y2))] # make sure that 'Y1' is always the lowest value # matching the 'Items' to the correct rows dat[, looked_at := lu$Item[Filename==lu$Filename & between(X, lu$X1, lu$X2) & between(Y, lu$Y1, lu$Y2)], by = .(Filename,Time)]

which gives:

 > dat Filename Time XY looked_at 1: 1 1 10 10 Cat 2: 1 2 15 20 NA 3: 1 3 12 25 NA 4: 2 1 11 15 NA 5: 2 2 10 10 NA 6: 3 1 15 11 NA 7: 3 2 25 12 NA 8: 3 5 20 15 House 9: 3 6 10 10 Mouse

Used data:

 dat <- structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), Time = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", "6"), class = "factor"), X = structure(c(1L, 4L, 3L, 2L, 1L, 4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor"), Y = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), .Names = c("Filename", "Time", "X", "Y"), row.names = c(NA, -9L), class = "data.frame") lu <- structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "3"), class = "factor"), Item = structure(1:4, .Label = c("Cat", "Dog", "House", "Mouse"), class = "factor"), X1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"), X2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"), Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "13", "35"), class = "factor"), Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "13", "35"), class = "factor")), .Names = c("Filename", "Item", "X1", "X2", "Y1", "Y2"), row.names = c(NA, -4L), class = "data.frame")

Fast, efficient way to loop over millions of rows and column matches

More articles: