Calculate the distance between two rows in a data table.

Summary of the problem: I clear the telemetry dataset (i.e. spatial coordinates over time) using the data.table package (version 1.9.5) in R (version) on a Windows 7 PC. Some data points are wrong (for example, telemetry the equipment chose an echo). We can say that these points are erroneous because the fish moved farther than biologically and stands out as emissions. The actual data set contains more than 2,000,000 rows of data from 30 individual fish, therefore, using the data.table package.

I delete points that are too far apart (i.e. the distance traveled is greater than the maximum distance). However, I need to recalculate the distance traveled between the points after deleting the point, because 2-3 data points are sometimes clustered. I currently have a for loop that does this job, but is probably far from optimal, and I know that probably some of the powerful tools in the data.table package data.table missing.

As technical notes, my spatial scale is small enough to work at Euclidean distance, and my criteria for maximum distance are biologically reasonable.

Where I was looking for help: I looked through SO and found some useful answers, but no one exactly matches my problem. In particular, all other answers compare only one column of data between the rows.

  • This answer compares two rows using data.table , but looks at only one variable.

  • This answer looks promising and uses Reduce , but I could not figure out how to use Reduce with two columns.

  • This answer uses the index function from data.table , but I could not figure out how to use it with the distance function.

  • Finally, this answer demonstrates the roll data.table function. However, I could not figure out how to use two variables with this function.

Here is my MVCE:

 library(data.table) ## Create dummy data.table dt <- data.table(fish = 1, time = 1:6, easting = c(1, 2, 10, 11, 3, 4), northing = c(1, 2, 10, 11, 3, 4)) dt[ , dist := 0] maxDist = 5 ## First pass of calculating distances for(index in 2:dim(dt)[1]){ dt[ index, dist := as.numeric(dist(dt[c(index -1, index), list(easting, northing)]))] } ## Loop through and remove points until all of the outliers have been ## removed for the data.table. while(all(dt[ , dist < maxDist]) == FALSE){ dt <- copy(dt[ - dt[ , min(which(dist > maxDist))], ]) ## Loops through and recalculates distance after removing outlier for(index in 2:dim(dt)[1]){ dt[ index, dist := as.numeric(dist(dt[c(index -1, index), list(easting, northing)]))] } } 
+5
source share
1 answer

I am a little confused why you keep recalculating the distance (and without having to copy the data) instead of a simple pass:

 last = 1 idx = rep(0, nrow(dt)) for (curr in 1:nrow(dt)) { if (dist(dt[c(curr, last), .(easting, northing)]) <= maxDist) { idx[curr] = curr last = curr } } dt[idx] # fish time easting northing #1: 1 1 1 1 #2: 1 2 2 2 #3: 1 5 3 3 #4: 1 6 4 4 
+4
source

All Articles