Creating an ID variable based on 3 common variables in R

I have the following data frame (below), which is a subset of my complete data set. I need to define the same identifier for each record with the same LAT / LONG and landing date. I previously ordered LANDEDDATE from the earliest and latest dates.

> dput(df2) structure(list(LATITUDE = c(43.35, 43.35, 43.35, 43.35, 43.35, 43.35, 43.35, 43.35, 43.5166, 43.5166, 43.5166, 43.5166, 43.5166, 43.5166, 43.5166, 42.9833, 42.9833, 42.9833, 42.9666, 42.9666 ), LONGITUDE = c(-60.6163, -60.6165, -60.7167, -60.7166, -60.7163, -60.716, -60.7169, -60.7166, -59.9169, -59.9168, -59.9169, -59.9166, -59.9166, -59.916, -59.916, -61.8333, -61.8333, -61.8333, -61.9161, -61.9161), LANDEDDATE = structure(c(11171, 11171, 11183, 11183, 11183, 11183, 11183, 11183, 11192, 11192, 11192, 11192, 11192, 11192, 11192, 11210, 11210, 11210, 11210, 11210), class = "Date"), sppCODE = c(251L, 251L, 251L, 251L, 251L, 251L, 251L, 251L, 251L, 251L, 251L, 251L, 251L, 256L, 251L, 256L, 252L, 251L, 251L, 252L), LIVEW = c(0.337, 0.471, 0.238, 0.772, 0.178, 0.416, 0.535, 0.356, 0.442, 0.663, 0.442, 0.497, 0.276, 0.032, 0.828, 0.035, 0.011, 1.224, 1.025, 0.072), SPECIES = structure(c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L), .Label = c("Albacore Tuna", "Bigeye Tuna", "Bluefin Tuna", "Mako", "Porbeagle", "Shark, UNSP", "Swordfish", "Tuna, UNSP", "White Marlin", "Yellowfin Tuna"), class = "factor")), .Names = c("LATITUDE", "LONGITUDE", "LANDEDDATE", "sppCODE", "LIVEW", "SPECIES"), row.names = c(19L, 20L, 13L, 14L, 15L, 16L, 17L, 18L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L), class = "data.frame") 

This is what I have done so far, but I cannot understand the rest.

  df2$setID<-"NA" #I created an empty setID. > head(df2) LATITUDE LONGITUDE LANDEDDATE sppCODE LIVEW SPECIES setID 19 43.35 -60.6163 2000-08-02 251 0.337 Swordfish NA 20 43.35 -60.6165 2000-08-02 251 0.471 Swordfish NA 13 43.35 -60.7167 2000-08-14 251 0.238 Swordfish NA 14 43.35 -60.7166 2000-08-14 251 0.772 Swordfish NA 15 43.35 -60.7163 2000-08-14 251 0.178 Swordfish NA 16 43.35 -60.7160 2000-08-14 251 0.416 Swordfish NA unique<-df2[which(!duplicated(df2[,1:3])),] #This is each entry that are NOT duplicate unique2$setID<-1:13 # Ranked from 1:13 > head(unique) #looks like that LATITUDE LONGITUDE LANDEDDATE sppCODE LIVEW SPECIES setID 19 43.35 -60.6163 2000-08-02 251 0.337 Swordfish 1 20 43.35 -60.6165 2000-08-02 251 0.471 Swordfish 2 13 43.35 -60.7167 2000-08-14 251 0.238 Swordfish 3 14 43.35 -60.7166 2000-08-14 251 0.772 Swordfish 4 15 43.35 -60.7163 2000-08-14 251 0.178 Swordfish 5 16 43.35 -60.7160 2000-08-14 251 0.416 Swordfish 6 rep<-df2[which(duplicated(df2[,1:3])),] #This is all my replicates 

I need to assign the setID of my unique data frame to the corresponding sets (replicated with the same LAT / LONG and LANDEDDATE) in the rep data frame. Any advice would be appreciated!

+4
source share
3 answers

That sounds like a lot of controversy!

Why not create a “key” with something like paste

 paste(df2$LONGITUDE, df2$LATITUDE, df2$LANDEDDATE) 

then you can use factor and force it back to your main goals:

 df2$setID <- as.integer(factor(paste(df2$LONGITUDE, df2$LATITUDE, df2$LANDEDDATE))) 

You can also do this with data.table if your data is large (or even if there is none!).

 library(data.table) df2 <- data.table(df2, key=c('LATITUDE', 'LONGITUDE', 'LANDEDDATE')) df2[, setID := (.GRP), by=c('LATITUDE', 'LONGITUDE', 'LANDEDDATE')] 
+6
source

You can use interaction

 within(df2, {id = interaction(LATITUDE,LONGITUDE,LANDEDDATE)}) LATITUDE LONGITUDE LANDEDDATE sppCODE LIVEW SPECIES id 19 43.35 -60.6163 2000-08-02 251 0.337 Swordfish 43.35.-60.6163.2000-08-02 20 43.35 -60.6165 2000-08-02 251 0.471 Swordfish 43.35.-60.6165.2000-08-02 13 43.35 -60.7167 2000-08-14 251 0.238 Swordfish 43.35.-60.7167.2000-08-14 14 43.35 -60.7166 2000-08-14 251 0.772 Swordfish 43.35.-60.7166.2000-08-14 15 43.35 -60.7163 2000-08-14 251 0.178 Swordfish 43.35.-60.7163.2000-08-14 ....... 

EDIT add drop = TRUE (more efficient) (@Matthew comment) and change separator (smarter)

 res1 <- within(df2, {id = interaction( sprintf("%5.2f", LATITUDE), sprintf("%5.2f", LONGITUDE), LANDEDDATE,drop=TRUE,sep=':')}) res1 LATITUDE LONGITUDE LANDEDDATE sppCODE LIVEW SPECIES id 19 43.3500 -60.6163 2000-08-02 251 0.337 Swordfish 43.35:-60.62:2000-08-02 20 43.3500 -60.6165 2000-08-02 251 0.471 Swordfish 43.35:-60.62:2000-08-02 13 43.3500 -60.7167 2000-08-14 251 0.238 Swordfish 43.35:-60.72:2000-08-14 14 43.3500 -60.7166 2000-08-14 251 0.772 Swordfish 43.35:-60.72:2000-08-14 15 43.3500 -60.7163 2000-08-14 251 0.178 Swordfish 43.35:-60.72:2000-08-14 16 43.3500 -60.7160 2000-08-14 251 0.416 Swordfish 43.35:-60.72:2000-08-14 17 43.3500 -60.7169 2000-08-14 251 0.535 Swordfish 43.35:-60.72:2000-08-14 18 43.3500 -60.7166 2000-08-14 251 0.356 Swordfish 43.35:-60.72:2000-08-14 6 43.5166 -59.9169 2000-08-23 251 0.442 Swordfish 43.52:-59.92:2000-08-23 7 43.5166 -59.9168 2000-08-23 251 0.663 Swordfish 43.52:-59.92:2000-08-23 
+7
source

The id function in plyr designed to do this as efficiently as possible:

 library(plyr) id(df2[c("LATITUDE", "LONGITUDE", "LANDEDDATE")]) id(df2[c("LATITUDE", "LONGITUDE", "LANDEDDATE")], drop = TRUE) 

It has a property, if you then order the data frame of this variable, it will be in the same order if you ordered it by lat, log and date.

Here is a small guideline, and not that it can make a big difference if your real data is much more than that:

 library(microbenchmark) df3 <- df2[c("LATITUDE", "LONGITUDE", "LANDEDDATE")] microbenchmark( id(df3), id(df3, drop = TRUE), interaction(df3), interaction(df3, drop = TRUE)) # Unit: microseconds # expr min lq median uq max neval # id(df3) 366 386 397 412 575 100 # id(df3, drop = TRUE) 421 443 460 476 1823 100 # interaction(df3) 622 657 672 698 987 100 # interaction(df3, drop = TRUE) 1006 1053 1079 1108 2210 100 
+2
source

All Articles