Purrr-Fection: in search of an elegant solution for operations with conditional data frames Purrr

Background

I have a problem for which a number of solutions are possible, but I am convinced that there is still an unsolved elegant solution using purrr.

Code example

I have a large data frame, as shown below, for which I gave an example below:

library(tibble) library(ggmap) library(purrr) library(dplyr) # Define Example Data df <- frame_data( ~Street, ~City, ~State, ~Zip, ~lon, ~lat, "226 W 46th St", "New York", "New York", 10036, -73.9867, 40.75902, "5th Ave", "New York", "New York", 10022, NA, NA, "75 Broadway", "New York", "New York", 10006, -74.01205, 40.70814, "350 5th Ave", "New York", "New York", 10118, -73.98566, 40.74871, "20 Sagamore Hill Rd", "Oyster Bay", "New York", 11771, NA, NA, "45 Rockefeller Plaza", "New York", "New York", 10111, -73.97771, 40.75915 ) 

Task

I would like to geotag all locations for which the lon and lat columns are currently NA . There are many ways I could do this, one of which is shown below:

 # Safe Code is Great Code safe_geocode <- safely(geocode) # Identify Data to be Geotagged by Absence of lon and lat data_to_be_geotagged <- df %>% filter(is.na(lon) | is.na(lat)) # GeoTag Addresses of Missing Data Points fullAddress <- paste(data_to_be_geotagged$Street, data_to_be_geotagged$City, data_to_be_geotagged$State, data_to_be_geotagged$Zip, sep = ", ") fullAddress %>% map(safe_geocode) %>% map("result") %>% plyr::ldply() 

Question

While I can make the above work and even mix up the newly identified lon and lat coordinates back to the original data frame, the whole circuit seems dirty. I am convinced that there is an elegant way to use pipelines and purrr to go through the data structure and conditionally geotag locations based on the absence of lon and lat .

I knocked down some rabbit holes, including purrr::pmap , trying to go through several columns in parallel while constructing the full address (as well as rowwise() and by_row() ). However, I cannot build anything that would qualify as an elegant solution.

Any understanding provided will be appreciated.

+7
r geocoding purrr ggmap
source share
3 answers

In fact, you want to avoid calling geocode more than necessary because it is slow, and if you use Google, you have only 2500 requests per day. Thus, it is best to do both columns from the same call, which can be done with the list column by creating a new version of data.frame using do or self-join.


1. In the list column

In the list column, you will create a new version of lon and lat with ifelse , geocoding if there are NA s, otherwise just copy the existing values. Subsequently, get rid of the old versions of the columns and disable the new ones:

 library(dplyr) library(ggmap) library(tidyr) # For `unnest` # Evaluate each row separately df %>% rowwise() %>% # Add a list column. If lon or lat are NA, mutate(data = ifelse(any(is.na(c(lon, lat))), # return a data.frame of the geocoded results, list(geocode(paste(Street, City, State, Zip))), # else return a data.frame of existing columns. list(data_frame(lon = lon, lat = lat)))) %>% # Remove old columns select(-lon, -lat) %>% # Unnest newly created ones from list column unnest(data) ## # A tibble: 6 × 6 ## Street City State Zip lon lat ## <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 226 W 46th St New York New York 10036 -73.98670 40.75902 ## 2 5th Ave New York New York 10022 -73.97491 40.76167 ## 3 75 Broadway New York New York 10006 -74.01205 40.70814 ## 4 350 5th Ave New York New York 10118 -73.98566 40.74871 ## 5 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259 ## 6 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915 

2. Using do

do , on the other hand, creates a completely new data.frame from pieces of the old. This requires a slightly clumsy $ notation, s . to represent the grouped data.frame included in the system. Using if and else instead of ifelse avoids nesting the results in lists (which they should have been higher, anyway).

  # Evaluate each row separately df %>% rowwise() %>% # Make a new data.frame from the first four columns and the geocode results or existing lon/lat do(bind_cols(.[1:4], if(any(is.na(c(.$lon, .$lat)))){ geocode(paste(.[1:4], collapse = ' ')) } else { .[5:6] })) 

which returns the same as the first version.


3. On a subset, recombination with self-coupling

If ifelse too confusing, you can simply geocode the subset and then recombine by binding the lines to anti_join , i.e. all the lines that are in df , but not the subset . :

 df %>% filter(is.na(lon) | is.na(lat)) %>% select(1:4) %>% bind_cols(geocode(paste(.$Street, .$City, .$State, .$Zip))) %>% bind_rows(anti_join(df, ., by = c('Street', 'Zip'))) 

which returns the same, but with new geocoded strings at the top. The same approach works with a list column or do , but since there is no need to combine two sets of columns, just bind_cols will do the trick.


4. On a subset with mutate_geocode

ggmap actually includes a mutate_geocode function that will add lon and lat columns when passing data.frame and address column. This has a problem: it cannot take more than the column name for the address, and therefore requires a single column with the entire address. Thus, although this version can be quite enjoyable, it requires creating and deleting an extra column with the entire address, which makes it impossible:

 df %>% filter(is.na(lon) | is.na(lat)) %>% select(1:4) %>% mutate(address = paste(Street, City, State, Zip)) %>% # make an address column mutate_geocode(address) %>% select(-address) %>% # get rid of address column bind_rows(anti_join(df, ., by = c('Street', 'Zip'))) ## Street City State Zip lon lat ## 1 5th Ave New York New York 10022 -73.97491 40.76167 ## 2 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259 ## 3 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915 ## 4 350 5th Ave New York New York 10118 -73.98566 40.74871 ## 5 75 Broadway New York New York 10006 -74.01205 40.70814 ## 6 226 W 46th St New York New York 10036 -73.98670 40.75902 

5. Base R

The R base can directly assign a subset, which makes the idiom much easier here, even if it requires a lot of subsets:

 df[is.na(df$lon) | is.na(df$lat), c('lon', 'lat')] <- geocode(paste(df$Street, df$City, df$State, df$Zip)[is.na(df$lon) | is.na(df$lat)]) 

The results are consistent with the first version.


All versions only call geocode twice.

Note that although you can use purrr for the job, it is not particularly better than regular dplyr . purrr has the advantage of working with lists, and although the column of the list is one of the parameters, it really does not need to be manipulated.

+6
source share

I'm not sure I will succumb to `purrr`, but the following code is used here:

 df <- frame_data( ~Street, ~City, ~State, ~Zip, ~lon, ~lat, "226 W 46th St", "New York", "New York", 10036, -73.9867, 40.75902, "5th Ave", "New York", "New York", 10022, NA, NA, "75 Broadway", "New York", "New York", 10006, -74.01205, 40.70814, "350 5th Ave", "New York", "New York", 10118, -73.98566, 40.74871, "20 Sagamore Hill Rd", "Oyster Bay", "New York", 11771, NA, NA, "45 Rockefeller Plaza", "New York", "New York", 10111, -73.97771, 40.75915 ) df2<-df %>% filter(is.na(lon) | is.na(lat)) %>% group_by(Street, City, State) %>% #not really necessary but it suppresses a warning mutate(lon=ifelse(is.na(lon) | is.na(lat), geocode(paste(Street, City,State, sep=" ")), 0)) %>% mutate(lat=ifelse(is.na(lon) | is.na(lat), rev(geocode(paste(Street, City,State, sep=" "))), 0)) 

If you want to get partial output, as in the above code example:

 as.data.frame(df2)[,5:6] lon lat 1 40.77505 -73.96515 2 40.88259 -73.50538 

Or include all columns:

 as.data.frame(df2) Street City State Zip lon lat 1 5th Ave New York New York 10022 40.77505 -73.96515 2 20 Sagamore Hill Rd Oyster Bay New York 11771 40.88259 -73.50538 

And if you want to combine your original data with the new data, you can do the following:

 as.data.frame(rbind(filter(df, !is.na(lon) | !is.na(lat)),df2 )) Street City State Zip lon lat 1 226 W 46th St New York New York 10036 -73.98670 40.75902 2 75 Broadway New York New York 10006 -74.01205 40.70814 3 350 5th Ave New York New York 10118 -73.98566 40.74871 4 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915 5 5th Ave New York New York 10022 40.77505 -73.96515 6 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.96515 40.77505 

... Or you can arrange it all in one, as in the following (preserves the original order):

 df2<-df %>% #group_by(Street, City, State) %>% # unescape if you want to suppress warning mutate(lon=ifelse(is.na(lon) | is.na(lat), geocode(paste(Street, City,State, sep=" ")), lon)) %>% mutate(lat=ifelse(is.na(lon) | is.na(lat), rev(geocode(paste(Street, City,State, sep=" "))), lat)) as.data.frame(df2) Street City State Zip lon lat 1 226 W 46th St New York New York 10036 -73.98670 40.75902 2 5th Ave New York New York 10022 -73.98670 40.75902 3 75 Broadway New York New York 10006 -74.01205 40.70814 4 350 5th Ave New York New York 10118 -73.98566 40.74871 5 20 Sagamore Hill Rd Oyster Bay New York 11771 40.75902 -73.98670 6 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915 
+3
source share

Using dplyr:

 df %>% mutate( lon = case_when( is.na(lon) ~ geocode(paste(Street, City, State, Zip))[,1], TRUE ~ lon), lat = case_when( is.na(lat) ~ geocode(paste(Street, City, State, Zip))[,2], TRUE ~ lat ) ) 
0
source share

All Articles