In fact, you want to avoid calling geocode more than necessary because it is slow, and if you use Google, you have only 2500 requests per day. Thus, it is best to do both columns from the same call, which can be done with the list column by creating a new version of data.frame using do or self-join.
1. In the list column
In the list column, you will create a new version of lon and lat with ifelse , geocoding if there are NA s, otherwise just copy the existing values. Subsequently, get rid of the old versions of the columns and disable the new ones:
library(dplyr) library(ggmap) library(tidyr)
2. Using do
do , on the other hand, creates a completely new data.frame from pieces of the old. This requires a slightly clumsy $ notation, s . to represent the grouped data.frame included in the system. Using if and else instead of ifelse avoids nesting the results in lists (which they should have been higher, anyway).
# Evaluate each row separately df %>% rowwise() %>% # Make a new data.frame from the first four columns and the geocode results or existing lon/lat do(bind_cols(.[1:4], if(any(is.na(c(.$lon, .$lat)))){ geocode(paste(.[1:4], collapse = ' ')) } else { .[5:6] }))
which returns the same as the first version.
3. On a subset, recombination with self-coupling
If ifelse too confusing, you can simply geocode the subset and then recombine by binding the lines to anti_join , i.e. all the lines that are in df , but not the subset . :
df %>% filter(is.na(lon) | is.na(lat)) %>% select(1:4) %>% bind_cols(geocode(paste(.$Street, .$City, .$State, .$Zip))) %>% bind_rows(anti_join(df, ., by = c('Street', 'Zip')))
which returns the same, but with new geocoded strings at the top. The same approach works with a list column or do , but since there is no need to combine two sets of columns, just bind_cols will do the trick.
4. On a subset with mutate_geocode
ggmap actually includes a mutate_geocode function that will add lon and lat columns when passing data.frame and address column. This has a problem: it cannot take more than the column name for the address, and therefore requires a single column with the entire address. Thus, although this version can be quite enjoyable, it requires creating and deleting an extra column with the entire address, which makes it impossible:
df %>% filter(is.na(lon) | is.na(lat)) %>% select(1:4) %>% mutate(address = paste(Street, City, State, Zip)) %>%
5. Base R
The R base can directly assign a subset, which makes the idiom much easier here, even if it requires a lot of subsets:
df[is.na(df$lon) | is.na(df$lat), c('lon', 'lat')] <- geocode(paste(df$Street, df$City, df$State, df$Zip)[is.na(df$lon) | is.na(df$lat)])
The results are consistent with the first version.
All versions only call geocode twice.
Note that although you can use purrr for the job, it is not particularly better than regular dplyr . purrr has the advantage of working with lists, and although the column of the list is one of the parameters, it really does not need to be manipulated.