Here is a simple solution that can help you get started. It uses a database containing city and country data in a map package. If you can get a better database, that should just be changing the code.
library(maps) library(plyr) # Load data from package maps data(world.cities) # Create test data aa <- c( "Mechanical and Production Engineering Department, National University of Singapore.", "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK", "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.", "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285." ) # Remove punctuation from data caa <- gsub(aa, "[[:punct:]]", "") ### *Edit* # Split data at word boundaries saa <- strsplit(caa, " ") # Match on cities in world.cities # Assumes that if multiple matches, the last takes precedence, ie max() llply(saa, function(x)x[max(which(x %in% world.cities$name))]) # Match on country in world.countries llply(saa, function(x)x[which(x %in% world.cities$country.etc)])
This is the result for cities:
[[1]] [1] "Singapore" [[2]] [1] "Cambridge" [[3]] [1] "Cambridge" [[4]] [1] "Indianapolis"
And the result for countries:
[[1]] [1] "Singapore" [[2]] [1] "UK" [[3]] [1] "UK" [[4]] character(0)
With a bit of data cleansing, you can do something about it.
Andrie
source share