Extract country name from copyright

I am currently studying the possibility of extracting a country name from Author Affiliations (PubMed Articles), my sample data is as follows:

Mechanical and Production Engineering Department, National University of Singapore.

Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK

Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.

Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285.

At first I tried to remove punctuation and divide the vector into words, and then compare it with the list of country names from Wikipedia, but I did not achieve this.

Can anyone suggest me a better way to do this? I would prefer a solution in R , since I need to continue the analysis and generate graphics in R

+6
text r nlp
source share
2 answers

Here is a simple solution that can help you get started. It uses a database containing city and country data in a map package. If you can get a better database, that should just be changing the code.

 library(maps) library(plyr) # Load data from package maps data(world.cities) # Create test data aa <- c( "Mechanical and Production Engineering Department, National University of Singapore.", "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK", "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.", "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285." ) # Remove punctuation from data caa <- gsub(aa, "[[:punct:]]", "") ### *Edit* # Split data at word boundaries saa <- strsplit(caa, " ") # Match on cities in world.cities # Assumes that if multiple matches, the last takes precedence, ie max() llply(saa, function(x)x[max(which(x %in% world.cities$name))]) # Match on country in world.countries llply(saa, function(x)x[which(x %in% world.cities$country.etc)]) 

This is the result for cities:

 [[1]] [1] "Singapore" [[2]] [1] "Cambridge" [[3]] [1] "Cambridge" [[4]] [1] "Indianapolis" 

And the result for countries:

 [[1]] [1] "Singapore" [[2]] [1] "UK" [[3]] [1] "UK" [[4]] character(0) 

With a bit of data cleansing, you can do something about it.

+6
source share

One way could be line splitting to isolate geographic information (for example, deleting everything up to the first coma), and then send the result to the geocoding service.

For example, the Google Geocoding API allows you to send an address and return localization and related geographic information, such as a country. I do not think there is a ready-made R package for this, but here you can find some functions, for example:

Geocoding in R with Google Maps

There are also extensions in other languages ​​such as Ruby:

http://geokit.rubyforge.org/

It also depends on the number of observations you have, the free Google API, for example, is limited to about 200 addresses / IP / day, if I remember correctly.

+1
source share