Extracting city names from text using python

Question

Extracting city names from text using python

I have a dataset with a single column heading: "What is your location and time zone?"

This means that we have entries like

Denmark, CET
Location - Devon, England, GMT time zone
Australia. Australian Eastern Standard Time. + 10 hours UTC.

and even

My location is Eugene, Oregon, most of the year or in Seoul, South Korea depending on my school holidays. My primary time zone is the Pacific time zone.
Throughout May I will be in London, UK (GMT + 1). Throughout June I will be in Norway (GMT + 2) or in Israel (GMT + 3) with limited Internet access. Throughout July and August I will be in London, United Kingdom (GMT + 1). And then from September 2015, I will be in Boston, USA (EDT).

Is there any way to get the city, country and time zone from this?

I was thinking of creating an array (from an open source dataset) with all the names of countries (including short forms), as well as the names of cities / time zones, and then if any word in the dataset matches the city / country / time belt or short form, it fills it into a new column in one data set and counts it.

Is it practical?

============ REPLT BASED ON NLTC ANSWER ==============

Running the same code as Alecxe, I get

Traceback (most recent call last): File "E:\SBTF\ntlk_test.py", line 19, in <module> tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\__init__.py", line 110, in pos_tag tagger = PerceptronTagger() File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 141, in __init__ self.load(AP_MODEL_LOC) File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 209, in load self.model.weights, self.tagdict, self.classes = load(loc) File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 801, in load opened_resource = _open(resource_url) File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 924, in _open return urlopen(resource_url) File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 431, in open response = self._open(req, data) File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 454, in _open 'unknown_open', req) File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 409, in _call_chain result = func(*args) File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 1265, in unknown_open raise URLError('unknown url type: %s' % type) URLError: <urlopen error unknown url type: c>

+1

python validation normalization

GeorgeC Mar 28 '16 at 2:59

source share

1 answer

alecxe · Answer 1 · 2016-03-28T03:10:17+0000

I would use the Natural Language Processing program and nltk can offer to extract entities.

An example (largely based on this meaning ) that tokens every line from a file, splits it into pieces, and searches for NE (named entity) for each fragment recursively. More explanation here :

 import nltk def extract_entity_names(t): entity_names = [] if hasattr(t, 'label') and t.label: if t.label() == 'NE': entity_names.append(' '.join([child[0] for child in t])) else: for child in t: entity_names.extend(extract_entity_names(child)) return entity_names with open('sample.txt', 'r') as f: for line in f: sentences = nltk.sent_tokenize(line) tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True) entities = [] for tree in chunked_sentences: entities.extend(extract_entity_names(tree)) print(entities)

For sample.txt containing:

 Denmark, CET Location is Devon, England, GMT time zone Australia. Australian Eastern Standard Time. +10h UTC. My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone. For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)

He prints:

 ['Denmark', 'CET'] ['Location', 'Devon', 'England', 'GMT'] ['Australia', 'Australian Eastern Standard Time'] ['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific'] ['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']

The solution is not perfect, but it may be a good start for you.

Extracting city names from text using python

More articles: