I would use the Natural Language Processing program and nltk can offer to extract entities.
An example (largely based on this meaning ) that tokens every line from a file, splits it into pieces, and searches for NE (named entity) for each fragment recursively. More explanation here :
import nltk def extract_entity_names(t): entity_names = [] if hasattr(t, 'label') and t.label: if t.label() == 'NE': entity_names.append(' '.join([child[0] for child in t])) else: for child in t: entity_names.extend(extract_entity_names(child)) return entity_names with open('sample.txt', 'r') as f: for line in f: sentences = nltk.sent_tokenize(line) tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True) entities = [] for tree in chunked_sentences: entities.extend(extract_entity_names(tree)) print(entities)
For sample.txt containing:
Denmark, CET Location is Devon, England, GMT time zone Australia. Australian Eastern Standard Time. +10h UTC. My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone. For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)
He prints:
['Denmark', 'CET'] ['Location', 'Devon', 'England', 'GMT'] ['Australia', 'Australian Eastern Standard Time'] ['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific'] ['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']
The solution is not perfect, but it may be a good start for you.