Extract nationalities and countries from text

I want to extract all mentions by country and nationality from the text using nltk, I used POS tags to extract all the markers marked with the GPE tag, but the results did not satisfy.

abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). " sent = nltk.tokenize.wordpunct_tokenize(abstract) pos_tag = nltk.pos_tag(sent) nes = nltk.ne_chunk(pos_tag) places = [] for ne in nes: if type(ne) is nltk.tree.Tree: if (ne.label() == 'GPE'): places.append(u' '.join([i[0] for i in ne.leaves()])) if len(places) == 0: places.append("N/A") 

Results:

 ['Thyroid', 'Australian', 'Caucasian', 'Graves'] 

Some of them are nationalities, while others are just nouns.

So what am I doing wrong or is there another way to extract such information?

+6
source share
4 answers

So, after fruitful comments, I delved into various NER tools to find the best in recognizing nationalities and country mentions and found that SPACY has an NORP organization that effectively extracts nationalities. https://spacy.io/docs/usage/entity-recognition

+5
source

If you want country names to be extracted, you need the NER tag, not the POS tag.

Named Object Recognition (NER) is a subtask for extracting information that seeks to search and classify elements in a text into predefined categories, such as names of individuals, organizations, locations, expressions of time, quantities, monetary values, percentages, etc. .

Go to the Stanford NER tag!

 from nltk.tag.stanford import NERTagger import os st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar') tagging = st.tag(text.split()) 
+2
source

Here's a geograpy that uses NLTK to perform entity extraction. It stores all places and locations as a reference to geographical names. He then searches the directory to find the appropriate places and locations. Browse the docs for more usage information -

 from geograpy import extraction e = extraction.Extractor(text="Thyroid-associated orbitopathy (TO) is an autoimmune- mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls).") e.find_entities() print e.places() 
+1
source

You can use Spacy for NER. This gives better results than NLTK.

 import spacy nlp = spacy.load('en_core_web_sm') doc = nlp(u"Apple is opening its first big office in San Francisco and California.") print([(ent.text, ent.label_) for ent in doc.ents]) 
0
source

All Articles