I want to extract all mentions by country and nationality from the text using nltk, I used POS tags to extract all the markers marked with the GPE tag, but the results did not satisfy.
abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). " sent = nltk.tokenize.wordpunct_tokenize(abstract) pos_tag = nltk.pos_tag(sent) nes = nltk.ne_chunk(pos_tag) places = [] for ne in nes: if type(ne) is nltk.tree.Tree: if (ne.label() == 'GPE'): places.append(u' '.join([i[0] for i in ne.leaves()])) if len(places) == 0: places.append("N/A")
Results:
['Thyroid', 'Australian', 'Caucasian', 'Graves']
Some of them are nationalities, while others are just nouns.
So what am I doing wrong or is there another way to extract such information?
source share