Getting a large list of nouns (or adjectives) in Python using NLTK; or Python Mad Libs

As this question , Iโ€™m interested in getting a large list of words in terms of speech (a long list of nouns, a list of adjectives) that will be used programmatically elsewhere. This answer has a solution using a WordNet database (in SQL).

Is there any way to get such a list using the tools / tools built into Python NLTK. I could take a large pile of text, analyze it, and then save nouns and adjectives. But given dictionaries and other built-in tools, is there a smarter way to simply extract words that are already present in NLTK datasets encoded as nouns / adjectives (whatever)?

Thanks.

+8
python machine-learning nltk
source share
3 answers

It is worth noting that Wordnet is actually one of the packages included in the NLTK bootloader by default. That way, you could just use the solution you already found without reinventing all wheels.

For example, you can just do something like this to get all synsets nouns:

from nltk.corpora import wordnet as wn for synset in list(wn.all_synsets('n')): print synset # Or, equivalently for synset in list(wn.all_synsets(wn.NOUN)): print synset 

This example will give you every noun you want, and it even groups them into your synsets so you can be sure that they are used in the right context.

+6
source share

You must use the Moby Parts of Speech Project data. You should not record the use of only what is directly in NLTK by default. It would work a little to upload files for this and it is pretty easy to parse them using NLTK after download.

+1
source share

I saw a similar question earlier this week (I canโ€™t find the link), but as I said, I donโ€™t think that maintaining a list of nouns / adjectives / is something great idea. This is due primarily to the fact that the same word can have different parts of speech, depending on the context.

However, if you are still dead using these lists, here's how I do it (I don't have a working NLTK installation on this machine, but I remember the basics):

 nouns = set() for sentence in my_corpus.sents(): # each sentence is either a list of words or a list of (word, POS tag) tuples for word, pos in nltk.pos_tag(sentence): # remove the call to nltk.pos_tag if `sentence` is a list of tuples as described above if pos in ['NN', "NNP"]: # feel free to add any other noun tags nouns.add(word) 

Hope this helps

+1
source share

All Articles