Speech Tag Part
The identification and extraction of all verbs in the text is very simple using the Speech Part (POS) tagger . Such taggers indicate all words in the text with tags for parts of speech that indicate whether they are verbs, nouns, adjectives, adverbs, etc. Modern POS tag labels are very accurate. For example, Toutanova et al. 2003 reports that the open source open source tagger, Stanford, assigns 97.24% of the time to the correct tag according to news reports.
POS tag execution
Java If you use Java, the Stanford Log-linear Part-Of-Speech Tagger is a good POS tag package. Matthew Jokers has put together an excellent tutorial on using this tagger, which you can find here .
Python If you prefer Python, you can use the POS tag included in the Natural Language Toolkit (nltk) . The following is a code snippet demonstrating how to perform POS marking with this package:
import nltk text = "I am very happy to be here today" tokens = nltk.word_tokenize(text) pos_tagged_tokens = nltk.pos_tag(tokens)
The resulting markers labeled with POS will be a set of tuples, where the first record in each tuple is the identifier of the tagged word, and the second record is the POS word tag, for example. the code snippet above pos_tagged_tokens will be set to:
[('I', 'PRP'), ('am', 'VBP'), ('very', 'RB'), ('happy', 'JJ'), ('to', 'TO'), ('be', 'VB'), ('here', 'RB'), ('today', 'NN')]
Understanding Tag Set
Both the Stanford POS tester and NLTK use the Penn Treebank tag set . If you're just interested in extracting verbs, pull out all the words with a POS tag that starts with "V" (for example, VB, VBD, VBG, VBN, VBP and VBZ).