Extract English verbs from a given text

I need to extract all English verbs from this text, and I was wondering how I can do this ... At first glance, my idea is to use regular expressions, because all the tenses of the English verb follow patterns, but maybe There is another way to do this. I just thought:

  • Create a template for each verb tense. I must somehow distinguish between regular verbs (http://en.wikipedia.org/wiki/English_verbs) and irregular verbs (http://www.chompchomp.com/rules/irregularrules01.htm).
  • Iterate over these patterns and split the text into them (the last word of each substring should be a verb that gives full meaning to the sentence that I need for other purposes β†’ nominalization)

What do you think? I think this is not an effective way to do this, but I cannot imagine another.

Thank you in advance!

PS:

  • I have two dictionaries, one for all English verbs and the other for all English nouns.
  • The main problem is that the project consists of the nominalization of verbs (this is just a uni project), so all "efforts" should be concentrated in this part, nominalization. In concrete, I follow this model: acl.ldc.upenn.edu/P/P00/P00-1037.pdf). The project consists of a given text, finds all the verbs in this text and offers several nominations for each verb. So, the first step (searching for verbs) should be as simple as possible, but I can not use any parser, it is not allowed
+8
java regex nlp
source share
4 answers

Speech Tag Part

The identification and extraction of all verbs in the text is very simple using the Speech Part (POS) tagger . Such taggers indicate all words in the text with tags for parts of speech that indicate whether they are verbs, nouns, adjectives, adverbs, etc. Modern POS tag labels are very accurate. For example, Toutanova et al. 2003 reports that the open source open source tagger, Stanford, assigns 97.24% of the time to the correct tag according to news reports.

POS tag execution

Java If you use Java, the Stanford Log-linear Part-Of-Speech Tagger is a good POS tag package. Matthew Jokers has put together an excellent tutorial on using this tagger, which you can find here .

Python If you prefer Python, you can use the POS tag included in the Natural Language Toolkit (nltk) . The following is a code snippet demonstrating how to perform POS marking with this package:

import nltk text = "I am very happy to be here today" tokens = nltk.word_tokenize(text) pos_tagged_tokens = nltk.pos_tag(tokens) 

The resulting markers labeled with POS will be a set of tuples, where the first record in each tuple is the identifier of the tagged word, and the second record is the POS word tag, for example. the code snippet above pos_tagged_tokens will be set to:

 [('I', 'PRP'), ('am', 'VBP'), ('very', 'RB'), ('happy', 'JJ'), ('to', 'TO'), ('be', 'VB'), ('here', 'RB'), ('today', 'NN')] 

Understanding Tag Set

Both the Stanford POS tester and NLTK use the Penn Treebank tag set . If you're just interested in extracting verbs, pull out all the words with a POS tag that starts with "V" (for example, VB, VBD, VBG, VBN, VBP and VBZ).

+13
source share

Parsing a natural language with regular expression is not possible. Forget about it.

As a sharp example: how do you find verbs (marked with asterisks) in this sentence?

Buffalo Buffalo Buffalo Buffalo Buffalo * Buffalo * Buffalo Buffalo

While you are unlikely to encounter such extreme cases, there are dozens of verbs that can also be nouns, adjectives, etc., if you just look at this word.

You need a natural language parser like Stanford NLP . I never used it, so I don’t know how good your results will be, but better than with Regex, I can tell you.

+4
source share

This is actually a very difficult task in NLP (Natural Language Processing). There aren’t enough regular expressions there. Take, for example, the word "training" - it can be used as a verb or noun ("I'm going to training"). Obviously, a regular expression will not be able to tell the difference between the two. There are also problems, "-ed" is the usual way to stop past tense verbs, but you will not succeed in case of "disgust."

There are several methods that can provide you with a good (not perfect, but good) indication of whether a given word is a verb or not β€” they can also be quite expensive computational methods.

So, the first question that you should ask yourself (in my opinion) is what quality of the answer depends on how much time you are interested.

0
source share

Although a year later, but I found a very useful tool from Northwestern University called MorphAdorner .

It handles all kinds of situations, for example. lematization, language recognition, name recognition, parser, sentence delimiter, etc.

Convenient easy to use.

0
source share

All Articles