Designing a regular expression to search for any phrase

Question

Designing a regular expression to search for any phrase

I am trying to create a chunker (or a shallow parser) using regular expressions (and without NLTK), but cannot come up with a regular expression that does what I want. Here is my immediate goal: to find all the noun phrases in the text in a natural language.

My first step is to mark all the sentences with my home part of the speech tagger, and then join the list of token / tag pairs in one line as follows:

'he PRN and CC bill NP could MOD hear VB them PRN on IN the DT large JJ balcony NN near IN the DT house NN'

My next step is to use a regular expression to search for strings for instances of personal phrases. Now the general linguistic formula for a noun phrase is: an optional qualifier (DT), zero or more adjectives (JJ) and a noun (NN), a proper name (NP) or a pronoun (PRN). Given this general formula, I tried this regular expression (remember that a marked line alternates between words and tags):

 '(\w+ DT)? (\w+ JJ)* (\w+ (NN|NP|PRN))'

Here is my code:

 text = 'he PRN and CC bill NP could MOD hear VB them PRN on IN the DT large JJ balcony NN near IN the DT house NN' regex = re.compile(r'(\w+ DT)? (\w+ JJ)* (\w+ (NN|NP|PRN))') m = regex.findall(text) if m: print m

And here is my conclusion:

 [('the DT', 'large JJ', 'balcony NN', 'NN')]

He does not find pronouns or proper nouns and for some reason only matches the pattern "NN in a \ w + DT \ w + NN". I assumed that my regex would match these patersn as I set the determinant pattern s optional (?) And the adjective pattern as zero or more (*).

Chris

+8

python regex chunking

user3609038 Jun 24 '14 at 1:13

source share

2 answers

Your regular expression will be,

 (\w+ DT)? (\w+ JJ)*|(\w+ (?:NN|NP|PRN))

Demo

0

Avinash raj Jun 24 '14 at 1:34

source share

zx81 · Accepted Answer · 2014-06-24T01:18:05+0000

Use this:

 (?:(?:\w+ DT )?(?:\w+ JJ )*)?\w+ (?:N[NP]|PRN)

See the demo .

(?:(?:\w+ DT )?(?:\w+ JJ )*)? optionally matches DT followed by zero or more objects
'\ w + (?: N [NP] | PRN)' matches NN , NP or PRN

Designing a regular expression to search for any phrase

More articles: