I am trying to create a chunker (or a shallow parser) using regular expressions (and without NLTK), but cannot come up with a regular expression that does what I want. Here is my immediate goal: to find all the noun phrases in the text in a natural language.
My first step is to mark all the sentences with my home part of the speech tagger, and then join the list of token / tag pairs in one line as follows:
'he PRN and CC bill NP could MOD hear VB them PRN on IN the DT large JJ balcony NN near IN the DT house NN'
My next step is to use a regular expression to search for strings for instances of personal phrases. Now the general linguistic formula for a noun phrase is: an optional qualifier (DT), zero or more adjectives (JJ) and a noun (NN), a proper name (NP) or a pronoun (PRN). Given this general formula, I tried this regular expression (remember that a marked line alternates between words and tags):
'(\w+ DT)? (\w+ JJ)* (\w+ (NN|NP|PRN))'
Here is my code:
text = 'he PRN and CC bill NP could MOD hear VB them PRN on IN the DT large JJ balcony NN near IN the DT house NN' regex = re.compile(r'(\w+ DT)? (\w+ JJ)* (\w+ (NN|NP|PRN))') m = regex.findall(text) if m: print m
And here is my conclusion:
[('the DT', 'large JJ', 'balcony NN', 'NN')]
He does not find pronouns or proper nouns and for some reason only matches the pattern "NN in a \ w + DT \ w + NN". I assumed that my regex would match these patersn as I set the determinant pattern s optional (?) And the adjective pattern as zero or more (*).
Chris
python regex chunking
user3609038
source share