Alphanumeric string matching in nltk grammar

I am trying to use NTLK grammar and syntax analysis algorithms as they seem pretty easy to use. Although, I cannot find a way to match the alphanumeric string correctly, for example:

import nltk grammar = nltk.parse_cfg (""" # Is this possible? TEXT -> \w* """) parser = nltk.RecursiveDescentParser(grammar) print parser.parse("foo") 

Is there an easy way to achieve this?

+4
source share
1 answer

It would be very difficult to do clean. The base parser classes rely on exact matches or production RHS for pop content, so this will require subclassing and rewriting large parts of the parser class. I tried to do this with a grammar class and refused.

Instead, I did more hacking, but basically, first I extract matches of regular expressions from the text and add them to the grammar as production ones. This will be very slow if you use a large grammar, since for each call you need to recalculate the grammar and parser.

 import re import nltk from nltk.grammar import Nonterminal, Production, ContextFreeGrammar grammar = nltk.parse_cfg (""" S -> TEXT TEXT -> WORD | WORD TEXT | NUMBER | NUMBER TEXT """) productions = grammar.productions() def literal_production(key, rhs): """ Return a production <key> -> n :param key: symbol for lhs: :param rhs: string literal: """ lhs = Nonterminal(key) return Production(lhs, [rhs]) def parse(text): """ Parse some text. """ # extract new words and numbers words = set([match.group(0) for match in re.finditer(r"[a-zA-Z]+", text)]) numbers = set([match.group(0) for match in re.finditer(r"\d+", text)]) # Make a local copy of productions lproductions = list(productions) # Add a production for every words and number lproductions.extend([literal_production("WORD", word) for word in words]) lproductions.extend([literal_production("NUMBER", number) for number in numbers]) # Make a local copy of the grammar with extra productions lgrammar = ContextFreeGrammar(grammar.start(), lproductions) # Load grammar into a parser parser = nltk.RecursiveDescentParser(lgrammar) tokens = text.split() return parser.parse(tokens) print parse("foo hello world 123 foo") 

Here's more background where this was discussed in the nltk-users group in google groups: https://groups.google.com/d/topic/nltk-users/4nC6J7DJcOc/discussion

+2
source

All Articles