It would be very difficult to do clean. The base parser classes rely on exact matches or production RHS for pop content, so this will require subclassing and rewriting large parts of the parser class. I tried to do this with a grammar class and refused.
Instead, I did more hacking, but basically, first I extract matches of regular expressions from the text and add them to the grammar as production ones. This will be very slow if you use a large grammar, since for each call you need to recalculate the grammar and parser.
import re import nltk from nltk.grammar import Nonterminal, Production, ContextFreeGrammar grammar = nltk.parse_cfg (""" S -> TEXT TEXT -> WORD | WORD TEXT | NUMBER | NUMBER TEXT """) productions = grammar.productions() def literal_production(key, rhs): """ Return a production <key> -> n :param key: symbol for lhs: :param rhs: string literal: """ lhs = Nonterminal(key) return Production(lhs, [rhs]) def parse(text): """ Parse some text. """
Here's more background where this was discussed in the nltk-users group in google groups: https://groups.google.com/d/topic/nltk-users/4nC6J7DJcOc/discussion
source share