How to parse a list of words according to simplified grammar?

Question

How to parse a list of words according to simplified grammar?

Just to clarify, this is not homework. I was asked to help with this, and I can’t do this, so he became a personal search to solve this problem.

Imagine you have a grammar for an English sentence:

S => NP VP | VP NP => N | Det N | Det Adj N VB => V | V NP N => i you bus cake bear V => hug love destroy am Det => a the Adj => pink stylish

I searched for a few hours and really out of ideas. I found articles telling about the shallow analysis, the depth of the first return and things related to it, and although I am familiar with most of them, I still can not apply them to this problem. I have noted Lisp and Haskell because these are the languages in which I plan to implement this, but I do not mind if you use other languages in your answers.

I will be grateful for the advice, good articles and all in general.

+8

algorithm lisp haskell nlp

Ben Oct 18 '11 at 7:02

source share

5 answers

There are several different approaches to parsing using grammar without context.

If you want to implement this yourself, you can start by familiarizing yourself with the parsing algorithms: you can look here and here , or if you prefer something on paper in the chapter on parsing in Yurafsky and Martin may be a good start.

I know that it is not difficult to implement a simple parser in the programming language Prolog. Just google for a "prolog shift analyzer" or "prolog prediction analyzer." I don’t know Haskell or Lisp, but there may be similarities to a prolog, so maybe you can get some ideas from there.

If you don't need to implement a full parser yourself, I would look at the Python NLTK, which offers tools for CFG-Parsing. There is a chapter about this in the NLTK book .

+4

tobigue Oct 18 '11 at 7:54

source share

Well, there are a number of algorithms you could use. Below are some popular dynamic programming algorithms: 1) CKY algorithm: the grammar should be in the form of CNF 2) Earley algorithm 3) Analysis of the diagram.

Please go to Google to find their implementation. Basically, given the proposal, these algorithms allow you to assign a context tree to it.

+1

Programmer Oct 18 '11 at 15:14

source share

An example of nonpabalistic grammar is given. Thus, you can use the ANTLR, JFlex, Scala Parser Combinators, Parsers python tools to implement the parser for this grammar in the very similar code that you provided.

0

yura Oct 18 '11 at 9:49

source share

I think the problem for you may be that the way you parse the computer language is much different than the way you analyze natural language.

Computer languages are designed to be unambiguous and relatively easy to get the exact meaning from a computer.

Natural languages have become compact and expressive and commonly understood by people. You may be able to do deterministic parsing that compilers use work for some very simple subset of English grammar, but this doesn't look like what is used to analyze real natural language.

0

Rob neuhaus Oct 18 '11 at 20:13

source share

Daniel Wagner · Accepted Answer · 2011-10-18T14:41:11+0000

Here is a working Haskell example. Turns out there are a few tricks to find out before you can get it to work! The zero thing to do is the template: disable the terrible restriction of monomorphism, import some libraries and define some functions that are not in the libraries (but should be):

 {-# LANGUAGE NoMonomorphismRestriction #-} import Control.Applicative ((<*)) import Control.Monad import Text.ParserCombinators.Parsec ensure px = guard (px) >> return x singleToken t = tokenPrim id (\pos _ _ -> incSourceColumn pos 1) (ensure (==t)) anyOf xs = choice (map singleToken xs)

Now that the null thing is done ... first we define the data type for our abstract syntax trees. We can just follow the grammar form here. However, to make it more convenient, I have adopted several grammar rules; in particular, two rules

 NP => N | Det N | Det Adj N VB => V | V NP

it’s more convenient to write like this when it comes to writing a parser:

 NP => N | Det (Adj | empty) N VB => V (NP | empty)

Any good parsing book will have a chapter on why such factoring is a good idea. Thus, type AST:

 data Sentence = Complex NounPhrase VerbPhrase | Simple VerbPhrase data NounPhrase = Short Noun | Long Article (Maybe Adjective) Noun data VerbPhrase = VerbPhrase Verb (Maybe NounPhrase) type Noun = String type Verb = String type Article = String type Adjective = String

Then we can make our parser. This grammar follows even more carefully (factorized)! One wrinkle here is that we always want our parser to consume the whole sentence, so we must explicitly ask it to do this by requiring "eof" - or the end of the "file".

 s = (liftM2 Complex np vp <|> liftM Simple vp) <* eof np = liftM Short n <|> liftM3 Long det (optionMaybe adj) n vp = liftM2 VerbPhrase v (optionMaybe np) n = anyOf ["i", "you", "bus", "cake", "bear"] v = anyOf ["hug", "love", "destroy", "am"] det = anyOf ["a", "the"] adj = anyOf ["pink", "stylish"]

The last part is the tokenizer. For this simple application, we'll just use space-based tokenization, so the built-in words function works fine. Give it a try! Download the whole file in ghci:

 *Main> parse s "stdin" (words "i love the pink cake") Right (Complex (Short "i") (VerbPhrase "love" (Just (Long "the" (Just "pink") "cake")))) *Main> parse s "stdin" (words "i love pink cake") Left "stdin" (line 1, column 3): unexpected "pink" expecting end of input

Here, Right indicates successful parsing, and Left indicates an error. The column number indicated in the error is actually the number of the word where the error occurred, due to how we calculate the source positions in singleToken .

How to parse a list of words according to simplified grammar?

More articles: