How to define grammar for clojure code using instaparse?

I am new to parsing and want to parse clojure code. I hope someone can provide an example of how clojure code can be parsed using instaparse. I just need to make numbers, symbols, keywords, sexps, vectors and spaces.

Some examples I want to analyze:

(+ 1 2 (+ 3 4)) {:hello "there" :look '(i am indented)} 
+8
clojure instaparse
source share
1 answer

Well, there are two parts to your question. The first part analyzes the expression

 (+ 1 2 (+ 3 4)) 

The second part converts the result to the desired result. To get a good idea of ​​these principles, I highly recommend the Udacity programming language course . Carin Meier 's blog post is also quite helpful.

The best way to understand how the parser works is to break it into smaller parts. Therefore, in the first we will consider some rules of parsing, and in the second part we will build our floors.

  • Simple example

    First you need to write a grammar that tells instaparse how to parse this expression. We start with a simple parsing of number 1 :

     (def parser (insta/parser "sexp = number number = #'[0-9]+' ")) 

    sexp describes the highest level grammar for sex. Our grammar says that sexp can only have a number. The next line indicates that the number can be any digit 0-9, and + similar to the regular expression + , which means that it must have one number repeated any number of times. If we run our parser, we get the following parsing tree:

     (parser "1") => [:sexp [:number "1"]] 

    Parentheses

    We can ignore specific values ​​by adding angle brackets < to our grammar. Therefore, if we want to parse "(1)" as just 1 , we can correct our grammar as follows:

     (def parser (insta/parser "sexp = lparen number rparen <lparen> = <'('> <rparen> = <')'> number = #'[0-9]+' ")) 

    and if we run the parser again, it will ignore the left and right parentheses:

     (parser "(1)") => [:sexp [:number "1"]] 

    This will become useful when we write the grammar for sexp below.

    Adding spaces

    Now, if we add spaces and run (parser "( 1 )") ? We get the error:

     (parser "( 1 )") => Parse error at line 1, column 2: ( 1 ) ^ Expected: #"[0-9]+" 

    This is because we have not defined the concept of space in our grammar! Therefore, we can add spaces as such:

     (def parser (insta/parser "sexp = lparen space number space rparen <lparen> = <'('> <rparen> = <')'> number = #'[0-9]+' <space> = <#'[ ]*'> ")) 

    Again, * looks like a regular expression * and means zero or more than one occurrence of space. This means that the following examples will return the same result:

     (parser "(1)") => [:sexp [:number "1"]] (parser "( 1 )") => [:sexp [:number "1"]] (parser "( 1 )") => [:sexp [:number "1"]] 
  • Making Sexp

    We are gradually going to build our grammar from scratch. It might be useful to look at the final product here to give an overview of where we are heading.

    So sexp contains more than just numbers defined by our simple grammar. One high level view we can have for sexp is to treat them as an operation between two brackets. So basically like ( operation ) . We can write this directly into our grammar.

     (def parser (insta/parser "sexp = lparen operation rparen <lparen> = <'('> <rparen> = <')'> operation = ??? ")) 

    As I said above, angle brackets < tell instaparse to ignore these values ​​when it creates a parse tree. Now what is an operation? Well, an operation consists of an operator, such as + , and some arguments, such as numbers 1 and 2 . Therefore, we can write the grammar as follows:

     (def parser (insta/parser "sexp = lparen operation rparen <lparen> = <'('> <rparen> = <')'> operation = operator + args operator = '+' args = number number = #'[0-9]+' ")) 

    We indicated only one possible + operator, just to make everything simple. We also included the number grammar rule from the simple example above. However, our grammar is very limited. The only valid sexp that he can analyze is (+1) . This is because we did not include the concept of spaces and stated that arguments can only have one number. So, at this point we will do two things. We will add spaces and we will indicate that args can have more than one number.

     (def parser (insta/parser "sexp = lparen operation rparen <lparen> = <'('> <rparen> = <')'> operation = operator + args operator = '+' args = snumber+ <snumber> = space number <space> = <#'[ ]*'> number = #'[0-9]+' ")) 

    We added space using the spatial grammar rule, which we defined in a simple example. We created a new snumber , which is defined as space and number , and added + to snumber to indicate that it should appear once, but it can be repeated as many times as necessary. Therefore, we can run our parser like this:

     (parser "(+ 1 2)") => [:sexp [:operation [:operator "+"] [:args [:number "1"] [:number "2"]]]] 

    We can make our grammar more reliable by returning the args link to sexp . So we can have sexp in our sexp! We can do this by creating ssexp , which adds space to sexp , and then adds ssexp to args .

     (def parser (insta/parser "sexp = lparen operation rparen <lparen> = <'('> <rparen> = <')'> operation = operator + args operator = '+' args = snumber+ ssexp* <ssexp> = space sexp <snumber> = space number <space> = <#'[ ]*'> number = #'[0-9]+' ")) 

    Now we can run

     (parser "(+ 1 2 (+ 1 2))") => [:sexp [:operation [:operator "+"] [:args [:number "1"] [:number "2"] [:sexp [:operation [:operator "+"] [:args [:number "1"] [:number "2"]]]]]]] 
  • Transformations

    This step can be performed using any number of tools that work on trees, such as revitalizing, zipper, matching, and tree-seq. However, Instaparse also includes its useful feature called insta\transform . We can build our transformations by replacing the keys in our parsing tree with the actual clojure functions. For example :number becomes read-string to turn our strings into valid numbers :args becomes vector to build our arguments.

    So we want to convert this:

      [:sexp [:operation [:operator "+"] [:args [:number "1"] [:number "2"]]]] 

    In it:

      (identity (apply + (vector (read-string "1") (read-string "2")))) => 3 

    We can do this by defining our conversion options:

     (defn choose-op [op] (case op "+" +)) (def transform-options {:number read-string :args vector :operator choose-op :operation apply :sexp identity }) 

    The only tricky thing here was to add the choose-op function. We want to pass the + function to apply , but if we replace operator with + , it will use + as a regular function. Therefore, he will transform our tree to this:

      ... (apply (+ (vector ... 

    But using choose-op , it will pass + as an argument to apply as such:

      ... (apply + (vector ... 

Conclusion

Now we can start our small interpreter by combining the parser and the transformer:

 (defn lisp [input] (->> (parser input) (insta/transform transform-options))) (lisp "(+ 1 2)") => 3 (lisp "(+ 1 2(+ 3 4))") => 10 

You can find the final code used in this tutorial here .

I hope this brief introduction is enough to engage in our own projects. You can create new lines by declaring a grammar for \n , and you can even not ignore spaces in the parse tree by removing the angle brackets < . This may be useful considering that you are trying to keep the indentation. Hope this helps if don't just write a comment!

+22
source share

All Articles