Break a row into a tree structure?

Question

Break a row into a tree structure?

I am trying to understand how to parse a string in this format in the form of a tree, as a data structure of arbitrary depth.

"{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}" [[["Hello big" "Hi" "Hey"] ["world" "earth"]] [["Goodbye" "farewell"] ["planet" "rock" "globe" ["." "!"]]]]

I tried playing with some regular expressions for this (for example, # "{([^ {}] *)}"), but everything I tried seems to "smooth" the tree into a large list of lists. I could approach this from the wrong angle, or maybe the regex just isn't right for the job.

Thank you for your help!

+6

parsing clojure text-parsing text-processing tree

erikcw Sep 29 '10 at 22:35

source share

4 answers

Vivin paliath · Answer 1 · 2010-09-29T22:39:30+0000

Do not use regular expressions for this task. An easier way is to describe your string using grammar (BNF or EBNF) and then write a parser to parse the string according to the grammar. You can generate a syntax tree from your EBNF and BNF, and so you naturally get a tree structure.

You can start with something like this:

 element ::= element-type, { ["|"], element-type } element-type ::= primitive | "{", element, "}" primitive ::= symbol | word symbol ::= "." | "!" word ::= character { character } character ::= "a" | "b" | ... | "z"

Note. I wrote it quickly and maybe this is not entirely correct. But that should give you an idea.

aschepler · Answer 2 · 2010-09-29T22:46:53+0000

Trying to match all of this with one regular expression doesn’t make you go too far, because regular expressions output nothing more than a list of matching substring positions, nothing tree-like. You need a lexer or grammar that does something like this:

Divide the input into tokens - atomic parts such as '{', '|' and 'world', then process these markers in order. Start with an empty tree with a single root node.

Every time you find { , create and go to the child element of node.

Every time you find | , create and go to the sibling node.

Every time you find } go to the parent node.

Each time you find a word, put that word in the current node sheet.

Arthur ulfeldt · Answer 3 · 2010-09-29T22:45:08+0000

if you want to hack quickly:

replace {chars with [
replace} characters with]
replace | characters with spaces
hope you don't get input with spaces.

read , so it appears as nested arrays.

ps: I agree that reg-ex cannot do this.

pss: set * read-eval * to false (you don't want the input to execute by itself)

edbond · Answer 4 · 2010-10-11T12:09:20+0000

You can use amotoen to build grammar and analyze this:

 (ns pegg.core (:gen-class) (:use (com.lithinos.amotoen core string-wrapper)) (:use clojure.contrib.pprint)) (def input "{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}") (def grammar { :Start :List :ws #"^[ \n\r\t]*" :Sep "|" :String #"^[A-Za-z !.]+" :Item '(| :String :List) :Items [:Item '(+ [:Sep :Item])] :List [:ws "{" '(* (| :Items :Item)) "}" :ws] }) (def parser (create-parser grammar)) (defn parse [^String input] (validate grammar) (pprint (parser (wrap-string input))))

Result:

 pegg.core> (parse input) {:List [{:ws ""} "{" ({:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "Hello big"}} ([{:Sep "|"} {:Item {:String "Hi"}}] [{:Sep "|"} {:Item {:String "Hey"}}])]}) "}" {:ws " "}]}} {:Items [{:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "world"}} ([{:Sep "|"} {:Item {:String "earth"}}])]}) "}" {:ws ""}]}} ([{:Sep "|"} {:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "Goodbye"}} ([{:Sep "|"} {:Item {:String "farewell"}}])]}) "}" {:ws " "}]}}])]} {:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "planet"}} ([{:Sep "|"} {:Item {:String "rock"}}] [{:Sep "|"} {:Item {:String "globe"}}])]} {:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "."}} ([{:Sep "|"} {:Item {:String "!"}}])]}) "}" {:ws ""}]}}) "}" {:ws ""}]}}) "}" {:ws ""}]}

PS This is one of my first binding grammars, and it could be better. Also see http://en.wikipedia.org/wiki/Parsing_expression_grammar

Break a row into a tree structure?

More articles: