Break a row into a tree structure?

I am trying to understand how to parse a string in this format in the form of a tree, as a data structure of arbitrary depth.

"{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}" [[["Hello big" "Hi" "Hey"] ["world" "earth"]] [["Goodbye" "farewell"] ["planet" "rock" "globe" ["." "!"]]]] 

I tried playing with some regular expressions for this (for example, # "{([^ {}] *)}"), but everything I tried seems to "smooth" the tree into a large list of lists. I could approach this from the wrong angle, or maybe the regex just isn't right for the job.

Thank you for your help!

+6
parsing clojure text-parsing text-processing tree
source share
4 answers

Do not use regular expressions for this task. An easier way is to describe your string using grammar (BNF or EBNF) and then write a parser to parse the string according to the grammar. You can generate a syntax tree from your EBNF and BNF, and so you naturally get a tree structure.

You can start with something like this:

 element ::= element-type, { ["|"], element-type } element-type ::= primitive | "{", element, "}" primitive ::= symbol | word symbol ::= "." | "!" word ::= character { character } character ::= "a" | "b" | ... | "z" 

Note. I wrote it quickly and maybe this is not entirely correct. But that should give you an idea.

+9
source share

Trying to match all of this with one regular expression doesn’t make you go too far, because regular expressions output nothing more than a list of matching substring positions, nothing tree-like. You need a lexer or grammar that does something like this:

Divide the input into tokens - atomic parts such as '{', '|' and 'world', then process these markers in order. Start with an empty tree with a single root node.

Every time you find { , create and go to the child element of node.

Every time you find | , create and go to the sibling node.

Every time you find } go to the parent node.

Each time you find a word, put that word in the current node sheet.

+4
source share

if you want to hack quickly:

  • replace {chars with [
  • replace} characters with]
  • replace | characters with spaces
  • hope you don't get input with spaces.

read , so it appears as nested arrays.

ps: I agree that reg-ex cannot do this.

pss: set * read-eval * to false (you don't want the input to execute by itself)

+3
source share

You can use amotoen to build grammar and analyze this:

 (ns pegg.core (:gen-class) (:use (com.lithinos.amotoen core string-wrapper)) (:use clojure.contrib.pprint)) (def input "{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}") (def grammar { :Start :List :ws #"^[ \n\r\t]*" :Sep "|" :String #"^[A-Za-z !.]+" :Item '(| :String :List) :Items [:Item '(+ [:Sep :Item])] :List [:ws "{" '(* (| :Items :Item)) "}" :ws] }) (def parser (create-parser grammar)) (defn parse [^String input] (validate grammar) (pprint (parser (wrap-string input)))) 

Result:

 pegg.core> (parse input) {:List [{:ws ""} "{" ({:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "Hello big"}} ([{:Sep "|"} {:Item {:String "Hi"}}] [{:Sep "|"} {:Item {:String "Hey"}}])]}) "}" {:ws " "}]}} {:Items [{:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "world"}} ([{:Sep "|"} {:Item {:String "earth"}}])]}) "}" {:ws ""}]}} ([{:Sep "|"} {:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "Goodbye"}} ([{:Sep "|"} {:Item {:String "farewell"}}])]}) "}" {:ws " "}]}}])]} {:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "planet"}} ([{:Sep "|"} {:Item {:String "rock"}}] [{:Sep "|"} {:Item {:String "globe"}}])]} {:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "."}} ([{:Sep "|"} {:Item {:String "!"}}])]}) "}" {:ws ""}]}}) "}" {:ws ""}]}}) "}" {:ws ""}]} 

PS This is one of my first binding grammars, and it could be better. Also see http://en.wikipedia.org/wiki/Parsing_expression_grammar

+1
source share

All Articles