Trying to match all of this with one regular expression doesnβt make you go too far, because regular expressions output nothing more than a list of matching substring positions, nothing tree-like. You need a lexer or grammar that does something like this:
Divide the input into tokens - atomic parts such as '{', '|' and 'world', then process these markers in order. Start with an empty tree with a single root node.
Every time you find { , create and go to the child element of node.
Every time you find | , create and go to the sibling node.
Every time you find } go to the parent node.
Each time you find a word, put that word in the current node sheet.
aschepler
source share