What an elegant way to parse this data format in Clojure?

The legacy application I'm working with has a funky SGS data format. I reviewed and started working with several brute force solutions, including a manual final state machine and my own recursive descent parser, but I try to create an application in which the volume (non-library) of the source code is enough to express what needs to be done.

So, I was looking at a parser based on Clojure. I was messing with

None of them have enough documentation / support on the network to disconnect me. So I'm looking for someone who has experience with one of these tools (or a good alternative) to give me a hand.

Here's the data language:


  • Data is represented by rows labeled (starting with column 1) and 1 or more fields separated by one or more spaces.

  • Fields consist of one or more subfields separated by commas. Commas may be followed by spaces for readability, but they are not significant.

  • Labels are identifiers consisting of characters in the set [- $ 0-9A-Z _ *%] and do not have to be unique.

  • - , (). . , .

  • Space-dot-space . - . . , , .

  • . , , .

  • ( ) .

:

. Comment
.
LAB1  F1S1                       . Minimal data row, with line comment
LAB1  F1S1,F1S2,F1S3  F2S1  F3S1 . 2nd row with same label
LAB2  , , , F1S4     ''Field #2 (only 1 subfield)''  F3S1,,F3S3
LAB99 F1S1,                      . Field 1 has 2 subfields, 2nd is nil
LAB3  F1S1,F1S2, ;
      F1S3       ;
      F2S1                       . Row continued over 3 lines. 

, :

[
 ("LAB1" ["F1S1"])
 ("LAB1" ["F1S1" "F1S2" "F1S3"] ["F2S1"] ["F3S1"])
 ("LAB2" [nil nil nil "F1S4"] ["Field #2 (only 1 subfield"] ["F3S1" nil "F3S3"])
 ("LAB99" ["F1S1" nil])
 ("LAB3" ["F1S1" "F1S2" "F1S3"] ["F2S1"])
]

UPDATE:

@edwood . , , ", ".

, InstaParse, sorta-works:

     SGS = (<COMMENT_LINE> / DATA_LINES) *
     COMMENT_LINE = #' *\\.(?: [^\\n]*)?\\n' 
     DATA_LINES = LABEL FIELDS SEPARATOR? (LINE_COMMENT | '\\n')
     LABEL = IDENTIFIER
     FIELDS = '' | (SEPARATOR FIELD)+
     SEPARATOR = CONTINUATION #' +' | #' +' (CONTINUATION #' *')?
     CONTINUATION = #'; *\\n'
     LINE_COMMENT = #' .[^\\n]*\\n'  
     FIELD = SUBFIELD (',' SEPARATOR? SUBFIELD)*
     SUBFIELD = IDENTIFIER | QUOTED_STRING | ''
     IDENTIFIER = #'[-$0-9A-Z_*%]+'
     QUOTED_STRING = #'\\'\\'[^\\']*\\'\\''

249 , , . , , -, 431 2

CompilerException java.lang.OutOfMemoryError: Java, : (sgs2.clj: 40: 13)

regexp-handled regexps, , , . , , .


228 , 16 . , . - ?

+4
2

, :

"<SGS> = (<COMMENT_ROW> | ROW)+
<NL> = '\\n'
<qq> = \"''\"
space = <#'\\s*'>
COMMENT_ROW = COMMENT NL?
LABEL = 'LAB' #'\\d+'
EMPTY_F = <space>
FFIELD = 'F' #'[0-9A-Z]+'
QFIELD = (<qq> (!qq #'.')+ <qq>)
<F> = FFIELD / QFIELD / EMPTY_F
F_SEP = ((space? | ',')* ';' NL space?) / (<space?> ',' <space?>) / <space>
<NEXT_FIELDS> = F <space?> (<F_SEP> NEXT_FIELDS)? <space?>
FIELDS = F <space?> (<F_SEP> NEXT_FIELDS)? <space?>
COMMENT = '.' #'.*'
ROW = LABEL <space?> FIELDS <space?> <COMMENT?> <NL?>"

, - . :

sgs.core> (sgs example-input)
([:ROW [:LABEL "LAB" "1"] [:FIELDS [:FFIELD "F" "1S1"]]] [:ROW [:LABEL "LAB" "1"] [:FIELDS [:FFIELD "F" "1S1"] [:FFIELD "F" "1S2"] [:FFIELD "F" "1S3"] [:FFIELD "F" "2S1"] [:FFIELD "F" "3S1"]]] [:ROW [:LABEL "LAB" "2"] [:FIELDS [:EMPTY_F] [:EMPTY_F] [:EMPTY_F] [:FFIELD "F" "1S4"] [:QFIELD "F" "i" "e" "l" "d" " " "#" "2" " " "(" "o" "n" "l" "y" " " "1" " " "s" "u" "b" "f" "i" "e" "l" "d" ")"] [:FFIELD "F" "3S1"] [:EMPTY_F] [:FFIELD "F" "3S3"]]] [:ROW [:LABEL "LAB" "99"] [:FIELDS [:FFIELD "F" "1S1"] [:EMPTY_F]]] [:ROW [:LABEL "LAB" "3"] [:FIELDS [:FFIELD "F" "1S1"] [:FFIELD "F" "1S2"] [:FFIELD "F" "1S3"] [:FFIELD "F" "2S1"]]])

50 . .

sgs.core> (pprint (parse-and-transform sgs example-input))
[("LAB1" ["F1S1"])
 ("LAB1" ["F1S1" "F1S2" "F1S3"] ["F2S1"] ["F3S1"])
 ("LAB2"
  [nil nil nil "F1S4"]
  ["Field #2 (only 1 subfield)"]
  ["F3S1" nil "F3S3"])
 ("LAB99" ["F1S1" nil])
 ("LAB3" ["F1S1" "F1S2" "F1S3"] ["F2S1"])]

: https://gist.github.com/edbond/8052305

https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md

.

+2

, , Parse-EZ. , / Parse-EZ (-trim-off). - "sgs" . : ( sgs )

(ns sgs-parser
  (:use [protoflex.parse]))

(defn line-comments [] (multi* #(regex #"(\r?\n)?\..*\r?\n")))
(defn wsp [] (regex #"[ \t]*(\. .*)?"))
(defn trim [parse-fn] (wsp) (let [r (parse-fn)]  (wsp) r))

(defn label [] (regex #"[-$0-9A-Z_*%]+"))
(defn quoted-str [] (between #(string "''") #(regex #"[^']*") #(string "''")))

(defn sub-field [] (trim #(any label quoted-str)))

(defn- eol? [] (starts-with-re? #"\r?\n"))
(defn field [] 
  (when (not (or (eol?) (at-end?)))
    (when (starts-with? ";") (skip-over "\n"))
    (loop [sfs []]
      (let [sf (opt sub-field)]
        (if (opt #(trim comma))
          (recur (conj sfs sf))
          (conj sfs sf))))))

(defn record [] 
  (line-comments) 
  (let [ret (into [(trim label)] (multi+ field))]
    (any #(regex #"\r?\n") at-end?)
    ret))

(defn sgs [] (with-trim-off (wsp) (multi* record)))
+1

All Articles