Prolog DCG: Writing the Lexer Programming Language

I am currently trying to leave my lexer and parser separate, based on the vague advice of Prolog and Natural Language Analysis, which really did not go into details about lexing / tokenizing. Therefore, I give him a chance and see some small problems that indicate to me that there is something obvious that I am missing.

All of my little parsers seem to be working fine; at the moment this is a snippet of my code:

:- use_module(library(dcg/basics)). operator('(') --> "(". operator(')') --> ")". operator('[') --> "[". operator(']') --> "]". % ... etc. keyword(array) --> "array". keyword(break) --> "break". % ... etc. 

This is a little repetitive, but it seems to work. Then I have something that I don’t quite like and will welcome suggestions, but it seems to work:

 id(id(Id)) --> [C], { char_type(C, alpha) }, idRest(Rest), { atom_chars(Id, [C|Rest]) }. idRest([C|Rest]) --> [C], { char_type(C, alpha) ; char_type(C, digit) ; C = '_' }, idRest(Rest). idRest([]) --> []. int(int(Int)) --> integer(Int). string(str(String)) --> "\"", stringContent(Codes), "\"", { string_chars(String, Codes) }. stringContent([C|Chars]) --> stringChar(C), stringContent(Chars). stringContent([]) --> []. stringChar(0'\n) --> "\\n". stringChar(0'\t) --> "\\t". stringChar(0'\") --> "\\\"". stringChar(0'\") --> "\\\\". stringChar(C) --> [C]. 

The basic rule for my tokenizer is the following:

 token(X) --> whites, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)). 

This is not perfect; I will see that int figured out in,id(t) , because keyword(X) preceded by id(X) . So I guess the first question.

The biggest question I have is that I don’t see how to properly integrate comments into this situation. I tried the following:

 skipAhead --> []. skipAhead --> (comment ; whites), skipAhead. comment --> "/*", anything, "*/". anything --> []. anything --> [_], anything. token(X) --> skipAhead, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)). 

This does not work; analysts that are returning (and I get a lot of parsing) do not seem to have left a comment. I am nervous that my comment rule is uselessly inefficient and probably causes a lot of unnecessary digression. I'm also nervous that whites//0 of the dcg / framework is deterministic; however, this part of the equation seems to work; it just integrates it with a skip of comments that doesn't seem to be.

As a final note, I don’t see how to handle the propagation of parsing errors back to the user with row / column information from here. It seems like I would have to track and thread some current row / column information and write it in tokens, and then maybe try to rebuild the line if I want to do something similar to the way llvm does it. Is this honest or is there a “recommended practice”?

All code can be found in this rush .

+6
source share
2 answers

I have this code to support error reporting, which should be handled with care, spatter meaningful messages and "skip rules" around the code. But there is no ready-made alternative: DCG is an excellent computational engine, but it cannot compete with special parsing mechanisms that can automatically emit error messages using the theoretical properties of target grammars ...

 :- dynamic text_length/1. parse_conf_cs(Cs, AST) :- length(Cs, TL), retractall(text_length(_)), assert(text_length(TL)), phrase(cfg(AST), Cs). .... %% tag(?T, -X, -Y)// is det. % % Start/Stop tokens for XML like entries. % Maybe this should restrict somewhat the allowed text. % tag(T, X, Y) --> pos(X), unquoted(T), pos(Y). .... %% pos(-C, +P, -P) is det. % % capture offset from end of stream % pos(C, P, P) :- text_length(L), length(P, Q), C is L - Q. 

tag // 3 is just an example of use, in this parser I create an editable AST, so I save the position so that I can correctly attribute each nested part in the editor ...

change

small extension for id // 1: SWI-Prolog has a specialized type_type / 2 code for this:

 1 ?- code_type(0'a, csymf). true. 2 ?- code_type(0'1, csymf). false. 

so (attenuation over a literal transformation)

 id([C|Cs]) --> [C], {code_type(C, csymf)}, id_rest(Cs). id_rest([C|Cs]) --> [C], {code_type(C, csym)}, id_rest(Cs). id_rest([]) --> []. 

depending on your attitude to the generalization of small fragments and actual grammar data, id_rest // 1 can be written in a reusable way and made deterministic

 id([C|Cs]) --> [C], {code_type(C, csymf)}, codes(csym, Cs). % greedy and deterministic codes(Kind, [C|Cs]) --> [C], {code_type(C, Kind)}, !, codes(Kind, Cs). codes(Kind, []), [C] --> [C], {\+code_type(C, Kind)}, !. codes(_, []) --> []. 

this stricter definition of id // 1 will also remove some ambiguous wrt attributes with keyword prefixes: keyword recoding // 1 like

 keyword(K) --> id(id(K)), {memberchk(K, [ array, break, ... ]}. 

will correctly identify

 ?- phrase(tokenize(Ts), `if1*2`). Ts = [id(if1), *, int(2)] ; 

Your line // 1 (OT: what an unsuccessful collision with the library (dcg / basics): string // 1) is an easy candidate for implementing a simple error recovery strategy:

 stringChar(0'\") --> "\\\\". stringChar(0'") --> pos(X), "\n", {format('unclosed string at ~d~n', [X])}. 

This is an example of a “report error and insertion of a missing token”, so the analysis can continue ...

+2
source

Currently it still looks a little weird ( unreadableNamesLikeInJavaAnyone? ), But at its core it is quite robust, so I only have a few comments about some aspects of the code and questions:

  • Separating lexing into parsing makes sense. It is also a perfectly acceptable solution for storing row and column information with each token, leaving markers (for example) of the form l_c_t(Line,Column,Token) or Token-lc(Line,Column) for processing the analyzer.
  • Comments are always unpleasant, or, so to speak, often non-honesty? A useful model in DCG is often for the longest match, which you already use in some cases, but not yet for anything//0 . Thus, reordering the two rules can help you skip all that is meant for commenting.
  • As for determinism: it’s normal to fix the first parsing that matches, but do it only once and resist the temptation to ruin the declarative grammar.
  • DCG elegantly use | instead ; .
  • tokenize//1 ? Come on! This is just tokens//1 . It makes sense in all directions.
+5
source

All Articles