Unwanted regex in Flex

Question

Unwanted regex in Flex

I just started with Flex and can't figure out how to match the following expression:

"Dog".*"Cat" ------------------ Input : Dog Ca Cat Cc Cat ------------------ Output: Dog Ca Cat Cc Cat

But I want a non-greedy match with the following output:

 Output: Dog Ca Cat

How can this be implemented on Flex?

EDIT

Tried the following:

 %% Dog.*Cat/.*Cat printf("Matched : ||%s||", yytext); dog.*cat printf("Matched : ||%s||", yytext); dOg[^c]*cAt printf("Matched : ||%s||", yytext); DOG.*?CAT printf("Matched : ||%s||", yytext); %%

Entrance:

 Dog Ca Cat Cc Cat dog Ca cat Cc cat dOg Ca cAt Cc cAt DOG CA CAT CC CAT

Exit:

 Matched : ||Dog Ca Cat Cc Cat|| Matched : ||dog Ca cat Cc cat|| Matched : ||dOg Ca cAt|| Cc cAt Matched : ||DOG CA CAT CC CAT||

Also gets a warning:

 lex4.l:2: warning, dangerous trailing context

Flex Version:

 flex 2.5.35 Apple(flex-31)

+4

regex flex-lexer lex

Kyuubi Oct 21 '13 at 7:45

source share

2 answers

Brian Tompsett - 汤莱恩 · Answer 1 · 2015-04-19T13:58:53+0000

This is a fairly common problem using lex / flex tools that feed newbies (and sometimes not newbies). There are two solutions to the problem that require two different advanced tool features. A phrase like dog ... cat presents the same problem as matching comments in different programming languages, for example, the comment form C /* ... */ or even 'comment' ... 'tnemmoc' . They have the same characteristics as your example. Consider the following C code:

 /* This is a comment */ "This is a String */"

The greedy lexical coincidence of this will correspond to the wrong comment terminator (and this is a good student test lexer BTW!).

Solutions are offered for several university compiler courses. One that explains this well is here (in Manchester) . Which leads to some good books that also cover issues:

J. Levine, T. Mason, and D. Brown: Lex and Yacc (2nd ed.)
MELesk and E.Schmidt: Lex - generator of the lexical analyzer

The two methods described are: Launch conditions to explain the state machine or manual input to read characters directly.

For your problem with cat ... dog they can be programmed as follows:

Launch conditions

In this solution, we need several states. The keyword dog causes it to go into the state dog , which continues until the letter c . Then it enters the LETTERC state, followed by the letter a if the dog state is not satisfied; the letter a causes the CAT state, which must be followed by the letter t , which causes the entire phrase to match and returns to the INITIAL state. yymore saves all dog ... cat text for use.

 %x DOG LETTERC CAT d [dD] o [oO] g [gG] c [cC] a [aA] t [tT] ws [ \t\r\n]+ %% <INITIAL>{d}{o}{g} { BEGIN(DOG); printf("DOG\n"); yymore(); } <DOG>[^cC]*{c} { printf("C: %s\n",yytext); yymore(); BEGIN(LETTERC); } <LETTERC>{a} { printf("A: %s\n",yytext); yymore(); BEGIN(CAT); } <LETTERC>[^aA] { BEGIN(DOG); yymore(); } <CAT>{t} { printf("CAT: %s\n",yytext); BEGIN(INITIAL); } <CAT>[^tT] { BEGIN(DOG); yymore(); } <INITIAL>{ws} /* skip */ ;

Manual input

The manual input method corresponds to the initial phrase dog and enters the C code, which swallows the input characters until a CAT sequence is found. (I was not worried about both upper and lower case). The problem with this solution is that it is difficult to store the value of the input text in yytext for later use in the parser. He discards it, which would be nice if the construct was a comment, but otherwise it would not be so useful.

 d [dD] o [oO] g [gG] ws [ \t\r\n]+ %% {d}{o}{g} { register int c; for ( ; ; ) { /* Not dealt with upper case .. left as an exercise */ while ( (c = input()) != 'c' && c != EOF ) ; /* eat up text of dog */ if ( c == 'c' ) { if ( ( c = input()) == 'a' ) if ( (c = input()) == 't' ) break; /* found the end */ } if ( c == EOF ) { REJECT; break; } } /* because we have used input() yytext always contains "dog" */ printf("cat: %s\n", yytext); } {ws} /* skip */ ;

(Both of these solutions have been tested)

Paul · Answer 2 · 2016-03-26T14:17:25+0000

Good question. Here is a pure regex, without using unwanted syntax .*? :

Dog([^C]|C+(aC+)*([^Ca]|a[^Ct]))*C+(aC+)*at

Unwanted regex in Flex

Launch conditions

Manual input

More articles: