So far, templates for full words: you do not want or match storage ; spaces and punctuation marks are correspondence anchors, then an easy way to translate your vocabulary into the input of the scanner generator (for example, you could use flex ), generate a scanner, and then run it at your input.
Scanner generators are designed to identify the occurrences of tokens in the input, where each type of token is described by a regular expression. Flex and similar programs quickly create scanners. By default, Flex processes up to 8k rules (in your vocabulary entries), and this can be extended. The generated scanners work in linear mode and in practice are very fast.
The internally regular token expressions are converted to the standard Klein pipeline theorem: first into the NFA, then into the DFA. Then the DFA is converted to its unique minimal form. This is encoded in the HLL table, which is emitted inside the shell that implements the scanner, referencing the table. This is what makes it flexible, but other strategies are possible. For example, DFA can be translated into goto code, where the state of the DFA appears to be an implicit instruction pointer as the code runs.
The reason for the obstacles associated with spaces is that scanners created by programs such as Flex are usually unable to identify matching matches: strangers cannot match both strangers and range , for example.
Here is a flexible scanner that matches the vocabulary of the example you gave:
%option noyywrap %% "good" { return 1; } "bad" { return 2; } "freed"[[:alpha:]]* { return 3; } "careless"[[:alpha:]]* { return 4; } "great"[[:space:]]+"loss" { return 5; } . { } "\n" { } <<EOF>> { return -1; } %% int main(int argc, char *argv[]) { yyin = argc > 1 ? fopen(argv[1], "r") : stdin; for (;;) { int found = yylex(); if (found < 0) return 0; printf("matched pattern %d with '%s'\n", found, yytext); } }
And to run:
$ flex -i foo.l $ gcc lex.yy.c $ ./a.out Good men can only lose freedom to bad matched pattern 1 with 'Good' matched pattern 3 with 'freedom' matched pattern 2 with 'bad' through carelessness or apathy. matched pattern 4 with 'carelessness'