How can I use a more general data structure?

Question

How can I use a more general data structure?

I am trying to create a text parser that will allow limited custom replacement rules.

In particular, I am reading codes from a DOS ASCII file in which the ordering is significant and line numbering should be supported. Using this input, I want to apply custom substitution rules (exchange this line for this line if we see that this line, followed by this line, performs this translation, etc.).

The output is also a DOS ASCII formatted file.

Most rules directly replace tit to replace tat, however there are situations when I want to define a rule, for example, if A follows B at any time in the future, apply this rule.

For this, I use a tree of structures as such:

struct node { list<string> common; // the text which is not affected by conditions string condition; // matching this string selects the left, otherwise the right node *lptr, *rptr; // pointers to the child nodes, if needed };

At any time when I come across such a rule, I can support the output with the rule to be omitted and applied, delaying the decision to use it until it is unambiguously resolved.

A bit of memory is wasteful, but it seems like the best way to avoid having to skip the input twice (the size of the input is unknown, but probably less than 1 megabyte).

Of course, such a case may exist if another rule of this type is run in one or both of the child nodes, therefore the structure of the tree.

There are no restrictions on the fact that children must be resolved in front of their parents; it is possible that a parent can be allowed in only one branch of the child. The EOF meeting will resolve all unresolved children in a false direction.

Thus, I have to be careful when rewinding and folding nodes.

Is there a simpler solution to this common problem? Is there a way to use standard library containers in a more efficient way than my tree?

+4

c ++ parsing

Stephen May 23 '12 at 19:31

source share

3 answers

kevin · Answer 1 · 2012-05-23T19:49:13+0000

You can look at NFAs and DFAs, namely nondeterministic automata with finite state and deterministic automata with finite state. These two approaches are the most common and often very effective way to write parsers.

In fact, there is no need to store data in node, otherwise it will be transferred and lost in memory. The best way to do this is to assign a variable (e.g. int state = 0) to keep track of the current state of the parsing. Based on the current state and input, your algorithm will then change state. States always go forward, but you can say that your algorithm reverts to some previous state if a certain condition is not agreed upon (known as "backtracking").

eg. if "ab" and "ac" are two valid inputs, when analyzing "ac" the algorithm may look like this:

 char is 'a' ==> go to state.checkB char is not 'b' ==> go back to state.checkA checkB was already done ==> go to state.checkC char is 'c' ==> DoSomething();

Downloading articles and graphs is required to fully explain everything, hoping this will give you an idea of where to look next.

user1408985 · Answer 2 · 2012-05-23T21:38:25+0000

Assuming that using a "text analyzer" you mean that you are trying to condense words and phrases of the same meaning in order to simplify the reaction to commands.

In this case, according to programs for old text adventures, a simple left-right parser will work here using the table of search rules.

If I misunderstand your problem area, your solution seems horribly reworked.

BeReal82 · Answer 3 · 2012-06-14T19:50:26+0000

Looks like you should try regular expressions. Here is a link to a discussion about choosing a library: C ++ RegEx Library Choice . Boost is popular.

Also, have you considered using another language to solve the problem? Python has an excellent database of useful libraries, including for regular expressions (import re). If this is in your wheelhouse, you may find it easier than a C ++ solution.

Finally, consider using the “already defined” text format instead of the custom one for the input file. XML is a good choice. It may be easier to embed rules in an XML tree. C ++ you can use the Expat XML XML parser (Python will be xml.etree.ElementTree).

How can I use a more general data structure?

More articles: