I am trying tokenize a string. I have a table of available tokens ordered in form three. Each token knows that it has children. A simple token table will look like this:
pattern value has_children -------- ------ -------- s s-val 1 stack stack-val 0 over over-val 1 overflow overflow-val 0
In this table, stack is a child of s and overflow is a child of over . In practice, this table will have 5,000+ records sorted this way.
Now, given the stackover line, it should output stack-valover-val . The algorithm is greedy, and he will try to find the longest match ever.
To do this, I will start reading each character from the input, look for a match, if a match is found, and the token has children, find the match again by including the next character. Do this until we find the longest match. If no match is found, try matching by including the next character until we reach the end of the line or a successful match.
If we reach the end of the line without matching, print the character ? and remove the first character from the input. Repeat the whole process with the remaining characters.
This algorithm works, but rolling back and repeating all possible input combinations makes it slow and complicated.
I am wondering if there is a better way to solve this? Any help would be greatly appreciated.
c algorithm
Navaneeth KN
source share