I have a question about minimizing DFA. Therefore, I used very well-known methods to convert a regular expression to NFA, and then built a DFA from it using the goto / clos algorithm. Now the question is how to minimize it? I watched the lectures about this here: http://www.youtube.com/watch?v=T9Z66NF5YRk , and I still can’t understand. What is DFA minimization? Is it just a merger of IDENTICAL states (states that go to the same states by the same characters) or something else?
So, I started with the following grammar:
%digit = '0'..'9' %letter = 'a'..'z' | 'A'..'Z' %exponent = ("e" | "E") ("+" | "-")? digit+ T_INT = digit+ T_FLOAT = T_INT exponent T_IDENTIFIER = (letter | "$" | "_") (letter | "$" | "_" | digit)*
and ultimately with the following DFA (represented as JSON):
{ "START": [{ "type": "range", "from": 36, "to": 36, "shift": "1" }, { "type": "range", "from": 48, "to": 57, "shift": "2" }, { "type": "range", "from": 65, "to": 90, "shift": "1" }, { "type": "range", "from": 95, "to": 95, "shift": "1" }, { "type": "range", "from": 97, "to": 122, "shift": "1" }], "1": [{ "type": "range", "from": 36, "to": 36, "shift": "1" }, { "type": "range", "from": 48, "to": 57, "shift": "1" }, { "type": "range", "from": 65, "to": 90, "shift": "1" }, { "type": "range", "from": 95, "to": 95, "shift": "1" }, { "type": "range", "from": 97, "to": 122, "shift": "1" }, { "shift": ["t_identifier"] }], "2": [{ "type": "range", "from": 48, "to": 57, "shift": "2" }, { "type": "range", "from": 69, "to": 69, "shift": "3" }, { "type": "range", "from": 101, "to": 101, "shift": "3" }, { "shift": ["t_int"] }], "3": [{ "type": "range", "from": 43, "to": 43, "shift": "5" }, { "type": "range", "from": 45, "to": 45, "shift": "5" }, { "type": "range", "from": 48, "to": 57, "shift": "4" }], "4": [{ "type": "range", "from": 48, "to": 57, "shift": "4" }, { "shift": ["t_float"] }], "5": [{ "type": "range", "from": 48, "to": 57, "shift": "4" }] }
So how do I keep it to a minimum?
UPDATE:
Ok, here is my algorithm. Given the following DFAs:
{ 0: [{ from: 97, to: 97, shift: 1 }], 1: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 2 }], 2: [{ from: 98, to: 98, shift: 4 }], 3: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 4 }], 4: [{ from: 98, to: 98, shift: 4 }] }
This is what I do to minimize it:
For each state (numbered 0, 1, 2, 3, 4 in my example), a unique hash is obtained that identifies such a state (for example, for state 0 it will be: from = 97, to = 97, shift = 1, for state 1 it will be: from = 97, to = 97, shift = 3 & from = 98, to = 98, shift = 2, etc.)
Compare the received hashes, and if we find two of the same, then merge it. In my example, the hash of state 2 will be: from = 98 and shift = 4 & to = 98, and the hash of the hash of state 4 will be: from = 98 and shift = 4 & to = 98, they are the same, so we can combine them into a new state , after that, the DFA will look like this:
{ 0: [{ from: 97, to: 97, shift: 1 }], 1: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 5 }], 3: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 5 }], 5: [{ from: 98, to: 98, shift: 5 }]
}
Continue this until we get any changes, so the next step (merging states 1 and 3) will transform the DFA into the following form:
{ 0: [{ from: 97, to: 97, shift: 6 }], 6: [{ from: 97, to: 97, shift: 6 }, { from: 98, to: 98, shift: 5 }], 5: [{ from: 98, to: 98, shift: 5 }]
}
No more identical states, it means that they were made.
SECOND UPDATE:
So, we give the following regular expression: 'a' ('ce') * ('d' | 'xa' | 'AFe') + | 'b' ('ce') * ('d' | 'aa' | 'AFe') + 'ce' + I have the following DFA (START → start state, ["accept"] → so to say go to receiving state):
{ "START": [{ "type": "range", "from": 98, "to": 98, "shift": "1.2" }, { "type": "range", "from": 97, "to": 97, "shift": "17.18" }], "1.2": [{ "type": "range", "from": 120, "to": 120, "shift": "10" }, { "type": "range", "from": 100, "to": 100, "shift": "6.7" }, { "type": "range", "from": 65, "to": 65, "shift": "8" }, { "type": "range", "from": 99, "to": 99, "shift": "4" }], "10": [{ "type": "range", "from": 97, "to": 97, "shift": "6.7" }], "6.7": [{ "type": "range", "from": 99, "to": 99, "shift": "15" }, { "type": "range", "from": 120, "to": 120, "shift": "13" }, { "type": "range", "from": 100, "to": 100, "shift": "6.7" }, { "type": "range", "from": 65, "to": 65, "shift": "11" }], "15": [{ "type": "range", "from": 101, "to": 101, "shift": "14.accept" }], "14.accept": [{ "type": "range", "from": 99, "to": 99, "shift": "16" }, { "shift": ["accept"] }], "16": [{ "type": "range", "from": 101, "to": 101, "shift": "14.accept" }], "13": [{ "type": "range", "from": 97, "to": 97, "shift": "6.7" }], "11": [{ "type": "range", "from": 70, "to": 70, "shift": "12" }], "12": [{ "type": "range", "from": 101, "to": 101, "shift": "6.7" }], "8": [{ "type": "range", "from": 70, "to": 70, "shift": "9" }], "9": [{ "type": "range", "from": 101, "to": 101, "shift": "6.7" }], "4": [{ "type": "range", "from": 101, "to": 101, "shift": "2.3" }], "2.3": [{ "type": "range", "from": 120, "to": 120, "shift": "10" }, { "type": "range", "from": 100, "to": 100, "shift": "6.7" }, { "type": "range", "from": 65, "to": 65, "shift": "8" }, { "type": "range", "from": 99, "to": 99, "shift": "5" }], "5": [{ "type": "range", "from": 101, "to": 101, "shift": "2.3" }], "17.18": [{ "type": "range", "from": 120, "to": 120, "shift": "25" }, { "type": "range", "from": 100, "to": 100, "shift": "22.accept" }, { "type": "range", "from": 65, "to": 65, "shift": "23" }, { "type": "range", "from": 99, "to": 99, "shift": "20" }], "25": [{ "type": "range", "from": 97, "to": 97, "shift": "22.accept" }], "22.accept": [{ "type": "range", "from": 120, "to": 120, "shift": "28" }, { "type": "range", "from": 100, "to": 100, "shift": "22.accept" }, { "type": "range", "from": 65, "to": 65, "shift": "26" }, { "shift": ["accept"] }], "28": [{ "type": "range", "from": 97, "to": 97, "shift": "22.accept" }], "26": [{ "type": "range", "from": 70, "to": 70, "shift": "27" }], "27": [{ "type": "range", "from": 101, "to": 101, "shift": "22.accept" }], "23": [{ "type": "range", "from": 70, "to": 70, "shift": "24" }], "24": [{ "type": "range", "from": 101, "to": 101, "shift": "22.accept" }], "20": [{ "type": "range", "from": 101, "to": 101, "shift": "18.19" }], "18.19": [{ "type": "range", "from": 120, "to": 120, "shift": "25" }, { "type": "range", "from": 100, "to": 100, "shift": "22.accept" }, { "type": "range", "from": 65, "to": 65, "shift": "23" }, { "type": "range", "from": 99, "to": 99, "shift": "21" }], "21": [{ "type": "range", "from": 101, "to": 101, "shift": "18.19" }] }
Is the story the same as minimizing it? If I follow the classic Hopcroft algorithm with all this table construction, defining indistinguishable states, combining them together and so on, then I'm going to get a DFA that contains 15 states (use this tool: http://regexvisualizer.apphb.com/ with this regular expression a (ce) (d | xa | AFe) + | b (ce) (d | xa | AFe) + ce + to check this). Here's what DFA looks like after minimization using the Hopcroft algorithm:

The algorithm I came across, after “rethinking” the Hopcroft algorithm, builds a DFA that is smaller than the one you see above (sorry for the image quality, I had to redraw it step by step to understand why it is smaller):

And this is how it works, the decision on “state equivalence” is based on the result of a hash function that sets the state (for example, “START”) builds short strings that can be built from DFA if we start from this state. Given the DFA above and the START state, we can build the following lines: 98-> 120, 98-> 100, 98-> 65, 98-> 99, 97-> 120, 97-> 100, 97-> 65, 97- > 99, so let it be the result of a hash function for the START state. If we run this function for each state in the DFA, we will see that for some states this function gives us the same results ("1,2", "6,7", "2,3" and "10", "13" and "15", "16" and "11", "8", "26", "23" and "12", "9", "4", "5", "20", "21" AND " 17.18 ", 18.19" and "25", "28" and "27", "24"), so all we need to do is to combine these states together.
I see that I’m wrong somewhere, but I don’t understand what happened to the minimized DFA created by my algorithm?