DFA minimization

Question

DFA minimization

I have a question about minimizing DFA. Therefore, I used very well-known methods to convert a regular expression to NFA, and then built a DFA from it using the goto / clos algorithm. Now the question is how to minimize it? I watched the lectures about this here: http://www.youtube.com/watch?v=T9Z66NF5YRk , and I still can’t understand. What is DFA minimization? Is it just a merger of IDENTICAL states (states that go to the same states by the same characters) or something else?

So, I started with the following grammar:

%digit = '0'..'9' %letter = 'a'..'z' | 'A'..'Z' %exponent = ("e" | "E") ("+" | "-")? digit+ T_INT = digit+ T_FLOAT = T_INT exponent T_IDENTIFIER = (letter | "$" | "_") (letter | "$" | "_" | digit)*

and ultimately with the following DFA (represented as JSON):

 { "START": [{ "type": "range", "from": 36, "to": 36, "shift": "1" }, { "type": "range", "from": 48, "to": 57, "shift": "2" }, { "type": "range", "from": 65, "to": 90, "shift": "1" }, { "type": "range", "from": 95, "to": 95, "shift": "1" }, { "type": "range", "from": 97, "to": 122, "shift": "1" }], "1": [{ "type": "range", "from": 36, "to": 36, "shift": "1" }, { "type": "range", "from": 48, "to": 57, "shift": "1" }, { "type": "range", "from": 65, "to": 90, "shift": "1" }, { "type": "range", "from": 95, "to": 95, "shift": "1" }, { "type": "range", "from": 97, "to": 122, "shift": "1" }, { "shift": ["t_identifier"] }], "2": [{ "type": "range", "from": 48, "to": 57, "shift": "2" }, { "type": "range", "from": 69, "to": 69, "shift": "3" }, { "type": "range", "from": 101, "to": 101, "shift": "3" }, { "shift": ["t_int"] }], "3": [{ "type": "range", "from": 43, "to": 43, "shift": "5" }, { "type": "range", "from": 45, "to": 45, "shift": "5" }, { "type": "range", "from": 48, "to": 57, "shift": "4" }], "4": [{ "type": "range", "from": 48, "to": 57, "shift": "4" }, { "shift": ["t_float"] }], "5": [{ "type": "range", "from": 48, "to": 57, "shift": "4" }] }

So how do I keep it to a minimum?

UPDATE:

Ok, here is my algorithm. Given the following DFAs:

 { 0: [{ from: 97, to: 97, shift: 1 }], 1: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 2 }], 2: [{ from: 98, to: 98, shift: 4 }], 3: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 4 }], 4: [{ from: 98, to: 98, shift: 4 }] }

This is what I do to minimize it:

For each state (numbered 0, 1, 2, 3, 4 in my example), a unique hash is obtained that identifies such a state (for example, for state 0 it will be: from = 97, to = 97, shift = 1, for state 1 it will be: from = 97, to = 97, shift = 3 & from = 98, to = 98, shift = 2, etc.)
Compare the received hashes, and if we find two of the same, then merge it. In my example, the hash of state 2 will be: from = 98 and shift = 4 & to = 98, and the hash of the hash of state 4 will be: from = 98 and shift = 4 & to = 98, they are the same, so we can combine them into a new state , after that, the DFA will look like this:
```
 { 0: [{ from: 97, to: 97, shift: 1 }], 1: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 5 }], 3: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 5 }], 5: [{ from: 98, to: 98, shift: 5 }] 
```
}

Continue this until we get any changes, so the next step (merging states 1 and 3) will transform the DFA into the following form:

 { 0: [{ from: 97, to: 97, shift: 6 }], 6: [{ from: 97, to: 97, shift: 6 }, { from: 98, to: 98, shift: 5 }], 5: [{ from: 98, to: 98, shift: 5 }]

}

No more identical states, it means that they were made.

SECOND UPDATE:

So, we give the following regular expression: 'a' ('ce') * ('d' | 'xa' | 'AFe') + | 'b' ('ce') * ('d' | 'aa' | 'AFe') + 'ce' + I have the following DFA (START → start state, ["accept"] → so to say go to receiving state):

 { "START": [{ "type": "range", "from": 98, "to": 98, "shift": "1.2" }, { "type": "range", "from": 97, "to": 97, "shift": "17.18" }], "1.2": [{ "type": "range", "from": 120, "to": 120, "shift": "10" }, { "type": "range", "from": 100, "to": 100, "shift": "6.7" }, { "type": "range", "from": 65, "to": 65, "shift": "8" }, { "type": "range", "from": 99, "to": 99, "shift": "4" }], "10": [{ "type": "range", "from": 97, "to": 97, "shift": "6.7" }], "6.7": [{ "type": "range", "from": 99, "to": 99, "shift": "15" }, { "type": "range", "from": 120, "to": 120, "shift": "13" }, { "type": "range", "from": 100, "to": 100, "shift": "6.7" }, { "type": "range", "from": 65, "to": 65, "shift": "11" }], "15": [{ "type": "range", "from": 101, "to": 101, "shift": "14.accept" }], "14.accept": [{ "type": "range", "from": 99, "to": 99, "shift": "16" }, { "shift": ["accept"] }], "16": [{ "type": "range", "from": 101, "to": 101, "shift": "14.accept" }], "13": [{ "type": "range", "from": 97, "to": 97, "shift": "6.7" }], "11": [{ "type": "range", "from": 70, "to": 70, "shift": "12" }], "12": [{ "type": "range", "from": 101, "to": 101, "shift": "6.7" }], "8": [{ "type": "range", "from": 70, "to": 70, "shift": "9" }], "9": [{ "type": "range", "from": 101, "to": 101, "shift": "6.7" }], "4": [{ "type": "range", "from": 101, "to": 101, "shift": "2.3" }], "2.3": [{ "type": "range", "from": 120, "to": 120, "shift": "10" }, { "type": "range", "from": 100, "to": 100, "shift": "6.7" }, { "type": "range", "from": 65, "to": 65, "shift": "8" }, { "type": "range", "from": 99, "to": 99, "shift": "5" }], "5": [{ "type": "range", "from": 101, "to": 101, "shift": "2.3" }], "17.18": [{ "type": "range", "from": 120, "to": 120, "shift": "25" }, { "type": "range", "from": 100, "to": 100, "shift": "22.accept" }, { "type": "range", "from": 65, "to": 65, "shift": "23" }, { "type": "range", "from": 99, "to": 99, "shift": "20" }], "25": [{ "type": "range", "from": 97, "to": 97, "shift": "22.accept" }], "22.accept": [{ "type": "range", "from": 120, "to": 120, "shift": "28" }, { "type": "range", "from": 100, "to": 100, "shift": "22.accept" }, { "type": "range", "from": 65, "to": 65, "shift": "26" }, { "shift": ["accept"] }], "28": [{ "type": "range", "from": 97, "to": 97, "shift": "22.accept" }], "26": [{ "type": "range", "from": 70, "to": 70, "shift": "27" }], "27": [{ "type": "range", "from": 101, "to": 101, "shift": "22.accept" }], "23": [{ "type": "range", "from": 70, "to": 70, "shift": "24" }], "24": [{ "type": "range", "from": 101, "to": 101, "shift": "22.accept" }], "20": [{ "type": "range", "from": 101, "to": 101, "shift": "18.19" }], "18.19": [{ "type": "range", "from": 120, "to": 120, "shift": "25" }, { "type": "range", "from": 100, "to": 100, "shift": "22.accept" }, { "type": "range", "from": 65, "to": 65, "shift": "23" }, { "type": "range", "from": 99, "to": 99, "shift": "21" }], "21": [{ "type": "range", "from": 101, "to": 101, "shift": "18.19" }] }

Is the story the same as minimizing it? If I follow the classic Hopcroft algorithm with all this table construction, defining indistinguishable states, combining them together and so on, then I'm going to get a DFA that contains 15 states (use this tool: http://regexvisualizer.apphb.com/ with this regular expression a (ce) (d | xa | AFe) + | b (ce) (d | xa | AFe) + ce + to check this). Here's what DFA looks like after minimization using the Hopcroft algorithm:

The algorithm I came across, after “rethinking” the Hopcroft algorithm, builds a DFA that is smaller than the one you see above (sorry for the image quality, I had to redraw it step by step to understand why it is smaller):

And this is how it works, the decision on “state equivalence” is based on the result of a hash function that sets the state (for example, “START”) builds short strings that can be built from DFA if we start from this state. Given the DFA above and the START state, we can build the following lines: 98-> 120, 98-> 100, 98-> 65, 98-> 99, 97-> 120, 97-> 100, 97-> 65, 97- > 99, so let it be the result of a hash function for the START state. If we run this function for each state in the DFA, we will see that for some states this function gives us the same results ("1,2", "6,7", "2,3" and "10", "13" and "15", "16" and "11", "8", "26", "23" and "12", "9", "4", "5", "20", "21" AND " 17.18 ", 18.19" and "25", "28" and "27", "24"), so all we need to do is to combine these states together.

I see that I’m wrong somewhere, but I don’t understand what happened to the minimized DFA created by my algorithm?

+4

regex dfa nfa lexer

Ruslan Jun 21 '12 at 5:56

source share

3 answers

Let your original DFA be called M1. Simply put, building a minimized DFA (let's call it M2) means converting it to a DFA that contains a minimum number of states. Thus, the number of states in M2 will be less than the number of states in M1. It is important to note here that M1 and M2 must be equivalent, which means that they must define the MOST common language. Building a minimized DFA involves not only finding identical states, but also the following:

Removing "unreachable" states:
This includes removing states that are not available from the initial DFA state for any input string.
Deleting dead or trap states:
This includes the removal of non-host states that end on themselves. They are also called TRAP states.
Removing "indistinguishable" states:
This includes deleting states that cannot be distinguished from each other for any input string.

There are also popular algorithms used to minimize DFA:

Maybe you should go through these algorithms!

+4

Deepti sharma Jun 21 '12 at 9:14

source share

Given that you have code to determine the NFA for the DFA, the Brzozovsky algorithm is the easiest solution to minimize. You will need to implement a function to modify the NFA, but it is quite simple. (flip transitions, start and start accepting states)

Once you have a function of definition and inverse function, minimizing Brzozowski is implemented as:

 minimize(nfa) = determinize(reverse(determinize(reverse(nfa))))

IMHO, this is a very elegant solution.

+1

Chrisblom Jun 26 '13 at 19:40

source share

Gunther · Accepted Answer · 2012-06-21T11:32:54+0000

Your proposed algorithm does not completely minimize it, because it does not detect complex structures that behave the same way. To understand this look at this DFA (Figure JFLAP ):

Minimization would combine q1 and q2, but the described algorithm failed.

In contrast, the Hopcroft algorithm would initially crash as follows:

  {q0, q1, q2}, {q3}

then split the first set, since q0 has no transition to q3:

  {q0}, {q1, q2}, {q3}

and does not split further, since q1 and q2 behave identically.

DFA minimization

More articles: