Work with grammatical ambiguity (parsing poker files)

I am currently working on a poker hand history parser as part of my bachelor project. I did some research in a couple of days and came across some good parser generators (of which I chose JavaCC, since the project itself will be encoded in Java).

Although the grammar of the hand history is fairly simple and straightforward, the problem of ambiguity arises due to the valid character set in the player’s alias.

Suppose we have a string in the following format:

Seat 5: myNickname (1500 in chips) 

The myNickname token can contain any character, as well as spaces. This means that both (1500 in chip and Seat 5: are valid nicknames, which ultimately leads to an ambiguity problem. There are no restrictions on the player’s alias except for the length (4-12 characters).

I need to parse and save some data along with the player’s alias (for example, the position of the place and the number of chips in this particular case), so my question is: what are my options here?

I would like to do this using JavaCC, something like this:

 SeatRecord seat() : { Token seatPos, nickname, chipStack; } { "Seat" seatPos=<INTEGER> ":" nickname=<NICKNAME> "(" chipStack=<INTEGER> "in chips)" { return new SeatRecord(seatPos.image, nickname.image, chipStack.image); } } 

What does not work now (due to the indicated problem)

I also searched for GLR parsers (which apparently handle ambiguous grammars), but they mostly seem abandoned or poorly documented, with the exception of Bison, but that doesn't support the GLR parser for Java and can be too complicated (aside from the problem ambiguities, the grammar itself is quite simple, as I mentioned)

Or should I stick with the tokenization of the string itself and use indexOf(), lastIndexOf() , etc. to analyze the data I need? I would go for it only if it was the only option left, since it would be too ugly IMHO, and I could skip some cases (which would lead to an incorrect analysis).

+4
source share
3 answers

If your input format is simple, as you specify, you can probably get away with a simple regex:

 ^Seat ([0-9]+): (.*) \(([0-9]+) in chips\)$ 

The NFA regex engine in this case solves your ambiguity, and parentheses are capture groups so that you can extract the information you are interested in.

+7
source

You have two solutions:

  • Add some name restrictions. I can’t remember any widely used system that would accept such nicknames. Just let them use the alphanumeric characters and the separators "_". You can also add keywords for the place, for example, so that such a word could not be a nickname.
  • You can also create a state machine for parsing based on your grammar. I think FSM can handle such an ambiguous grammar. After that, you can analyze everything you want.

In any case, I think there is a problem with the original design. Nicky should not allow such a set of names. In addition, why you can not use identifiers instead of names - names can be stored in the database.

+2
source

The grammar for your system may look like this (written as contextual grammar):

 S -> seating nickname chips seating -> "Seat " number ":" number -> "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" number -> number number nickname -> "a" | "b" | "c" ...... | "z" | ...."+" | "?" | number nickname -> nickname nickname chips -> "(" number "in chips)" 

Pay attention to the rule of the form:

 number -> number number 

This basically allows you to use infinite grammar. Note that “infinite grammar” does not mean that you encapsulate everything. The above line is basically equivalent to the regular expression (\d*) .

I find that typing a grammar into CFG and then converting it to regular grammar helps me most of the time. Learn more about how to do it here . Good luck

+2
source

All Articles