How to implement group patterns, character classes, negative character classes, etc. In the model for regular grammar?

Question

How to implement group patterns, character classes, negative character classes, etc. In the model for regular grammar?

TL DR: How does a computational model model grammatical derivatives, so that there is an indefinite number of products for the same left side?

I am working on a project dealing with formal language theory and am trying to write a class to create regular grammar objects that can be passed to a finite machine. My naive attempt was to create an API to add products for each allowed input. A stripped-down version of my attempt is as follows (based on a formal definition of a formal grammar G = (N, Σ, P, S) ):

 class ContextFreeGrammar: def __init__(self, variables, alphabet, production_rules, start_variable): self.variables = variables self.alphabet = alphabet self.production_rules = production_rules self.start_variable = start_variable def __repr__(self): return '{}({}, {}, {}, {})'.format( self.__class__.__name__, self.variables, self.alphabet, self.production_rules, self.start_variable ) class RegularGrammar(ContextFreeGrammar): _regular_expression_grammar = None # TODO @classmethod def from_regular_expression(cls, regular_expression): raise NotImplementedError()

I still have not reached the point of writing a state machine or a push machine.

The grammar for regex has no context, so I included my definition in WSN below:

 syntax = expression . expression = term "|" expression . expression = term . term = factor repetition term . term = factor term . term = . repetition = "*" . repetition = "+" . repetition = "?" . repetition = "{" nonnegative_integer "," nonnegative_integer "}" . repetition = "{" nonnegative_integer ",}" . repetition = "{," nonnegative_integer "}" . nonnegative_integer = nonzero_arabic_numeral arabic_numerals . nonnegative_integer = arabic_numeral . nonzero_arabic_numeral = "1" . nonzero_arabic_numeral = "2" . nonzero_arabic_numeral = "3" . nonzero_arabic_numeral = "4" . nonzero_arabic_numeral = "5" . nonzero_arabic_numeral = "6" . nonzero_arabic_numeral = "7" . nonzero_arabic_numeral = "8" . nonzero_arabic_numeral = "9" . arabic_numeral = nonzero_arabic_numeral . arabic_numeral = "0" . arabic_numerals = arabic_numeral . arabic_numerals = arabic_numeral arabic_numerals . factor = "(" expression ")" . factor = character_class . factor = character . escaped_character = "\\." . escaped_character = "\\(" . escaped_character = "\\)" . escaped_character = "\\+" . escaped_character = "\\*" . escaped_character = "\\?" . escaped_character = "\\[" . escaped_character = "\\]" . escaped_character = "\\\\" . escaped_character = "\\{" . escaped_character = "\\}" . escaped_character = "\\|" . character -> TODO ; character_class = TODO .

It is easy to see that I clearly break the alternations into separate pieces. I do this for ease of implementation. But I am fixated on how I should be engaged in personal lessons and the like. I wanted production_rules be a map from each left side to a set of each of its respective right sides. But now it seems unfeasible.

+6

python regex parsing context-free-grammar formal-languages

Tyler crompton Dec 01 '15 at 10:27

source share

1 answer

Enteleform · Answer 1 · 2016-03-14T13:04:22+0000

I do not quite understand your question, but from the comments it seems that you are trying to work in a predefined character set that excludes different Unicode and ASCII characters.

Here is a method I recently applied to work with similar limitations:

Character groups

Here is an example that implements the above definitions:

 global rx_Trim_FromAlphaNumeric rx_Trim_FromAlphaNumeric = \ "[" + rx_AlphaNumeric + "]+" + \ "[" + rx_ValidCharacters_WithLineSpace + "]*" global rx_StartsWithSymbol rx_StartsWithSymbol = \ "[^" + rx_AlphaNumeric + "]" + \ "[" + rx_Symbols + "]+" + \ "[" + rx_LineSpace + rx_Symbols + "]*" + \ "[" + rx_AlphaNumeric + "]+" + \ "[" + rx_ValidCharacters_WithLineSpace + "]*" global rx_StartsWithLetter rx_StartsWithLetter = \ "^[" + rx_Alphabetic + "]+" + \ "[" + rx_ValidCharacters_WithLineSpace + "]+" global rx_StartsWithNumber rx_StartsWithNumber = \ "^[" + rx_Numeric + "]+" + \ "[" + rx_ValidCharacters_WithLineSpace + "]+" global rx_WordSegments rx_WordSegments = \ "([" + rx_Symbols + "]+|" + \ "[" + rx_Numeric + "]+|" + \ "[" + rx_Alphabetic + "]+|" + \ "[" + rx_LineSpace + "]+)"

Note. . I prefer to avoid all characters, as certain characters, such as ^ , have contextual escaping requirements. If they always avoid, it is less likely that the problem will be met.

How to implement group patterns, character classes, negative character classes, etc. In the model for regular grammar?

Character groups

More articles: