How to implement group patterns, character classes, negative character classes, etc. In the model for regular grammar?

TL DR: How does a computational model model grammatical derivatives, so that there is an indefinite number of products for the same left side?


I am working on a project dealing with formal language theory and am trying to write a class to create regular grammar objects that can be passed to a finite machine. My naive attempt was to create an API to add products for each allowed input. A stripped-down version of my attempt is as follows (based on a formal definition of a formal grammar G = (N, Σ, P, S) ):

 class ContextFreeGrammar: def __init__(self, variables, alphabet, production_rules, start_variable): self.variables = variables self.alphabet = alphabet self.production_rules = production_rules self.start_variable = start_variable def __repr__(self): return '{}({}, {}, {}, {})'.format( self.__class__.__name__, self.variables, self.alphabet, self.production_rules, self.start_variable ) class RegularGrammar(ContextFreeGrammar): _regular_expression_grammar = None # TODO @classmethod def from_regular_expression(cls, regular_expression): raise NotImplementedError() 

I still have not reached the point of writing a state machine or a push machine.

The grammar for regex has no context, so I included my definition in WSN below:

 syntax = expression . expression = term "|" expression . expression = term . term = factor repetition term . term = factor term . term = . repetition = "*" . repetition = "+" . repetition = "?" . repetition = "{" nonnegative_integer "," nonnegative_integer "}" . repetition = "{" nonnegative_integer ",}" . repetition = "{," nonnegative_integer "}" . nonnegative_integer = nonzero_arabic_numeral arabic_numerals . nonnegative_integer = arabic_numeral . nonzero_arabic_numeral = "1" . nonzero_arabic_numeral = "2" . nonzero_arabic_numeral = "3" . nonzero_arabic_numeral = "4" . nonzero_arabic_numeral = "5" . nonzero_arabic_numeral = "6" . nonzero_arabic_numeral = "7" . nonzero_arabic_numeral = "8" . nonzero_arabic_numeral = "9" . arabic_numeral = nonzero_arabic_numeral . arabic_numeral = "0" . arabic_numerals = arabic_numeral . arabic_numerals = arabic_numeral arabic_numerals . factor = "(" expression ")" . factor = character_class . factor = character . escaped_character = "\\." . escaped_character = "\\(" . escaped_character = "\\)" . escaped_character = "\\+" . escaped_character = "\\*" . escaped_character = "\\?" . escaped_character = "\\[" . escaped_character = "\\]" . escaped_character = "\\\\" . escaped_character = "\\{" . escaped_character = "\\}" . escaped_character = "\\|" . character -> TODO ; character_class = TODO . 

It is easy to see that I clearly break the alternations into separate pieces. I do this for ease of implementation. But I am fixated on how I should be engaged in personal lessons and the like. I wanted production_rules be a map from each left side to a set of each of its respective right sides. But now it seems unfeasible.

+6
source share
1 answer

I do not quite understand your question, but from the comments it seems that you are trying to work in a predefined character set that excludes different Unicode and ASCII characters.

Here is a method I recently applied to work with similar limitations:

Character groups

Here is an example that implements the above definitions:

 global rx_Trim_FromAlphaNumeric rx_Trim_FromAlphaNumeric = \ "[" + rx_AlphaNumeric + "]+" + \ "[" + rx_ValidCharacters_WithLineSpace + "]*" global rx_StartsWithSymbol rx_StartsWithSymbol = \ "[^" + rx_AlphaNumeric + "]" + \ "[" + rx_Symbols + "]+" + \ "[" + rx_LineSpace + rx_Symbols + "]*" + \ "[" + rx_AlphaNumeric + "]+" + \ "[" + rx_ValidCharacters_WithLineSpace + "]*" global rx_StartsWithLetter rx_StartsWithLetter = \ "^[" + rx_Alphabetic + "]+" + \ "[" + rx_ValidCharacters_WithLineSpace + "]+" global rx_StartsWithNumber rx_StartsWithNumber = \ "^[" + rx_Numeric + "]+" + \ "[" + rx_ValidCharacters_WithLineSpace + "]+" global rx_WordSegments rx_WordSegments = \ "([" + rx_Symbols + "]+|" + \ "[" + rx_Numeric + "]+|" + \ "[" + rx_Alphabetic + "]+|" + \ "[" + rx_LineSpace + "]+)" 

Note. . I prefer to avoid all characters, as certain characters, such as ^ , have contextual escaping requirements. If they always avoid, it is less likely that the problem will be met.

0
source

All Articles