What are the types of strings generated (a * + b *)

Besides strings of any number a and b, for example aa .. or bb .., will the regular expression (a * + b *) contain a string of type

ab

or any line ending with b?

Is (a * + b *) the same as (a * b *)?

I am a little confused about the lines generated by the regular expression (a * + b *), and would really appreciate it if someone could help.

+3
compiler-construction regex computation-theory
source share
2 answers

If you are not working with a regular expression language that explicitly classifies *+ as a special token that either has special meaning or is reserved for future expansion (and creates certain behavior at the present time, or a syntax error), the natural analysis of a*+ is to what it means (a*)+ : postfix + applies to the expression a* .

If this interpretation is applicable, we may notice that (a*)+ is simply equivalent to a* . Therefore, a*+b* coincides with a*b* .

First, by definition, R+ means RR* . Match one R and then zero or more of them. Therefore, we can rewrite (a*)+ as (a*)(a*)* .

Secondly, * is idempotent, therefore (a*)* is simply (a*) . If we match “zero or more a ”, zero or more times, nothing changes; the net effect is zero or greater than a . Proof: R* denotes this infinite extension: (|R|RR|RRR|RRRR|RRRRR|...) : nothing matches or does not correspond to one R , or corresponds to two R , therefore ... (a*)* dentes this extension: (|a*|a*a*|a*a*a*|...) . These internal a* -s, in turn, denote separate decompositions of the second level: (|(|a|aa|aaa|...|)|(|a|aa|aaa|...)(a|a|aaa|...))|...) . By associative property of a branch | we can smooth out a structure like (a|(b|c)) into (a|b|c) , and when we do this with decomposition, we note that there are many identical terms: empty regular expression () , single a , double aa etc. They all come down to one copy, because (|||) equivalent to () , and (a|a|a|a|...) equivalent only to (a) , etc. That is, when we sort the terms, increasing the length, and squire several identical terms into only one copy, we get (|a|aa|aaa|aaaa|...) , which is recognized as an extension only a* . Thus, (a*)* a* .

Finally, (a*)(a*) simply means a* . Evidence. . Similarly to the previous one, we turn to the branches: (|a|aa|aaa|...)(|a|aa|aaa|...) . Further, we note that the notation of branch expressions is equivalent to the Cartesian set of products. That is, (a|b|c|..)(i|j|k|...) means exactly: (ai|aj|ik|...|bi|bj|bk|...|ci|cj|ck|...|...) . When we apply this product to (|a|aa|aaa|...)(|a|aa|aaa|...) , we get a range of terms, which, being ordered in increasing order and subject to deduplication, reduces to (|a|aa|aaa|aaaa|...) , and it's just a* .

+3
source share

I think it’s useful here to look at the formal definition of regular expressions, i.e. look for every regular expression e that it expresses in the language of L (e).

So, let's just start:

(1) What about the regular expression a (letter only)? His tongue

  L(a) := {a}, 

just one word / character "a".

(2) For the regular expression e1 + e2, where e1 and e2 are the regular expressions themselves,

 L(e1 + e2) := L(e1) UL(e2). 

So, for example, if a and b are symbols, L (a + b) = {a, b}.

(3) For the regular expression e1 e2 (concatenation), where e1 and e2 are the regular expressions themselves,

 L(e1 e2) := all words w such that we can write w = w_1w_2 with w_1 in L(e1) and w_2 in L(e2)". 

(4) What about the regular expression * e **, where e can be the regular expression itself? Intuitively, a word is in L (e *) if it has the form w_1 w_2w_3w_4 ... w_n, with w_i in L (e) for each i. So,

 L(e*) := all words w such that we can write w = w_1 w_2 .. w_n for an >= 0 with all w_i in L(e) (for i = 1, 2, ..., n) 

So, what about L ((a * + b *))?

 L((a* + b*)) (according to rule 2) = L(a*) UL(b*) (according to rule 4/1) = {eps, a, aa, aaa, aaaa, ....} U {eps, b, bb, bbb, bbbb} = all strings that have either only a OR only b in it (including eps, the so-called empty word) 

Similarly for (a * b *):

  L((a* b*)) (according to rule 3) = all words w = w_1 w_2 with w_1 in L(a*) and w_2 in L(b*) = {eps eps, eps b, a eps, ab, aa eps, aab, ...} = {eps, b, a, ab, aa, aab, aabb, ... } = all strings that first have zero or more a's, then zero or more b's. 

To begin with, I think this helps to “deconstruct” the regular expression, as we did above, since regular expressions can also be considered as trees, as well as the more well-known arithmetic expressions, for example:

  + / \ * * | | ab 
+1
source share

All Articles