Extension [optionals], groupings and | or operator in the text

I am trying to expand sentences that include [ ] to indicate options, ( ) to specify groupings and | to indicate the or operator and list all the possibilities. For example:

"Hey [there] you [hood]." should return four sentences:

 Hey there you hood. Hey there you. Hey you hood. Hey you. 

The final goal will look like this:

Input: "(His|Her) dog was [very|extremely] confused."

 Output: His dog was very confused. His dog was extremely confused. His dog was confused. Her dog was very confused. Her dog was extremely confused. Her dog was confused. 

I do this using regular expression matching and recursion. I searched for both CPAN and SO under the phrases:

Text extension
extended offers
expanding conditions
extension options
expanding groupings

no luck.

Thanks.


I pretty much edited this question to better reflect its evolution and removed large portions that became obsolete as the question developed. The question posed above is that most of the answers below are trying to solve.

My current status is as follows:

After struggling with the problem above during the day, I have two solutions that are very close to what I want. One of them is mine, and the second is PLT. However, I decided to try a fundamentally different approach.

Using regular expressions and manually parsing these sentences seems like a very ugly way to do something. So I decided instead to write a grammar for my β€œlanguage” and use a parser to parse it.

This gives me an extra layer of abstraction and avoids the following scenario described by Damian Conway in Perl Best Practices: [about regular expressions]

cut and paste and change-a-bit-and-oh-now-it-does-not-work-on-all-so-modification let's get this a little more, and watch if not-that-helps-not-he-didn't- but-we-entrusted-now it can be so if we, with a change of what-bit instead of-hmmm-that's-closer, but still not-Perfectly-right-possible-if-I did the third, repetition-non-greedy- instead of-oops-now this-counter-not-match-for-all-possible-I-should-only-post-it-to-PerlMonks.org-see if-they-know-how wrong there

It also makes it easier if the grammar of these expressions needs to change, and I need to subsequently support other constructs.


Last update:

I solved my problem with open source tools . This will decrypt the JSGF version of my input and create a stateful converter. From there, you can go through FST to create all possible results.

+4
source share
3 answers

Ok, another complete revision of the answer. This one will work as intended. :) Now it also extends nested parsers. The new line is still a divider, but I added a way to quickly change it to something more complex if the need arises.

Basically, I started by replacing the brackets with parens + pipe, since [word ] and (|word ) equivalent.

Then I extracted all the encapsulating parsers, for example. like (you |my friend) and (you |my (|friendly ) friend ) . Then I expanded the nested parses into regular parsers, for example. (you |my (|friendly ) friend ) been replaced by (you |my friendly friend |my friend ) .

Using this word, words can be processed using the original subroutine.

It remains to pass testing on more complex extensions, but it works fine during my testing.

Here is the code:

 use strict; use warnings; sub addwords { my ($aref, @words) = @_; my @total; for my $start (@$aref) { for my $add (@words) { push @total, $start . $add; } } return @total; } sub expand_words { my $str = shift; my @sentences = (''); for my $word (word_split($str)) { if ($word =~ /^([(])([^)]+)[)]$/) { my @options = split /\|/, $2; push @options, '' if ($1 eq '['); @sentences = addwords(\@sentences, @options); } else { @sentences = addwords(\@sentences, $word); } } return @sentences; } sub fix_parens { my $str = shift; $str =~ s/\[/(|/g; $str =~ s/\]/)/g; return $str; } sub fix_nested { my @array = @_; my @return; for (my $i=0; $i <= $#array; ) { my $inc = 1; my ($co, $cc); do { $co = () = $array[$i] =~ /\(/g; $cc = () = $array[$i] =~ /\)/g; if ( $co > $cc ) { $array[$i] .= $array[$i + $inc++]; } } while ( $co > $cc ); push @return, expand_nest($array[$i]); $i += $inc; } return @return; } sub expand_nest { my $str = shift; my $co = () = $str =~ /\(/g; return $str unless ($co > 1); while ($str =~ /([^|(]+\([^)]+\)[^|)]+)/) { my $match = $1; my @match = expand_words($match); my $line = join '|', @match; $match =~ s/([()|])/"\\" . $1/ge; $str =~ s/$match/$line/ or die $!; } return $str; } sub word_split { my $str = shift; my $delimeter = "\n"; $str = fix_parens($str); $str =~ s/([[(])/$delimeter$1/g; $str =~ s/([])])/$1$delimeter/g; my @tot = split /$delimeter/, $str; @tot = fix_nested(@tot); return @tot; } my $str = "Hey [there ](you|my [friendly ]friend) where my [red|blue]berry?"; my @sentences = expand_words($str); print "$_\n" for (@sentences); print scalar @sentences . " sentences\n"; 

It will output:

 Hey you where my berry? Hey you where my redberry? Hey you where my blueberry? Hey my friend where my berry? Hey my friend where my redberry? Hey my friend where my blueberry? Hey my friendly friend where my berry? Hey my friendly friend where my redberry? Hey my friendly friend where my blueberry? Hey there you where my berry? Hey there you where my redberry? Hey there you where my blueberry? Hey there my friend where my berry? Hey there my friend where my redberry? Hey there my friend where my blueberry? Hey there my friendly friend where my berry? Hey there my friendly friend where my redberry? Hey there my friendly friend where my blueberry? 18 sentences 
+1
source

Data :: Create . I found this when looking for a combination, which is the mathematical term of what you are doing, with your sets of words.

+1
source

Here's a pretty simple solution if you go through some of the ugly regular expressions due to collisions between your syntax and the regexp syntax. It allows you to use the syntax [] and (), which is actually very similar, [foo] is the same as (foo |).

The basis is to replace each rotation with marker # 0, # 1, # 2 ... while saving them in an array. then replace the last marker by creating a few phrases, and then replace the next-last marker in each of these phrases ... until all the markers are replaced. Attentive readers of the higher order Perl will surely find a more elegant way to do this.

 #!/usr/bin/perl use strict; use warnings; while( my $phrase=<DATA>) { my $original= $phrase; $phrase=~s{\[([^\]]*)\]}{($1| )}g; # replace [c|d] by (c|d| ) my $alts=[]; my $i=0; while( $phrase=~ s{\(([^)]*)\)}{#$i}) # replace (a|b) ... (c|d| ) by #0 ... #1 { push @$alts, [ split /\|/, $1 ]; $i++; # store [ ['a', 'b'], [ 'c', 'd', ' '] ] } my $expanded=[$phrase]; # seed the expanded list with the phrase while( @$alts) { expand( $alts, $expanded); } # expand each alternation, until none left print "$original - ", join( " - ", @$expanded), "\n\n"; } exit; # expand the last #i of the phrase in all the phrases in $expanded sub expand { my( $alts, $expanded) =@ _; my @these_alts= @{pop(@$alts)}; # the last alternations my $i= @$alts; # the corresponding index in the phrases @$expanded= map { my $ph= $_; map { my $ph_e= $ph; $ph_e=~ s{#$i}{$_}; # replace the marker #i by one option $ph_e=~ s{ +}{ }; # fix double spaces $ph_e; } @these_alts # for all options } @$expanded # for all phrases stored so far } __DATA__ (His|Her) dog was [very|extremely 
+1
source

All Articles