So, my Perl script basically takes a string and then tries to clear it by doing a multiple search and replacing on it, for example:
$text =~ s/<[^>]+>/ /g;
$text =~ s/\s+/ /g;
$text =~ s/[\(\{\[]\d+[\(\{\[]/ /g;
$text =~ s/\s+[<>]+\s+/\. /g;
$text =~ s/\s+/ /g;
$text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g;
$text =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g;
As you can see, I am dealing with unpleasant html and must surpass it in submission.
I hope there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look the same as above.
I solved one version of this problem using a hash where the key is a comment and the hash is a reg expression, for example:
%rxcheck = (
'time of day'=>'\d+:\d+',
'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]',
'ends with a single capital letter'=>'\b[A-Z]\.'
}
And here is how I use it:
foreach my $key (keys %rxcheck) {
if($snippet =~ /$rxcheck{ $key }/g){ blah blah }
}
The problem arises when I try to use a hash, when the key is an expression, and indicates that I want to replace with it ... and there is 1 or 2 dollars there.
%rxcheck2 = (
'(\w) \"'=>'$1\"'
}
The above should do this:
$snippet =~ s/(\w) \"/$1\"/g;
"$ 1" ( , ... , $1, "" ). , :
if($snippet =~ /$key/$rxcheck2{ $key }/g){ }
.
, 2 :
: , , ?
: ( , , , , , , 1) , 2) 3) , 4) / ), ?
-