Regex lazy quantifier versus denial class in source MarkDown

I look at MarkDown code written by John Gruber in Perl, and there is a _Detab subchannel that converts tabs to spaces, preserving text indents. The line of code in question is 1314 in Markdown.pl:

 $text =~ s{(.*?)\t}{$1.(' ' x ($g_tab_width - length($1) % $g_tab_width))}ge; 

Wouldn't that lead to an unnecessary retreat? Would the following pattern be more efficient?

 /([^\t\n]*)\t/ 

Or am I missing something? Thanks.

By the way, I only deny \n , not \r , because all line breaks are standardized to \n in advance.

+2
performance regex perl markdown
source share
2 answers

Do not guess when you can compare:

 use Benchmark 'cmpthese'; my $source = "\t\thello\n\t\t\tworld\n" x 100; my $g_tab_width = 8; my ($textU, $textN); cmpthese(-3, { ungreedy => sub { $textU = $source; $textU =~ s{(.*?)\t}{$1.(' ' x ($g_tab_width - length($1) % $g_tab_width))}ge; }, negated => sub { $textN = $source; $textN =~ s{([^\n\t]*)\t}{$1.(' ' x ($g_tab_width - length($1) % $g_tab_width))}ge; }, }); die "whoops" unless $textN eq $textU; # ensure they do the same thing 

I found that the unwanted version (as it appears in the Markdown source) is about 40% faster than the negative character class you are offering:

  Rate negated ungreedy negated 1204/s -- -30% ungreedy 1718/s 43% -- 

I guess the mapping . more efficient than a negative character class, which compensates for the extra return. Additional tests will be required to confirm this.

+5
source share

You're right. This will lead to an unnecessary retreat. Yes, your template will be more efficient.

Most people do not understand and do not think about how regular expressions work and / or just do what they have been taught. I don’t know the details of this code or the author, but this is a very general regular expression that you will see in perl code.

And frankly, for most use cases this doesn't make much of a difference.

+1
source share

All Articles