Why did this regular expression cause substcont an excessive number of times?

Question

Why did this regular expression cause substcont an excessive number of times?

This is more out of curiosity than anything else, as I cannot find any useful information about Google about this function (CORE :: substcont)

In profiling and optimizing old, slow, syntactic XML code, I found that the following regular expression calls substcont 31 times each time a line is executed, and takes a huge amount of time:

Calls: 10000 Time: 2.65s Sub-elections: 320,000 Time in submarines: 1.15s`

$handle =~s/(>)\s*(<)/$1\n$2/g; # spent 1.09s making 310000 calls to main::CORE:substcont, avg 4µs/call # spent 58.8ms making 10000 calls to main::CORE:subst, avg 6µs/call

Compared to the previous line:

Calls: 10,000 Time: 371ms Sub-conclusions: 30,000 Time in submarines: 221 ms

  $handle =~s/(.*)\s*(<\?)/$1\n$2/g; # spent 136ms making 10000 calls to main::CORE:subst, avg 14µs/call # spent 84.6ms making 20000 calls to main::CORE:substcont, avg 4µs/call

The number of subscript calls is quite surprising, especially considering that I would have thought that the second regular expression would be more expensive. This is obviously why profiling is a good thing -)

I subsequently modified both of these lines to remove unnecessary backrefs, with sharp results for a poorly managed line:

Calls: 10000 Time: 393ms Sub-elections: 10000 Time in the submarine: 341 ms

 $handle =~s/>\s*</>\n</g; # spent 341ms making 10000 calls to main::CORE:subst, avg 34µs/call

So my question is: why does the original make SO many calls to substcont, and what does substcont do even in the regex engine, which takes so long?

+7

optimization profiling regex perl

paulw1128 May 24, '10 at 16:52

source share

1 answer

Schwern · Accepted Answer · 2010-05-24T19:59:33+0000

substcont is the internal Perl name for the "substitution iterator". Something related to s/// . Based on what little information I have, it seems that substcont triggered when doing backref. That is, when $1 present. You can play with him a bit using B :: Concise.

Here's the opcodes of a simple regular expression without backref.

 $ perl -MO=Concise,-exec -we'$foo = "foo"; $foo =~ s/(foo)/bar/ig' 1 <0> enter 2 <;> nextstate(main 1 -e:1) v:{ 3 <$> const[PV "foo"] s 4 <#> gvsv[*foo] s 5 <2> sassign vKS/2 6 <;> nextstate(main 1 -e:1) v:{ 7 <#> gvsv[*foo] s 8 <$> const[PV "bar"] s 9 </> subst(/"(foo)"/) vKS a <@> leave[1 ref] vKP/REFC -e syntax OK

And one with.

 $ perl -MO=Concise,-exec -we'$foo = "foo"; $foo =~ s/(foo)/$1/ig' 1 <0> enter 2 <;> nextstate(main 1 -e:1) v:{ 3 <$> const[PV "foo"] s 4 <#> gvsv[*foo] s 5 <2> sassign vKS/2 6 <;> nextstate(main 1 -e:1) v:{ 7 <#> gvsv[*foo] s 8 </> subst(/"(foo)"/ replstart->9) vKS 9 <#> gvsv[*1] s a <|> substcont(other->8) sK/1 b <@> leave[1 ref] vKP/REFC -e syntax OK

This is all what I can offer. You can try Rx , the mjd old regex debugger.

Why did this regular expression cause substcont an excessive number of times?

More articles: