Is there any performance difference between sed pipelined calls and multiple sed expressions?

Question

Is there any performance difference between sed pipelined calls and multiple sed expressions?

I have a question about the effectiveness of sed in bash. I have a pipelined series of sed statements, for example:

var1="Some string of text" var2=$(echo "$var1" | sed 's/pattern1/replacement1/g' | sed 's/pattern2/replacement2/g' | sed 's/pattern3/replacement3/g' | sed 's/pattern4/replacement4' | sed 's/pattern5/replacement5/g')

Assuming no inputs depend on the edited output from an earlier sed channel, would I rather not write the scripts above using expression expressions? For instance:

 var2=$(echo "$var1" | sed -e's/pattern1/replacement1/g' -e's/pattern2/replacement2/g' -e's/pattern3/replacement3/g' -e's/pattern4/replacement4/g' -e's/pattern5/replacement5/g')

Is there any efficiency that can be obtained here?

+4

performance bash regex sed

Zack Jul 25 '12 at 1:03

source share

5 answers

Most of the overhead in sed tends to process regular expressions, but you process the same number of regular expressions in each of your examples.

Note that the operating system needs to build std and stdout for each pipe element. Sed also takes memory on your system, and the OS must allocate that memory for each sed instance — whether it be one instance or four.

Here is my assessment:

 $ jot -r 1000000 1 10000 | time sed 's/1/_/g' | time sed 's/2/_/g' | time sed 's/3/_/g' | time sed 's/4/_/g' >/dev/null 2.38 real 0.84 user 0.01 sys 2.38 real 0.84 user 0.01 sys 2.39 real 0.85 user 0.01 sys 2.39 real 0.85 user 0.01 sys $ jot -r 1000000 1 10000 | time sed 's/1/_/g;s/2/_/g;s/3/_/g;s/4/_/g' >/dev/null 2.71 real 2.57 user 0.02 sys $ jot -r 1000000 1 10000 | time sed 's/1/_/g;s/2/_/g;s/3/_/g;s/4/_/g' >/dev/null 2.71 real 2.56 user 0.02 sys $ jot -r 1000000 1 10000 | time sed 's/1/_/g;s/2/_/g;s/3/_/g;s/4/_/g' >/dev/null 2.71 real 2.57 user 0.02 sys $ jot -r 1000000 1 10000 | time sed 's/1/_/g;s/2/_/g;s/3/_/g;s/4/_/g' >/dev/null 2.74 real 2.57 user 0.02 sys $ dc .84 2* .85 2* + p 3.38 $

And starting from 3.38> 2.57, the search time is taken if you are using a single sed instance.

+3

ghoti Jul 25 '12 at 1:12

source share

Yes. You avoid the overhead of starting sed again every time.

+2

blahdiblah Jul 25 '12 at 1:04

source share

You can probably measure performance to evaluate different. Perhaps using the time command. Empirically, but will be more effective.

0

Jasonw Jul 25 '12 at 1:14

source share

As noted in ghoti's answer, your examples have the same number of regular expressions anyway (separate sed calls vs series -e expressions), but OS overheads include piping and process tuning and memory allocation for each sed instance. For multiple calls, the OS overhead is not worth the trouble, but if the number is thousands or more, it could be.

In any case, the effectiveness of the computer aside, the effectiveness of the program is often a more important task. Both methods shown so far are clumsy and slow. It is easier (at least with GNU sed) to use a list of sed commands, separated by semicolons, instead of many separate -e lines. The following is an example.

 $ var1="Some p1 string p2 of p3 text p4 etc" $ var2=$(echo "$var1" | sed 's/p1/a1/g; s/p2/b2/g; s/p3/c3/g; s/p4/d4/; s/p5/e5/g') $ echo $var2 Some a1 string b2 of c3 text d4 etc

Unfortunately, I do not see the semicolon-as-sed-command-separator separator specified in the sed documentation, and I don't know if this is available in versions other than GNU sed.

0

James Waldby - jwpat7 Sep 16 '12 at 15:27

source share

Todd A. Jacobs · Accepted Answer · 2012-07-25T01:30:54+0000

Short answer

Using multiple expressions will be faster than using multiple pipelines, because you have the additional overhead of creating pipelines and forking sed processes. However, in practice this is rarely the case in practice.

Benchmarks

Using multiple expressions is faster than multiple pipelines, but this is probably not enough for the average use case. Using your example, the average difference in execution speed was only two thousandths of a second, which is not enough to worry.

 # Average run with multiple pipelines. $ time { echo "$var1" | sed 's/pattern1/replacement1/g' | sed 's/pattern2/replacement2/g' | sed 's/pattern3/replacement3/g' | sed 's/pattern4/replacement4/g' | sed 's/pattern5/replacement5/g' } Some string of text real 0m0.007s user 0m0.000s sys 0m0.004s

 # Average run with multiple expressions. $ time { echo "$var1" | sed \ -e 's/pattern1/replacement1/g' \ -e 's/pattern2/replacement2/g' \ -e 's/pattern3/replacement3/g' \ -e 's/pattern4/replacement4/g' \ -e 's/pattern5/replacement5/g' } Some string of text real 0m0.005s user 0m0.000s sys 0m0.000s

Of course, this is not checking for a large input file, thousands of input files, or running in a loop with tens of thousands of iterations. Nevertheless, it is safe to say that the difference is small enough to be irrelevant for most common situations.

Unusual situations are a completely different story. In such cases, benchmarking will help you determine if replacing pipes with inline expressions is a valuable optimization for this use case.

Is there any performance difference between sed pipelined calls and multiple sed expressions?

Short answer

Benchmarks

More articles: