Ruby string search: what happens to faster break or regex?

Question

Ruby string search: what happens to faster break or regex?

This is a two-part question. Since you have an array of strings that can be separated by a character (for example, email addresses at “@” or file names at “.”), Which is the most efficient way to search for characters before a split character?

my_string.split(char)[0]

or

 my_string[/regex/]

The second part of the question is how do you write a regular expression to get everything up to the first instance of a character. The regular expression below finds certain characters before the '.' (because "." is not in the template), but it was my hacker way to get to the solution.

 my_string[/[A-Za-z0-9\_-]+/]

thanks!

+7

ruby regex

kreek Sep 23 '11 at 18:45

source share

2 answers

partition will be faster than split because it will not continue checking after the first match.

Regularly slice with index will be faster than the regular expression slice .

The regex slide also slows down significantly, as part of the line before the match gets bigger. It becomes slower than the original split after ~ 10 characters, and then it gets much worse from there. If you have Regexp without a + or * match, I think it looks a little better.

 require 'benchmark' n=1000000 def bench n,email printf "\n%s %s times\n", email, n Benchmark.bm do |x| x.report('split ') do n.times{ email.split('@')[0] } end x.report('partition') do n.times{ email.partition('@').first } end x.report('slice reg') do n.times{ email[/[^@]+/] } end x.report('slice ind') do n.times{ email[0,email.index('@')] } end end end bench n, ' a@be.pl ' bench n, ' some_name@regulardomain.com ' bench n, ' some_really_long_long_email_name@regulardomain.com ' bench n, ' some_name@rediculously-extra-long-silly-domain.com ' bench n, ' some_really_long_long_email_name@rediculously-extra-long-silly-d omain.com' bench n, 'a'*254 + '@' + 'b'*253 # rfc limits bench n, 'a'*1000 + '@' + 'b'*1000 # for other string processing

Results 1.9.3p484:

 a@be.pl 1000000 times user system total real split 0.405000 0.000000 0.405000 ( 0.410023) partition 0.375000 0.000000 0.375000 ( 0.368021) slice reg 0.359000 0.000000 0.359000 ( 0.357020) slice ind 0.312000 0.000000 0.312000 ( 0.309018) some_name@regulardomain.com 1000000 times user system total real split 0.421000 0.000000 0.421000 ( 0.432025) partition 0.374000 0.000000 0.374000 ( 0.379021) slice reg 0.421000 0.000000 0.421000 ( 0.411024) slice ind 0.312000 0.000000 0.312000 ( 0.315018) some_really_long_long_email_name@regulardomain.com 1000000 times user system total real split 0.593000 0.000000 0.593000 ( 0.589034) partition 0.531000 0.000000 0.531000 ( 0.529030) slice reg 0.764000 0.000000 0.764000 ( 0.771044) slice ind 0.484000 0.000000 0.484000 ( 0.478027) some_name@rediculously-extra-long-silly-domain.com 1000000 times user system total real split 0.483000 0.000000 0.483000 ( 0.481028) partition 0.390000 0.016000 0.406000 ( 0.404023) slice reg 0.406000 0.000000 0.406000 ( 0.411024) slice ind 0.312000 0.000000 0.312000 ( 0.344020) some_really_long_long_email_name@rediculously-extra-long-silly-d omain.com 1000000 times user system total real split 0.639000 0.000000 0.639000 ( 0.646037) partition 0.609000 0.000000 0.609000 ( 0.596034) slice reg 0.764000 0.000000 0.764000 ( 0.773044) slice ind 0.499000 0.000000 0.499000 ( 0.491028) a<254>@b<253> 1000000 times user system total real split 0.952000 0.000000 0.952000 ( 0.960055) partition 0.733000 0.000000 0.733000 ( 0.731042) slice reg 3.432000 0.000000 3.432000 ( 3.429196) slice ind 0.624000 0.000000 0.624000 ( 0.625036) a<1000>@b<1000> 1000000 times user system total real split 1.888000 0.000000 1.888000 ( 1.892108) partition 1.170000 0.016000 1.186000 ( 1.188068) slice reg 12.885000 0.000000 12.885000 ( 12.914739) slice ind 1.108000 0.000000 1.108000 ( 1.097063)

2.1.3p242 has approximately the same% difference, but is 10-30% faster, with the exception of the scatter of regular expressions, where it slows down even more.

+4

Matt Oct 13 '14 at 21:43

source share

mu is too short · Accepted Answer · 2011-09-23T19:09:13+0000

The easiest way to answer the first part is, as always, to compare it with your real data. For example:

 require 'benchmark' Benchmark.bm do |x| x.report { 50000.times { a = ' a@b.c '.split('@')[0] } } x.report { 50000.times { a = ' a@b.c '[/[^@]+/] } } end

says (according to my setup):

  user system total real 0.130000 0.010000 0.140000 ( 0.130946) 0.090000 0.000000 0.090000 ( 0.096260)

So, the regex solution looks a little faster, but the difference is barely noticeable even at 50,000 iterations. OTOH, the regular expression says exactly what you mean ("give me everything to the first @ "), while the split solution will get the desired result in a slightly circular way.

The split approach is probably slower because it needs to scan the entire string to break it into pieces, then build an array of parts and finally extract the first element of the array and discard the rest; I do not know if the virtual machine is enough to understand that it does not need to build an array in order to work a little fast.

Regarding your second question, tell me what you mean:

 my_string[/[^.]+/]

If you want everything before the first period, then say "everything before the period", and not "the first fragment, which consists of these characters (which do not contain a period)".

Ruby string search: what happens to faster break or regex?

More articles: