Fast line suffix check in C # (.NET 4.0)?

Question

Fast line suffix check in C # (.NET 4.0)?

What is the fastest way to check string suffixes in C #?

I need to check every line in a large list (from 5000 to 100000 items) for a specific term. The term is guaranteed never to be embedded in a string. In other words, if a line contains this term, it will be at the end of the line. The line must also be longer than the suffix. Cultural information does not matter.

Here's how different methods work against 100,000 lines (half of them have a suffix):

1. Substring Comparison - 13.60ms 2. String.Contains - 22.33ms 3. CompareInfo.IsSuffix - 24.60ms 4. String.EndsWith - 29.08ms 5. String.LastIndexOf - 30.68ms

This is the average time. [Edit] Forgot to mention that the lines also fall into separate lists, but this is not important. However, this increases the execution time.

On my system, comparing a substring (extracting the end of a string using the String.Substring method and comparing it with a suffix) is the fastest when testing over 100,000 lines. The problem with using substring comparison is that the Garbage Collection can slow it down significantly (more than other methods), because String.Substring creates new lines. The effect is not as bad in .NET 4.0 as it was in 3.5 and below, but it is still noticeable. In my tests, String.Substring ran sequentially slower on line sets 12000-13000. This will obviously differ between systems and implementations.

[EDIT] Verification code: http://pastebin.com/smEtYNYN

[EDIT] FlyingStreudel code is fast, but Jon Skeet's recommendation for using EndsWith in conjunction with StringComparison.Ordinal seems to be the best option.

+6

string c #

ilitirit Feb 08 '11 at 14:55

source share

7 answers

John is absolutely right; this is potentially not a comparison of apples with apples, because different string methods have different default values for hidden sensitivity. Be sure that you get the comparison semantics that you intend in each of them.

In addition to John's answer, I would add that the corresponding question is not “the fastest”? but rather "too slow"? What is your performance goal for this code? The slowest method still finds the result in less time than it takes for the movie projector to advance to the next frame, and obviously this does not apply to people. If your goal is that the search appears instantly for the user, then you are done; any of these methods work. If your goal is that the search takes less than a millisecond, then none of these methods work; they are all too slow. What is the budget?

+13

Eric Lippert Feb 08 '11 at 15:12

source share

I took a look at your test code, and frankly, it looks dodgy.

You measure all kinds of extraneous things along with what you want to measure; you measure the value of foreach and adding to the list, both of which can have costs of the same order of magnitude as the thing you are trying to test.

In addition, you do not throw away the first run; remember that the JIT compiler is going to write the code that you call the first time through the loop, and it will be hot and ready to work the second time, so your results will be distorted; you average one potentially very large thing with many little things. In the past, when I did this, I discovered situations where jit time actually dominated the time of everything else. This is real? Do you want to measure jit time or not be considered part of the mean?

+6

Eric Lippert Feb 08 '11 at 18:08

source share

I don’t know how fast this happens, but is that what I will do?

 static bool HasSuffix(string check, string suffix) { int offset = check.Length - suffix.Length; for (int i = 0; i < suffix.Length; i++) { if (check[offset + i] != suffix[i]) { return false; } } return true; }

Edit: OOPS x2

edit: So, I wrote my own little test ... is that so? He conducts 25 trials evaluating a million rows and accepts the average value of the difference in performance. Several times when I ran it, he constantly deduced that CharCompare was ~ 10-40 ms faster per million records. Thus, this is a very slight increase in efficiency (.000000001s / call) :) In general, I doubt that it will be important which method you implement.

 class Program { volatile static List<string> strings; static double[] results = new double[25]; static void Main(string[] args) { strings = new List<string>(); Random r = new Random(); for (int rep = 0; rep < 25; rep++) { Console.WriteLine("Run " + rep); strings.Clear(); for (int i = 0; i < 1000000; i++) { string temp = ""; for (int j = 0; j < r.Next(3, 101); j++) { temp += Convert.ToChar( Convert.ToInt32( Math.Floor(26 * r.NextDouble() + 65))); } if (i % 4 == 0) { temp += "abc"; } strings.Add(temp); } OrdinalWorker ow = new OrdinalWorker(strings); CharWorker cw = new CharWorker(strings); if (rep % 2 == 0) { cw.Run(); ow.Run(); } else { ow.Run(); cw.Run(); } Thread.Sleep(1000); results[rep] = ow.finish.Subtract(cw.finish).Milliseconds; } double tDiff = 0; for (int i = 0; i < 25; i++) { tDiff += results[i]; } double average = tDiff / 25; if (average < 0) { average = average * -1; Console.WriteLine("Char compare faster by {0}ms average", average.ToString().Substring(0, 4)); } else { Console.WriteLine("EndsWith faster by {0}ms average", average.ToString().Substring(0, 4)); } } } class OrdinalWorker { List<string> list; int count; public Thread t; public DateTime finish; public OrdinalWorker(List<string> l) { list = l; } public void Run() { t = new Thread(() => { string suffix = "abc"; for (int i = 0; i < list.Count; i++) { count = (list[i].EndsWith(suffix, StringComparison.Ordinal)) ? count + 1 : count; } finish = DateTime.Now; }); t.Start(); } } class CharWorker { List<string> list; int count; public Thread t; public DateTime finish; public CharWorker(List<string> l) { list = l; } public void Run() { t = new Thread(() => { string suffix = "abc"; for (int i = 0; i < list.Count; i++) { count = (HasSuffix(list[i], suffix)) ? count + 1 : count; } finish = DateTime.Now; }); t.Start(); } static bool HasSuffix(string check, string suffix) { int offset = check.Length - suffix.Length; for (int i = 0; i < suffix.Length; i++) { if (check[offset + i] != suffix[i]) { return false; } } return true; } }

+4

Flyingstreudel Feb 08 '11 at 15:10

source share

Have you tried direct access? I mean, you can make a loop watching a similar string, it can be faster than making a substring and have the same behavior.

 int i,j; foreach(String testing in lists){ i=0; j=0; int ok=1; while(ok){ i = testing.lenght - PATTERN.lenght; if(i>0 && i<testing.lenght && testing[i] != PATTERN[j]) ok = 0; i++; j++; } if(ok) return testing; }

Also, if these are large strings, you can try using a hash.

0

ykatchou Feb 08 '11 at 15:02

source share

I do not pretend to be an expert in this field, but I was forced to at least profile it to some extent (knowing that my dummy scenario will be significantly different from your own), and here is what I came up with:

It seems, at least for me, EndsWith is leading with LastIndexOf , sequentially entering the second, some timings:

 SubString: 00:00:00.0191877 Contains: 00:00:00.0201980 CompareInfo: 00:00:00.0255181 EndsWith: 00:00:00.0120296 LastIndexOf: 00:00:00.0133181

They were gleaned from processing 100,000 lines, where the desired suffix appeared in all lines, and therefore, for me, just echoes of John's answer (where the advantage is speed and descriptiveness). And the code used to achieve these results:

 class Program { class Profiler { private Stopwatch Stopwatch = new Stopwatch(); public TimeSpan Elapsed { get { return Stopwatch.Elapsed; } } public void Start() { Reset(); Stopwatch.Start(); } public void Stop() { Stopwatch.Stop(); } public void Reset() { Stopwatch.Reset(); } } static string suffix = "_sfx"; static Profiler profiler = new Profiler(); static List<string> input = new List<string>(); static List<string> output = new List<string>(); static void Main(string[] args) { GenerateSuffixedStrings(); FindStringsWithSuffix_UsingSubString(input, suffix); Console.WriteLine("SubString: {0}", profiler.Elapsed); FindStringsWithSuffix_UsingContains(input, suffix); Console.WriteLine("Contains: {0}", profiler.Elapsed); FindStringsWithSuffix_UsingCompareInfo(input, suffix); Console.WriteLine("CompareInfo: {0}", profiler.Elapsed); FindStringsWithSuffix_UsingEndsWith(input, suffix); Console.WriteLine("EndsWith: {0}", profiler.Elapsed); FindStringsWithSuffix_UsingLastIndexOf(input, suffix); Console.WriteLine("LastIndexOf: {0}", profiler.Elapsed); Console.WriteLine(); Console.WriteLine("Press any key to exit..."); Console.ReadKey(); } static void GenerateSuffixedStrings() { for (var i = 0; i < 100000; i++) { input.Add(Guid.NewGuid().ToString() + suffix); } } static void FindStringsWithSuffix_UsingSubString(IEnumerable<string> strings, string suffix) { output.Clear(); profiler.Start(); foreach (var s in strings) { if(s.Substring(s.Length - 4) == suffix) output.Add(s); } profiler.Stop(); } static void FindStringsWithSuffix_UsingContains(IEnumerable<string> strings, string suffix) { output.Clear(); profiler.Start(); foreach (var s in strings) { if (s.Contains(suffix)) output.Add(s); } profiler.Stop(); } static void FindStringsWithSuffix_UsingCompareInfo(IEnumerable<string> strings, string suffix) { var ci = CompareInfo.GetCompareInfo("en-GB"); output.Clear(); profiler.Start(); foreach (var s in strings) { if (ci.IsSuffix(s, suffix)) output.Add(s); } profiler.Stop(); } static void FindStringsWithSuffix_UsingEndsWith(IEnumerable<string> strings, string suffix) { output.Clear(); profiler.Start(); foreach (var s in strings) { if (s.EndsWith(suffix)) output.Add(s); } profiler.Stop(); } static void FindStringsWithSuffix_UsingLastIndexOf(IEnumerable<string> strings, string suffix) { output.Clear(); profiler.Start(); foreach (var s in strings) { if (s.LastIndexOf(suffix) == s.Length - 4) output.Add(s); } profiler.Stop(); } }

EDIT:

As commented, I tried to repeat this with only some lines having a suffix, and these are the results:

 SubString: 00:00:00.0079731 Contains: 00:00:00.0243696 CompareInfo: 00:00:00.0334056 EndsWith: 00:00:00.0196668 LastIndexOf: 00:00:00.0229599

The line generator method has been updated as follows to create lines:

  static void GenerateSuffixedStrings() { var nxt = false; var rnd = new Random(); for (var i = 0; i < 100000; i++) { input.Add(Guid.NewGuid().ToString() + (rnd.Next(0, 2) == 0 ? suffix : string.Empty)); } }

Further, this trend continues if none of the lines has a suffix:

 SubString: 00:00:00.0055584 Contains: 00:00:00.0187089 CompareInfo: 00:00:00.0228983 EndsWith: 00:00:00.0114227 LastIndexOf: 00:00:00.0199328

However, this gap is narrowed by assigning a quarter of the suffix entries (first quarter and then sorting to randomise coverage):

 SubString: 00:00:00.0302997 Contains: 00:00:00.0305685 CompareInfo: 00:00:00.0306335 EndsWith: 00:00:00.0351229 LastIndexOf: 00:00:00.0322899

Output? IMO, and agreeing with John, EndsWith seems to be the way to go (based on this limited test, anyway).

Further editing:

To cure John's curiosity, I did some more tests on EndsWith , with and without Ordinal string comparison ...

In 100,000 lines with a quarter of suffixes:

 EndsWith: 00:00:00.0795617 OrdinalEndsWith: 00:00:00.0240631

For 1,000,000 lines with a quarter of suffixes:

 EndsWith: 00:00:00.5460591 OrdinalEndsWith: 00:00:00.2807860

In 10,000,000 lines with a quarter of suffixes:

 EndsWith: 00:00:07.5889581 OrdinalEndsWith: 00:00:03.3248628

Please note that I only ran the last test once, creating lines proving that this laptop needs to be replaced

0

Grant thomas Feb 08 '11 at 15:53

source share

There is a lot of good information here. I would like to note that if your suffix is short, it may be even faster to look at the last few characters. A modified version of the reference code is here: http://pastebin.com/6nNdbEvW . He gives the results of theses:

Last char equality: 1.52 ms (50,000)
Last 2 char equality: 1.56 ms (50,000)
EndsWith using StringComparison.Ordinal: 3.75 ms (50,000)
Contains: 11.10 ms (50,000)
LastIndexOf: 14.85 ms (50,000)
IsSuffix: 11.30 ms (50,000)
Substring comparison: 17.69 ms (50,000)

-one

user1088761 Jun 27 '12 at 10:56

source share

Jon skeet · Accepted Answer · 2011-02-08T14:59:18+0000

If the time spent checking 100,000 rows does it really matter?

Personally, I would use string.EndsWith on the grounds that it is the most descriptive: it definitely says what you are trying to check.

I am somewhat suspicious that it seems to work worse, though ... if you could post your test code, that would be very helpful. (In particular, he really doesn't need to do as much work as string.Contains .)

Have you tried to indicate the ordinal match? This can make it significantly faster:

 if (x.EndsWith(y, StringComparison.Ordinal))

Of course, you should not do this if you do not want comparisons in order - do you expect culturally sensitive matches? (Developers, as a rule, do not consider such things, and I am very firmly in this category.)

Fast line suffix check in C # (.NET 4.0)?

More articles: