String Benchmarks in C # - Refactoring for Speed / Performance

Question

String Benchmarks in C # - Refactoring for Speed / Performance

I used to do small functions at the time, trying to find ways to refactor them (I recently read Martin Fowler's book Refactoring: Improving the Design of Existing Code ). I discovered the following MakeNiceString() function, updating another part of the code base next to it, and it looked like a good candidate to talk to. Be that as it may, there is no real reason to replace it, but it is small enough and does something small, so it’s easy to follow and still get a “good” experience.

 private static string MakeNiceString(string str) { char[] ca = str.ToCharArray(); string result = null; int i = 0; result += System.Convert.ToString(ca[0]); for (i = 1; i <= ca.Length - 1; i++) { if (!(char.IsLower(ca[i]))) { result += " "; } result += System.Convert.ToString(ca[i]); } return result; } static string SplitCamelCase(string str) { string[] temp = Regex.Split(str, @"(?<!^)(?=[AZ])"); string result = String.Join(" ", temp); return result; }

The first MakeNiceString() function is the function that I found in some code that I updated at work. The purpose of this function is to translate ThisIsAString to an It string . It was used in half a dozen places in the code, and it is pretty minor in the whole scheme of things.

I built the second function only as an academic exercise to see if using regular expressions will last longer or not.

Well, here are the results:

With 10 iterations:

  MakeNiceString took 2649 ticks
 SplitCamelCase took 2502 ticks

However, it varies greatly along the long course:

With 10,000 iterations:

  MakeNiceString took 121625 ticks
 SplitCamelCase took 443001 ticks

Refactoring `MakeNiceString()`

The refactoring process of MakeNiceString() began with a simple removal of the transformations that took place. This gave the following results:

  MakeNiceString took 124716 ticks
 ImprovedMakeNiceString took 118486

Here is the code after Refactor # 1:

 private static string ImprovedMakeNiceString(string str) { //Removed Convert.ToString() char[] ca = str.ToCharArray(); string result = null; int i = 0; result += ca[0]; for (i = 1; i <= ca.Length - 1; i++) { if (!(char.IsLower(ca[i]))) { result += " "; } result += ca[i]; } return result; }

Reactor # 2 - Use `StringBuilder`

My second task was to use StringBuilder instead of String . Because String immutable, unnecessary copies are created during the loop. The test for using this is below, as well as the code:

 static string RefactoredMakeNiceString(string str) { char[] ca = str.ToCharArray(); StringBuilder sb = new StringBuilder((str.Length * 5 / 4)); int i = 0; sb.Append(ca[0]); for (i = 1; i <= ca.Length - 1; i++) { if (!(char.IsLower(ca[i]))) { sb.Append(" "); } sb.Append(ca[i]); } return sb.ToString(); }

This leads to the following benchmark:

  MakeNiceString Took: 124497 Ticks // Original
 SplitCamelCase Took: 464459 Ticks // Regex
 ImprovedMakeNiceString Took: 117369 Ticks // Remove Conversion
 RefactoredMakeNiceString Took: 38542 Ticks // Using StringBuilder

Changing the for loop in the foreach resulted in the following test result:

 static string RefactoredForEachMakeNiceString(string str) { char[] ca = str.ToCharArray(); StringBuilder sb1 = new StringBuilder((str.Length * 5 / 4)); sb1.Append(ca[0]); foreach (char c in ca) { if (!(char.IsLower(c))) { sb1.Append(" "); } sb1.Append(c); } return sb1.ToString(); }

  RefactoredForEachMakeNiceString Took: 45163 Ticks

As you can see, maintenance, the foreach cycle will be the easiest to maintain and have the “cleanest” look. It is slightly slower than the for loop, but infinitely easier to follow.

Alternative refactoring: use compiled `Regex`

I moved Regex right before the start of the loop, hoping that since it only compiles it once, it will run faster. What I found out (and I'm sure I have a mistake somewhere) is that this is not the way it should:

 static void runTest5() { Regex rg = new Regex(@"(?<!^)(?=[AZ])", RegexOptions.Compiled); for (int i = 0; i < 10000; i++) { CompiledRegex(rg, myString); } } static string CompiledRegex(Regex regex, string str) { string result = null; Regex rg1 = regex; string[] temp = rg1.Split(str); result = String.Join(" ", temp); return result; }

The final test results:

  MakeNiceString Took 139363 Ticks
 SplitCamelCase Took 489174 Ticks
 ImprovedMakeNiceString Took 115478 Ticks
 RefactoredMakeNiceString Took 38819 Ticks
 RefactoredForEachMakeNiceString Took 44700 Ticks
 CompiledRegex Took 227021 Ticks

Or, if you prefer milliseconds:

  MakeNiceString Took 38 ms
 SplitCamelCase Took 123 ms
 ImprovedMakeNiceString Took 33 ms
 RefactoredMakeNiceString Took 11 ms
 RefactoredForEachMakeNiceString Took 12 ms
 CompiledRegex Took 63 ms

Thus, interest income:

  MakeNiceString 38 ms Baseline
 SplitCamelCase 123 ms 223% slower
 ImprovedMakeNiceString 33 ms 13.15% faster
 RefactoredMakeNiceString 11 ms 71.05% faster
 RefactoredForEachMakeNiceString 12 ms 68.42% faster
 CompiledRegex 63 ms 65.79% slower

(Please check my math)

In the end, I'm going to replace what is there with RefactoredForEachMakeNiceString() , and while I'm in it, I will rename it something useful, like SplitStringOnUpperCase .

Performance test:

For comparison, I just call a new Stopwatch for each method call:

  string myString = "ThisIsAUpperCaseString"; Stopwatch sw = new Stopwatch(); sw.Start(); runTest(); sw.Stop(); static void runTest() { for (int i = 0; i < 10000; i++) { MakeNiceString(myString); } }

Questions

What makes these functions be so "long" as well
How can I improve this feature a) be more maintainable or b) work faster?
How would I do memory tests on them to find out which one has less memory?

Thank you for your responses. I added all the suggestions made by @Jon Skeet, and would like to receive feedback on the updated questions that I asked as a result.

NB . This question is intended to learn how to refactor string processing functions in C #. I copied / pasted the first as is code. I know well that you can remove System.Convert.ToString() in the first method, and I did just that. If anyone is aware of any consequences of removing System.Convert.ToString() , this would also be helpful to know.

+6

string c # refactoring

George stocker Jan 23 '09 at 14:33

source share

10 answers

You might want to try creating an instance of the Regex object as a member of the class and use the RegexOptions.Compiled parameter when creating it.

You are currently using the static Split Regex member and are not caching the regex. Using a member object instead of a static method should improve your performance even further (in the end).

+3

Andrew rollings Jan 23 '09 at 2:43

source share

Use StringBuilder instead of concatenation. Each concatenation creates a new instance of the string and discards the old.

+2

Kent boogaart Jan 23 '09 at 2:43

source share

Try refactoring so that the regular expression that you use to split the string in the second method is stored in the static method and built using the RegexOptions.Compiled parameter. More on this here: http://msdn.microsoft.com/en-us/library/8zbs0h2f.aspx .

I did not test the theory, but I would suggest that having to re-create a regular expression each time would be time-consuming.

+2

Joel martinez Jan 23 '09 at 2:43

source share

This is in response to ctacke's comment on John Skeet's answer (he is not long for comment)

I always thought that foreach was quite known to be slower than for since it should use an iterator.

Actually, no, in this case foreach will be faster. The index constraint is limited (that is, I check that it is in the range three times in the loop: once in for () and once for two ca [i] s), which makes the for loop slower than foreach.

If the C # compiler detects a specific syntax:

  for(i = 0; i < ca.Length; i++)

then it will perform special optimization by removing internal border checks, making the for () loop faster. However, since here we should consider ca [0] as a special case (to exclude the leading space at the output), we cannot initiate this optimization.

+2

James curran Jan 23 '09 at 15:45

source share

I know what they say about RegEx, use it to solve the problem, and now you have two problems, but I remain a fan. Just for fun, here is the RegEx version. RegEx, with little initiation, is easy to read, less code, and makes it easy to insert extra delimiters (just like a comma).

  s1 = MakeNiceString( "LookOut,Momma,There'sAWhiteBoatComingUpTheRiver" ) ); private string MakeNiceString( string input ) { StringBuilder sb = new StringBuilder( input ); int Incrementer = 0; MatchCollection mc; const string SPACE = " "; mc = Regex.Matches( input, "[AZ|,]" ); foreach ( Match m in mc ) { if ( m.Index > 0 ) { sb.Insert( m.Index + Incrementer, SPACE ); Incrementer++; } } return sb.ToString().TrimEnd(); }

+2

rp. Jan 23 '09 at 16:29

source share

My first refactoring was to change the method name to something more descriptive. MakeNiceString imo is not a name that would tell me what this method does.

What about PascalCaseToSentence? I do not like this name, but it is better than MakeNiceString.

+2

steve_c Jan 23 '09 at 16:41

source share

Here is a slightly better version. I accepted suggestions from previous posters, but also added to the line builder in a blocky way. This may allow the line builder to copy 4 bytes at a time, depending on the size of the word. I also removed the line selection and just replaced it with str.length.

  static string RefactoredMakeNiceString2(string str) { char[] ca = str.ToCharArray(); StringBuilder sb = new StringBuilder(str.Length); int start = 0; for (int i = 0; i < ca.Length; i++) { if (char.IsUpper(ca[i]) && i != 0) { sb.Append(ca, start, i - start); sb.Append(' '); start = i; } } sb.Append(ca, start, ca.Length - start); return sb.ToString(); }

+1

avid Nov 10 '09 at 12:22

source share

The version modes of your solution are not equivalent to the source code. Perhaps the broader code context avoids the areas in which they differ. The source code will add a space for anything that is not a lowercase character. For example, "This\tIsATest" will become "This \t Is A Test" in the original, but "This\t Is A Test" with Regex versions.

 (?<!^)(?=[^az])

This is the pattern you want for a closer match, but even then it still ignores i18n issues. The following template should take care of this:

 (?<!^)(?=\P{Ll})

0

toad Jan 23 '09 at 19:52

source share

In C # (.Net, really) When you add a line, several things happen in the background. Now I forget the specifics, but it's something like:

string A = B + C;

A + = D; A + = E;

// ... rinse-repeat for 10,000 iterations

For each line above, .NET will: 1) Allocate a new memory for A. 2) Copy line B to the new memory. 3) Expand memory to hold C. 4) Add line C to A.

The longer the string A, the longer it takes. Add to this, the more times you do this, the more A gets, exponentially more than what is required.

However, when using StringBuilder you do not allocate new memory, so you will skip this problem.

If you say:

 StringBuilder A = new StringBuilder(); A.Append(B); A.Append(C); // .. rinse/repeat for 10,000 times... string sA = A.ToString();

code>

StringBuilder (Edit: fixed description) has a string in memory. There is no need to redistribute the entire string for each added substring. When you issue ToString (), the string is already added in the appropriate format.

One shot instead of a cycle, which takes a longer period.

I hope this helps answer the question of why it took much less time.

-2

Jerry Jan 23 '09 at 17:08

source share

Jon skeet · Accepted Answer · 2009-01-23T14:43:45+0000

1) Use a StringBuilder, preferably given with a reasonable initial capacity (e.g. string length * 5/4, to allow extra space of four characters).

2) Try using a foreach loop instead of a for loop - it might be easier

3) You do not need to convert the string to a char array first - foreach will already work on the string or use the index.

4) Do not do additional string conversions everywhere - by calling Convert.ToString (char), and then adding this line is pointless; no single character string needed

5) For the second option, just create a regular expression once, outside the method. Try using RegexOptions.Compiled.

EDIT: Ok, complete test results. I tried a few more things and also executed the code with more iterations to get a more accurate result. This only works on the Eee PC, so no doubt it will work faster on "real" PCs, but I suspect that wide results are suitable. First code:

 using System; using System.Collections; using System.Collections.Generic; using System.Diagnostics; using System.Linq; using System.Reflection; using System.Text; using System.Text.RegularExpressions; class Benchmark { const string TestData = "ThisIsAUpperCaseString"; const string ValidResult = "This Is A Upper Case String"; const int Iterations = 1000000; static void Main(string[] args) { Test(BenchmarkOverhead); Test(MakeNiceString); Test(ImprovedMakeNiceString); Test(RefactoredMakeNiceString); Test(MakeNiceStringWithStringIndexer); Test(MakeNiceStringWithForeach); Test(MakeNiceStringWithForeachAndLinqSkip); Test(MakeNiceStringWithForeachAndCustomSkip); Test(SplitCamelCase); Test(SplitCamelCaseCachedRegex); Test(SplitCamelCaseCompiledRegex); } static void Test(Func<string,string> function) { Console.Write("{0}... ", function.Method.Name); Stopwatch sw = Stopwatch.StartNew(); for (int i=0; i < Iterations; i++) { string result = function(TestData); if (result.Length != ValidResult.Length) { throw new Exception("Bad result: " + result); } } sw.Stop(); Console.WriteLine(" {0}ms", sw.ElapsedMilliseconds); GC.Collect(); } private static string BenchmarkOverhead(string str) { return ValidResult; } private static string MakeNiceString(string str) { char[] ca = str.ToCharArray(); string result = null; int i = 0; result += System.Convert.ToString(ca[0]); for (i = 1; i <= ca.Length - 1; i++) { if (!(char.IsLower(ca[i]))) { result += " "; } result += System.Convert.ToString(ca[i]); } return result; } private static string ImprovedMakeNiceString(string str) { //Removed Convert.ToString() char[] ca = str.ToCharArray(); string result = null; int i = 0; result += ca[0]; for (i = 1; i <= ca.Length - 1; i++) { if (!(char.IsLower(ca[i]))) { result += " "; } result += ca[i]; } return result; } private static string RefactoredMakeNiceString(string str) { char[] ca = str.ToCharArray(); StringBuilder sb = new StringBuilder((str.Length * 5 / 4)); int i = 0; sb.Append(ca[0]); for (i = 1; i <= ca.Length - 1; i++) { if (!(char.IsLower(ca[i]))) { sb.Append(" "); } sb.Append(ca[i]); } return sb.ToString(); } private static string MakeNiceStringWithStringIndexer(string str) { StringBuilder sb = new StringBuilder((str.Length * 5 / 4)); sb.Append(str[0]); for (int i = 1; i < str.Length; i++) { char c = str[i]; if (!(char.IsLower(c))) { sb.Append(" "); } sb.Append(c); } return sb.ToString(); } private static string MakeNiceStringWithForeach(string str) { StringBuilder sb = new StringBuilder(str.Length * 5 / 4); bool first = true; foreach (char c in str) { if (!first && char.IsUpper(c)) { sb.Append(" "); } sb.Append(c); first = false; } return sb.ToString(); } private static string MakeNiceStringWithForeachAndLinqSkip(string str) { StringBuilder sb = new StringBuilder(str.Length * 5 / 4); sb.Append(str[0]); foreach (char c in str.Skip(1)) { if (char.IsUpper(c)) { sb.Append(" "); } sb.Append(c); } return sb.ToString(); } private static string MakeNiceStringWithForeachAndCustomSkip(string str) { StringBuilder sb = new StringBuilder(str.Length * 5 / 4); sb.Append(str[0]); foreach (char c in new SkipEnumerable<char>(str, 1)) { if (char.IsUpper(c)) { sb.Append(" "); } sb.Append(c); } return sb.ToString(); } private static string SplitCamelCase(string str) { string[] temp = Regex.Split(str, @"(?<!^)(?=[AZ])"); string result = String.Join(" ", temp); return result; } private static readonly Regex CachedRegex = new Regex("(?<!^)(?=[AZ])"); private static string SplitCamelCaseCachedRegex(string str) { string[] temp = CachedRegex.Split(str); string result = String.Join(" ", temp); return result; } private static readonly Regex CompiledRegex = new Regex("(?<!^)(?=[AZ])", RegexOptions.Compiled); private static string SplitCamelCaseCompiledRegex(string str) { string[] temp = CompiledRegex.Split(str); string result = String.Join(" ", temp); return result; } private class SkipEnumerable<T> : IEnumerable<T> { private readonly IEnumerable<T> original; private readonly int skip; public SkipEnumerable(IEnumerable<T> original, int skip) { this.original = original; this.skip = skip; } public IEnumerator<T> GetEnumerator() { IEnumerator<T> ret = original.GetEnumerator(); for (int i=0; i < skip; i++) { ret.MoveNext(); } return ret; } IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); } } }

Now the results:

 BenchmarkOverhead... 22ms MakeNiceString... 10062ms ImprovedMakeNiceString... 12367ms RefactoredMakeNiceString... 3489ms MakeNiceStringWithStringIndexer... 3115ms MakeNiceStringWithForeach... 3292ms MakeNiceStringWithForeachAndLinqSkip... 5702ms MakeNiceStringWithForeachAndCustomSkip... 4490ms SplitCamelCase... 68267ms SplitCamelCaseCachedRegex... 52529ms SplitCamelCaseCompiledRegex... 26806ms

As you can see, the version of the row index index is the winner - it is also pretty simple code.

Hope this helps ... and don't forget that there are other options that I haven't thought about!

String Benchmarks in C # - Refactoring for Speed ​​/ Performance

Refactoring MakeNiceString()

Reactor # 2 - Use StringBuilder

Alternative refactoring: use compiled Regex

The final test results:

Performance test:

Questions

More articles:

String Benchmarks in C # - Refactoring for Speed / Performance

Refactoring `MakeNiceString()`

Reactor # 2 - Use `StringBuilder`

Alternative refactoring: use compiled `Regex`