Optimization in code related to string functions

Are there any recommendations that can be followed before calling standard string-related functions in C?

For example, how much does optimization compare the first character of two lines (and check if they are equal) before strcmp provide?

What types of overhead associated with string functions in C can be expected, and what mechanisms will help avoid them?

Thanks!

+4
source share
11 answers

String functions are for your use. If you need to compare two lines, call strcmp . Don't worry about the tiny performance difference that you can imagine anyway. Run your code first.

Firstly, to answer any question about performance, if you ask: "How much will the optimization ...", the answer is "Profile!". No one can predict how fast something will start. The implementation of C stdlib has been constantly improving over the years, any optimization tricks you are trying to come up with could damage it.

For example, I think GCC will use the vector when comparing strings, so you are actually comparing 4-8 elements at a time. Did you expect this? Performing a single character comparison can actually slow it down.

However, a typical implementation simply checks the character for the character, so you just move one comparison out of the loop, without a net gain. (But as said, perhaps for sheer loss!)

So the guide:

Now the program is optimized.

And optimize the rational path: with evidence and testing, not guessing.

+19
source

Worries about strcmp() - this is micro-optimization most of the time - it’s not worth the effort. There are other things that can be more secure, for example:

 for (i = 0; i < strlen(s); i++) { ...do stuff with s[i]... } 

The optimizer may not understand that he can and should avoid calling the function at each iteration of the loop - if you increase s in the loop, this may not be avoided. It is expensive.

Once strstr() time, I used a function like strstr() on buffers of 20 KB or so, and the program worked perfectly on the HP box where I developed it. I put it back on the Sun machine (remember that it was 20 years ago - the problems were solved long ago), and the program did not even crawl - it was almost stationary. The problem turned out to be a loop in strstr() , which used strlen() more or less, as shown above. When used on short buffers there was no big problem; when using 20 KB buffers, searching for a pattern that appears every couple of kilobytes, the bad machine section stops. The problem was shown by profiling the application. I included the substitute strstr() , which avoided the error, and everything returned to normal.

Another common source of slowness when porting to excess is the use of strcat() . For instance:

 strcpy(dst, name); strcat(dst, "/subdir/"); strcat(dst, file); strcat(dst, ".sfx"); 

This is usually not a source of performance problems, but in the absence of evidence to the contrary, it can be a source of buffer overflows. You should know that the lengths are small enough to fit in dst . But, if you know the lengths of each bit (how you should be sure that they fit), you can write instead:

 strcpy(dst, name); dst += len_name; strcpy(dst, "/subdir/"); dst += sizeof("/subdir/") - 1; strcpy(dst, file); dst += len_file; strcpy(dst, ".sfx"); 

This allows you to repeatedly scan a string of known length to find the end before adding new material. With short strings, it doesn't really matter; with long strings and many concatenation operations, this can make a difference. As before, the key is measuring value.

Kernighan and Pike have an interesting story on how to improve the performance of the spam filter in the book “T “ Programming Practices. ” He started using strstr() in a loop; it turned out to be completely different because the circumstances for which strstr() , didn’t meet the requirements of their spam filtering system, but, again, they only did this work because it was demonstrated that there was a performance problem and they did enough work so that the program was not a bottleneck in the system, but no more .

+7
source

It will not optimize, because that is exactly what strcmp () does.

In general, since the str ... () functions are so heavily used, you can depend on how the libraries that implement them are most efficient. Only after you wrote your own code that uses these functions, did you find that you have a problem, and tracked it ON USING THE PROFILE if you are considering writing notes.

+4
source

This could be a tutorial study on the glibc implementation of strlen :

  • There are no checks for NULL arguments (doesn't even claim)
  • Lines are compared byte-by-bye until an address with an aligned word is reached. After that, all comparisons are performed in 4 or 8-byte fragments.
  • No vectorization or architecture-specific elements (i.e. no #ifdefs).
  • The hard part when comparing 4 or 8 bytes at a time is figuring out which byte was zero when the comparison fails.
+1
source

You may be interested in this article (Joel Spolsky). This is about low-level features (particularly C) and how they are optimized.

+1
source
 <sarcasm> 

Unlike other answers about your statement:

For example, how much will optimization compare the first character of two lines (and check if they are equal) before calling strcmp?

I think this is a great idea. So we have to do this:

 int compstr(const char *a, const char *b) { if (*a == *b) return strcmp(a+1, b+1); else return *(unsigned char *)a < *(unsigned char *)b ? -1 : 1; } 

But why stop there? We have to check one more character, which gives us the best optimization:

 int compstr(const char *a, const char *b) { size_t i; for (i=0; *a == *b && i < 2; ++a, ++b, ++i) if (*a == 0) return 0; if (i == 2) return strcmp(a, b); return *(unsigned char *)a < *(unsigned char *)b ? -1 : 1; } 

Of course, we can do much better. Let the number of characters be compared with the parameter:

 /* Really fast implementation to compare strings, takes the optimization parameter n_comp */ int compstr(const char *a, const char *b, size_t n_comp) { int i; for (i=0; *a == *b && i < n_comp; ++a, ++b, ++i) if (*a == 0) return 0; if (i == n_comp) return strcmp(a, b); return *(unsigned char *)a < *(unsigned char *)b ? -1 : 1; } 

But if we are going to solve this problem by comparing the first few characters, why not just do it ourselves? So here is the final, fully optimized version:

 /* Final, highly optimized function to compare strings */ int strcmp (const char *a, const char *b) { for (; *a == *b; ++a, ++b) if (*a == 0) return 0; return *(unsigned char *)a < *(unsigned char *)b ? -1 : 1; } 

After writing our version, we will be pleased to find that it is identical to the version in PJ Plauger. The standard C library (and of course it avoids any architectural optimizations that a good library would use)!

 </sarcasm> 

In other words, as others have said, there is no point in premature optimization.

Note. I did not check if my code fragments were really correct. Another reason not to reinvent the wheel: you need to do all the hard work!

+1
source

Jonathan is absolutely right, especially the strlen(s) example that I found (for other people, code :-) with one stackshot.

You are talking about micro-optimization, which is the right thing to worry about AFTER you otherwise configured blasts from the code . Comparing the first character before calling strcmp saves some time due to the overhead of putting the function on / off, but my rule of thumb is to just do it if these calls to strcmp cost more than 10%.

0
source

The standard version of the c-runtime line is quite optimized. It is unlikely that you will be able to improve it, except by using knowledge of your problem area that c-runtime cannot have.

Your idea of ​​pre-testing the first character has some advantages - IFF, most of your comparisons are between unlike strings. (i.e. most failures). In this case, you avoid the overhead of calling the function.

But you are comparing strings that match even more expensive ones!

strcmp is the most expensive if strings that match are given. therefore, if your algorithm ever passes the same pointer as both strcmp parameters, you could optimize by comparing the pointers first. Only you can know if your code will actually do this often enough to be worth it.

The only other general tip I have is: don't use strcat . Sure, it's quick and easy, but it gets more expensive the more you use it. Better keep track of the end of the line and strcpy to the end.

0
source

The standard C library is sweet because it is very optimized. And some compilers are built into CRT functions, so you save the overhead of the call command. However, if you still want to increase speed, there are several options available. If you visit this link, which I will give you, you can download a program that counts several strcmp routines written by assembler programming experts.

http://www.masm32.com/board/index.php?topic=2508.0

I would specifically consider a feature that a member of the lingo forum wrote. This guy is writing the fastest build code I've ever seen.

If you don’t know how to use the assembly language function with your C program, just ask another StackOverflow question and many people, including me, will be able to help you.

Here are the results they get when comparing the string abcdefg with abcz

 lstrcmp - original (from Microsoft) : 19314 clocks; Return value: 1 lstrcmp - kunt0r : 957 clocks; Return value: 24832 lstrcmp - Lingo : 501 clocks; Return value: 1 

As you can see by the number of hours (less is better), other functions are much faster.

0
source

Are there any recommendations that can be followed before calling standard string functions in C?

Yes. Do not worry about which library functions are faster or slower than others, or how they can be accelerated microscopically (or slower!). Instead, find features that let you clearly express your intention.

After all, if you have evidence that your application is too slow, you can look at the profile and see if any string functions are related to your problem. And if the improvement is more likely to come from a sublinear algorithm like Boyer-Moore than by setting strcmp .


Michael A. Jackson has two optimization rules:

  • Do not do this.

  • (Expert only) Do not do this yet.

0
source

I believe that it is important that each repository / database associated with a string can have its own characteristics that could be manipulated or used to create optimal string functions. However, there are some simple tricks in this post for some cases - you can choose what suits your needs and use it: http://www.codemaestro.com/articles/21

0
source

All Articles