Search table vs if-else

Today I read the code using the lookup table instead of if-else to trim the two summarized uint8 values. The map i is in i={0...255} and 255 in i={256...511} . I wondered how big a gain could be, and tried to figure it out using gprof,

 g++ -std=c++0x -pg perfLookup.cpp -O2 -o perfLookup && ./perfLookup && gprof perfLookup |less 

with the code below. Now without the -O2 flag, gprof says lookup () takes 45% and ifelse () takes 48% of the execution time. With -O2, although this is 56% for search () and 43% for ifelse (). But is this test correct? Perhaps most of the code is optimized since dst is never read?

 #include <iostream> #include <cstdint> #include <vector> void lookup(std::vector<uint8_t> src, int repeat) { uint8_t lookup[511]; for (int i = 0; i < 256; i++) { lookup[i] = i; } for (int i = 256; i < 512; i++) { lookup[i] = 255; } std::vector<uint8_t> dst(src.size()); for (int i = 0; i < repeat; i++) { for (int i = 0; i < src.size(); i++) { dst[i] = lookup[src[i]]; } } } void ifelse(std::vector<uint8_t> src, int repeat) { std::vector<uint8_t> dst(src.size()); for (int i = 0; i < repeat; i++) { for (int i = 0; i < src.size(); i++) { dst[i] = (src[i] > 255) ? 255 : src[i]; } } } int main() { int n = 10000; std::vector<uint8_t> src(n); for (int i = 0; i < src.size(); i++) { src[i] = rand() % 510; } lookup(src, 10000); ifelse(src, 10000); } 

Updated code:

 #include <iostream> #include <cstdint> #include <cstring> #include <vector> #include <algorithm> // g++ -std=c++0x -pg perfLookup.cpp -O2 -o perfLookup && ./perfLookup && gprof perfLookup |less std::vector<uint16_t> lookup(std::vector<uint16_t> src, int repeat) { uint16_t lookup[511]; for (int i = 0; i < 256; i++) { lookup[i] = i; } for (int i = 256; i < 511; i++) { lookup[i] = 255; } std::vector<uint16_t> dst(src.size()); for (int i = 0; i < repeat; i++) { for (int k = 0; k < src.size(); k++) { dst[k] = lookup[src[k]]; } } return dst; } std::vector<uint16_t> ifelse(std::vector<uint16_t> src, int repeat) { std::vector<uint16_t> dst(src.size()); for (int i = 0; i < repeat; i++) { for (int k = 0; k < src.size(); k++) { dst[k] = (src[k] > 255) ? 255 : src[k]; } } return dst; } std::vector<uint16_t> copyv(std::vector<uint16_t> src, int repeat) { std::vector<uint16_t> dst(src.size()); for (int i = 0; i < repeat; i++) { dst = src; for (int k = 0; k < src.size(); k++) { if (dst[k] > 255) { dst[k] = 255; } } } return dst; } std::vector<uint16_t> copyC(std::vector<uint16_t> src, int repeat) { uint16_t* dst = (uint16_t *) malloc(sizeof(uint16_t) * src.size()); // Alloc array for dst for (int i = 0; i < repeat; i++) { std::memcpy(dst, &src[0], sizeof(uint16_t) * src.size()); // copy src into array for (int k = 0; k < src.size(); k++) { if ((dst[k] & 0xFF00) != 0) dst[k] = 0x00FF; } } free(dst); return std::vector<uint16_t>(); } int main() { int n = 10000; std::vector<uint16_t> src(n); for (int i = 0; i < src.size(); i++) { src[i] = rand() % 510; } std::vector<uint16_t> dst; dst = lookup(src, 10000); dst = ifelse(src, 10000); dst = copyv(src, 10000); } 
+7
source share
7 answers

Well, since src declared as std::vector<uint8_t> , src[i] never greater than 255 , which is the highest possible value for an 8-bit unsigned integer.

Therefore, I assume that the compiler optimizes the validation. What remains is just the outline of the template, so the reference meaning does not make sense.

If the test was not meaningless (that is, it checked 64, not 255), the result of the "optimization" is supposed to be highly machine dependent. Predicting a branch can (depending on the input) do a good job of reducing the cost of the branch. On the other hand, the lookup table requires (depending on the input) random access to the memory and spoils the cache ...

+7
source

In addition to what Alexander has already said:

Search tables can significantly improve performance. However, this is offset by the time taken to create the lookup table. Usually you compare this separately.

Another thing to keep in mind is that the lookup table requires space in the cache, and therefore can skip the cache if it is large. If there are enough misses in the cache, the if method will be faster than the lookup table.

Finally, gprof identifies bottlenecks very well. But I would not use it for tests. Use the sync function instead. gprof uses a sample, which, strictly speaking, can be displayed on the time spent on consumption, but here it is less accurate.

+7
source

The lookup array processing is broken. This line:

 uint8_t lookup[511]; 

disabled by one, you want a lookup[512]; , since you seem to expect indexing from 511 (which refers to the 512th element). Of course, as Alexander noted, everything is still excited, since uint8_t means that you cannot have an index above 255.

Like this, this code:

 for (int i = 256; i < 512; i++) { lookup[i] = 255; } 

will index beyond and write 255 to a more or less randomly selected memory location.

+3
source

You also measure the initialization time of the lookup table, and this may not be what you want. If a table is only initialized once in production code, but used many times, then you should not measure initialization.

+2
source

Both approaches seem rather strange. Do you really need this level of optimization? If so, I would question the use of vectors and consider C arrays instead!

The ifelse approach seems more obvious. I doubt this is noticeably slower / faster than the lookup table unless you call it billions of times.

Personally, I could just clone the src vector, then iterate over it and fix the values ​​(using 250 here, because 255 doesn't make sense, as indicated):

 std::vector<uint8_t> dst(src); for(std::vector<int>::size_type i = 0; i != v.size(); i++) { if (dst[i] > 250) dst[i] = 250; } 

Depending on how the cloning is actually performed and optimized by the compiler (for example, it can make a copy of the block's memory), it can actually be a little faster. This is certainly more neat and understandable.

+2
source

Sometimes the compiler is smart enough to optimize the simplest profiling tests. If so, you have a trick that the compiler does not optimize. Using a much higher repetition value can also help you get better results or tell you if something is optimized.

Search tables can be faster than chains if / elseifs, but in this case with only one comparison, I would not expect much difference. For example, if you had 10, 100, 1000 ... comparisons, the lookup table should generally win.

+1
source

A possible dirty little solution C (from the top of my head and is unverified / not compiled, so it probably contains errors):

 std::vector<uint16_t> copyC(std::vector<uint16_t> src, int repeat) { uint16_t* dst = malloc(sizeof(unit16_t) * src.size()); // Alloc array for dst for (int i = 0; i < repeat; i++) { memcpy(dst, &src[0], sizeof(unit16_t) * src.size()); // copy src into array for (int k = 0; k < src.size(); k++) { if ((dst[k] & 0xFF00) != 0) dst[k] = 0x00FF; } } free(dst); } 

I would be interested to see how this compares. (Again, this may depend on the implementation of memcpy, as it will only be faster if large copies of memory are more efficient than byte copies).

Depending on the specification of your chip (for example, 8-bit or 16-bit register sizes), single-byte access may be faster than double-byte. if so, then the code described above can also be rewritten to treat dst as a unit8_t array. Then it will only consider every second byte, and if it is not zero, set it to 0 and the next byte * to 0xFF.

(* or previous byte, depending on the entity)

+1
source

All Articles