Built-in Type Performance: char vs short vs int vs. float vs. double

This may seem like a bit of a silly question, but after seeing Alexandre C's answer in another topic, I'm curious to know that if there is any performance difference with the built-in types:

char vs short vs int vs. float vs. double .

Usually we do not consider such a difference in productivity (if any) in our real-life projects, but I would like to know this for educational purposes. You can ask general questions:

  • Is there a performance difference between integral arithmetic and floating point arithmetic?

  • Which is faster? What is the reason fast? Please explain this.

+53
c ++ performance c built-in
Feb 21 2018-11-21T00:
source share
9 answers

Float vs. integer:

Historically, floating point can be much slower than integer arithmetic. On modern computers, this is no longer the case (on some platforms it is somewhat slower, but if you do not write the perfect code and do not optimize for each cycle, the difference will depend on the other inefficiencies of your code).

On a few limited processors, for example, on high-end mobile phones, floating point can be somewhat slower than an integer, but usually it is an order of magnitude (or better) if there is a hardware floating point, It is worth noting that this gap closes quite quickly, as cell phones are designed to run more and more common computing workloads.

On very limited processors (cheap cell phones and your toaster), as a rule, there is no floating point equipment, so floating point operations need to be emulated in software. This is slow - a couple of orders slower than a whole arithmetic.

As I said, people expect their phones and other devices to behave more and more like “real computers,” and hardware developers are rapidly building up FPUs to meet this demand. If you do not pursue every last cycle, or you write code for very limited processors that have little or no floating point support, the performance difference does not matter to you.

Different types of integer:

As a rule, processors work most quickly on integers of their own word size (with some reservations about 64-bit systems). 32-bit operations are often faster than 8- or 16-bit operations on modern processors, but this is slightly different between architectures. Also, remember that you cannot consider processor speed separately; it is part of a complex system. Even if working on 16-bit numbers is 2 times slower than working on 32-bit numbers, you can fit twice as much data into the cache hierarchy when you present it with 16-bit numbers instead of 32-bit numbers. If this makes the difference between the fact that all your data comes from the cache rather than frequent misses in the cache, faster access to the memory will lead to a slower processor.

Other notes:

Vectorization tells the balance further in favor of narrower types ( float and 8- and 16-bit integers) - you can do more operations in a vector of the same width. However, good vector code is hard to write, so it’s not as if you had this advantage without a lot of careful work.

Why are there differences in performance?

In fact, there are only two factors that influence whether an operation on a processor is performed quickly: the complexity of the circuit and the user's need for fast operation.

(For a reason) any operation can be performed quickly if chip designers are willing to throw enough transistors into the problem. But transistors cost money (more precisely, using a large number of transistors increases your chip, which means that you get fewer chips per wafer and lower profitability, which costs money), so chip developers must balance how difficult it is to use, for which operations and they do this based on (perceived) consumer demand. Roughly speaking, you might consider breaking down operations into four categories:

  high demand low demand high complexity FP add, multiply division low complexity integer add popcount, hcf boolean ops, shifts 

operations with a high degree of demand and low complexity will be performed on almost any processor: they are low-potential fruits and give maximum benefit for each transistor.

Operations with a high degree of availability, high complexity will quickly work on expensive processors (for example, computers), since users are willing to pay for them. You probably won’t want to pay an extra $ 3 for a toaster to quickly copy the FP, however cheap processors will avoid these instructions.

operations with low demand, high complexity, as a rule, will be slow for almost all processors; there is simply not enough profit to justify the cost.

operations with low demand, low complexity will be quick if someone is worried about them and does not exist at all.

Further reading:

  • Agner Fog maintains a pleasant site with a lot of discussion of low-level performance characteristics (and has a very scientific methodology for collecting data for its backup).
  • The Intel® 64 and IA-32 Architecture Optimization Reference Guide (a PDF download link is part of the path down the page) also covers many of these issues, although it focuses on one specific architecture family.
+101
Feb 21 '11 at 18:17
source share

That's right.

First, of course, it depends entirely on the processor architecture.

However, integral and floating point are handled very differently, so it almost always happens like this:

  • for simple operations, integral types are fast. For example, an integer addition often only has a delay of one cycle, and integer multiplication is usually about 2-4 cycles, IIRC.
  • The floating point types used to execute are much slower. However, on today's processors they have excellent bandwidth, and each floating point block can usually delete an operation per cycle, which leads to the same (or similar) bandwidth as for whole operations. However, latency is generally worse. Adding a floating point often has a delay of about 4 cycles (vs 1 for ints).
  • for some complex operations, the situation is different or even canceled. For example, dividing by FP may have a lower delay than integers, simply because the operation is difficult to implement in both cases, but it is more useful for FP values, so you can spend more effort (and transistors) to optimize this case.

On some processors, doubling can be significantly slower than floating. Some architectures do not have special equipment for pair discharges, and therefore they are processed by passing two blocks of the size of the float, which gives you higher throughput and twice as much delay. In other cases (for example, xPU x86), both types are converted to the same internal format with 80-bit floating point, in the case of x86), so the performance is identical. In other cases, both float and double have proper hardware support, but since float has fewer bits, this can be done a little faster, usually reducing bit delay relative to double operations.

Disclaimer: All timings and specifications mentioned are simply retrieved from memory. I did not look at anything, so this may be wrong .;)

For different integer types, the answer is highly dependent on the processor architecture. The x86 architecture, due to its long complicated history, should support both 8, 16, 32 (and today 64) bit operations initially, and in general, they are all equally fast (they use basically the same equipment and only zero if nessesary).

However, on other processors, data types smaller than int can be more expensive to load / store (writing a byte to memory can be done by loading the entire 32-bit word in which it is located, and then make bit masking to update one byte in the register, then write all the words back). Similarly, for data types greater than int , some processors may have to split the operation into two, loading / storing / calculating the lower and upper halves separately.

But on x86, the answer is that basically it doesn't matter. For historical reasons, the processor must have fairly reliable support for each type of data. So the only difference you are likely to notice is that floating point operations have a longer delay (but similar throughput, so they are not slower on their own, at least if you write the code correctly)

+7
Feb 21 '11 at 18:29
source share

I don’t think anyone mentioned the whole promotion rules. In standard C / C ++, no operation can be performed with a type less than int . If char or short is less than int on the current platform, they implicitly rise to int (which is the main source of errors). The compiler must make this implicit progress; there is no way around it without violating the standard.

Integer stocks mean that no operation (addition, bitwise, logical, etc., etc.) in the language can be performed on a smaller integer type than int. Thus, operations with char / short / int are, as a rule, equally fast, since the former rise to the latter.

And besides integer promotions, there are “regular arithmetic conversions,” which means that C seeks to make both operands of the same type, converting one of them to the larger of the two if they are different.

However, the CPU can perform various load / storage operations at level 8, 16, 32, etc. In 8- and 16-bit architectures, this often means that 8 and 16-bit types are faster, despite the whole promotion. On a 32-bit processor, this may mean that smaller types are slower because he wants everything to be neatly aligned in 32-bit pieces. 32-bit compilers usually optimize speed and allocate smaller integer types in a larger space than specified.

Although usually smaller integer types, of course, take up less space than larger ones, therefore, if you plan to optimize the size of RAM, they prefer.

+6
Feb 21 2018-11-21T00:
source share

Is there a performance difference between integral arithmetic and floating point arithmetic?

Yes. However, this is a very specific platform and processor. Different platforms can perform different arithmetic operations at different speeds.

Speaking, the answer in question was more specific. pow() is a universal program that works with double values. Feeding it with integer values, it still does all the work that will be required to process non-integer indicators. Using direct multiplication circumvents the great difficulty in which speed comes into play. This is really not a problem (of so many) different types, but rather a workaround for the large amount of complex code needed to create a pow function with any metric.

+2
Feb 21 '11 at 18:05
source share

Depends on the composition of the processor and platform.

Platforms that have a floating point coprocessor may be slower than integral arithmetic due to the fact that the values ​​must be passed to and from the coprocessor.

If floating point processing is located in the processor core, the execution time may be small.

If floating point calculations are emulated by software, then integral arithmetic will be faster.

If in doubt, profile.

Ensure correct and reliable programming before optimization.

+1
Feb 21 '11 at 18:19
source share

No, not at all. This, of course, depends on the processor and the compiler, but the difference in performance is usually insignificant, even if there is one.

0
Feb 21 '11 at 18:09
source share

There is a definite difference between floating point arithmetic and integer. Depending on the hardware and micro-instructions for the processor, you get different performance and / or accuracy. Good google terms for exact descriptions (I don't know for sure):

FPU x87 MMX SSE

As for the size of integers, it is best to use the platform / architecture word size (or double), which boils down to int32_t on x86 and int64_t on x86_64. SOme processors can have built-in instructions that process several such values ​​at once (for example, SSE (floating point) and MMX), which will speed up parallel additions or multiplications.

0
Feb 21 '11 at 18:12
source share

As a rule, integer math is faster than floating point math. This is because integer math includes simpler calculations. However, in most operations we are talking about less than a dozen hours. Not Millions, Microns, Nano or Tics; clock. Those that occur 2-3 billion times per second in modern cores. In addition, since 486 many cores have a set of floating point processing units or FPUs that are hardwired for efficient floating point arithmetic and often in parallel with the processor.

As a result of this, although technically it is slower, floating point calculations are still so fast that any attempt to split the difference will have more errors inherent in the synchronization and flow scheduling mechanism than is actually required to perform the calculation. Use ints when you can, but understand when you can’t, and don’t worry too much about the relative speed of the calculation.

0
Feb 21 '11 at 18:16
source share

The first answer is above, and I copied a small block from it to the next duplicate (since it was here that I ended up first).

There are "char" and "small int" slower than "int",

I would like to suggest the following code, which profiles distribute, initialize and perform some arithmetic for various integer sizes:

 #include <iostream> #include <windows.h> using std::cout; using std::cin; using std::endl; LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds; LARGE_INTEGER Frequency; void inline showElapsed(const char activity []) { QueryPerformanceCounter(&EndingTime); ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart; ElapsedMicroseconds.QuadPart *= 1000000; ElapsedMicroseconds.QuadPart /= Frequency.QuadPart; cout << activity << " took: " << ElapsedMicroseconds.QuadPart << "us" << endl; } int main() { cout << "Hallo!" << endl << endl; QueryPerformanceFrequency(&Frequency); const int32_t count = 1100100; char activity[200]; //-----------------------------------------------------------------------------------------// sprintf_s(activity, "Initialise & Set %d 8 bit integers", count); QueryPerformanceCounter(&StartingTime); int8_t *data8 = new int8_t[count]; for (int i = 0; i < count; i++) { data8[i] = i; } showElapsed(activity); sprintf_s(activity, "Add 5 to %d 8 bit integers", count); QueryPerformanceCounter(&StartingTime); for (int i = 0; i < count; i++) { data8[i] = i + 5; } showElapsed(activity); cout << endl; //-----------------------------------------------------------------------------------------// //-----------------------------------------------------------------------------------------// sprintf_s(activity, "Initialise & Set %d 16 bit integers", count); QueryPerformanceCounter(&StartingTime); int16_t *data16 = new int16_t[count]; for (int i = 0; i < count; i++) { data16[i] = i; } showElapsed(activity); sprintf_s(activity, "Add 5 to %d 16 bit integers", count); QueryPerformanceCounter(&StartingTime); for (int i = 0; i < count; i++) { data16[i] = i + 5; } showElapsed(activity); cout << endl; //-----------------------------------------------------------------------------------------// //-----------------------------------------------------------------------------------------// sprintf_s(activity, "Initialise & Set %d 32 bit integers", count); QueryPerformanceCounter(&StartingTime); int32_t *data32 = new int32_t[count]; for (int i = 0; i < count; i++) { data32[i] = i; } showElapsed(activity); sprintf_s(activity, "Add 5 to %d 32 bit integers", count); QueryPerformanceCounter(&StartingTime); for (int i = 0; i < count; i++) { data32[i] = i + 5; } showElapsed(activity); cout << endl; //-----------------------------------------------------------------------------------------// //-----------------------------------------------------------------------------------------// sprintf_s(activity, "Initialise & Set %d 64 bit integers", count); QueryPerformanceCounter(&StartingTime); int64_t *data64 = new int64_t[count]; for (int i = 0; i < count; i++) { data64[i] = i; } showElapsed(activity); sprintf_s(activity, "Add 5 to %d 64 bit integers", count); QueryPerformanceCounter(&StartingTime); for (int i = 0; i < count; i++) { data64[i] = i + 5; } showElapsed(activity); cout << endl; //-----------------------------------------------------------------------------------------// getchar(); } /* My results on i7 4790k: Initialise & Set 1100100 8 bit integers took: 444us Add 5 to 1100100 8 bit integers took: 358us Initialise & Set 1100100 16 bit integers took: 666us Add 5 to 1100100 16 bit integers took: 359us Initialise & Set 1100100 32 bit integers took: 870us Add 5 to 1100100 32 bit integers took: 276us Initialise & Set 1100100 64 bit integers took: 2201us Add 5 to 1100100 64 bit integers took: 659us */ 

My results in MSVC on i7 4790k:

Initialize and set 1100100 8-bit integers: 444us
Add 5 to 1100100 8-bit integers: 358us

Initialize and set 1100100 16-bit integers: 666us
Add 5 to 1100100 16-bit integers: 359us

Initialize and set 1100100 32-bit integers: 870us
Add 5 to 1100100 32-bit integers: 276us

Initialize and set 1100100 64-bit integers: 2201us
Add 5 to 1100100 64-bit integers: 659us

0
May 03 '16 at 1:02 a.m.
source share



All Articles