This is a tutorial question that involves rewriting some C code so that it works best on a given processor architecture.
Considering: aims at one superscalar processor with 4 adders and 2 units of the multiplier.
Input structure (initialized elsewhere):
struct s { short a; unsigned v; short b; } input[100];
Here is a routine for working with this data. Obviously, correctness must be ensured, but the goal is to optimize the crap out of it.
int compute(int x, int *r, int *q, int *p) { int i; for(i = 0; i < 100; i++) { *r *= input[i].v + x; *p = input[i].v; *q += input[i].a + input[i].v + input[i].b; } return i; }
Thus, this method has 3 arithmetic instructions for updating the integers r, q, p.
Here's my attempt at commenting to explain what I think:
Okay, so what gave me the solution? If you look at the old code, each iteration of the loop I will be the minimum latency max ((1A + 1M), (3A)), where the first value is for calculating the new r, and the latency of 3 additions is for q.
In my solution, I turn around 2 and try to calculate the second new value of r and q. Assuming that the latency of the adders / factors is such that M = c * A (c is an integer> 1), the multiplication operations for r are definitely a step in speed limiting, so I focus on that. I tried to use the factors in parallel as much as I could.
In my code, two multipliers are first used in parallel to count the intermediate steps, then adding should combine them, then the final multiplication is used to get the last result. So, for 2 new r values (although I only keep / care about the latter), it takes me (1M // 1M // 1A) + 1A + 1M for full latency of 2M + 1M sequentially. Division by 2, my delay value for each cycle is 1M + 0.5A . I calculate the initial delay / value (for r) equal to 1A + 1M. Therefore, if my code is correct (I did it all manually, I have not tested it yet!), Then I have a small performance gain.
In addition, we hope that, without accessing or updating the pointers directly in the loop (mainly due to the r_temp and q_temp temporary variables), I save with some load / storage delay.
That was my hit. It's definitely interesting to see others come up with this better!