Optimize and rewrite the following C code

This is a tutorial question that involves rewriting some C code so that it works best on a given processor architecture.

Considering: aims at one superscalar processor with 4 adders and 2 units of the multiplier.

Input structure (initialized elsewhere):

struct s { short a; unsigned v; short b; } input[100]; 

Here is a routine for working with this data. Obviously, correctness must be ensured, but the goal is to optimize the crap out of it.

 int compute(int x, int *r, int *q, int *p) { int i; for(i = 0; i < 100; i++) { *r *= input[i].v + x; *p = input[i].v; *q += input[i].a + input[i].v + input[i].b; } return i; } 

Thus, this method has 3 arithmetic instructions for updating the integers r, q, p.

Here's my attempt at commenting to explain what I think:

 //Use temp variables so we don't keep using loads and stores for mem accesses; //hopefully the temps will just be kept in the register file int r_temp = *r; int q_temp = *q; for (i=0;i<99;i = i+2) { int data1 = input[i]; int data2 = input[i+1]; //going to try partially unrolling loop int a1 = data1.a; int a2 = data2.a; int b1 = data1.b; int b2 = data2.b; int v1 = data1.v; int v2 = data2.v; //I will use brackets to make my intention clear the order of operations I was planning //with respect to the functional (adder, mul) units available //This is calculating the next iteration new q value //from q += v1 + a1 + b1, or q(new)=q(old)+v1+a1+b1 q_temp = ((v1+q1)+(a1+b1)) + ((a2+b2)+v2); //For the first step I am trying to use a max of 3 adders in parallel, //saving one to start the next computation //This is calculating next iter new r value //from r *= v1 + x, or r(new) = r(old)*(v1+x) r_temp = ((r_temp*v1) + (r_temp*x)) + (v2+x); } //Because i will end on i=98 and I only unrolled by 2, I don't need to //worry about final few values because there will be none *p = input[99].v; //Why it in the loop I don't understand, this should be correct *r = r_temp; *q = q_temp; 

Okay, so what gave me the solution? If you look at the old code, each iteration of the loop I will be the minimum latency max ((1A + 1M), (3A)), where the first value is for calculating the new r, and the latency of 3 additions is for q.

In my solution, I turn around 2 and try to calculate the second new value of r and q. Assuming that the latency of the adders / factors is such that M = c * A (c is an integer> 1), the multiplication operations for r are definitely a step in speed limiting, so I focus on that. I tried to use the factors in parallel as much as I could.

In my code, two multipliers are first used in parallel to count the intermediate steps, then adding should combine them, then the final multiplication is used to get the last result. So, for 2 new r values ​​(although I only keep / care about the latter), it takes me (1M // 1M // 1A) + 1A + 1M for full latency of 2M + 1M sequentially. Division by 2, my delay value for each cycle is 1M + 0.5A . I calculate the initial delay / value (for r) equal to 1A + 1M. Therefore, if my code is correct (I did it all manually, I have not tested it yet!), Then I have a small performance gain.

In addition, we hope that, without accessing or updating the pointers directly in the loop (mainly due to the r_temp and q_temp temporary variables), I save with some load / storage delay.


That was my hit. It's definitely interesting to see others come up with this better!

+7
source share
1 answer

Yes, you can use two shorts. Rebuild your structure like this

 struct s { unsigned v; short a; short b; } input[100]; 

and you can improve the alignment of memory fields in your architecture, which may allow more of these structures to reside on the same memory page, which may allow you to encounter less memory page errors.

All this is speculative, so it is very important to profile.

If you have the right architecture, rearrangement will give you better alignment of the data structure , which leads to a higher density of data in memory as fewer bits are lost to fill up to ensure type alignment with data boundaries imposed by common memory architectures.

+3
source

All Articles