It looks like you want to transpose the matrix, which is slightly different from rotation. In rotation, rows can become columns, but rows or columns will be in reverse order depending on the direction of rotation. Transposition maintains the original order of rows and columns.
I think that using the right algorithm is much more important than using an assembly or just C. Rotating 90 degrees or transposing really comes down to simply moving the memory. The biggest thing to consider is the cache miss effect if you use such a naive algorithm:
for(int x=0; x<width; x++) { for(y=0; y<height; y++) out[x][y] = in[y][x]; }
This will cause many cache misses because you jump a lot in memory. Effectively use the block approach. Google for "caching an efficient transpose matrix."
In one place where you can get some value, use SSE instructions to move more than one piece of data at a time. They are available in assembly and in C. Also check this link . About half way they have a section on calculating the fast transpose of the matrix.
edit I just saw your comment that you are doing this for a class in an assembly, so you can probably ignore most of what I said. I suggested that you want to squeeze the best performance since you used the assembly.
Jason b
source share