Take advantage of spatial locality
In C, matrices are stored in r ow-major order . Therefore, if you access the element a[i][j] , access to the element a[i][j+1] will most likely go to the cache. There will be no access to the main memory. The cache is faster than main memory, so the access pattern matters.
Of course, you need to consider more factors, such as write / read access, write policy (write, write back / write, lack of write allocation), multi-level caches, etc. But that seems redundant for this question.
Get to know a profiling tool like cachegrind and see for yourself.
For example, consider a dummy program that has 4MB matrices. Check the differences between the skip values ββfor each access pattern.
column access
$ cat col_major.c #include <stdio.h> int main(){ size_t i,j; const size_t dim = 1024 ; int matrix [dim][dim]; for (i=0;i< dim; i++){ for (j=0;j <dim;j++){ matrix[j][i]= i*j; } } return 0; } $ valgrind --tool=cachegrind ./col_major ==3228== D refs: 10,548,665 (9,482,149 rd + 1,066,516 wr) ==3228== D1 misses: 1,049,704 ( 968 rd + 1,048,736 wr) ==3228== L2d misses: 1,049,623 ( 893 rd + 1,048,730 wr) ==3228== D1 miss rate: 9.9% ( 0.0% + 98.3% ) ==3228== L2d miss rate: 9.9% ( 0.0% + 98.3% )
line access
$ cat row_major.c #include <stdio.h> int main(){ size_t i,j; const size_t dim = 1024 ; int matrix [dim][dim]; for (i=0;i< dim; i++) for (j=0;j <dim;j++) matrix[i][j]= i*j; return 0; } $ valgrind --tool=cachegrind ./row_major ==3524== D refs: 10,548,665 (9,482,149 rd + 1,066,516 wr) ==3524== D1 misses: 66,664 ( 968 rd + 65,696 wr) ==3524== L2d misses: 66,583 ( 893 rd + 65,690 wr) ==3524== D1 miss rate: 0.6% ( 0.0% + 6.1% ) ==3524== L2d miss rate: 0.6% ( 0.0% + 6.1% )