Converting Sparse Matrices AVX2

I am trying to use the new AVX2 GATHER instructions to speed up the scatter of matrix vectors. The matrix is ​​in CSR (or Yale) format with a row pointer that points to an array of column indices, which in turn contains columns. The C code for such a mat-vec mul is as follows:

for (int row = 0; row < n_rows - 1; row++) { double rowsum = 0; for (int col = row_ptr[row]; col < row_ptr[row + 1]; col++) { rowsum += values[col] * x[col_indices[col]]; } result[row] = rowsum; } 

Now my goal is to speed it up with the built-in AVX2. The following code works with the latest versions of Intel or GCC based on https://blog.fox-toolkit.org/?p=174 . I deleted the rest of it here because all my lines are aligned to 4 doubles (columns% 4 == 0) anyway (I was lucky). I have a code with the remainder if someone is interested, but the fact is that the code is actually a bit slower. I checked the disassembly and only FP instructions are generated for the above version, and for my AVX2 code, all AVX2 statements look as expected. Even with small matrices that fit into the cache, the AVX2 version is not good. I am puzzled here ...

 double* value_base = &values[0]; double* x_base = &x[0]; int* index_base = &col_indices[0]; for (int row = 0; row < n_rows - 1; row++) { int col_length = row_ptr[row + 1] - row_ptr[row]; __m256d rowsum = _mm256_set1_pd(0.); for (int col4 = 0; col4 < col_length; col4 += 4) { // Load indices for x vector(const __m128i*) __m128i idxreg = _mm_load_si128((const __m128i*)index_base); // Load 4 doubles from x indexed by idxreg (AVX2) __m256d x_ = _mm256_i32gather_pd(x_base, idxreg, 8); // Load 4 doubles linear from memory (value array) __m256d v_ = _mm256_load_pd(value_base); // FMA: rowsum += x_ * v_ rowsum = _mm256_fmadd_pd(x_, v_, rowsum); index_base += 4; value_base += 4; } __m256d s = _mm256_hadd_pd(rowsum, rowsum); result[row] = ((double*)&s)[0] + ((double*)&s)[2]; // Alternative (not faster): // Now we split the upper and lower AVX register, and do a number of horizontal adds //__m256d hsum = _mm256_add_pd(rowsum, _mm256_permute2f128_pd(rowsum, rowsum, 0x1)); //_mm_store_sd(&result[row], _mm_hadd_pd( _mm256_castpd256_pd128(hsum), _mm256_castpd256_pd128(hsum) ) ); } 

Any suggestions are welcome.

Thanks a lot Chris

+6
source share
1 answer

Gather on Haswell slowly. I implemented an 8-bit LUT index with 16-bit values ​​(to multiply GF16 by par. 2) in several ways to find out what is the fastest. On Haswell, the VPGATHERDD version took 1.7 times to the movd / pinsrw . (Only a couple of VPUNPCK / shift instructions were needed outside the collections.) The code is here if someone wants to run the test .

As usual, when the instruction is first introduced, they do not emit a huge amount of silicon to make it superfast. It's there to get HW support there, so you can write code to use it. To work perfectly on all processors, you need to do what x264 did for pshufb : have the SLOW_SHUFFLE flag for processors such as Core2, and take this into account in your search-pointer-setting of the search function that is most convenient for searching, and not just what the processor supports.

For less fanatical projects that asm versions are configured for each processor, they can run, and introducing a version without acceleration will cause people to use it earlier, so when the next design goes forward, its speed goes up . Releasing a design such as Haswell, where the slowdown is actually going, is a bit risky. Maybe they wanted to see how people would use it? This increases code density, which helps when the collection is not in a tight loop.

Broadwell should have a faster build, but I do not have access to it. The Intel manual, which states latency / bandwidth for instructions, says that Broadwell assembly is about 1.6 times faster, so it will be slightly slower than a manually created pen that shifts / decompresses indices in GP regs and uses them for PINSRW into vectors.

If gather can take advantage of cases where several elements have the same index or even an index pointing to the same 32B extraction unit, there may be some great acceleration depending on the input.

I hope Skylake improves even further. I thought that I would read something, saying that it would be, but during the check I did not find anything.

RE: sparse matrices: is there a format that duplicates data, so you can do continuous reads for rows or columns? This is not what I should have written for the code, but I think I saw this mentioned in some answers.

+9
source

All Articles