Gather on Haswell slowly. I implemented an 8-bit LUT index with 16-bit values ββ(to multiply GF16 by par. 2) in several ways to find out what is the fastest. On Haswell, the VPGATHERDD version took 1.7 times to the movd / pinsrw . (Only a couple of VPUNPCK / shift instructions were needed outside the collections.) The code is here if someone wants to run the test .
As usual, when the instruction is first introduced, they do not emit a huge amount of silicon to make it superfast. It's there to get HW support there, so you can write code to use it. To work perfectly on all processors, you need to do what x264 did for pshufb : have the SLOW_SHUFFLE flag for processors such as Core2, and take this into account in your search-pointer-setting of the search function that is most convenient for searching, and not just what the processor supports.
For less fanatical projects that asm versions are configured for each processor, they can run, and introducing a version without acceleration will cause people to use it earlier, so when the next design goes forward, its speed goes up . Releasing a design such as Haswell, where the slowdown is actually going, is a bit risky. Maybe they wanted to see how people would use it? This increases code density, which helps when the collection is not in a tight loop.
Broadwell should have a faster build, but I do not have access to it. The Intel manual, which states latency / bandwidth for instructions, says that Broadwell assembly is about 1.6 times faster, so it will be slightly slower than a manually created pen that shifts / decompresses indices in GP regs and uses them for PINSRW into vectors.
If gather can take advantage of cases where several elements have the same index or even an index pointing to the same 32B extraction unit, there may be some great acceleration depending on the input.
I hope Skylake improves even further. I thought that I would read something, saying that it would be, but during the check I did not find anything.
RE: sparse matrices: is there a format that duplicates data, so you can do continuous reads for rows or columns? This is not what I should have written for the code, but I think I saw this mentioned in some answers.