matrix reduction can be somewhat easier to implement, since vector / row reduction of a vector can be performed independently. You can allow each thread to process a column / row (depending on the main orientation of the matrix) and coalesce the reading in memory. I doubt that you can buy more performance without resorting to a texture / permanent cache where terrain can become important.
source share