Is Pure Haskell 10x-100x faster than HMatrix for small matrices?

We spend most of our processor cycles on operations with small matrices, so I wondered if it was possible to optimize for this case. Consider the following code:

module Main where import Numeric.LinearAlgebra.HMatrix import Criterion.Main data Matrix2x2 = Matrix2x2 {-# UNPACK #-} !Double !Double !Double !Double mul2x2p :: Matrix2x2 -> Matrix2x2 -> Matrix2x2 mul2x2p (Matrix2x2 a1 b1 c1 d1) (Matrix2x2 a2 b2 c2 d2) = Matrix2x2 (a1*a2 + b1*c2) (a1*b2 + b1*d2) (c1*a2 + d1*c2) (c1*b2 + d1*d2) inv2x2 :: Matrix2x2 -> Matrix2x2 inv2x2 (Matrix2x2 abcd) = let detInv = a * d - b * c in Matrix2x2 (d / detInv) (-b / detInv) (-c / detInv) (a / detInv) add2x2 (Matrix2x2 a1 b1 c1 d1) (Matrix2x2 a2 b2 c2 d2) = Matrix2x2 (a1+a2) (b1+b2) (c1+c2) (d1+d2) hm1 = matrix 2 [1, 2, 3, 4] hm2 = matrix 2 [5, 6, 7, 8] pm1 = Matrix2x2 1 2 3 4 pm2 = Matrix2x2 5 6 7 8 main = defaultMain [ bgroup "matrix tests" [ bench "pure mult" $ whnf (mul2x2p pm1) pm2 , bench "hmatrix mult" $ whnf (hm1 <>) hm2 , bench "pure add" $ whnf (add2x2 pm1) pm2 , bench "hmatrix add" $ whnf (hm1 +) hm2 , bench "pure inv" $ whnf inv2x2 pm1 , bench "hmatrix inv" $ whnf inv hm1 ]] 

Results:

 benchmarking matrix tests/pure mult time 6.461 ns (6.368 ns .. 6.553 ns) 0.999 R² (0.998 R² .. 0.999 R²) mean 6.482 ns (6.394 ns .. 6.594 ns) std dev 345.1 ps (271.4 ps .. 477.3 ps) variance introduced by outliers: 77% (severely inflated) benchmarking matrix tests/hmatrix mult time 180.6 ns (178.2 ns .. 183.1 ns) 0.999 R² (0.998 R² .. 0.999 R²) mean 183.0 ns (180.6 ns .. 186.3 ns) std dev 9.363 ns (7.405 ns .. 12.73 ns) variance introduced by outliers: 71% (severely inflated) benchmarking matrix tests/pure add time 6.262 ns (6.223 ns .. 6.297 ns) 0.999 R² (0.999 R² .. 1.000 R²) mean 6.281 ns (6.220 ns .. 6.355 ns) std dev 235.0 ps (183.3 ps .. 321.0 ps) variance introduced by outliers: 62% (severely inflated) benchmarking matrix tests/hmatrix add time 116.4 ns (115.0 ns .. 117.9 ns) 0.999 R² (0.998 R² .. 0.999 R²) mean 116.3 ns (115.2 ns .. 117.7 ns) std dev 4.176 ns (3.447 ns .. 5.150 ns) variance introduced by outliers: 55% (severely inflated) benchmarking matrix tests/pure inv time 7.811 ns (7.718 ns .. 7.931 ns) 0.999 R² (0.998 R² .. 0.999 R²) mean 7.895 ns (7.808 ns .. 7.988 ns) std dev 296.4 ps (247.2 ps .. 358.3 ps) variance introduced by outliers: 62% (severely inflated) benchmarking matrix tests/hmatrix inv time 908.5 ns (901.3 ns .. 916.6 ns) 0.999 R² (0.998 R² .. 0.999 R²) mean 934.0 ns (917.6 ns .. 961.3 ns) std dev 73.92 ns (50.53 ns .. 108.6 ns) variance introduced by outliers: 84% (severely inflated) 

My questions:

1) Is the speed real or due to an artifact with a reference process?

2) If the speed up is real, is there an existing library that will handle, say, 1x1, 2x2, 3x3, 4x4 matrices as special cases?

3) If not, what is the best way to wrap the HMatrix so that we can take the fast track when the matrix is ​​small? ghc can decompress entries with only one constructor. Is there a way to automatically generate different versions of our code, etc.

Example-test.cabal:

 name: example-test version: 0.1.0.0 build-type: Simple cabal-version: >=1.10 executable example-test main-is: Main.hs build-depends: base >=4.7 && <4.8, criterion, hmatrix default-language: Haskell2010 ghc-options: -H12G -O3 -optc-O3 -fllvm -rtsopts -threaded -fexcess-precision -j6 +RTS -N6 -RTS -fno-ignore-asserts -fcontext-stack=150 -- -fforce-recomp 
+7
performance matrix haskell
source share
2 answers

HMatrix provides a functional interface for standard dense linear algebra (svd, eigenvalues, linear systems, etc.) based on the excellent and highly optimized BLAS / LAPACK Library.

There is overhead caused by FFI calls, memory allocation, error checking, code to avoid transferring any matrix, etc. This is not significant compared to the time required by typical matrix calculations of moderate to large sizes, but for very small arrays and simple operations, the overhead can be large.

There are no plans to redefine calculations for small arrays. it is not easy to do something better than BLAS / LAPACK, and I'm not sure that some speed gain for a particular application is more important than simplicity of code and Generality.

If your application requires a large number of simple operations on 2x2, or 3x3, then other libraries are probably more suitable for this task.

But I recommend running your program in profiling mode to find out the true bottlenecks. A test using persistent data may not be fully representative of the time spent in a “normal” program.

Even if you find that most of the time is spent on these calculations with small matrices, you may be able to rewrite your algorithm using a larger calculation equivalent matrix.

+3
source share

For small vectors and matrices, linear packaging is often used: https://hackage.haskell.org/package/linear

0
source share

All Articles