The best type of Scala collection for vectorized numerical computing

Question

The best type of Scala collection for vectorized numerical computing

Look for a suitable data type (for example, IndexedSeq[Double] ) that will be used when designing a compute library for specific domains. On this issue, I limit the ability to work with 1-dimensional Double arrays. The library will define the numerical functions that are usually applied to each element in the 1D array.

Questions:

Prefer immutable data types such as Vector or IndexedSeq
Want to minimize data conversion.
Reasonably effective in space and time
Friendly to other people using the library
Elegant and clean API

Should I use something above the collection hierarchy like Seq ?

Or is it better to simply define singleton functions and leave the mapping / iteration to the end user?

This seems less efficient (since some calculations can be done once for each set of calls), but at the same time a more flexible API, since it will work with any type of collection.

Any recommendations?

+4

api scala scala-collections numerical-computing

chriswynnyk Dec 05 '12 at 15:17

source share

2 answers

I use Vectors all the time when dealing with numerical values, as it provides very efficient random access as well as append / prepend.

Also note that the current default collection for immutable indexed sequences is Vector, so if you write some code like for (i <- 0 until n) yield {...} , it returns IndexedSeq[...] , but the execution type is Vector. Thus, it might be a good idea to always use Vectors, since some binary operators that accept two sequences as input can benefit from the fact that the two arguments have the same implementation type. (Actually, this is not so, but someone pointed out that vector concatenation can be log (N) times, unlike the current linear time, because the second parameter is simply considered as a general sequence.)

However, I believe that Seq[Double] should provide most of the functional interfaces that you need. And since comparison results from the range do not directly give Vector , I usually put Seq[Double] as the type of the argument as my input, so it has some commonality. I would expect performance to be optimized in the base implementation.

Hope this helps.

+3

Kane Dec 05 '12 at 15:31

source share

Rex kerr · Accepted Answer · 2012-12-05T16:24:17+0000

If your calculations should do anything remotely computationally intensive, use Array , either raw or wrapped in your own classes. You can provide a collection-compatible wrapper, but make it an explicit wrapper only for interoperability. Everything except Array is generic and therefore boxed and therefore relatively slow and cumbersome.

If you are not using Array , people will be forced to give up any things that you have and just use Array instead of having a performance impact. Maybe everything is in order; perhaps you want the calculations to be there for convenience, not efficiency. In this case, I suggest using the IndexedSeq interface for the interface, assuming you want people to know that indexing is not overly slow (for example, it is not a List ) and uses Vector under the hood. You will use about 4 times more memory than Array[Double] , and 3-10x slower for most low-force operations (such as multiplication).

For example, this:

 val u = v.map(1.0 / _) // v is Vector[Double]

about three times slower than this:

 val u = new Array[Double](v.length) var j = 0 while (j<u.length) { u(j) = 1.0/v(j) // v is Array[Double] j += 1 }

If you use the map method on Array , it is as slow as the Vector[Double] path; Array operations are common and therefore boxed. (And that's where most fines start.)

The best type of Scala collection for vectorized numerical computing

More articles: