Rare matrices / arrays in Java

I am working on a project written in Java that requires me to build a very large 2-D sparse array. Very rarely, if that matters. In any case: the most important aspect for this application is time-efficient (suppose the memory size, although not so unlimited as to allow me to use a standard two-dimensional array - the key range is in billions in both dimensions).

Of the cadillion cells in the array, there will be several hundred thousand cells containing the object. I need to change the contents of cells very quickly.

Anyway: Does anyone know a particularly good library for this purpose? It must be Berkeley, LGPL or a similar license (without the GPL, because the product cannot be completely open). Or, if it's just a very simple way to make an object with a sparse homebrew array, that would be fine too.

I review MTJ but have not heard any opinions about its quality.

+58
java algorithm sparse-matrix sparse-array
Dec 23 '08 at 21:55
source share
11 answers

Scattered arrays built with hashmaps are very inefficient for frequently read data. The most efficient implementations use Trie, which allows you to access a single vector in which the segments are distributed.

A Trie can calculate whether an item is present in a table by performing read-only TWO indexing to get the effective position in which the item is stored, or to know if it is not in the underlying storage.

It can also provide a default position in the backup storage for the default value for a sparse array, so you do not need ANY test on the returned index, since Trie ensures that all possible source indexes will be displayed at least to the default position in the backup storage copies (where you often store null or empty string or null object).

There are implementations that support fast-updating Tries with a standalone, compact operation to optimize storage size at the end of several operations. Three times much faster than hashmaps, because they don’t need a complex hash function and do not need to handle conflicts for reading (using Hashmaps you have a BOTH collision for reading and writing, it requires the loop to skip the next candidate position and test for each of them to compare the effective source index ...)

In addition, Java Hashmaps can only index objects and create an Integer object for each hashed source index (creating this object is required for every read, not just write) is expensive in terms of memory operations, as it emphasizes the garbage collector.

I really hoped the JRE included IntegerTrieMap <Object> as the default implementation for the slow HashMap <Integer, Object> or LongTrieMap <Object> as the default implementation for the slower HashMap <Long, Object> ... But that's it not so yet.







You may wonder what is Trie?

This is just a small array of integers (in a smaller range than the full range of coordinates for your matrix), which allows you to display the coordinates in an integer position in the vector.

For example, suppose you want a 1024 * 1024 matrix that contains only a few non-zero values. Instead of storing this matrix in an array containing 1024 * 1024 elements (more than 1 million), you can simply split it into 16 * 16 subbands, and you only need 64 * 64 such subbands.

In this case, the Trie index will contain only 64 * 64 integers (4096), and there will be at least 16 * 16 data elements (containing zero default values ​​or the most common subrange found in your sparse matrix).

And the vector used to store the values ​​will contain only 1 copy for subbands that are equal to each other (most of them are filled with zeros, they will be represented by the same subband).

Therefore, instead of using syntax like matrix[i][j] you should use syntax like:

 trie.values[trie.subrangePositions[(i & ~15) + (j >> 4)] + ((i & 15) << 4) + (j & 15)] 

which will be more conveniently handled using the access method for the trie object.

Here is an example built into the class with comments (I hope it compiles OK, as it has been simplified, let me know if there are any errors to fix):

 /** * Implement a sparse matrix. Currently limited to a static size * (<code>SIZE_I</code>, <code>SIZE_I</code>). */ public class DoubleTrie { /* Matrix logical options */ public static final int SIZE_I = 1024; public static final int SIZE_J = 1024; public static final double DEFAULT_VALUE = 0.0; /* Internal splitting options */ private static final int SUBRANGEBITS_I = 4; private static final int SUBRANGEBITS_J = 4; /* Internal derived splitting constants */ private static final int SUBRANGE_I = 1 << SUBRANGEBITS_I; private static final int SUBRANGE_J = 1 << SUBRANGEBITS_J; private static final int SUBRANGEMASK_I = SUBRANGE_I - 1; private static final int SUBRANGEMASK_J = SUBRANGE_J - 1; private static final int SUBRANGE_POSITIONS = SUBRANGE_I * SUBRANGE_J; /* Internal derived default values for constructors */ private static final int SUBRANGES_I = (SIZE_I + SUBRANGE_I - 1) / SUBRANGE_I; private static final int SUBRANGES_J = (SIZE_J + SUBRANGE_J - 1) / SUBRANGE_J; private static final int SUBRANGES = SUBRANGES_I * SUBRANGES_J; private static final int DEFAULT_POSITIONS[] = new int[SUBRANGES](0); private static final double DEFAULT_VALUES[] = new double[SUBRANGE_POSITIONS](DEFAULT_VALUE); /* Internal fast computations of the splitting subrange and offset. */ private static final int subrangeOf( final int i, final int j) { return (i >> SUBRANGEBITS_I) * SUBRANGE_J + (j >> SUBRANGEBITS_J); } private static final int positionOffsetOf( final int i, final int j) { return (i & SUBRANGEMASK_I) * MAX_J + (j & SUBRANGEMASK_J); } /** * Utility missing in java.lang.System for arrays of comparable * component types, including all native types like double here. */ public static final int arraycompare( final double[] values1, final int position1, final double[] values2, final int position2, final int length) { if (position1 >= 0 && position2 >= 0 && length >= 0) { while (length-- > 0) { double value1, value2; if ((value1 = values1[position1 + length]) != (value2 = values2[position2 + length])) { /* Note: NaN values are different from everything including * all Nan values; they are are also neigher lower than nor * greater than everything including NaN. Note that the two * infinite values, as well as denormal values, are exactly * ordered and comparable with <, <=, ==, >=, >=, !=. Note * that in comments below, infinite is considered "defined". */ if (value1 < value2) return -1; /* defined < defined. */ if (value1 > value2) return 1; /* defined > defined. */ if (value1 == value2) return 0; /* defined == defined. */ /* One or both are NaN. */ if (value1 == value1) /* Is not a NaN? */ return -1; /* defined < NaN. */ if (value2 == value2) /* Is not a NaN? */ return 1; /* NaN > defined. */ /* Otherwise, both are NaN: check their precise bits in * range 0x7FF0000000000001L..0x7FFFFFFFFFFFFFFFL * including the canonical 0x7FF8000000000000L, or in * range 0xFFF0000000000001L..0xFFFFFFFFFFFFFFFFL. * Needed for sort stability only (NaNs are otherwise * unordered). */ long raw1, raw2; if ((raw1 = Double.doubleToRawLongBits(value1)) != (raw2 = Double.doubleToRawLongBits(value2))) return raw1 < raw2 ? -1 : 1; /* Otherwise the NaN are strictly equal, continue. */ } } return 0; } throw new ArrayIndexOutOfBoundsException( "The positions and length can't be negative"); } /** * Utility shortcut for comparing ranges in the same array. */ public static final int arraycompare( final double[] values, final int position1, final int position2, final int length) { return arraycompare(values, position1, values, position2, length); } /** * Utility missing in java.lang.System for arrays of equalizable * component types, including all native types like double here. */ public static final boolean arrayequals( final double[] values1, final int position1, final double[] values2, final int position2, final int length) { return arraycompare(values1, position1, values2, position2, length) == 0; } /** * Utility shortcut for identifying ranges in the same array. */ public static final boolean arrayequals( final double[] values, final int position1, final int position2, final int length) { return arrayequals(values, position1, values, position2, length); } /** * Utility shortcut for copying ranges in the same array. */ public static final void arraycopy( final double[] values, final int srcPosition, final int dstPosition, final int length) { arraycopy(values, srcPosition, values, dstPosition, length); } /** * Utility shortcut for resizing an array, preserving values at start. */ public static final double[] arraysetlength( double[] values, final int newLength) { final int oldLength = values.length < newLength ? values.length : newLength; System.arraycopy(values, 0, values = new double[newLength], 0, oldLength); return values; } /* Internal instance members. */ private double values[]; private int subrangePositions[]; private bool isSharedValues; private bool isSharedSubrangePositions; /* Internal method. */ private final reset( final double[] values, final int[] subrangePositions) { this.isSharedValues = (this.values = values) == DEFAULT_VALUES; this.isSharedsubrangePositions = (this.subrangePositions = subrangePositions) == DEFAULT_POSITIONS; } /** * Reset the matrix to fill it with the same initial value. * * @param initialValue The value to set in all cell positions. */ public reset(final double initialValue = DEFAULT_VALUE) { reset( (initialValue == DEFAULT_VALUE) ? DEFAULT_VALUES : new double[SUBRANGE_POSITIONS](initialValue), DEFAULT_POSITIONS); } /** * Default constructor, using single default value. * * @param initialValue Alternate default value to initialize all * positions in the matrix. */ public DoubleTrie(final double initialValue = DEFAULT_VALUE) { this.reset(initialValue); } /** * This is a useful preinitialized instance containing the * DEFAULT_VALUE in all cells. */ public static DoubleTrie DEFAULT_INSTANCE = new DoubleTrie(); /** * Copy constructor. Note that the source trie may be immutable * or not; but this constructor will create a new mutable trie * even if the new trie initially shares some storage with its * source when that source also uses shared storage. */ public DoubleTrie(final DoubleTrie source) { this.values = (this.isSharedValues = source.isSharedValues) ? source.values : source.values.clone(); this.subrangePositions = (this.isSharedSubrangePositions = source.isSharedSubrangePositions) ? source.subrangePositions : source.subrangePositions.clone()); } /** * Fast indexed getter. * * @param i Row of position to set in the matrix. * @param j Column of position to set in the matrix. * @return The value stored in matrix at that position. */ public double getAt(final int i, final int j) { return values[subrangePositions[subrangeOf(i, j)] + positionOffsetOf(i, j)]; } /** * Fast indexed setter. * * @param i Row of position to set in the sparsed matrix. * @param j Column of position to set in the sparsed matrix. * @param value The value to set at this position. * @return The passed value. * Note: this does not compact the sparsed matric after setting. * @see compact(void) */ public double setAt(final int i, final int i, final double value) { final int subrange = subrangeOf(i, j); final int positionOffset = positionOffsetOf(i, j); // Fast check to see if the assignment will change something. int subrangePosition, valuePosition; if (Double.compare( values[valuePosition = (subrangePosition = subrangePositions[subrange]) + positionOffset], value) != 0) { /* So we'll need to perform an effective assignment in values. * Check if the current subrange to assign is shared of not. * Note that we also include the DEFAULT_VALUES which may be * shared by several other (not tested) trie instances, * including those instanciated by the copy contructor. */ if (isSharedValues) { values = values.clone(); isSharedValues = false; } /* Scan all other subranges to check if the position in values * to assign is shared by another subrange. */ for (int otherSubrange = subrangePositions.length; --otherSubrange >= 0; ) { if (otherSubrange != subrange) continue; /* Ignore the target subrange. */ /* Note: the following test of range is safe with future * interleaving of common subranges (TODO in compact()), * even though, for now, subranges are sharing positions * only between their common start and end position, so we * could as well only perform the simpler test <code> * (otherSubrangePosition == subrangePosition)</code>, * instead of testing the two bounds of the positions * interval of the other subrange. */ int otherSubrangePosition; if ((otherSubrangePosition = subrangePositions[otherSubrange]) >= valuePosition && otherSubrangePosition + SUBRANGE_POSITIONS < valuePosition) { /* The target position is shared by some other * subrange, we need to make it unique by cloning the * subrange to a larger values vector, copying all the * current subrange values at end of the new vector, * before assigning the new value. This will require * changing the position of the current subrange, but * before doing that, we first need to check if the * subrangePositions array itself is also shared * between instances (including the DEFAULT_POSITIONS * that should be preserved, and possible arrays * shared by an external factory contructor whose * source trie was declared immutable in a derived * class). */ if (isSharedSubrangePositions) { subrangePositions = subrangePositions.clone(); isSharedSubrangePositions = false; } /* TODO: no attempt is made to allocate less than a * fully independant subrange, using possible * interleaving: this would require scanning all * other existing values to find a match for the * modified subrange of values; but this could * potentially leave positions (in the current subrange * of values) unreferenced by any subrange, after the * change of position for the current subrange. This * scanning could be prohibitively long for each * assignement, and for now it assumed that compact() * will be used later, after those assignements. */ values = setlengh( values, (subrangePositions[subrange] = subrangePositions = values.length) + SUBRANGE_POSITIONS); valuePosition = subrangePositions + positionOffset; break; } } /* Now perform the effective assignment of the value. */ values[valuePosition] = value; } } return value; } /** * Compact the storage of common subranges. * TODO: This is a simple implementation without interleaving, which * would offer a better data compression. However, interleaving with its * O(NΒ²) complexity where N is the total length of values, should * be attempted only after this basic compression whose complexity is * O(nΒ²) with n being SUBRANGE_POSITIIONS times smaller than N. */ public void compact() { final int oldValuesLength = values.length; int newValuesLength = 0; for (int oldPosition = 0; oldPosition < oldValuesLength; oldPosition += SUBRANGE_POSITIONS) { int oldPosition = positions[subrange]; bool commonSubrange = false; /* Scan values for possible common subranges. */ for (int newPosition = newValuesLength; (newPosition -= SUBRANGE_POSITIONS) >= 0; ) if (arrayequals(values, newPosition, oldPosition, SUBRANGE_POSITIONS)) { commonSubrange = true; /* Update the subrangePositions|] with all matching * positions from oldPosition to newPosition. There may * be several index to change, if the trie has already * been compacted() before, and later reassigned. */ for (subrange = subrangePositions.length; --subrange >= 0; ) if (subrangePositions[subrange] == oldPosition) subrangePositions[subrange] = newPosition; break; } if (!commonSubrange) { /* Move down the non-common values, if some previous * subranges have been compressed when they were common. */ if (!commonSubrange && oldPosition != newValuesLength) { arraycopy(values, oldPosition, newValuesLength, SUBRANGE_POSITIONS); /* Advance compressed values to preserve these new ones. */ newValuesLength += SUBRANGE_POSITIONS; } } } /* Check the number of compressed values. */ if (newValuesLength < oldValuesLength) { values = values.arraysetlength(newValuesLength); isSharedValues = false; } } } 

Note: this code is not complete, since it processes one matrix size, and its compactor is limited to detect only common subbands without alternating them.

In addition, the code does not determine where it is the best width or height for dividing the matrix into subbands (for x or y coordinates) according to the size of the matrix. It uses only the same sizes of the static subrange 16 (for both coordinates), but it can be convenient with any other small cardinality 2 (but non-force 2 will slow down the internal methods int indexOf(int, int) and int offsetOf(int, int) ). independently for both coordinates and up to the maximum width or height of the matrix. Otherwise, the compact() method should be able to determine the best fit.

If the size of the shared subranges can vary, then you will need to add instance instances for these subband sizes instead of the static SUBRANGE_POSITIONS and make the static methods int subrangeOf(int i, int j) and int positionOffsetOf(int i, int j) unsteady; and the DEFAULT_POSITIONS and DEFAULT_VALUES initialization arrays will need to be dropped or redefined in different ways.

If you want to support interleaving, you will basically start by dividing the existing values ​​into two approximately the same size (both of which are multiples of the minimum size of the subrange, with the first subset possibly having another subrange than the second), and you will scan the larger one at all subsequent positions to find the corresponding alternation; then you will try to match these values. Then you will cyclically recursively divide the subsets in half (also a multiple of the minimum size of the subrange), and you will scan again to fit these subsets (this will multiply the number of subsets by 2: you have to wonder if the size of the index subrangePositions is worth the value compared to the existing one by the size of the values ​​to see if it provides efficient compression (if not, you stop at it: you find the optimal size of the subband directly from the interlacing compression process ). Case, the subband size is changeable during compaction.

But this code shows how you assign non-zero values ​​and redistribute the data array for additional (non-zero) subranges, and then how you can optimize (using compact() after completing assignments using setAt(int i, int j, double value) ) storage of this data in the presence of duplicated subranges that can be unified within the data and re-indexed at the same position in the subrangePositions array.

In any case, all the principles of trie are implemented there:

  • It is always faster (and more compact in memory, which means better locality) to represent a matrix using one vector instead of an array with two indices (each of which is allocated separately). The improvement is visible in the double getAt(int, int) method!

  • You save a lot of space, but when assigning values, it may take time to reallocate new subbands. For this reason, the subranges should not be too small, or the redistribution will occur too often to tune your matrix.

  • You can automatically convert the initial large matrix to a more compact matrix by detecting common subbands. A typical implementation will then contain a method such as compact() above. However, if get () access is very fast and set () is quite fast, compact () can be very slow if there are many common subranges for compression (for example, when subtracting a large unallocated randomly filled matrix with itself, or multiplying it by zero : in this case it will be simpler and much faster to reset trie by initializing a new one and deleting the old one).

  • Shared subbands use shared storage in the data, so this shared data should be read-only. If you must change one value without changing the rest of the matrix, you must first ensure that it is referenced only once in the subrangePositions index. Otherwise, you will need to select a new subrange anywhere (conveniently at the end) of the values vector, and then save the position of this new subrange in the subrangePositions index.







Please note that the Colt general library, although very good, is not so good at working with a sparse matrix, because it uses hashing (or row compression) of techniques that do not yet implement support for attempts, despite this excellent optimization, which saves time and , especially for the most common getAt () operations.

Even the setAt () operation described here for attempts saves a lot of time (the method is implemented here, that is, without automatic compaction after installation, which can still be implemented based on demand and estimated time, when compaction will still save a lot of storage space at a price time): time saving is proportional to the number of cells in the subbands, and space saving is inversely proportional to the number of cells in each subband. A good compromise, if you then use the size of the subrange, this number of cells in each subband is the square root of the total number of cells in the 2D matrix (this will be the cubic root when working with the 3D matrix).

The hashing technique used in Colt sparse matrix solutions has the disadvantage that they add a lot of storage overhead as well as slow access times due to possible collisions. As a result, attempts can avoid all collisions and can then guarantee that linear time O (n) is maintained up to O (1) time in the worst cases, where (n) is the number of possible collisions (which, in the case of a sparse matrix, can be up to the number of cells without default values ​​in the matrix, i.e., up to the total number of matrix sizes multiplied by a coefficient proportional to the hash fill factor for a non-sparse, i.e., full matrix).

Colt’s RC (compressed line) technique is closer to Tries, but at a different price, it uses a compression technique that has very slow access time for the most common read (get () operations, and very slow compression for setAt () operations . In addition, the compression used is not orthogonal, unlike this Tries presentation, where orthogonality is preserved. This orthogonality will also be preserved for related viewing operations, such as walking, transposing (considered as a step operation based on integer cyclic modular operations), subbands (and sub-segments in general, including sorting).

I just hope that Colt will be updated in the future to implement another implementation using Tries (i.e. TrieSparseMatrix instead of HashSparseMatrix and RCSparseMatrix). Ideas in this article.

The Trove implementation (based on int-> int maps) is also based on a hash technique similar to the Colt HashedSparseMatrix method, i.e. they have the same inconvenience. The tests will be much faster, with moderate additional space (but this space can be optimized and become even better than Trove and Colt, for the delayed time, using the final compact () ionic operation on the resulting / trie matrix).

Note: this Trie implementation is bound to a specific field type (here double). This is voluntary, since the overall implementation using box types has huge overhead (and much slower while passing). Here he simply uses his own one-dimensional arrays of double, not common vectors. But, of course, you can get a common implementation for Tries ... Unfortunately, Java still does not allow writing really general classes with all the advantages of native types, except that it writes several implementations (for a general object type or for each native type) and serving all these operations using the factory type. The language should be able to automatically initiate its own implementations and automatically create a factory (for now this is not the case, even in Java 7, and that .Net still retains its advantage for truly generic types that are as fast as native types).

+62
Aug 02 '10 at 20:21
source share

The following structure for testing Java Matrix libraries also provides a good list of them! https://lessthanoptimal.imtqy.com/Java-Matrix-Benchmark/

Tested Libraries:

 * Colt * Commons Math * Efficient Java Matrix Library (EJML) * Jama * jblas * JScience (Older benchmarks only) * Matrix Toolkit Java (MTJ) * OjAlgo * Parallel Colt * Universal Java Matrix Package (UJMP) 
+8
Feb 04 2018-11-11T00:
source share

Here is a document that might interest you that talks about data structures for matrix computing, including sparse arrays:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.7544

You can download the document in PDF or PS format. It also contains source code.

+4
Dec 23 '08 at 22:09
source share

Perhaps Colt will help. It provides a sparse matrix implementation.

+4
Dec 25 '08 at 22:45
source share

It seems simple.

You can use the binary data tree using the row column * maxcolums + as an index.

To find an element, you simply compute the row * maxcolums + column and binary search for the tree looking for it, if it is not there, you can return null (this is O (log n), where n is the number of cells that contain the object).

+3
Dec 24 '08 at 9:46
source share

Not the fastest execution solution, perhaps, but the fastest I could come up with seems to work. Create an Index class and use it as a key for SortedMap, for example:

  SortedMap<Index, Object> entries = new TreeMap<Index, Object>(); entries.put(new Index(1, 4), "1-4"); entries.put(new Index(5555555555l, 767777777777l), "5555555555l-767777777777l"); System.out.println(entries.size()); System.out.println(entries.get(new Index(1, 4))); System.out.println(entries.get(new Index(5555555555l, 767777777777l))); 

My index class looks like this (with some help from the Eclipse code generator).

 public static class Index implements Comparable<Index> { private long x; private long y; public Index(long x, long y) { super(); this.x = x; this.y = y; } public int compareTo(Index index) { long ix = index.x; if (ix == x) { long iy = index.y; if (iy == y) { return 0; } else if (iy < y) { return -1; } else { return 1; } } else if (ix < x) { return -1; } else { return 1; } } public int hashCode() { final int PRIME = 31; int result = 1; result = PRIME * result + (int) (x ^ (x >>> 32)); result = PRIME * result + (int) (y ^ (y >>> 32)); return result; } public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; final Index other = (Index) obj; if (x != other.x) return false; if (y != other.y) return false; return true; } public long getX() { return x; } public long getY() { return y; } } 
+2
24 . '08 10:51
source share

Trove , , Colt, int- > int ( ).

+2
22 . '10 21:39
source share

la4j (Linear Algebra for Java). CRS ( ) , CCS ( ) . , .

la4j :

 Matrix a = new CRSMatrix(new double[][]{ // 'a' - CRS sparse matrix { 1.0, 0.0, 3.0 }, { 0.0, 5.0, 0.0 }, { 7.0, 0.0. 9.0 } }); Matrix b = a.transpose(); // 'b' - CRS sparse matrix Matrix c = b.multiply(a, Matrices.CCS_FACTORY); // 'c' = 'b' * 'a'; // 'c' - CCS sparse matrix 
+2
08 .
source share

, , ,

  Map<Integer, Map<integer, Object>> matrix; 

, , , :

 class Tuple<T extends yourDataObject> { public final int x; public final int y; public final T object; } class Matrix { private final Map<Integer, Map<interger, Tupple>> data = new...; void add(int x, int y, Object object) { data.get(x).put(new Tupple(x,y,object); } } //etc 

null check ..

0
24 . '08 12:38
source share

SuanShu Java Java .

-one
19 . '11 18:31
source share

HashMap. ( ) , '/', StringBuilder (not + String.format) . . soo 20- .: -)

-one
14 . '14 15:37
source share



All Articles