Handle a large matrix in Java

I currently need to split a singular value with a 48K x 50K matrix.

I tried JAMA, but it only works for rows> columns. I tried PCOLT, JBLAS, but they return an error when the rows * columns> MAX_INT

Any suggestions what should I do?

Sorry if I made mistakes in the lines above.

Many thanks!

+7
source share
3 answers

I had similar problems when doing SVD calculations, and my experience is: don't do this in Java. There are tools that can do this more efficiently. If you really need Java, you might consider creating an interface that calls the tool inside your code. I ended up using R. I used it manually, saving the matrix in a file that could be considered R as a matrix.

By the way, if the matrix is sparse , various optimizations are possible that would reduce memory usage and size of the output file (if you decide to use it).

Otherwise, check out this thread to see if this helps: Processing a large data structure in Java

+6
source

For really large blocks of memory, I tend to suggest using memory mapped files (maybe this is what R does for you). You can do this in Java with some boiler plate code. Unfortunately, Java does not directly support displaying more than 2 GB at a time, so you need to partition it.

import sun.misc.Cleaner; import sun.nio.ch.DirectBuffer; import java.io.Closeable; import java.io.IOException; import java.io.RandomAccessFile; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; import java.util.ArrayList; import java.util.List; public class LargeDoubleMatrix implements Closeable { private static final int MAPPING_SIZE = 1 << 30; private final RandomAccessFile raf; private final int width; private final int height; private final List<MappedByteBuffer> mappings = new ArrayList<MappedByteBuffer>(); public LargeDoubleMatrix(String filename, int width, int height) throws IOException { this.raf = new RandomAccessFile(filename, "rw"); try { this.width = width; this.height = height; long size = 8L * width * height; for (long offset = 0; offset < size; offset += MAPPING_SIZE) { long size2 = Math.min(size - offset, MAPPING_SIZE); mappings.add(raf.getChannel().map(FileChannel.MapMode.READ_WRITE, offset, size2)); } } catch (IOException e) { raf.close(); throw e; } } protected long position(int x, int y) { return (long) y * width + x; } public int width() { return width; } public int height() { return height; } public double get(int x, int y) { assert x >= 0 && x < width; assert y >= 0 && y < height; long p = position(x, y) * 8; int mapN = (int) (p / MAPPING_SIZE); int offN = (int) (p % MAPPING_SIZE); return mappings.get(mapN).getDouble(offN); } public void set(int x, int y, double d) { assert x >= 0 && x < width; assert y >= 0 && y < height; long p = position(x, y) * 8; int mapN = (int) (p / MAPPING_SIZE); int offN = (int) (p % MAPPING_SIZE); mappings.get(mapN).putDouble(offN, d); } public void close() throws IOException { for (MappedByteBuffer mapping : mappings) clean(mapping); raf.close(); } private void clean(MappedByteBuffer mapping) { if (mapping == null) return; Cleaner cleaner = ((DirectBuffer) mapping).cleaner(); if (cleaner != null) cleaner.clean(); } } 

has this test that sets the diagonal value.

 @Test public void getSetMatrix() throws IOException { long start = System.nanoTime(); final long used0 = usedMemory(); LargeDoubleMatrix matrix = new LargeDoubleMatrix("/tmp/ldm.test", 48*1000, 50*1000); for(int i=0;i<matrix.width();i++) matrix.set(i,i,i); for(int i=0;i<matrix.width();i++) assertEquals(i, matrix.get(i,i), 0.0); long time = System.nanoTime() - start; final long used = usedMemory() - used0; if (used==0) System.err.println("You need to use -XX:-UsedTLAB to see small changes in memory usage."); System.out.printf("Setting the diagonal took %,d ms, Heap used is %,d KB%n", time/1000/1000, used/1024); matrix.close(); } private long usedMemory() { return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory(); } 

prints (when starting with -XX:-UseTLAB )

 Setting the diagonal took 60 ms, Heap used is 55 KB 

Only used pages are created. Files look very large, but the allocated space is based on usage.

 $ ls -lh /tmp/ldm.test -rw-rw-r-- 1 peter peter 18G 2011-12-30 10:18 /tmp/ldm.test $ du -sh /tmp/ldm.test 222M /tmp/ldm.test 
+3
source

Step 1. Use the database to store it.
Step 2. Use a multi-line / parallel algorithm.

This document discusses SOTA techniques for large SVDs. The Lanzcos algorithm on 3 processors took just over 10 minutes on a 32k X 32k matrix, but only for the smallest singular value. It is probably possible to deflate and retrieve consecutive singular values ​​- I always found Power Iteration deflated well for this.

In short, make MX M_T and M_T XM and take eigenvectors and eigenvalues ​​to reconstruct the SVD matrices.

If you're ready to take approximations, this other document is just one of many that deals with approximate algorithms. Many of them are based on some downsampling of columns or optimally representative submatrices, where you take advantage of the cubically smaller parts to work, plus parallelism.

Obviously they come with some distortion, but maybe you can smooth it out for your result.

Finally, you really need to use the Strassen method to perform multiplications.

+1
source

All Articles