Effective implementation of mutual information in Java

I want to calculate mutual information between two functions using Java.

I already read "Computing Mutual Information for Choosing a Learning Set in Java" , but it was a discussion of whether mutual information is suitable for a poster, just some easy pseudo code regarding the implementation.

My current code is below, but I hope there is a way to optimize it, since I have a lot of information to process. I know that accessing another language / framework can increase speed, but I would like to focus on solving this in Java at the moment.

Any help is greatly appreciated.

public static double calculateNewMutualInformation(double frequencyOfBoth, double frequencyOfLeft, double frequencyOfRight, int noOfTransactions) { if (frequencyOfBoth == 0 || frequencyOfLeft == 0 || frequencyOfRight == 0) return 0; // supp = f11 double supp = frequencyOfBoth / noOfTransactions; // P(x,y) double suppLeft = frequencyOfLeft / noOfTransactions; // P(x) double suppRight = frequencyOfRight / noOfTransactions; // P(y) double f10 = (suppLeft - supp); // P(x) - P(x,y) double f00 = (1 - suppRight) - f10; // (1-P(y)) - P(x,y) double f01 = (suppRight - supp); // P(y) - P(x,y) // -1 * ((P(x) * log(Px)) + ((1 - P(x)) * log(1-p(x))) double HX = -1 * ((suppLeft * MathUtils.logWithoutNaN(suppLeft)) + ((1 - suppLeft) * MathUtils.logWithoutNaN(1 - suppLeft))); // -1 * ((P(y) * log(Py)) + ((1 - P(y)) * log(1-p(y))) double HY = -1 * ((suppRight * MathUtils.logWithoutNaN(suppRight)) + ((1 - suppRight) * MathUtils.logWithoutNaN(1 - suppRight))); double one = (supp * MathUtils.logWithoutNaN(supp)); // P(x,y) * log(P(x,y)) double two = (f10 * MathUtils.logWithoutNaN(f10)); double three = (f01 * MathUtils.logWithoutNaN(f01)); double four = (f00 * MathUtils.logWithoutNaN(f00)); double HXY = -1 * (one + two + three + four); return (HX + HY - HXY) / (HX == 0 ? MathUtils.EPSILON : HX); } public class MathUtils { public static final double EPSILON = 0.000001; public static double logWithoutNaN(double value) { if (value == 0) { return Math.log(EPSILON); } else if (value < 0) { return 0; } return Math.log(value); } 
+4
source share
2 answers

I found the following quickly, but I did not compare it with your method - only in weka .

It works on the premise of reinstalling the MI equation so that the number of floating point operations can be minimized:

mutual information equation

Let's start with the definition pcdot as the number / frequency of the number of samples / transactions. So, we define the number of elements as n, the number of times x occurs as | x |, the number of times y occurs as | y | and the number of times when they coincide at | x, y |. Then we get

mi1 .

Now we can rebuild this by flipping the bottom of the inner division, this will give us (n | x, y |) / (| x || y |). Also, calculate using N = 1 / n, so that we have one operation with less division. This gives us:

mi2

This gives us the following code:

 /*** * Computes MI between variables t and a. Assumes that a.length == t.length. * @param a candidate variable a * @param avals number of values a can take (max(a) == avals) * @param t target variable * @param tvals number of values a can take (max(t) == tvals) * @return */ static double computeMI(int[] a, int avals, int[] t, int tvals) { double numinst = a.length; double oneovernuminst = 1/numinst; double sum = 0; // longs are required here because of big multiples in calculation long[][] crosscounts = new long[avals][tvals]; long[] tcounts = new long[tvals]; long[] acounts = new long[avals]; // Compute counts for the two variables for (int i=0;i<a.length;i++) { int av = a[i]; int tv = t[i]; acounts[av]++; tcounts[tv]++; crosscounts[av][tv]++; } for (int tv=0;tv<tvals;tv++) { for (int av=0;av<avals;av++) { if (crosscounts[av][tv] != 0) { // Main fraction: (n|x,y|)/(|x||y|) double sumtmp = (numinst*crosscounts[av][tv])/(acounts[av]*tcounts[tv]); // Log bit (|x,y|/n) and update product sum += oneovernuminst*crosscounts[av][tv]*Math.log(sumtmp)*log2; } } } return sum; } 

This code assumes that the values โ€‹โ€‹of a and t are not sparse (i.e. min (t) = 0 and tvals = max (t)) to be effective. Otherwise (in the comments) large and unnecessary arrays are created.

I believe that this approach is further improved when calculating MI between several variables at once (counting operations can be compressed, especially for the target). The implementation I use is an interface that interacts with WEKA.

Finally, it would be more efficient to even get out of the summation. But I'm not sure if the magazine or the power will take more calculations in the loop. This is done by:

  • Apply a * log (b) = log (a ^ b)
  • Move the log beyond the summarization using log (a) + log (b) = log (ab)

and gives:

mi2

+1
source

I am not a mathematician, but ..

There is only a bunch of floating point calculations here. Some mathematicians can reduce this to fewer calculations, try Math SE .

Meanwhile, you should be able to use static final double for Math.log(EPSILON)

Your problem may not be one call, but the amount of data for which this calculation should be performed. This problem is better solved by adding more hardware to it.

+1
source

All Articles