The fastest way to calculate the minimum Euclidean distance between two matrices containing high-dimensional vectors

I started a similar question in another thread , but then I focused on how to use OpenCV. Unable to achieve what I originally wanted, I will ask here exactly what I want.

I have two matrices. The matrix a is 2782x128, and the matrix b is 4000x128, both unsigned char values. Values โ€‹โ€‹are stored in one array. For each vector in I need the index of the vector in b with the closest Euclidean distance.

Ok, now my code for this is:

#include <windows.h> #include <stdlib.h> #include <stdio.h> #include <cstdio> #include <math.h> #include <time.h> #include <sys/timeb.h> #include <iostream> #include <fstream> #include "main.h" using namespace std; void main(int argc, char* argv[]) { int a_size; unsigned char* a = NULL; read_matrix(&a, a_size,"matrixa"); int b_size; unsigned char* b = NULL; read_matrix(&b, b_size,"matrixb"); LARGE_INTEGER liStart; LARGE_INTEGER liEnd; LARGE_INTEGER liPerfFreq; QueryPerformanceFrequency( &liPerfFreq ); QueryPerformanceCounter( &liStart ); int* indexes = NULL; min_distance_loop(&indexes, b, b_size, a, a_size); QueryPerformanceCounter( &liEnd ); cout << "loop time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl; if (a) delete[]a; if (b) delete[]b; if (indexes) delete[]indexes; return; } void read_matrix(unsigned char** matrix, int& matrix_size, char* matrixPath) { ofstream myfile; float f; FILE * pFile; pFile = fopen (matrixPath,"r"); fscanf (pFile, "%d", &matrix_size); *matrix = new unsigned char[matrix_size*128]; for (int i=0; i<matrix_size*128; ++i) { unsigned int matPtr; fscanf (pFile, "%u", &matPtr); matrix[i]=(unsigned char)matPtr; } fclose (pFile); } void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size) { const int descrSize = 128; *indexes = (int*)malloc(a_size*sizeof(int)); int dataIndex=0; int vocIndex=0; int min_distance; int distance; int multiply; unsigned char* dataPtr; unsigned char* vocPtr; for (int i=0; i<a_size; ++i) { min_distance = LONG_MAX; for (int j=0; j<b_size; ++j) { distance=0; dataPtr = &a[dataIndex]; vocPtr = &b[vocIndex]; for (int k=0; k<descrSize; ++k) { multiply = *dataPtr++-*vocPtr++; distance += multiply*multiply; // If the distance is greater than the previously calculated, exit if (distance>min_distance) break; } // if distance smaller if (distance<min_distance) { min_distance = distance; (*indexes)[i] = j; } vocIndex+=descrSize; } dataIndex+=descrSize; vocIndex=0; } } 

And attached files with model matrices.

matrixa matrixb

I use windows.h to calculate the consumption time, so if you want to test the code on a platform other than windows, just change the title of windows.h and change the way you calculate the consumption time.

This code on my computer is about 0.5 seconds. The problem is that I have another code in Matlab that does the same thing in 0.05 seconds. In my experiments, I get several matrices, such as a matrix, every second, so 0.5 seconds is too much.

Now matlab code to calculate this value:

 aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b'; d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab)); [minz index]=min(d,[],2); 

Ok Matlab code uses this (xa) ^ 2 = x ^ 2 + a ^ 2 - 2ab.

So my next attempt was to do the same. I deleted my own code to do the same calculations, but that was 1.2 seconds.

Then I tried to use different external libraries. The first attempt was Eigen:

 const int descrSize = 128; MatrixXi a(a_size, descrSize); MatrixXi b(b_size, descrSize); MatrixXi ab(a_size, b_size); unsigned char* dataPtr = matrixa; for (int i=0; i<nframes; ++i) { for (int j=0; j<descrSize; ++j) { a(i,j)=(int)*dataPtr++; } } unsigned char* vocPtr = matrixb; for (int i=0; i<vocabulary_size; ++i) { for (int j=0; j<descrSize; ++j) { b(i,j)=(int)*vocPtr ++; } } ab = a*b.transpose(); a.cwiseProduct(a); b.cwiseProduct(b); MatrixXi aa = a.rowwise().sum(); MatrixXi bb = b.rowwise().sum(); MatrixXi d = (aa.replicate(1,vocabulary_size) + bb.transpose().replicate(nframes,1) - 2*ab).cwiseAbs2(); int* index = NULL; index = (int*)malloc(nframes*sizeof(int)); for (int i=0; i<nframes; ++i) { d.row(i).minCoeff(&index[i]); } 

This Eigen code costs 1.2 approximately for a line that says: ab = a * b.transpose ();

Also used a similar code using opencv, and the cost ab = a * b.transpose (); was 0.65 seconds.

So, it is very annoying that Matlab is able to do the same thing so quickly, and I can not in C ++! Of course, the opportunity to launch my experiment will be wonderful, but I think that lack of knowledge is what really annoys me. How to achieve the same performance as in Matlab? Welcome to any permission. I mean that any external library (if possible, free), the cycle of unfolding things, template things, SSE instructions (I know that they exist), caching things. As I said, my main goal is to increase my knowledge for the ability of codes to think in such a way, with higher performance.

Thanks in advance

EDIT: More code suggested by David Hamman. I did arrays before int before doing any calculations. Here is the code:

 void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size) { const int descrSize = 128; int* a_int; int* b_int; LARGE_INTEGER liStart; LARGE_INTEGER liEnd; LARGE_INTEGER liPerfFreq; QueryPerformanceFrequency( &liPerfFreq ); QueryPerformanceCounter( &liStart ); a_int = (int*)malloc(a_size*descrSize*sizeof(int)); b_int = (int*)malloc(b_size*descrSize*sizeof(int)); for(int i=0; i<descrSize*a_size; ++i) a_int[i]=(int)a[i]; for(int i=0; i<descrSize*b_size; ++i) b_int[i]=(int)b[i]; QueryPerformanceCounter( &liEnd ); cout << "Casting time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl; *indexes = (int*)malloc(a_size*sizeof(int)); int dataIndex=0; int vocIndex=0; int min_distance; int distance; int multiply; /*unsigned char* dataPtr; unsigned char* vocPtr;*/ int* dataPtr; int* vocPtr; for (int i=0; i<a_size; ++i) { min_distance = LONG_MAX; for (int j=0; j<b_size; ++j) { distance=0; dataPtr = &a_int[dataIndex]; vocPtr = &b_int[vocIndex]; for (int k=0; k<descrSize; ++k) { multiply = *dataPtr++-*vocPtr++; distance += multiply*multiply; // If the distance is greater than the previously calculated, exit if (distance>min_distance) break; } // if distance smaller if (distance<min_distance) { min_distance = distance; (*indexes)[i] = j; } vocIndex+=descrSize; } dataIndex+=descrSize; vocIndex=0; } } 

The whole process is now 0.6, and the casting loops at the beginning are 0.001 seconds. Maybe I did something wrong?

EDIT2: Anything about Eigen? When I look for external libraries, they always talk about Eigen and their speed. Did I do something wrong? Here's a simple code using Eigen that shows it, not so fast. Perhaps I am missing some configuration or some kind of flag, or ...

 MatrixXd A = MatrixXd::Random(1000, 1000); MatrixXd B = MatrixXd::Random(1000, 500); MatrixXd X; 

This code is about 0.9 seconds.

+6
source share
3 answers

As you noticed, a matrix product dominates in your code, which represents 2.8e9 about arithmetic operations. Yopu says that Matlab (or rather a highly optimized MKL) calculates it in about 0.05 seconds. This is a measure of 57 GFLOPS, showing that it is not only the use of vectorization, but also multithreading. With Eigen, you can enable multithreading by compiling with OpenMP support ( -fopenmp with gcc). On my 5 year old computer (2.66Ghz Core2), using floats and 4 threads, your product takes about 0.053 and 0.16 s without OpenMP, so there should be something wrong with your compilation flags. To resume to get the best out of Eigen:

  • compile in 64 bit mode
  • use floats (doubling twice as slow due to vectorization)
  • enable openmp
  • If your processor has hyperthreading, either disable it or define the OMP_NUM_THREADS environment OMP_NUM_THREADS for the number of physical cores (this is very important, otherwise the performance will be very bad!)
  • If you have another task running, it might be a good idea to reduce OMP_NUM_THREADS to nb_cores-1
  • use the latest compiler you can, GCC, clang and ICC best, MSVC is usually slower.
+3
source

One thing that definitely hurts you in your C ++ code is that it has a char boat for int conversions. In the boat, I mean up to 2 * 2782 * 4000 * 128 char for int conversions. Char to int conversions are slow, very slow.

You can reduce this to (2782 + 4000) * 128 such conversions by highlighting a pair of int arrays, one 2782 * 128 and the other 4000 * 128 to contain content other than the whole, t23> and char* b . Work with these int* arrays, not char* arrays.

Another problem might be using int compared to long . I do not work on windows, so this may not be applicable. On the machines I'm working on, int is 32 bits and long is 64 bits. 32 bits are more than enough, because 255 * 255 * 128 <256 * 256 * 128 = 2 23 .

This is obviously not a problem.

What is striking is that this code does not compute a huge array of 2728 by 4000, which creates the Matlab code. Even more striking is that Matlab most likely does this with doubles rather than ints - and it still beats up its pants with C / C ++ code.

One big problem is cache. This 4000 * 128 array is too large for a level 1 cache, and you iterate over this large array 2782 times. Your code is waiting too much in memory. To overcome this problem, work with smaller fragments of array b so that your code works with level 1 cache for as long as possible.

Another problem is optimization if (distance>min_distance) break; . I suspect that this is actually de-optimization. Testing if inside your inner loop is a bad idea. Blast through this internal product as quickly as possible. Apart from failed calculations, there is no harm in getting rid of this test. Sometimes it is better to do explicitly unnecessary calculations if this allows you to remove a branch in the innermost loop. This is one such case. You may be able to solve your problem by excluding this test. Try to do it.

Returning to the cache problem, you need to get rid of this branch so that you can split the operations on the matrix a and b into smaller pieces, pieces of no more than 256 lines at a time. That the number of lines of 128 unsigned characters fits into one of two modern L1 caches for Intel chips. Since 250 divides 4000, study the logical partition of this matrix b into 16 pieces. You might want to form this large set of internal products of size 2872-4000, but do it in small pieces. You can add that if (distance>min_distance) break; again, but do it at the fragment level, not at the byte level.

You should be able to defeat Matlab because it almost certainly works with doubles, but you can work with unsigned chars and ints.

+2
source

Matrix multiplication usually uses the worst cache access pattern possible for one of the two matrices, and the solution is to migrate one of the matrices and use a specialized multiplication algorithm that works with the data stored in this way.

Your matrix is โ€‹โ€‹already saved transposed. By transferring it to the normal order, and then using multiplication by the normal matrix, you absolutely kill performance.

Write your own matrix multiplication cycle, which inverts the order of the indices in the second matrix (which leads to its transfer, without actually moving anything or breaking the behavior of the cache). And pass your compiler any parameters it has to enable automatic vectorization.

+1
source

Source: https://habr.com/ru/post/926291/


All Articles