I started a similar question in another thread , but then I focused on how to use OpenCV. Unable to achieve what I originally wanted, I will ask here exactly what I want.
I have two matrices. The matrix a is 2782x128, and the matrix b is 4000x128, both unsigned char values. Values โโare stored in one array. For each vector in I need the index of the vector in b with the closest Euclidean distance.
Ok, now my code for this is:
#include <windows.h> #include <stdlib.h> #include <stdio.h> #include <cstdio> #include <math.h> #include <time.h> #include <sys/timeb.h> #include <iostream> #include <fstream> #include "main.h" using namespace std; void main(int argc, char* argv[]) { int a_size; unsigned char* a = NULL; read_matrix(&a, a_size,"matrixa"); int b_size; unsigned char* b = NULL; read_matrix(&b, b_size,"matrixb"); LARGE_INTEGER liStart; LARGE_INTEGER liEnd; LARGE_INTEGER liPerfFreq; QueryPerformanceFrequency( &liPerfFreq ); QueryPerformanceCounter( &liStart ); int* indexes = NULL; min_distance_loop(&indexes, b, b_size, a, a_size); QueryPerformanceCounter( &liEnd ); cout << "loop time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl; if (a) delete[]a; if (b) delete[]b; if (indexes) delete[]indexes; return; } void read_matrix(unsigned char** matrix, int& matrix_size, char* matrixPath) { ofstream myfile; float f; FILE * pFile; pFile = fopen (matrixPath,"r"); fscanf (pFile, "%d", &matrix_size); *matrix = new unsigned char[matrix_size*128]; for (int i=0; i<matrix_size*128; ++i) { unsigned int matPtr; fscanf (pFile, "%u", &matPtr); matrix[i]=(unsigned char)matPtr; } fclose (pFile); } void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size) { const int descrSize = 128; *indexes = (int*)malloc(a_size*sizeof(int)); int dataIndex=0; int vocIndex=0; int min_distance; int distance; int multiply; unsigned char* dataPtr; unsigned char* vocPtr; for (int i=0; i<a_size; ++i) { min_distance = LONG_MAX; for (int j=0; j<b_size; ++j) { distance=0; dataPtr = &a[dataIndex]; vocPtr = &b[vocIndex]; for (int k=0; k<descrSize; ++k) { multiply = *dataPtr++-*vocPtr++; distance += multiply*multiply; // If the distance is greater than the previously calculated, exit if (distance>min_distance) break; } // if distance smaller if (distance<min_distance) { min_distance = distance; (*indexes)[i] = j; } vocIndex+=descrSize; } dataIndex+=descrSize; vocIndex=0; } }
And attached files with model matrices.
matrixa matrixb
I use windows.h to calculate the consumption time, so if you want to test the code on a platform other than windows, just change the title of windows.h and change the way you calculate the consumption time.
This code on my computer is about 0.5 seconds. The problem is that I have another code in Matlab that does the same thing in 0.05 seconds. In my experiments, I get several matrices, such as a matrix, every second, so 0.5 seconds is too much.
Now matlab code to calculate this value:
aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b'; d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab)); [minz index]=min(d,[],2);
Ok Matlab code uses this (xa) ^ 2 = x ^ 2 + a ^ 2 - 2ab.
So my next attempt was to do the same. I deleted my own code to do the same calculations, but that was 1.2 seconds.
Then I tried to use different external libraries. The first attempt was Eigen:
const int descrSize = 128; MatrixXi a(a_size, descrSize); MatrixXi b(b_size, descrSize); MatrixXi ab(a_size, b_size); unsigned char* dataPtr = matrixa; for (int i=0; i<nframes; ++i) { for (int j=0; j<descrSize; ++j) { a(i,j)=(int)*dataPtr++; } } unsigned char* vocPtr = matrixb; for (int i=0; i<vocabulary_size; ++i) { for (int j=0; j<descrSize; ++j) { b(i,j)=(int)*vocPtr ++; } } ab = a*b.transpose(); a.cwiseProduct(a); b.cwiseProduct(b); MatrixXi aa = a.rowwise().sum(); MatrixXi bb = b.rowwise().sum(); MatrixXi d = (aa.replicate(1,vocabulary_size) + bb.transpose().replicate(nframes,1) - 2*ab).cwiseAbs2(); int* index = NULL; index = (int*)malloc(nframes*sizeof(int)); for (int i=0; i<nframes; ++i) { d.row(i).minCoeff(&index[i]); }
This Eigen code costs 1.2 approximately for a line that says: ab = a * b.transpose ();
Also used a similar code using opencv, and the cost ab = a * b.transpose (); was 0.65 seconds.
So, it is very annoying that Matlab is able to do the same thing so quickly, and I can not in C ++! Of course, the opportunity to launch my experiment will be wonderful, but I think that lack of knowledge is what really annoys me. How to achieve the same performance as in Matlab? Welcome to any permission. I mean that any external library (if possible, free), the cycle of unfolding things, template things, SSE instructions (I know that they exist), caching things. As I said, my main goal is to increase my knowledge for the ability of codes to think in such a way, with higher performance.
Thanks in advance
EDIT: More code suggested by David Hamman. I did arrays before int before doing any calculations. Here is the code:
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size) { const int descrSize = 128; int* a_int; int* b_int; LARGE_INTEGER liStart; LARGE_INTEGER liEnd; LARGE_INTEGER liPerfFreq; QueryPerformanceFrequency( &liPerfFreq ); QueryPerformanceCounter( &liStart ); a_int = (int*)malloc(a_size*descrSize*sizeof(int)); b_int = (int*)malloc(b_size*descrSize*sizeof(int)); for(int i=0; i<descrSize*a_size; ++i) a_int[i]=(int)a[i]; for(int i=0; i<descrSize*b_size; ++i) b_int[i]=(int)b[i]; QueryPerformanceCounter( &liEnd ); cout << "Casting time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl; *indexes = (int*)malloc(a_size*sizeof(int)); int dataIndex=0; int vocIndex=0; int min_distance; int distance; int multiply; int* dataPtr; int* vocPtr; for (int i=0; i<a_size; ++i) { min_distance = LONG_MAX; for (int j=0; j<b_size; ++j) { distance=0; dataPtr = &a_int[dataIndex]; vocPtr = &b_int[vocIndex]; for (int k=0; k<descrSize; ++k) { multiply = *dataPtr++-*vocPtr++; distance += multiply*multiply;
The whole process is now 0.6, and the casting loops at the beginning are 0.001 seconds. Maybe I did something wrong?
EDIT2: Anything about Eigen? When I look for external libraries, they always talk about Eigen and their speed. Did I do something wrong? Here's a simple code using Eigen that shows it, not so fast. Perhaps I am missing some configuration or some kind of flag, or ...
MatrixXd A = MatrixXd::Random(1000, 1000); MatrixXd B = MatrixXd::Random(1000, 500); MatrixXd X;
This code is about 0.9 seconds.