I determined using the "Random-Stop-Method" method that the following two lines look very slow:
cv::Mat pixelSubMue = pixel - vecMatMue[kk_real]; // ca. 35.5 % cv::Mat pixelTemp = pixelSubMue * covInvRef; // ca. 58.1 % cv::multiply(pixelSubMue, pixelTemp, pixelTemp); // ca. 0 % cv::Scalar sumScalar = cv::sum(pixelTemp); // ca. 3.2 % double cost = sumScalar.val[0] * 0.5 + vecLogTerm[kk_real]; // ca. 3.2 %
vecMatMue[kk_real] - this is std::vector<cv::Mat> <- I know that there are many copies, but using pointers here does not greatly affect performance.pixelSubMue is a vector cv::Mat(1, 3, CV_64FC1)covInvRef is a reference to the matrix cv::Mat(3, 3, CV_64FC1)vecLogTerm[kk_real] is a std::vector<double>
The code snippet above is in the inner loop, which is called millions of times.
Question : Is there a way to increase the speed of this operation?
Edit : Thanks for the comments! I now measured the time in the program, and percentages show how much time is spent on each line. The measurements were carried out in the release mode. I took six measurements each time the code was executed millions of times.
I should probably also mention that std::vector objects do not affect performance, I just replaced them with constant objects.
Change 2 . I also implemented an algorithm using C-Api. The corresponding lines are as follows:
cvSub(pixel, vecPMatMue[kk], pixelSubMue); // ca. 24.4 % cvMatMulAdd(pixelSubMue, vecPMatFCovInv[kk], 0, pixelTemp); // ca. 39.0 % cvMul(pixelSubMue, pixelTemp, pixelSubMue); // ca. 22.0 % CvScalar sumScalar = cvSum(pixelSubMue); // ca. 14.6 % cost = sumScalar.val[0] * 0.5 + vecFLogTerm[kk]; // ca. 0.0 %
A C ++ implementation requires the same ca input. 3100 ms, whereas for the implementation of C-C only approx. 2050 ms (both measurements refer to the total time to execute a fragment millions of times). But I still prefer my implementation in C ++, as it is easier to read for me (other ugly changes had to be made to make it work with the C-API).
Change 3 . I rewrote the code without using any function calls for actual calculations:
capacity_t mue0 = meanRef.at<double>(0, 0); capacity_t mue1 = meanRef.at<double>(0, 1); capacity_t mue2 = meanRef.at<double>(0, 2); capacity_t sigma00 = covInvRef.at<double>(0, 0); capacity_t sigma01 = covInvRef.at<double>(0, 1); capacity_t sigma02 = covInvRef.at<double>(0, 2); capacity_t sigma11 = covInvRef.at<double>(1, 1); capacity_t sigma12 = covInvRef.at<double>(1, 2); capacity_t sigma22 = covInvRef.at<double>(2, 2); mue0 = p0 - mue0; mue1 = p1 - mue1; mue2 = p2 - mue2; capacity_t pt0 = mue0 * sigma00 + mue1 * sigma01 + mue2 * sigma02; capacity_t pt1 = mue0 * sigma01 + mue1 * sigma11 + mue2 * sigma12; capacity_t pt2 = mue0 * sigma02 + mue1 * sigma12 + mue2 * sigma22; mue0 *= pt0; mue1 *= pt1; mue2 *= pt2; capacity_t cost = (mue0 + mue1 + mue2) / 2.0 + vecLogTerm[kk_real];
Now calculations for each pixel need only 150 ms!