Card performance in a multi-threaded program is lower than expected (4 times acceleration versus 8x)

Question

Card performance in a multi-threaded program is lower than expected (4 times acceleration versus 8x)

I am starting multithreaded programming, so please excuse me if the following seems obvious. I am adding multithreading to an image processing program, and acceleration is not quite what I expected.

Currently, I am getting 4 times acceleration on a 4-processor hyper-thread processor (8), so I would like to know if such acceleration is expected. The only thing I can think of is that it might make sense if both hyper-threads of the same physical CPU should have some memory bus.

Being a newbie to multithreading, it’s not entirely clear to me if this will be considered as an I / O-related program, given that all memory is allocated in RAM (I understand that the virtual memory manager of my OS will be the only one deciding the page to / from this supposed amount of memory from the heap). My machine has 16 GB of RAM if it helps to decide if there is a swap / swap problem.

I wrote a test program demonstrating a serial case and two parallel cases using QThreadPool and tbb :: parallel_for

The current program, as you can see, has no real operations, other than setting the intended image from black to white, and it is specifically designed to know what the baseline is before any real operations are applied to the image.

I am joining the program in the hope that someone can explain to me if my search for about 8x acceleration is a lost reason in this processing algorithm. Please note that I am not interested in other types of optimization, such as SIMD, since my real problem is not only to make it faster, but also to speed things up using pure multithreading, without switching to SSE or optimizing the processor cache level.

#include <iostream> #include <sys/time.h> #include <vector> #include <QThreadPool> #include "/usr/local/include/tbb/tbb.h" #define LOG(x) (std::cout << x << std::endl) struct col4 { unsigned char r, g, b, a; }; class QTileTask : public QRunnable { public: void run() { for(uint32_t y = m_yStart; y < m_yEnd; y++) { int rowStart = y * m_width; for(uint32_t x = m_xStart; x < m_xEnd; x++) { int index = rowStart + x; m_pData[index].r = 255; m_pData[index].g = 255; m_pData[index].b = 255; m_pData[index].a = 255; } } } col4* m_pData; uint32_t m_xStart; uint32_t m_yStart; uint32_t m_xEnd; uint32_t m_yEnd; uint32_t m_width; }; struct TBBTileTask { void operator()() { for(uint32_t y = m_yStart; y < m_yEnd; y++) { int rowStart = y * m_width; for(uint32_t x = m_xStart; x < m_xEnd; x++) { int index = rowStart + x; m_pData[index].r = 255; m_pData[index].g = 255; m_pData[index].b = 255; m_pData[index].a = 255; } } } col4* m_pData; uint32_t m_xStart; uint32_t m_yStart; uint32_t m_xEnd; uint32_t m_yEnd; uint32_t m_width; }; struct TBBCaller { TBBCaller(std::vector<TBBTileTask>& t) : m_tasks(t) {} TBBCaller(TBBCaller& e, tbb::split) : m_tasks(e.m_tasks) {} void operator()(const tbb::blocked_range<size_t>& r) const { for (size_t i=r.begin();i!=r.end();++i) m_tasks[i](); } std::vector<TBBTileTask>& m_tasks; }; inline double getcurrenttime( void ) { timeval t; gettimeofday(&t, NULL); return static_cast<double>(t.tv_sec)+(static_cast<double>(t.tv_usec) / 1000000.0); } char* getCmdOption(char ** begin, char ** end, const std::string & option) { char ** itr = std::find(begin, end, option); if (itr != end && ++itr != end) { return *itr; } return 0; } bool cmdOptionExists(char** begin, char** end, const std::string& option) { return std::find(begin, end, option) != end; } void baselineSerial(col4* pData, int resolution) { double t = getcurrenttime(); for(int y = 0; y < resolution; y++) { int rowStart = y * resolution; for(int x = 0; x < resolution; x++) { int index = rowStart + x; pData[index].r = 255; pData[index].g = 255; pData[index].b = 255; pData[index].a = 255; } } LOG((getcurrenttime() - t) * 1000 << " ms. (Serial)"); } void baselineParallelQt(col4* pData, int resolution, uint32_t tileSize) { double t = getcurrenttime(); QThreadPool pool; for(int y = 0; y < resolution; y+=tileSize) { for(int x = 0; x < resolution; x+=tileSize) { uint32_t xEnd = std::min<uint32_t>(x+tileSize, resolution); uint32_t yEnd = std::min<uint32_t>(y+tileSize, resolution); QTileTask* t = new QTileTask; t->m_pData = pData; t->m_xStart = x; t->m_yStart = y; t->m_xEnd = xEnd; t->m_yEnd = yEnd; t->m_width = resolution; pool.start(t); } } pool.waitForDone(); LOG((getcurrenttime() - t) * 1000 << " ms. (QThreadPool)"); } void baselineParallelTBB(col4* pData, int resolution, uint32_t tileSize) { double t = getcurrenttime(); std::vector<TBBTileTask> tasks; for(int y = 0; y < resolution; y+=tileSize) { for(int x = 0; x < resolution; x+=tileSize) { uint32_t xEnd = std::min<uint32_t>(x+tileSize, resolution); uint32_t yEnd = std::min<uint32_t>(y+tileSize, resolution); TBBTileTask t; t.m_pData = pData; t.m_xStart = x; t.m_yStart = y; t.m_xEnd = xEnd; t.m_yEnd = yEnd; t.m_width = resolution; tasks.push_back(t); } } TBBCaller caller(tasks); tbb::task_scheduler_init init; tbb::parallel_for(tbb::blocked_range<size_t>(0, tasks.size()), caller); LOG((getcurrenttime() - t) * 1000 << " ms. (TBB)"); } int main(int argc, char** argv) { int resolution = 1; uint32_t tileSize = 64; char * pResText = getCmdOption(argv, argv + argc, "-r"); if (pResText) { resolution = atoi(pResText); } char * pTileSizeChr = getCmdOption(argv, argv + argc, "-b"); if (pTileSizeChr) { tileSize = atoi(pTileSizeChr); } if(resolution > 16) resolution = 16; resolution = resolution << 10; uint32_t tileCount = resolution/tileSize + 1; tileCount *= tileCount; LOG("Resolution: " << resolution << " Tile Size: "<< tileSize); LOG("Tile Count: " << tileCount); uint64_t pixelCount = resolution*resolution; col4* pData = new col4[pixelCount]; memset(pData, 0, sizeof(col4)*pixelCount); baselineSerial(pData, resolution); memset(pData, 0, sizeof(col4)*pixelCount); baselineParallelQt(pData, resolution, tileSize); memset(pData, 0, sizeof(col4)*pixelCount); baselineParallelTBB(pData, resolution, tileSize); delete[] pData; return 0; }

+5

c ++ performance multithreading hyperthreading tbb

Exaberri tokugawa Aug 6 '15 at 20:21

source share

1 answer

lvella · Accepted Answer · 2015-08-06T20:45:50+0000

Yes, 4x acceleration is expected. Hypertreading is a kind of time-sharing implemented in hardware, so you cannot expect benefits from it if one thread uses all the superscalar pipelines available on the kernel, as this is your case. Another thread will definitely have to wait.

You can expect even lower speeds if the memory bus bandwidth is saturated with threads running less than the total number of available cores. It usually happens if you have too many cores, as in this question:

Why does this set not scale linearly?

Card performance in a multi-threaded program is lower than expected (4 times acceleration versus 8x)

More articles: