C ++, Qt - splitting QByteArray as quickly as possible

I am trying to split a massive QByteArray that contains UTF-8 encoded text (using a space as a separator) with the best performance. I found that I can achieve much better results if I convert the array to QString . I tried using the QString.split function using regexp, but the performance was horrific. This code turned out to be faster:

 QMutex mutex; QSet<QString> split(QByteArray body) { QSet<QString> slova; QString s_body = QTextCodec::codecForMib(106)->toUnicode(body); QString current; for(int i = 0; i< body.size(); i++){ if(s_body[i] == '\r' || s_body[i] == '\n' || s_body[i] == '\t' || s_body[i] == ' '){ mutex.lock(); slova.insert(current); mutex.unlock(); current.clear(); current.reserve(40); } else { current.push_back(s_body[i]); } } return slova; } 

The "word" is now QSet<QString> , but I could use std::set or any other format. This code should find how many unique words are in the array, with the best performance.

Unfortunately, this code is far from fast enough. I want to squeeze the absolute maximum out of this.

Using callgrind, I found that the most gluttonous internal functions were:

 QString::reallocData (18% absolute cost) QString::append (10% absolute cost) QString::operator= (8 % absolute cost) QTextCodec::toUnicode (8% absolute cost) 

Obviously, this is due to the memory allocation arising from the push_back function. What is the best way to solve this problem? It does not have to be a Qt solution - pure C or C ++ is also acceptable.

+5
source share
3 answers

Minimize the number of copies you need to complete. Store the input buffer in UTF-8 and do not store std::string or QString in your collection; instead, create a small class to reference existing UTF-8 data:

 #include <QString> class stringref { const char *start; size_t length; public: stringref(const char *start, const char *end); operator QString() const; bool operator<(const stringref& other) const; }; 

This can encapsulate the UTF-8 input substring. You need to make sure that it does not survive the input string; you can do this with the clever use of std::shared_ptr , but if the code is sufficiently autonomous, then it should be convenient enough to talk about the lifetime.

We can build it from a pair of pointers into our UTF-8 data and convert it to QString when we want to use it:

 stringref::stringref(const char *start, const char *end) : start(start), length(end-start) {} stringref::operator QString() const { return QString::fromUtf8(start, length); } 

You need to define operator< so you can use it in std::set .

 #include <cstring> bool stringref::operator<(const stringref& other) const { return length == other.length ? std::strncmp(start, other.start, length) < 0 : length < other.length; } 

Note that we sort by length to dereference pointers to reduce cache impact.


Now we can write a split method:

 #include <set> #include <QByteArray> std::set<stringref> split(const QByteArray& a) { std::set<stringref> words; // start and end const auto s = a.data(), e = s + a.length(); // current word auto w = s; for (auto p = s; p <= e; ++p) { switch (*p) { default: break; case ' ': case '\r': case '\n': case '\t': case '\0': if (w != p) words.insert({w, p}); w = p+1; } } return words; } 

The algorithm is pretty yours, with the addition of the test w!=p , so that spaces are not counted.


Test it, and time is an important bit:

 #include <QDebug> #include <chrono> int main() { QByteArray body{"foo bar baz\n foo again\nbar again "}; // make it a million times longer for (int i = 0; i < 20; ++i) body.append(body); using namespace std::chrono; const auto start = high_resolution_clock::now(); auto words = split(body); const auto end = high_resolution_clock::now(); qDebug() << "Split" << body.length() << "bytes in" << duration_cast<duration<double>>(end - start).count() << "seconds"; for (auto&& word: words) qDebug() << word; } 

I get:

Split 35651584 bytes in 1.99142 seconds
"bar"
Baz
"Foo"
"Again"

Compiling with -O3 reduced the time to 0.6188 seconds, so be sure to ask the compiler for help!

If this is not fast enough, it may be time to look at the parallelization of the task. You will want to split the string into approximately equal lengths, but move on to the next space so that no work restricts the work of two threads. Each thread must create its own set of results, and the reduction step is to combine the result sets. I will not offer a complete solution for this, as this is a different issue in itself.

+2
source

The first thing I would do if I were you is to change my code so that it does not block or unlock QMutex forever, what it inserts into QSet is pure overhead. Or lock QMutex only once, at the beginning of the cycle, and unlock it again after the cycle ends; or better yet, insert into a QSet that is not accessible from any other thread, so you don’t need to block any QMutexes at all.

From this point of view, the second thing to do is to eliminate as many heap allocations as possible. Ideally, you should do all the parsing without allocating or freeing any dynamic memory; my implementation below does this (well, almost - an unordered_set can do some internal distributions, but it probably won't). On my computer (2.7 GHz Mac Mini), I measure the processing speed of about 11 million words per second, using Gutenberg ASCII Moby Dick text as test input.

Please note that due to the backward compatibility of the encoding used by UTF-8, this program will work equally well with UTF-8 or ASCII inputs.

 #include <ctype.h> #include <stdio.h> #include <string.h> #include <sys/time.h> #include <unordered_set> // Loads in a text file from disk into an in-memory array // Expected contents of the file are ASCII or UTF8 (doesn't matter which). // Note that this function appends a space to the end of the returned array // That way the parsing function doesn't have to include a special case // since it is guaranteed that every word in the array ends with whitespace static char * LoadFile(const char * fileName, unsigned long * retArraySizeBytes) { char * ret = NULL; *retArraySizeBytes = 0; FILE * fpIn = fopen(fileName, "r"); if (fpIn) { if (fseek(fpIn, 0L, SEEK_END) == 0) { const unsigned long fileSizeBytes = ftell(fpIn); const unsigned long arraySizeBytes = *retArraySizeBytes = fileSizeBytes+1; // +1 because I'm going to append a space to the end rewind(fpIn); ret = new char[arraySizeBytes]; if (fread(ret, 1, fileSizeBytes, fpIn) == fileSizeBytes) { ret[fileSizeBytes] = ' '; // appending a space allows me to simplify the parsing step } else { perror("fread"); delete [] ret; ret = NULL; } } else perror("fseek"); fclose(fpIn); } return ret; } // Gotta provide our own equality-testing function otherwise unordered_set will just compare pointer values struct CharPointersEqualityFunction : public std::binary_function<char *, char *,bool> { bool operator() (char * s1, char * s2) const {return strcmp(s1, s2) == 0;} }; // Gotta provide our own hashing function otherwise unordered_set will just hash the pointer values struct CharPointerHashFunction { int operator() (char * str) const { // djb2 by Dan Bernstein -- fast enough and simple enough unsigned long hash = 5381; int c; while((c = *str++) != 0) hash = ((hash << 5) + hash) + c; return (int) hash; } }; typedef std::unordered_set<char *, CharPointerHashFunction, CharPointersEqualityFunction > CharPointerUnorderedSet; int main(int argc, char ** argv) { if (argc < 2) { printf("Usage: ./split_words filename\n"); return 10; } unsigned long arraySizeBytes; char * buf = LoadFile(argv[1], &arraySizeBytes); if (buf == NULL) { printf("Unable to load input file [%s]\n", argv[1]); return 10; } CharPointerUnorderedSet set; set.reserve(100000); // trying to size (set) big enough that no reallocations will be necessary during the parse struct timeval startTime; gettimeofday(&startTime, NULL); // The actual parsing of the text is done here int wordCount = 0; char * wordStart = buf; char * wordEnd = buf; char * bufEnd = &buf[arraySizeBytes]; while(wordEnd < bufEnd) { if (isspace(*wordEnd)) { if (wordEnd > wordStart) { *wordEnd = '\0'; set.insert(wordStart); wordCount++; } wordStart = wordEnd+1; } wordEnd++; } struct timeval endTime; gettimeofday(&endTime, NULL); unsigned long long startTimeMicros = (((unsigned long long)startTime.tv_sec)*1000000) + startTime.tv_usec; unsigned long long endTimeMicros = (((unsigned long long) endTime.tv_sec)*1000000) + endTime.tv_usec; double secondsElapsed = ((double)(endTimeMicros-startTimeMicros))/1000000.0; printf("Parsed %i words (%zu unique words) in %f seconds, aka %.0f words/second\n", wordCount, set.size(), secondsElapsed, wordCount/secondsElapsed); //for (const auto& elem: set) printf("word=[%s]\n", elem); delete [] buf; return 0; } 
+2
source

Your biggest cost is suspected to be in push_back , causing frequent redistributions when you add one character at a time. Why not look further and then add all the data at once using QString::mid() :

 slova.insert(s_body.mid(beginPos, i - beginPos - 1)); 

Where beginPos contains the index of the beginning of the current substring. Instead of adding each character to current before inserting it into slova copy occurs immediately. After copying the substring, do a forward lookup for the next valid (not separator) character and set beginPos to this index.

In the (rough) code:

 QString s_body = ... //beginPos tells us the index of the current substring we are working //with. -1 means the previous character was a separator int beginPos = -1; for (...) { //basically your if statement provided in the question as a function if (isSeparator(s_body[i])) { //ignore double white spaces, etc. if (beginPos != -1) { mutex.lock(); slova.insert(s_body.mid(beginPos, i - beginPos - 1)); mutex.unlock(); } } else if (beginPos == -1) //if beginPos is not valid and we are not on a separator, we //are at the start of a new substring. beginPos = i; } 

This approach will drastically reduce your overhead in heap allocation and eliminate calls to QString::push_back() .

One final note: QByteArray also provides a mid() function. You can completely skip the conversion to QString and work directly with the byte array.

+2
source

All Articles