Minimize the number of copies you need to complete. Store the input buffer in UTF-8 and do not store std::string or QString in your collection; instead, create a small class to reference existing UTF-8 data:
#include <QString> class stringref { const char *start; size_t length; public: stringref(const char *start, const char *end); operator QString() const; bool operator<(const stringref& other) const; };
This can encapsulate the UTF-8 input substring. You need to make sure that it does not survive the input string; you can do this with the clever use of std::shared_ptr , but if the code is sufficiently autonomous, then it should be convenient enough to talk about the lifetime.
We can build it from a pair of pointers into our UTF-8 data and convert it to QString when we want to use it:
stringref::stringref(const char *start, const char *end) : start(start), length(end-start) {} stringref::operator QString() const { return QString::fromUtf8(start, length); }
You need to define operator< so you can use it in std::set .
#include <cstring> bool stringref::operator<(const stringref& other) const { return length == other.length ? std::strncmp(start, other.start, length) < 0 : length < other.length; }
Note that we sort by length to dereference pointers to reduce cache impact.
Now we can write a split method:
#include <set> #include <QByteArray> std::set<stringref> split(const QByteArray& a) { std::set<stringref> words; // start and end const auto s = a.data(), e = s + a.length(); // current word auto w = s; for (auto p = s; p <= e; ++p) { switch (*p) { default: break; case ' ': case '\r': case '\n': case '\t': case '\0': if (w != p) words.insert({w, p}); w = p+1; } } return words; }
The algorithm is pretty yours, with the addition of the test w!=p , so that spaces are not counted.
Test it, and time is an important bit:
#include <QDebug> #include <chrono> int main() { QByteArray body{"foo bar baz\n foo again\nbar again "}; // make it a million times longer for (int i = 0; i < 20; ++i) body.append(body); using namespace std::chrono; const auto start = high_resolution_clock::now(); auto words = split(body); const auto end = high_resolution_clock::now(); qDebug() << "Split" << body.length() << "bytes in" << duration_cast<duration<double>>(end - start).count() << "seconds"; for (auto&& word: words) qDebug() << word; }
I get:
Split 35651584 bytes in 1.99142 seconds
"bar"
Baz
"Foo"
"Again"
Compiling with -O3 reduced the time to 0.6188 seconds, so be sure to ask the compiler for help!
If this is not fast enough, it may be time to look at the parallelization of the task. You will want to split the string into approximately equal lengths, but move on to the next space so that no work restricts the work of two threads. Each thread must create its own set of results, and the reduction step is to combine the result sets. I will not offer a complete solution for this, as this is a different issue in itself.