How to quickly analyze spatially separated floats in C ++?

I have a file with millions of lines, each line has 3 floats, separated by spaces. It takes a long time to read a file, so I tried to read them using memory-mapped files, only to find out that the problem is not with I / O speed, but with parsing speed.

My current parsing is to take a stream (called a file) and do the following

float x,y,z; file >> x >> y >> z; 

Someone recommended using Boost.Spirit, but I could not find a simple tutorial to explain how to use it.

I am trying to find a simple and efficient way to parse a string that looks like this:

 "134.32 3545.87 3425" 

I will be very grateful for the help. I wanted to use strtok to split it, but I don't know how to convert strings to float, and I'm not quite sure if this is the best way.

I do not mind if the decision is Boost or not. I do not mind if this is not the most effective solution, but I am sure that you can double the speed.

Thanks in advance.

+31
c ++ parsing boost-spirit
Jul 04 '13 at 8:10
source share
7 answers

If conversion is the throat of a bottle (which is entirely possible) you should start by using the various options in the standard. Logically, one would expect that they would be very close, but in practice, they are not always:

  • You have already determined that std::ifstream is too slow.

  • Converting memory mapped data to std::istringstream almost certainly not a good solution; you will first have to create a row that will copy all the data.

  • Writing your own streambuf to read directly from memory, without copying (or using the deprecated std::istrstream ) may be the solution, though if the problem is really conversion ... it still uses the same conversion procedures.

  • You can always try fscanf or scanf in your memory card stream. Depending on the implementation, they may be faster than various istream implementations.

  • Probably faster than any of them should use strtod . No tokenize is needed for this: strtod skips the leading empty space (including '\n' ), and has an out parameter where it puts the address of the first character is not readable. The final condition is a bit complicated, your loop should probably look something like this:

     char * begin;  // Set to point to the mmap'ed data ...
                     // You'll also have to arrange for a '\ 0'
                     // to follow the data.  This is probably
                     // the most difficult issue.
     char * end;
     errno = 0;
     double tmp = strtod (begin, & end);
     while (errno == 0 && end! = begin) {
         // do whatever with tmp ...
         begin = end;
         tmp = strtod (begin, & end);
     }

If none of them is fast enough, you will need to consider the actual data. It probably has some additional restrictions, which means that you can write a conversion procedure that is faster than more general ones; for example strtod should handle both fixed and scientific, and it should be 100% accurate, even if there are 17 significant digits. It must also be locale specific. All this added complexity, which means the added code to execute. But be careful: writing an efficient and correct conversion procedure, even for a limited set of input data, is not trivial; you really need to know what you are doing.

EDIT:

Just out of curiosity, I did some tests. In addition to the above solutions, I wrote a simple custom converter that processes only a fixed point (without scientific), with a maximum of five digits after the decimal point, and the value before the decimal should match int :

 double convert( char const* source, char const** endPtr ) { char* end; int left = strtol( source, &end, 10 ); double results = left; if ( *end == '.' ) { char* start = end + 1; int right = strtol( start, &end, 10 ); static double const fracMult[] = { 0.0, 0.1, 0.01, 0.001, 0.0001, 0.00001 }; results += right * fracMult[ end - start ]; } if ( endPtr != nullptr ) { *endPtr = end; } return results; } 

(If you really use this, you should definitely add some processing error. It was just quickly overturned for experimental purposes, read the test file that I generated, and nothing else.)

The interface exactly matches strtod to simplify coding.

I conducted tests in two environments (on different machines, so the absolute values ​​of any time do not matter). I got the following results:

On Windows 7 compiled with VC 11 (/ O2):

 Testing Using fstream directly (5 iterations)... 6.3528e+006 microseconds per iteration Testing Using fscan directly (5 iterations)... 685800 microseconds per iteration Testing Using strtod (5 iterations)... 597000 microseconds per iteration Testing Using manual (5 iterations)... 269600 microseconds per iteration 

On Linux 2.6.18, compiled with g ++ 4.4.2 (-O2, IIRC):

 Testing Using fstream directly (5 iterations)... 784000 microseconds per iteration Testing Using fscanf directly (5 iterations)... 526000 microseconds per iteration Testing Using strtod (5 iterations)... 382000 microseconds per iteration Testing Using strtof (5 iterations)... 360000 microseconds per iteration Testing Using manual (5 iterations)... 186000 microseconds per iteration 

In all cases, I read 554,000 lines, each of which is 3 randomly generated floating point in the range [0...10000) .

The most striking thing is the huge difference between fstream and fscan under Windows (and the relatively small difference between fscan and strtod ). Secondly, how much a simple user-defined conversion function wins on both platforms. The necessary error handling will slow things down a little, but the difference is still significant. I expected some improvements since it does not handle many things, standard conversion procedures (e.g. scientific format, very, very small numbers, Inf and NaN, i18n, etc.), but not many.

+18
Jul 04 '13 at 8:45
source share

UPDATE

Since Spirit X3 is available for testing, I updated the tests. In the meantime, I used Nonius to get statistically valid tests.

All graphs available below are available online interactive.

Benchmark CMake + testdata project used for github: https://github.com/sehe/bench_float_parsing

enter image description here

Summary:

Spirit parsers are faster. If you can use C ++ 14, consider the experimental version of Spirit X3:

enter image description here

The above measures are used using memory mapped files. Using IOstream, it will be slower compared to the board,

enter image description here

but not as slow as scanf using C / POSIX FILE* function calls:

enter image description here




The following are parts of the OLD response




I implemented the Spirit version and conducted a comparative test compared to the other suggested answers.

Here are my results, all tests are performed on the same input element (515Mb input.txt ). See below for exact specifications.

i1jkm.png
(wall clock time in seconds, average 2+ runs)

To my own surprise, Boost Spirit is the fastest and most elegant:

  • processes / reports errors
  • supports +/- Inf and NaN and variable spaces.
  • no problem finding the end of the input (unlike the other mmap answer)
  • looks good:

     bool ok = phrase_parse(f,l, // source iterators (double_ > double_ > double_) % eol, // grammar blank, // skipper data); // output attribute 

Note that boost::spirit::istreambuf_iterator was unspeakably slower (15s +). Hope this helps!

Benchmark Details

All parsing is done on vector of struct float3 { float x,y,z; } struct float3 { float x,y,z; } .

Create an input file with

 od -f -A none --width=12 /dev/urandom | head -n 11000000 

The result is a 515 MB file containing data of type

  -2627.0056 -1.967235e-12 -2.2784738e+33 -1.0664798e-27 -4.6421956e-23 -6.917859e+20 -1.1080849e+36 2.8909405e-33 1.7888695e-12 -7.1663235e+33 -1.0840628e+36 1.5343362e-12 -3.1773715e-17 -6.3655537e-22 -8.797282e+31 9.781095e+19 1.7378472e-37 63825084 -1.2139188e+09 -5.2464635e-05 -2.1235992e-38 3.0109424e+08 5.3939846e+30 -6.6146894e-20 

Compile the program using:

 g++ -std=c++0x -g -O3 -isystem -march=native test.cpp -o test -lboost_filesystem -lboost_iostreams 

Measure the wall clock time with

 time ./test < input.txt 

Environment:

  • Linux Desktop 4.2.0-42-General # 49-Ubuntu SMP x86_64
  • Intel (R) Core (TM) i7-3770K CPU @ 3.50 GHz
  • 32GiB RAM

Full code

The full code for the old test is in the change history of this post , the newest version on github

+44
Jul 05
source share

Before you begin, make sure that this is the slow part of your application and get test wiring around it so you can measure improvements.

boost::spirit for me, in my opinion, would be redundant. Try fscanf

 FILE* f = fopen("yourfile"); if (NULL == f) { printf("Failed to open 'yourfile'"); return; } float x,y,z; int nItemsRead = fscanf(f,"%f %f %f\n", &x, &y, &z); if (3 != nItemsRead) { printf("Oh dear, items aren't in the right format.\n"); return; } 
+13
Jul 04 '13 at 8:12
source share

I would look at this related post Using ifstream to read a float or How do I want tokenize a string in C ++ especially messages related to the C ++ String Toolkit Library. I used C strtok, C ++ streams, Boost tokenizer and the best of them for convenience and use - C ++ String Toolkit Library.

+2
Jul 07 '16 at 19:49
source share

using C will be the fastest solution. Split into tokens using strtok and then convert to float with strtof . Or, if you know the exact format, use fscanf .

0
Jul 04 '13 at 8:13
source share

a crucial solution would be to throw more cores into the problem, spawning multiple threads. If the only bottleneck is the processor, you can reduce the runtime by two threads (on multi-core processors)

some other tips:

  • try not to parse library functions like boost and / or std. They are overblown with error checking conditions, and most of the processing time is spent on these checks. In just a couple of conversions, they are beautiful, but fail when it comes to handling millions of values. If you already know that your data is well formatted, you can write (or find) a custom optimized C function that only performs data conversion

  • use a large memory buffer (say 10 MB) in which you load chunks of your file and convert there

  • divide et impera: divide your problem into smaller ones: pre-process your file, make it a single, single float, split each line into "." character and convert integers instead of float and then combine two integers to create a floating point number

0
Jul 04 '13 at 8:16
source share

I believe that one of the most important rules in string processing is "read only once, one character at a time." I think it is always easier, faster and more reliable.

I made a simple test program to show how simple it is. My test says that this code is 40% faster than the strtod version.

 #include <iostream> #include <sstream> #include <iomanip> #include <stdlib.h> #include <math.h> #include <time.h> #include <sys/time.h> using namespace std; string test_generate(size_t n) { srand((unsigned)time(0)); double sum = 0.0; ostringstream os; os << std::fixed; for (size_t i=0; i<n; ++i) { unsigned u = rand(); int w = 0; if (u > UINT_MAX/2) w = - (u - UINT_MAX/2); else w = + (u - UINT_MAX/2); double f = w / 1000.0; sum += f; os << f; os << " "; } printf("generated %f\n", sum); return os.str(); } void read_float_ss(const string& in) { double sum = 0.0; const char* begin = in.c_str(); char* end = NULL; errno = 0; double f = strtod( begin, &end ); sum += f; while ( errno == 0 && end != begin ) { begin = end; f = strtod( begin, &end ); sum += f; } printf("scanned %f\n", sum); } double scan_float(const char* str, size_t& off, size_t len) { static const double bases[13] = { 0.0, 10.0, 100.0, 1000.0, 10000.0, 100000.0, 1000000.0, 10000000.0, 100000000.0, 1000000000.0, 10000000000.0, 100000000000.0, 1000000000000.0, }; bool begin = false; bool fail = false; bool minus = false; int pfrac = 0; double dec = 0.0; double frac = 0.0; for (; !fail && off<len; ++off) { char c = str[off]; if (c == '+') { if (!begin) begin = true; else fail = true; } else if (c == '-') { if (!begin) begin = true; else fail = true; minus = true; } else if (c == '.') { if (!begin) begin = true; else if (pfrac) fail = true; pfrac = 1; } else if (c >= '0' && c <= '9') { if (!begin) begin = true; if (pfrac == 0) { dec *= 10; dec += c - '0'; } else if (pfrac < 13) { frac += (c - '0') / bases[pfrac]; ++pfrac; } } else { break; } } if (!fail) { double f = dec + frac; if (minus) f = -f; return f; } return 0.0; } void read_float_direct(const string& in) { double sum = 0.0; size_t len = in.length(); const char* str = in.c_str(); for (size_t i=0; i<len; ++i) { double f = scan_float(str, i, len); sum += f; } printf("scanned %f\n", sum); } int main() { const int n = 1000000; printf("count = %d\n", n); string in = test_generate(n); { struct timeval t1; gettimeofday(&t1, 0); printf("scan start\n"); read_float_ss(in); struct timeval t2; gettimeofday(&t2, 0); double elapsed = (t2.tv_sec - t1.tv_sec) * 1000000.0; elapsed += (t2.tv_usec - t1.tv_usec) / 1000.0; printf("elapsed %.2fms\n", elapsed); } { struct timeval t1; gettimeofday(&t1, 0); printf("scan start\n"); read_float_direct(in); struct timeval t2; gettimeofday(&t2, 0); double elapsed = (t2.tv_sec - t1.tv_sec) * 1000000.0; elapsed += (t2.tv_usec - t1.tv_usec) / 1000.0; printf("elapsed %.2fms\n", elapsed); } return 0; } 



Below is the console output from the i7 Mac Book Pro (compiled in Xcode 4.6).

 count = 1000000 generated -1073202156466.638184 scan start scanned -1073202156466.638184 elapsed 83.34ms scan start scanned -1073202156466.638184 elapsed 53.50ms 
0
Jul 04 '13 at 12:47 on
source share



All Articles