Read huge data from a file and efficiently analyze the date. How to increase productivity for huge data?

I am reading huge data from a file:

//abc.txt

10 12 14 15 129 -12 14 -18 -900 -1234 145 12 13 12 32 68 51 76 -59 -025 - - - - etc fun(char *p, int x, int y, int z) { } 

I tried using atoi , strtok , but they consume in real time when the array is too large and sscanf also very slow.

How to increase productivity for huge data?

I use strtok for parsing. I am looking for a quick method to parse each line.

I read as each line, and then parse each line as:

  char * ptr; ptr = strtok (str," "); while (ptr != NULL) { int value1 = atoi(ptr) ; ptr = strtok (NULL, " "); } 
  • Is there any quick way to parse a string in int ?
  • Is there an alternative approach that will be faster than the code above? I use atoi to convert char * to int .
  • Is it possible to use another quick method to convert char * to int ?
+4
source share
5 answers

To convert an ASCII string to an integer value, you cannot get much faster than atoi does, but you can speed it up by implementing the conversion function that you use inline. The version below increases the pointer behind the scanned digits, therefore it does not correspond to the atoi semantics, but this should help increase the analyzer's efficiency, as shown below. (There is clearly no error checking, so add it if you need it.)

 static inline int my_parsing_atoi(const char *&s) { if (s) { bool neg = false; int val = 0; if (*s == '-') { neg = true; ++s; } for (;isdigit(*s);++s) val = 10*val + (*s - '0'); return neg ? -val : val; } return 0; } const char *p = input_line; if (p) { p += strspn(p, " "); while (*p) { int value1 = my_parsing_atoi(p); p += strspn(p, " "); } } 

Make sure you correctly profile your code so that you know that your routine is being computed, not bound to I / O. In most cases, you will be involved with I / O, and the suggestions below are ways to mitigate this.

If you use C or C ++ file reader routines such as fread or fstream , you should get buffered reads that should already be quite efficient, but you can try using basic OS calls like POSIX read to read files in large blocks at a time to speed up file reading performance. To be truly fantastic, you can read your file asynchronously while it is being processed, either using threads or aio_read . You can even use mmap , and this will delete some overhead for copying the data, but if the file is extremely large, you will need to manage the map so that you munmap parts of the file that was scanned, and mmap in the new part for scanning.

I compared my parsing procedure above and the OP procedure using code that looked like this:

 clock_t before_real; clock_t after_real; struct tms before; struct tms after; std::vector<char *> numbers; make_numbers(numbers); before_real = times(&before); for (int i = 0; i < numbers.size(); ++i) { parse(numbers[i]); } after_real = times(&after); std::cout << "user: " << after.tms_utime - before.tms_utime << std::endl; std::cout << "real: " << after_real - before_real << std::endl; 

The difference between real and user is that real is the wall clock time, and user is the actual time spent by the OS that starts the process (so that context switches are not taken into account during operation time).

In my tests, my work was almost twice as fast as the OP routine (compiled with g++ -O3 on a 64-bit Linux system).

+3
source

You are looking for the wrong place. The problem is not parsing unless you are doing something really strange. On a modern N Ghz processor, the cycle required for each line is tiny. What kills performance is physical I / O. Spinning tends to operate at a speed of 10 s / s.

I also doubt that the problem is the physical reading of the file, as this will be effectively cached in the file system cache.

No, as samy.vilar prompts, the problem is almost certainly virtual memory:

... the array is too big ...

Use the system monitor / psinfo / top to view your application. Almost certainly, this increases the large working set, as it creates an array in memory, and your OS prompts it to disk.

So forget to read as a problem. Your real problem is how to manipulate huge datasets in memory. The approaches here are different:

  • Not. Data packet and packet processing.
  • Use space-efficient storage (e.g. compact items).
  • Allocate more memory resources.

This is talked about a lot about SO.

+3
source

If your file is truly huge, then IO is killing you, not parsing. Each time you read a line, you make a system call, which can be quite expensive.

A more effective alternative would be to use an IO memory file . If you are running a POSIX system such as Linux, you can use the mmap command, which downloads the file all at once and returns a pointer to its location in memory. Then the memory manager will take care of reading and replacing the I / O file when accessing data through this pointer.

It will look something like this.

 #include <sys/mman.h> int fd = open( 'abc.txt' , O_RDONLY ); char *ptr = mmap( NULL , length , PROT_READ , MAP_PRIVATE , fd , 0 ); 

but I highly recommend that you read the man page and find the best options for yourself.

+1
source
  • If your file contains int numbers, you can use operator "> , but this solution is only C ++. Something like:

     std::fstream f("abc.txt"); int value = 0; f >> value 
  • If you convert your file to a list of binary numbers, you will have more opportunities to improve performance. Not only does this avoid parsing numbers from a string into a type, but you also have another way to access your data (e.g. using mmap ).

0
source

First of all, the general recommendation is to always use profiling to verify that it is actually a slow translation, and not something else, such as physically reading a file from disk.

You may be able to improve performance by writing your own, minimal parsing function. strtok modifies the string so it will not be optimally fast, and if you know that all numbers are decimal integers and you don't need error checking, you can simplify the translation a bit.

Some strtok-free code that can speed up string processing if it is actually a translation and not (for example) I / O, which is the problem.

 void handle_one_number(int number) { // .... } void handle_numbers_in_buffer(char *buffer) { while (1) { while (*buffer != '\0' && isspace(*buffer)) ++buffer; if (*buffer == '\0') return; int negative = 0; if (*buffer == '-') { negative = 1; ++buffer; } int number = 0; while (isdigit(*buffer)) { number = number * 10 + *buffer - '0'; ++buffer; } if (negative) number = -number; handle_one_number(number); } } 

I really went and did some tests. I expected I / O to be dominant, but it turns out that (with the usual caveat of “on my specific system, with my specific compiler”), parsing numbers takes quite a lot of time.

By changing the version of strtok to my code above, I was able to improve the translation time of 100 million numbers (with the text already in memory) from 5.2 seconds to about 1.1 seconds. When reading from a slow disk (Caviar Green), I measured the improvement from 5.9 seconds to 3.5 seconds. When reading from an SSD, I measured the improvement from 5.8 to 1.8 seconds.

I also tried reading the file directly using while (fscanf(f, "%d", ....) == 1) .... but it turned out to be much slower (10 seconds), possibly because fscanf is thread safe, and more calls require more blocking.

(GCC 4.5.2 on Ubuntu 11.04 with -O2 optimization, several versions of each version, clearing caches between runs, i7 processor.)

0
source

All Articles