How fast can we make a particular tr?

I had to replace all null bytes in the file with another character (I arbitrarily chose @ ) and was very surprised that tr '\00' '@' was about 1/4 of the gzip speed:

 $ pv < lawl | gzip > /dev/null ^C13MiB 0:00:04 [28.5MiB/s] [====> ] 17% ETA 0:00:18 $ pv < lawl | tr '\00' '@' > /dev/null ^C58MiB 0:00:08 [7.28MiB/s] [==> ] 9% ETA 0:01:20 

My real data file has 3 GB gzipped and takes 50 minutes to tr , and I really need to do this in many such files, so this is not really an academic problem. Please note that reading from disk (SSD is fast enough here) or pv not a bottleneck in any case; both gzip and tr use 100% CPU, and cat is much faster:

 $ pv < lawl | cat > /dev/null 642MiB 0:00:00 [1.01GiB/s] [================================>] 100% 

This code:

 #include <stdio.h> int main() { int ch; while ((ch = getchar()) != EOF) { if (ch == '\00') { putchar('@'); } else { putchar(ch); } } } 

compiled with clang -O3 , it runs a little faster:

 $ pv < lawl | ./stupidtr > /dev/null ^C52MiB 0:00:06 [ 8.5MiB/s] [=> ] 8% ETA 0:01:0 

Compiling with gcc -O4 -mtune=native -march=native (4.8.4) is comparable, maybe very slightly faster. Adding -march=native to clang ( Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn) ) creates an identical binary.

This, apparently, is only because the general processing code for tr substitutions is replaced by constants, and the checks can be compiled. LLVM IR ( clang -S -O3 stupidtr.c ) looks pretty good.

I think gzip should be faster because it is doing something SIMD instructions or something like that. Is it possible to get speed to gzip ?

Some specifications, if relevant:

  • CSV file; null bytes can only occur in a specific field, but some of the other fields are of variable length, so you cannot just search arbitrarily. Most strings have zero byte in this field. I suppose that means you can do a Boyer-Moore search ,\00, , if that helps. Once you find a null byte, it also ensures that there can be no other for hundreds of bytes or so.

  • A typical file is about 20 GiB uncompressed, but bz2 is compressed on disk if necessary.

  • You can parallelize if you want, although gzip does this with one, so this is not necessary. I will run this either on a quad i7 running OSX or on a cloud server with two vCPUs running Linux.

  • Both machines that I can work on have 16 GB of RAM.

+7
performance c io
source share
4 answers

Combining ideas from different answers with some extra bits, here is an optimized version:

 #include <errno.h> #include <stdint.h> #include <stdio.h> #include <string.h> #include <unistd.h> #define BUFFER_SIZE 16384 #define REPLACE_CHAR '@' int main(void) { /* define buffer as uint64_t to force alignment */ /* make it one slot longer to allow for loop guard */ uint64_t buffer[BUFFER_SIZE/8 + 1]; ssize_t size, chunk; uint64_t *p, *p_end; uint64_t rep8 = (uint8_t)REPLACE_CHAR * 0x0101010101010101ULL; while ((size = read(0, buffer, BUFFER_SIZE)) != 0) { if (size < 0) { if (errno == EINTR) continue; fprintf(stderr, "read error: %s\n", strerror(errno)); return 1; } p = buffer; p_end = p + ((size + 7) >> 3); *p_end = 0ULL; /* force a 0 at the end */ for (;; p++) { #define LOWBITS 0x0101010101010101ULL #define HIGHBITS 0x8080808080808080ULL uint64_t m = ((*p - LOWBITS) & ~*p & HIGHBITS); if (m != 0) { if (p >= p_end) break; m |= m >> 1; m |= m >> 2; m |= m >> 4; *p |= m & rep8; } } for (unsigned char *pc = (unsigned char *)buffer; (chunk = write(1, pc, (size_t)size)) != size; pc += chunk, size -= chunk) { if (chunk < 0) { if (errno == EINTR) continue; fprintf(stderr, "write error: %s\n", strerror(errno)); return 2; } } } return 0; } 
+4
source share

You need to use a read and write unit for speed. (Even with a buffered I / O library such as stdio.h, the cost of managing the buffer can be significant.) Something like:

 #include <unistd.h> int main( void ) { char buffer[16384]; int size, i; while ((size = read(0, buffer, sizeof buffer)) > 0) { for( i = 0; i < size; ++i ) { if (buffer[i] == '\0') { buffer[i] = '@'; // optionally, i += 64; since // "Once you've found a null byte, it also guaranteed that there can't // be another one for a hundred bytes or so" } } write(1, buffer, size); } } 

Naturally, compile with optimization so that the compiler can convert indexing to pointer arithmetic, if useful.

This version is also well suited for SIMD optimizations if you still don't meet your speed targets (or a smart enough compiler can automatically vectorize a for loop).

In addition, this code does not contain reliable error handling. As @chqrlie mentions in the comment, you should try again when you get -EINTR and you should handle the partial entry.

+4
source share

Your code is incorrect because you are not checking the end of the file in the right place. This error is very much in do {} while loops. I recommend completely eliminating this construct (except macros for converting sequences of statements into a single statement).

Also try and tell glibc to perform less thread checks:

 #include <stdio.h> int main() { int c; while ((c = getchar_unlocked()) != EOF) { if (c == '\0') c = '@': putchar_unlocked(c); } } 

You can also play with different buffer sizes, for example, try them before the while() :

 setvbuf(stdin, NULL, _IOFBF, 1024 * 1024); setvbuf(stdout, NULL, _IOFBF, 1024 * 1024); 

This should not have much effect if you use this utility as a filter using pipes, but may be more effective if you use files.

If you use files, you can also mmap use this file and use memchr to find '\0' bytes or even strchr , which can be faster, and you can make sure that `` \ 0` exists. `at the end of the file (putting it on the right path).

+3
source share

First, as others have noted, do not use getchar()/putchar() or even any of the FILE-based methods such as fopen()/fread()/fwrite() . Use open()/read()/write() instead.

If the file is already uncompressed on disk, do not use channels. If it is compressed, you want to use a pipe to delete the entire read / write cycle. If you unzip the disk back to disk, replace the NUL characters, the data path is disk-> memory / cpu-> disk-> memory / cpu-> disk. If you use a channel, the path is disk-> memory / cpu-> disk. If you are limited on disk, this extra read / write cycle will be about twice as long as it takes to process your gigabytes (or more) of data.

Another thing - given your I / O pattern and the amount of data that you are moving - read the entire file with several GB, write the entire file - the page cache only bothers you. Use direct IO this way in C on Linux (for clarity, headers and reliable error checking):

 #define CHUNK_SIZE ( 1024UL * 1024UL * 4UL ) #define NEW_CHAR '@' int main( int argc, char **argv ) { /* page-aligned buffer */ char *buf = valloc( CHUNK_SIZE ); /* just set "in = 0" to read a stdin pipe */ int in = open( argv[ 1 ], O_RDONLY | O_DIRECT ); int out = open( argv[ 2 ], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0644 ); for ( ;; ) { ssize_t bytes = read( in, buf, CHUNK_SIZE ); if ( bytes < 0 ) { if ( errno == EINTR ) { continue; } break; } else if ( bytes == 0 ) { break; } for ( int ii = 0; ii < bytes; ii++ ) { if ( !buf[ ii ] ) { buf[ ii ] = NEW_CHAR; } } write( out, buf, bytes ); } close( in ); close( out ); return( 0 ); } 

Maximize compiler optimization. To use this code for real data, you need to check the results of calling write() - direct input-output in Linux - a real ingenious beast. I had to close the file opened with O_DIRECT and reopen it without direct I / O to write the last bytes of the file in Linux when the last bits were not multiples of the full page.

If you want to go even faster, you can multithreadedly process a process - one stream reads, one stream translates characters, and the other stream writes. Use as many buffers, passing them from stream to stream, if necessary, to support the slowest part of the process at any time.

If you are really interested in learning how fast you can move data, multithreading read / write too. And if your file system supports it, use asynchronous read / write.

0
source share

All Articles