Fast md5sum on millions of lines in bash / ubuntu

I need MD5 sums of 3 million lines or so in a bash script on ubuntu. 3 million lines β†’ 3 million MD5 hashes. A trivial implementation takes about 0.005 seconds per line. This is more than 4 hours. What are the faster alternatives? Is there a way to swap line groups in md5sum?

#time md5sum running 100 times on short strings #each iteration is ~0.494s/100 = 0.005s time (for i in {0..99}; do md5sum <(echo $i); done) > /dev/null real 0m0.494s user 0m0.120s sys 0m0.356s 

A good solution would be to include a bash / Perl script that takes a list of strings from stdin and lists its MD5 hashes.

+4
source share
5 answers

This is not difficult to do in C (or Perl or Python) using any of the many md5 implementations - at the heart of its md5 is a hash function that moves from a character vector to a character vector.

So, just write an external program that reads your 3 million lines, and then feed them one by one to the md5 implementation of your choice. Thus, you have one launch of the program, and not 3 million, and only this will save you time.

FWIW in one project. I used the implementation of md5 (in C) by Christoph Devin, there is OpenSSL, and I'm sure CPAN will also have a lot of them for Perl.

Edit: Well, I could not resist. The md5 implementation that I mentioned, for example, inside this little tarball . Take the md5.c file and replace (# ifdef'ed out) main() at the bottom with this

 int main( int argc, char *argv[] ) { FILE *f; int j; md5_context ctx; unsigned char buf[1000]; unsigned char md5sum[16]; if( ! ( f = fopen( argv[1], "rb" ) ) ) { perror( "fopen" ); return( 1 ); } while( fscanf(f, "%s", buf) == 1 ) { md5_starts( &ctx ); md5_update( &ctx, buf, (uint32) strlen((char*)buf) ); md5_finish( &ctx, md5sum ); for( j = 0; j < 16; j++ ) { printf( "%02x", md5sum[j] ); } printf( " <- %s\n", buf ); } return( 0 ); } 

create a simple standalone program, for example. in

 /tmp$ gcc -Wall -O3 -o simple_md5 simple_md5.c 

and then you get the following:

 # first, generate 300,000 numbers in a file (using 'little r', an R variant) /tmp$ r -e'for (i in 1:300000) cat(i,"\n")' > foo.txt # illustrate the output /tmp$ ./simple_md5 foo.txt | head c4ca4238a0b923820dcc509a6f75849b <- 1 c81e728d9d4c2f636f067f89cc14862c <- 2 eccbc87e4b5ce2fe28308fd9f2a7baf3 <- 3 a87ff679a2f3e71d9181a67b7542122c <- 4 e4da3b7fbbce2345d7772b0674a318d5 <- 5 1679091c5a880faf6fb5e6087eb1b2dc <- 6 8f14e45fceea167a5a36dedd4bea2543 <- 7 c9f0f895fb98ab9159f51fd0297e236d <- 8 45c48cce2e2d7fbdea1afc51c7c6ad26 <- 9 d3d9446802a44259755d38e6d163e820 <- 10 # let the program rip over it, suppressing stdout /tmp$ time (./simple_md5 foo.txt > /dev/null) real 0m1.023s user 0m1.008s sys 0m0.012s /tmp$ 

So, about a second for 300,000 (short) lines.

+6
source
 perl -MDigest::MD5=md5_hex -lpe '$_ = md5_hex $_' 
+4
source
 #~/sw/md5$ time (for i in {0..99}; do md5sum <(echo $i); done) > /dev/null real 0m0.220s user 0m0.084s sys 0m0.160s #~/sw/md5$ time (python test.py `for i in {0..99}; do echo $i; done`) > /dev/null real 0m0.041s user 0m0.024s sys 0m0.012s 

The code for python is five times faster for these small samples, for larger samples the difference is much larger due to missing caviar. 1k samples from 0.033 to 2.3 s :) script:

 #!/usr/bin/env python import hashlib, sys for arg in sys.argv[1:]: print hashlib.md5(arg).hexdigest() 
+4
source

I don’t have a machine to test it right now, but md5sum <<< "$i" faster than md5sum <(echo $i) ? The <<< syntax would avoid the subprocess overhead for echo and pass $i directly to md5sum for standard input.

+3
source

To get better performance, you probably need to use another program or create a C program that calls one of the public md5 hash APIs.

Another option is to call several md5 calls at once in order to take advantage of several cores. Each cycle through you can generate 8 calls, and the first 7 are also used at the end (to indicate asynchronous). If you have 4-8 cores, this can speed it up 8 times.

+1
source

All Articles