Boost SHA-1 ComputeHash Performance

I use the following code to create a checksum file that works fine. But when I generate a hash for a large file, say 2 GB, it is rather slow. How to improve the performance of this code?

fs = new FileStream(txtFile.Text, FileMode.Open); formatted = string.Empty; using (SHA1Managed sha1 = new SHA1Managed()) { byte[] hash = sha1.ComputeHash(fs); foreach (byte b in hash) { formatted += b.ToString("X2"); } } fs.Close(); 

Update:

System:

OS: Win 7 64bit, CPU: I5 750, RAM: 4 GB, hard drive: 7200 rpm

Tests:

Test1 = 59.895 seconds

Test2 = 59.94 seconds

+6
performance c #
source share
6 answers

First question: why is this checksum needed? If you do not need cryptographic properties, then a non-crisis hash or a hash that is less cryptographically secure (MD5 is โ€œbrokenโ€) does not prevent it from being a good hash, but still strong enough for some uses) it will probably be more effective. You can make your own hash by reading a subset of the data (I would advise making this subset work in 4096 byte files of the main file, as this would correspond to the size of the buffer used by SHA1Managed, as well as allowing you to read a faster chunk than you would if said that every X bytes for some value X).

Edit: A reminder reminding me of this answer also reminded me that I have since written SpookilySharp , which provides high-performance 32-, 64- and 128-bit hashes that are not cryptographic but are good for providing checksums for errors, storage etc. (This, in turn, reminded me that I should update it to support .NET Core).

Of course, if you want the SHA-1 file to interact with something else, you're stuck.

I would experiment with different buffer sizes, as increasing the size of the stream buffer may increase the speed due to additional memory. I would recommend an integer multiple of 4096 (the default is 4096 - the default), since SHA1Managed will request 4096 pieces at a time, and thus there will be no case when the FileStream will be less than the most requested (allowed, but sometimes suboptimal) or at the same time performs several copies.

+3
source share

Well, is it tied to IO or is it tied to a processor? If it is connected to the processor, we cannot do much with it.

It is possible that opening a FileStream with different parameters will allow the file system to do more buffering or suggest that you are going to read the file sequentially - but I doubt it will help much. (Of course, he is not going to do much if he is connected to the processor.)

How slow is it still slow? Compared to, say, copying a file?

If you have a lot of memory (for example, 4 GB or more), how long does the hash file take the second time it can be in the file system cache?

+1
source share

First of all, did you measure "rather slowly"? From this site , SHA-1 has approximately half the speed of MD5 at a speed of about 100 MB / s (depending on the processor), so 2 GB takes about 20 seconds. Also note that if you use a slow hard drive, this could be your real bottleneck, since 30-70 MB / s are not unusual.

To speed things up, you can simply not hash the entire file, but the first X KB or its representable parts (parts that are likely to be different). If your files are not too similar, this should not cause duplicates.

+1
source share

First: SHA-1 file hashing must be related to I / O on non-ancient processors - and I5, of course, does not qualify as ancient. Of course, this depends on the implementation of SHA-1, but I doubt that SHA1Managed รผber-slow.

Further, 60 seconds for 2 GB data is ~ 34 MB / s - this is a slow read of the hard drive; even a 2.5-inch laptop can read faster. Assuming the hard drive is internal (no USB2 / something else or a network bottleneck), and there are not many other disk I / O operations there, I would be surprised to see less than 60 MB / s from a modern drive.

My guess would be that ComputeHash() uses a small buffer inside. Try manual reading / hashing, so you can specify a larger buffer (64kb or even more) to increase throughput. You can also move on to async processing so that disk reading and calculation can overlap.

+1
source share

None of SHA1 controls the best choice for large input strings, nor is Byte.ToString ("X2") the fastest way to convert an array of bytes into a string.

I just finished an article with detailed guidelines on this topic. It compares SHA1Managed, SHA1CryptoServiceProvider, SHA1Cng, and also considers SHA1.Create () for input strings of different lengths.

The second part shows five different ways to convert a byte array to a string, where the worst value is Byte.ToString ("X2").

My largest input was only 10,000 characters, so you can run my tests in your 2 GB file. It would be interesting if / how this changes the numbers.

http://wintermute79.wordpress.com/2014/10/10/c-sha-1-benchmark/

However, to check the integrity of files, you better use MD5, as you already wrote.

0
source share

You can use this logic to get the SHA-1 value. I used it in java.

public class sha1Calculate {

  public static void main(String[] args)throws Exception { File file = new File("D:\\Android Links.txt"); String outputTxt= ""; String hashcode = null; try { FileInputStream input = new FileInputStream(file); ByteArrayOutputStream output = new ByteArrayOutputStream (); byte [] buffer = new byte [65536]; int l; while ((l = input.read (buffer)) > 0) output.write (buffer, 0, l); input.close (); output.close (); byte [] data = output.toByteArray (); MessageDigest digest = MessageDigest.getInstance( "SHA-1" ); byte[] bytes = data; digest.update(bytes, 0, bytes.length); bytes = digest.digest(); StringBuilder sb = new StringBuilder(); for( byte b : bytes ) { sb.append( String.format("%02X", b) ); } System.out.println("Digest(in hex format):: " + sb.toString()); }catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (NoSuchAlgorithmException e) { // TODO Auto-generated catch block e.printStackTrace(); } } 
-one
source share

All Articles