Get a duplicate file list by calculating their MD5

I have an array that contains the path to the files, I want to make a list of files that are duplicated based on their MD5. I calculate their MD5 as follows:

private void calcMD5(Array files) //Array contains a path of all files { int i=0; string[] md5_val = new string[files.Length]; foreach (string file_name in files) { using (var md5 = MD5.Create()) { using (var stream = File.OpenRead(file_name)) { md5_val[i] = BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower(); i += 1; } } } } 

From above I can calculate their MD5, but how to get only a list of duplicate files. If there is any other way to do the same, please let me know, and I'm also new to Linq

+6
source share
5 answers

1. Rewrite your calcMD5 function to take one file path and return MD5.
2. Store your file names in string[] or List<string> , and not in an untyped array, if possible.
3. Use the following LINQ to get groups of files with the same hash:

 var groupsOfFilesWithSameHash = files // or files.Cast<string>() if you're stuck with an Array .GroupBy(f => calcMD5(f)) .Where(g => g.Count() > 1); 

4. You can get groups with nested foreach loops, for example:

 foreach(var group in groupsOfFilesWithSameHash) { Console.WriteLine("Shared MD5: " + g.Key); foreach (var file in group) Console.WriteLine(" " + file); } 
+11
source
  static void Main(string[] args) { // returns a list of file names, which have duplicate MD5 hashes var duplicates = CalcDuplicates(new[] {"Hello.txt", "World.txt"}); } private static IEnumerable<string> CalcDuplicates(IEnumerable<string> fileNames) { return fileNames.GroupBy(CalcMd5OfFile) .Where(g => g.Count() > 1) // skip SelectMany() if you'd like the duplicates grouped by their hashes as group key .SelectMany(g => g); } private static string CalcMd5OfFile(string path) { // I took your implementation - I don't know if there are better ones using (var md5 = MD5.Create()) { using (var stream = File.OpenRead(path)) { return BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower(); } } } 
+2
source
 var duplicates = md5_val.GroupBy(x => x).Where(x => x.Count() > 1).Select(x => x.Key); 

This will give you a list of hashes that are duplicated in the array.

To get names instead of hashes:

 var duplicates = md5_val.Select((x,i) => new Tuple<string, int>(x, i)) .GroupBy(x => x.Item1) .Where(x => x.Count() > 1) .SelectMany(x => files[x.Item2].ToList()); 
0
source

Instead of returning an array of all hashes of the MD5 files, do this as follows:

  • You have one method calculateFileHash ().
  • Create an array of file names for testing.
  • Do it:

    var dupes = Filenames.GroupBy (fn => calculateFileHash (fn)). Where (gr => gr.Count> 1);

This will return an array of groups, with each group being enumerable containing file names with the same contents with each other.

0
source
  private void calcMD5(String[] filePathes) //Array contains a path of all files { Dictionary<String, String> hashToFilePathes = new Dictionary<String, String>(); foreach (string file_name in filePathes) { using (var md5 = MD5.Create()) { using (var stream = File.OpenRead(file_name)) { //This will get you dictionary where key is md5hash and value is filepath hashToFilePathes.Add(BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower(), file_name); } } } // Here will be all duplicates List<String> listOfDuplicates = hashToFilePathes.GroupBy(e => e.Key).Where(e => e.Count() > 1).SelectMany(e=>e).Select(e => e.Value).ToList(); } } 
0
source

All Articles