If your goal is to compare two files located on HDFS, I would not use the βhdfs dfs -checksum URIβ because in my case it generates different checksums for files with identical content.
In the example below, I compare two files with the same content in different places:
The old-school method md5sum returns the same checksum:
$ hdfs dfs -cat /project1/file.txt | md5sum b9fdea463b1ce46fabc2958fc5f7644a - $ hdfs dfs -cat /project2/file.txt | md5sum b9fdea463b1ce46fabc2958fc5f7644a -
However, the checksum generated on HDFS is different for files with the same contents:
$ hdfs dfs -checksum /project1/file.txt 0000020000000000000000003e50be59553b2ddaf401c575f8df6914 $ hdfs dfs -checksum /project2/file.txt 0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e
A little puzzled, as I expect that an identical checksum will be generated against identical contents.
Tomek source share