Checksum in Hadoop

Question

Checksum in Hadoop

Do I need to check the checksum after moving files to Hadoop (HDFS) from a Linux server via Webhdfs?

I would like to make sure that the files on HDFS are not damaged after they are copied. But do I need to check the checksum?

I read that the client performs the checksum before the data is written to HDFS

Can someone help me understand how I can make sure the source file on a Linux system is the same as the swallowed file in Hdfs using webhdfs.

+5

hadoop checksum hdfs

chhaya vishwakarma Aug 10 '15 at 12:48

source share

5 answers

Venkata karthik · Answer 1 · 2015-08-10T13:35:44+0000

The checksum for the file can be calculated using the hadoop fs command.

Usage: hasoop fs -checksum URI

Returns file checksum information.

Example:

hadoop fs -checksum hdfs: //nn1.example.com/file1 file hadoop fs -checksum: /// path / in / linux / file1

Read More: Hadoop Documentation

So, if you want to compile file1 in linux and hdfs, you can use the above utility.

Abhishek anand · Answer 2 · 2016-11-09T09:12:13+0000

I wrote a library with which you can calculate the checksum of a local file, just like hasoop does in hdfs files.

So you can compare the checksum with cross-validation. https://github.com/srch07/HDFSChecksumForLocalfile

Tomek · Answer 3 · 2017-05-18T08:05:39+0000

If your goal is to compare two files located on HDFS, I would not use the “hdfs dfs -checksum URI” because in my case it generates different checksums for files with identical content.

In the example below, I compare two files with the same content in different places:

The old-school method md5sum returns the same checksum:

$ hdfs dfs -cat /project1/file.txt | md5sum b9fdea463b1ce46fabc2958fc5f7644a - $ hdfs dfs -cat /project2/file.txt | md5sum b9fdea463b1ce46fabc2958fc5f7644a -

However, the checksum generated on HDFS is different for files with the same contents:

 $ hdfs dfs -checksum /project1/file.txt 0000020000000000000000003e50be59553b2ddaf401c575f8df6914 $ hdfs dfs -checksum /project2/file.txt 0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e

A little puzzled, as I expect that an identical checksum will be generated against identical contents.

maxmithun · Answer 4 · 2017-11-15T07:09:52+0000

If you do this check via API

 import org.apache.hadoop.fs._ import org.apache.hadoop.io._

Option 1: for b9fdea463b1ce46fabc2958fc5f7644a

 val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString

Option 2: for the value 3e50be59553b2ddaf401c575f8df6914

 val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)

xyz · Answer 5 · 2015-11-04T20:50:59+0000

This is a crc check. For each and every file, it creates .crc to make sure there is no damage.

Checksum in Hadoop

More articles: