Checksum in Hadoop

Do I need to check the checksum after moving files to Hadoop (HDFS) from a Linux server via Webhdfs?

I would like to make sure that the files on HDFS are not damaged after they are copied. But do I need to check the checksum?

I read that the client performs the checksum before the data is written to HDFS

Can someone help me understand how I can make sure the source file on a Linux system is the same as the swallowed file in Hdfs using webhdfs.

+5
source share
5 answers

The checksum for the file can be calculated using the hadoop fs command.

Usage: hasoop fs -checksum URI

Returns file checksum information.

Example:

hadoop fs -checksum hdfs: //nn1.example.com/file1 file hadoop fs -checksum: /// path / in / linux / file1

Read More: Hadoop Documentation

So, if you want to compile file1 in linux and hdfs, you can use the above utility.

+5
source

I wrote a library with which you can calculate the checksum of a local file, just like hasoop does in hdfs files.

So you can compare the checksum with cross-validation. https://github.com/srch07/HDFSChecksumForLocalfile

+2
source

If your goal is to compare two files located on HDFS, I would not use the β€œhdfs dfs -checksum URI” because in my case it generates different checksums for files with identical content.

In the example below, I compare two files with the same content in different places:

The old-school method md5sum returns the same checksum:

$ hdfs dfs -cat /project1/file.txt | md5sum b9fdea463b1ce46fabc2958fc5f7644a - $ hdfs dfs -cat /project2/file.txt | md5sum b9fdea463b1ce46fabc2958fc5f7644a - 

However, the checksum generated on HDFS is different for files with the same contents:

 $ hdfs dfs -checksum /project1/file.txt 0000020000000000000000003e50be59553b2ddaf401c575f8df6914 $ hdfs dfs -checksum /project2/file.txt 0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e 

A little puzzled, as I expect that an identical checksum will be generated against identical contents.

+2
source

If you do this check via API

 import org.apache.hadoop.fs._ import org.apache.hadoop.io._ 

Option 1: for b9fdea463b1ce46fabc2958fc5f7644a

 val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString 

Option 2: for the value 3e50be59553b2ddaf401c575f8df6914

 val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0) 
+1
source

This is a crc check. For each and every file, it creates .crc to make sure there is no damage.

0
source

All Articles