HDFS Checksum

I am trying to check file consistency after copying to HDFS using the Hadoop API - DFSCleint.getFileChecksum ().

I get the following output for the above code:

Null HDFS : null Local : null 

Can someone point out an error or error? Here is the code:

 import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileChecksum; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.LocalFileSystem; import org.apache.hadoop.fs.Path; public class fileCheckSum { /** * @param args * @throws IOException */ public static void main(String[] args) throws IOException { // TODO Auto-generated method stub Configuration conf = new Configuration(); FileSystem hadoopFS = FileSystem.get(conf); // Path hdfsPath = new Path("/derby.log"); LocalFileSystem localFS = LocalFileSystem.getLocal(conf); // Path localPath = new Path("file:///home/ubuntu/derby.log"); // System.out.println("HDFS PATH : "+hdfsPath.getName()); // System.out.println("Local PATH : "+localPath.getName()); FileChecksum hdfsChecksum = hadoopFS.getFileChecksum(new Path("/derby.log")); FileChecksum localChecksum = localFS.getFileChecksum(new Path("file:///home/ubuntu/derby.log")); if(null!=hdfsChecksum || null!=localChecksum){ System.out.println("HDFS Checksum : "+hdfsChecksum.toString()+"\t"+hdfsChecksum.getLength()); System.out.println("Local Checksum : "+localChecksum.toString()+"\t"+localChecksum.getLength()); if(hdfsChecksum.toString().equals(localChecksum.toString())){ System.out.println("Equal"); }else{ System.out.println("UnEqual"); } }else{ System.out.println("Null"); System.out.println("HDFS : "+hdfsChecksum); System.out.println("Local : "+localChecksum); } } } 
+6
source share
2 answers

Since you do not set the remote address to conf and essentially use the same configuration, both hadoopFS and localFS point to an instance of LocalFileSystem .

getFileChecksum not executed for LocalFileSystem and returns null. It should work for DistributedFileSystem , although if your conf points to a distributed cluster, FileSystem.get(conf) should return a DistributedFileSystem instance that returns MD5 MD5 CRC32 checksums of pieces of bytes.per.checksum size. This value depends on the block size and cluster configuration, bytes.per.checksum . That's why these two parameters are also encoded in the returned value of the distributed checksum as the algorithm name: MD5-of-xxxMD5-of-yyyCRC32, where xxx is the number of CRC checksums per block, and yyy is the bytes.per.checksum parameter.

getFileChecksum not intended for compatibility between file systems. Although it is possible to simulate a distributed checksum locally or manually reduce workloads to calculate equivalents of local hashes, I suggest relying on Hadoop's own integrity checks that occur when files are written or read from Hadoop

+9
source

Try it. In this, I calculated the MD5 of both the local and the HDFS file, and then compared them for both files. Hope this helps.

 public static void compareChecksumForLocalAndHdfsFile(String sourceHdfsFilePath, String sourceLocalFilepath, Map<String, String> hdfsConfigMap) throws Exception { System.setProperty("HADOOP_USER_NAME", hdfsConfigMap.get(Constants.USERNAME)); System.setProperty("hadoop.home.dir", "/tmp"); Configuration hdfsConfig = new Configuration(); hdfsConfig.set(Constants.USERNAME, hdfsConfigMap.get(Constants.USERNAME)); hdfsConfig.set("fsURI", hdfsConfigMap.get("fsURI")); FileSystem hdfs = FileSystem.get(new URI(hdfsConfigMap.get("fsURI")), hdfsConfig); Path inputPath = new Path(hdfsConfigMap.get("fsURI") + "/" + sourceHdfsFilePath); InputStream is = hdfs.open(inputPath); String localChecksum = getMD5Checksum(new FileInputStream(sourceLocalFilepath)); String hdfsChecksum = getMD5Checksum(is); if (null != hdfsChecksum || null != localChecksum) { System.out.println("HDFS Checksum : " + hdfsChecksum.toString() + "\t" + hdfsChecksum.length()); System.out.println("Local Checksum : " + localChecksum.toString() + "\t" + localChecksum.length()); if (hdfsChecksum.toString().equals(localChecksum.toString())) { System.out.println("Equal"); } else { System.out.println("UnEqual"); } } else { System.out.println("Null"); System.out.println("HDFS : " + hdfsChecksum); System.out.println("Local : " + localChecksum); } } public static byte[] createChecksum(String filename) throws Exception { InputStream fis = new FileInputStream(filename); byte[] buffer = new byte[1024]; MessageDigest complete = MessageDigest.getInstance("MD5"); int numRead; do { numRead = fis.read(buffer); if (numRead > 0) { complete.update(buffer, 0, numRead); } } while (numRead != -1); fis.close(); return complete.digest(); } // see this How-to for a faster way to convert // a byte array to a HEX string public static String getMD5Checksum(String filename) throws Exception { byte[] b = createChecksum(filename); String result = ""; for (int i = 0; i < b.length; i++) { result += Integer.toString((b[i] & 0xff) + 0x100, 16).substring(1); } return result; } 

Conclusion:

 HDFS Checksum : d99513cc4f1d9c51679a125702bd27b0 32 Local Checksum : d99513cc4f1d9c51679a125702bd27b0 32 Equal 
0
source

All Articles