Copy large files to HDFS

Question

Copy large files to HDFS

I am trying to copy a large file (32 GB) to HDFS. I never had problems copying files to HDFS, but they were all smaller. I use hadoop fs -put <myfile> <myhdfsfile> and up to 13.7 GB everything goes well, but then I get this exception:

 hadoop fs -put * /data/unprocessed/ Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: Input/output error at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:150) at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:384) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:217) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:230) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:191) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1183) at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:130) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1762) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1895) Caused by: java.io.IOException: Input/output error at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:242) at org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.read(RawLocalFileSystem.java:91) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:144) ... 20 more

When I check the log files (in my NameNode and DataNodes), I see that the lease of the file has been deleted, but there is no reason. According to the log files, everything was going well. Here are the last lines of my NameNode log:

 2013-01-28 09:43:34,176 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /data/unprocessed/AMR_EXPORT.csv. blk_-4784588526865920213_1001 2013-01-28 09:44:16,459 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.6.114:50010 is added to blk_-4784588526865920213_1001 size 30466048 2013-01-28 09:44:16,466 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on file /data/unprocessed/AMR_EXPORT.csv from client DFSClient_1738322483 2013-01-28 09:44:16,472 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /data/unprocessed/AMR_EXPORT.csv is closed by DFSClient_1738322483 2013-01-28 09:44:16,517 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 168 Total time for transactions(ms): 26Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0

Does anyone know about this? I checked core-default.xml and hdfs-default.xml for properties that I could overwrite, which would extend the lease or so, but could not find it.

+4

hadoop hdfs ioexception

Pieterjan Jan 28 '13 at 9:04

source share

2 answers

Ashish · Answer 1 · 2013-01-28T12:04:41+0000

Some suggestions:

If you have multiple files to copy, use multiple sessions with multiple sessions.
If there is only one large file, use compression before copying, or you can split the large file into small ones and then copy

omid · Answer 2 · 2013-01-28T12:41:21+0000

This sounds like a problem reading the local file, not the hdfs client. The stack trace shows a problem with reading a local file that has popped up completely. The lease terminates because the client disconnected due to an IOException while reading the file.

Copy large files to HDFS

More articles: