Writing to a file in HDFS in Hadoop

Question

Writing to a file in HDFS in Hadoop

I was looking for a Hadoop application with a hard drive to check I / O activity in Hadoop, but I could not find an application that kept the disk usage higher than, say, 50% or some kind of application that actually forces the drives to load. I tried randomwriter, but this is surprisingly not disk I / O intensity.

So, I wrote a small program to create a file in Mapper and write some text to it. This application works well, but its use is high only in the main node, which is also called node, the job tracker and one of the slaves. Using a NIL drive or a minor value in other task trackers. I cannot understand why disk I / O is so low in task managers. Can someone push me in the right direction if I do something wrong? Thanks in advance.

Here is my code segment code that I wrote in the WordCount.java file to create and write a UTF string to a file -

Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path outFile; while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); outFile = new Path("./dummy"+ context.getTaskAttemptID()); FSDataOutputStream out = fs.create(outFile); out.writeUTF("helloworld"); out.close(); fs.delete(outFile); }

+6

hadoop hdfs word-count

Gudda bhoota Nov 19 '12 at 16:21

source share

2 answers

I think that any mechanism that creates java objects for each cell in each row and runs any serialization of java objects before saving to disk has little chance of using IO.
In my experience, serialization works at a speed of several MB per second or a little more, but not 100 MB per second.
So, what you did, avoiding the layers of chaos on the way out, is absolutely correct. Now let's see how recording in HDFS works. Data is written to the local disk through the local datanode, and then synchronously with other nodes on the network, depending on your replication rate. In this case, you cannot write more data to HDFS and then to your bandwidth. If your cluster is relatively small, everything will be worth it. For 3 node clusters and triple replication, you will go through all the data to all nodes, so that the entire cluster bandwidth of HDFS will be about 1 GBit - if you have such a network.
So, I would suggest:
a) Reduce the replication rate to 1, thus stopping linking the network.
b) Write large chunks of data in one mapper call

+1

David gruzman Nov 20 '12 at 14:29

source share

Gudda bhoota · Accepted Answer · 2012-11-27T20:18:54+0000

OK I must have been really stupid because I haven't checked before. The actual problem was that all my data nodes were down. I reformatted the namenode and everything fell into place, I was getting 15-20% use, which is not bad for WC. I ran it for TestDFSIO and see if I can use the disk even more.

Writing to a file in HDFS in Hadoop

More articles: