HDFS performance for small files

Question

HDFS performance for small files

I am new to Haddoop. Recently, I am trying to process (read only) many small files on hdfs / hadoop. The average file size is 1 kb , and the number of files is more than 10 MB. The program must be written in C ++ due to some limitations.

This is just a performance measure, so I only use 5 machines for data nodes. Each of the data node has 5 data disks.

I wrote a small C ++ project to read files directly from a hard drive (and not from HDFS) to create a performance baseline. The program will create 4 read streams for each disk. The performance result is about 14 MB / s per disk. The total throughput is about 14 MB / s * 5 * 5 = 350 MB / s (14 MB / s * 5 disks * 5 machines).

However, when this program (still using C ++, dynamically linked to libhdfs.so, creating 4 * 5 * 5 = 100 threads) reads files from hdfs , the throughput is approximately 55MB / s .

If this programming starts in mapreduce (hasoop streamming, 5 tasks, each of them has 20 threads, the total number of threads is 100), the throughput is reduced to about 45 MB / s. (I think it slows down in the accounting process).

I am wondering what reasonable HDFS performance can provide. As you can see, comparing with native code, data throughput is only about 1/7 . Is this the problem of my config? Or a limitation of HDFS? Or a Java limitation? What is the best way for my script? Will there be support for the sequence file (a lot)? What is the reasonable throughput compared to the usual I / O reading that we can expect?

Here are some of my settings:

The heap size of the NameNodeNode is 32G.

Work / Task node heap size 8G.

Number of NameNode handlers: 128

Number of DataNode Handlers: 8

DataNode Maximum number of streams transmitted: 4096

1GBps ethernet.

Thanks.

+6

performance io hadoop hdfs

avhacker Dec 21 '12 at 15:50

source share

3 answers

HDFS is really not designed for many small files.

For each new file that you read, the client must talk to namenode, which gives it the location (s) of the block (s) of this file, and then the client transfers data from the datanode.

Now, at best, the client does this once, and then discovers that it is a machine with data on it and can read it directly from disk. It will be fast: comparable to reading a direct disk.

If it is not a machine that has data, it must transmit data over the network. Then you are tied to network I / O speed, which should not be horrible, but still a little slower than reading a direct disk.

However, you get an even worse case when the overhead of talking with a denomination becomes significant. With only 1 KB of files, you get to the point where you exchange as much metadata as there is actual data. The client must make two separate network exchanges in order to receive data from each file. Add to this that the appellation is probably clogged with all these different streams, and therefore it can become a bottleneck.

So, to answer your question, yes, if you use HDFS for something that it is not intended for use for, it will be slow. Combine your small files and use MapReduce to get the data, and you will get much better performance. In fact, since you can make better use of sequential disk reads, I would not be surprised if reading from one large HDFS file is even faster than reading many small local files.

+7

Joe k Dec 21 '12 at 18:29

source share

just to add to what Joe said, another difference between HDFS and other file systems is that it keeps disk I / O as small as possible, storing data in large blocks (usually 64M or 128M) compared to traditional FS, where FS block size is in KBs order. for this reason, they always say that HDFS is good at handling several large files, rather than large small files. The reason for this is the fact that although recently in components such as CPU, RAM, etc. significant improvements have occurred, disk I / O is an area where we are still not making much progress. this was due to the fact that he had such huge blocks (unlike traditional FS), and reduced disk usage as little as possible.

Also, if the block size is too small, we will have more blocks. which means more metadata. this may again degrade performance, as more information needs to be loaded into memory. for each block that is considered an object in HDFS, it has about 200B of metadata associated with it. if you have a lot of small blocks, this will simply increase the metadata, and you may run into RAM problems.

The Cloudera blog blog has a very good post that talks about the same issue. You can visit here .

+3

Tariq Dec 21 '12 at 19:21

source share

David gruzman · Accepted Answer · 2012-12-23T11:19:21+0000

Let's try to understand our limitations and see when we hit them.
a) We need to name to provide us with information on where the files are. I can assume that this number is about a thousand per second. Additional information here https://issues.apache.org/jira/browse/HADOOP-2149 Assuming that this number will be 10000K, we can get information about 10 MB second for 1K files. (somehow you get more ...). may

b) Overhead of HDFS. These overheads are mainly related to latency rather than bandwidth. HDFS can be configured to submit a large number of files to parralel. HBase does this, and we can take the settings from the HBase configuration guides. The question is how much do you need Datanodes
c) Your LAN. You are moving data from the network so that you can exceed the bandwidth of 1 GB. (I think you got it.

I also have to agree with Joe - HDFS is not designed for the script, and you have to use other technologies (like HBase if you like the Hadoop stack) or compress files together - like into sequence files.

As for reading large files from HDFS - run the DFSIO test and this will be your number.
At the same time, SSDs on a single host can be a great solution.

HDFS performance for small files

More articles: