I am new to Haddoop. Recently, I am trying to process (read only) many small files on hdfs / hadoop. The average file size is 1 kb , and the number of files is more than 10 MB. The program must be written in C ++ due to some limitations.
This is just a performance measure, so I only use 5 machines for data nodes. Each of the data node has 5 data disks.
I wrote a small C ++ project to read files directly from a hard drive (and not from HDFS) to create a performance baseline. The program will create 4 read streams for each disk. The performance result is about 14 MB / s per disk. The total throughput is about 14 MB / s * 5 * 5 = 350 MB / s (14 MB / s * 5 disks * 5 machines).
However, when this program (still using C ++, dynamically linked to libhdfs.so, creating 4 * 5 * 5 = 100 threads) reads files from hdfs , the throughput is approximately 55MB / s .
If this programming starts in mapreduce (hasoop streamming, 5 tasks, each of them has 20 threads, the total number of threads is 100), the throughput is reduced to about 45 MB / s. (I think it slows down in the accounting process).
I am wondering what reasonable HDFS performance can provide. As you can see, comparing with native code, data throughput is only about 1/7 . Is this the problem of my config? Or a limitation of HDFS? Or a Java limitation? What is the best way for my script? Will there be support for the sequence file (a lot)? What is the reasonable throughput compared to the usual I / O reading that we can expect?
Here are some of my settings:
The heap size of the NameNodeNode is 32G.
Work / Task node heap size 8G.
Number of NameNode handlers: 128
Number of DataNode Handlers: 8
DataNode Maximum number of streams transmitted: 4096
1GBps ethernet.
Thanks.