Performance with lots of multiple output files in Hadoop

I use a custom output format that outputs a new sequence file for each mapper to the key, so you get something like this.

Enter

Key1 Value Key2 Value Key1 Value 

Files

 /path/to/output/Key1/part-00000 /path/to/output/Key2/part-00000 

I noticed a huge performance hit, it usually takes about 10 minutes to simply match the input, but after two hours the file cabinets were not complete. Although they displayed lines. I expect that the number of unique keys will be about half the number of input lines, about 200,000.

Has anyone ever done something similar or could suggest something that could help performance? I would like to keep this process of key separation to the extent possible.

Thanks!

+4
source share
2 answers

I believe that you should reconsider your design. I do not believe that HDFS weighs about 10 million files. I suggest reading more about Hadoop, HDFS, and Map / Reduce. A good place to start would be http://www.cloudera.com/blog/2009/02/the-small-files-problem/ .

Good luck

EDIT 8/26: Based on a comment by @David Gruzman, I delved into the problem. Indeed, the penalty for storing a large number of small files is only for NameNode. There are no additional penalties for data nodes. I deleted the wrong part of my answer.

+2
source

It seems that drawing a conclusion to the Key-Value repository can help a lot.
For example, HBASE can satisfy your need as it is optimized for a large number of records and you will reuse part of your chaos infrastructure. There is an existing output format for writing HBase rights: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html

+1
source

All Articles