Combining compressed files on HDFS

How to combine all the files in a directory on HDFS, which, as I know, are all compressed, into one compressed file without copying data through the local machine? For example, but not necessarily using Pig?

As an example, I have a folder / data / input containing the files part-m-00000.gz and part-m-00001.gz. Now I want to merge them into a single file / data / output / foo.gz

+4
source share
3 answers

I would suggest looking at FileCrush ( https://github.com/edwardcapriolo/filecrush ), a tool for combining files on HDFS using MapReduce. It does exactly what you described, and provides several options for handling compression and controlling the number of output files.

  Crush --max-file-blocks XXX /data/input /data/output

max-file-blocks dfs . , :

8, 80 , 1/10th dfs , 8 * 1/10 = 8 dfs. 81 , 1/10- dfs , . 41 , 40. , .

+4

1, . :

  • set default_parallel 20;, ,
  • Parallel - DISTINCT ID PARALLEL 1;

+1

I know that it is possible to merge with the local file system using the command "hdfs dfs -getMerge". Perhaps you can use this to merge into the local file system and then use the command “hdfs dfs -copyFromLocal” to copy it back to hdfs.

0
source

All Articles