Should I combine spark output files?

Question

Should I combine spark output files?

Apache Spark usually displays part-00XXX files. Is it best practice to merge them or leave them as is in stock? (I use Google Cloud Storage)

+4

apache-spark

poiuytrez Oct 3 '14 at 14:33

source share

2 answers

eugenio calabrese · Answer 1 · 2016-01-05T14:58:12+0000

I believe this is a choice, but I would say no, because:

If you do the merging of large data files, it will not be easy because you will get a huge file
another file may correspond to the RDD section, so you can use it for further processing (for example, as a filter, just looking at some files, not all)
, .textfile * ,

,

belka · Answer 2 · 2017-08-24T08:29:24+0000

, . , concurrency.

, , Python Pandas :

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._

def merge(srcPath: String, dstPath: String): Unit =  {
  val hadoopConfig = new Configuration()
  val hdfs = FileSystem.get(hadoopConfig)
  FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}

Should I combine spark output files?

More articles: