Should I combine spark output files?

Apache Spark usually displays part-00XXX files. Is it best practice to merge them or leave them as is in stock? (I use Google Cloud Storage)

+4
source share
2 answers

I believe this is a choice, but I would say no, because:

  • If you do the merging of large data files, it will not be easy because you will get a huge file
  • another file may correspond to the RDD section, so you can use it for further processing (for example, as a filter, just looking at some files, not all)
  • , .textfile * ,

,

0

, . , concurrency.

, , Python Pandas :

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._

def merge(srcPath: String, dstPath: String): Unit =  {
  val hadoopConfig = new Configuration()
  val hdfs = FileSystem.get(hadoopConfig)
  FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
0

All Articles