How many partitions does Spark create when loading a file from an S3 bucket?

Question

How many partitions does Spark create when loading a file from an S3 bucket?

If the file is downloaded from HDFS by default, one section per block is created from the spark. But how does a spark solve partitions when a file is loaded from an S3 bucket?

+4

amazon-s3 hadoop bigdata apache-spark rdd

Suhas chandramouli May 11 '16 at 16:44

source share

2 answers

Spark S3, , HDFS S3 : . :

val inputRDD = sc.textFile("s3a://...")
println(inputRDD.partitions.length)

, .

0

Paweł Jurczenko 11 '16 21:35

Ivan Borisov · Accepted Answer · 2016-05-11T21:31:49+0000

See code org.apache.hadoop.mapred.FileInputFormat.getSplits().

The block size depends on the implementation of the S3 file system (see FileStatus.getBlockSize()). For instance. S3AFileStatusjust set it equal 0(and then FileInputFormat.computeSplitSize()comes into play).

Also, you don't get partitions if your InputFormat is not shared :)

How many partitions does Spark create when loading a file from an S3 bucket?

More articles: