Read random file selection on S3 using Pyspark

Question

Read random file selection on S3 using Pyspark

I have a bucket on S3 that contains 1000 files. Each one is about 1 GB. I would like to read a random selection of these files. Say 5% of all files. This is how i do it

fileDF = sqlContext.jsonRDD(self.sc.textFile(self.path).sample(withReplacement=False, fraction=0.05, seed=42).repartition(160))

But it seems the above code will read all the files and then take samples. Although I want to take sample files and read them. Can anyone help?

+6

python amazon-s3 apache-spark pyspark amazon-emr

neikusc May 18, '16 at 20:06

source share

1 answer

Efwalker · Accepted Answer · 2016-05-20T05:16:20+0000

Use your favorite method to list files along a path, take a sample of names, and then use the RDD join:

 import pyspark import random sc = pyspark.SparkContext(appName = "Sampler") file_list = list_files(path) desired_pct = 5 file_sample = random.sample(file_list, int(len(file_list) * desired_pct / 100)) file_sample_rdd = sc.emptyRDD() for f in file_sample: file_sample_rdd = file_sample_rdd.union(sc.textFile(f)) sample_data_rdd = file_sample_rdd.repartition(160)

Here is one of the possible quick and dirty implementations of "list_files", which will list the files under the "directory" on S3:

 import os def list_files(path, profile = None): if not path.endswith("/"): raise Exception("not handled...") command = 'aws s3 ls %s' % path if profile is not None: command = 'aws --profile %s s3 ls %s' % (profile, path) result = os.popen(command) _r = result.read().strip().split('\n') _r = [path + i.strip().split(' ')[-1] for i in _r] return _r

Read random file selection on S3 using Pyspark

More articles: