I have 20 million files in S3 spanning approximately 8000 days.
Files are sorted by UTC, for example: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz . Each file is a UTF-8 text containing between 0 (empty) and 100 KB of text (95th percentile, although there are several files up to several MB in size).
Using Spark and Scala (I am new to both and want to study), I would like to save the "daily packages" (8000 of them), each of which contains any number of files for this day. Ideally, I would like to keep the original file names, as well as their contents. The output should also be in S3 and be compressed in some format that is suitable for input in subsequent Spark steps and experiments.
One idea was to store packages as a bunch of JSON objects (one per line and '\n' -separated), for example
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"} {id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set it up elegantly.
Using the Spark shell, for example, I saw that reading files is very easy, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz") // or even val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") // which will take for ever
But how can I βinterceptβ the reader to provide a file name?
Or maybe I should get the RDD of all files divided by day, and at the stage of reduction write K=filename, V=fileContent ?
scala amazon-s3 hadoop apache-spark
Pierre d
source share