How to load a directory of JSON files in Apache Spark in Python

I'm relatively new to Apache Spark, and I want to create one RDD in Python from dictionary lists that are stored in multiple JSON files (each of them has gzipped and contains a list of dictionaries). As a result, RDD will, roughly speaking, contain all dictionary lists combined into one dictionary list. I could not find this in the documentation ( https://spark.apache.org/docs/1.2.0/api/python/pyspark.html ), but if I missed it, please let me know.

So far I have been trying to read JSON files and create a combo box in Python and then use sc.parallelize (), however the whole data set is too large to fit in memory, so this is not a practical solution. It seems like Spark would have a sensible way to handle this use case, but I don't know about that.

How to create a separate RDD in Python containing lists in all JSON files?

I should also mention that I do not want to use Spark SQL. I would like to use functions like map, filter, etc., if possible.

+7
json python dictionary apache-spark
source share
4 answers

Following what tgpfeiffer said in his answer and comment, here is what I did.

First, as they mentioned, JSON files had to be formatted, so they had one dictionary per line, not one list of dictionaries. Then it was as simple as:

my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files) my_RDD_dictionaries = my_RDD_strings.map(json.loads) 

If there is a better or more efficient way to do this, let me know, but it seems to work.

+5
source share

You can use sqlContext.jsonFile () to get SchemaRDD (which is an RDD [Row] plus a schema), which can then be used with Spark SQL. Or see Loading JSON Dataset in Spark, then use a filter, map, etc. for a non-SQL processing pipeline. I think you may have to unzip the files, and also Spark can only work with files, where each line is a single JSON document (i.e. there are no multi-line objects).

+2
source share

You can load the file directory into a single RDD using textFile, and also support wildcards. This will not give you the file names, but you don't seem to need it.

You can use Spark SQL when using basic transformations such as map, filter, etc. SchemaRDD is also an RDD (in Python as well as Scala)

+1
source share

To load a Json list from a file as RDD :

 def flat_map_json(x): return [each for each in json.loads(x[1])] rdd = sc.wholeTextFiles('example.json').flatMap(flat_map_json) 
+1
source share

All Articles