I'm relatively new to Apache Spark, and I want to create one RDD in Python from dictionary lists that are stored in multiple JSON files (each of them has gzipped and contains a list of dictionaries). As a result, RDD will, roughly speaking, contain all dictionary lists combined into one dictionary list. I could not find this in the documentation ( https://spark.apache.org/docs/1.2.0/api/python/pyspark.html ), but if I missed it, please let me know.
So far I have been trying to read JSON files and create a combo box in Python and then use sc.parallelize (), however the whole data set is too large to fit in memory, so this is not a practical solution. It seems like Spark would have a sensible way to handle this use case, but I don't know about that.
How to create a separate RDD in Python containing lists in all JSON files?
I should also mention that I do not want to use Spark SQL. I would like to use functions like map, filter, etc., if possible.
json python dictionary apache-spark
Brandt
source share