I have a set of tuples that consist of compound keys and values. For example,
tfile.collect() = [(('id1','pd1','t1'),5.0), (('id2','pd2','t2'),6.0), (('id1','pd1','t2'),7.5), (('id1','pd1','t3'),8.1) ]
I want to perform sql operations in this collection, where I can aggregate information based on id [1..n] or pd [1..n]. I want to implement using apis vanys pyspark and not use SQLContext. In my current implementation, I am reading from a bunch of files and merging RDD.
def readfile(): fr = range(6,23) tfile = sc.union([sc.textFile(basepath+str(f)+".txt") .map(lambda view: set_feature(view,f)) .reduceByKey(lambda a, b: a+b) for f in fr]) return tfile
I intend to create an aggregated array as a value. For example,
agg_tfile = [((id1,pd1),[5.0,7.5,8.1])]
where 5.0.7.5.8.1 represent [t1, t2, t3]. I'm currently getting the same vanilla python code using dictionaries. It works great for small data sets. But I'm worried, as this may not scale for large datasets. Is there an effective way to achieve the same use of apis pyspark?
python apache-spark pyspark
Rahul
source share