Emit multiple pairs in a card operation

Question

Emit multiple pairs in a card operation

Let's say I have lines of phone call records:

[CallingUser, ReceivingUser, Duration]

If I want to find out the total amount of time that this user was on the phone (the Duration sum, where the User was CallingUser or ReceivingUser).

Effectively for this record, I would like to create 2 pairs (CallingUser, Duration) and (ReceivingUser, Duration) .

What is the most efficient way to do this? I can add 2 RDDs together, but it is not clear if this is a good approach:

 #Sample Data: callData = sc.parallelize([["User1", "User2", 2], ["User1", "User3", 4], ["User2", "User1", 8] ]) calls = callData.map(lambda record: (record[0], record[2])) #The potentially inefficient map in question: calls += callData.map(lambda record: (record[1], record[2])) reduce = calls.reduceByKey(lambda a, b: a + b)

+7

apache-spark pyspark

Jeffrey marshall Feb 27 '15 at 7:01

source share

2 answers

Use flatMap (), which is good for accepting individual inputs and creating multiple displayed outputs. Complete with code:

 callData = sc.parallelize([["User1", "User2", 2], ["User1", "User3", 4], ["User2", "User1", 8]]) calls = callData.flatMap(lambda record: [(record[0], record[2]), (record[1], record[2])]) print calls.collect() # prints [('User1', 2), ('User2', 2), ('User1', 4), ('User3', 4), ('User2', 8), ('User1', 8)] reduce = calls.reduceByKey(lambda a, b: a + b) print reduce.collect() # prints [('User2', 10), ('User3', 4), ('User1', 14)]

+6

SoldierOfFortran Jun 09 '15 at 4:44

source share

Others · Accepted Answer · 2015-02-27T07:15:46+0000

You need a flat map. If you write a function that returns the list [(record[0], record[2]),(record[1],record[2])] , then you can display it tightly!

Emit multiple pairs in a card operation

More articles: