Running a sum in an rdd int array

Question

Running a sum in an rdd int array

Is there a built-in conversion to have the sum on the ints of the next rdd

org.apache.spark.rdd.RDD[(String, (Int, Int))]

Line

is the key, and the array Int is the value Value, I need to get the sum of all Ints as RDD[(String, Int)]. I tried groupByKey without success ...

In addition, the result set should again be rdd.

Thanks in advance

+4

apache-spark

Adam right Apr 08 '15 at 2:18

source share

2 answers

pyspark.

rdd = sc.parallelize([ ('A', (1,1)), ('B', (2,2)), ('C', (3, 3)) ])
rdd.mapValues(lambda (v1, v2): v1+v2).collect()

>>> rdd.map(lambda (k, v): (k, sum(v))).collect()
[('A', 2), ('B', 4), ('C', 6)]

>>> rdd.map(lambda (k, v): (k, (v[0] + v[1]))).collect()
[('A', 2), ('B', 4), ('C', 6)]

>>> def fn(x):
...   k_s = (x[0], sum(x[1]))
...   print k_s
... 
>>> rdd.foreach(fn)
('C', 6)
('A', 2)
('B', 4)

+2

Shiva 30 . '15 18:23

Shyamendra solanki · Accepted Answer · 2015-04-08T02:38:16+0000

If the goal is to sum the elements of the value (Int, Int), then transforming the map can achieve this:

val arr = Array(("A", (1, 1)), ("B", (2, 2)), ("C", (3, 3))

val rdd = sc.parallelize(arr)

val result = rdd.map{ case (a, (b, c)) => (a, b + c) }

// result.collect = Array((A,2), (B,4), (C,6))

Instead, if the value type is an array, Array.sum can be used.

val rdd = sc.parallelize(Array(("A", Array(1, 1)), 
                               ("B", Array(2, 2)), 
                               ("C", Array(3, 3)))

rdd.map { case (a, b) => (a, b.sum) }

Edit:

mapthe conversion does not preserve the original delimiter, as @Justin suggested mapValuesmight be more appropriate here:

rdd.mapValues{ case (x, y) => x + y }
rdd.mapValues(_.sum)

Running a sum in an rdd int array

More articles: