Spark Matrix with python

Question

Spark Matrix with python

I am trying to do matrix multiplication using Apache Spark and Python.

Here are my details

from pyspark.mllib.linalg.distributed import RowMatrix

My RDD vectors

 rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]]) rows_2 = sc.parallelize([[1, 2], [4, 5]])

My maxtrix

 mat1 = RowMatrix(rows_1) mat2 = RowMatrix(rows_2)

I would like to do something like this:

 mat = mat1 * mat2

I wrote a function to handle matrix multiplication, but I'm afraid to have a long processing time. Here is my function:

 def matrix_multiply(df1, df2): nb_row = df1.count() mat=[] for i in range(0, nb_row): row=list(df1.filter(df1['index']==i).take(1)[0]) row_out = [] for r in range(0, len(row)): r_value = 0 col = df2.select(df2[list_col[r]]).collect() col = [list(c)[0] for c in col] for c in range(0, len(col)): r_value += row[c] * col[c] row_out.append(r_value) mat.append(row_out) return mat

My function does a lot of spark action (take, collect, etc.). Will the function take a lot of processing time? If anyone has another idea, this will be helpful to me.

+5

apache-spark pyspark apache-spark-mllib

Raouf Jun 11 '16 at 16:46

source share

1 answer

zero323 · Answer 1 · 2016-06-11T17:51:33+0000

You can not. Since RowMatrix does not have significant row indices, it cannot be used for multiplications. Even ignoring that the only distributed matrix that supports multiplication with another distributed structure is BlockMatrix .

 from pyspark.mllib.linalg.distributed import * def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024): return IndexedRowMatrix( rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0])) ).toBlockMatrix(rowsPerBlock, colsPerBlock) as_block_matrix(rows_1).multiply(as_block_matrix(rows_2))

Spark Matrix with python

More articles: