I am trying to do matrix multiplication using Apache Spark and Python.
Here are my details
from pyspark.mllib.linalg.distributed import RowMatrix
My RDD vectors
rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]]) rows_2 = sc.parallelize([[1, 2], [4, 5]])
My maxtrix
mat1 = RowMatrix(rows_1) mat2 = RowMatrix(rows_2)
I would like to do something like this:
mat = mat1 * mat2
I wrote a function to handle matrix multiplication, but I'm afraid to have a long processing time. Here is my function:
def matrix_multiply(df1, df2): nb_row = df1.count() mat=[] for i in range(0, nb_row): row=list(df1.filter(df1['index']==i).take(1)[0]) row_out = [] for r in range(0, len(row)): r_value = 0 col = df2.select(df2[list_col[r]]).collect() col = [list(c)[0] for c in col] for c in range(0, len(col)): r_value += row[c] * col[c] row_out.append(r_value) mat.append(row_out) return mat
My function does a lot of spark action (take, collect, etc.). Will the function take a lot of processing time? If anyone has another idea, this will be helpful to me.
Raouf source share