Cannot access RowMatrix methods in PySpark: columnSimilarities (), computeColumnSummaryStatistics ()

I am trying to use the functions columnSimilarities (), computeColumnSummaryStatistics ()

  • Especially the columnSimilarities () function mentioned in this post:

https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

I use a list of sparse vectors from mlib.

sparse_vectors = []

for cust, group in df.groupby(0):

    i_v = zip(group[1].values, group[2].values)
    i_v = sorted(i_v)
    indices = [x[0] for x in i_v]
    values = [x[1] for x in i_v]
    sparse_vectors.append(Vectors.sparse(len(df[1].unique()), indices, values))

rows = sc.parallelize(sparse_vectors)
mat = RowMatrix(rows)

I get an error message:

AttributeError: 'RowMatrix' object does not have 'ComputeColumnSummaryStatistics' attribute

or

AttributeError: 'RowMatrix' object does not have the 'ColumnSimilarities' attribute

every time I run functions.

Is this a PySpark issue unlike Scala Spark? I also cannot find the RowMatrix features page using a Google search.

thank

+4
1

, (Spark 1.6) PySpark.

IndexedRowMatrix.columnSimilarities (. SPARK-12041) , .

+2

All Articles