Effective CSV load coordinate format (COO) for local spark matrix

I want to convert CSV coordinate format data to a local matrix. Currently, I first convert them to CoordinateMatrix, and then convert to LocalMatrix. But is there a better way to do this?

Sample data:

0,5,5.486978435
0,3,0.438472867
0,0,6.128832321
0,7,5.295923198
0,1,7.738270234

code:

var loadG = sqlContext.read.option("header", "false").csv("file.csv").rdd.map("mapfunctionCreatingMatrixEntryOutOfRow")
var G = new CoordinateMatrix(loadG)

var matrixG = G.toBlockMatrix().toLocalMatrix()
+6
source share
1 answer

A LocalMatrixwill be stored on the same machine and therefore do not use strong sparks. In other words, using Spark seems a bit wasteful, although still possible.

CSV LocalMatrix - CSV Scala, Spark:

val entries = Source.fromFile("data.csv").getLines()
  .map(_.split(","))
  .map(a => (a(0).toInt, a(1).toInt, a(2).toDouble))
  .toSeq

SparseMatrix LocalMatrix COO. . , , :

val numRows = entries.map(_._1).max + 1
val numCols = entries.map(_._2).max + 1

:

val matrixG = SparseMatrix.fromCOO(numRows, numCols, entries)

CSC . :

1 x 8 CSCMatrix
(0,0) 6.128832321
(0,1) 7.738270234
(0,3) 0.438472867
(0,5) 5.486978435
(0,7) 5.295923198
+1

All Articles