Apache Spark MLlib Model File Format

Apache Spark MLlib algorithms (e.g., decision trees) store the model in a location (e.g., myModelPath ), where it creates two directories, namely myModelPath/data and myModelPath/metadata . There are several files in these paths, and these are not text files. There are *.parquet format files.

I have a few questions:

  • What is the format of these files?
  • Which file (s) contains the actual model?
  • Can I save the model in another place, for example, in the database?
+6
source share
1 answer

Spark> = 2.4

Because Spark 2.4 provides format-independent recording interfaces, and some models already implement them. For example, LinearRegressionModel :

 val lrm: org.apache.spark.ml.regression.LinearRegressionModel = ??? val path: String = ??? lrm.write.format("pmml").save(path) 

will create a directory with a single file containing the PMML view.

Spark <2.4

What is the format of these files?

  • data/*.parquet files are in Apache Parquet storage format in columns
  • metadata/part-* looks like JSON

Which file (s) contains the current model?

  • model/*.parquet

Is it possible to save the model in another place, for example, in the database?

I do not know any direct method, but you can load the model as a data frame and subsequently save it in the database:

 val modelDf = spark.read.parquet("/path/to/data/") modelDf.write.jdbc(...) 
+5
source

All Articles