SBT: How to pack an instance of a class as a JAR?

I have some code that looks something like this:

class FoodTrainer(images: S3Path) { // data is >100GB file living in S3 def train(): FoodClassifier // Very expensive - takes ~5 hours! } class FoodClassifier { // Light-weight API class def isHotDog(input: Image): Boolean } 

I want to build a JAR assembly ( sbt assembly ), call val classifier = new FoodTrainer(s3Dir).train() and publish a JAR that has an instance of classifier that is instantly accessible to users of the library downstream.

What is the easiest way to do this? What are some established paradigms for this? I know that this is a fairly common idiom in ML projects for publishing trained models, for example. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar

How to do this with sbt assembly , where I don't need to check a large model class or data file in my version control?

+7
java scala jar sbt sbt-assembly
source share
4 answers

Ok, I managed to do this:

  • Separate the power trainer module into 2 separate SBT submodules: food-trainer and food-model . The first command is called only at compile time to create a model and serialize it into the generated resources . The latter serves as a simple factory object for instantiating a model from a serialized version. Each top-down project depends only on this food-model submodule.

  • food-trainer has the bulk of all the code and has a main method that can serialize FoodModel :

     object FoodTrainer { def main(args Array[String]): Unit = { val input = args(0) val outputDir = args(1) val model: FoodModel = new FoodTrainer(input).train() val out = new ObjectOutputStream(new File(outputDir + "/model.bin")) out.writeObject(model) } } 
  • Add a compilation task to create the product trainer module in build.sbt :

     lazy val foodTrainer = (project in file("food-trainer")) lazy val foodModel = (project in file("food-model")) .dependsOn(foodTrainer) .settings( resourceGenerators in Compile += Def.task { val log = streams.value.log val dest = (resourceManaged in Compile).value IO.createDirectory(dest) runModuleMain( cmd = s"com.foo.bar.FoodTrainer $pathToImages ${dest.getAbsolutePath}", cp = (fullClasspath in Runtime in foodTrainer).value.files, log = log ) Seq(dest / "model.bin") } def runModuleMain(cmd: String, cp: Seq[File], log: Logger): Unit = { log.info(s"Running $cmd") val opt = ForkOptions(bootJars = cp, outputStrategy = Some(LoggedOutput(log))) val res = Fork.scala(config = opt, arguments = cmd.split(' ')) require(res == 0, s"$cmd exited with code $res") } 
  • Now in your food-model module you have something like this:

     object FoodModel { lazy val model: FoodModel = new ObjectInputStream(getClass.getResourceAsStream("/model.bin").readObject().asInstanceOf[FoodModel]) } 

Each downstream project now only depends on the food-model and simply uses FoodModel.model . We get an advantage:

  • It statically loads fast at runtime from JAR packed resources
  • No need to train the model at runtime (very expensive)
  • No need to check the model in your version (again, the binary model is very large) - it is only packaged in your JAR
  • No need to separate FoodTrainer and FoodModel packages into their own JARs (we have the headache of deploying them internally now) - instead, we just keep them in the same project, but different submodules that are packaged in one JAR.
0
source share

You must serialize the data obtained as a result of training into your own file. You can then pack this data file into a JAR. Your production code opens the file and reads it, and does not start the learning algorithm.

+3
source share

Following are the steps.

During the build resource generation phase:

  • Generate a model during the resource generation phase.
  • Serialize the contents of the model to a file in the managed resource folder.
      resourceGenerators in Compile + = Def.task {
       val classifier = new FoodTrainer (s3Dir) .train ()
       val contents = FoodClassifier.serialize (classifier)
       val file = (resourceManaged in Compile) .value / "mypackage" / "food-classifier.model"
       IO.write (file, contents)
       Seq (file)
     } .taskValue
    
  • The resource will be automatically included in the jar file and will not appear in the source tree.
  • To load a model, simply add code that reads the resource and analyzes the model.
      object FoodClassifierModel {
       lazy val classifier = readResource ("/ mypackage / food-classifier.model")
       def readResource (resourceName: String): FoodClassifier = {
         val stream = getClass.getResourceAsStream (resourceName)
         val lines = scala.io.Source.fromInputStream (stream) .getLines
         val contents = lines.mkString ("\ n")
         FoodClassifier.parse (contents)
       }
     }
     object FoodClassifier {
       def parse (content: String): FoodClassifier
       def serialize (classfier: FoodClassifier): String
     }
    

Of course, since your data is quite large, you will need to use stream serializers and parsers so as not to overload the java heap space. The above just shows how to pack a resource at build time.

See http://www.scala-sbt.org/1.x/docs/Howto-Generating-Files.html

+3
source share

Here is an idea, drop your model into the resource folder, which is added to the jar assembly. I think that all the jars are distributed along with your model, if it is in this folder. Lmk how's it going, greetings!

Check this out for reading from the resource:

https://www.mkyong.com/java/java-read-a-file-from-resources-folder/

This is in Java, but you can still use api in Scala.

-one
source share

All Articles