How to distribute xgboost module for use in spark mode?

Question

How to distribute xgboost module for use in spark mode?

I would like to use the pre-processed xgboost classifier in pyspark, but the nodes in the cluster do not have the xgboost module installed. I can break the classifier that I trained and translated, but this is not enough, because I still need a module to load in each node cluster.

I can not install it on the nodes of the cluster, since I do not have a root, and there is no shared file system.

How can I distribute the xgboost classifier for use in sparks?

I have an egg for xgboost. Is there something like http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-td7059.html or https://stackoverflow.com/a/31664/ to work?

+6

machine-learning apache-spark xgboost pyspark

eleanora 24 sept '16 at 3:21

source share

1 answer

bear911 · Answer 1 · 2016-09-27T14:15:28+0000

Cloudera explains this really good blog post . All loans go to them.

But just to answer your question in short - no, this is not possible. Any complex third-party dependency should be installed on each node of your cluster and properly configured. For simple modules / dependencies, you can create *.egg , *.zip or *.py --py-files and provide them to the cluster with the --py-files flag in spark-submit .

However, xgboost is a numerical package that is highly dependent not only on other Python packages, but also on a specific C++ library / compiler, which is low level. If you were to deliver the compiled code to the cluster, you might run into errors due to different hardware architectures. Adding to the fact that clusters are usually heterogeneous in terms of hardware, this behavior will be very poor.

How to distribute xgboost module for use in spark mode?

More articles: