Apache Spark (MLLib) for real-time analysis

Question

Apache Spark (MLLib) for real-time analysis

I have a few questions related to using Apache Spark for real-time analysis using Java. When the Spark application is submitted, the data stored in the Cassandra database is downloaded and processed using the Support Vector Machine algorithm. With the expansion of Spark streaming, when new data arrives, they are stored in the database, the current data set is retrained and the SVM algorithm is executed. The result of this process is also stored in the database.

Apache Spark MLLib provides an implementation of a linear vector support machine. In case I need a non-linear implementation of SVM, should I implement my own algorithm or use existing libraries such as libsvm or jkernelmachines? These implementations are not based on Spark RDD, is there a way to do this without implementing the algorithm from scratch using RDD collections? If not, that would be a huge effort if I wanted to test several algorithms.
Is MLLib a utility for scaling data before executing the SVM algorithm? http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf , as defined in section 2.2
While the new data stream is being transmitted, do I need to reassemble the hole dataset? Is there a way that I could just add new data to already prepared data?

+7

cassandra machine-learning apache-spark

Pantelis Jun 24 '14 at 14:45

source share

1 answer

viksit · Answer 1 · 2015-02-03T00:05:30+0000

To answer your questions piecewise,

Spark provides the MLUtils class, which allows you to load data from the LIBSVM format into RDD - so the data loading part will not stop you from using this library. You can also implement your own algorithms if you know what you are doing, although my recommendation would be to take an existing one and configure the objective function and see how it works. Spark basically provides you with the functionality of a distributed Stochastic Gradient Descent process — you can do anything with it.
Not that I knew. Hope someone else knows the answer.
What do you mean by retraining when all the data is transferred?

In documents

.. except that fitting takes place on each data lot, so the model is constantly updated to reflect data from the stream.

Apache Spark (MLLib) for real-time analysis

More articles: