I have a few questions related to using Apache Spark for real-time analysis using Java. When the Spark application is submitted, the data stored in the Cassandra database is downloaded and processed using the Support Vector Machine algorithm. With the expansion of Spark streaming, when new data arrives, they are stored in the database, the current data set is retrained and the SVM algorithm is executed. The result of this process is also stored in the database.
- Apache Spark MLLib provides an implementation of a linear vector support machine. In case I need a non-linear implementation of SVM, should I implement my own algorithm or use existing libraries such as libsvm or jkernelmachines? These implementations are not based on Spark RDD, is there a way to do this without implementing the algorithm from scratch using RDD collections? If not, that would be a huge effort if I wanted to test several algorithms.
- Is MLLib a utility for scaling data before executing the SVM algorithm? http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf , as defined in section 2.2
- While the new data stream is being transmitted, do I need to reassemble the hole dataset? Is there a way that I could just add new data to already prepared data?
cassandra machine-learning apache-spark
Pantelis
source share