How to format data for mlib kmeans spark clustering algorithm?

Question

How to format data for mlib kmeans spark clustering algorithm?

I am trying to execute kmeans clustering algorithm from apache Spark mlib library. I have all the settings, but I'm not quite sure how I can start formatting the input. I am relatively new to machine learning, so any help would be greatly appreciated. In the data.txt sample, the data is as follows: 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2

And the data I want to run the algorithm is in this format (json array):

[{"customer":"ddf6022","order_id":"20031-19958","asset_id":"dd1~33","price":300,"time":1411134115000,"location":"bt2"},{"customer":"ddf6023","order_id":"23899-23825","asset_id":"dd1~33","price":300,"time":1411954672000,"location":"bt2"}]

How can I convert it to something that can be used with the k-mean clustering algorithm? I use Java, I also assume that I need it to be in JavaRDD format, but I don’t know how to do it.

+4

java algorithm machine-learning apache-spark

Raza gill Apr 29 '15 at 18:22

1

emecas · Answer 1 · 2015-05-14T15:05:20+0000

:

, , KMeans, KMeans, Spark, ( X Y Z). KMeans MLLib n , n >= 1

:

, , X Y Z JSON: , . , , :

300 1411134115000 2
300 1411954672000 2
...
...
...

"bt2" 2 ( , ). KMeans.

/:

, , : Year, Month, Day, Hour, Minute, Second .. , .

, , JSON2CSV. , : fooobar.com/questions/192337/...

How to format data for mlib kmeans spark clustering algorithm?

More articles: