I do not think that such a thing exists for String in MLlib. But this is likely to be a valuable contribution if you intend to implement it.
Calculating just one of these metrics is easy. For example. for top 4 in frequency:
def top4(rdd: org.apache.spark.rdd.RDD[String]) = rdd .map(s => (s, 1)) .reduceByKey(_ + _) .map { case (s, count) => (count, s) } .top(4) .map { case (count, s) => s }
Or the number of unique:
def numUnique(rdd: org.apache.spark.rdd.RDD[String]) = rdd.distinct.count
But to do this for all indicators in one pass takes more work.
These examples assume that if you have multiple βcolumnsβ of data, you will split each column into a separate RDD. This is a good way to organize data, and it is necessary for operations that perform shuffling.
What I mean by dividing the columns:
def split(together: RDD[(Long, Seq[String])], columns: Int): Seq[RDD[(Long, String)]] = { together.cache // We will do N passes over this RDD. (0 until columns).map { i => together.mapValues(s => s(i)) } }
Daniel Darabos
source share