Summary statistics for line types in spark

Is there something like a composite function in sparks, like in "R".

The summary calculation that comes from the spark (MultivariateStatisticalSummary) works only with numeric types.

I am interested in getting results for string types, as well as the first four maximally filled strings (group by type of operation), the number of unique ones, etc.

Is there any existing code for this?

If not that, please suggest a better way to deal with string types.

+2
scala apache-spark
source share
1 answer

I do not think that such a thing exists for String in MLlib. But this is likely to be a valuable contribution if you intend to implement it.

Calculating just one of these metrics is easy. For example. for top 4 in frequency:

def top4(rdd: org.apache.spark.rdd.RDD[String]) = rdd .map(s => (s, 1)) .reduceByKey(_ + _) .map { case (s, count) => (count, s) } .top(4) .map { case (count, s) => s } 

Or the number of unique:

 def numUnique(rdd: org.apache.spark.rdd.RDD[String]) = rdd.distinct.count 

But to do this for all indicators in one pass takes more work.


These examples assume that if you have multiple β€œcolumns” of data, you will split each column into a separate RDD. This is a good way to organize data, and it is necessary for operations that perform shuffling.

What I mean by dividing the columns:

 def split(together: RDD[(Long, Seq[String])], columns: Int): Seq[RDD[(Long, String)]] = { together.cache // We will do N passes over this RDD. (0 until columns).map { i => together.mapValues(s => s(i)) } } 
+1
source share

All Articles