Pyspark matrix with dummy variables

Question

Pyspark matrix with dummy variables

They have two columns:

ID Text 1 a 2 b 3 c

How can I create a matrix with dummy variables like this:

 ID abc 1 1 0 0 2 0 1 0 3 0 0 1

Using the pyspark library and its functions?

+6

python apache-spark pyspark

Halfpintboy Mar 08 '16 at 10:33

source share

1 answer

ksindi · Accepted Answer · 2016-04-29T20:05:00+0000

 from pyspark.sql import functions as F df = sqlContext.createDataFrame([ (1, "a"), (2, "b"), (3, "c"), ], ["ID", "Text"]) categories = df.select("Text").distinct().rdd.flatMap(lambda x: x).collect() exprs = [F.when(F.col("Text") == category, 1).otherwise(0).alias(category) for category in categories] df.select("ID", *exprs).show()

Output

 +---+---+---+---+ | ID| a| b| c| +---+---+---+---+ | 1| 1| 0| 0| | 2| 0| 1| 0| | 3| 0| 0| 1| +---+---+---+---+

Pyspark matrix with dummy variables

More articles: