SparkSQL and explode on DataFrame in Java

Question

SparkSQL and explode on DataFrame in Java

Is there an easy way to use explode in an array column on a SparkSQL DataFrame ? This is relatively simple in Scala, but this feature seems to be unavailable (as mentioned in javadoc) in Java.

The option is to use the SQLContext.sql(...) and explode inside the query, but I'm looking for a slightly better and especially cleaner way. DataFrame are loaded from parquet files.

+7

java apache-spark apache-spark-sql

JiriS Aug 6 '15 at 15:03

source share

2 answers

I decided this as follows: let's say that you have an array column containing descriptions of the work with the names of "position", for each person with a "full name".

Then you will get the original schema:

 root |-- fullName: string (nullable = true) |-- positions: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- companyName: string (nullable = true) | | |-- title: string (nullable = true) ...

to the scheme:

 root |-- personName: string (nullable = true) |-- companyName: string (nullable = true) |-- positionTitle: string (nullable = true)

:

  DataFrame personPositions = persons.select(persons.col("fullName").as("personName"), org.apache.spark.sql.functions.explode(persons.col("positions")).as("pos")); DataFrame test = personPositions.select(personPositions.col("personName"), personPositions.col("pos").getField("companyName").as("companyName"), personPositions.col("pos").getField("title").as("positionTitle"));

+12

marilena.oita Nov 26 '15 at 10:08

source share

JiriS · Accepted Answer · 2015-10-15T10:21:32+0000

It seems you can use a combination of org.apache.spark.sql.functions.explode(Column col) and DataFrame.withColumn(String colName, Column col) to replace the column with its exploded version.

SparkSQL and explode on DataFrame in Java

More articles: