SparkSQL and explode on DataFrame in Java

Is there an easy way to use explode in an array column on a SparkSQL DataFrame ? This is relatively simple in Scala, but this feature seems to be unavailable (as mentioned in javadoc) in Java.

The option is to use the SQLContext.sql(...) and explode inside the query, but I'm looking for a slightly better and especially cleaner way. DataFrame are loaded from parquet files.

+7
java apache-spark apache-spark-sql
source share
2 answers

It seems you can use a combination of org.apache.spark.sql.functions.explode(Column col) and DataFrame.withColumn(String colName, Column col) to replace the column with its exploded version.

+6
source share

I decided this as follows: let's say that you have an array column containing descriptions of the work with the names of "position", for each person with a "full name".

Then you will get the original schema:

 root |-- fullName: string (nullable = true) |-- positions: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- companyName: string (nullable = true) | | |-- title: string (nullable = true) ... 

to the scheme:

 root |-- personName: string (nullable = true) |-- companyName: string (nullable = true) |-- positionTitle: string (nullable = true) 

:

  DataFrame personPositions = persons.select(persons.col("fullName").as("personName"), org.apache.spark.sql.functions.explode(persons.col("positions")).as("pos")); DataFrame test = personPositions.select(personPositions.col("personName"), personPositions.col("pos").getField("companyName").as("companyName"), personPositions.col("pos").getField("title").as("positionTitle")); 
+12
source share

All Articles