How to combine several columns into one column (without prior knowledge of their number)?

Let's say I have the following data framework:

agentName|original_dt|parsed_dt|   user|text|
+----------+-----------+---------+-------+----+
|qwertyuiop|          0|        0|16102.0|   0|

I want to create a new dataframe with another column that has a concatenation of all the elements of a row:

agentName|original_dt|parsed_dt|   user|text| newCol
+----------+-----------+---------+-------+----+
|qwertyuiop|          0|        0|16102.0|   0| [qwertyuiop, 0,0, 16102, 0]

Note. This is just an example. The number of columns and their names is unknown. It is dynamic.

+6
source share
4 answers

I think this is great for your case. Here is an example.

val spark =
    SparkSession.builder().master("local").appName("test").getOrCreate()
  import spark.implicits._
  val data = spark.sparkContext.parallelize(
    Seq(
      ("qwertyuiop", 0, 0, 16102.0, 0)
    )).toDF("agentName","original_dt","parsed_dt","user","text")


  val result = data.withColumn("newCol", split(concat_ws(";",  data.schema.fieldNames.map(c=> col(c)):_*), ";"))        
      result.show()

+----------+-----------+---------+-------+----+------------------------------+
|agentName |original_dt|parsed_dt|user   |text|newCol                        |
+----------+-----------+---------+-------+----+------------------------------+
|qwertyuiop|0          |0        |16102.0|0   |[qwertyuiop, 0, 0, 16102.0, 0]|
+----------+-----------+---------+-------+----+------------------------------+

Hope this helps!

+3
source

TL DR Use structfunction with operator Dataset.columns.

Quoting the scaladoc struct :

struct (colName: String, colNames: String *): Column , .

: Column ( , ).

Dataset.columns:

: Array [String] .


:

scala> df.withColumn("newCol",
  struct(df.columns.head, df.columns.tail: _*)).
  show(false)
+----------+-----------+---------+-------+----+--------------------------+
|agentName |original_dt|parsed_dt|user   |text|newCol                    |
+----------+-----------+---------+-------+----+--------------------------+
|qwertyuiop|0          |0        |16102.0|0   |[qwertyuiop,0,0,16102.0,0]|
+----------+-----------+---------+-------+----+--------------------------+
+6

, dataframe .

df.select($"*",array($"col1",$"col2").as("newCol")) \\$"*" will capture all existing columns

:

df.select($"*",array($"agentName",$"original_dt",$"parsed_dt",$"user", $"text").as("newCol"))
+1

You can use the function udfto combine all columnsinto one. All you have to do is define the function udfand pass all columnsthat you want to execute with the function udfand call the function udfusing the function.withColumn dataframe

Or

You can use the function concat_ws(java.lang.String sep, Column... exprs)available for dataframe.

var df = Seq(("qwertyuiop",0,0,16102.0,0))
  .toDF("agentName","original_dt","parsed_dt","user","text")
df.withColumn("newCol", concat_ws(",",$"agentName",$"original_dt",$"parsed_dt",$"user",$"text"))
df.show(false)

Gives you the result as

+----------+-----------+---------+-------+----+------------------------+
|agentName |original_dt|parsed_dt|user   |text|newCol                  |
+----------+-----------+---------+-------+----+------------------------+
|qwertyuiop|0          |0        |16102.0|0   |qwertyuiop,0,0,16102.0,0|
+----------+-----------+---------+-------+----+------------------------+

This will give you the result you want

0
source

All Articles