PySpark - Rename more than one column using withColumnRenamed

Question

PySpark - Rename more than one column using withColumnRenamed

I want to change the names of two columns using a spark with the ColumnRenamed function. Of course I can write:

data = sqlContext.createDataFrame([(1,2), (3,4)], ['x1', 'x2']) data = (data .withColumnRenamed('x1','x3') .withColumnRenamed('x2', 'x4'))

but I want to do it in one step (having a list / tuple of new names). Unfortunately, neither is this:

 data = data.withColumnRenamed(['x1', 'x2'], ['x3', 'x4'])

and this:

 data = data.withColumnRenamed(('x1', 'x2'), ('x3', 'x4'))

works. Can this be done?

+11

rename apache-spark pyspark apache-spark-sql

user2280549 Aug 05 '16 at 22:30

source share

3 answers

I also could not find an easy solution for pyspark, so I just created my own, similar to pandas' df.rename(columns={'old_name_1':'new_name_1', 'old_name_2':'new_name_2'}) .

 def rename_columns(df, columns): if isinstance(columns, dict): for old_name, new_name in columns.items(): df = df.withColumnRenamed(old_name, new_name) return df else: raise ValueError("'columns' should be a dict, like {'old_name_1':'new_name_1', 'old_name_2':'new_name_2'}")

So your solution will look like data = rename_columns(data, {'x1': 'x3', 'x2': 'x4'})

This will save me a few lines of code, hope this helps too.

+5

proggeo Jan 4 '18 at 12:26

source share

why do you want to execute it in one line, if you print the execution plan, it actually runs in only one line

 data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2']) data = (data .withColumnRenamed('x1','x3') .withColumnRenamed('x2', 'x4')) data.explain()

EXIT

 == Physical Plan == *(1) Project [x1#1548L AS x3#1552L, x2#1549L AS x4#1555L] +- Scan ExistingRDD[x1#1548L,x2#1549L]

if you want to do this with a list tuple, you can use a simple map function

 data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2']) new_names = [("x1","x3"),("x2","x4")] data = data.select(list( map(lambda old,new:F.col(old).alias(new),*zip(*new_names)) )) data.explain()

still has the same plan

EXIT

 == Physical Plan == *(1) Project [x1#1650L AS x3#1654L, x2#1651L AS x4#1655L] +- Scan ExistingRDD[x1#1650L,x2#1651L]

+1

Tushar kolhe Dec 03 '18 at 9:57

source share

zero323 · Accepted Answer · 2016-08-05T22:43:41+0000

Cannot use single call with withColumnRenamed .

You can use the DataFrame.toDF * method

 data.toDF('x3', 'x4')

or

 new_names = ['x3', 'x4'] data.toDF(*new_names)

You can also rename with a simple select :

 from pyspark.sql.functions import col mapping = dict(zip(['x1', 'x2'], ['x3', 'x4'])) data.select([col(c).alias(mapping.get(c, c)) for c in data.columns])

Similarly in Scala you can:

Rename all columns:

 val newNames = Seq("x3", "x4") data.toDF(newNames: _*)

Rename from display with select :

 val mapping = Map("x1" -> "x3", "x2" -> "x4") df.select( df.columns.map(c => df(c).alias(mapping.get(c).getOrElse(c))): _* )

or foldLeft + withColumnRenamed

 mapping.foldLeft(data){ case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) }

* Not to be confused with RDD.toDF which is not a variable functionality, and accepts column names as a list,

PySpark - Rename more than one column using withColumnRenamed

More articles: