PySpark - Rename more than one column using withColumnRenamed

I want to change the names of two columns using a spark with the ColumnRenamed function. Of course I can write:

data = sqlContext.createDataFrame([(1,2), (3,4)], ['x1', 'x2']) data = (data .withColumnRenamed('x1','x3') .withColumnRenamed('x2', 'x4')) 

but I want to do it in one step (having a list / tuple of new names). Unfortunately, neither is this:

 data = data.withColumnRenamed(['x1', 'x2'], ['x3', 'x4']) 

and this:

 data = data.withColumnRenamed(('x1', 'x2'), ('x3', 'x4')) 

works. Can this be done?

+11
source share
3 answers

Cannot use single call with withColumnRenamed .

  • You can use the DataFrame.toDF * method

     data.toDF('x3', 'x4') 

    or

     new_names = ['x3', 'x4'] data.toDF(*new_names) 
  • You can also rename with a simple select :

     from pyspark.sql.functions import col mapping = dict(zip(['x1', 'x2'], ['x3', 'x4'])) data.select([col(c).alias(mapping.get(c, c)) for c in data.columns]) 

Similarly in Scala you can:

  • Rename all columns:

     val newNames = Seq("x3", "x4") data.toDF(newNames: _*) 
  • Rename from display with select :

     val mapping = Map("x1" -> "x3", "x2" -> "x4") df.select( df.columns.map(c => df(c).alias(mapping.get(c).getOrElse(c))): _* ) 

    or foldLeft + withColumnRenamed

     mapping.foldLeft(data){ case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) } 

* Not to be confused with RDD.toDF which is not a variable functionality, and accepts column names as a list,

+24
source

I also could not find an easy solution for pyspark, so I just created my own, similar to pandas' df.rename(columns={'old_name_1':'new_name_1', 'old_name_2':'new_name_2'}) .

 def rename_columns(df, columns): if isinstance(columns, dict): for old_name, new_name in columns.items(): df = df.withColumnRenamed(old_name, new_name) return df else: raise ValueError("'columns' should be a dict, like {'old_name_1':'new_name_1', 'old_name_2':'new_name_2'}") 

So your solution will look like data = rename_columns(data, {'x1': 'x3', 'x2': 'x4'})

This will save me a few lines of code, hope this helps too.

+5
source

why do you want to execute it in one line, if you print the execution plan, it actually runs in only one line

 data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2']) data = (data .withColumnRenamed('x1','x3') .withColumnRenamed('x2', 'x4')) data.explain() 

EXIT

 == Physical Plan == *(1) Project [x1#1548L AS x3#1552L, x2#1549L AS x4#1555L] +- Scan ExistingRDD[x1#1548L,x2#1549L] 

if you want to do this with a list tuple, you can use a simple map function

 data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2']) new_names = [("x1","x3"),("x2","x4")] data = data.select(list( map(lambda old,new:F.col(old).alias(new),*zip(*new_names)) )) data.explain() 

still has the same plan

EXIT

 == Physical Plan == *(1) Project [x1#1650L AS x3#1654L, x2#1651L AS x4#1655L] +- Scan ExistingRDD[x1#1650L,x2#1651L] 
+1
source

All Articles