Pyspark splits a column into multiple columns without pandas

Question

Pyspark splits a column into multiple columns without pandas

My question is how to split a column into multiple columns. I do not know why df.toPandas() does not work.

For example, I would like to change 'df_test' to 'df_test2'. I have seen many examples using the pandas module. Is there another way? Thank you in advance.

 df_test = sqlContext.createDataFrame([ (1, '14-Jul-15'), (2, '14-Jun-15'), (3, '11-Oct-15'), ], ('id', 'date'))

df_test2

 id day month year 1 14 Jul 15 2 14 Jun 15 1 11 Oct 15

+6

python apache-spark pyspark apache-spark-sql

nathanlim45 Dec 18 '15 at 19:10

source share

1 answer

zero323 · Answer 1 · 2015-12-18T22:04:33+0000

Spark> = 2.2

You can skip unix_timestamp and use and use to_date or to_timestamp :

 from pyspark.sql.functions import to_date, to_timestamp df_test.withColumn("date", to_date("date", "dd-MMM-yy")).show() ## +---+----------+ ## | id| date| ## +---+----------+ ## | 1|2015-07-14| ## | 2|2015-06-14| ## | 3|2015-10-11| ## +---+----------+ df_test.withColumn("date", to_timestamp("date", "dd-MMM-yy")).show() ## +---+-------------------+ ## | id| date| ## +---+-------------------+ ## | 1|2015-07-14 00:00:00| ## | 2|2015-06-14 00:00:00| ## | 3|2015-10-11 00:00:00| ## +---+-------------------+

and then apply the other datetime functions shown below.

Spark & lt; 2.2

It is not possible to get multiple top-level columns in one access. You can use structures or collection types using UDF as follows:

 from pyspark.sql.types import StringType, StructType, StructField from pyspark.sql import Row from pyspark.sql.functions import udf, col schema = StructType([ StructField("day", StringType(), True), StructField("month", StringType(), True), StructField("year", StringType(), True) ]) def split_date_(s): try: d, m, y = s.split("-") return d, m, y except: return None split_date = udf(split_date_, schema) transformed = df_test.withColumn("date", split_date(col("date"))) transformed.printSchema() ## root ## |-- id: long (nullable = true) ## |-- date: struct (nullable = true) ## | |-- day: string (nullable = true) ## | |-- month: string (nullable = true) ## | |-- year: string (nullable = true)

but it is not only quite verbose in PySpark, but also expensive.

For date-based conversions, you can simply use the built-in functions:

 from pyspark.sql.functions import unix_timestamp, dayofmonth, year, date_format transformed = (df_test .withColumn("ts", unix_timestamp(col("date"), "dd-MMM-yy").cast("timestamp")) .withColumn("day", dayofmonth(col("ts")).cast("string")) .withColumn("month", date_format(col("ts"), "MMM")) .withColumn("year", year(col("ts")).cast("string")) .drop("ts"))

Similarly, you can use regexp_extract to split the date.

See also Output multiple columns from one column in a Spark DataFrame

Note

If you use a version that is not patched with SPARK-11724 , this will require a fix after unix_timestamp(...) and before cast("timestamp") .

Pyspark splits a column into multiple columns without pandas

More articles: