How to arrange fields of Row objects in Spark (Python)

Question

How to arrange fields of Row objects in Spark (Python)

I am creating Row objects in Spark. I do not want my fields to be ordered alphabetically. However, if I do the following, they are sorted alphabetically.

row = Row(foo=1, bar=2)

Then it creates an object similar to the following:

 Row(bar=2, foo=1)

When I then create a DataFrame for this object, the column order will be first first and second second, when I would rather have it the other way around.

I know that I can use "_1" and "_2" (respectively for "foo" and "bar"), and then assign a scheme (with the corresponding names "foo" and "bar"). But is there a way to prevent the Row object from ordering them?

+11

python apache-spark pyspark apache-spark-sql pyspark-sql

rye Feb 11 '16 at 15:33

source share

3 answers

From the documentation :

A string can also be used to create another Row class, and then it can be used to create Row objects.

In this case, the column order is preserved:

 >>> FooRow = Row('foo', 'bar') >>> row = FooRow(1, 2) >>> spark.createDataFrame([row]).dtypes [('foo', 'bigint'), ('bar', 'bigint')]

+1

Patrick z Feb 07 '17 at 12:05

source share

How to sort the original schema according to the alphabetical order of RDD:

 schema_sorted = StructType() structfield_list_sorted = sorted(df.schema, key=lambda x: x.name) for item in structfield_list_sorted: schema_sorted.add(item)

+1

bloodrootfc Mar 05 '18 at 16:38

source share

zero323 · Accepted Answer · 2016-02-11T15:50:48+0000

But is there a way to prevent the Row object from arranging them?

No If you provide kwargs arguments will be sorted by name . Sorting is necessary for deterministic behavior since Python prior to 3.6 does not preserve the order of the keyword arguments.

Just use simple tuples:

 rdd = sc.parallelize([(1, 2)])

and pass the circuit as an argument to RDD.toDF (not to be confused with DataFrame.toDF ):

 rdd.toDF(["foo", "bar"])

or createDataFrame :

 from pyspark.sql.types import * spark.createDataFrame(rdd, ["foo", "bar"]) # With full schema schema = StructType([ StructField("foo", IntegerType(), False), StructField("bar", IntegerType(), False)]) spark.createDataFrame(rdd, schema)

You can also use namedtuples :

 from collections import namedtuple FooBar = namedtuple("FooBar", ["foo", "bar"]) spark.createDataFrame([FooBar(foo=1, bar=2)])

Finally, you can sort the columns by select :

 sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar")

How to arrange fields of Row objects in Spark (Python)

More articles: