How to arrange fields of Row objects in Spark (Python)

I am creating Row objects in Spark. I do not want my fields to be ordered alphabetically. However, if I do the following, they are sorted alphabetically.

row = Row(foo=1, bar=2) 

Then it creates an object similar to the following:

 Row(bar=2, foo=1) 

When I then create a DataFrame for this object, the column order will be first first and second second, when I would rather have it the other way around.

I know that I can use "_1" and "_2" (respectively for "foo" and "bar"), and then assign a scheme (with the corresponding names "foo" and "bar"). But is there a way to prevent the Row object from ordering them?

+11
source share
3 answers

But is there a way to prevent the Row object from arranging them?

No If you provide kwargs arguments will be sorted by name . Sorting is necessary for deterministic behavior since Python prior to 3.6 does not preserve the order of the keyword arguments.

Just use simple tuples:

 rdd = sc.parallelize([(1, 2)]) 

and pass the circuit as an argument to RDD.toDF (not to be confused with DataFrame.toDF ):

 rdd.toDF(["foo", "bar"]) 

or createDataFrame :

 from pyspark.sql.types import * spark.createDataFrame(rdd, ["foo", "bar"]) # With full schema schema = StructType([ StructField("foo", IntegerType(), False), StructField("bar", IntegerType(), False)]) spark.createDataFrame(rdd, schema) 

You can also use namedtuples :

 from collections import namedtuple FooBar = namedtuple("FooBar", ["foo", "bar"]) spark.createDataFrame([FooBar(foo=1, bar=2)]) 

Finally, you can sort the columns by select :

 sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar") 
+10
source

From the documentation :

A string can also be used to create another Row class, and then it can be used to create Row objects.

In this case, the column order is preserved:

 >>> FooRow = Row('foo', 'bar') >>> row = FooRow(1, 2) >>> spark.createDataFrame([row]).dtypes [('foo', 'bigint'), ('bar', 'bigint')] 
+1
source

How to sort the original schema according to the alphabetical order of RDD:

 schema_sorted = StructType() structfield_list_sorted = sorted(df.schema, key=lambda x: x.name) for item in structfield_list_sorted: schema_sorted.add(item) 
+1
source

All Articles