Spark Error: expected null arguments to build a ClassDict (for numpy.core.multiarray._reconstruct)

I have a DataFrame in Spark in which one of the columns contains an array. Now I have written a separate UDF that converts the array to another array with separate values ​​in it. See the example below:

Ex: [24,23,27,23] must be converted to [24, 23, 27] Code:

def uniq_array(col_array): x = np.unique(col_array) return x uniq_array_udf = udf(uniq_array,ArrayType(IntegerType())) Df3 = Df2.withColumn("age_array_unique",uniq_array_udf(Df2.age_array)) 

In the above code, Df2.age_array is the array on which I use UDF to get another column "age_array_unique" , which should contain only unique values ​​in the array.

However, as soon as I ran the Df3.show() command, I get an error message:

net.razorvine.pickle.PickleException: expected null arguments to build a ClassDict (for numpy.core.multiarray._reconstruct)

Can anyone tell me why this is happening?

Thanks!

+7
arrays user-defined-functions apache-spark pyspark apache-spark-sql
source share
1 answer

The source of the problem is that the object returned from the UDF does not match the declared type. np.unique not only numpy.ndarray , but also converts numeric values ​​to the corresponding NumPy types that are not compatible with the DataFrame API. You can try something like this:

 udf(lambda x: list(set(x)), ArrayType(IntegerType())) 

or this (to maintain order)

 udf(lambda xs: list(OrderedDict((x, None) for x in xs)), ArrayType(IntegerType())) 

instead.

If you really want np.unique , you need to convert the output:

 udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType())) 
+14
source share

All Articles