Apache Spark throws a NullPointerException when a missing function is detected

Question

Apache Spark throws a NullPointerException when a missing function is detected

I have a strange problem with PySpark when indexing a column of rows in functions. Here is my tmp.csv file:

x0,x1,x2,x3 asd2s,1e1e,1.1,0 asd2s,1e1e,0.1,0 ,1e3e,1.2,0 bd34t,1e1e,5.1,1 asd2s,1e3e,0.2,0 bd34t,1e2e,4.3,1

where I have one missing value for 'x0'. First I read the functions from the csv file in the DataFrame using pyspark_csv: https://github.com/seahboonsiew/pyspark-csv then indexing x0 with StringIndexer:

 import pyspark_csv as pycsv from pyspark.ml.feature import StringIndexer sc.addPyFile('pyspark_csv.py') features = pycsv.csvToDataFrame(sqlCtx, sc.textFile('tmp.csv')) indexer = StringIndexer(inputCol='x0', outputCol='x0_idx' ) ind = indexer.fit(features).transform(features) print ind.collect()

when calling '' ind.collect () '' Spark throws java.lang.NullPointerException. Everything works fine for a complete dataset, for example, for "x1".

Does anyone know what causes this and how to fix it?

Thanks in advance!

Sergey

Update:

I am using Spark 1.5.1. Exact error:

 File "/spark/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", line 258, in show print(self._jdf.showString(n)) File "/spark/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/spark/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o444.showString. : java.lang.NullPointerException at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:208) at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:196) at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:196) ... etc

I tried to create the same DataFrame without reading the csv file,

 df = sqlContext.createDataFrame( [('asd2s','1e1e',1.1,0), ('asd2s','1e1e',0.1,0), (None,'1e3e',1.2,0), ('bd34t','1e1e',5.1,1), ('asd2s','1e3e',0.2,0), ('bd34t','1e2e',4.3,1)], ['x0','x1','x2','x3'])

and he gives the same error. A slightly different example works great,

 df = sqlContext.createDataFrame( [(0, None, 1.2), (1, '06330986ed', 2.3), (2, 'b7584c2d52', 2.5), (3, None, .8), (4, 'bd17e19b3a', None), (5, '51b5c0f2af', 0.1)], ['id', 'x0', 'num']) // after indexing x0 +---+----------+----+------+ | id| x0| num|x0_idx| +---+----------+----+------+ | 0| null| 1.2| 0.0| | 1|06330986ed| 2.3| 2.0| | 2|b7584c2d52| 2.5| 4.0| | 3| null| 0.8| 0.0| | 4|bd17e19b3a|null| 1.0| | 5|51b5c0f2af| 0.1| 3.0| +---+----------+----+------+

Update 2:

I just discovered the same problem in Scala, so I think the Spark error is not just PySpark. In particular, the data frame

 val df = sqlContext.createDataFrame( Seq(("asd2s","1e1e",1.1,0), ("asd2s","1e1e",0.1,0), (null,"1e3e",1.2,0), ("bd34t","1e1e",5.1,1), ("asd2s","1e3e",0.2,0), ("bd34t","1e2e",4.3,1)) ).toDF("x0","x1","x2","x3")

throws java.lang.NullPointerException when indexing the function "x0". Moreover, when indexing "x0" in the next data frame

 val df = sqlContext.createDataFrame( Seq((0, null, 1.2), (1, "b", 2.3), (2, "c", 2.5), (3, "a", 0.8), (4, "a", null), (5, "c", 0.1)) ).toDF("id", "x0", "num")

I have a "java.lang.UnsupportedOperationException: schema for type Any is not supported" which is caused by the absence of a "num" value in the fifth vector. If you replace it with a number, everything works well, even if there is no value in the 1st vector.

I also tried older versions of Spark (1.4.1), and the result is the same.

+8

python apache-spark pyspark apache-spark-sql apache-spark-ml

serge_k Nov 06 '15 at 20:02

source share

2 answers

Well, for now, the only solution is to get rid of NA, for example @ zero323, or to convert a Spark DataFrame to a Pandas DataFrame using the toPandas () method and transfer data using the sklearn Imputer or any user computer, for example, Put categorical missing values in scikit -learn , then convert the Pandas Dataframe back to a Spark DataFrame and work with it. However, the problem remains, I will try to report an error, if any. I am relatively new to Spark, so there is a chance I missed something.

0

serge_k Nov 07 '15 at 13:01

source share

zero323 · Accepted Answer · 2015-11-07T04:41:48+0000

It looks like the module you are using will convert the empty strings to NULL values and at some point mess with the subsequent processing. At first glance, it looks like a PySpark error .

How to fix it? A simple workaround is to either reset zeros before indexing:

 features.na.drop()

or replace zeros with some placeholder:

 from pyspark.sql.functions import col, when features.withColumn( "x0", when(col("x0").isNull(), "__SOME_PLACEHOLDER__").otherwise(col("x0")))

Alternatively, you can use spark-csv . It is efficient, tested and, as a bonus, does not convert empty strings to nulls .

 features = (sqlContext.read .format('com.databricks.spark.csv') .option("inferSchema", "true") .option("header", "true") .load("tmp.csv"))

Apache Spark throws a NullPointerException when a missing function is detected

More articles: