How to handle null entries in SparkR

Question

How to handle null entries in SparkR

I have a SparkSQL DataFrame.

Some entries in this data are empty, but they do not behave like NULL or NA. How can I delete them? Any ideas?

In R, I can easily remove them, but in sparkR he says that there is a problem with the S4 / system methods.

Thanks.

+6

r apache-spark sparkr apache-spark-1.4

Ole petersen Jul 23 '15 at 21:46

source share

2 answers

This is not a pleasant workaround, but if you discard them as strings, they are saved as "NaN" and then you can filter them, a brief example:

 testFrame <- createDataFrame(sqlContext, data.frame(a=c(1,2,3),b=c(1,NA,3))) testFrame$c <- cast(testFrame$b,"string") resultFrame <- collect(filter(testFrame, testFrame$c!="NaN")) resultFrame$c <- NULL

This excludes the entire row where there is no element in column b.

+2

Wannes rosiers Jul 24 '15 at 6:13

source share

zero323 · Accepted Answer · 2015-07-25T13:44:36+0000

SparkR Column provides a long list of useful methods , including isNull and isNotNull :

 > people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA)) > people <- createDataFrame(sqlContext, people_local) > head(people) Id Age 1 1 21 2 2 18 3 3 NA > filter(people, isNotNull(people$Age)) %>% head() Id Age 1 1 21 2 2 18 3 3 30 > filter(people, isNull(people$Age)) %>% head() Id Age 1 4 NA

Please keep in mind that in SparkR there is no difference between NA and NaN .

If you prefer operations on an entire data frame, there is a set of NA functions , including fillna and dropna :

 > fillna(people, 99) %>% head() Id Age 1 1 21 2 2 18 3 3 30 4 4 99 > dropna(people) %>% head() Id Age 1 1 21 2 2 18 3 3 30

Both can be adjusted to only consider a subset of columns ( cols ), and dropna has some additional useful parameters. For example, you can specify the minimum number of non-zero columns:

 > people_with_names_local <- data.frame( Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA)) > people_with_names <- createDataFrame(sqlContext, people_with_names_local) > people_with_names %>% head() Id Age Name 1 1 21 Alice 2 2 18 <NA> 3 3 30 Bob 4 4 NA <NA> > dropna(people_with_names, minNonNulls=2) %>% head() Id Age Name 1 1 21 Alice 2 2 18 <NA> 3 3 30 Bob

How to handle null entries in SparkR

More articles: