SparkR Column provides a long list of useful methods , including isNull and isNotNull :
> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA)) > people <- createDataFrame(sqlContext, people_local) > head(people) Id Age 1 1 21 2 2 18 3 3 NA > filter(people, isNotNull(people$Age)) %>% head() Id Age 1 1 21 2 2 18 3 3 30 > filter(people, isNull(people$Age)) %>% head() Id Age 1 4 NA
Please keep in mind that in SparkR there is no difference between NA and NaN .
If you prefer operations on an entire data frame, there is a set of NA functions , including fillna and dropna :
> fillna(people, 99) %>% head() Id Age 1 1 21 2 2 18 3 3 30 4 4 99 > dropna(people) %>% head() Id Age 1 1 21 2 2 18 3 3 30
Both can be adjusted to only consider a subset of columns ( cols ), and dropna has some additional useful parameters. For example, you can specify the minimum number of non-zero columns:
> people_with_names_local <- data.frame( Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA)) > people_with_names <- createDataFrame(sqlContext, people_with_names_local) > people_with_names %>% head() Id Age Name 1 1 21 Alice 2 2 18 <NA> 3 3 30 Bob 4 4 NA <NA> > dropna(people_with_names, minNonNulls=2) %>% head() Id Age Name 1 1 21 Alice 2 2 18 <NA> 3 3 30 Bob