Julia DataFrames.jl - filter data with NA (NAException)

I am not sure how to handle NA as part of DataFrames.

For example, with the following DataFrame:

 > import DataFrames > a = DataFrames.@data ([1, 2, 3, 4, 5]); > b = DataFrames.@data ([3, 4, 5, 6, NA]); > ndf = DataFrames.DataFrame(a=a, b=b) 

I can successfully perform the following operation in a column :a

 > ndf[ndf[:a] .== 4, :] 

but if I try to perform the same operation on :b , I get the error NAException("cannot index an array with a DataArray containing NA values") .

 > ndf[ndf[:b] .== 4, :] NAException("cannot index an array with a DataArray containing NA values") while loading In[108], in expression starting on line 1 in to_index at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:85 in getindex at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:210 in getindex at /Users/abisen/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:268 

This is due to the presence of NA.

My question is how should DataFrames with NA be handled? I can understand that the operation > or < against NA will be undefined , but == should work (no?).

+6
source share
3 answers

What is your desired behavior here? If you want to make a similar choice, you can make a condition (not NAN) AND (equal to 4). If the first test fails, the second will never happen.

 using DataFrames a = @data([1, 2, 3, 4, 5]); b = @data([3, 4, 5, 6, NA]); ndf = DataFrame(a=a, b=b) ndf[(!isna(ndf[:b]))&(ndf[:b].==4),:] 

In some cases, you can simply leave all rows with NA in specific columns

 ndf = ndf[!isna(ndf[:b]),:] 
+5
source

Regarding this question I asked earlier, you can change this NA behavior directly in the source code of the modules if you want. There is a function in the indexing.jl file called Base.to_index(A::DataArray) , starting at line 75, where you can change the code for setting NA in the boolean array to false . For example, you can do the following:

 # Indexing with NA throws an error function Base.to_index(A::DataArray) A[A.na] = false any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values")) Base.to_index(A.data) end 

Ignoring NA with isna() will result in less readable source code and, in large formulas, performance loss:

 @timeit ndf[(!isna(ndf[:b])) & (ndf[:b] .== 4),:] #3.68 µs per loop @timeit ndf[ndf[:b] .== 4, :] #2.32 µs per loop ## 71x179 2D Array @timeit dm[(!isna(dm)) & (dm .< 3)] = 1 #14.55 µs per loop @timeit dm[dm .< 3] = 1 #754.79 ns per loop 
+1
source

In many cases, you want to treat NA as separate instances, that is, assume that everything that is NA is "equal" and everything else is different.

If this is your behavior, the current DataFrames API will not help you, since (NA == NA) and (NA == 1) return NA instead of the expected boolean results.

This makes very tedious DataFrame filters using loops: function filter(df,c) for r in eachrow(df) if (isna(c) && isna(r:[c])) || ( !isna(r[:c]) && r[:c] == c ) ... function filter(df,c) for r in eachrow(df) if (isna(c) && isna(r:[c])) || ( !isna(r[:c]) && r[:c] == c ) ... and breaks up the selected functions in DataFramesMeta.jl and Query.jl when NA values ​​are present or requested for.

One way to solve the problem is to use isequal(a,b) instead of a==b

 test = @where(df, isequal.(:a,"cc"), isequal.(:b,NA) ) #from DataFramesMeta.jl 
0
source

All Articles