Perfect (or close) multicollinearity in julia

Question

Perfect (or close) multicollinearity in julia

Launching a simple regression model in Julia with perfect multicollinearity causes an error. In R, we can run the same model producing NAs in the estimates of the corresponding covariates, which R interprets: "not defined due to features." We can identify these variables using the alias() function in R.

Is there any way to check perfect multicollinearity in Julia before modeling in order to abandon collinear variables?

+4

r regression linear-algebra julia-lang

Kani Aug 22 '16 at 14:54

source share

1 answer

Michael ohlrogge · Accepted Answer · 2016-08-22T14:59:54+0000

Discovering Perfect Collinearity

Suppose X is your design matrix. You can verify perfect multicollinearity by doing:

 rank(X) == size(X,2)

This will give false if you have perfect multicollinearity.

Determination of near collinearity + Search for which columns are collinear or close to collinear

I do not know any specific built-in functions for this. But applying some basic principles of linear algebra can quite easily determine this. Below is a function I wrote that does this, and then a more detailed explanation for those interested. Its essence is that we want to find the eigenvalues X*X' equal to zero (for perfect collinearity) or close to zero (for close collinearity). Then we find the eigenvectors associated with these eigenvalues. The components of those eigenvectors that are nonzero (for perfect collinearity) or moderately large (a term that is ambiguous in the nature of “close collinearity”, is ambiguous) are columns that have collinearity problems.

 function LinDep(A::Array, threshold1::Float64 = 1e-6, threshold2::Float64 = 1e-1; eigvec_output::Bool = false) (L, Q) = eig(A'*A) max_L = maximum(abs(L)) conditions = max_L ./ abs(L) max_C = maximum(conditions) println("Max Condition = $max_C") Collinear_Groups = [] Tricky_EigVecs = [] for (idx, lambda) in enumerate(L) if lambda < threshold1 push!(Collinear_Groups, find(abs(Q[:,idx]) .> threshold2)) push!(Tricky_EigVecs, Q[:,idx]) end end if eigvec_output return (Collinear_Groups, Tricky_EigVecs) else return Collinear_Groups end end

A simple example to get you started. It is easy to see that this matrix has collinearity problems:

 A1 = [1 3 1 2 ; 0 0 0 0 ; 1 0 0 0 ; 1 3 1 2] 4x4 Array{Int64,2}: 1 3 1 2 0 0 0 0 1 0 0 0 1 3 1 2 Collinear_Groups1 = LinDep(A1) [2,3] [2,3,4] Max Condition = 5.9245306995900904e16

There are two eigenvalues equal to 0. Thus, the function gives us two sets of “problem” columns. We want to remove one or more columns here to consider collinearity. It is clear that, as in the nature of collinearity, there is no “right” answer. For instance. Col3 is clearly 1/2 of Col4. Thus, we could remove one of them to solve the collinearity problem.

Note: here the maximum condition is the largest ratio of the maximum eigenvalue to each of the other eigenvalues. The general rule is that a maximum condition> 100 means moderate collinearity, and> 1000 means strong collinearity (see, for example, Wikipedia .) But LOT depends on the specifics of your situation, so using simplified rules like this is not particularly advisable . It is much better to consider this as one of the factors among many, including such things as the analysis of eigenvectors and your knowledge of basic data, and where you suspect that collinearity may or may not be present. In any case, we see that there is a lot to expect.

Now let's look at a more complicated situation when there is no perfect collinearity, but close collinearity. We can use the as is function, but I think it’s useful to enable the eigvec_output option eigvec_output that we can see eigenvectors that correspond to problem eigenvalues. In addition, you may want to work a little with the indicated thresholds in order to adjust the sensitivity to close collinearity. Or just set them both quite large (especially the second), and most of your time, exploring the outputs of the eigector.

 srand(42); ## set random seed for reproducibility N = 10 A2 = rand(N,N); A2[:,2] = 2*A2[:,3] +0.8*A2[:,4] + (rand(N,1)/100); ## near collinearity (Collinear_Groups2, Tricky_EigVecs2) = LinDep(A2, eigvec_output = true) Max Condition = 4.6675275950744677e8

Our maximum condition is noticeably less, which is nice, but still clearly quite difficult.

 Collinear_Groups2 1-element Array{Any,1}: [2,3,4] Tricky_EigVecs2[1] julia> Tricky_EigVecs2[1] 10-element Array{Float64,1}: 0.00537466 0.414383 -0.844293 -0.339419 0.00320918 0.0107623 0.00599574 -0.00733916 -0.00128179 -0.00214224

Here we see that columns 2,3,4 have relatively large components of the associated eigenvector. This shows us that these are problematic columns for close collinearity, which, of course, we expected, given how we created our matrix!

Why does it work?

From the basic linear algebra, any symmetric matrix can be diagonalized as:

 A = Q * L * Q'

Where L is the diagonal matrix containing its eigenvalues, and Q is the matrix of its corresponding eigenvectors.

Thus, suppose that in the regression analysis we have a constructive matrix X The matrix X'X will always be symmetric and therefore diagonalizable, as described above.

Similarly, we will always have rank(X) = rank(X'X) , which means that if X contains linearly dependent columns and is less than the full rank, then there will be X'X .

Now recall that by the definition of eigenvalue ( L[i] ) and the eigenvector Q[:,i] we have:

 A * Q[:,i] = L[i] * Q[:,i]

If L[i] = 0 , then it will be:

 A * Q[:,i] = 0

for some nonzero Q[:,i] . This definition of A has a linearly dependent column.

In addition, since A * Q[:,i] = 0 can be rewritten as the sum of the columns of A weighted by the components of Q[:,i] . Thus, if we give S1 and S2 two mutually exclusive sets, then we have

 sum (j in S1) A[:,j]*Q[:,i][j] = sum (j in S2) A[:,j]*Q[:,i][j]

those. some combination of columns A can be written as a weighted combination of other columns.

Thus, if we knew that L[i] = 0 for some i , and then we look at the corresponding Q[:,i] and see Q[:,i] = [0 0 1 0 2 0] , then we know that column 3 = -2 times column 5 and therefore we want to delete one or the other.

Perfect (or close) multicollinearity in julia

Why does it work?

More articles: