Multicollinearity in Regression Model

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. This is a common situation in real life regression problem and depending on the goal of the user, it can be more or less troublesome.

If the goal is simply to predict Y from a set of X variables, then multicollinearity is not a problem. The predictions will still be accurate, and the overall R^2 (or adjusted R^2  ) quantifies how well the model predicts the Y values. If the goal is to understand how the various X variables impact Y, then multicollinearity is a big problem.

We begin with the linear regression model. The most basic one uses Ordinary Least Square technique to estimate the regression coefficients. The solution of a OLS linear regression is given by:

\hat{\beta} = (X^TX)^{-1} X^T y

Note here that if X is not full rank (ie. the predictors are not independent from each other), X^TX is not invertible, and thus there is no unique solution for \hat{\beta} . This is where all the troubles begin. One problem is that the individual P-values can be misleading (a P-value can be high, even though the variable is important). The second problem is that the confidence intervals on the regression coefficients will be very wide. The confidence intervals may even include zero, which means one can’t even be confident whether an increase in the X value is associated with an increase, or a decrease, in Y. Because the confidence intervals are so wide, excluding a subject (or adding a new one) can change the coefficients dramatically and may even change their signs.

The unstable P-value can lead to some misleading and confusing results. For example, if we have 2 correlated predictors, we might stumble upon an extreme situation in which the F-test confirms that the model is useful for predicting y, but the coefficient t-tests are non-significant, which suggest that none of the two predictors are significantly associated to y ! The explanation is quite simple: given that \beta_1 and \beta_2 are 2 the estimated coefficients, \beta_1 is the expected change in y due to x_1 given x_2 is already in the model and vice versa, \beta_2 is the expected change in y due to x_2 given x_1 is already in the model, since both x_1 and x_2 contribute redundant information about y once one of the predictors is in the model, the other one does not have much more to contribute. This is why the F−test indicates that at least one of the predictors is important yet the individual t-tests indicate that the contribution of the predictor, given that the other one has already been included, is not really important.

Statistically speaking, a high multicollinearity will inflate the standard error of estimates of the predictors (and thus decrease the reliability). Consequently, multicollinearity results in a decline in the t-statistic (because t = \frac{\hat{\beta}}{SE(\hat{\beta})}). This means that the power of the hypothesis test—the probability of correctly detecting a non-zero coefficient—is reduced by collinearity: a predictor with a small coefficient might be “masked”, even if it is statistically significant.

So we can see that the biggest problem with multicollinearity is that the predictor coefficients have a large variance, and thus it is very hard to interpret their effect.

There are many methods that have been proposed to overcome this problem. Some people suggest to drop the correlated variables. However this is a very risky method because

  1. As the variable coefficients are not stable, we are not sure which variables are the most suitable ones to be dropped, and moreover, removing one variable will cause the other correlated variable coefficients to change in an unpredicted way.
  2. If we use step-wise regression for variable selection, we risk to overfit the dataset. (in general, step-wise methods are not recommended.)

Another direction is to use shrinkage methods, especially ridge regression. The solution to the ridge regression problem is given by:

\hat{\beta} = (X^TX + \lambda I)^{-1} X^T y

At the beginning we have said that in the linear regression model, the OLS estimates do not always exist because X^TX is not always invertible. With ridge regression, the problem is solve: For any design matrix X, the quantity (X^TX + \lambda I) is always invertible; thus, there is always a unique solution \hat{\beta}

Ridge regression use the regularizer \lambda to penalize large coefficients. Being a biased estimator, it trades some degree of bias to reduce the variance, and therefore results in more stable estimates.


2 thoughts on “Multicollinearity in Regression Model

  1. > If the goal is simply to predict Y from a set of X variables, then multicollinearity is not a problem

    Not always. If the value of least one of the dependent variables is significantly different from what was used during the training, the prediction may be very wrong. Of course, this can happen without the colinearity problem but to a smaller extent since there is a higher probability that a model managed to learn better “how the various X variables impact Y”

    Liked by 1 person

    1. Yes you are totally correct! I should have specified in my statement that it is valid only when the testing set comes from the same population with the same multicollinearity level as the training set.

      Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s