# Multicollinearity in Regression Model

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. This is a common situation in real life regression problem and depending on the goal of the user, it can be more or less troublesome.

If the goal is simply to predict Y from a set of X variables, then multicollinearity is not a problem. The predictions will still be accurate, and the overall $R^2$ (or adjusted $R^2$  ) quantifies how well the model predicts the Y values. If the goal is to understand how the various X variables impact Y, then multicollinearity is a big problem.

We begin with the linear regression model. The most basic one uses Ordinary Least Square technique to estimate the regression coefficients. The solution of a OLS linear regression is given by:

$\hat{\beta} = (X^TX)^{-1} X^T y$

Note here that if $X$ is not full rank (ie. the predictors are not independent from each other), $X^TX$ is not invertible, and thus there is no unique solution for $\hat{\beta}$. This is where all the troubles begin. One problem is that the individual P-values can be misleading (a P-value can be high, even though the variable is important). The second problem is that the confidence intervals on the regression coefficients will be very wide. The confidence intervals may even include zero, which means one can’t even be confident whether an increase in the X value is associated with an increase, or a decrease, in Y. Because the confidence intervals are so wide, excluding a subject (or adding a new one) can change the coefficients dramatically and may even change their signs.

The unstable P-value can lead to some misleading and confusing results. For example, if we have 2 correlated predictors, we might stumble upon an extreme situation in which the F-test confirms that the model is useful for predicting $y$, but the coefficient t-tests are non-significant, which suggest that none of the two predictors are significantly associated to $y$ ! The explanation is quite simple: given that $\beta_1$ and $\beta_2$ are 2 the estimated coefficients, $\beta_1$ is the expected change in y due to $x_1$ given $x_2$ is already in the model and vice versa, $\beta_2$ is the expected change in y due to $x_2$ given $x_1$ is already in the model, since both $x_1$ and $x_2$ contribute redundant information about $y$ once one of the predictors is in the model, the other one does not have much more to contribute. This is why the F−test indicates that at least one of the predictors is important yet the individual t-tests indicate that the contribution of the predictor, given that the other one has already been included, is not really important.

Statistically speaking, a high multicollinearity will inflate the standard error of estimates of the predictors (and thus decrease the reliability). Consequently, multicollinearity results in a decline in the t-statistic (because $t = \frac{\hat{\beta}}{SE(\hat{\beta})}$). This means that the power of the hypothesis test—the probability of correctly detecting a non-zero coefficient—is reduced by collinearity: a predictor with a small coefficient might be “masked”, even if it is statistically significant.

So we can see that the biggest problem with multicollinearity is that the predictor coefficients have a large variance, and thus it is very hard to interpret their effect.

There are many methods that have been proposed to overcome this problem. Some people suggest to drop the correlated variables. However this is a very risky method because

1. As the variable coefficients are not stable, we are not sure which variables are the most suitable ones to be dropped, and moreover, removing one variable will cause the other correlated variable coefficients to change in an unpredicted way.
2. If we use step-wise regression for variable selection, we risk to overfit the dataset. (in general, step-wise methods are not recommended.)

Another direction is to use shrinkage methods, especially ridge regression. The solution to the ridge regression problem is given by:

$\hat{\beta} = (X^TX + \lambda I)^{-1} X^T y$

At the beginning we have said that in the linear regression model, the OLS estimates do not always exist because $X^TX$ is not always invertible. With ridge regression, the problem is solve: For any design matrix X, the quantity $(X^TX + \lambda I)$ is always invertible; thus, there is always a unique solution $\hat{\beta}$

Ridge regression use the regularizer $\lambda$ to penalize large coefficients. Being a biased estimator, it trades some degree of bias to reduce the variance, and therefore results in more stable estimates.