Unveiling the Depths of Linear Regression

The Probabilistic Foundations of Linear Regression

8/26/20243 min read

Linear regression is a cornerstone of statistical methods, forming the foundation of supervised learning and neural networks. While many people have a basic understanding of it, this blog aims to uncover its deeper nature.

Foundation

At its core, linear regression begins with the equation Y = Xβ+e, where (X, Y) represents our observed data, with X as the explanatory variable and Y as the predicted variable. Here, β is the true parameter of the linear regression, and e is the error term with constant variance sigma^2 and mean of zero. The key assumptions underlying this model are as follows:

  • Linear correlation between X and Y

  • Constant variance of the error term (homoscedasticity), no correlation between the error term and X

  • Zero mean of the error term

Given these assumptions, we derive the Conditional Expectation: E(Y|X, theta) = Xβ and Conditional Variance VAR(Y|X, theta) =sigma^2, where theta represents the model parameters, including both β and sigma.

Since Y = Xβ + e, Y is essentially a linear transformation of the error term e. If e follows a Normal distribution, then Y also follows a normal distribution: p(Y|X, theta) ~N(Xβ, sigma^2). When we condition Y on X and theta, the Xβ becomes deterministic, with randomness introduced solely by the error term. Linear regression focuses on the conditional probability distribution of Y given the value of X.

This probabilistic aspect is crucial to understanding the true nature of linear regression—it isn't just about finding a line of best fit, but about modeling the underlying probability distribution of the data. It provides a framework for understanding both the expected outcome and the uncertainty around that expectation.

Estimation

How do we estimate β? There are two commonly used methods: Ordinary Least Squares (OLS) and Maximum Likelihood Estimation (MLE). The OLS finds the estimator of β that minimizes the sum of squared residuals, the difference between the observed y and predicted y, which is essentially the error term. The sum of squared residuals SUM((Yi - Xi*β)^2) is a convex function, and by taking the partial derivative with respect to β, we can find the β that minimizes the sum. The OLS estimator of β is β_hat = Inverse(XtransposeX)XtransposeY. If X is a full-rank matrix, there is a unique solution for β; however, if X is not full-rank, XtransposeX is not invertible leading to multiple solutions for β. The OLS approach may not be robust to data with outliers, resulting in a model fitting outliers more than the true data due to the higher importance assigned by MSE to larger errors.

The MLE method, on the other hand, tries to find the β that maximizes the likelihood of observing the given data. As derived earlier, p(Y|X, theta) ~N( Xβ, sigma^2), if the error term follows a normal distribution. The likelihood function is p(Y|X, theta) = Product(f(Xβ, sigma^2)), where f is the density function of normal distribution. When the sum of squared residual is minimized the joint probability is maximized, making the MLE estimator identical to the OLS estimator under the normality assumption of the error term. When the error term does not follow a normal distribution, MLE can still be applied by specifying a likelihood function based on the actual distribution of the error terms.

Properties of Beta
The OLS estimator of β is β_hat= (XTX)-1XTY, where Y = Xβ + e, leads to = (XTX)-1XTXβ +(XTX)-1XTe . Thus, β_hat is essentially a linear transformation of the error term. We can easily derive E( β_hat) = β, VAR(β_hat) =simga^2(XTX)-1.

If the error term follows a normal distribution, then β_hat~ N(β, sigma^2(XTX)-1). Even if the error term doesn’t follow a normal distribution, β_hat will follow a normal distribution when the sample size is sufficiently large by the Central Limit Theorem (CLT). The CLT applies because each component of A (where A = (XTX)-1XT) is a sum of independent random variables, assuming the errors are independent.

β_hat is known as the BLUE estimator (Best Linear Unbiased Estimator) if the linearity holds, the error terms are independently and identically distributed with a mean of zero and constant variance of simga^2, X is a full rank matrix (implying no perfect multicollinearity), and (X, Y) are randomly and independently sampled from the population. The best estimator is the estimator with the smallest variance.

Let's dive into the linear algebra aspect of linear regression. As Y = X*beta + e, where e is the error term with a mean of zero and constant variance. The beta could be estimated by minimizing the sum of the squared residuals, which leads to the normal equation XTX*beta = XTY, and thus beta = (XTX)-1XTY. When XTX is invertible, the estimate of beta is uniquely defined. This invertibility occurs when X is a full-rank matrix (eg: no multicollinearity, number of data n > number of feature p). When XTX is not invertible, the estimate of beta is not unique, and the variance of beta becomes inflated. In such cases, techniques like pseudo inverse, feature selection, and regularization are typically employed to address the problem. To understand the geometric interpretation of linear regression, we can rewrite Y = X*beta as Y = beta_0*X_c0 + beta_1*X_c0+...+beta_p*X_cp, where X is a n*p matrix, and X_ci is its' ith column vector. Essentially, we aim to find a vector Y_hat on the column space of X such that ||Y-Y_hat|| is minimized. This happens when Y_hat is the orthogonal projection of vector Y on the column space of X. If we introduce a projection matrix P, such that Y_hat = P*Y, we can calculate P = X(XTX)-1XT.