OLS, Ridge, and LASSO

Posted by: Christina P. on February 7, 2021

I read a great paper recently by Hal Varian, titled “Big Data: New Tricks for Econometrics.” The paper, published Spring 2014 in the Journal of Economic Perspectives describes new tools that economists can use to hone in on the ever-relevant world of Big Data.

Hal Varian, by the way, is currently the Chief Economist at Google, following a superlative career in academia. With over 2 TRILLION searches per year on Google (source: SearchEngineLand), Big Data naturally plays a Big Role in his recent work.

One of the issues that Varian discussed in his paper is that of variable selection. That is, when confronted with a large dataset, and many potential explanatory variables, how do you know which ones to include in the model?

You should first be informed by theory and common sense. I am reminded of the following comic:

A comic illustrating the probability of failing to reject a hypothesis when an experiment is repeated many times; machine learning for systems. — *(Source:* XKCD).

In the case of variable selection, if a model selects an explanatory variable that has no theoretical or common-sense relationship to the outcome variable, then it may be a spurious result. For example, Varian uses a variable selection technique called Bayesian Structural Time Series (BSTS) to search for Google queries that are predictors of new home sales in the US. The variable results are shown below:

The initial predictors in block A include the results “oldies lyrics” and “www.mail2web”. Since there is no common-sense relationship between these search terms and new home sales in the US, they are eliminated from the model. Re-running the model without these terms produces the resultant search terms in block B.

BSTS is a useful technique for variable selection when working with time-series data, such as seasonal home sales data. However, as the title of this blog would suggest, I would like to discuss two very simple methods for variable selection that are closely related to the Ordinary Least Squares (OLS) linear regression model: namely, Ridge regression and Least Absolute Shrinkage and Selection Operator (LASSO).

Suppose that we observe the following data, with n observations and (k-1) explanatory variables:

($y_i$, $x_{i,1}$, …, $x_{i,k-1}$)$_{[i=1,…,n]}$

Define ${y} \in \mathbb{R}^{n}$, ${X} \in \mathbb{R}^{n \times k}$, and ${b} \in \mathbb{R}^{k}$ as:

where ${x_i} = (x_{i,1}, …, x_{i,k-1})^T$, for $i \in \{1,...,n\}$.

The following optimization problem for solving for the parameters in ${b}$ nests the three regression methodologies.

Specifically, when $\lambda=0$, the second term drops out and we have the classic OLS regression, in which we aim to minimize the first term, which is the Sum of Squared Residuals (SSR). In other words, OLS minimizes the sum of the squared difference between the actual observed values $y_i$, and the predicted values, $\hat{y_i}=b_0+\sum_{j=1}^{k-1}b_jx_{i,j}$, by adjusting the parameters $b_j$ for $j \in \{0,...,k-1\}$. Note that $b_0$ is an intercept term.

When $\lambda>0$, we include the second term, which penalizes for the size of non-intercept coefficients, $ |b_j|$ for $j \in \{1,...,k-1\}$. Effectively, there is a penalty for including additional variables that don't reduce the first term (the SSR) by more than they increase the second term. The penalty for additional variables is largest when $\alpha=1$, which is the Ridge regression, and smallest when $\alpha=0$, which is the LASSO.

To demonstrate the selection effect of OLS, Ridge, and LASSO, I generated a dataset in Python, with n = 100 observations and 9 potential explanatory variables (k = 10), such that the TRUE relationship between ${y}$ and ${X}$ is given by:

${y}={X}{b}+{\epsilon}$, where ${b}=(3,1,2,3,4,5,0,0,0,0)^T$ and $\epsilon$ ~ $N(0,10)$.

The results of the OLS, Ridge, and LASSO regressions are as follows:

In this example, OLS and Ridge have done the best job at approximating the true values of the parameters, while LASSO has done a comparatively good job of weeding out the last four explanatory variables, which were irrelevant to the outcome variable, ${y}$.

Works Cited:

Varian, Hal R. “Big Data: New Tricks for Econometrics.” Journal of Economic Perspectives 28.2 (2014): 3-28.