Machine learning

class: top, left, title-slide

.title[
# Machine learning
]
.subtitle[
## Regularization
]
.author[
### Joshua Loftus
]

---

class: inverse

# Benefits of shrinkage/bias

### In "high-dimensions" (p > 2)

Shrinking estimates/models toward a pre-specified point

# (Adaptive) regularization

### Tuning `$\lambda$` with cross-validation

Splitting data into training and testing subsets

---
class: inverse, center, middle

# Bias *can* be good, actually

### Especially in higher dimensions

---

### Stein "paradox" and bias

Estimating `$\mu \in \mathbb R^p$` from an i.i.d. sample `$\mathbf Y_1, \ldots, \mathbf Y_n \sim N(\mathbf{\mu}, \sigma^2 I)$`
- The MLE is `$\mathbf{\bar Y}$` (obvious and best, right?)

- [Charles Stein](https://news.stanford.edu/2016/12/01/charles-m-stein-extraordinary-statistician-anti-war-activist-dies-96/) discovered in the 1950's that the MLE is *[inadmissible](https://en.wikipedia.org/wiki/Admissible_decision_rule)* if `$p > 2$` 🤯

- The [James-Stein estimator](https://en.wikipedia.org/wiki/Stein%27s_example) **shrinks** `$\mathbf{\bar Y}$` toward some other point, any other point, chosen *a priori*, e.g. 0

`$$\text{MSE}(\mathbf{\hat \mu}_{\text{JS}}) < \text{MSE}(\mathbf{\bar Y}) \text{ for all } \mathbf \mu, \text{ if } p > 2$$`
`$$\mathbf{\hat \mu}_{\text{JS}} = \left(1 - \frac{(p-2)\sigma^2/n}{\|\mathbf{\bar Y}\|^2} \right) \mathbf{\bar Y}$$`

---

### Shrinkage: less variance, more bias

.pull-left[
<img src="06-1-regularization_files/figure-html/unnamed-chunk-1-1.png" width="504" />
Solid points are improved by shrinking, hollow red points do worse
]
.pull-right[
If `$\bar Y$` is between `$\mu$` and 0 then shrinking does worse

In higher dimensions, a greater portion of space is *not* between `$\mu$` and 0

e.g. `$2^p$` orthants in `$p$`-dimensional space, and only 1 contains `$\mu - 0$`

(*Not meant to be a [proof](https://statweb.stanford.edu/~candes/teaching/stats300c/Lectures/Lecture18.pdf)*)
]

---

## Historical significance

Statisticians (particularly frequentists) emphasized unbiasedness

But after Stein's example, we must admit bias is not always bad

Opens the doors to many interesting methods

Most (almost all?) ML methods use bias this way

(Even if some famous CS profs say otherwise on twitter 🤨)

---

### Regularized (i.e. penalized) regression

Motivation: If the JS estimator can do better than the MLE at estimating a sample mean, does a similar thing happen when estimating regression coefficients?

For some penalty function `$\mathcal P_\lambda$`, which depends on a tuning parameter `$\lambda$`, the estimator

`$$\hat \beta_\lambda = \arg \min_\beta \| \mathbf y - \mathbf X \beta \|^2_2 + \mathcal P_\lambda(\beta)$$`
is "regularized" or shrunk (shranken?) toward values that decrease the penalty. Often `$\mathcal P_\lambda = \lambda \| \cdot \|$` for some norm

Many ML methods are optimizing "loss + penalty"

---

### Ridge (i.e. L2 penalized) regression

- Originally motivated by problems where `$\mathbf X^T \mathbf X$` is uninvertible (or badly conditioned, i.e. almost uninvertible)

- If `$p > n$` then this always happens

- Least squares estimator is undefined or numerically unstable

For some constant `$\lambda > 0$`,
`$$\text{minimize } \| \mathbf y - \mathbf X \beta \|^2_2 + \lambda \| \beta \|^2$$`
**Shrinks** coefficients `$\hat \beta$` toward 0

Larger coefficients are penalized more (squared penalty)

---

### High-dimensional simulation

Parameters in covariate space (rather than outcome space)

1. Simulate a high-dimensional linear model
$$
\mathbf y = \mathbf X \beta + \varepsilon, \text{ for } \varepsilon \sim N(0, \sigma^2  I) 
$$
2. Fit **ridge regression** on a grid of `$\lambda$` values
3. Iterate over multiple realizations of `$\varepsilon$`
4. Plot the MSE of estimated coefficients as a function of `$\lambda$`, with one line for each iterate

$$
\text{MSE}(\hat \beta_\text{ridge}(\lambda))
$$
Simulation is "cheating" -- can only compute MSE because we know true `$\beta$`

---

#### MSE(ridge) lower-dimensional

```r
high_dim_MSE_MC(n = 100, p = 10, instances = 20)
```

---