Machine learning

class: bottom, left, title-slide

# Machine learning
## Tree-based methods
### Joshua Loftus

---

class: inverse

## Regression and classification trees

### More interpretable than linear models?

- Sequence of simple questions about individual predictors

- Growing and pruning

### Strategies for improving "weak" models

- Bagging

- Random forests (similar to "dropout" -- future topic)

- Boosting

---

## Decision trees

### Are you eligible for the COVID-19 vaccine?

- If `Age >= 50` then `yes`, otherwise continue
- If `HighRisk == TRUE` then `yes`, otherwise continue
- If `Job == CareWorker` then `yes`, otherwise `no`

This is (arguably) more interpretable than a linear model with multiple predictors

(This is just an example, find the real vaccination criteria [here](https://www.nhs.uk/conditions/coronavirus-covid-19/coronavirus-vaccination/coronavirus-vaccine/))

---

[Penguin data](https://education.rstudio.com/blog/2020/07/palmerpenguins-cran) from Palmer Station Antarctica

![](https://education.rstudio.com/blog/2020/07/palmerpenguins-cran/penguin-expedition.jpg)

---

### Measuring our large adult penguins

```r
library(palmerpenguins)
pg <- penguins %>% drop_na()
```

---

### Regression tree to predict penguin massiveness

```r
library(tree)
fit_tree <- 
* tree(body_mass_g ~ flipper_length_mm + bill_length_mm, control = tree.control(nrow(pg), mindev = 0.007), data = pg)
plot(fit_tree, type = "uniform")
text(fit_tree, pretty = 0, cex = 1.7)
```

---

#### Partial dependence plots with `plotmo`

```r
library(plotmo)
vars <- c("bill_length_mm", "flipper_length_mm")
plotmo(fit_tree, trace = -1, degree1 = NULL, degree2 = vars)
```

---

### Recursive rectangular splitting on predictors

"Stratification of the feature space"

```
Input: subset of data
  For each predictor variable x_j in subset
    Split left: observations with x_j < cutoff
    Split right: observations with x_j >= cutoff
    Predict constants in each split
    Compute model improvement
    Scan cutoff value to find best split for x_j
Output: predictor and split with best improvement
```

Starting from full dataset, compute first split as above

**Recurse**: take the two subsets of data from each side of the split and plug them both back into the same function

Until some **stopping rule** prevents more splitting

---

### Regression tree predictions

---

### Tree diagram again for comparison

---

### Categorical predictors

```r
fit_tree <- tree(body_mass_g ~ ., data = pg)
plot(fit_tree, type = "uniform")
text(fit_tree, pretty = 0, cex = 1.7)
```

Split using `levels`, e.g. the species Adelie, Chinstrap, Gentoo

---

### Stopping rules

```r
fit_tree <- tree(body_mass_g ~ .,
*     control = tree.control(nrow(pg), mindev = 0.001), data = pg)
```

Interpretable?... (see `?tree.control` for options)

---

## Complexity and overfitting

Could keep recursively splitting on predictor space until we have bins containing only 1 unique set of predictor values each

This would be like 1-nearest neighbors

**Lab exercise**: create a plot of training error versus tree size

```r
fit_tree <- tree(body_mass_g ~ .,
*     control = tree.control(nrow(pg), mindev = 0.000001), data = pg)
summary(fit_tree)$size # number of "leaf" endpoints
```

```
## [1] 53
```

---

## Growing and pruning

#### Problem: greedy splitting

Each split uses the best possible predictor, similar to forward stepwise. Early stopping may prevent the model from finding useful but weaker predictors later on

**Solution**: don't use early stopping. Grow a large tree

#### Problem: overfitting

Larger trees are more complex, more difficult to interpret, and could be overfit to training data

**Solution**: (cost complexity / weakest link) pruning

---

### How to prune a tree

After growing a large tree, find the "best" sub-tree

#### Problem: too many sub-trees

The number of sub-trees grows combinatorially in the number of splits (depends on depth as well, interesting counting problem)

**Solution**: consider only a one-dimensional path of sub-tree models, the ones that minimize

`$$RSS(\text{Sub-tree}) + \alpha |\text{SubTree}|$$`

for `$\alpha \geq 0$`. Now we can choose `$\alpha$`, and therefore a specific sub-tree, using validation

---

## Classification trees

If the outcome is categorical we need to modify the splitting algorithm

- When making a split, classify all observations in each leaf with the same class (modal category rather than mean numeric prediction)

- Can't measure improvement in fit by reduction in RSS, instead, use reduction of some measure related to classification error

Software generally uses **Gini index** by default. In a leaf:

$$\sum_{k=1}^K \hat p_k(1-\hat p_k) $$

---

## Trees, forests, and other models

- Model using a single tree is very simple. High interpretability, but likely low prediction accuracy

- For proper *machine learning* we'll combine many trees into one model (next topic)

- When should we use these tree methods?
  - High complexity, so usually want `$n > p$`
  
  - If "true" relationships are linear/smooth, tree methods may fit poorly compared to linear/smooth methods

- Trees more easily handle categorical predictors and missing values (can treat missingness as a category)

---

### Tree-based fit vs smooth fit