class: bottom, left, title-slide # Machine learning ## Tree-based methods ### Joshua Loftus --- class: inverse <style type="text/css"> .remark-slide-content { font-size: 1.2rem; padding: 1em 4em 1em 4em; } </style> ## Regression and classification trees ### More interpretable than linear models? - Sequence of simple questions about individual predictors - Growing and pruning ### Strategies for improving "weak" models - Bagging - Random forests (similar to "dropout" -- future topic) - Boosting --- ## Decision trees ### Are you eligible for the COVID-19 vaccine? - If `Age >= 50` then `yes`, otherwise continue - If `HighRisk == TRUE` then `yes`, otherwise continue - If `Job == CareWorker` then `yes`, otherwise `no` This is (arguably) more interpretable than a linear model with multiple predictors (This is just an example, find the real vaccination criteria [here](https://www.nhs.uk/conditions/coronavirus-covid-19/coronavirus-vaccination/coronavirus-vaccine/)) --- [Penguin data](https://education.rstudio.com/blog/2020/07/palmerpenguins-cran) from Palmer Station Antarctica ![](https://education.rstudio.com/blog/2020/07/palmerpenguins-cran/penguin-expedition.jpg) --- ### Measuring our large adult penguins ```r library(palmerpenguins) pg <- penguins %>% drop_na() ``` <img src="08-2-trees_files/figure-html/penguinplot-1.png" width="720" style="display: block; margin: auto;" /> --- ### Regression tree to predict penguin massiveness ```r library(tree) fit_tree <- * tree(body_mass_g ~ flipper_length_mm + bill_length_mm, control = tree.control(nrow(pg), mindev = 0.007), data = pg) plot(fit_tree, type = "uniform") text(fit_tree, pretty = 0, cex = 1.7) ``` <img src="08-2-trees_files/figure-html/penguintree-1.png" width="720" style="display: block; margin: auto;" /> --- #### Partial dependence plots with `plotmo` ```r library(plotmo) vars <- c("bill_length_mm", "flipper_length_mm") plotmo(fit_tree, trace = -1, degree1 = NULL, degree2 = vars) ``` <img src="08-2-trees_files/figure-html/plotmotree-1.png" width="720" style="display: block; margin: auto;" /> --- ### Recursive rectangular splitting on predictors "Stratification of the feature space" ``` Input: subset of data For each predictor variable x_j in subset Split left: observations with x_j < cutoff Split right: observations with x_j >= cutoff Predict constants in each split Compute model improvement Scan cutoff value to find best split for x_j Output: predictor and split with best improvement ``` -- Starting from full dataset, compute first split as above **Recurse**: take the two subsets of data from each side of the split and plug them both back into the same function Until some **stopping rule** prevents more splitting --- ### Regression tree predictions <img src="08-2-trees_files/figure-html/penguinctreeplot-1.png" width="720" style="display: block; margin: auto;" /> --- ### Tree diagram again for comparison <img src="08-2-trees_files/figure-html/penguintreediagramagain-1.png" width="1008" style="display: block; margin: auto;" /> --- ### Categorical predictors ```r fit_tree <- tree(body_mass_g ~ ., data = pg) plot(fit_tree, type = "uniform") text(fit_tree, pretty = 0, cex = 1.7) ``` <img src="08-2-trees_files/figure-html/penguinctree-1.png" width="648" style="display: block; margin: auto;" /> Split using `levels`, e.g. the species Adelie, Chinstrap, Gentoo --- ### Stopping rules ```r fit_tree <- tree(body_mass_g ~ ., * control = tree.control(nrow(pg), mindev = 0.001), data = pg) ``` <img src="08-2-trees_files/figure-html/penguinbigtree-1.png" width="1008" style="display: block; margin: auto;" /> Interpretable?... (see `?tree.control` for options) --- ## Complexity and overfitting Could keep recursively splitting on predictor space until we have bins containing only 1 unique set of predictor values each This would be like 1-nearest neighbors **Lab exercise**: create a plot of training error versus tree size ```r fit_tree <- tree(body_mass_g ~ ., * control = tree.control(nrow(pg), mindev = 0.000001), data = pg) summary(fit_tree)$size # number of "leaf" endpoints ``` ``` ## [1] 53 ``` --- ## Growing and pruning #### Problem: greedy splitting Each split uses the best possible predictor, similar to forward stepwise. Early stopping may prevent the model from finding useful but weaker predictors later on **Solution**: don't use early stopping. Grow a large tree #### Problem: overfitting Larger trees are more complex, more difficult to interpret, and could be overfit to training data **Solution**: (cost complexity / weakest link) pruning --- ### How to prune a tree After growing a large tree, find the "best" sub-tree #### Problem: too many sub-trees The number of sub-trees grows combinatorially in the number of splits (depends on depth as well, interesting counting problem) **Solution**: consider only a one-dimensional path of sub-tree models, the ones that minimize `$$RSS(\text{Sub-tree}) + \alpha |\text{SubTree}|$$` for `\(\alpha \geq 0\)`. Now we can choose `\(\alpha\)`, and therefore a specific sub-tree, using validation --- ## Classification trees If the outcome is categorical we need to modify the splitting algorithm - When making a split, classify all observations in each leaf with the same class (modal category rather than mean numeric prediction) - Can't measure improvement in fit by reduction in RSS, instead, use reduction of some measure related to classification error Software generally uses **Gini index** by default. In a leaf: $$\sum_{k=1}^K \hat p_k(1-\hat p_k) $$ --- ## Trees, forests, and other models - Model using a single tree is very simple. High interpretability, but likely low prediction accuracy - For proper *machine learning* we'll combine many trees into one model (next topic) - When should we use these tree methods? - High complexity, so usually want `\(n > p\)` - If "true" relationships are linear/smooth, tree methods may fit poorly compared to linear/smooth methods - Trees more easily handle categorical predictors and missing values (can treat missingness as a category) --- ### Tree-based fit vs smooth fit <img src="08-2-trees_files/figure-html/smoothvstree-1.png" width="1008" style="display: block; margin: auto;" />