class: top, left, title-slide .title[ # Machine learning ] .subtitle[ ## Classification and logistic regression ] .author[ ### Joshua Loftus ] --- class: inverse <style type="text/css"> .remark-slide-content { font-size: 1.2rem; padding: 1em 4em 1em 4em; } </style> ## Classification - **Supervised learning** with categorical/qualitative outcomes (in contrast to regression, with numeric outcomes) - Often called "labels", `\(K\)` = number of unique classes - Binary: positive/negative or 0/1 or yes/no or success/fail etc *Label names not mathematically important* - e.g. use `\(1, ..., K\)` -- - Limitations: labels already defined (not learned from data-- that would be unsupervised learning), `\(K\)` is fixed - Plots: often use color/point shape for categorical variables --- ## Interpretable classification ### Logistic regression `$$\mathbb E(Y| \mathbf X = \mathbf x) = g^{-1}(\mathbf x^T \beta)$$` `$$g(p) = \log{\left(\frac{p}{1-p}\right)}$$` -- ### Generalized linear models (GLMs) - [Various](https://en.wikipedia.org/wiki/Generalized_linear_model#General_linear_models) "link" functions `\(g\)` - Linear regression is a special case with `\(g = \text{id}\)` - Logistic in `R`: `glm(..., family = binomial())` - Others: Poisson, multinomial, ..., see `?family` in `R` --- ### One predictor, "S curve" <img src="04-1-classification_files/figure-html/logit-1dm-1.png" width="540px" style="display: block; margin: auto;" /> --- ### Classifications/decisions: threshold probability <img src="04-1-classification_files/figure-html/logit-1d-class-1.png" width="540px" style="display: block; margin: auto;" /> --- ### Without giving `\(y\)` a spatial dimension <img src="04-1-classification_files/figure-html/logit-0d-class-1.png" width="540px" style="display: block; margin: auto;" /> --- ### Two predictors, binary outcome <img src="04-1-classification_files/figure-html/logit-data-plot-1.png" width="540px" style="display: block; margin: auto;" /> --- ### Contours of GLM-predicted class probabilities <img src="04-1-classification_files/figure-html/logit-contour-1.png" width="540px" style="display: block; margin: auto;" /> --- class: middle, center **Classification boundaries** with ## `\(p = 3\)` predictors ### Boundary = plane ## `\(p > 3\)` predictors ### Boundary = hyperplane (In practice, "high-dimensional" = can't easily plot it) --- ### Interpretation: coefficients ```r model_fit <- glm(y ~ x1 + x2, family = "binomial", data = train) broom::tidy(model_fit) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -2.24 0.656 -3.42 0.000635 ## 2 x1 2.17 0.584 3.72 0.000198 ## 3 x2 1.53 0.499 3.07 0.00215 ``` Coefficient scale: log-odds? Exponentiate `\(\to\)` odds ```r broom::tidy(model_fit, exponentiate = TRUE) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.106 0.656 -3.42 0.000635 ## 2 x1 8.78 0.584 3.72 0.000198 ## 3 x2 4.62 0.499 3.07 0.00215 ``` --- ### Interpretation: inference and diagnostics - MLEs `\(\to\)` asymptotic normality for intervals/tests `summary()`, `coef()`, `confint()`, `anova()`, etc in `R` - "Deviance" instead of RSS ```r broom::glance(model_fit) ``` ``` ## # A tibble: 1 × 8 ## null.deviance df.null logLik AIC BIC deviance df.residual nobs ## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int> ## 1 110. 79 -25.8 57.5 64.7 51.5 77 80 ``` - Because `\(y\)` is 0 or 1, residual plots will show patterns, not as easy to interpret geometrically --- ### Challenges #### Separable case (guaranteed if `\(p > n\)`) If classes can be perfectly separated, the MLE is undefined, fitting algorithm diverges as `\(\hat \beta\)` coordinates `\(\to \pm \infty\)` Awkwardly, classification is *too easy*(!?) for this probabilistic approach -- #### Curse of dimensionality Biased MLE and wrong variance/asymp. dist. if `\(n/p \to \text{const}\)`, even if `\(> 1\)` .footnote[See [Sur and Candès, (2019)](https://www.pnas.org/content/116/29/14516)] --- class: inverse ### Classification summary - Numeric prediction `\(\to\)` classification `$$\hat y = \mathbb I(\hat p > c) = \begin{cases} 0 & \text{ if } \hat y \leq c \\ 1 & \text{ if } \hat y > c \end{cases}$$` Log-odds function is monotonic, so (hyperplanes) `$$\hat p > c \leftrightarrow x^T \beta > c'$$` - More classes: transform to binary, predict using largest `\(\hat p_k\)` - Non-linear boundaries: transformation of predictors, or use methods other than GLMs (we'll learn more soon) - Some classification methods output categorical classes, not probabilities (or other numeric scores) --- ### Fitting logistic regression How do we estimate `\(\beta\)`? **Maximum likelihood**: $$ \text{maximize } L(\beta ; \mathbf y | \mathbf X) = \prod_{i=1}^n L(\beta ; y_i | \mathbf x_i) $$ (assuming the data is i.i.d.) Next slide: a bit of mathematics --- ### MLE $$ L(\beta ; \mathbf y | \mathbf x) = \prod_{i=1}^n \left( \frac{1}{1+e^{-x_i \beta}} \right)^{y_i} \left(1- \frac{1}{1+e^{-x_i \beta}} \right)^{1-y_i} $$ -- $$ \ell(\beta ; \mathbf y | \mathbf x) = \sum_{i=1}^n y_i \log \left( \frac{1}{1+e^{-x_i \beta}} \right) + (1-y_i) \log \left(1- \frac{1}{1+e^{- x_i \beta}} \right) $$ -- $$ \frac{\partial}{\partial \beta} \ell(\beta ; \mathbf y | \mathbf x) = \sum_{i=1}^n y_i \left( \frac{x_i e^{-x_i \beta}}{1+e^{-x_i \beta}} \right) + (1-y_i) \left(\frac{-x_i}{1+e^{- x_i \beta}} \right) $$ `$$= \sum_{i=1}^n x_i \left[ y_i - \left(\frac{1}{1+e^{- x_i \beta}} \right) \right] = \color{skyblue}{\sum_{i=1}^n x_i [y_i - \hat p_i(\beta)]}$$` Set this equal to 0 and solve for `\(\beta\)` using Newton-Raphson --- ### Newton-Raphson - Find the roots of a function - Iteratively approximating the function by its tangent - Root of the tangent line is used as starting point for next approximation - See the [animation](https://en.wikipedia.org/wiki/Newton%27s_method#/media/File:NewtonIteration_Ani.gif) on [Wikipedia](https://en.wikipedia.org/wiki/Newton%27s_method) **Exercise**: using result from previous slide, compute the second derivative of `\(\ell\)` and derive the expressions needed to apply Newton-Raphson --- ### Logistic regression fitting: multivariate case Newton-IRLS (equivalent) steps: $$ `\begin{eqnarray} \hat{\mathbf p}_t & = & g^{-1}(\mathbf X \hat \beta_t) & \ \text{ update probs.} \\ \mathbf W_t & = & \text{diag}[\hat{\mathbf p}_t (1 - \hat{\mathbf p}_t)] & \ \text{ update weights} \\ \hat{\mathbf{y}}_t & = & g(\hat{\mathbf p}_t) + \mathbf W_t^{-1}(\mathbf y - \hat{\mathbf p}_t) & \ \text{ update response} \end{eqnarray}` $$ and then update parameter estimate (LS sub-problem) `$$\hat{\beta}_{t+1} = \arg \min_{\beta} (\hat{\mathbf{y}}_t - \mathbf X \beta)^T \mathbf W_t (\hat{\mathbf{y}}_t - \mathbf X \beta)$$` -- **Note**: larger weights on observations with `\(\hat p\)` closer to 1/2, i.e. the most difficult to classify (***look for variations of this theme***) .footnote[See Section 4.4.1 of [ESL](https://web.stanford.edu/~hastie/ElemStatLearn/)] --- ### Optimization algorithms Downside of Newton-Raphson: requires second derivatives, including *inverting the `\(p \times p\)` Hessian matrix* when optimizing over `\(p > 1\)` parameters If `\(p\)` is large, **second-order** optimization methods like Newton's are very costly -- First order methods only require computing the `\(p \times 1\)` gradient vector Recall that the gradient is a vector in the *direction of steepest increase* in the parameter space --- ### Gradient (steepest) descent i.e. skiing as fast as possible. Notation, let `$$L(\beta) = L(\mathbf X, \mathbf y, g_\beta) \color{skyblue}{\text{ (loss function)}}$$` 1. Start at an initial point `\(\beta^{(0)}\)` 2. For step `\(n = 1, \ldots\)` - Compute `\(\mathbf d_n = \nabla L(\beta^{(n-1)}) \color{skyblue}{\text{ (gradient)}}\)` - Update `\(\beta^{(n)} = \beta^{(n-1)} - \gamma_n \mathbf d_n\)` 3. Until some **convergence criteria** is satisfied -- Where the **step size** `\(\gamma_n > 0\)` is made small enough to not "overshoot" and increase the loss, i.e. the loss only decreases --- class: inverse ## Optimization more generally - Components: objective functions, algorithms, local/global optima, approximate solutions - Computational cost: speed, storage (time and space) #### Closed form / analytic solutions e.g. OLS formula for `\(\hat \beta\)` (remember?) #### Iterative algorithms (e.g. Newton-Raphon) - Rates of convergence - Might have guarantees, e.g. if objective is **convex** --- class: inverse Machine learning = optimization algorithms applied to data Understanding optimization is very important! - Intuition (challenge: dimensionality) - Mathematical guarantees (challenge: relevance) - Empirical evaluation (challenge: overfitting...)