class: bottom, left, title-slide .title[ # Machine learning ] .subtitle[ ## Interpreting models, changing the world ] .author[ ### Joshua Loftus ] --- class: inverse <style type="text/css"> .remark-slide-content { font-size: 1.2rem; padding: 1em 4em 1em 4em; } </style> ## **Storytelling** with ML models Suppose we tuned a model to have good predictions on test data We often want to interpret/explain the model, not just predict - Why does the model predict a certain outcome for a given value of the predictors? - How does the model depend on a specific variable or set of variables? - Which variables does the model depend on most? - Can we create plots/visuals to answer any of these, even if `\(p\)` is large? --- ## Conditioning / observation `\(\mathbb P(\mathbf y | \mathbf X = \mathbf x)\)`, coefficients, dependence plots, CEF, etc ## Manipulating / intervention `\(\mathbb P(\mathbf y | \text{do}(\mathbf X = \mathbf x))\)`, ATE, [CATE](https://en.wikipedia.org/wiki/Average_treatment_effect#Heterogenous_treatment_effects), etc **All interpretations are non-causal by default** Without explicit causal assumptions there are no causal conclusions "[No causes in, no causes out](https://oxford.universitypressscholarship.com/view/10.1093/0198235070.001.0001/acprof-9780198235071-chapter-3)" --- ## Plots instead of coefficients - Plot predictor `\(\mathbf x_j\)` vs `\(\hat f(\mathbf x)\)` - We saw this with GAMs -- #### Simulated examples Next we'll walk-through two examples First when the true CEF is additive, and second when it is not We'll see how **partial dependence plots** can be used to interpret models --- ### True model is additive ```r library(tidyverse) *f1 <- function(x) x^2 *f2 <- function(x) cos(pi*x/2) set.seed(1) n <- 800 df <- data.frame( x1 = 2*(runif(n)-1/2), x2 = 2*sort(sample(-100:100 / 50, n, replace = TRUE)) ) %>% mutate( * CEF = f1(x1) + f2(x2), y = CEF + rnorm(n) ) ``` **Note**: these simple examples have only 2 predictors so 3d-plot can show everything. In higher-dimensional models we may not be able to visualize the interactions we see here --- ### Correctly specified GAM ```r library(gam) fit_gam <- gam(y ~ s(x1) + s(x2), data = df) par(mfrow = c(1,2)) plot(fit_gam, se = TRUE, ylim = c(-1,1)) points(df$x2, f2(df$x2), type = "l", col = "green", lwd = 10) ``` <img src="10-1-interpretation_files/figure-html/df_plot-1.png" width="864" style="display: block; margin: auto;" /> --- ### Can plot entire model since `\(p = 2\)` .pull-left[
True CEF ] .pull-right[
GAM estimated ] --- ### Now true model is **not** additive Univariate dependence plots are easily interpreted under **additivity assumption** -- like GAMs **Danger**: does not show associations between predictors, interactions e.g. ```r df_int <- df %>% mutate( * CEF = f1(x1) * f2(x2), y = CEF + rnorm(n) ) ``` Additivity assumption is now (importantly) wrong --- ### Misspecified GAM ```r fit_gam_wrong <- gam(y ~ s(x1) + s(x2), data = df_int) par(mfrow = c(1,2)) plot(fit_gam_wrong, ylim = c(-1,1)) ``` <img src="10-1-interpretation_files/figure-html/dfi_plot-1.png" width="576" style="display: block; margin: auto;" /> Partial dependence on `\(x_1\)` is a parabola opening upward when `\(\cos(x_2) > 0\)` but downward when `\(\cos(x_2) < 0\)` (cancellation) --- ### GAM summaries ```r gam_summary <- summary(fit_gam) wrong_summary <- summary(fit_gam_wrong) wrong_summary$anova ``` ``` ## Anova for Nonparametric Effects ## Npar Df Npar F Pr(F) ## (Intercept) ## s(x1) 3 0.4497 0.7175862 ## s(x2) 3 6.9832 0.0001221 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` .pull-left[ ```r gam_summary$null.deviance ``` ``` ## [1] 1316.843 ``` ```r gam_summary$deviance ``` ``` ## [1] 931.711 ``` ] .pull-right[ ```r wrong_summary$null.deviance ``` ``` ## [1] 726.0882 ``` ```r wrong_summary$deviance ``` ``` ## [1] 703.4611 ``` ] --- ### Strong modeling assumption restricts learning .pull-left[
True CEF (non-additive) ] .pull-right[
GAM estimate ] --- class: center, middle ### Danger of interpreting misspecified models These plots and interaction tests interpret the fitted model They do not check the model fitting assumptions #### If we fit models that are importantly wrong, our interpretations can also be importantly wrong --- ### `iml` package ```r library(iml) # methods for interpretation pred_gam <- Predictor$new(fit_gam, data = df_subsample, y = df_subsample$y) pdp <- FeatureEffects$new(pred_gam, features = c("x1", "x2"), method = "pdp+ice") plot(pdp) + stat_function(fun = f2, color = "green", size = 2) + ylim(-1,1) ``` <img src="10-1-interpretation_files/figure-html/df_pdp-1.png" width="648" style="display: block; margin: auto;" /> --- ### Checking for interactions in the predictions ```r ia <- Interaction$new(pred_gam) ``` .pull-left[ ```r ia ``` ``` Interpretation method: Interaction Analysed predictor: Prediction task: unknown Analysed data: Sampling from data.frame with 240 rows and 3 columns. Head of results: .feature .interaction 1: x1 1.449399e-16 2: x2 1.225800e-16 3: y 0.000000e+00 ``` ] .pull-right[ ```r plot(ia) + xlim(0,1) ``` <img src="10-1-interpretation_files/figure-html/unnamed-chunk-12-1.png" width="504" /> ] --- ```r pred_gamw <- Predictor$new(fit_gam_wrong, data = df_subsample, y = df_subsample$y) pdp <- FeatureEffects$new(pred_gamw, features = c("x1", "x2"), method = "pdp+ice") plot(pdp) + ylim(-1,1) ``` <img src="10-1-interpretation_files/figure-html/df_pdp_int-1.png" width="648" style="display: block; margin: auto;" /> --- ### Checking for interactions **in the predictions** ```r ia <- Interaction$new(pred_gamw) ``` .pull-left[ ```r ia ``` ``` Interpretation method: Interaction Analysed predictor: Prediction task: unknown Analysed data: Sampling from data.frame with 240 rows and 3 columns. Head of results: .feature .interaction 1: x1 1.401751e-16 2: x2 1.156637e-16 3: y 0.000000e+00 ``` ] .pull-right[ ```r plot(ia) + xlim(0,1) ``` <img src="10-1-interpretation_files/figure-html/unnamed-chunk-15-1.png" width="504" /> ] --- ### Regression trees + bagging ```r library(randomForest) fit_rf <- randomForest(y ~ ., ntree = 100, mtry = 2, minsize = 20, data = df_int_noCEF) pred_rf <- Predictor$new(fit_rf, data = df_subsample, y = df_subsample$y) pdp <- FeatureEffects$new(pred_rf, features = c("x1", "x2"), method = "pdp+ice") plot(pdp) + ylim(-1,1) ``` <img src="10-1-interpretation_files/figure-html/dfi_plot2-1.png" width="864" style="display: block; margin: auto;" /> --- ### Tree methods can fit interactions ```r ia <- Interaction$new(pred_rf) ``` .pull-left[ ```r ia ``` ``` Interpretation method: Interaction Analysed predictor: Prediction task: unknown Analysed data: Sampling from data.frame with 240 rows and 3 columns. Head of results: .feature .interaction 1: x1 0.7247983 2: x2 0.6523332 3: y 0.0000000 ``` ] .pull-right[ ```r plot(ia) + xlim(0,1) ``` <img src="10-1-interpretation_files/figure-html/unnamed-chunk-18-1.png" width="504" /> ] --- ### Tree methods aren't smooth .pull-left[
True CEF ] .pull-right[
Bagged trees estimate ] --- ### Smooth and non-additive approach ```r library(kernlab) fit_ksvm <- ksvm(y ~ ., kernel = rbfdot(sigma = 2), data = df_int_noCEF) pred_ksvm <- Predictor$new(fit_ksvm, data = df_subsample %>% select(-y), y = df_subsample$y, predict.fun=function(object, newdata) predict(object, newdata)) pdp <- FeatureEffects$new(pred_ksvm, features = c("x1", "x2"), method = "pdp+ice") plot(pdp) + ylim(-1,1) ``` <img src="10-1-interpretation_files/figure-html/dfi_plot3-1.png" width="864" style="display: block; margin: auto;" /> --- ### RBF kernel-SVM model fits interactions ```r ia <- Interaction$new(pred_ksvm) ``` .pull-left[ ```r ia ``` ``` Interpretation method: Interaction Analysed predictor: Prediction task: unknown Analysed data: Sampling from data.frame with 240 rows and 2 columns. Head of results: .feature .interaction 1: x1 0.7047211 2: x2 0.7269615 ``` ] .pull-right[ ```r plot(ia) + xlim(0,1) ``` <img src="10-1-interpretation_files/figure-html/unnamed-chunk-23-1.png" width="504" /> ] --- ### Best fit (for this example)? .pull-left[
True CEF ] .pull-right[
RBF kernel-SVM estimate ] --- ### What to do about other variables? For models with multiple predictors, partial dependence plots can hide model structure **Danger**: simple interpretation methods can be misleading, e.g. simple 2d-plots when the model is not additive Options - Hold other variables fixed (e.g. at medians), `plotmo` default - Average over them, `pdp` method - Plot a curve for each observation, `ice` method Each has pros/cons, no satisfactory answer for dependence --- ### Higher dimensions - When `\(p > 2\)`... Many plots: `\(p\)` univariate plots, `\(\binom{p}{2}\)` 3d plots, and 3d plots no longer show the whole model... Focus on one (or few) predictors of interest Simpler, fewer opportunities for interpretation errors But *each plot now potentially hides more of the model's complexity* - If `\(n\)` is (too) large Computational complexity, very busy plots (e.g. `ice` plots) Downsample, use a random fraction of data --- class: inverse ### (Local) surrogate/explanatory models Fit simple models to explain the predictions of a complex one From a **class of simple models**, let `\(\hat g\)` be the closest model to `\(\hat f\)` - **Local**: explanation for an individual prediction Different simple models `\(\hat g\)` for different parts of the predictor space, more complicated interpretation - **Global**: explanation for the overall model Simpler interpretation, but usually less fidelity to `\(\hat f\)` --- ### Calculus analogy (warning: only an analogy) Taylor expansion: approximate a differentiable function (locally) by a linear one (or polynomial). Image from [Wikipedia](https://en.wikipedia.org/wiki/Tangent_space) ![](https://upload.wikimedia.org/wikipedia/commons/e/e7/Tangentialvektor.svg) --- ### Models for different purposes | Prediction | Interpretation | | ----------- | ----------- | | `\(\hat f(\mathbf x)\)` complex, fit to `\(\mathbf y\)` | `\(\hat g(\mathbf z)\)` simple, fit to `\(\hat f(\mathbf x)\)` | | Accuracy on test data | Why did `\(\hat f\)` predict this value? | | Original features `\(\mathbf x\)` | Simplified features `\(\mathbf z = \mathbf z(\mathbf x)\)` | Fit `\(\hat g\)` using e.g. lasso, or a simple tree For text/image data, `\(\mathbf z\)` could be indicators for specific words or regions of an image --- ### LIME #### Local Interpretable Model-Agnostic Explanations ```r x_local <- data.frame(x1 = 0.9, x2 = 0) lime_ksvm <- LocalModel$new(pred_ksvm, k = 2, x.interest = x_local, dist.fun = "euclidean", kernel.width = .5) lime_ksvm$results ``` ``` ## beta x.recoded effect x.original feature feature.value ## x1 0.254412004 0.9 0.2289708 0.9 x1 x1=0.9 ## x2 -0.002814105 0.0 0.0000000 0 x2 x2=0 ``` ``` ## beta x.recoded effect x.original feature feature.value ## x1 -0.2923272 0.9 -0.2630945 0.9 x1 x1=0.9 ## x2 -0.3746932 1.0 -0.3746932 1 x2 x2=1 ``` ``` ## beta x.recoded effect x.original feature feature.value ## x1 -0.7984347 0.9 -0.7185913 0.9 x1 x1=0.9 ## x2 -0.1282582 2.0 -0.2565163 2 x2 x2=2 ``` Recall the CEF `\(= x_1^2 \cos(\pi x_2 / 2)\)` --- class: inverse ## Feature/variable "importance" - Which predictors are *used most* by the model? - Does not necessarily tell us about **causation** (none of these methods do, but an important reminder) #### Tree-specific importance When `\(x_j\)` is used for a split, how much does the accuracy improve? Average over all splits (across all bagged trees) #### Permutation importance Permute values of predictor `\(x_j\)`. How much does the accuracy of the model decrease? --- ### Local + "fair" importance Prediction is payoff, predictors are players, **Shapley values** distribute the payoff in a [fair](https://en.wikipedia.org/wiki/Shapley_value#Characterization) way Change in prediction when adding variable `\(x_j\)` to another other subset of predictors, averaged over all subsets ```r ksvm_shapley <- Shapley$new(pred_ksvm, x.interest = x_local) ksvm_shapley$results ``` ``` ## feature phi phi.var feature.value ## 1 x1 0.1302725 0.0691334 x1=0.9 ## 2 x2 0.2993309 0.1255579 x2=0 ``` ``` ## feature phi phi.var feature.value ## 1 x1 0.03226445 0.1132297 x1=0.9 ## 2 x2 0.40410080 0.2060785 x2=0 ``` ``` ## feature phi phi.var feature.value ## 1 x1 0.01575709 0.03429754 x1=0 ## 2 x2 0.04087980 0.05024968 x2=1 ``` --- class: inverse ## Interpretation challenges ### Tuning **Trading off complexity of interpretation**: fidelity to the model vs simplicity of explanations How simple (e.g. sparse)? For local explanations, how local? --- class: inverse ## Interpretation challenges ### Error **Inaccurate interpretation of a good model**: reach wrong conclusions about why model generalizes well Univariate plots or local surrogate may be too simple, poorly fitting to `\(\hat f\)`. Model explanations can even be optimized to be misleading (e.g. to give appearance of compliance with regulations, see "[fairwashing](http://proceedings.mlr.press/v97/aivodji19a.html)") **Accurate interpretation of a bad model**: understanding the model may increase confidence, leading to greater surprise when the model fails e.g. model with poor OOD generalization used in a context where DGP changes --- class: inverse ## Interpretation challenges ### Communication Some of these methods are still quite technical Variable importance measures with randomized computation, values can change Many of these tools are popular because software makes it easy to get simple output, but may not understand the method they are using (*ask someone using Shapley values how to compute them...*) **Exercise**: explain to different audiences why variable importance cannot be interpreted as strength of a causal relationship --- ## Additional reading - [Free online textbook](https://christophm.github.io/interpretable-ml-book/) on interpretability by Christoph Molnar - [Vignette](https://cran.r-project.org/web/packages/iml/vignettes/intro.html) for `iml` package by Molnar - [Vignette](https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html) for `lime` package - [R tutorial](https://uc-r.github.io/lime) on lime - [lime homepage](https://homes.cs.washington.edu/~marcotcr/blog/lime/) with links to Python package --- ### Should/do we always care about causality? > How does the model depend on a specific variable or set of variables? *Why* focus on that variable (those variables)? > Which variables does the model depend on most? What will you do with this information? > I don't care about interpretation, I just want the most accurate predictions! Predictions are (usually) used to make decisions / take actions --- ### Causality assumptions Remember: "[No causes in, no causes out](https://oxford.universitypressscholarship.com/view/10.1093/0198235070.001.0001/acprof-9780198235071-chapter-3)" 1. Assume some mathematical (probability) model relating variables (possibly including unobserved variables) 2. **Also** assume a direction for each relationship 3. **Also** assume that changing/manipulating/intervening a variable results in corresponding changes in other variables (or their distributions) which are functionally "downstream" Deterministic e.g.: `\(X \to Y\)`, so `\(y = f(x)\)` and changing `\(x\)` to `\(x'\)` means `\(y\)` also changes to `\(f(x')\)` [but if `\(f\)` has an inverse, this model does *not* mean that changing `\(y\)` to `\(y'\)` results in `\(x\)` changing to `\(f^{-1}(y')\)`] (temperature `\(\to\)` thermometer) --- ### This is how we *want* to interpret regression Suppose we estimate a CEF `\(\hat f(x) \approx \mathbb E[Y |X = x]\)` We would like to interpret this as meaning > If we change `\(x\)` to `\(x'\)` then `\(\mathbb E[Y|X=x]\)` will change from `\(\hat f(x)\)` to `\(\hat f(x')\)` But this interpretation depends on the extra, causal assumptions (2 and 3 on the previous slide) #### *These assumptions are often **importantly wrong*** We must use them carefully, always be clear about it, and try to improve --- ## Various inferential goals/methods - **Causal discovery**: learn a graph (DAG) from data (very difficult especially if there are many variables) Otherwise we may assume the DAG structure is known and want to estimate the functions - **Estimation/inference** - *Parametric*: estimate `\(f\)` and/or `\(g\)` assuming they are linear or have some other (low-dimensional) parametric form, get CIs - *Non-parametric*: use machine learning instead - **Optimization**: find the "best" intervention/policy --- class: inverse, middle, center ### A few examples of estimation tasks and ML --- ### Mediation analysis Does `\(X\)` influence `\(Y\)` directly, and/or through a mediator variable `\(M\)`? *How much* of the relationship between `\(X\)` and `\(Y\)` is "explained" by `\(M\)`? e.g. Does `gdp per capita` influence `life expectancy` directly, and/or through `healthcare expenditure`? e.g. Does `SES` influence `graduation rate` directly, and/or through `major`? e.g. (dear to me) How much of `weightlifting` influences `strength` directly, vs how much of the change in `strength` is mediated through `muscle size`? --- ### Simple model for mediation If we know/estimate functions, we can simulate/compute the consequences of an intervention `$$x = \text{exogeneous}, \quad m = f(x) + \varepsilon_m, \quad y = g(m, x) + \varepsilon_y$$` <img src="10-1-interpretation_files/figure-html/unnamed-chunk-32-1.png" width="504" style="display: block; margin: auto;" /> Counterfactual: `\(x \gets x'\)`, so `$$m \gets f(x') + \varepsilon_m, \quad y \gets g(f(x') + \varepsilon_m, x') + \varepsilon_y$$` --- ### Mediation questions Direct/indirect effects: holding `\(m\)` constant or allowing it to be changed by `\(x\)`, various comparisons ("controlled" or "natural") #### Direct effect Quantifying/estimating the strength of the arrow `\(X \to Y\)` #### Total and indirect effects How much of the resulting change in `\(Y\)` (total effect) goes through the path `\(X \to M \to Y\)` (i.e. indirectly)? --- ### Treatment with **observed** confounding Let's relabel from `\(M\)` to `\(T\)`, a "treatment" (or "exposure") of interest but for which we do not have experimental data #### The effect of `\(T\)`, "controlling" for `\(X\)` Counterfactual: `\(T \gets t\)`, so `\(Y^t := Y \gets g(t, X) + \varepsilon_y\)` -- Often `\(T\)` is binary, and we are interested in - Average treatment effect (ATE) `$$\tau = \mathbb E[Y^1 - Y^0]$$` - Conditional ATE (CATE) `$$\tau(x) = \mathbb E[Y^1 - Y^0 | X = x]$$` --- ### Strategy: two staged regression - Parametric $$ Y = T \theta + X \beta + \varepsilon_Y $$ $$ T = X \alpha + \varepsilon_T $$ Estimate `\(\hat \theta\)` from e.g. 2SLS -- - Non/semi-parametric [PLM](https://en.wikipedia.org/wiki/Partially_linear_model) ("*double machine learning*") $$ Y = T \theta + g(X) + \varepsilon_Y $$ $$ T = m(X) + \varepsilon_T $$ Replace LS (least squares) with machine learning methods to estimate "*nuisance functions*" `\(\hat g\)` and `\(\hat m\)` --- ### High-dimensional regression example **Double selection** ("post-double-selection") Use a high-dimensional regression method that chooses sparse regression models, like lasso, in the first stage of a two stage procedure 1. Estimate nuisance functions with e.g. lasso - Fit `\(T \sim X\)`, get a subset of predictors `\(X_m\)` - Fit `\(Y \sim X\)`, get a subset of predictors `\(X_g\)` 2. Fit least squares `\(Y \sim T + X_m + X_g\)` The coefficient of `\(T\)` in the least squares model is our estimate --- class: middle ### Note Can combine (sparse) variable selection (e.g. lasso) with (global) non-linearity, e.g. selecting interaction/polynomial/spline basis functions Even if our original predictors satisfy `\(p \ll n\)`, we can still select (sparse) models from a high-dimensional model space this way --- ### Binary treatment and propensity models When `\(T\)` is binary we may keep the same *outcome model* (e.g. if `\(Y\)` is numeric then we usually use) $$ Y = T \theta + g(X) + \varepsilon_Y $$ But now the treatment model is called a *propensity function* $$ \pi(x) = \mathbb P(T = 1 | X = x) $$ e.g. `\(T\)` smoking, `\(Y\)` some health outcome, `\(X\)` could be SES etc -- Intuition: can't randomize, so compare people who have similar values of `\(\hat \pi(x)\)` Use ML **classification** methods to get `\(\hat \pi(x)\)` --- ### Tree based: causal forests Suppose we want to learn the CATE `\(\tau(x)\)` - *heterogeneous treatment effects* - could be used for *individualized treatments* e.g. Randomized trial, but we want to learn how `\(X\)` interacts with the treatment effect. When is `\(T\)` more/less effective? Idea: search predictor space `\(X\)` for regions where `\(Y\)` is maximally different between the treated and controls Causal forests use this splitting strategy when growing trees, instead of splitting to improve predictions --- class: inverse, middle, center ### Validation for causal ML --- ### Causal relationships should generalize (Specifically, the *kinds of causal relationships we study with statistical methods* should generalize, otherwise why are we using probability?) - If `\(T \to Y\)` and we estimate this causal relationship well then it should be useful for predicting on new data - If `\(X \to T\)` then we should be able to predict `\(T\)` (using `\(\hat m\)` or `\(\hat \pi\)`) on test data - If `\(X \to Y\)` then `\(\hat g\)` should be accurate on test data, etc Causal ML applies this in various ways to different problems --- ## Cross-fitting Sample splitting, using separate subsets of data to compute 1. estimates of nuisance functions, like `\(\hat g\)` or `\(\hat \pi\)` 2. the desired causal estimates, like `\(\hat \theta\)` Why? e.g. if `\(\hat g\)` is (a little) overfit, this has less effect on `\(\hat \theta\)` because `\(\hat g\)` is overfit to a different subset of the data (than the one used to estimate `\(\hat \theta\)`) Can be iterated: swap subsets and average `\(\hat \theta\)` estimates --- ## Honest trees/forests Split training data, use one subset for deciding tree structure (which variables to use for splitting) and the other subset for predictions e.g. Suppose we fit a tree with depth 1, on one subset of data we find the best split is `\(X_7 < c\)`. Then for predictions we use the values of `\(Y\)` *in a separate subset of data*, averaging/voting those `\(Y\)` values with `\(X_7 < c\)` for one prediction and those with `\(X_7 \geq c\)` for the other prediction. (No test/validation yet) -- **Exercise**: explain how this is different from the out-of-bag validation approach in bagging. Draw diagrams (perhaps with inspiration from ISLR Figure 5.5) to represent the two cases --- class: inverse, middle, center ### These methods can be importantly wrong --- ### Why causal inference (and ML) is difficult #### Out-of-distribution generalization An ML model which is not overfit--has good test error, good in-distribution generalization--can still fail at OOD generalization For a different data-generating process, the *probability distribution changes*, and (maybe) *causal relationships change* #### Identifiability The same data can be consistent with many different causal structures. We rely on assumptions which may be wrong ML specific: flexible models can relax some assumptions --- ### Unobserved confounding *Observed* predictors must capture *everything* about how treatment assignment and treatment effectiveness are related <img src="10-1-interpretation_files/figure-html/unnamed-chunk-33-1.png" width="360" style="display: block; margin: auto;" /> Practically all methods assume this isn't happening (some conduct sensitivity analysis) --- class: middle **Exercise**: draw a few DAGs representing the partially linear model and give real-world examples both with and without unobserved confounders. For the case with unobserved confounding, explain how our estimates of `\(\theta\)` would fail to capture the correct causal relationship (i.e. why an intervention on `\(T\)` would fail to have the estimated effect) --- ### More warnings for data about humans Social science is hard! Health is also (at least partially) social. It is very difficult to [predict life outcomes](https://www.pnas.org/content/117/15/8398), even with machine learning models and [big data](https://hdsr.mitpress.mit.edu/pub/uan1b4m9/release/3) Strategic/adversarial/adaptive behavior: [Goodhart's law / Campbell's law](https://twitter.com/CT_Bergstrom/status/1329175501882019841) / [Lucas critique](https://en.wikipedia.org/wiki/Lucas_critique) / [Friedman's thermostat](https://worthwhile.typepad.com/worthwhile_canadian_initi/2012/07/why-are-almost-all-economists-unaware-of-milton-friedmans-thermostat.html) / [Washback effect](https://en.wikipedia.org/wiki/Washback_effect) / [Performativity](https://en.wikipedia.org/wiki/Performativity) / [Reflexivity](https://en.wikipedia.org/wiki/Reflexivity_(social_theory%29) / [Hammer's law](https://twitter.com/MCHammer/status/1363908982289559553) For these reasons, *when dealing with human data*: - Almost all variables correlate - Unconfoundedness assumptions are always wrong - Can't separate "normative vs scientific" (ethics always important) Remember: focus on what is **importantly wrong** --- class: inverse, center, middle ### You are close to the cutting edge of research Practice using some of these methods, combine them with interpretable plots (previous lecture), and your labor will be in high demand! --- ## R packages [DoubleML](https://cran.r-project.org/web/packages/DoubleML/vignettes/DoubleML.html) package Causal forests in [grf](https://grf-labs.github.io/grf/index.html) package, and a [blog post](https://www.markhw.com/blog/causalforestintro) introduction A few [DML packages](https://github.com/MCKnaus) by Michael C. Knaus