```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(tidyverse) library(broom) library(glmnet) library(selectiveInference) theme_set(theme_minimal(base_size = 14)) ``` ## Inference for the lasso #### Generate data for a high-dimensional regression ```{r} n <- 100 p <- 200 X <- matrix(rnorm(n*p), nrow = n) Y <- rnorm(n) ``` #### Fit the lasso solution path and choose a sparsity level automatically Note: this is an alternative method to cross-validation ```{r} # Compute lasso path # while storing information required for # conditional (truncated) inference lasso_path <- lar(X, Y) # Use AIC/BIC to choose model size # and compute adjusted p-values # in higher dimensions change mult # to log(n) for BIC, log(p) for RIC lasso_inference <- larInf(lasso_path, mult = 2, type = "aic", ntimes = 1) ``` Selected predictors ```{r} active_set <- lasso_inference$vars active_set ``` #### Fit an OLS model using only selected variables, compute p-values the classical way (Use `summary()` or `tidy()` from the `broom` package) ```{r} OLS_fit <- lm(Y ~ X[, active_set] - 1) tidy(OLS_fit) ``` #### Adjusted inference These p-values (from the `larInf` function in the `selectiveInference` package above) are calculated from a conditional distribution, conditional on the model selection ```{r} lasso_inference ``` #### How do the two sets of p-values compare? Which are "better", and why? #### Compare to p-values calculated on a new set of test data ```{r} ``` #### Rerun the code several times and observe the variability #### Change `Y` to be generated from a coefficient vector where only the first 5 entries are nonzero and then run the code again a few more times