```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse) 
library(broom)     
library(glmnet)    
library(selectiveInference)
theme_set(theme_minimal(base_size = 14))
```

## Inference for the lasso

#### Generate data for a high-dimensional regression

```{r}
n <- 100
p <- 200
X <- matrix(rnorm(n*p), nrow = n)
Y <- rnorm(n)
```

#### Fit the lasso solution path and choose a sparsity level automatically

Note: this is an alternative method to cross-validation

```{r}
# Compute lasso path
# while storing information required for
# conditional (truncated) inference
lasso_path <- lar(X, Y)
# Use AIC/BIC to choose model size
# and compute adjusted p-values
# in higher dimensions change mult
# to log(n) for BIC, log(p) for RIC
lasso_inference <- larInf(lasso_path, mult = 2, type = "aic", ntimes = 1)
```

Selected predictors

```{r}
active_set <- lasso_inference$vars
active_set
```

#### Fit an OLS model using only selected variables, compute p-values the classical way

(Use `summary()` or `tidy()` from the `broom` package)

```{r}
OLS_fit <- lm(Y ~ X[, active_set] - 1)
tidy(OLS_fit)
```

#### Adjusted inference

These p-values (from the `larInf` function in the `selectiveInference` package above) are calculated from a conditional distribution, conditional on the model selection

```{r}
lasso_inference
```

#### How do the two sets of p-values compare? Which are "better", and why?


#### Compare to p-values calculated on a new set of test data

```{r}

```


#### Rerun the code several times and observe the variability


#### Change `Y` to be generated from a coefficient vector where only the first 5 entries are nonzero and then run the code again a few more times