```{r setup, message = FALSE, echo=FALSE} knitr::opts_chunk$set(echo = TRUE) library(tidyverse) theme_set(theme_minimal(base_size = 16)) ``` ### Simulate high-dimensional data Generate data from a high-dimensional model. After your implementation is finished, you can come back here to experiment and see how the algorithm's performance depends on these values. (One interesting experiment: reduce `lambda` in `rpois` to have a sparser `beta` and see how the results change). ```{r} set.seed(1) # comment/uncomment/change this n <- 100 p <- 400 X <- matrix(rnorm(n*p), nrow = n) beta <- rpois(p, lambda = 1.2) SGD_starting_beta <- rnorm(p) # generate once sigma <- 2 y <- X %*% beta + rnorm(n, sd = sigma) # sparsity level sum(beta != 0) ``` **Question**: Can we fit OLS to estimate beta? Why or why not? ### Implement stochastic gradient descent #### Helper functions ```{r} # Loss function least_squares_loss <- function(x, y, beta) { residuals <- y - x %*% beta # TODO } ``` ```{r} # Gradient of the loss function least_squares_gradient <- function(x, y, beta) { # TODO } ``` #### Mini-batch example Computing gradient on a subset of the sample: ```{r} batch_size <- 10 batch_indices <- sample(1:nrow(X), batch_size) X_batch <- X[batch_indices, , drop = FALSE] y_batch <- y[batch_indices] gradient_batch <- least_squares_gradient(X_batch, y_batch, SGD_starting_beta) dim(X_batch) length(y_batch) dim(gradient_batch) ``` #### SGD: main loop Modify the code below to complete your implementation. In this implementation, normalize the gradient to make it a unit vector before taking a step. Recall that if `v` is vector, then `u <- v / sqrt(sum(v^2))` is a unit vector in the direction of `v`. (Optional: for numerical stability you could also add a small positive number in the denominator like `1e-8`). ```{r} epochs <- 40 # decrease for early stopping? batch_size <- 8 # for batch SGD gamma <- 0.9 beta_hat <- SGD_starting_beta # generated once above num_batches <- ceiling(n/batch_size) steps <- epochs * num_batches SGD_path <- data.frame(step = 1:steps, epoch = numeric(steps), gamma = numeric(steps), loss = numeric(steps), MSE = numeric(steps), beta_hat_norm = numeric(steps), gradient_norm = numeric(steps)) start_time <- Sys.time() step <- 1 for (epoch in 1:epochs) { # Pass over sample in a random order each time shuffled_indices <- sample(1:n, n, replace = TRUE) for (batch in 1:num_batches) { batch_start <- 1 + batch_size * (batch - 1) batch_end <- min(batch_size * batch, nrow(X)) batch_indices <- shuffled_indices[batch_start:batch_end] # Mini-batch # TODO copy/paste from example above # Update beta_hat # TODO beta_hat <- beta_hat - gamma^epoch * ... LS_loss <- least_squares_loss(X, y, beta_hat) # Store algorithm path information SGD_path[step, ] <- c(step, epoch, gamma^epoch, LS_loss, mean((beta_hat - beta)^2), sqrt(sum(beta_hat^2)), gradient_norm) step <- step + 1 } } print(Sys.time() - start_time) ``` ## Plots by **epoch** (Averaged over all batches in that epoch) Note: the `loss` is plotted on a logarithmic scale so that large initial changes don't obscure later variation. ```{r echo = FALSE, fig.align='center'} SGD_epoch <- SGD_path |> select(-step) |> group_by(epoch) |> summarize(across(where(is.numeric), mean)) SGD_epoch |> ggplot(aes(epoch, gamma)) + geom_point() + ylab("gamma^epoch") SGD_epoch |> ggplot(aes(epoch, loss)) + geom_point() + ylab("Loss") + scale_y_log10() SGD_epoch |> ggplot(aes(epoch, MSE)) + geom_point() + ylab("MSE(beta_hat)") SGD_epoch |> ggplot(aes(epoch, beta_hat_norm)) + geom_point() + ylab("norm(beta_hat)") SGD_epoch |> ggplot(aes(epoch, gradient_norm)) + geom_point() + ylab("norm(gradient)") ``` **Question**: Does the MSE of beta_hat level off? After roughly how many epochs? Is this happening because `gamma` decreased too fast? ## Plots by **mini-batch** ```{r echo = FALSE, fig.align='center'} SGD_path |> ggplot(aes(step, gamma)) + geom_point() + ylab("gamma^epoch") SGD_path |> ggplot(aes(step, loss)) + geom_point() + ylab("Loss") + scale_y_log10() SGD_path |> ggplot(aes(step, MSE)) + geom_point() + ylab("MSE(beta_hat)") SGD_path |> ggplot(aes(step, beta_hat_norm)) + geom_point() + ylab("norm(beta_hat)") SGD_path |> ggplot(aes(step, gradient_norm)) + geom_point() + ylab("norm(gradient)") ``` ## Squared errors Has the algorithm pulled more of the squared errors closer to zero compared to the (random) initial starting point? ```{r echo=FALSE} data.frame( coordinate = c(1:p, 1:p), estimate = c(rep("initial", p), rep("final", p)), error = c(beta - SGD_starting_beta, beta - beta_hat) ) |> ggplot(aes(error^2, fill = estimate)) + geom_density(alpha = .7) + scale_fill_viridis_d() ``` Compute the (Euclidean) distances between the following points: ```{r} # From algorithm starting point to end point sqrt(sum((beta_hat - SGD_starting_beta)^2)) # From starting point to true beta sqrt(sum((beta - SGD_starting_beta)^2)) # From end point to true beta sqrt(sum((beta - beta_hat)^2)) ``` **Question**: Does SGD do better than a random guess (the starting point) at estimating the true coefficients?