Seminar note

During the live seminar we’ll implement the following for ordinary least squares. Completing the implementation for ridge regression might be on the next problem set (and/or some simple mathematical analysis of it could be examinable, so this is good practice)

Ridge regression

(You can delete this section if you get error messages about LaTeX)

Consider the loss function

\[L(\beta) = \| \mathbf y - \mathbf X \beta \|^2 + \lambda \| \beta \|^2\] You might also want to consider writing it this way

\[L(\beta) = \sum_{i=1}^n (y_i - \mathbf x_i^T \beta)^2 + \sum_{j=1}^p \beta_j^2\] Compute the gradient \(\nabla L\) of this loss function, and use your answer to write an R function below that inputs \((X, y, \beta, \lambda)\) and outputs the value of the gradient. Also write a function that takes the same inputs and outputs the value of the ridge loss.

(Don’t delete below here)

# Gradient of the ridge loss
least_squares_gradient <- function(x, y, beta) {
  -2 * t(x) %*% (y - x %*% beta)
}

# Loss function
least_squares_loss <- function(x, y, beta) {
  sum((y - x %*% beta)^2)
}

Simulate data

Write a function that inputs a coefficient vector beta and design matrix, and uses these the generate a vector of outcomes from a linear model with normal noise

set.seed(1)
n <- 100
p <- 10
X <- matrix(rnorm(n*p), nrow = n)
beta <- rpois(p, lambda = 3)
y <- X %*% beta + rnorm(n)
least_squares_gradient(X, y, beta)
##             [,1]
##  [1,]  -7.655822
##  [2,]  29.799193
##  [3,]   7.017735
##  [4,]  -2.586372
##  [5,] -10.374698
##  [6,]   1.528508
##  [7,] -17.181227
##  [8,] -41.579952
##  [9,]  12.879127
## [10,] -21.106348

Now generate a design matrix and coefficient vector and use these with the previous function to generate an outcome

# Choose n and p, consider making p large, maybe even larger than n
# Also think about the size of the coefficients, and sparsity

Gradient descent

beta0 <- rep(0, p)
previous_loss <- least_squares_loss(X, y, beta0)
grad0 <- least_squares_gradient(X, y, beta0)
beta1 <- beta0 - 0.0001 * grad0
next_loss <- least_squares_loss(X, y, beta1)
previous_beta <- beta1
steps <- 1

while (abs(previous_loss - next_loss) > 0.0001) {
  gradn <- least_squares_gradient(X, y, previous_beta)
  next_beta <- previous_beta - (0.99)^steps * gradn / sqrt(sum(gradn^2))
  if (steps %% 10 == 0) print(previous_loss)
  steps <- steps + 1
  previous_beta <- next_beta
  previous_loss <- next_loss
  next_loss <- least_squares_loss(X, y, next_beta)
}
## [1] 1536.472
## [1] 159.3278
## [1] 151.77
## [1] 145.2942
## [1] 139.7435
## [1] 135.0087
## [1] 130.9636
## [1] 125.5322
## [1] 110.2006
## [1] 108.4356
## [1] 106.9456
## [1] 105.6849
## [1] 104.6159
## [1] 103.7073
## [1] 102.9331
## [1] 102.2718
## [1] 101.7055
## [1] 101.0657
## [1] 99.00849
## [1] 98.3756
## [1] 98.16812
## [1] 97.99213
## [1] 97.8425
## [1] 97.71496
## [1] 97.60598
## [1] 97.51188
## [1] 97.00878
## [1] 96.96367
## [1] 96.92572
## [1] 96.89373
## [1] 96.8667
## [1] 96.84383
## [1] 96.82441
## [1] 96.8079
## [1] 96.79381
## [1] 96.78178
## [1] 96.77146
## [1] 96.71617
## [1] 96.71431
## [1] 96.70984
## [1] 96.70606
## [1] 96.70285
## [1] 96.70012
## [1] 96.69778
## [1] 96.69579
## [1] 96.69121
## [1] 96.68443
## [1] 96.68362
## [1] 96.68293
## [1] 96.68235
## [1] 96.68187
## [1] 96.68145
## [1] 96.6811
## [1] 96.6808
## [1] 96.68055
## [1] 96.68033
## [1] 96.68014
## [1] 96.67907
## [1] 96.67898
## [1] 96.67891
## [1] 96.67884
## [1] 96.67879
## [1] 96.67875
## [1] 96.67871
## [1] 96.67868
## [1] 96.67865
## [1] 96.67863
## [1] 96.67861
## [1] 96.67859
c(steps, previous_loss, next_loss)
## [1] 695.00000  96.67844  96.67854
beta - previous_beta
##               [,1]
##  [1,] -0.028475498
##  [2,]  0.177765248
##  [3,] -0.004290968
##  [4,] -0.017128255
##  [5,] -0.054355122
##  [6,] -0.050542705
##  [7,] -0.047081517
##  [8,] -0.169549647
##  [9,]  0.086593620
## [10,] -0.106931045
mean((beta - previous_beta)^2)
## [1] 0.008812863
coef(lm(y ~ X - 1)) - previous_beta
##                [,1]
##  [1,] -1.845182e-04
##  [2,]  6.105159e-04
##  [3,]  1.512364e-04
##  [4,]  1.368817e-04
##  [5,]  2.888292e-04
##  [6,] -5.384685e-05
##  [7,]  4.033052e-04
##  [8,]  3.873572e-04
##  [9,]  3.729351e-05
## [10,]  5.589654e-05

Optional:

Stochastic gradient descent

Find a problem scale (n and p) large enough so that your previous implementation is running slow (maybe taking a few minutes to converge). Implement mini-batch gradient descent with a smaller subset size and apply it to the new data. Compare running times and estimation accuracy of the two implementations