During the live seminar we’ll implement the following for ordinary least squares. Completing the implementation for ridge regression might be on the next problem set (and/or some simple mathematical analysis of it could be examinable, so this is good practice)
(You can delete this section if you get error messages about LaTeX)
Consider the loss function
\[L(\beta) = \| \mathbf y - \mathbf X \beta \|^2 + \lambda \| \beta \|^2\] You might also want to consider writing it this way
\[L(\beta) = \sum_{i=1}^n (y_i - \mathbf x_i^T \beta)^2 + \sum_{j=1}^p \beta_j^2\] Compute the gradient \(\nabla L\) of this loss function, and use your answer to write an R
function below that inputs \((X, y, \beta, \lambda)\) and outputs the value of the gradient. Also write a function that takes the same inputs and outputs the value of the ridge loss.
(Don’t delete below here)
# Gradient of the ridge loss
least_squares_gradient <- function(x, y, beta) {
-2 * t(x) %*% (y - x %*% beta)
}
# Loss function
least_squares_loss <- function(x, y, beta) {
sum((y - x %*% beta)^2)
}
Write a function that inputs a coefficient vector beta and design matrix, and uses these the generate a vector of outcomes from a linear model with normal noise
set.seed(1)
n <- 100
p <- 10
X <- matrix(rnorm(n*p), nrow = n)
beta <- rpois(p, lambda = 3)
y <- X %*% beta + rnorm(n)
least_squares_gradient(X, y, beta)
## [,1]
## [1,] -7.655822
## [2,] 29.799193
## [3,] 7.017735
## [4,] -2.586372
## [5,] -10.374698
## [6,] 1.528508
## [7,] -17.181227
## [8,] -41.579952
## [9,] 12.879127
## [10,] -21.106348
Now generate a design matrix and coefficient vector and use these with the previous function to generate an outcome
# Choose n and p, consider making p large, maybe even larger than n
# Also think about the size of the coefficients, and sparsity
beta0 <- rep(0, p)
previous_loss <- least_squares_loss(X, y, beta0)
grad0 <- least_squares_gradient(X, y, beta0)
beta1 <- beta0 - 0.0001 * grad0
next_loss <- least_squares_loss(X, y, beta1)
previous_beta <- beta1
steps <- 1
while (abs(previous_loss - next_loss) > 0.0001) {
gradn <- least_squares_gradient(X, y, previous_beta)
next_beta <- previous_beta - (0.99)^steps * gradn / sqrt(sum(gradn^2))
if (steps %% 10 == 0) print(previous_loss)
steps <- steps + 1
previous_beta <- next_beta
previous_loss <- next_loss
next_loss <- least_squares_loss(X, y, next_beta)
}
## [1] 1536.472
## [1] 159.3278
## [1] 151.77
## [1] 145.2942
## [1] 139.7435
## [1] 135.0087
## [1] 130.9636
## [1] 125.5322
## [1] 110.2006
## [1] 108.4356
## [1] 106.9456
## [1] 105.6849
## [1] 104.6159
## [1] 103.7073
## [1] 102.9331
## [1] 102.2718
## [1] 101.7055
## [1] 101.0657
## [1] 99.00849
## [1] 98.3756
## [1] 98.16812
## [1] 97.99213
## [1] 97.8425
## [1] 97.71496
## [1] 97.60598
## [1] 97.51188
## [1] 97.00878
## [1] 96.96367
## [1] 96.92572
## [1] 96.89373
## [1] 96.8667
## [1] 96.84383
## [1] 96.82441
## [1] 96.8079
## [1] 96.79381
## [1] 96.78178
## [1] 96.77146
## [1] 96.71617
## [1] 96.71431
## [1] 96.70984
## [1] 96.70606
## [1] 96.70285
## [1] 96.70012
## [1] 96.69778
## [1] 96.69579
## [1] 96.69121
## [1] 96.68443
## [1] 96.68362
## [1] 96.68293
## [1] 96.68235
## [1] 96.68187
## [1] 96.68145
## [1] 96.6811
## [1] 96.6808
## [1] 96.68055
## [1] 96.68033
## [1] 96.68014
## [1] 96.67907
## [1] 96.67898
## [1] 96.67891
## [1] 96.67884
## [1] 96.67879
## [1] 96.67875
## [1] 96.67871
## [1] 96.67868
## [1] 96.67865
## [1] 96.67863
## [1] 96.67861
## [1] 96.67859
c(steps, previous_loss, next_loss)
## [1] 695.00000 96.67844 96.67854
beta - previous_beta
## [,1]
## [1,] -0.028475498
## [2,] 0.177765248
## [3,] -0.004290968
## [4,] -0.017128255
## [5,] -0.054355122
## [6,] -0.050542705
## [7,] -0.047081517
## [8,] -0.169549647
## [9,] 0.086593620
## [10,] -0.106931045
mean((beta - previous_beta)^2)
## [1] 0.008812863
coef(lm(y ~ X - 1)) - previous_beta
## [,1]
## [1,] -1.845182e-04
## [2,] 6.105159e-04
## [3,] 1.512364e-04
## [4,] 1.368817e-04
## [5,] 2.888292e-04
## [6,] -5.384685e-05
## [7,] 4.033052e-04
## [8,] 3.873572e-04
## [9,] 3.729351e-05
## [10,] 5.589654e-05
Optional:
Find a problem scale (n and p) large enough so that your previous implementation is running slow (maybe taking a few minutes to converge). Implement mini-batch gradient descent with a smaller subset size and apply it to the new data. Compare running times and estimation accuracy of the two implementations