Optimization

Seminar note

During the live seminar we’ll implement the following for ordinary least squares. Completing the implementation for ridge regression might be on the next problem set (and/or some simple mathematical analysis of it could be examinable, so this is good practice)

Ridge regression

(You can delete this section if you get error messages about LaTeX)

Consider the loss function

\[L(\beta) = \| \mathbf y - \mathbf X \beta \|^2 + \lambda \| \beta \|^2\] You might also want to consider writing it this way

\[L(\beta) = \sum_{i=1}^n (y_i - \mathbf x_i^T \beta)^2 + \sum_{j=1}^p \beta_j^2\] Compute the gradient \(\nabla L\) of this loss function, and use your answer to write an R function below that inputs \((X, y, \beta, \lambda)\) and outputs the value of the gradient. Also write a function that takes the same inputs and outputs the value of the ridge loss.

(Don’t delete below here)

# Gradient of the ridge loss
least_squares_gradient <- function(x, y, beta) {
  -2 * t(x) %*% (y - x %*% beta)
}

# Loss function
least_squares_loss <- function(x, y, beta) {
  sum((y - x %*% beta)^2)
}

Simulate data

Write a function that inputs a coefficient vector beta and design matrix, and uses these the generate a vector of outcomes from a linear model with normal noise

set.seed(1)
n <- 100
p <- 10
X <- matrix(rnorm(n*p), nrow = n)
beta <- rpois(p, lambda = 3)
y <- X %*% beta + rnorm(n)

least_squares_gradient(X, y, beta)

##             [,1]
##  [1,]  -7.655822
##  [2,]  29.799193
##  [3,]   7.017735
##  [4,]  -2.586372
##  [5,] -10.374698
##  [6,]   1.528508
##  [7,] -17.181227
##  [8,] -41.579952
##  [9,]  12.879127
## [10,] -21.106348

Now generate a design matrix and coefficient vector and use these with the previous function to generate an outcome

# Choose n and p, consider making p large, maybe even larger than n
# Also think about the size of the coefficients, and sparsity

Gradient descent

Implement gradient descent to optimize the ridge loss function
Apply your implementation to the data you generated in the previous section
Compare the estimated coefficients to the true coefficients

beta0 <- rep(0, p)
previous_loss <- least_squares_loss(X, y, beta0)
grad0 <- least_squares_gradient(X, y, beta0)
beta1 <- beta0 - 0.0001 * grad0
next_loss <- least_squares_loss(X, y, beta1)
previous_beta <- beta1
steps <- 1

while (abs(previous_loss - next_loss) > 0.0001) {
  gradn <- least_squares_gradient(X, y, previous_beta)
  next_beta <- previous_beta - (0.99)^steps * gradn / sqrt(sum(gradn^2))
  if (steps %% 10 == 0) print(previous_loss)
  steps <- steps + 1
  previous_beta <- next_beta
  previous_loss <- next_loss
  next_loss <- least_squares_loss(X, y, next_beta)
}

## [1] 1536.472
## [1] 159.3278
## [1] 151.77
## [1] 145.2942
## [1] 139.7435
## [1] 135.0087
## [1] 130.9636
## [1] 125.5322
## [1] 110.2006
## [1] 108.4356
## [1] 106.9456
## [1] 105.6849
## [1] 104.6159
## [1] 103.7073
## [1] 102.9331
## [1] 102.2718
## [1] 101.7055
## [1] 101.0657
## [1] 99.00849
## [1] 98.3756
## [1] 98.16812
## [1] 97.99213
## [1] 97.8425
## [1] 97.71496
## [1] 97.60598
## [1] 97.51188
## [1] 97.00878
## [1] 96.96367
## [1] 96.92572
## [1] 96.89373
## [1] 96.8667
## [1] 96.84383
## [1] 96.82441
## [1] 96.8079
## [1] 96.79381
## [1] 96.78178
## [1] 96.77146
## [1] 96.71617
## [1] 96.71431
## [1] 96.70984
## [1] 96.70606
## [1] 96.70285
## [1] 96.70012
## [1] 96.69778
## [1] 96.69579
## [1] 96.69121
## [1] 96.68443
## [1] 96.68362
## [1] 96.68293
## [1] 96.68235
## [1] 96.68187
## [1] 96.68145
## [1] 96.6811
## [1] 96.6808
## [1] 96.68055
## [1] 96.68033
## [1] 96.68014
## [1] 96.67907
## [1] 96.67898
## [1] 96.67891
## [1] 96.67884
## [1] 96.67879
## [1] 96.67875
## [1] 96.67871
## [1] 96.67868
## [1] 96.67865
## [1] 96.67863
## [1] 96.67861
## [1] 96.67859

c(steps, previous_loss, next_loss)

## [1] 695.00000  96.67844  96.67854

beta - previous_beta

##               [,1]
##  [1,] -0.028475498
##  [2,]  0.177765248
##  [3,] -0.004290968
##  [4,] -0.017128255
##  [5,] -0.054355122
##  [6,] -0.050542705
##  [7,] -0.047081517
##  [8,] -0.169549647
##  [9,]  0.086593620
## [10,] -0.106931045

mean((beta - previous_beta)^2)

## [1] 0.008812863

coef(lm(y ~ X - 1)) - previous_beta

##                [,1]
##  [1,] -1.845182e-04
##  [2,]  6.105159e-04
##  [3,]  1.512364e-04
##  [4,]  1.368817e-04
##  [5,]  2.888292e-04
##  [6,] -5.384685e-05
##  [7,]  4.033052e-04
##  [8,]  3.873572e-04
##  [9,]  3.729351e-05
## [10,]  5.589654e-05

Now change the value of lambda and repeat the previous steps
Which value of lambda produced estimates closer to the truth?

Optional:

Does your answer to the previous question change depending on something about the data you generated, for example the size or sparsity of the true coefficients?

Stochastic gradient descent

Find a problem scale (n and p) large enough so that your previous implementation is running slow (maybe taking a few minutes to converge). Implement mini-batch gradient descent with a smaller subset size and apply it to the new data. Compare running times and estimation accuracy of the two implementations