ISLR Chapter 4, Section 4.7

ISLR Chapter 9, Section 9.7

Optional practice

The remaining exercises are optional and will not be graded.


The \(L^1\)-norm of a \(d\)-dimensional vector \(\beta = (\beta_1, \ldots, \beta_d)\) is \(\| \beta \|_1 = \sum_{j=1}^d |\beta_j|\), i.e. the sum of absolute values of its components. More generally, the \(L^p\)-norm is \(\| \beta \|_p = (\sum_{j=1}^d |\beta_j|^p)^{1/p}\). There is also the \(L^\infty\)-norm \(\| \beta \|_\infty = \max_j | \beta_j |\). In machine learning we most often consider only the cases \(p = 1, 2, \infty\), and we often use the notation \(p\) for the dimension of the space instead.

  • Prove that each of these are norms, i.e. show that they satisfy the definition of a norm.
  • For each of these derive a formula for its gradient (at a point where \(\beta_j \neq 0\) for all \(j\))
  • Show that the \(L^0\)-“norm,” \(\| \beta \|_0 = | \{ j : \beta_j \neq 0 \}|\) which counts the number of non-zero components of \(\beta\) does not satisfy the definition of a norm


  • Read the linked Wikipedia article, particularly the sections on Projectors, Examples, Linearly independent columns, and Applications
  • In R, generate a high-dimensional matrix of predictors X, a sparse cofficient vector beta (having only a few nonzero entries), and an outcome from the noise-free linear model y = X %*% beta. Use the svd function in R to compute the pseudoinverse of X and find the minimum (\(L^2\)-)norm solution to the under-determined linear equation \(y = X b\). Is this solution sparse like the original beta?

Gradient descent for high-dim. logistic regression

  1. Generate some synthetic data including a design matrix (predictors) with n <- 100, p <- 200, a (sparse) coefficient vector with only 20 nonzero coefficients, and a binary outcome (using the rbinom function) with probability determined by the inverse logistic function applied to X %*% beta. In other words, the data generation process is truly a logistic regression probability model. Adjust your linear function coefficients so that the classes are roughly balanced or at least not very imbalanced. (You may have numerical errors later if, for example, the entire training data only has observations with y = 1 and no observations with y = 0).

  2. Pick 2 of the predictors that have nonzero coefficients (from among the 20 that you know) and create a plot with these two predictors as the horizontal and vertical axes and the binary values of the outcome determine the color of the points.

  3. Using only these two predictors, fit a logistic regression with the glm function. Use the fitted low-dimensional model to classify test data as follows: generate 100 new observations from the same model as test data and calculate the misclassification rate, e.g. mean(y == yhat) (note that for classification you need to classify as 0 or 1 rather than the fitted probability phat)

  4. Write down the logistic loss function with a 2-norm penalty on the coefficient vector, and derive a formula for its gradient. (Your code should allow changing the penalty parameter lambda in front of the 2-norm)

  5. Write R functions to evaluate the loss function and gradient

  6. Implement gradient descent, using the Barzilai–Borwein method for step size described here – run it on your synthetic data starting from an initial estimate of beta <- rep(0, p). Compare the estimated value of beta to the true beta in several ways: compute the MSE of the estimate, check to see if the estimate is sparse like the true vector, and compare the nonzero values of the true beta to their estimated values, making note of anything that stands out to you

  7. Start gradient descent from a few random initial estimates, try beta <- rnorm(p) a few times and try beta <- rnorm(p, sd = 10) a few times, and each time compute the MSE of the estimate after convergence. Can you notice/conjecture anything about these solutions?

  8. Use one of these gradient descent fits to classify the same test data from part (c) and compare the misclassification rate to the low-dimensional model from that part

  9. Experiment with running gradient descent for different values of the 2-norm penalty parameter and see if any of your previous conclusions must be changed because they depend strongly on the choice of lambda