- 1
- 4
- Write down the loss function and derive the gradient for logistic regression
- Derive the gradient of the function that inputs a coefficient vector and outputs its squared 2-norm, writing your answer in vector notation

- 3
- 5(a-f) [save your code for this because we’ll re-use it in the next exercise set]

The remaining exercises are optional and will not be graded.

The \(L^1\)-norm of a \(d\)-dimensional vector \(\beta = (\beta_1, \ldots, \beta_d)\) is \(\| \beta \|_1 = \sum_{j=1}^d |\beta_j|\), i.e. the sum of absolute values of its components. More generally, the \(L^p\)-norm is \(\| \beta \|_p = (\sum_{j=1}^d |\beta_j|^p)^{1/p}\). There is also the \(L^\infty\)-norm \(\| \beta \|_\infty = \max_j | \beta_j |\). In machine learning we most often consider only the cases \(p = 1, 2, \infty\), and we often use the notation \(p\) for the dimension of the space instead.

- Prove that each of these are norms, i.e. show that they satisfy the definition of a norm.
- For each of these derive a formula for its gradient (at a point where \(\beta_j \neq 0\) for all \(j\))
- Show that the \(L^0\)-“norm,” \(\| \beta \|_0 = | \{ j : \beta_j \neq 0 \}|\) which counts the number of non-zero components of \(\beta\) does not satisfy the definition of a norm

- Read the linked Wikipedia article, particularly the sections on Projectors, Examples, Linearly independent columns, and Applications
- In R, generate a high-dimensional matrix of predictors
`X`

, a sparse cofficient vector`beta`

(having only a few nonzero entries), and an outcome from the noise-free linear model`y = X %*% beta`

. Use the`svd`

function in R to compute the pseudoinverse of`X`

and find the*minimum (\(L^2\)-)norm solution*to the under-determined linear equation \(y = X b\). Is this solution sparse like the original`beta`

?

Generate some synthetic data including a design matrix (predictors) with

`n <- 100`

,`p <- 200`

, a (sparse) coefficient vector with only 20 nonzero coefficients, and a binary outcome (using the`rbinom`

function) with probability determined by the inverse logistic function applied to`X %*% beta`

. In other words, the data generation process is truly a logistic regression probability model.**Adjust your linear function coefficients so that the classes are roughly balanced**or at least not very imbalanced. (You may have numerical errors later if, for example, the entire training data only has observations with`y = 1`

and no observations with`y = 0`

).Pick 2 of the predictors that have nonzero coefficients (from among the 20 that you know) and create a plot with these two predictors as the horizontal and vertical axes and the binary values of the outcome determine the color of the points.

Using only these two predictors, fit a logistic regression with the

`glm`

function. Use the fitted low-dimensional model to**classify**test data as follows: generate 100 new observations from the same model as test data and calculate the misclassification rate, e.g.`mean(y == yhat)`

(note that for classification you need to classify as 0 or 1 rather than the fitted probability`phat`

)Write down the logistic loss function

**with a 2-norm penalty on the coefficient vector**, and derive a formula for its gradient. (Your code should allow changing the penalty parameter`lambda`

in front of the 2-norm)Write R functions to evaluate the loss function and gradient

Implement gradient descent, using the Barzilai–Borwein method for step size described here – run it on your synthetic data starting from an initial estimate of

`beta <- rep(0, p)`

. Compare the estimated value of beta to the true beta in several ways: compute the MSE of the estimate, check to see if the estimate is sparse like the true vector, and compare the nonzero values of the true beta to their estimated values, making note of anything that stands out to youStart gradient descent from a few random initial estimates, try

`beta <- rnorm(p)`

a few times and try`beta <- rnorm(p, sd = 10)`

a few times, and each time compute the MSE of the estimate after convergence. Can you notice/conjecture anything about these solutions?Use one of these gradient descent fits to classify the same test data from part (c) and compare the misclassification rate to the low-dimensional model from that part

Experiment with running gradient descent for different values of the 2-norm penalty parameter and see if any of your previous conclusions must be changed because they depend strongly on the choice of

`lambda`