Java Python Homework 2
Advanced Topics in Statistical Learning, Spring 2024
Due Friday March 1
1 Properties of RKHS regression [32 points]
In this exercise, we will work out some facts about RKHS regression.
(a) First, let k : X × X → R bean arbitrary kernel. Recall that this means there exists a feature map
φ : X → H, and a Hilbert space H with inner product ( · , · 〉H , such that for any x, y ∈ X , k(x, y) = (φ(x), φ(y)〉H .
Prove that k is positive semidefinite, meaning, it is symmetric, and for any n ≥ 1 and x1 , . . . , xn ∈ X , if we definea matrix K ∈ Rn ×n to have entries Kij = k(xi , xj ), then [6 pts]
aTKa ≥ 0, for all a ∈ Rn.
Hint: express aTKa in terms of the feature map φ .
(b) Henceforth, suppose that k is the reproducing kernel for H (and hence H is an RKHS). Recall that this means the following two properties are satisfied:
(a) for any x ∈ X , the function k(·, x) is an element of H;
(b) for any function f ∈ H and x ∈ X , it holds that (f, k(·, x)〉H = f (x).
Let f be a function of the form.
for coefficients β1, . . . , βn ∈ R. Show that [4 pts]
(c) Let h be any function (in H) that is orthogonal (with respect to ( · , · 〉H ) to the linear space of func-tions of the form in part (b). Prove that [6 pts]
h(xi ) = 0, i = 1, . . . , n,
and
Ⅱf + hⅡH ≥ Ⅱf ⅡH , with equality iff h = 0.
(d) Argue that for any λ > 0, the infinite-dimensional RKHS optimization problem
(where the minimization is implicitly over f ∈ H) has a unique solution of the form in part (b), and we can rewrite it as [4 pts]
minβ(imize) ⅡY - KβⅡ2(2) + λβTKβ .
for the matrix K ∈ Rn ×n with entries Kij = k(xi , xj ).
For the uniqueness part, you may assume may assume that k isa positive definite kernel (strictly), so that K is positive definite matrix (strictly).
Hint: let g = f + h and use the results in part (c) to argue that g has a larger criterion value, unless h = 0. Use part (b) to complete the reduction to finite-dimensional form.
(e) Finally, we establish a cool fact about leave-one-out cross-validation (LOOCV) in RKHS regression problems. Recall that in general, the LOOCV error of an estimator f(^) is defined as
where f(^)-i is the estimator trained on all but the ith pair (xi , yi ). Prove the following shortcut for-mula for LOOCV with an RKHS regression estimator f(^): [12 pts]
where S = K(K + λI)-1 is the smoother matrix for the RKHS estimator (so that the vector of fitted values is given by Y(^) = SY).
Hint: prove that for each i,
The desired result will follow by rearranging, squaring both sides, and summing over i = 1, . . . , n. There are different ways to establish the above display; one nice way is as follows. Consider solving the RKHS regression problem on
(x1 , y1 ), . . . , (xi - 1 , yi - 1 ), (xi+1, yi+1), . . . , (xn , yn ),
and consider solving it on
(x1 , y1 ), . . . , (xi - 1 , yi - 1 ), (xi , i ), (xi+1, yi+1), . . . , (xn , yn ),
where y˜i = f ˆ−i (xi). Argue that these should produce the same solutions. Derive the zero gradient condition for optimality (differentiate the criterion and set it equal to zero) for each problem, and use these to solve for y˜i = f ˆ−i (xi)
2 Minimax lower bounds for Hölder classes [30 points]
Consider the d-dimensional Hölder ball
Cs (L; [0, 1]d ) = {f : [0, 1]d → R : jDα f(x) - Dα f(z)j ≤ LⅡx - zⅡ2 for all jαj = s - 1, and x, z ∈ [0, 1]d },
where s ≥ 1 is an integer and L > 0 is a constant (not growing with n). Consider i.i.d. samples (xi , yi ), i = 1, . . . , n where each
yi = f(xi ) + ∈i , xi ~ Unif([0, 1]d ) ∈i ~ N (0, σ2 ), and xi 丄丄 ∈i.
Generalize the calculations given in the minimax theory lecture for the minimax lower bound over the
Lipschitz ball to show that [30 pts]
Make sure to write the full calculation from start to finish (even detailing the steps that may be analogous to those in lecture).
Hint: consider defining (adhering to the notation used in the lecture)
where K is any smooth function supported on the unit l2 ball {x : ⅡxⅡ2 ≤ 1}, with the following proper- ties: K(0) = 1, 0 < ∫ K(x)2 dx < ∞, and Dβ K is 1-Lipschitz for all |β| < s (we can equival Advanced Topics in Statistical Learning, Spring 2024 Homework 2Python ently assume bounded derivatives: ⅡDβ f Ⅱ2 ≤ 1 for all |β| ≤ s). Then follow the calculations in the lecture, where we used Fano’s method and the Varshamov-Gilbert lemma.
3 Sub-Gaussian maximal inequalities [24 points]
In this exercise, we will derive tail bounds on maxima of sub-Gaussian random variables Xi , i = 1, . . . , n. Suppose each Xi has mean zero and variance proxy σ 2 . (We assume nothing about their dependence struc- ture.)
(a) Prove that for any λ ∈ R, [6 pts]
(b) Rearrange the result in part (a), then choose a suitable value of λ to show that [4 pts]
(c) Prove that for any λ ≥ 0, [4 pts]
Hint: use the fact that P(Xi ≥ λ) ≤ e-λ2 /(2σ2 ) , for any λ ≥ 0, which you can view as a consequence of the tail bound for sub-Gaussian averages when n = 1.
(d) Reparametrize the result in part (c) to show that for any t > 0, [2 pt]
(e) Now, we turn the question of the role of dependence: do correlations between Xi , i = 1, . . . , n make their maximum stochastically larger or smaller? Conduct (and include the results of) a small simula-tion in order to inform your answer. [8 pts]
4 Risk analysis for the constrained lasso [24 points]
This exercise explores simple in-sample and out-of-sample risk bounds for the lasso. Assume that we ob- serve i.i.d. (xi , yi ) ∈ [-M, M]d × R, i = 1, . . . , n, where each
yi = x Tiβ0 + Ei ,
and each ∈i is sub-Gaussian with mean zero and variance proxy σ 2 . Consider the constrained-form lasso estimator β(ˆ), which solves
minβ(imize) ⅡY - XβⅡ2(2) subject to ⅡβⅡ1 ≤ t,
where Y = (y1, . . . , yn ) ∈ Rn is the response vector, X ∈ Rn×d is the predictor matrix (whose ith row is xi ∈ Rd ),and t ≥ 0 is atuning parameter.
(a) Prove that the lasso estimator, with t = Ⅱβ0 Ⅱ1 , has in-sample risk satisfying [6 pts]
where the expectation is taken over the training data (xi , yi ), i = 1, . . . , n.
Hint: follow the strategy that we used in lecture to derive the “slow” rate for the constrained lasso. Note that this was for fixed X , so you will need to condition on X here. Then apply the result from Q2 part (b). You may use the fact that for ∈ = (∈1, . . . , ∈n ) ∈ Rn , a vector of i.i.d. sub-Gaussians with mean zero and variance proxy σ 2 , and an arbitrary fixed vector a ∈ Rn , the random variable
aT ∈ is mean zero sub-Gaussian with variance proxy σ2 ⅡaⅡ2(2).
(b) For i.i.d. mean zero random variables Zi , i = 1, . . . , n that lie almost surely in [a, b], prove that for
any t ∈ R, [4 pts]
Hint: you may use the fact that each Zi is sub-Gaussian with variance proxy (b - a)/2, and the hint about linear combinations of sub-Gaussians from part (a).
(c) Let Σ denote the predictor covariance matrix, that is, Σ = E[x0 x0(T)] for a draw x0 from the predictor
distribution. Let Σ(^) = X TX/n be the empirical covariance matrix, and let V =Σ(^) - Σ . Prove that [4 pts]
Hint: apply part (b) to the entries of V.
(d) Prove that the lasso estimator, with t = Ⅱβ0 Ⅱ1 , has out-of-sample risk satisfying [10 pts]
where the expectation is taken over the training data (xi , yi ), i = 1, . . . , n and an independent draw x0 from the predictor distribution.
Hint: first, argue that the in-sample risk and out-of-sample risk can be written as
E [(β(^) - β0 )T Σ(^)(β(^) - β0 )] and E [(β(^) - β0 )T Σ(β(^) - β0 )],
respectively. (Note that the expectations above are each taken with respect to the training samples (xi , yi ), i = 1, . . . , n only—there is nothing else that is random.) Next, argue that
where recall Vjk = (Σ(^) - Σ)jk . Then do a little bit more algebra to bound the right-hand side above and apply the previous parts of this question to conclude the result.
(e) The bound derived in this question for the out-of-sample risk is always larger than that for the in-sample risk (by nature of its construction). As abonus, investigate: can the out-of-sample risk of the lasso be lower than the in-sample risk? Use a simulation,a pointer to an experiment or result in the literature, or any means of answering that you believe provides a convincing argument