Week 13¶
Exercise 1: Supporting hyperplane Theorem¶
Exercise
In this exercise we will prove the supporting hyperplane theorem:
Let \(C\subset \mathbb R^n\) be a closed convex set. Suppose \(x\in \mathbb R^n\setminus \operatorname{int}(C)=( \mathbb R^n\setminus C)\cup \partial C\), then there exists a vector \(a\in \mathbb R^n\) such that
That is, \(C\) is on one side of the hyperplane that goes through \(x\) and that is orthogonal to \(a\).
Remark: the theorem also holds if \(C\) is not closed. This can essentially be proved by replacing \(C\) with its closure \(\overline C\).
Exercise 1a)¶
Exercise
Let \(x\in \mathbb R^n\setminus C\) (so we exclude the case \(x\in\partial C\)). Using the fact that \(C\) is closed we can define the projection onto \(C\) by
Use the optimality conditions from convex optimization to show that
satisfies
Recall that if \(x^*=\operatorname{argmin}_xf(x)\), then the optimality condition is \(\nabla f^\top (x^*)(y-x^*)\geq 0\) for all \(y\in C\). For the projection this means that
and thus that
since \(a^\top (P_C(x)-x) = -\|x-P_C(x)\| <0\).
Exercise 1b)¶
Exercise
Now consider the case that \(x\in \partial C\), but \(C\) still closed. Take a sequence \((x_k)\) with \(x_k\notin C\) such that \(\lim_k x_k=x\). Then consider
Prove that \((a_k)\) has a limit point \(a\) satisfying the requirements of the supporting the hyperplane theorem.
The sequence \((a_k)\) is bounded, and hence has a limit point \(a\). By the same reasoning as before we then have,
Hence if we restrict to a subsequence of \(a_k\) converging to \(a\), then in the limit we obtain
as required.
Exercise 2: Heavy ball method¶
Exercise
Last week we studied the converge rate of gradient descent for unconstrained optimization of the quadratic function
This time we will see a different method, with asymptotically optimal convergence rate when applied to \(f\). In particular, we consider the “heavy ball” iteration (Boris Polyak, 1964)
We will analyse the convergence speed of this method.
Remark: The term \(x_k-x_{k-1}\) is refered to as “momentum”, since it increases the step size proportionally to the last stepsize.
The heavy ball iteration is equivalent to a discretization of the ODE
which models the motion of a particle in potential field \(f\) with a friction term, hence the name “heavy ball”, since it could be use to model a heavy ball moving through a fluid, for example. This term will damp out the oscillations that occur in the steepest descent method and should allow the iteration to converge faster.
Exercise 2a)¶
Exercise
Let \(x_*\) be the solution to the optimization problem. Since \(x_{k+1}\) depends on both \(x_k\) and \(x_{k-1}\) we will have to study the convergence rate two steps at a time. To do this we want to find a matrix \(T\)
Find such a matrix \(T\) of form
Hint: like last week, note that \(Ax_*=b\).
First of all it’s easy to see that \(c_4=1\). For the other three we get equation
The left hand side is given by
Using the fact that \(Ax_*=b\) this gives
Thus
Exercise 2b)¶
Exercise
Let \(U AU^\top=\Lambda=\operatorname{diag}(\lambda_1,\dots,\lambda_n)\) be an eigendecomposition of \(A\).
Show that there is a permutation matrix \(\Pi\) such that
with \(T_i\) a \(2\times 2\) block matrix
We have that
This can be conjugated by a permutation matrix to obtain a \(2\times 2\) block diagonal matrix by the shuffle permutation
We simply need to send the top-left diagonal to the odd-diagonal entries of the new matrix, the bottom-left diagonal to the even-diagonal elements. This already fixes the permutation, and it does precisely the right thing to the off-diagonal blocks. We can see it in action for small matrices:
[3]:
import numpy as np
n = 3
Pi = np.zeros((2*n,2*n),dtype=int)
for i in range(n):
Pi[i,2*i] = 1
Pi[i+n,2*i+1]=1
print("Pi = ")
print(Pi)
A = np.zeros((2*n,2*n),dtype=int)
for i in range(n):
A[i,i] = i+1
A[i,i+n]=i+n+1
A[i+n,i] = i+2*n+1
print("X = ")
print(A)
print("Pi@X@Pi.T = ")
print(Pi.T@A@Pi)
Pi =
[[1 0 0 0 0 0]
[0 0 1 0 0 0]
[0 0 0 0 1 0]
[0 1 0 0 0 0]
[0 0 0 1 0 0]
[0 0 0 0 0 1]]
X =
[[1 0 0 4 0 0]
[0 2 0 0 5 0]
[0 0 3 0 0 6]
[7 0 0 0 0 0]
[0 8 0 0 0 0]
[0 0 9 0 0 0]]
Pi@X@Pi.T =
[[1 4 0 0 0 0]
[7 0 0 0 0 0]
[0 0 2 5 0 0]
[0 0 8 0 0 0]
[0 0 0 0 3 6]
[0 0 0 0 9 0]]
Exercise 2c)¶
Exercise
The eigenvalues of each individual block \(T_i\) are given by the roots of the equation
Show that if \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\) then the roots are complex and have magnitude
The block \(T_i\) is given by
hence the characteristic equation is indeed \(u^2-(1+\beta-\alpha\lambda_i)u+\beta=0\). The roots of this equation are
Hence, if \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\), the discriminant is negative. The magnitude is then the same for both roots, and given by
Exercise 2d)¶
Exercise
Let \(\lambda_1\) and \(\lambda_n\) be respectively the smallest and largest eigenvalue of \(A\). Show that if we choose \(\beta = \max\{(1-\sqrt{\alpha\lambda_1})^2,(1-\sqrt{\alpha \lambda_n})^2\}\), then the condition \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\) is satisfied for each \(i\). Conclude that then the spectral radius of \(T\) is given by \(\rho(T)=\sqrt\beta\).
Hint: Prove that \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\) is equivalent to \((1-\sqrt{\alpha \lambda_i})^2\leq\beta\leq (1+\sqrt{\alpha\lambda_i})^2\). For this it may be easier to first solve the equation \((1+\beta-\alpha\lambda_i)^2= 4\beta\)
We consider the inequality \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\). This is an inequality where a (positive) parabola is smaller than a line, which happens between the two intersection points (if they exist). Therefore we solve \((1+\beta-\alpha\lambda_i)^2=4\beta\). Expanding the left hand side gives equation
Solving this gives
which is equivalent to \(\beta\in \{(1-\sqrt{\alpha\lambda_i})^2,(1+\sqrt{\alpha\lambda_i})^2\}\). Therefore \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\) is equivalent to \((1-\sqrt{\alpha \lambda_i})^2\leq\beta\leq (1+\sqrt{\alpha\lambda_i})^2\).
Now if \(\beta=\max\{(1-\sqrt{\alpha\lambda_1})^2,(1-\sqrt{\alpha \lambda_n})^2\}\) then \(1\geq \beta\geq (1-\sqrt{\alpha\lambda_i})^2\) for all \(\lambda_i\). By the previous exercise, the eigenvalues of all blocks \(T_i\) have therefore magnitude \(\sqrt{\beta}\), and thus every eigenvalue lies on the cirlce of radius \(\sqrt{\beta}\). Therefore \(\rho(T)=\sqrt{\beta}\).
Exercise 2e)¶
Exercise
To get fast convergence, we want to minimize \(\rho(T)\). Find the value of \(\alpha>0\) that minimizes the spectral radius \(\rho(T)^2=\beta=\max\{(1-\sqrt{\alpha\lambda_1})^2,(1-\sqrt{\alpha \lambda_n})\}\). Show that then
Hint: this is very similar to exercise 2c) of week 12
Like in exercise 2c) of week 12, we can argue that the max of these two functions is minimized precisely if
This is equivalent to
Plugging this in for \(\beta\) gives
using \(\rho(T)=\sqrt{\beta}\) gives the required result.
Exercise 2f)¶
Exercise
Recall Gelfand’s formula for the spectral radius that tells us that
Therefore, for any \(\epsilon>0\) we have for all \(k\) sufficiently large that
Let \(\kappa=\lambda_n/\lambda_1\) be the condition number of \(A\). Show that for a good choice of \(\alpha,\beta\) we have for all \(\epsilon>0\) and \(k\) sufficiently large that
Up to the arbitrarily small term \(\varepsilon\), this convergence rate matches the optimal rate that we have seen in the course.
Firs of all note simply that
From exercise 2a) we have that
Hence
By induction
Then using the remark about Gelfand’s formula together with the expression for \(\rho(T)\) gives the result.