Open in Colab

Week 13

Exercise 1: Supporting hyperplane Theorem

Exercise

In this exercise we will prove the supporting hyperplane theorem:

Let \(C\subset \mathbb R^n\) be a closed convex set. Suppose \(x\in \mathbb R^n\setminus \operatorname{int}(C)=( \mathbb R^n\setminus C)\cup \partial C\), then there exists a vector \(a\in \mathbb R^n\) such that

\[a^\top y \leq a^\top x\qquad \forall x\in C\]

That is, \(C\) is on one side of the hyperplane that goes through \(x\) and that is orthogonal to \(a\).

Remark: the theorem also holds if \(C\) is not closed. This can essentially be proved by replacing \(C\) with its closure \(\overline C\).

Exercise 1a)

Exercise

Let \(x\in \mathbb R^n\setminus C\) (so we exclude the case \(x\in\partial C\)). Using the fact that \(C\) is closed we can define the projection onto \(C\) by

\[P_C(x) = \operatorname{argmin}_{y\in C}\frac12\|x-y\|^2.\]

Use the optimality conditions from convex optimization to show that

\[a = \frac{x-P_C(x)}{\|x-P_C(x)\|}\]

satisfies

\[a^\top y<a^\top x,\qquad \forall y\in C.\]

Recall that if \(x^*=\operatorname{argmin}_xf(x)\), then the optimality condition is \(\nabla f^\top (x^*)(y-x^*)\geq 0\) for all \(y\in C\). For the projection this means that

\[(P_C(x)-x)^\top(y-P_C(x))\geq 0\]

and thus that

\[a^\top y\leq a^\top P_C(x)=a^\top (P_C(x)-x)+a^\top x<a^\top x\]

since \(a^\top (P_C(x)-x) = -\|x-P_C(x)\| <0\).

Exercise 1b)

Exercise

Now consider the case that \(x\in \partial C\), but \(C\) still closed. Take a sequence \((x_k)\) with \(x_k\notin C\) such that \(\lim_k x_k=x\). Then consider

\[a_k = \frac{x_k-P_C(x_k)}{\|x_k-P_C(x_k)\|}.\]

Prove that \((a_k)\) has a limit point \(a\) satisfying the requirements of the supporting the hyperplane theorem.

The sequence \((a_k)\) is bounded, and hence has a limit point \(a\). By the same reasoning as before we then have,

\[a_k^\top y< a_k^\top x_k,\qquad \forall y\in C\]

Hence if we restrict to a subsequence of \(a_k\) converging to \(a\), then in the limit we obtain

\[a^\top y\leq a^\top x,\qquad \forall y\in C,\]

as required.

Exercise 2: Heavy ball method

Exercise

Last week we studied the converge rate of gradient descent for unconstrained optimization of the quadratic function

\[f(x) = \frac12x^\top A x -b^\top x+c.\]

This time we will see a different method, with asymptotically optimal convergence rate when applied to \(f\). In particular, we consider the “heavy ball” iteration (Boris Polyak, 1964)

\[x_{k+1}=x_k - \alpha \nabla f(x_k)+\beta(x_k-x_{k-1}).\]

We will analyse the convergence speed of this method.

Remark: The term \(x_k-x_{k-1}\) is refered to as “momentum”, since it increases the step size proportionally to the last stepsize.

The heavy ball iteration is equivalent to a discretization of the ODE

\[\ddot{x}+a\dot{x} +b\nabla f(x) = 0\]

which models the motion of a particle in potential field \(f\) with a friction term, hence the name “heavy ball”, since it could be use to model a heavy ball moving through a fluid, for example. This term will damp out the oscillations that occur in the steepest descent method and should allow the iteration to converge faster.

Exercise 2a)

Exercise

Let \(x_*\) be the solution to the optimization problem. Since \(x_{k+1}\) depends on both \(x_k\) and \(x_{k-1}\) we will have to study the convergence rate two steps at a time. To do this we want to find a matrix \(T\)

\[\begin{split}\begin{pmatrix}x_{k+1}-x_*\\x_k-x_*\end{pmatrix} = T\begin{pmatrix}x_{k}-x_*\\x_{k-1}-x_*\end{pmatrix}\end{split}\]

Find such a matrix \(T\) of form

\[\begin{split}T=\begin{pmatrix}c_1I+c_2A&c_3I\\c_4I&0\end{pmatrix}\end{split}\]

Hint: like last week, note that \(Ax_*=b\).

First of all it’s easy to see that \(c_4=1\). For the other three we get equation

\[x_{k+1}-x_* = c_1(x_k-x_*)+c_2(Ax_k-c_2Ax_*)+c_3(x_{k-1}-x_*)\]

The left hand side is given by

\[x_k - \alpha \nabla f(x_k)+\beta(x_k-x_{k-1})=x_k - \alpha (Ax_k-b)+\beta(x_k-x_{k-1})\]

Using the fact that \(Ax_*=b\) this gives

\[\begin{split}\begin{align} 1+\beta &=c_1 \\ -\alpha &= c_2\\ 1 &= c_1+c_3 \Rightarrow c_3 = -\beta \end{align}\end{split}\]

Thus

\[\begin{split}T=\begin{pmatrix}(1+\beta)I-\alpha A&-\beta I\\I&0\end{pmatrix}\end{split}\]

Exercise 2b)

Exercise

Let \(U AU^\top=\Lambda=\operatorname{diag}(\lambda_1,\dots,\lambda_n)\) be an eigendecomposition of \(A\).

Show that there is a permutation matrix \(\Pi\) such that

\[\begin{split}\Pi\begin{pmatrix}U&0\\0&U\end{pmatrix}T\begin{pmatrix}U^\top&0\\0&U^\top\end{pmatrix}\Pi^\top = \begin{pmatrix}T_1&0&\cdots&0\\ 0 & T_2&\cdots &0\\\vdots &\vdots&\ddots &\vdots \\ 0&0&\cdots &T_n\end{pmatrix}\end{split}\]

with \(T_i\) a \(2\times 2\) block matrix

\[\begin{split}T_i = \begin{pmatrix}c_1+c_2\lambda_i&c_3\\c_4&0\end{pmatrix}\end{split}\]

We have that

\[\begin{split}\begin{align} &\begin{pmatrix}U&0\\0&U\end{pmatrix}\begin{pmatrix}(1+\beta)I-\alpha A&-\beta I\\I&0\end{pmatrix}\begin{pmatrix}U^\top&0\\0&U^\top\end{pmatrix}\\ &=\begin{pmatrix}(1+\beta)I-\alpha \Lambda&-\beta I\\I&0\end{pmatrix}\\ \end{align}\end{split}\]

This can be conjugated by a permutation matrix to obtain a \(2\times 2\) block diagonal matrix by the shuffle permutation

\[\Pi_{i,j} = (e_1,e_{n+1},e_2,e_{n+2},\dots,e_n,e_{2n})\]

We simply need to send the top-left diagonal to the odd-diagonal entries of the new matrix, the bottom-left diagonal to the even-diagonal elements. This already fixes the permutation, and it does precisely the right thing to the off-diagonal blocks. We can see it in action for small matrices:

[3]:
import numpy as np
n = 3
Pi = np.zeros((2*n,2*n),dtype=int)
for i in range(n):
    Pi[i,2*i] = 1
    Pi[i+n,2*i+1]=1
print("Pi = ")
print(Pi)

A = np.zeros((2*n,2*n),dtype=int)
for i in range(n):
    A[i,i] = i+1
    A[i,i+n]=i+n+1
    A[i+n,i] = i+2*n+1
print("X = ")
print(A)
print("Pi@X@Pi.T = ")
print(Pi.T@A@Pi)
Pi =
[[1 0 0 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 0 1 0]
 [0 1 0 0 0 0]
 [0 0 0 1 0 0]
 [0 0 0 0 0 1]]
X =
[[1 0 0 4 0 0]
 [0 2 0 0 5 0]
 [0 0 3 0 0 6]
 [7 0 0 0 0 0]
 [0 8 0 0 0 0]
 [0 0 9 0 0 0]]
Pi@X@Pi.T =
[[1 4 0 0 0 0]
 [7 0 0 0 0 0]
 [0 0 2 5 0 0]
 [0 0 8 0 0 0]
 [0 0 0 0 3 6]
 [0 0 0 0 9 0]]

Exercise 2c)

Exercise

The eigenvalues of each individual block \(T_i\) are given by the roots of the equation

\[u^2-(1+\beta-\alpha\lambda_i)u+\beta=0.\]

Show that if \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\) then the roots are complex and have magnitude

\[|u|^2 = \beta.\]

The block \(T_i\) is given by

\[\begin{split}\begin{pmatrix}1+\beta-\alpha\lambda_i&-\beta\\1&0\end{pmatrix}\end{split}\]

hence the characteristic equation is indeed \(u^2-(1+\beta-\alpha\lambda_i)u+\beta=0\). The roots of this equation are

\[u=\frac12\left(1+\beta-\alpha\lambda\pm \sqrt{(1+\beta-\alpha\lambda)^2-4\beta}\right).\]

Hence, if \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\), the discriminant is negative. The magnitude is then the same for both roots, and given by

\[|u|^2 = \frac14\left((1+\beta-\alpha\lambda)^2+4\beta-(1+\beta-\alpha\lambda)^2\right) = \beta.\]

Exercise 2d)

Exercise

Let \(\lambda_1\) and \(\lambda_n\) be respectively the smallest and largest eigenvalue of \(A\). Show that if we choose \(\beta = \max\{(1-\sqrt{\alpha\lambda_1})^2,(1-\sqrt{\alpha \lambda_n})^2\}\), then the condition \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\) is satisfied for each \(i\). Conclude that then the spectral radius of \(T\) is given by \(\rho(T)=\sqrt\beta\).

Hint: Prove that \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\) is equivalent to \((1-\sqrt{\alpha \lambda_i})^2\leq\beta\leq (1+\sqrt{\alpha\lambda_i})^2\). For this it may be easier to first solve the equation \((1+\beta-\alpha\lambda_i)^2= 4\beta\)

We consider the inequality \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\). This is an inequality where a (positive) parabola is smaller than a line, which happens between the two intersection points (if they exist). Therefore we solve \((1+\beta-\alpha\lambda_i)^2=4\beta\). Expanding the left hand side gives equation

\[\beta^2+\beta(-2-2\lambda_i)+(1-\alpha\lambda_i)^2 = 0\]

Solving this gives

\[\beta = 1+\alpha\lambda_i\pm2\sqrt{\alpha\lambda_i}\]

which is equivalent to \(\beta\in \{(1-\sqrt{\alpha\lambda_i})^2,(1+\sqrt{\alpha\lambda_i})^2\}\). Therefore \((1+\beta-\alpha\lambda_i)^2\leq 4\beta\) is equivalent to \((1-\sqrt{\alpha \lambda_i})^2\leq\beta\leq (1+\sqrt{\alpha\lambda_i})^2\).

Now if \(\beta=\max\{(1-\sqrt{\alpha\lambda_1})^2,(1-\sqrt{\alpha \lambda_n})^2\}\) then \(1\geq \beta\geq (1-\sqrt{\alpha\lambda_i})^2\) for all \(\lambda_i\). By the previous exercise, the eigenvalues of all blocks \(T_i\) have therefore magnitude \(\sqrt{\beta}\), and thus every eigenvalue lies on the cirlce of radius \(\sqrt{\beta}\). Therefore \(\rho(T)=\sqrt{\beta}\).

Exercise 2e)

Exercise

To get fast convergence, we want to minimize \(\rho(T)\). Find the value of \(\alpha>0\) that minimizes the spectral radius \(\rho(T)^2=\beta=\max\{(1-\sqrt{\alpha\lambda_1})^2,(1-\sqrt{\alpha \lambda_n})\}\). Show that then

\[\rho(T) = \frac{\sqrt{\lambda_n}-\sqrt{\lambda_1}}{\sqrt{\lambda_n}+\sqrt{\lambda_1}}\]

Hint: this is very similar to exercise 2c) of week 12

Like in exercise 2c) of week 12, we can argue that the max of these two functions is minimized precisely if

\[(1-\sqrt{\alpha\lambda_1})=-(1-\sqrt{\alpha \lambda_n})\]

This is equivalent to

\[\alpha = \frac{4}{(\sqrt\lambda_n+\sqrt\lambda_1)^2}\]

Plugging this in for \(\beta\) gives

\[\beta = \left(\frac{\sqrt{\lambda_n}-\sqrt{\lambda_1}}{\sqrt{\lambda_n}+\sqrt{\lambda_1}}\right)^2\]

using \(\rho(T)=\sqrt{\beta}\) gives the required result.

Exercise 2f)

Exercise

Recall Gelfand’s formula for the spectral radius that tells us that

\[\rho(T) = \lim_{k\to\infty}\|T^k\|^{\frac1k}.\]

Therefore, for any \(\epsilon>0\) we have for all \(k\) sufficiently large that

\[\|T^k\|\leq (\rho(T)+\epsilon)^k.\]

Let \(\kappa=\lambda_n/\lambda_1\) be the condition number of \(A\). Show that for a good choice of \(\alpha,\beta\) we have for all \(\epsilon>0\) and \(k\) sufficiently large that

\[\begin{split}\left\|\begin{matrix}x_{k+1}-x_*\\x_k-x_*\end{matrix}\right\| \leq \left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}+\epsilon\right)^k\left\|\begin{matrix}x_{1}-x_*\\x_{0}-x_*\end{matrix}\right\|.\end{split}\]

Up to the arbitrarily small term \(\varepsilon\), this convergence rate matches the optimal rate that we have seen in the course.

Firs of all note simply that

\[\rho(T) = \frac{\sqrt{\lambda_n}-\sqrt{\lambda_1}}{\sqrt{\lambda_n}+\sqrt{\lambda_1}}=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\]

From exercise 2a) we have that

\[\begin{split}\begin{pmatrix}x_{k+1}-x_*\\x_k-x_*\end{pmatrix} = T\begin{pmatrix}x_{k}-x_*\\x_{k-1}-x_*\end{pmatrix}\end{split}\]

Hence

\[\begin{split}\left\|\begin{matrix}x_{k+1}-x_*\\x_k-x_*\end{matrix}\right\| \leq \|T\|\left\|\begin{matrix}x_{k}-x_*\\x_{k-1}-x_*\end{matrix}\right\|\end{split}\]

By induction

\[\begin{split}\left\|\begin{matrix}x_{k+1}-x_*\\x_k-x_*\end{matrix}\right\| \leq \|T\|^k\left\|\begin{matrix}x_{1}-x_*\\x_{0}-x_*\end{matrix}\right\|\end{split}\]

Then using the remark about Gelfand’s formula together with the expression for \(\rho(T)\) gives the result.