Open in Colab

Week 12

Exercise 1: Subgradients


Let \(f\colon \Omega\subset\mathbb R^n\to \mathbb R\) (not necessarily convex or smooth). We call \(g\in \mathbb R^n\) a subgradient of \(f\) at \(x\in\Omega\) if:

\[f(y)\geq f(x)+g^\top (y-x) \qquad \forall y\in\Omega\]

The set of all subgradients of \(f\) at \(x\) is denoted \(\partial f(x)\). Note that if \(f\) is convex and differentiable at \(x\) then \(\nabla f(x)\in \partial f(x)\), by the definition of a convex function. The point of subgradients is that they give a useful well-defined analogue of gradients at points where \(f\) is not differentiable.

Exercise 1a)


Show that \(\partial f(x)\) is a convex set.

Let \(g,g'\in\partial f(x)\) and let \(t\in [0,1]\), \(y\in \Omega\). Then,

\[\begin{split}\begin{align} f(y) &= tf(y)+(1-t)f(y)\\ & \geq t\left(f(x) + g^\top (y-x)\right)+(1-t)\left(f(x) + (g')^\top (y-x)\right)\\ & = f(x) + (tg+(1-t)g')^\top (y-x) \end{align}\end{split}\]

and therefore \(tg+(1-t)g'\in \partial f(x)\) and thus \(\partial f(x)\) is convex.

Exercise 1b)


Let \(f\colon \Omega\subset\mathbb R^n\to \mathbb R\). Let \(x^*\in \Omega\). Show that \(0\in \partial f(x^*)\) if and only if \(x^*\) is a global minimizer.

\(0\in\partial f(x^*)\) means that

\[f(y)\geq f(x^*) \qquad\forall y\in\Omega\]

which is means \(x^*\) is a global minimizer.

Exercise 1c)


Consider \(f=|\cdot|\colon \mathbb R\to \mathbb R\). Show that \(\partial f(0)=[-1,1]\) and conclude that \(0\) is a global minimizer of \(f\).

Let \(g\in\mathbb R\). Then we need \(g\) such that \(f(y)=|y|\geq gy\) for all \(y\in\mathbb R\). If \(g\) and \(y\) have opposite signs, then this inequality is trivially true, since the right-hand side is negative. Then if \(y\geq 0\) we obtain \(g\leq 1\), whereas for \(g\leq 0\) we obtain \(g\geq -1\). Therefore only \(g\in[-1,1]\) satisfies the inequality for all \(y\in\mathbb R\), and \(\partial f(0)=[-1,1]\). Since \(0\in\partial f(0)\), \(0\) is a global minimizer by 1b).

Exercise 1d)


Let \(f(x) = \max_{i=1,\dots,n} f_i(x)\). Show that

\[\partial f(x) \supset \text{Co}\left(\bigcup_{i\mid f_i(x)=f(x)}\partial f_i(x)\right)\]

where Co denotes the convex hull. By \(i\mid f_i(x)=f(x)\) we mean those \(i\in\{1,\dots,m\}\) such that \(f_i(x)=f(x)\).

Remark: The opposite inclusion is also true, but a bit harder to show.

Let \(x\in \Omega\) and consider \(k\) such that \(f(x) = f_k(x)\). Then for \(g\in \partial f_k(x)\) we have that

\[f(y)\geq f_k(y) \geq f_k(x)+g^\top (y-x) = f(x)+g^\top (y-x)\]

hence \(g\in \partial f(x)\). Therefore we have that

\[\partial f(x)\supset \bigcup_{i\mid f_i(x)=f(x)}\partial f_i(x)\]

since \(\partial f(x)\) is convex, this inequality remains true if we take the convex hull on the right hand side.

Exercise 1e)


Consider the function \(f\colon \mathbb R^2\to \mathbb R\), \(f(x,y)=\max \{f_1(x,y),\,f_2(x,y)\}\) with

\[f_1(x,y) = x^2,\qquad f_2(x,y) = \alpha x^2+y\]

where \(\alpha\in (0,1)\) is a fixed parameter. Compute \(\partial f(0,0)\) and show that \((0,0)\) is a global minimzer of \(f\).

Hint: You can use the opposite inclusion of 1d): \(\partial f(x) = \text{Co}\left(\bigcup_{i\mid f_i(x)=f(x)}\partial f_i(x)\right)\)

Let \(g = (g_1,g_2)\), then since \(f_1(0,0)=f_2(0,0)=f(0,0)\) we have that

\[\partial f(0,0) = \text{Co}(\partial f_1(0,0)\cup\partial f_2(0,0))\]

For \(\partial f_1(0,0)\) we get inequality

\[x^2\geq g_1x+g_2y\]

For this to hold for all \(y\), we need \(g_2=0\). Furthermore \(g_1=0\) as well, since \(x^2<g_1 x\) for any \(g_1\neq 0\) for \(x\) sufficiently close to zero. Hence \(\partial f_1(0,0)=\{(0,0)\}\). In fact, since \(f_1\) is differentiable we could immediately conclude this.

For \(\partial f_2(0,0)\) we get inequality,

\[\alpha x^2+y \geq g_1x+g_2y\]

By the same argumentation as before, we obtain \(g_1=0\). If \(x=0\), \(y>0\) the inequality holds iff \(g_2\leq 1\), whereas for \(y<0\) it holds iff \(g_2\geq 1\). Hence \(g_2=1\), and \(\partial f_2(0,0) = \{(0,1)\}\). Again, we could have conlcuded this by taking the derivative of \(f_2\) as well.

In conclusion, we have that \(\partial f(0,0) = \{0\}\times [0,1]\). Since this includes \(\{(0,0)\}\), we conclude \((0,0)\) is a global minimizer of \(f\).

Below we see a contour plot of this function. The plot suggests that all the points on the line \(0\times(\infty,0]\) minimize the function (solid red line). The dotted black line shows the region where \(f_1=f_2\).

import matplotlib.pyplot as plt
import numpy as np

alpha = 0.5

t = np.linspace(-1, 1, 100)
xx, yy = np.meshgrid(t, t)
f1 = xx ** 2
f2 = alpha * xx ** 2 + yy
f = np.max([f1, f2], axis=0)

CS = plt.contour(xx, yy, f, [0.003, 0.05, 0.1, 0.2, 0.5, 1])
plt.plot(t, t ** 2 * alpha, ls="--", c="k", alpha=0.7)
plt.axvline(0, 0, 0.5, c="r")

Exercise 2: Gradient descent for quadratic problem


Let \(0\prec A\in \mathbb R^{n\times n}\) be a positive definite matrix, and let \(b\in\mathbb R^n\) and \(c\in \mathbb R\). We consider the problem of minimizing the quadratic function

\[f(x) = \frac12x^\top A x - b^\top x+c\]

using gradient descent methods. Recall that this problem has unique minimizer \(x_*=A^{-1}b\).

Exercise 2a)


Let \(x_{k+1} = x_k-t\nabla f(x_k)\). Show that

\[\|x_{k+1}-x_*\|\leq \|I-t A\|\|x_k-x_*\|\]

where the norm on the left of the right-hand-side is the operator norm.

Hint: You can use that \(Ax_*=b\).

We have

\[\begin{split}\begin{align} \|x_{k+1}-x_*\| &= \|x_k-t\nabla f(x_k)-x_*\|\\ &= \|x_k-t (A x_k-b ) -x_*\|\\ &= \|x_k-t (A x_k-Ax_* ) -x_*\|\\ &= \|(I-t A)(x_k-x_*)\|\\ &\leq \|I-tA\|\|x_k-x_*\| \end{align}\end{split}\]

Here we used the property that \(\|Av\|\leq \|A\|\|v\|\). This is true by the definiton of the operatorn norm: \(\|A\| = \operatorname{sup}_{v\neq 0} \|Av\|/\|v\|\).

Exercise 2b)


Let \(\lambda_1,\lambda_n>0\) be respectively the smallest and biggest eigenvalue of \(A\). Show that

\[\|I-tA\| = \max\{|1-t\lambda_1|,|1-t\lambda_n|\}$.\]

Hint: First show that \(\|I-tA\| = \max_i|1-t\lambda_i|\) with \(\lambda_i\) the eigenvalues of \(A\).

Since \(I\) and \(tA\) commute, the specturm of \(I-tA\) is given by \(1-t\cdot \sigma (A)\). Hence we have that \(\|I-tA\| = \max_i |1-t\lambda_i|\), with \(\lambda_i\) the eigenvalues of \(A\). For fixed \(t\), we have that \(|1-t\lambda_i|\) is as large as possible either if \(\lambda_i\) is the smallest or the biggest eigenvalue, hence

\[h(t) = \|I-tA\| = \max\{|1-t\lambda_1|,|1-t\lambda_n|\}\]

Exercise 2c)


Show that \(t=2/(\lambda_1+\lambda_n)\) minimizes \(h(t)=\|I-tA\|\)

Hint: You can use the fact that this function is minimized if and only if \(0\in\partial h(t)\) (c.f. Exercise 1b) and 1d))

The subgradient \(\partial h(t)\) is a single (non-zero) point if \(|1-t\lambda_1|\neq |1-t\lambda_n|\), since both \(|1-t\lambda_1|\) and \(|1-t\lambda_n|\) are differentiable if \(1\neq t\lambda_i\), but if \(1=t\lambda_i\) then \(|1-t\lambda_j|\) for \(j\neq i\) is always bigger. Therefore \(0\in \partial h(t)\) only if \(|1-t\lambda_1|=|1-t\lambda_n|\).

This means either \(1-t\lambda_1=1-t\lambda_n\), and hence \(t=0\). It is easy to see that \(h(0)=[\lambda_1,\lambda_n]\not\owns 0\).

Otherwise it means \(1-t\lambda-1=t\lambda_n-1\), and hence \(t=2/(\lambda_n+\lambda_n)\). At this point we have \(h(t)=[-\lambda_n,\lambda_1]\owns 0\) and hence \(t=2/(\lambda_n+\lambda_n)\) is a global minimizer.

The function \(h(t)\) is shown below. Between \(-\infty\) and \(0\) it has slope \(-\lambda_n\), then between \(0\) and \(2/(\lambda_1+\lambda_2)\) it has slope \(-\lambda_1\). Then after that it has slope \(\lambda_n\).

import matplotlib.pyplot as plt
import numpy as np

lambda_1 = 1
lambda_n = 2
t_opt = 2 / (lambda_1 + lambda_n)

t = np.linspace(-0.5, 1.5, 100)
y = np.max([np.abs(1 - t * lambda_1), np.abs(1 - t * lambda_n)], axis=0)
plt.plot(t, np.abs(1 - t * lambda_1), alpha=0.8)
plt.plot(t, np.abs(1 - t * lambda_n), alpha=0.8)
plt.plot(t, y, lw=3, c="r")
plt.axvline(0, c="k", ls="--", alpha=0.5)
plt.axvline(t_opt, c="k", ls="--", alpha=0.5);

Exercise 2d)


Using the previous exercises, derive a convergence rate \(\gamma>0\) of gradient descent with optimal step size \(t\) in terms of the condition number \(\kappa = \lambda_n/\lambda_1\):

\[\|x_k-x_*\|\leq \gamma^k\|x_0-x_*\|\]

We have that \(h(t)=\|I-tA\|\) is minimzed if \(t=2/(\lambda_1+\lambda_n)\), in which case

\[h(t) = 1-\frac{2\lambda_1}{\lambda_1+\lambda_n} = \frac{\lambda_n-\lambda_1}{\lambda_1+\lambda_n} = \frac{\kappa-1}{\kappa+1}.\]

Hence by induction we get convergence rate

\[\|x_k-x_*\|\leq \left(\frac{\kappa-1}{\kappa+1}\right)^k\|x_0-x_*\|\]