Week 12¶

Exercise

Let $$f\colon \Omega\subset\mathbb R^n\to \mathbb R$$ (not necessarily convex or smooth). We call $$g\in \mathbb R^n$$ a subgradient of $$f$$ at $$x\in\Omega$$ if:

$f(y)\geq f(x)+g^\top (y-x) \qquad \forall y\in\Omega$

The set of all subgradients of $$f$$ at $$x$$ is denoted $$\partial f(x)$$. Note that if $$f$$ is convex and differentiable at $$x$$ then $$\nabla f(x)\in \partial f(x)$$, by the definition of a convex function. The point of subgradients is that they give a useful well-defined analogue of gradients at points where $$f$$ is not differentiable.

Exercise 1a)¶

Exercise

Show that $$\partial f(x)$$ is a convex set.

Let $$g,g'\in\partial f(x)$$ and let $$t\in [0,1]$$, $$y\in \Omega$$. Then,

\begin{split}\begin{align} f(y) &= tf(y)+(1-t)f(y)\\ & \geq t\left(f(x) + g^\top (y-x)\right)+(1-t)\left(f(x) + (g')^\top (y-x)\right)\\ & = f(x) + (tg+(1-t)g')^\top (y-x) \end{align}\end{split}

and therefore $$tg+(1-t)g'\in \partial f(x)$$ and thus $$\partial f(x)$$ is convex.

Exercise 1b)¶

Exercise

Let $$f\colon \Omega\subset\mathbb R^n\to \mathbb R$$. Let $$x^*\in \Omega$$. Show that $$0\in \partial f(x^*)$$ if and only if $$x^*$$ is a global minimizer.

$$0\in\partial f(x^*)$$ means that

$f(y)\geq f(x^*) \qquad\forall y\in\Omega$

which is means $$x^*$$ is a global minimizer.

Exercise 1c)¶

Exercise

Consider $$f=|\cdot|\colon \mathbb R\to \mathbb R$$. Show that $$\partial f(0)=[-1,1]$$ and conclude that $$0$$ is a global minimizer of $$f$$.

Let $$g\in\mathbb R$$. Then we need $$g$$ such that $$f(y)=|y|\geq gy$$ for all $$y\in\mathbb R$$. If $$g$$ and $$y$$ have opposite signs, then this inequality is trivially true, since the right-hand side is negative. Then if $$y\geq 0$$ we obtain $$g\leq 1$$, whereas for $$g\leq 0$$ we obtain $$g\geq -1$$. Therefore only $$g\in[-1,1]$$ satisfies the inequality for all $$y\in\mathbb R$$, and $$\partial f(0)=[-1,1]$$. Since $$0\in\partial f(0)$$, $$0$$ is a global minimizer by 1b).

Exercise 1d)¶

Exercise

Let $$f(x) = \max_{i=1,\dots,n} f_i(x)$$. Show that

$\partial f(x) \supset \text{Co}\left(\bigcup_{i\mid f_i(x)=f(x)}\partial f_i(x)\right)$

where Co denotes the convex hull. By $$i\mid f_i(x)=f(x)$$ we mean those $$i\in\{1,\dots,m\}$$ such that $$f_i(x)=f(x)$$.

Remark: The opposite inclusion is also true, but a bit harder to show.

Let $$x\in \Omega$$ and consider $$k$$ such that $$f(x) = f_k(x)$$. Then for $$g\in \partial f_k(x)$$ we have that

$f(y)\geq f_k(y) \geq f_k(x)+g^\top (y-x) = f(x)+g^\top (y-x)$

hence $$g\in \partial f(x)$$. Therefore we have that

$\partial f(x)\supset \bigcup_{i\mid f_i(x)=f(x)}\partial f_i(x)$

since $$\partial f(x)$$ is convex, this inequality remains true if we take the convex hull on the right hand side.

Exercise 1e)¶

Exercise

Consider the function $$f\colon \mathbb R^2\to \mathbb R$$, $$f(x,y)=\max \{f_1(x,y),\,f_2(x,y)\}$$ with

$f_1(x,y) = x^2,\qquad f_2(x,y) = \alpha x^2+y$

where $$\alpha\in (0,1)$$ is a fixed parameter. Compute $$\partial f(0,0)$$ and show that $$(0,0)$$ is a global minimzer of $$f$$.

Hint: You can use the opposite inclusion of 1d): $$\partial f(x) = \text{Co}\left(\bigcup_{i\mid f_i(x)=f(x)}\partial f_i(x)\right)$$

Let $$g = (g_1,g_2)$$, then since $$f_1(0,0)=f_2(0,0)=f(0,0)$$ we have that

$\partial f(0,0) = \text{Co}(\partial f_1(0,0)\cup\partial f_2(0,0))$

For $$\partial f_1(0,0)$$ we get inequality

$x^2\geq g_1x+g_2y$

For this to hold for all $$y$$, we need $$g_2=0$$. Furthermore $$g_1=0$$ as well, since $$x^2<g_1 x$$ for any $$g_1\neq 0$$ for $$x$$ sufficiently close to zero. Hence $$\partial f_1(0,0)=\{(0,0)\}$$. In fact, since $$f_1$$ is differentiable we could immediately conclude this.

For $$\partial f_2(0,0)$$ we get inequality,

$\alpha x^2+y \geq g_1x+g_2y$

By the same argumentation as before, we obtain $$g_1=0$$. If $$x=0$$, $$y>0$$ the inequality holds iff $$g_2\leq 1$$, whereas for $$y<0$$ it holds iff $$g_2\geq 1$$. Hence $$g_2=1$$, and $$\partial f_2(0,0) = \{(0,1)\}$$. Again, we could have conlcuded this by taking the derivative of $$f_2$$ as well.

In conclusion, we have that $$\partial f(0,0) = \{0\}\times [0,1]$$. Since this includes $$\{(0,0)\}$$, we conclude $$(0,0)$$ is a global minimizer of $$f$$.

Below we see a contour plot of this function. The plot suggests that all the points on the line $$0\times(\infty,0]$$ minimize the function (solid red line). The dotted black line shows the region where $$f_1=f_2$$.

[33]:

import matplotlib.pyplot as plt
import numpy as np

alpha = 0.5

t = np.linspace(-1, 1, 100)
xx, yy = np.meshgrid(t, t)
f1 = xx ** 2
f2 = alpha * xx ** 2 + yy
f = np.max([f1, f2], axis=0)

CS = plt.contour(xx, yy, f, [0.003, 0.05, 0.1, 0.2, 0.5, 1])
plt.clabel(CS)
plt.plot(t, t ** 2 * alpha, ls="--", c="k", alpha=0.7)
plt.axvline(0, 0, 0.5, c="r")
plt.xlabel("x")
plt.ylabel("y");


Exercise

Let $$0\prec A\in \mathbb R^{n\times n}$$ be a positive definite matrix, and let $$b\in\mathbb R^n$$ and $$c\in \mathbb R$$. We consider the problem of minimizing the quadratic function

$f(x) = \frac12x^\top A x - b^\top x+c$

using gradient descent methods. Recall that this problem has unique minimizer $$x_*=A^{-1}b$$.

Exercise 2a)¶

Exercise

Let $$x_{k+1} = x_k-t\nabla f(x_k)$$. Show that

$\|x_{k+1}-x_*\|\leq \|I-t A\|\|x_k-x_*\|$

where the norm on the left of the right-hand-side is the operator norm.

Hint: You can use that $$Ax_*=b$$.

We have

\begin{split}\begin{align} \|x_{k+1}-x_*\| &= \|x_k-t\nabla f(x_k)-x_*\|\\ &= \|x_k-t (A x_k-b ) -x_*\|\\ &= \|x_k-t (A x_k-Ax_* ) -x_*\|\\ &= \|(I-t A)(x_k-x_*)\|\\ &\leq \|I-tA\|\|x_k-x_*\| \end{align}\end{split}

Here we used the property that $$\|Av\|\leq \|A\|\|v\|$$. This is true by the definiton of the operatorn norm: $$\|A\| = \operatorname{sup}_{v\neq 0} \|Av\|/\|v\|$$.

Exercise 2b)¶

Exercise

Let $$\lambda_1,\lambda_n>0$$ be respectively the smallest and biggest eigenvalue of $$A$$. Show that

$\|I-tA\| = \max\{|1-t\lambda_1|,|1-t\lambda_n|\}.$

Hint: First show that $$\|I-tA\| = \max_i|1-t\lambda_i|$$ with $$\lambda_i$$ the eigenvalues of $$A$$.

Since $$I$$ and $$tA$$ commute, the specturm of $$I-tA$$ is given by $$1-t\cdot \sigma (A)$$. Hence we have that $$\|I-tA\| = \max_i |1-t\lambda_i|$$, with $$\lambda_i$$ the eigenvalues of $$A$$. For fixed $$t$$, we have that $$|1-t\lambda_i|$$ is as large as possible either if $$\lambda_i$$ is the smallest or the biggest eigenvalue, hence

$h(t) = \|I-tA\| = \max\{|1-t\lambda_1|,|1-t\lambda_n|\}$

Exercise 2c)¶

Exercise

Show that $$t=2/(\lambda_1+\lambda_n)$$ minimizes $$h(t)=\|I-tA\|$$

Hint: You can use the fact that this function is minimized if and only if $$0\in\partial h(t)$$ (c.f. Exercise 1b) and 1d))

The subgradient $$\partial h(t)$$ is a single (non-zero) point if $$|1-t\lambda_1|\neq |1-t\lambda_n|$$, since both $$|1-t\lambda_1|$$ and $$|1-t\lambda_n|$$ are differentiable if $$1\neq t\lambda_i$$, but if $$1=t\lambda_i$$ then $$|1-t\lambda_j|$$ for $$j\neq i$$ is always bigger. Therefore $$0\in \partial h(t)$$ only if $$|1-t\lambda_1|=|1-t\lambda_n|$$.

This means either $$1-t\lambda_1=1-t\lambda_n$$, and hence $$t=0$$. It is easy to see that $$h(0)=[\lambda_1,\lambda_n]\not\owns 0$$.

Otherwise it means $$1-t\lambda-1=t\lambda_n-1$$, and hence $$t=2/(\lambda_n+\lambda_n)$$. At this point we have $$h(t)=[-\lambda_n,\lambda_1]\owns 0$$ and hence $$t=2/(\lambda_n+\lambda_n)$$ is a global minimizer.

The function $$h(t)$$ is shown below. Between $$-\infty$$ and $$0$$ it has slope $$-\lambda_n$$, then between $$0$$ and $$2/(\lambda_1+\lambda_2)$$ it has slope $$-\lambda_1$$. Then after that it has slope $$\lambda_n$$.

[2]:

import matplotlib.pyplot as plt
import numpy as np

lambda_1 = 1
lambda_n = 2
t_opt = 2 / (lambda_1 + lambda_n)

t = np.linspace(-0.5, 1.5, 100)
y = np.max([np.abs(1 - t * lambda_1), np.abs(1 - t * lambda_n)], axis=0)
plt.plot(t, np.abs(1 - t * lambda_1), alpha=0.8)
plt.plot(t, np.abs(1 - t * lambda_n), alpha=0.8)
plt.plot(t, y, lw=3, c="r")
plt.axvline(0, c="k", ls="--", alpha=0.5)
plt.axvline(t_opt, c="k", ls="--", alpha=0.5);


Exercise 2d)¶

Exercise

Using the previous exercises, derive a convergence rate $$\gamma>0$$ of gradient descent with optimal step size $$t$$ in terms of the condition number $$\kappa = \lambda_n/\lambda_1$$:

$\|x_k-x_*\|\leq \gamma^k\|x_0-x_*\|$

We have that $$h(t)=\|I-tA\|$$ is minimzed if $$t=2/(\lambda_1+\lambda_n)$$, in which case

$h(t) = 1-\frac{2\lambda_1}{\lambda_1+\lambda_n} = \frac{\lambda_n-\lambda_1}{\lambda_1+\lambda_n} = \frac{\kappa-1}{\kappa+1}.$

Hence by induction we get convergence rate

$\|x_k-x_*\|\leq \left(\frac{\kappa-1}{\kappa+1}\right)^k\|x_0-x_*\|$