Week 12¶
Exercise 1: Subgradients¶
Exercise
Let \(f\colon \Omega\subset\mathbb R^n\to \mathbb R\) (not necessarily convex or smooth). We call \(g\in \mathbb R^n\) a subgradient of \(f\) at \(x\in\Omega\) if:
The set of all subgradients of \(f\) at \(x\) is denoted \(\partial f(x)\). Note that if \(f\) is convex and differentiable at \(x\) then \(\nabla f(x)\in \partial f(x)\), by the definition of a convex function. The point of subgradients is that they give a useful well-defined analogue of gradients at points where \(f\) is not differentiable.
Exercise 1a)¶
Exercise
Show that \(\partial f(x)\) is a convex set.
Let \(g,g'\in\partial f(x)\) and let \(t\in [0,1]\), \(y\in \Omega\). Then,
and therefore \(tg+(1-t)g'\in \partial f(x)\) and thus \(\partial f(x)\) is convex.
Exercise 1b)¶
Exercise
Let \(f\colon \Omega\subset\mathbb R^n\to \mathbb R\). Let \(x^*\in \Omega\). Show that \(0\in \partial f(x^*)\) if and only if \(x^*\) is a global minimizer.
\(0\in\partial f(x^*)\) means that
which is means \(x^*\) is a global minimizer.
Exercise 1c)¶
Exercise
Consider \(f=|\cdot|\colon \mathbb R\to \mathbb R\). Show that \(\partial f(0)=[-1,1]\) and conclude that \(0\) is a global minimizer of \(f\).
Let \(g\in\mathbb R\). Then we need \(g\) such that \(f(y)=|y|\geq gy\) for all \(y\in\mathbb R\). If \(g\) and \(y\) have opposite signs, then this inequality is trivially true, since the right-hand side is negative. Then if \(y\geq 0\) we obtain \(g\leq 1\), whereas for \(g\leq 0\) we obtain \(g\geq -1\). Therefore only \(g\in[-1,1]\) satisfies the inequality for all \(y\in\mathbb R\), and \(\partial f(0)=[-1,1]\). Since \(0\in\partial f(0)\), \(0\) is a global minimizer by 1b).
Exercise 1d)¶
Exercise
Let \(f(x) = \max_{i=1,\dots,n} f_i(x)\). Show that
where Co denotes the convex hull. By \(i\mid f_i(x)=f(x)\) we mean those \(i\in\{1,\dots,m\}\) such that \(f_i(x)=f(x)\).
Remark: The opposite inclusion is also true, but a bit harder to show.
Let \(x\in \Omega\) and consider \(k\) such that \(f(x) = f_k(x)\). Then for \(g\in \partial f_k(x)\) we have that
hence \(g\in \partial f(x)\). Therefore we have that
since \(\partial f(x)\) is convex, this inequality remains true if we take the convex hull on the right hand side.
Exercise 1e)¶
Exercise
Consider the function \(f\colon \mathbb R^2\to \mathbb R\), \(f(x,y)=\max \{f_1(x,y),\,f_2(x,y)\}\) with
where \(\alpha\in (0,1)\) is a fixed parameter. Compute \(\partial f(0,0)\) and show that \((0,0)\) is a global minimzer of \(f\).
Hint: You can use the opposite inclusion of 1d): \(\partial f(x) = \text{Co}\left(\bigcup_{i\mid f_i(x)=f(x)}\partial f_i(x)\right)\)
Let \(g = (g_1,g_2)\), then since \(f_1(0,0)=f_2(0,0)=f(0,0)\) we have that
For \(\partial f_1(0,0)\) we get inequality
For this to hold for all \(y\), we need \(g_2=0\). Furthermore \(g_1=0\) as well, since \(x^2<g_1 x\) for any \(g_1\neq 0\) for \(x\) sufficiently close to zero. Hence \(\partial f_1(0,0)=\{(0,0)\}\). In fact, since \(f_1\) is differentiable we could immediately conclude this.
For \(\partial f_2(0,0)\) we get inequality,
By the same argumentation as before, we obtain \(g_1=0\). If \(x=0\), \(y>0\) the inequality holds iff \(g_2\leq 1\), whereas for \(y<0\) it holds iff \(g_2\geq 1\). Hence \(g_2=1\), and \(\partial f_2(0,0) = \{(0,1)\}\). Again, we could have conlcuded this by taking the derivative of \(f_2\) as well.
In conclusion, we have that \(\partial f(0,0) = \{0\}\times [0,1]\). Since this includes \(\{(0,0)\}\), we conclude \((0,0)\) is a global minimizer of \(f\).
Below we see a contour plot of this function. The plot suggests that all the points on the line \(0\times(\infty,0]\) minimize the function (solid red line). The dotted black line shows the region where \(f_1=f_2\).
[33]:
import matplotlib.pyplot as plt
import numpy as np
alpha = 0.5
t = np.linspace(-1, 1, 100)
xx, yy = np.meshgrid(t, t)
f1 = xx ** 2
f2 = alpha * xx ** 2 + yy
f = np.max([f1, f2], axis=0)
CS = plt.contour(xx, yy, f, [0.003, 0.05, 0.1, 0.2, 0.5, 1])
plt.clabel(CS)
plt.plot(t, t ** 2 * alpha, ls="--", c="k", alpha=0.7)
plt.axvline(0, 0, 0.5, c="r")
plt.xlabel("x")
plt.ylabel("y");

Exercise 2: Gradient descent for quadratic problem¶
Exercise
Let \(0\prec A\in \mathbb R^{n\times n}\) be a positive definite matrix, and let \(b\in\mathbb R^n\) and \(c\in \mathbb R\). We consider the problem of minimizing the quadratic function
using gradient descent methods. Recall that this problem has unique minimizer \(x_*=A^{-1}b\).
Exercise 2a)¶
Exercise
Let \(x_{k+1} = x_k-t\nabla f(x_k)\). Show that
where the norm on the left of the right-hand-side is the operator norm.
Hint: You can use that \(Ax_*=b\).
We have
Here we used the property that \(\|Av\|\leq \|A\|\|v\|\). This is true by the definiton of the operatorn norm: \(\|A\| = \operatorname{sup}_{v\neq 0} \|Av\|/\|v\|\).
Exercise 2b)¶
Exercise
Let \(\lambda_1,\lambda_n>0\) be respectively the smallest and biggest eigenvalue of \(A\). Show that
Hint: First show that \(\|I-tA\| = \max_i|1-t\lambda_i|\) with \(\lambda_i\) the eigenvalues of \(A\).
Since \(I\) and \(tA\) commute, the specturm of \(I-tA\) is given by \(1-t\cdot \sigma (A)\). Hence we have that \(\|I-tA\| = \max_i |1-t\lambda_i|\), with \(\lambda_i\) the eigenvalues of \(A\). For fixed \(t\), we have that \(|1-t\lambda_i|\) is as large as possible either if \(\lambda_i\) is the smallest or the biggest eigenvalue, hence
Exercise 2c)¶
Exercise
Show that \(t=2/(\lambda_1+\lambda_n)\) minimizes \(h(t)=\|I-tA\|\)
Hint: You can use the fact that this function is minimized if and only if \(0\in\partial h(t)\) (c.f. Exercise 1b) and 1d))
The subgradient \(\partial h(t)\) is a single (non-zero) point if \(|1-t\lambda_1|\neq |1-t\lambda_n|\), since both \(|1-t\lambda_1|\) and \(|1-t\lambda_n|\) are differentiable if \(1\neq t\lambda_i\), but if \(1=t\lambda_i\) then \(|1-t\lambda_j|\) for \(j\neq i\) is always bigger. Therefore \(0\in \partial h(t)\) only if \(|1-t\lambda_1|=|1-t\lambda_n|\).
This means either \(1-t\lambda_1=1-t\lambda_n\), and hence \(t=0\). It is easy to see that \(h(0)=[\lambda_1,\lambda_n]\not\owns 0\).
Otherwise it means \(1-t\lambda-1=t\lambda_n-1\), and hence \(t=2/(\lambda_n+\lambda_n)\). At this point we have \(h(t)=[-\lambda_n,\lambda_1]\owns 0\) and hence \(t=2/(\lambda_n+\lambda_n)\) is a global minimizer.
The function \(h(t)\) is shown below. Between \(-\infty\) and \(0\) it has slope \(-\lambda_n\), then between \(0\) and \(2/(\lambda_1+\lambda_2)\) it has slope \(-\lambda_1\). Then after that it has slope \(\lambda_n\).
[2]:
import matplotlib.pyplot as plt
import numpy as np
lambda_1 = 1
lambda_n = 2
t_opt = 2 / (lambda_1 + lambda_n)
t = np.linspace(-0.5, 1.5, 100)
y = np.max([np.abs(1 - t * lambda_1), np.abs(1 - t * lambda_n)], axis=0)
plt.plot(t, np.abs(1 - t * lambda_1), alpha=0.8)
plt.plot(t, np.abs(1 - t * lambda_n), alpha=0.8)
plt.plot(t, y, lw=3, c="r")
plt.axvline(0, c="k", ls="--", alpha=0.5)
plt.axvline(t_opt, c="k", ls="--", alpha=0.5);

Exercise 2d)¶
Exercise
Using the previous exercises, derive a convergence rate \(\gamma>0\) of gradient descent with optimal step size \(t\) in terms of the condition number \(\kappa = \lambda_n/\lambda_1\):
We have that \(h(t)=\|I-tA\|\) is minimzed if \(t=2/(\lambda_1+\lambda_n)\), in which case
Hence by induction we get convergence rate