Open in Colab

Exam project

This is the Python programming homework that counts for the exam of this course. This homework will require a little bit more independence from you, and you will need to write more code fully by yourself than for other homework. Furthermore there are no automatic tests, so you need to verify yourself that your solution is correct.

As usual, you can do this homework either on Google colab, or on your own machine (after installing conda).

We begin by running the cell below. If you’re running this notebook on a local python installation, make sure that this cell does not throw any errors. If it does, you will need to install additional packages. You can do this by using pip from the command line.

[1]:
from importlib_metadata import version

import cvxpy as cp
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams["figure.figsize"] = (9, 6)
import numpy as np
import pandas as pd
import scipy

import warnings

warnings.filterwarnings("ignore")  # don't display irrelevant deprecation warnings

Part 1: Alternating projection algorithm

Most solvers of convex problems \(\min_{x\in\Omega}f(x)\) always start by first finding a feasible point \(x\in \Omega\). For some constraint sets this step can be very difficult, but if \(\Omega\) is the intersection of several simple constraints then we can use a relatively simple method to find a feasible point.

Suppose that \(\Omega = C_1\cap\dots\cap C_k\subset \mathbb R^n\), with each \(C_i\subset \mathbb R^n\) closed convex, and such that we have a projection map \(P_{C_i}\colon \mathbb R^n\to C_i\). Recall that \(P_{C_i}(x)\) returns the point \(y \in C_i\) that is closest to \(x\) in Euclidean norm. Then starting with some \(x_0\) we could naively set \(x = P_{C_k}\dots P_{C_2}P_{C_1}x_0\) and expect that \(x\in \Omega\). This is unfortunately not the case, but we can iterate this procedure, that is, we can set

\[x_{i+1} = P_{C_k}\dots P_{C_2}P_{C_1}x_i, \qquad i=0,1, \ldots.\]

This iterative procedure is known as the Alternating projection algorithm, also known as projection onto convex sets (POCS).

One can prove that, as long as \(\Omega\neq\emptyset\), then the iterates \((x_i)\) produced by the alternating projection algorithm converge linearly to a point in \(x\in\Omega\). Note that this point \(x\) is not necessarily the projection of \(x_0\) onto \(\Omega\), nor does the algoritm necessarily give a point in \(\Omega\) in a finite number of steps (hence I think that “projection onto convex sets” is a bit of misnomer). Nevertheless, for the purpose of finding a feasible point for a convex solver this is good enough: As long as \(x\) satisfies the constraints up to some small error on the order of the machine epsilon, we can consider \(x\) to lie in \(\Omega\) from a numerical point of view.

Exercise 1a) Finding a point in a polyhedron

Recall that a convex polyhedron is given by a set of inequalities of the form \(C_i=\{x\in\mathbb R^n\mid a_i^\top x\leq b_i\}\) for a vector \(a_i\in\mathbb R^n\) and \(b_i\in \mathbb R\), or encoded as a matrix \(\Omega = \{x\in\mathbb R^n\mid Ax\leq b\}\) for some matrix \(A\) and vector \(b\). Each convex set \(C_i\) is a halfspace, and we can project onto it by setting

\[\begin{split}P_{C_i}(x) = \begin{cases}x &\text{if } a_i^\top x\leq b_i\\ x-(a_i^\top x-b_i){a_i}/{\|a_i\|^2} & \text{if } a_i^\top x> b_i\end{cases}\end{split}\]

Implement the alternating projection algorithm for finding a point in a polyhedron. As input you should accept any \(m\times n\) matrix A and length \(m\) vector b.

Do at most max_its iterations. Sometimes the algorithm will return a point inside the polyhedron in a finite number of steps, or sometimes it will converge to a point very close to the polyhedron (a distance on the order of \(10^{-16}\)). Try to detect this convergence, and if so make the algorithm stop early (i.e. in less than max_its iterations).


To generate test problems, we need to generate random polytopes in arbitrary dimensions. The simplest way to do this is to compute the convex hull of a random set of points. For this we can use the scipy routine scipy.spatial.ConvexHull, which is a Python interface to the C++ library Qhull.

  • Computing the convex hull in high dimension can be very expensive. Only run the code random_polytope for values of n_dim until 10.

[2]:
def random_polytope(n_dims, n_points=None):
    """Generate a random polytope by taking convex hull of random points"""
    from scipy.spatial import ConvexHull

    if n_points is None:  # Number of points in the convex hull
        n_points = n_dims * 2 + 10
    points = np.random.normal(size=(n_points, n_dims))
    shift = np.random.uniform(
        -100, 100, size=n_dims
    )  # Shift the points so 0 isn't a feasible point
    points += shift

    # Find the convex hull, and extract the hyperplane constraints
    hull = ConvexHull(points)
    A = hull.equations[:, :-1]
    b = -hull.equations[:, -1]
    return A, b, hull


def plot_hull_2D(hull):
    """Plot a 2d convex hull"""
    points = hull.points
    for n1, n2 in hull.simplices:
        v1 = points[n1]
        v2 = points[n2]
        plt.plot([v1[0], v2[0]], [v1[1], v2[1]], c="k")
        plt.plot(points[hull.vertices, 0], points[hull.vertices, 1], "or")
[3]:
def alternating_projection_polyhedron(A, b, max_its=100):
    """
    Compute a feasible point in the polyhedron A@x<=b using alternating
    projection algorithm.
    Use a maximum of `max_its` iterations. Stop early when converge is reached.
    """

    # BEGIN SOLUTION
    x = np.zeros(A.shape[1])

    def proj_once(i, x):
        "project `x` onto `i`th hyperplane"
        a = A[i]
        bi = b[i]
        violation = np.dot(a, x) - bi
        if violation <= 0:
            return x, 0
        else:
            return x - violation * a / np.dot(a, a), violation

    for _ in range(max_its):
        max_violation = 0
        for i in range(len(A)):
            x, violation = proj_once(i, x)
            max_violation = max(max_violation, violation)
        if max_violation <= 1e-12:
            break
    return x
    # END SOLUTION


# Generate random polytope, and plot if 2D
# You should `dim` to other values, and see if it still works
# Don't put `dim` higher than ~10, since then the convex hull
# takes long to compute.
dim = 2
A, b, hull = random_polytope(dim)
if dim == 2:
    plot_hull_2D(hull)

x = alternating_projection_polyhedron(A, b)

plt.plot([x[0]], [x[1]], "ob", markersize=10);
../_images/nblinks_week14_5_0.png

Exercise 1b): Feasible point for SDP constraints

Recall that SDP constraints on a real symmetric matrix \(X\in\mathbb S_n\) are given by

\[\begin{split}\begin{align} X&\succeq 0\\ \langle A_i,X\rangle &= b_i,\qquad i=1,\dots,k \end{align}\end{split}\]

with \(A_i\in \mathbb S_n\) and \(b_i\in\mathbb R\). Projecting onto the hyperplane \(C_i=\{\langle A_i,X\rangle = b_i\mid X\in \mathbb S_n\}\) is easy: similar to the linear case, we have

\[P_{C_i}(X) = X - \left(\langle A_i,X\rangle-b_i\right)A_i/\langle A_i,A_i\rangle.\]

Projecting onto the positive definite cone \(\mathbb S^n_+=\{X\succeq 0\}\) can be achieved through an eigenvalue decomposition. If \(X = U\Lambda U^\top\), with \(\Lambda = \mathrm{diag}(\lambda_1,\dots,\lambda_n)\), then we can update \(X\) by setting all the negative eigenvalues to zero:

\[P_{\mathbb S^n_+}(X) = U\mathrm{diag}(\max\{0,\lambda_1\},\dots,\max\{0,\lambda_n\})U^\top.\]

Implement the alternating projection algorithm for SDP constraints. As input you need to accept an \(k\times n\times n\) array A and a size \(k\) vector b. Here, A[i,:,:] represents \(A_i\). Do a maximum of max_its iterations.

You also need to study the convergence speed. To study the error you can, for example, look at the residual vector \(r \in \mathbb{R}^k\) defines as \(r_i = \langle A_i,X\rangle-b_i\), and then take a suitable norm. Make a plot showing how this error changes with each iteration. The vertical axis should be logarithmic.

Based on this analysis you can also implement a stopping criterion. That is, a criterion to stop the algorithm before max_its iterations.


Tips:

  • To compute the eigendecomposition of a symmetric matrix, you can use np.linalg.eigh. This routine is specific to symmetric matrices, because this means we can use a faster algorithm.

  • If you first normalize \(A\) so that \(\langle A_i,A_i\rangle=1\), then you don’t need to divide by \(\langle A_i,A_i\rangle\) at every step.

  • You can make matplotlib use a logarithmic \(y\)-axis by calling plt.yscale('log')

  • For the experts: To compute an inner product of two matrices \(\langle A,B\rangle=\operatorname{tr}(A^\top B)\) efficiently, you can use np.tensordot. With the default argument np.tensordot(A, B, axes=2) this contracts the last two axes of A and B. In effect this also means that if A is of shape (k,n,n) and X of shape (n,n) then np.tensordot(A,X) is of shape k, and contains the \(k\) values \(\langle A_i,X\rangle\).

[4]:
def alternating_projection_SDP(A, b, max_its=200):
    """
    Compute a feasible point for SDP constraints using alternating projection algorithm.
    Return an (almost) feasible point, together with residuals for the constraints.
    """
    # BEGIN SOLUTION
    k, n, _ = A.shape
    A_norm = 1 / np.linalg.norm(A, axis=(1, 2)) ** 2  # compute the norm only once
    X = np.eye(n)
    residuals = []
    for _ in range(max_its):
        # tensordot is more efficient than for loop containing matrix product + trace
        residual = np.tensordot(A, X) - b

        # einsum is not more efficient, but easier
        X = X - np.einsum("i,ijk,i->jk", residual, A, A_norm)

        # compute eigendecomposition. Set negative eigenvalues to zero
        v, U = np.linalg.eigh(X)
        v = np.clip(v, 0, None)
        X = U @ np.diag(v) @ U.T

        residual_norm = np.linalg.norm(residual)
        residuals.append(residual_norm)
        if residual_norm < 1e-10:
            break
    return X, residuals
    # END SOLUTION


n = 10
k = 5

# Create array of random symmetric matrices
A = np.random.normal(size=(k, n, n))
A = A + np.transpose(A, (0, 2, 1))
b = np.random.normal(size=k)

X, residuals = alternating_projection_SDP(A, b)

# Plot how the residuals change with each iteration
# BEGIN SOLUTION
# here we plot norm of residual error.
plt.title("Norm of constraint deviations")
plt.plot(residuals, "o-")
plt.yscale("log")
# END SOLUTION
../_images/nblinks_week14_7_0.png

Part 2: Frank-Wolfe method and LASSO

After finding a feasible point, there are many algorithms to solve a convex problem. The algorithm we will consider is the Frank-Wolfe method. The idea is that optimizing a linear function over a convex domain is relatively simple, so at each step of the algorithm we linearize our objective and do a step towards the optimum of the linearized function. That is, if we have a convex function \(f\colon\Omega \to \mathbb R\), an we start with some point \(\beta_0\in \Omega\). Then in an iterative procedure we define \(\beta_{k+1}\in\Omega\) given \(\beta_k\in\Omega\) at step \(k\) by first solving the linear problem

\[\alpha_k =\operatorname{argmin}_{\alpha \in\Omega}\alpha^\top\nabla f(\beta_k).\]

Then we update \(\beta_{k}\) by setting

\[\beta_{k+1} = \beta_k + t_k(\alpha_k-\beta_k).\]

with \(t_k\) a suitably chosen stepsize. Because of the optimality conditions, \(\alpha_k-\beta_k\) is then a descent direction, so the function is gauranteed to decrease along this line. Note that for \(t=1\) we obtain the point \(\alpha_k\), and therefore by convexity any value \(t_k\in[0,1]\) is a valid stepsize (i.e., one such that \(\beta_{k+1}\in\Omega\)). For many problems we can compute the optimal value for \(t_k\), but if not then we can use a constant stepsize or a line search method such as Armijo backtracking.

The largest advantage of the Frank-Wolfe method is that it is versatile and simple to implement. We can use it so long as we can compute the gradient of a function cheaply, and use a linear programming solver as a black box. Its main disadvantage is that it has sublinear convergence speed, and for most convex problems there are more efficient specialized solvers.

The Frank-Wolfe method is similar to the projected gradient descent method, but there instead we take an unconstrained linear step in the direction of the gradient, and then project back onto the constraint set. For many constraint sets there is no efficient way to project onto the set, in which case the Frank-Wolfe method can be cheaper.

We will apply the Frank-Wolfe method to the LASSO problem. This is least-squares regression, with an \(\ell_1\)-norm constraint on the parameters. That is, given \(N\) (mean centered) data points \(X=(x_1,\dots,x_N)^\top\in \mathbb R^{N\times n}\) and \(y=(y_1,\dots,y_N)^\top\in \mathbb R^N\) we consider the following problem for \(s\in\mathbb R\) fixed:

\[\begin{split}\begin{array}{ll} \text{minimize} & \|y-X\beta\|^2_2\\ \text{subject to} & \|\beta\|_1=\sum_{i}|\beta_i| \leq s \end{array}\end{split}\]

This problem is convex, and we will solve it using the Frank-Wolfe method.


Implement the Frank-Wolfe method to solve the Lasso problem.


Tips:

  • Finding a feasible point for this problem is very easy.

  • Find an explicit expression for the gradient \(\nabla f(\beta_k)\) at a point \(\beta_k\). Use this to formulate the linear problem \(\alpha_k = \operatorname{argmin}_{\|\alpha\|_1\leq s}\alpha^\top \nabla f(\beta_k)\), and solve it using CVXPY.

  • For this problem we can do an exact linesearch. That is, there is a simple formula for the optimal stepsize \(t_k\). Find this by first solving the problem \(t_k=\operatorname{argmin}_{t\in[0,1]}\|y-X(\beta_k+t(\alpha_k-\beta_k))\|\).

  • You can use the stepsize as a stopping criterion. If the stepsize \(t_k\) is very small, then we have reached the boundary of domain, and the search direction was close to orthogonal to this boundary.

[7]:
N = 100  # number of data points
n = 10  # dimension

# Generate random ground-truth data
X = np.random.normal(size=(N, n))
beta_true = np.random.normal(size=n)

# make true Beta inside constraint set. If you comment out this line
# then the estimated beta is probably no longer equal to `beta_true`.
# You should experiment with this. Also experiment with the value of `s`.
beta_true /= 2 * np.linalg.norm(beta_true, ord=1)
X = X - np.mean(X, axis=0)
y = X @ beta_true


def FW_LASSO(X, y, s=1, max_its=100):
    """Solve the LASSO problem using Frank-Wolfe.

    Input is mean-centered data X,y, the value of `s`,
    and the maximum number of iterations."""

    # BEGIN SOLUTION
    N, n = X.shape
    beta = np.zeros(n)

    beta_norms = []
    stepsizes=[]

    for _ in range(max_its):
        grad = LASSO_grad(X, y, beta)
        alpha = min_linear_problem(grad, s)
        t = opt_linesearch(X, y, beta, alpha - beta)
        beta_norms.append(t*np.linalg.norm(alpha-beta))
        stepsizes.append(t)
        beta = beta + t * (alpha - beta)
        if t < 1e-8:  # step size is good stopping criterion
            break
    return beta, beta_norms, stepsizes

    # END SOLUTION


def LASSO_grad(X, y, beta):
    """Return the gradient of the LASSO problem for data X,y at point `beta`"""

    # BEGIN SOLUTION
    return 2 * (X.T @ X) @ beta - 2 * X.T @ y
    # END SOLUTION


def min_linear_problem(grad, s):
    """Find the minimum of the linear problem.

    Depends on the LASSO gradient and the value of `s` determining
    the constraint set."""

    # BEGIN SOLUTION
    alpha = cp.Variable(len(grad))
    objective = cp.Minimize(grad @ alpha)
    constraints = [cp.norm(alpha, p=1) <= s]
    problem = cp.Problem(objective, constraints)
    problem.solve()
    return alpha.value
    # END SOLUTION


def opt_linesearch(X, y, beta, direction):
    """Return the result of exact linesearch.
    That is, the best possible choice of stepsize `t`. Make sure
    that this doesn't escape the constraint set."""

    # BEGIN SOLUTION
    numerator = np.dot(y, X @ direction) - np.dot(X @ beta, X @ direction)
    denominator = np.linalg.norm(X @ direction) ** 2
    if denominator == 0:
        return 1
    else:
        return min(numerator / denominator, 1)
    # END SOLUTION


# Compute beta from data. Compare to ground truth
beta, beta_norms, stepsizes = FW_LASSO(X, y, s=2)
beta - beta_true

plt.plot(beta_norms, label=r"$\|\alpha_k - \beta_k\|$")
plt.plot(stepsizes, label="stepsize $t_k$")
plt.yscale('log')
plt.legend();
plt.title("Norm of search direction and stepsize for Frank-Wolfe Lasso")
[7]:
Text(0.5, 1.0, 'Norm of search direction and stepsize for Frank-Wolfe Lasso')
../_images/nblinks_week14_9_1.png
[ ]: