Optimisation - Concise Notes
MATH70141
PDFs
Problem Sheets
Table of contents
- Optimisation - Concise Notes
- Mathematical Preliminaries
- Unconstrained Optimisation
- Linear and Nonlinear Least Squares Problems
- Gradient Descent Algorithm
- Algorithm 1: Schematic Descent Direction Method {#algorithm-1-schematic-descent-direction-method .unnumbered}
- Algorithm 2: The Gradient method {#algorithm-2-the-gradient-method .unnumbered}
- Algorithm 3: Scaled Gradient Method {#algorithm-3-scaled-gradient-method .unnumbered}
- Algorithm 4: The Damped Gauss-Newton Method {#algorithm-4-the-damped-gauss-newton-method .unnumbered}
- Stochastic Gradient Descent
- Convex Sets and Functions
- Convex Optimisation
- Optimality Conditions
- Duality
- Optimal Control
Mathematical Preliminaries
Vector Spaces
Definition 1 (Vector Space). A vector space over a field $\mathbb{F}$ is a set $V$ equipped with two operations: vector addition $+: V \times V \to V$ and scalar multiplication $\cdot: \mathbb{F} \times V \to V$ such that the following properties hold:
Closure under Addition: For all $u, v \in V$, $u + v \in V$.
Associativity of Addition: For all $u, v, w \in V$, $(u + v) + w = u + (v + w)$.
Commutativity of Addition: For all $u, v \in V$, $u + v = v + u$.
Additive Identity: There exists a vector $0 \in V$ such that for all $v \in V$, $v + 0 = v$.
Additive Inverse: For all $v \in V$, there exists a vector $-v \in V$ such that $v + (-v) = 0$.
Closure under Scalar Multiplication: For all $\lambda \in \mathbb{F}$ and $v \in V$, $\lambda v \in V$.
Distributivity of Scalar Multiplication over Vector Addition: For all $\lambda \in \mathbb{F}$ and $u, v \in V$, $\lambda(u + v) = \lambda u + \lambda v$.
Distributivity of Scalar Multiplication over Field Addition: For all $\lambda, \mu \in \mathbb{F}$ and $v \in V$, $(\lambda + \mu)v = \lambda v + \mu v$.
Compatibility of Scalar Multiplication with Field Multiplication: For all $\lambda, \mu \in \mathbb{F}$ and $v \in V$, $(\lambda \mu)v = \lambda(\mu v)$.
Identity Element of Scalar Multiplication: For all $v \in V$, $1v = v.$
Definition 2 (Inner Product). An inner product on a vector space $V$ is a function $\langle \cdot, \cdot \rangle: V \times V \to \mathbb{F}$ such that for all $u, v, w \in V$ and $\lambda \in \mathbb{F}$, the following properties hold, considering only real vector spaces:
Symmetry: $\langle u, v \rangle = \langle v, u \rangle$.
Additivity: $\langle u + v, w \rangle = \langle u, w \rangle + \langle v, w \rangle$.
Homogeneity: $\langle \lambda u, v \rangle = \lambda \langle u, v \rangle, \forall \lambda \in \mathbb{R}$.
Positive Definiteness: $\langle v, v \rangle \geq 0$ with equality if and only if $v = 0$.
Definition 3 (Norm). The norm of a vector $v \in V$ is defined as $|v| = \sqrt{\langle v, v \rangle}$. The norm satisfies the following properties:
$|v| \geq 0$ with equality if and only if $v = 0$.
$|\lambda v| = \lambda |v|$. - $|u + v| \leq |u| + |v|$.
Theorem 4 (Cauchy-Schwarz Inequality). For all $u, v \in V$, $|\langle u, v \rangle| \leq |u||v|$. Equality holds if and only if $u$ and $v$ are linearly dependent.
Definition 5 (Matrix Norms). A norm $\left\vert \cdot \right\vert$ on $\mathbb{R}^{m\times n}$ is a function $\left\vert \cdot \right\vert : \mathbb{R}^{m\times n} \to \mathbb{R}$ such that for all $A, B \in \mathbb{R}^{m\times n}$ and $\lambda \in \mathbb{R}$, the following properties hold:
Non-negativity: $\left\vert A \right\vert \geq 0$ with equality if and only if $A = 0$.
Homogeneity: $\left\vert \lambda A \right\vert = \left\vert \lambda \right\vert \left\vert A \right\vert$.
Triangle Inequality: $\left\vert A + B \right\vert \leq \left\vert A \right\vert + \left\vert B \right\vert$.
Definition 6 (Induced Norms). Given matrix $A\in \mathbb{R}^{m\times n}$ and two norms $\left\lVert \cdot \right\rVert_{a}$ and $\left\lVert \cdot \right\rVert_{b}$ on $\mathbb{R}^{m}$ and $\mathbb{R}^{n}$ respectively, the induced norm ( $a,b)$ norm ) of $A$ is defined as:
\[\left\lVert A \right\rVert_{a,b} = \mathop{\max}_{\mathbf{x}} \{ \left\lVert \mathbf{Ax} \right\rVert_{b} : \left\lVert \mathbf{x} \right\rVert_{a} \leq 1 \}\]Example 7 (Examples of Norms).
Spectral Norm: ((2,2)-norm): $\left\lVert A \right\rVert_{2} = \sigma_{\max}(A)$.
$\ell_1$ norm: $\left\lVert A \right\rVert_{1} = \mathop{\max}{j} \sum{i} \left\vert a_{ij} \right\vert$.
$\ell_{\infty}$ norm: $\left\lVert A \right\rVert_{\infty} = \mathop{\max}{i} \sum{j} \left\vert a_{ij} \right\vert$.
Frobenius Norm: $\left\lVert A \right\rVert_{F} = \sqrt{\sum_{i,j} a_{ij}^{2}}$.
Eigenvalues and Eigenvectors
Definition 8 (Eigenvalues and Eigenvectors). Let $A \in \mathbb{R}^{n\times n}$. A scalar $\lambda \in \mathbb{C}$ is an eigenvalue of $A$ if there exists a non-zero vector $\mathbf{x} \in \mathbb{C}^{n}$ such that $A\mathbf{x} = \lambda \mathbf{x}$. The vector $\mathbf{x}$ is called an eigenvector corresponding to the eigenvalue $\lambda$.
Theorem 9 (Spectral Factorization Theorem). Let $A \in \mathbb{R}^{n\times n}$ be a symmetric matrix. Then, there exists an orthogonal matrix $Q$ and a diagonal matrix $\Lambda$ such that $A = Q\Lambda Q^{T}$.
The columns of $Q$ are the eigenvectors of $A$ and the diagonal elements of $\Lambda$ are the corresponding eigenvalues of $A$.
Corollary 10 (Trace and Determinant). Let $A \in \mathbb{R}^{n\times n}$ be a symmetric matrix with eigenvalues $\lambda{1}, \lambda{2}, \ldots, \lambda{n}$. Then, the trace and determinant of $A$ are given by:
\[\text{tr}(A) = \sum_{i=1}^{n} \lambda_{i} \quad \text{and} \quad \text{det}(A) = \prod_{i=1}^{n} \lambda_{i}\]Corollary 11 (Rayleigh Quotient Bound). Let $A \in \mathbb{R}^{n\times n}$ be a symmetric matrix with eigenvalues $\lambda{1} \leq \lambda{2} \leq \ldots \leq \lambda{n}$. Then, for any vector $\mathbf{x} \in \mathbb{R}^{n}$, the Rayleigh quotient satisfies:
\[\lambda_{1} \leq \frac{\mathbf{x}^{T}A\mathbf{x}}{\mathbf{x}^{T}\mathbf{x}} \leq \lambda_{n}\]Basic Topological Concepts
Definition 12 (Balls). An open ball of radius $\epsilon > 0$ centered at a point $\mathbf{x} \in \mathbb{R}^{n}$ is defined as: \(B(\mathbf{x}, \epsilon) = \{ \mathbf{y} \in \mathbb{R}^{n} : \left\lVert \mathbf{y} - \mathbf{x} \right\rVert < \epsilon \}\) A closed ball of radius $\epsilon > 0$ centered at a point $\mathbf{x} \in \mathbb{R}^{n}$ is defined as: \(\bar{B}(\mathbf{x}, \epsilon) = \{ \mathbf{y} \in \mathbb{R}^{n} : \left\lVert \mathbf{y} - \mathbf{x} \right\rVert \leq \epsilon \}\)
Definition 13 (Interior Point). A point $\mathbf{x} \in \mathbb{R}^{n}$ is an interior point of a set $S \subseteq \mathbb{R}^{n}$ if there exists an open ball centered at $\mathbf{x}$ that is contained in $S$. The set of all interior points of $S$ is denoted by $\text{int}(S)$. \(\text{int}(S) = \{ \mathbf{x} \in S : \text{there exists an open ball } B(\mathbf{x}, \epsilon) \subseteq S \}\)
Definition 14 (Closed Set). A set $S \subseteq \mathbb{R}^{n}$ is closed if it contains all its limit points. A limit point of a set $S$ is a point $\mathbf{x} \in \mathbb{R}^{n}$ such that every open ball centered at $\mathbf{x}$ contains a point in $S$.
We have that a set $U$ is closed $\iff U^c$ is open.
Definition 15 (Boundary Points). A point $\mathbf{x} \in \mathbb{R}^{n}$ is a boundary point of a set $S \subseteq \mathbb{R}^{n}$ if every open ball centered at $\mathbf{x}$ contains points in $S$ and points not in $S$. The set of all boundary points of $S$ is denoted by $\partial S$. \(\partial S = \{ \mathbf{x} \in \mathbb{R}^{n} : \text{every open ball centered at } \mathbf{x} \text{ contains points in } S \text{ and points not in } S \}\)
Definition 16 (Closure). The closure of a set $S \subseteq \mathbb{R}^{n}$ is the union of $S$ and its boundary points. Equivalently the closure of a set $S$ is the smallest closed set containing $S$. \(\bar{S} = S \cup \partial S, \quad \bar{S} = \bigcap \{ U \subseteq \mathbb{R}^{n} : U \text{ is closed and } S \subseteq U \}\)
Directional Derivatives and Gradients
Definition 17 (Directional Derivative). Let $f: \mathbb{R}^{n} \to \mathbb{R}$ be a function and $\mathbf{x} \in \mathbb{R}^{n}$. The directional derivative of $f$ at $\mathbf{x}$ in the direction of a vector $\mathbf{d} \in \mathbb{R}^{n}$ is defined as: \(\nabla_{\mathbf{d}} f(\mathbf{x}) = \lim_{t \to 0} \frac{f(\mathbf{x} + t\mathbf{d}) - f(\mathbf{x})}{t}\)
Definition 18 (Continuous Differentiability). A function $f: \mathbb{R}^{n} \to \mathbb{R}$ is continuously differentiable at a point $\mathbf{x} \in \mathbb{R}^{n}$ if all the partial derivatives of $f$ exist and are continuous in a neighbourhood of $\mathbf{x}$.
Proposition 19. Let $f:U \to \mathbb{R}$ defined on open set $U \subseteq \mathbb{R}^n$. Suppose that $f$ is *continuously differentiable over $U$. Then \(\lim\limits_{d \to 0} \frac{f(\mathbf{x}+\mathbf{d}) - f(\mathbf{x} ) - \nabla f(\mathbf{x})^T \mathbf{d} }{\left\lVert \mathbf{d} \right\rVert } = 0, \quad \forall \mathbf{x} \in U\) Or equiv: \(f(\mathbf{y} ) = f(\mathbf{x} ) + \nabla f(\mathbf{x} )^T(\mathbf{y} -\mathbf{x} ) + o(\left\lVert \mathbf{y} -\mathbf{x} \right\rVert)\) where $o(\left\lVert \mathbf{y} -\mathbf{x} \right\rVert)$ is the Landau notation for a function that vanishes faster than a linear function. i.e $\frac{o(t)}{t} \to 0$ as $t \to 0$*
Definition 20 (Twice differentiablility and Hessian). Let $f: \mathbb{R}^{n} \to \mathbb{R}$ be a function. The second order partial derivatives of $f$ at a point $\mathbf{x} \in \mathbb{R}^{n}$ are defined as: \(\frac{\partial^{2} f}{\partial x_{i} \partial x_{j}} = \frac{\partial}{\partial x_{i}} \left( \frac{\partial f}{\partial x_{j}} \right)\) The Hessian matrix of $f$ at $\mathbf{x}$ is defined as:
\[\nabla^{2} f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^{2} f}{\partial x_{1}^{2}} & \frac{\partial^{2} f}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{n}} \\ \frac{\partial^{2} f}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{2}^{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{2} \partial x_{n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^{2} f}{\partial x_{n} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{n} \partial x_{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{n}^{2}} \end{bmatrix}\]Theorem 21 (Linear Approximation Theorem). _Let $f: U \to \mathbb{R}$ be a function defined on $U \subseteq \mathbb{R}^n$, that is twice continuously differentiable over $U$ . Let $\mathbf{x} \in U$ and $\epsilon > 0$ satisfying $B(x,\epsilon ) \subseteq U$. Then for any $y \in B(x,\epsilon )$, there exists $\xi \in [\mathbf{x} ,\mathbf{y} ]$ such that:
\[f(\mathbf{y}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^{T}(\mathbf{y} - \mathbf{x}) + \frac{1}{2}(\mathbf{y} -\mathbf{x} )^{T} \nabla^{2} f(\mathbf{\xi })(\mathbf{y} -\mathbf{x} )\]Theorem 22 (Quadratic Approximation Theorem). _Let $f:U \to \mathbb{R}$, defined on open set $U \subseteq \mathbb{R}^n$. Suppose $f$ is twice continuously differentiable over $U$. Let $\mathbf{x} \in U$ and $r >0$ satisfying $B(\mathbf{x} ,r) \subseteq U$. Then for any $\mathbf{y} \in B(\mathbf{x} ,r)$ : \(f(y) = f(x) + \nabla f(x)^{T}(\mathbf{y} -\mathbf{x} ) + \frac{1}{2}(\mathbf{y} -\mathbf{x} )^{T} \nabla^{2} f(\mathbf{\mathbf{x} })(\mathbf{y} -\mathbf{x} ) + o(\left\lVert \mathbf{y} -\mathbf{x} \right\rVert^{2})\)
Proposition 23 (Gradient of Linear Function). _Given function $f(\mathbf{w} ) = \mathbf{a}^T \mathbf{w}$. Then \(\nabla f(\mathbf{w} ) = \mathbf{a}\)
Proposition 24 (Gradient of Quadratic Function). _Given function $f(\mathbf{w} ) = \mathbf{w}^T A \mathbf{w}$. Then \(\nabla f(\mathbf{w} ) = (A + A^T)\mathbf{w}\)
_Note that if $A$ is symmetric, then $\nabla f(\mathbf{w} ) = 2A\mathbf{w}$
_More generally for a function $f(\mathbf{w} ) = \frac{1}{2}\mathbf{w}^T A \mathbf{w} + \mathbf{b}^T w + \gamma$, the gradient is given by: \(\nabla f(\mathbf{w} ) = \frac{1}{2}(A^T + A ) \mathbf{w} + \mathbf{b}\) And again if $A$ is symmetric, then $\nabla f(\mathbf{w} ) = A\mathbf{w} + \mathbf{b}$
Proposition 25 (Hessian of Quadratic Function). _For function of the form $f(\mathbf{w} ) = \mathbf{w}^T A \mathbf{w}$. The Hessian is given by: \(\nabla^{2} f(\mathbf{w} ) = A + A^T\) Which for $A$ symmetric simplifies to $\nabla^{2} f(\mathbf{w} ) = 2A$
Unconstrained Optimisation
Global Minimum and Maximum
Definition 26 (Global Minimum and Maximum). Let $f: S \to \mathbb{R}$, defined on set $S \subseteq \mathbb{R}^n$. Then
$\mathbf{x}^{\ast} \in S$ a global minimum of $f$ over $S$ if $f(\mathbf{x}^{\ast} ) \leq f(\mathbf{x} )$ for all $\mathbf{x} \in S$.
$\mathbf{x}^{\ast} \in S$ a strict global minimum of $f$ over $S$ if $f(\mathbf{x}^{\ast} ) < f(\mathbf{x} )$ for all $\mathbf{x} \in S, \mathbf{x} \neq \mathbf{x}^{\ast}$.
$\mathbf{x}^{\ast} \in S$ a global maximum of $f$ over $S$ if $f(\mathbf{x}^{\ast} ) \geq f(\mathbf{x} )$ for all $\mathbf{x} \in S$.
$\mathbf{x}^{\ast} \in S$ a strict global maximum of $f$ over $S$ if $f(\mathbf{x}^{\ast} ) > f(\mathbf{x} )$ for all $\mathbf{x} \in S, \mathbf{x} \neq \mathbf{x}^{\ast}$.
We denote by global optimum the global minimum or maximum.
maximal value of $f$ over $S$ \(\sup \{ f(\mathbf{x} ) : \mathbf{x} \in S \}\)
minimal value of $f$ over $S$ \(\inf \{ f(\mathbf{x} ) : \mathbf{x} \in S \}\)
Local Minima and Maxima
Definition 27 (Local Minimum and Maximum). Let $f: S\to \mathbb{R}$ be defined on a set $S \subseteq \mathbb{R}^n$. Then:
$\mathbf{x}^{\ast} \in S$ a local minimum of $f$ over $S$ if there exists $r >0$ for which $f(\mathbf{x}^{\ast} ) \leq f(\mathbf{x})$ for any $\mathbf{x} \in S \cap B(\mathbf{x}^{\ast} , r)$
$\mathbf{x}^{\ast} \in S$ a strict local minimum of $f$ over $S$ if there exists $r >0$ for which $f(\mathbf{x}^{\ast} ) < f(\mathbf{x})$ for any $\mathbf{x} \neq \mathbf{x}^{\ast} \in S \cap B(\mathbf{x}^{\ast} , r)$
$\mathbf{x}^{\ast} \in S$ a local maximum of $f$ over $S$ if there exists $r >0$ for which $f(\mathbf{x}^{\ast} ) \geq f(\mathbf{x})$ for any $\mathbf{x} \in S \cap B(\mathbf{x}^{\ast} , r)$
$\mathbf{x}^{\ast} \in S$ a strict local maximum of $f$ over $S$ if there exists $r >0$ for which $f(\mathbf{x}^{\ast} ) > f(\mathbf{x})$ for any $\mathbf{x} \neq \mathbf{x}^{\ast} \in S \cap B(\mathbf{x}^{\ast} , r)$
Theorem 28 (Fermat’s Theorem: First Order Optimality Conditions). _Let $f:U \to \mathbb{R}$ a function defined on set $U\subseteq \mathbb{R}^n$. Suppose that $\mathbf{x}^{\ast} \text{int}(U)$ a local optimum point and all partial derivatives of $f$ exist at $\mathbf{x}^{\ast}$. Then: \(\nabla f(\mathbf{x}^{\ast} ) = \mathbf{0}\)
Definition 29 (Stationary Points). Let $f:U\to \mathbb{R}$ a function defined on set $U \subseteq \mathbb{R}^n$. Suppose that $\mathbf{x}^{\ast} \in \text{int} (U)$ and all partial derivatives of $f$ exist at $\mathbf{x}^{\ast}$. Then $\mathbf{x}^{\ast}$ is a stationary point of $f$ if: \(\nabla f(\mathbf{x}^{\ast} ) = \mathbf{0}\)
Second Order Optimality Conditions
Theorem 30 (Necessary Second Order Optimality Conditions). Let $f:U\to \mathbb{R}$ a function defined on open set $U\subseteq \mathbb{R}^n$. Suppose $f$ is twice continuously differentiable over $U$ and $\mathbf{x}^{\ast}$ a stationary point. Then:
_if $\mathbf{x}^{\ast}$ a local minimum point, then $\nabla^{2} f(\mathbf{x}^{\ast} ) \succeq 0$
_if $\mathbf{x}^{\ast}$ a local maximum point, then $\nabla^{2} f(\mathbf{x}^{\ast} ) \preceq 0$
Theorem 31 (Sufficient Second Order Optimality Conditions). Let $f:U\to \mathbb{R}$ a function defined on open set $U\subseteq \mathbb{R}^n$. Suppose $f$ is twice continuously differentiable over $U$ and $\mathbf{x}^{\ast}$ a stationary point. Then:
if $\nabla f(\mathbf{x}^{\ast} ) = \mathbf{0}$ and $\nabla^{2} f(\mathbf{x}^{\ast} ) \succ 0$, then $\mathbf{x}^{\ast}$ a strict local minimum point
if $\nabla f(\mathbf{x}^{\ast} ) = \mathbf{0}$ and $\nabla^{2} f(\mathbf{x}^{\ast} ) \prec 0$, then $\mathbf{x}^{\ast}$ a strict local maximum point
Saddle Points
Definition 32 (Saddle Point). Let $f:U\to \mathbb{R}$ a function defined on open set $U\subseteq \mathbb{R}^n$. Suppose $f$ is continuously differentiable over $U$ and $\mathbf{x}^{\ast}$ a stationary point. Then $\mathbf{x}^{\ast}$ a saddle point if it is neither a local minimum nor a local maximum.
Theorem 33 (Sufficient Condition for Saddle Points). Let $f:U\to \mathbb{R}$ a function defined on open set $U\subseteq \mathbb{R}^n$. Suppose $f$ is twice continuously differentiable over $U$ and $\mathbf{x}^{\ast}$ a stationary point. Then:
- if $\nabla f(\mathbf{x}^{\ast} ) = \mathbf{0}$ and $\nabla^{2} f(\mathbf{x}^{\ast} )$ is indefinite, then $\mathbf{x}^{\ast}$ a saddle point
Attainment of Minimal/Maximal Values
Theorem 34 (Weierstrass Theorem). Let $f:U\to \mathbb{R}$ a continuous function defined on a non-empty compact set $U\subseteq \mathbb{R}^n$. Then $f$ attains its maximal and minimal values over $U$.
Definition 35 (Coerciveness). A function $f:U\to \mathbb{R}$ is coercive if for any sequence ${ \mathbf{x}_k } \subseteq U$ such that $\left\lVert \mathbf{x}_k \right\rVert \to \infty$, we have that $f(\mathbf{x}_k) \to \infty$.
Theorem 36 (Attainment of Global Optima Points for Coercive Functions). Let $f:U\to \mathbb{R}$ a coercive function defined on a non-empty closed set $U\subseteq \mathbb{R}^n$. Then $f$ attains its global minimal value over $U$.
Global Optimality Conditions
Theorem 37 (Global Optimality Conditions). Let $f$ be twice continuously differentiable over $\mathbb{R}^n$ . Suppose that $\nabla^{2} f(x) \succeq 0$ for any $\mathbf{x} \in \mathbb{R}^n$. Let $\mathbf{x}^{\ast} \in \mathbb{R}^n$ be a stationary point of $f$. Then $\mathbf{x}^{\ast}$ is a global minimum point of $f$.
Quadratic Functions
Proposition 38. Let $f(\mathbf{x} ) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} + \mathbf{b}^T \mathbf{x} + \mathbf{c}$ be a quadratic function defined on $\mathbb{R}^n$. With $A \in \mathbb{R}^{n\times n}$ symmetric, $\mathbf{b} \in \mathbb{R}^n$ and $\mathbf{c} \in \mathbb{R}$. Then:
$\mathbf{x}$ a stationary point of $f$ if and only if $A \mathbf{x} = -\mathbf{b}$
_if $A \succeq 0$, then $\mathbf{x}$ a global minimum point of $f$ if and only if $A \mathbf{x} = -\mathbf{b}$
_if $A \succ 0$, then $\mathbf{x} = - A^{-1}\mathbf{b}$ a strict global minimum point of $f$
Two Important Theorems on Quadratic Functions
Lemma 39 (Coerciveness of Quadratic Functions). _Let $f(\mathbf{x} ) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} + \mathbf{b}^T \mathbf{x} + \mathbf{c}$ be a quadratic function defined on $\mathbb{R}^n$. With $A \in \mathbb{R}^{n\times n}$ symmetric, $\mathbf{b} \in \mathbb{R}^n$ and $\mathbf{c} \in \mathbb{R}$. Then $f$ is coercive $\iff A \succ 0$
Theorem 40 (Characterization of the Nonnegativity of Quadratic Functions). Consider the quadratic function $f(\mathbf{x} ) = \mathbf{x}^T A \mathbf{x} + 2\mathbf{b}^T \mathbf{x} + \mathbf{c}$ defined on $\mathbb{R}^n$. With $A \in \mathbb{R}^{n\times n}$ symmetric, $\mathbf{b} \in \mathbb{R}^n$ and $\mathbf{c} \in \mathbb{R}$. Then the following are equivalent:
$f(\mathbf{x}) \equiv \mathbf{x}^T A \mathbf{x} + 2\mathbf{b}^T \mathbf{x} + \mathbf{c}$
The augmented matrix $\begin{pmatrix} A & \mathbf{b} \ \mathbf{b}^T & \mathbf{c} \end{pmatrix} \succeq 0$ is positive semidefinite
Appendix: Classification of Matrices
Definition 41 (Positive Definiteness). A symmetric matrix $A \in \mathbb{R}^{n\times n}$ is positive semidefinite ($A\succeq 0$ ) if \(\mathbf{x}^T A \mathbf{x} \geq 0, \quad \forall \mathbf{x} \in \mathbb{R}^n, \mathbf{x} \neq \mathbf{0}\) A symmetric matrix $A \in \mathbb{R}^{n\times n}$ is positive definite ($A\succ 0$ ) if \(\mathbf{x}^T A \mathbf{x} > 0, \quad \forall \mathbf{x} \in \mathbb{R}^n, \mathbf{x} \neq \mathbf{0}\)
Proposition 42. Let $A$ be positive definite (semidefinite) matrix. Then the diagonal elements of $A$ are positive. (non-negative)
Definition 43 (Negative Definiteness). A symmetric matrix $A \in \mathbb{R}^{n\times n}$ is negative semidefinite ($A\preceq 0$ ) if \(\mathbf{x}^T A \mathbf{x} \leq 0, \quad \forall \mathbf{x} \in \mathbb{R}^n, \mathbf{x} \neq \mathbf{0}\) A symmetric matrix $A \in \mathbb{R}^{n\times n}$ is negative definite ($A\prec 0$ ) if \(\mathbf{x}^T A \mathbf{x} < 0, \quad \forall \mathbf{x} \in \mathbb{R}^n, \mathbf{x} \neq \mathbf{0}\)
remark Remark 1.
$A$ is negative (semi)definite $\iff -A$ is positive (semi)definite
A matrix is indefinite if and only if it is neither positive nor negative semidefinite
A symmetric matrix with positive and negative diagonal elements is indefinite
The sum of two positive/negative (semi)definite matrices is positive/negative (semi)definite
Theorem 44 (Eigenvalue Characterization). Let $A$ be a symmetric $n\times n$ matrix. Then:
$A$ is positive definite $\iff \lambda_{i} > 0, \quad \forall i = 1,2,\ldots,n$_
$A$ is positive semidefinite $\iff \lambda_{i} \geq 0, \quad \forall i = 1,2,\ldots,n$_
$A$ is negative definite $\iff \lambda_{i} < 0, \quad \forall i = 1,2,\ldots,n$_
$A$ is negative semidefinite $\iff \lambda_{i} \leq 0, \quad \forall i = 1,2,\ldots,n$_
$A$ is indefinite $\iff \exists i,j \text{ s.t. } \lambda_{i} > 0, \lambda*{j} < 0$*
Definition 45 (Principal Minor). Let $A$ be a symmetric $n\times n$ matrix. The $k$-th principal minor of $A$ is the determinant of the $k\times k$ submatrix of $A$ obtained by deleting the last $n-k$ rows and columns of $A$.
Proposition 46. _Let $A$ a $n\times n$ symmetric matrix. Then $A$ is positive definite if and only if all the principal minors of $A$ are positive. \(D_1(A) > 0, D_2(A) > 0, \ldots, D_n(A) > 0\)
Proposition 47. _Let $A$ a $n\times n$ symmetric matrix. Then $A$ is negative definite if and only if the principal minors of $A$ alternate in sign, starting with a negative minor. \((-1)^k D_{k}(A) > 0, \quad \forall k = 1,2,\ldots,n\)
Definition 48 (Diagonal Dominance). Let $A$ symmetric $n\times n$ matrix
$A$ is diagonally dominant if \(\left| a_{ii} \right| \geq \sum_{j\neq i} \left| a_{ij} \right|, \quad \forall i = 1,2,\ldots,n\)
$A$ is strictly diagonally dominant if \(\left| a_{ii} \right| > \sum_{j\neq i} \left| a_{ij} \right|, \quad \forall i = 1,2,\ldots,n\)
Theorem 49 (Positive Definiteness of diagonally dominant matrices).
if $A$ symmetric, diagonally dominant, with non-negative diagonal elements, then $A$ is positive semidefinite
if $A$ symmetric, strictly diagonally dominant, with positive diagonal elements, then $A$ is positive definite
Linear and Nonlinear Least Squares Problems
Linear Least Squares
Definition 50 (Linear Least Squares). Let $A \in \mathbb{R}^{m\times n}$ and $\mathbf{b} \in \mathbb{R}^{m}$. We assume $A$ has full column rank. The linear least squares problem is to find $\mathbf{x}^{\ast} \in \mathbb{R}^{n}$ that minimizes the residual sum of squares:
\[\mathbf{x}^{\ast} = \underset{\mathbf{x} \in \mathbb{R}^{n}}{\mathop{\arg\ \min}} \left\lVert A\mathbf{x} - \mathbf{b} \right\rVert^{2}\]Which is the same as:
\[\mathop{\min}_{x\in \mathbb{R}^n} \{ f( \mathbf{x} ) \equiv \mathbf{x}^T A^T A \mathbf{x} - 2\mathbf{b}^T A \mathbf{x} + \mathbf{b}^T \mathbf{b} \}\]Note that $\nabla^{2} f(x) = 2A^T A \succ 0$ since $A$ of full column rank, and $m > n$. Therefore we get the unique optimal solution
$\mathbf{x}_{LS} = \nabla f(\mathbf{x} ) = 0$
\[(A^T A) \mathbf{x}_{LS} = A^T \mathbf{b}\] \[\mathbf{x}_{LS} = (A^T A)^{-1} A^T \mathbf{b}\]Definition 51 (Regularised Least Squares). Let $A \in \mathbb{R}^{m\times n}$ and $\mathbf{b} \in \mathbb{R}^{m}$. We assume $A$ has full column rank. The regularised least squares problem is to find $\mathbf{x}^{\ast} \in \mathbb{R}^{n}$ that minimizes the regularised residual sum of squares:
\[\mathbf{x}^{\ast} = \underset{\mathbf{x} \in \mathbb{R}^{n}}{\mathop{\arg\ \min}} \left\lVert A\mathbf{x} - \mathbf{b} \right\rVert^{2} + \lambda R(x)\]Where $\lambda$ the regularisation parameter and $R(x)$ the regularisation/penalty function. A common choice is the quadratic regularisation function:
\[\mathop{\min} \left\lVert A \mathbf{x} - \mathbf{b} \right\rVert^{2} + \lambda \left\lVert D \mathbf{x} \right\rVert^{2}\]The optimal solution is then given by:
\[X_{RLS} = (A^T A + \lambda D^T D)^{-1} A^T \mathbf{b}\]Must have that $\text{null}(S) \cap \text{null}(A) = { 0 }$ for the above inversion to be possible.
Example 52 (Denoising). Suppose we have a noisy signal $\mathbf{y} \in \mathbb{R}^{n}$ that is the sum of a clean signal $\mathbf{x} \in \mathbb{R}^{n}$ and some noise $\mathbf{e} \in \mathbb{R}^{n}$. We can model this as:
\[\mathbf{y} = \mathbf{x} + \mathbf{e}\]We can then solve the regularised least squares problem to recover the clean signal $\mathbf{x}$ from the noisy signal $\mathbf{y}$. Taking our problem as follows:
\[\mathop{\min} \left\lVert \mathbf{y} - \mathbf{x} \right\rVert^{2} + \lambda \left\lVert L \mathbf{x} \right\rVert^{2}\]For a matrix
$L$ \(L = \begin{bmatrix} 1 & -1 & 0 & \cdots & 0 \\ 0 & 1 & -1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & -1 \end{bmatrix}\)
This is a common choice for denoising as it penalises the difference between adjacent elements of the signal. Direct solution is given by:
\[\mathbf{x}_{RLS} = (I + \lambda L^T L)^{-1} \mathbf{y}\]Nonlinear Least Squares
Definition 53 (Nonlinear Least Squares). Aim to find $\mathbf{x}^{\ast} \in \mathbb{R}^{n}$ that minimizes: \(\mathop{\min}_{x} \sum_{i=1}^{m} (f_{i}(x) - b_{i})^{2}\)
Gradient Descent Algorithm
Definition 54 (Descent Direction). Let $f: \mathbb{R}^n \to \mathbb{R}$ a continuously differentiable function over $\mathbb{R}^n$. A vector $\mathbf{d} \in \mathbb{R}^n$ is a descent direction at $\mathbf{x} \in \mathbb{R}^n$ if: \(\nabla f(\mathbf{x} )^T \mathbf{d} < 0\)
Lemma 55. _Let $f$ be a continuously differentiable function over $\mathbb{R}^n$, and let $\mathbf{x} \in \mathbb{R}^n$. Suppose $\mathbf{d}$ a descent direction of $f$ at $\mathbf{x}$. Then there exists $\epsilon >0$, such that for any $\alpha \in (0,\epsilon )$, we have: \(f(\mathbf{x} + \alpha \mathbf{d} ) < f(\mathbf{x} )\)
Algorithm 1: Schematic Descent Direction Method {#algorithm-1-schematic-descent-direction-method .unnumbered}
Initialisation: Choose $\mathbf{x}^{(0)} \in \mathbb{R}^n$, set $k = 0$
General Step: For $k = 0,1,2,\ldots$ execute the following:
Pick a descent direction $\mathbf{d}^{(k)}$ at $\mathbf{x}^{(k)}$
Find a stepsize $t^k$ satisfying $f(x^k + t^{k}d^k) < f(x^k)$
Set $\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + t^{k} \mathbf{d}^{(k)}$
If a stopping criteria is satisfied, then STOP and output $\mathbf{x}^{(k +1)}$
Choosing a step-size {#choosing-a-step-size .unnumbered}
Constant stepsize: $t^{k} = t$
Exact stepsize: $t^{k} = \underset{t}{\mathop{\arg\ \min}} f(\mathbf{x}^{(k)} + t \mathbf{d}^{(k)} )$
Backtracking line search: Start with $t = 1$, and reduce $t$ until the Armijo condition is satisfied: \(f(\mathbf{x}^{(k)} + t \mathbf{d}^{(k)} ) \leq f(\mathbf{x}^{(k)} ) + \alpha t \nabla f(\mathbf{x}^{(k)} )^T \mathbf{d}^{(k)}\)
Lemma 56. _Let $f$ be a continuously differentiable function and let $\mathbf{x} \in \mathbb{R}^n$ be a non-stationary point. Then an optimal solution of \(\mathop{\min}_{d} \{f^\prime (\mathbf{x} \delta \textbf{d}) : \left\lVert \mathbf{d} \right\rVert =1 \}\) is $\mathbf{d} = -\nabla f(\mathbf{x} ) / \left\lVert \nabla f(\mathbf{x} )\right\rVert$_
Algorithm 2: The Gradient method {#algorithm-2-the-gradient-method .unnumbered}
Initialisation: A tolerance parameter $\epsilon >0$ and choose $\mathbf{x}^{(0)} \in \mathbb{R}^n$
General Step: For $k = 0,1,2,\ldots$ execute the following:
Pick a stepsize $t^k$ by a line search method on the function \(g(t) = f(\mathbf{x}^{(k)} - t \nabla f(\mathbf{x}^{(k)} ) )\)
Set $\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} - t^{k} \nabla f(\mathbf{x}^{(k)} )$
If $\left\lVert \nabla f(\mathbf{x}^{(k+1)} ) \right\rVert < \epsilon$, then STOP and output $\mathbf{x}^{(k+1)}$
Lemma 57. Let ${x^K}{k>0}$ be the sequence generated by the gradient method with exact line search for solving a problem of minimizing a continuously differentiable function $f$. Then for any $k = 0,1,2, \ldots$ \(f(\mathbf{x}^{(k+1)} ) \leq f(\mathbf{x}^{(k)} )\)
Definition 58 (Lipschitz Gradient). Let $f$ be a continuously differentiable function over $\mathbb{R}^n$. We say that $f$ has a Lipschitz gradient if there exists $L \geq 0$ for which: \(\left\lVert \nabla f(\mathbf{x} ) - \nabla f(\mathbf{y} ) \right\rVert \leq L \left\lVert \mathbf{x} - \mathbf{y} \right\rVert, \quad \forall \mathbf{x}, \mathbf{y} \in \mathbb{R}^n\)
remark Remark 2.
If $\nabla f$ is Lipschitz with constant $L$, then it is also Lipschitz continuous with constant $\widetilde{L}$ for all $\widetilde{L} \geq L$
The class of functions with Lipschitz gradient with constant $L$ denoted by $C^{1,1}_{L}(\mathbb{R}^n)$ or just $C^{1,1}(\mathbb{R}^n)$
Linear functions: Given $a\in \mathbb{R}^n$, the function $f(\mathbf{x} ) = a^T \mathbf{x}$ has a Lipschitz gradient with constant $L = \left\lVert a \right\rVert$
Quadratic functions: Given $A \in \mathbb{R}^{n\times n}$, the function $f(\mathbf{x} ) = \mathbf{x}^T A \mathbf{x} + 2 \mathbf{b}^T \mathbf{x} + c$ a $C^{1,1}$ function with the smallest Lipschitz constant $L = 2\left\lVert A \right\rVert$
Theorem 59 (Equivalence to Boundedness of the Hessian). Let $f$ be a twice continuously differentiable function over $\mathbb{R}^n$. Then the following are equivalent:
$f$ has a Lipschitz gradient with constant $L$
_The Hessian of $f$ is bounded, i.e. there exists $M \geq 0$ for which: \(\left\lVert \nabla^{2} f(\mathbf{x} ) \right\rVert \leq M, \quad \forall \mathbf{x} \in \mathbb{R}^n\)
Lemma 60 (Sufficient decrease of the gradient method). Let $f \in C^{1,1}{L}(\mathbb{R}^n)$. Let ${x^k}_{k\geq 0}$ be the sequence generated by the gradient method for solving
\[\mathop{\min}_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x} )\]with one of the following stepsize strategies:_
_constant stepsize $\widetilde{t} \in (0, \frac{2}{L})$
exact line search
backtracking procedure with parameters $s\in \mathbb{R}{++}, \alpha \in (0,1)$ and $\beta \in (0,1)$. Then \(f(x^k) - f(x^{k+1}) \geq M \left\lVert \nabla f(x^k) \right\rVert^{2}\) Where \(M = \begin{cases} \widetilde{t} (1 - \widetilde{t} \frac{L}{2}) & \text{if constant stepsize} \\ \frac{1}{2L} & \text{if exact line search} \\ \alpha \mathop{\min} \{s, \frac{2(1-\alpha )\beta }{L} \} & \text{if backtracking line search} \end{cases}\)
Theorem 61 (Convergence of the Gradient Method). Let ${x^k}{k\geq 0}$ be the sequence generated by the gradient method for solving \(\mathop{\min}_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x} )\)
with one of the following stepsize strategies:
_constant stepsize $\widetilde{t} \in (0, \frac{2}{L})$
exact line search
backtracking procedure with parameters $s\in \mathbb{R}{++}, \alpha \in (0,1)$ and $\beta \in (0,1)$_
Assume that
$f\in C_{L}^{1,1}(\mathbb{R}^n)$_
$f$ bounded below over $\mathbb{R}^n$, that is, there exists $m\in \mathbb{R}$ such that $f(\mathbf{x} ) >m$ for all $\mathbf{x} \in \mathbb{R}^n$
Then:
_for any $k,\ f(x^{k+1} < f(x^k) )$ unless $\nabla f(x^k) = 0$
$\nabla f(x^k) \to 0$ as $k \to \infty$
Definition 62 (Condition number). Let $A$ be an $n\times n$ positive definite matrix. The condition number of $A$ is defined as: \(\kappa (A) = \frac{\lambda_{\text{max}}(A)}{\lambda_{\text{min}}(A)}\) where $\lambda_{\text{max}}(A)$ and $\lambda_{\text{min}}(A)$ are the largest and smallest eigenvalues of $A$ respectively.
Lemma 63 (Kantorovich Inequality). _Let $A$ be an $n\times n$ positive definite matrix. Then for any $0 \neq \mathbf{x} \in \mathbb{R}^n$, we have:
\[\frac{(\mathbf{x}^T x)^{2} }{(\mathbf{x}^T A \mathbf{x} )(\mathbf{x}^T A^{-1} \mathbf{x} )} \leq \frac{4 \lambda_{max}(A) \lambda_{min}(A)}{(\lambda_{max}(A) + \lambda_{min}(A))^2}\]Theorem 64 (Gradient method for minimizing $\mathbf{x}^T A \mathbf{x}$ ). Let ${x^k}{k\geq 0}$ be the sequence generated by the gradient method with exact line search for solving the problem \(\mathop{\min}_{\mathbf{x} \in \mathbb{R}^n } \mathbf{x}^T A \mathbf{x} \quad (A \succ 0)\) Then for any $k = 0,1, \ldots$
\[f(x^{k+1} ) \leq \left(\frac{M-m}{M+m}\right)^{2} f(x^k)\]where $M = \lambda_{max}(A)$ and $m = \lambda_{min}(A)$_
Algorithm 3: Scaled Gradient Method {#algorithm-3-scaled-gradient-method .unnumbered}
Initialisation: A tolerance parameter $\epsilon >0$ and choose $\mathbf{x}^{(0)} \in \mathbb{R}^n$
General Step: For $k = 0,1,2,\ldots$ execute the following:
Pick $\mathbf{D}^k \succ 0$, a scaling matrix
Pick a stepsize $t^k$ by a line search method on the function \(g(t) = f(\mathbf{x}^{(k)} - t \mathbf{D}^{(k)} \nabla f(x^k) )\)
Set $\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} - t^{k} \mathbf{D}^{(k)} \nabla f(x^k)$
If $\left\lVert \nabla f(x^{k+1} )\right\rVert \leq \epsilon$, then STOP and output $\mathbf{x}^{(k+1)}$
Algorithm 4: The Damped Gauss-Newton Method {#algorithm-4-the-damped-gauss-newton-method .unnumbered}
Initialisation: A tolerance parameter $\epsilon >0$ and choose $\mathbf{x}^{(0)} \in \mathbb{R}^n$
General Step: For $k = 0,1,2,\ldots$ execute the following:
Set $\mathbf{d}^k = \left(J(\mathbf{x}^k)^T J(\mathbf{x}^k)\right)^{-1} J(\mathbf{x}^k)^T F(\mathbf{x}^k)$
Set $t^k$ by line search procedure on function \(h(t) = g(x^k + t\mathbf{d}^k)\)
Set $\mathbf{x}^{k+1} = \mathbf{x}^k + t^k \mathbf{d}^k$
If $\left\lVert \nabla g(x^{k+1} )\right\rVert \leq \epsilon$, then STOP and output $\mathbf{x}^{(k+1)}$
Stochastic Gradient Descent
The Kaczmarz Algorithm
Definition 65 (Kaczmarz Algorithm). Let $A \in \mathbb{R}^{m\times n}$ and $\mathbf{b} \in \mathbb{R}^{m}$. The Kaczmarz Algorithm is an iterative method for solving the linear system $A\mathbf{x} = \mathbf{b}$. The algorithm is as follows:
Start with an initial guess $\mathbf{x}^{(0)} \in \mathbb{R}^n$
For $k = 0,1,2,\ldots$ execute the following:
- For $i = 1,2,\ldots,m$ execute the following: \(\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + \frac{b_i - a_i^T \mathbf{x}^{(k)} }{\left\lVert a_i \right\rVert^{2}} a_i\)
Note that we do not need to compute $A^{-1}$. The $i$-th row that is chosen at the $k$-th iteration of the algorithm is cycled periodically through all $m$ rows of $A$. \(i = k \mod m\) Provided the system is consistent, the Kaczmarz algorithm converges to the solution of the linear system.
The Kaczmarz algorithm converges exponential to the solution of the linear system, if at the $k$-th iteration the $i$-th row is chosen randomly according to either a uniform distribution or proportional to the squared row norm: $p_i = \left\lVert a_i \right\rVert^{2} / \left\lVert A \right\rVert^{2}$,
Stochastic Gradient Descent
Definition 66 (Stochastic Gradient Descent (SGD)). An iterative optimization algorithm used in machine learning to minimize a loss function $L(\theta)$, where $\theta$ represents the parameters of the model. SGD modifies the parameters by following these steps:
Initialize the parameters $\theta$ to some starting values $\theta_0$.
At each iteration $t$, randomly select a minibatch $\mathcal{B}_t$ from the dataset.
Compute the gradient of the loss function approximated over the minibatch: \(g_t = \frac{1}{|\mathcal{B}_t|} \sum_{i \in \mathcal{B}_t} \nabla L_i(\theta_t)\) where $\nabla L_i(\theta_t)$ is the gradient of the loss function with respect to $\theta$ computed at the $i$-th data point.
Update the parameters using a learning rate $\eta$: \(\theta_{t+1} = \theta_t - \eta g_t\)
Repeat steps 2-4 until the parameters converge or a predefined number of iterations is reached.
SGD is particularly effective for large datasets as it requires less computational resources per iteration. The random selection of data helps in avoiding local minima, but it also introduces variability in the gradient estimation, potentially causing fluctuation in the convergence path.
Theorem 67 (Convergence of SGD). Assume:
_The cost $g(\mathbf{x} )$ is such that \(\left\lVert \nabla g(\mathbf{x} ) - \nabla g(\mathbf{y} ) \right\rVert \leq L \left\lVert \mathbf{x} - \mathbf{y} \right\rVert , \quad \text{and } \nabla^{2} g(\mathbf{x} ) \succeq \mu I\)
The sample gradient $\nabla Q{i} (\mathbf{x}^k )$ is an unbiased estimator of $\nabla g(\mathbf{x}^k)$_
_For all $\mathbf{x}$ \(\mathbb{E} \left[ \left\lVert Q_{i}(\mathbf{x} )\right\rVert^{2} \right] \leq \sigma^{2} + c \left\lVert \nabla g(x) \right\rVert^{2}\) Then if $t^k \equiv t \leq \frac{1}{Lc}$ Then SGD achieves \(\mathbb{E} \left[ g(\mathbf{x}^{k} ) - g(\mathbf{x}^{\ast} ) \right] \leq \frac{tL \sigma^{2} }{2\mu } + (1- \mu )^k \left[ g(\mathbf{x}^{0} ) - g(\mathbf{x}^{\ast} ) \right]\)
The above implies
Fast (linear) convergence during first iterations
Convergence to a neighbourhood of the optimal solution, without further improvement
If gradient computation noiseless, that is $\sigma = 0$, then linear convergence to the optimal solution
A smaller stepsize $t$ yield better converging points
Convex Sets and Functions
Definition 68 (Convex Set). A set $C \subseteq \mathbb{R}^n$ is convex if for any $\mathbf{x}, \mathbf{y} \in C$ and any $\lambda \in [0,1]$, we have: \(\lambda \mathbf{x} + (1-\lambda )\mathbf{y} \in C\)
Example 69.
Any line segment in $\mathbb{R}^n$ is a convex set \(L = \{\mathbf{z} + t \mathbf{d} : t\in \mathbb{R}\}\)
where $\mathbf{z}, 0 \neq \mathbf{d} \in \mathbb{R}^n$
$[x,y], (x,y)$ for $x,y \in \mathbb{R}^n, x \neq y, \varnothing , \mathbb{R}^n$
A hyperplane is a convex set \(H = \{\mathbf{x} \in \mathbb{R}^n : \mathbf{a}^T \mathbf{x} = b\} \quad\)
A half-space is a convex set \(H^- = \{\mathbf{x} \in \mathbb{R}^n : \mathbf{a}^T \mathbf{x} \leq b\} \quad\)
The open and closed balls are convex sets
An ellipsoid is a convex set \(E = \{\mathbf{x} \in \mathbb{R}^n : \mathbf{x}^T Q \mathbf{x} + 2 \mathbf{b}^T \mathbf{x} + c \leq 0\}, \quad Q \succ 0\)
Lemma 70 (Intersection of Convex Sets). The intersection of any collection of convex sets is a convex set
Theorem 71 (Properties of Convex Sets).
Let $C_1, C_2, \ldots , C{k} \subseteq \mathbb{R}^n$ be convex sets and let $\mu_1, \mu_2,\ldots, \mu_{k} \in \mathbb{R}$. Then the set \(C = \{ \mathbf{x} \in \mathbb{R}^n : \mathbf{x} \in \mu_1 C_1 + \mu_2 C_2 + \ldots + \mu_{k} C_{k} \}, \quad \text{is convex}\)
Let $C{i}\subset \mathbb{R}^{k*{i}}, i = 1, \ldots , m$ be convex sets. Then \(C = C_1 \times C_2 \times \ldots \times C*{m} = \{ (\mathbf{x}_1, \mathbf{x}\_2, \ldots, \mathbf{x}_{m} ) : \mathbf{x}\_i \in C_i, i = 1, \ldots , m \}, \quad \text{is convex}\)
_Let $M \subseteq \mathbb{R}^n$ be a convex set and let $A \in \mathbb{R}^{m\times n}$ . Then the set \(A(M) = \{ A\mathbf{x} : \mathbf{x} \in M \}, \quad \text{is convex}\)
_Let $D \subseteq \mathbb{R}^m$ be convex and let $A \in R^{m\times n}$. Then the set \(A^{-1} (D) = \{ \mathbf{x} \in \mathbb{R}^n : A\mathbf{x} \in D \}, \quad \text{is convex}\)
Definition 72 (Convex Combinations). Given $m$ points $x_1, x_2, \ldots , x_{m}\in \mathbb{R}^n$, a convex combination of these points is a point of the form: \(\sum_{i=1}^{m} \lambda_i x_i, \quad \text{where } \lambda_i \geq 0, \sum_{i=1}^{m} \lambda_i = 1\)
Theorem 73. Let $C \subseteq \mathbb{R}^n$ be a convex set and let $x_1, x_2, \ldots , x{m}\in C$. Then for any $\lambda \in \Delta_{m}$, the relation $\sum_{i=1}^{m} \lambda_{i}\mathbf{x}_{i} \in C$ holds_
Definition 74 (The Convex Hull). Let $S \subseteq \mathbb{R}^n$. The convex hull of $S$, denoted by $\text{conv}(S)$, is the set of all convex combinations of points in $S$: \(\text{conv}(S) = \left\{ \sum_{i=1}^{m} \lambda_i \mathbf{x}_i : \mathbf{x}_i \in S, \lambda_i \geq 0, \sum_{i=1}^{m} \lambda_i = 1 \right\}\)
Theorem 75 (Caratheodory). Let $S \subseteq \mathbb{R}^n$ and let $\mathbf{x} \in \text{conv}(S)$. Then there exists a subset $S’ \subseteq S$ with $|S’| \leq n+1$ such that $\mathbf{x} \in \text{conv}(S’)$. That is there exists $\mathbf{x}_1, \mathbf{x}_2, \ldots , \mathbf{x}{n+1} \in S$ such that $\mathbf{x} \in \text{conv} ({x_1,x_2, \ldots ,x_{n+1} })$, that is there exists $\lambda \in \Delta_{n+1}$ such that \(\mathbf{x} = \sum_{i=1}^{n+1} \lambda_i \mathbf{x}_i\)
Definition 76 (Extreme Point). A point $\mathbf{x} \in C$ is an extreme point of a convex set $C$ if it cannot be expressed as a convex combination of two distinct points in $C$. That is, $\mathbf{x}$ is an extreme point of $C$ if for any $\mathbf{y}, \mathbf{z} \in C$ and any $\lambda \in (0,1)$, we have: \(\mathbf{x} = \lambda \mathbf{y} + (1-\lambda )\mathbf{z} \implies \mathbf{x} = \mathbf{y} = \mathbf{z}\) The set of all extreme points of a convex set $C$ is denoted by $\text{ext}(C)$.
Theorem 77 (The Krein-Milman Theorem). _Let $S \subseteq \mathbb{R}^n$ be a compact convex set. Then \(S = \text{conv}(\text{ext}(S))\)
Convex Functions
Definition 78 (Convex Function). A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if for any $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ and any $\lambda \in [0,1]$, we have: \(f(\lambda \mathbf{x} + (1-\lambda )\mathbf{y} ) \leq \lambda f(\mathbf{x} ) + (1-\lambda )f(\mathbf{y} )\)
Definition 79 (Strict Convexity). A function $f: C \to \mathbb{R}$ defined on convex set $C \subseteq \mathbb{R}^n$ is called strictly convex if for any $\mathbf{x}, \mathbf{y} \in C$ and any $\lambda \in (0,1)$, we have: \(f(\lambda \mathbf{x} + (1-\lambda )\mathbf{y} ) < \lambda f(\mathbf{x} ) + (1-\lambda )f(\mathbf{y} )\)
Definition 80 (Concavity). A function $f$ is concave if $-f$ is convex. Equivalently; a function $f: \mathbb{R}^n \to \mathbb{R}$ is concave if for any $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ and any $\lambda \in [0,1]$, we have: \(f(\lambda \mathbf{x} + (1-\lambda )\mathbf{y} ) \geq \lambda f(\mathbf{x} ) + (1-\lambda )f(\mathbf{y} )\)
Theorem 81 (Jensen’s Inequality). Let $f:C \to \mathbb{R}$ be a convex function where $C\subseteq \mathbb{R}^n$ is a convex set. Then for any $\mathbf{x}_1, \mathbf{x}_2, \ldots , \mathbf{x}{m} \in C$ and any $\lambda \in \Delta_{m}$, we have: \(f\left( \sum_{i=1}^{m} \lambda_i \mathbf{x}_i \right) \leq \sum_{i=1}^{m} \lambda_i f(\mathbf{x}_i)\)
Definition 82 (Epigraph). Let $f: \mathbb{R}^n \to \mathbb{R}$ be a function. The epigraph of $f$ is the set: \(\text{epi}(f) = \{ (\mathbf{x}, t) \in \mathbb{R}^{n+1} : f(\mathbf{x} ) \leq t \}\)
Theorem 83. Let $f:\mathbb{R}^n \to \mathbb{R}$. Then $f$ is convex if and only if $\text{epi}(f)$ is a convex set
First-order Characterizations of Convexity
Theorem 84 (Gradient Inequality). _Let $f:C \to \mathbb{R}$ be a continuously differentiable function defined on a convex set $C \subseteq \mathbb{R}^n$. Then $f$ is convex if and only if \(f(\mathbf{y} ) \geq f(\mathbf{x} ) + \nabla f(\mathbf{x} )^T (\mathbf{y} - \mathbf{x} ), \quad \forall \mathbf{x}, \mathbf{y} \in C\)
Theorem 85 (Stationarity Implies Global Optimality). _Let $f$ be a continuously differentiable function convex over a convex set $C \subseteq \mathbb{R}^n$. Suppose $\nabla f(\mathbf{x}^{\ast} ) = 0$ for some $\mathbf{x}^{\ast} \in C$. Then $\mathbf{x}^{\ast} \in C$. Then $\mathbf{x}^{\ast}$ is a global minimizer of $f$ over $C$
Theorem 86 (Convexity of Quadratic Functions with Positive semidefinite matrices). _Let $f: \mathbb{R}^n \to \mathbb{R}$ be the quadratic function; $f(\mathbf{x} ) = \mathbf{x}^T A \mathbf{x} + 2 \mathbf{b}^T \mathbf{x} + c$, where $A \in \mathbb{R}^{n\times n}$ is symmetric, ${b}f\in \mathbb{R}^n, c\in \mathbb{R}$. Then $f$ is (strictly) convex if and only if $A \succeq 0$ ( $A \succ 0$ )_
Theorem 87 (Monotonicity of the Gradient). _Suppose that $f$ is a continuously differentiable function over a convex set $C \subseteq \mathbb{R}^n$. Then $f$ is convex over $C$ if and only if \(\nabla f(\mathbf{y} ) - \nabla f(\mathbf{x} ) \succeq 0, \quad \forall \mathbf{x}, \mathbf{y} \in C\)
Second-order Characterizations of Convexity
Theorem 88 (Second-order Characterization of Convexity). _Let $f$ be a twice continuously differentiable function over an open convex set $C\subseteq \mathbb{R}^n$. Then $f$ is convex over $C$ if and only if $\nabla^{2} \succeq 0$ for any $\mathbf{x} \in C$
Further Results for Convex Functions
Lemma 89 (Operations Preserving Convexity). The following preserve convexity:
_Let $f$ be a convex function over a convex set $C \subseteq \mathbb{R}^n$. Then the $\alpha f$ is convex over $C$ for any $\alpha \geq 0$
_Let $f$ and $g$ be convex functions over a convex set $C \subseteq \mathbb{R}^n$. Then the function $f+g$ is convex over $C$
_Let $f$ be a convex function over convex set $C \subseteq \mathbb{R}^n$. Let $A \in \mathbb{R}^{n\times m}$ and $\mathbf{b} \in \mathbb{R}^n$. Then the function $g$ defined by \(g(\mathbf{y} ) = f(A \mathbf{y} + \mathbf{b} ) \text{ is convex over } D = \{ \mathbf{y} \in \mathbb{R}^m : A\mathbf{y} + \mathbf{b} \in C \}\)
Theorem 90 (Preservation of Convexity Under Partial Minimization). _Let $f: C \times D \to \mathbb{R}$ be a convex function over the set $C \times D$ where $C \subseteq \mathbb{R}^n$ and $D \subseteq \mathbb{R}^n$ are convex sets. Let \(g(\mathbf{x}) = \mathop{\min}_{\mathbf{y} \in D} f(\mathbf{x}, \mathbf{y} ), \quad x \in C\) where we assume that the minimum is finite. Then $g$ is convex over $C$_
Definition 91 (Level sets). Let $f: S\to \mathbb{R}$ be a function defined over a set $S \subseteq \mathbb{R}^n$. The level set of $f$ at level $t$ is the set: \(\text{Lev}(f,\alpha) = \{ \mathbf{x} \in S : f(\mathbf{x} ) \leq \alpha \}\)
Theorem 92 (Convexity of Level Sets). _Let $f: S \to \mathbb{R}$ be a convex function defined over a convex set $S \subseteq \mathbb{R}^n$. Then the level set of $f$ at level $t$ is a convex set for any $t \in \mathbb{R}$
Four Important Theorems for Convex Functions
Theorem 93 (Continuity of Convex Functions). _Let $f: S \to \mathbb{R}$ be a convex function defined over a convex set $S \subseteq \mathbb{R}^n$. Then $f$ is continuous over $S$. In particular let $\mathbf{x}_0 \in \text{int}(S)$. Then there exists $\epsilon >0$ and $L > 0$ such that $B(\mathbf{x}_0, \epsilon ) \subseteq C$ and \(|f(\mathbf{x} ) - f(\mathbf{x}_0 )| \leq L \left\lVert \mathbf{x} - \mathbf{x}_0 \right\rVert, \quad \forall \mathbf{x} \in B(\mathbf{x}_0, \epsilon )\)
Theorem 94 (Existence of directional derivatives of Convex Functions). _Let $f: S \to \mathbb{R}$ be a convex function defined over a convex set $S \subseteq \mathbb{R}^n$. Then for any $\mathbf{x} \in \text{int}(S)$ and any $0 \neq \mathbf{d} \in \mathbb{R}^n$, the directional derivative of $f$ at $\mathbf{x}$ in the direction $\mathbf{d}$ exists and is given by: \(f'(\mathbf{x} ; \mathbf{d} ) = \lim_{t\to 0} \frac{f(\mathbf{x} + t\mathbf{d} ) - f(\mathbf{x} )}{t}\)
Theorem 95 (No Maximum inside the Convex Set). _Let $f: C \to \mathbb{R}$ be a convex function defined over a non-empty convex set $C \subseteq \mathbb{R}^n$. Then $f$ does not attain a maximum at a point in $\text{int}(C)$
Theorem 96 (Maximimum of a Convex Function over a compact convex set). _Let $f:C \to \mathbb{R}$ be convex over the nonempty convex and compact set $C \subseteq \mathbb{R}^n$. Then there exists at least one maximiser of $f$ over $C$ that is an extreme point of $C$
Convex Optimisation
Theorem 97 (Local minima are global in CVX). _Let $f: \mathbb{R}^n \to \mathbb{R}$ be a convex function defined over a convex set $C \subseteq \mathbb{R}^n$. Suppose that $\mathbf{x}^{\ast} \in C$ is a local minimizer of $f$ over $C$. Then $\mathbf{x}^{\ast}$ is a global minimum of $f$ over $C$
Theorem 98. _Let $f:C \to \mathbb{R}$ be a strictly convex function defined on the convex set $C$. Let $\mathbf{x}^{\ast} \in C$ be a local minimum of $f$ over $C$. Then $\mathbf{x}^{\ast}$ is a strict global minimum of $f$ over $C$
Theorem 99. _Let $f:C \to \mathbb{R}$ be a convex function defined over the convex set $C \subseteq \mathbb{R}^n$. Then the set of optimal solutions of the problem
\[\mathop{\min} \{f(\mathbf{x} ) : \mathbf{x} \in C\}\]is a convex set. If also $f$ is strictly convex, then the set of optimal solutions is a singleton_
Definition 100 (Stationarity). Let $f$ be a continuously differentiable function over a closed and convex set $C$. Then $\mathbf{x}^{\ast}$ is called a stationary point of $\mathop{\min}_{\mathbf{x} } {f(\mathbf{x} ) : \mathbf{x} \in C}$ if \(\nabla f(\mathbf{x}^{\ast} )^T (\mathbf{x} - \mathbf{x}^{\ast} ) \geq 0, \quad \forall \mathbf{x} \in C\)
Theorem 101 (Stationarity as a Necessary Optimality Condition). Let $f$ be a continuously differentiable function over a nonempty closed convex set $C$, and let $\mathbf{x}^{\ast}$ be a local minimum of $\mathop{\min}{\mathbf{x} } {f(\mathbf{x} ) : \mathbf{x} \in C}$. Then $\mathbf{x}^{\ast}$ is a stationary point of the problem._
The Orthogonal Projection Operator
Definition 102 (Orthogonal Projection). Given a nonempty closed convex set $C$, the orthogonal projection operator $P_{C}: \mathbb{R}^n \to C$ is defined by: \(P_{C}(\mathbf{x} ) = \mathop{\arg\min}_{\mathbf{y} \in C} \{ \left\lVert \mathbf{x} - \mathbf{y} \right\rVert^{2} : \mathbf{y} \in C \}\)
Theorem 103 (The First Projection Theorem). Let $C \subseteq \mathbb{R}^n$ be a nonempty closed and convex set. Then for any $\mathbf{x} \in \mathbb{R}^n$, the orthogonal projection $P{C}(\mathbf{x} )$ exists and is unique_
Theorem 104 (The Second Projection Theorem). Let $C$ be a nonempty closed convex set and let $\mathbf{x} \in \mathbb{R}^n$. Then $z = P{C}(\mathbf{x} )$ if and only if \((\mathbf{x} - \mathbf{z} )^T (\mathbf{y} - \mathbf{z} ) \leq 0, \quad \forall \mathbf{y} \in C\)
Theorem 105 (Reperesentation of Stationarity via the Orthogonal Projection Operator). Let $f$ be a continuously differentiable function over the nonempty closed convex set $C$, and let $s>0$ . Then $\mathbf{x}^{\ast}$ is a stationarity point of the problem $\mathop{\min}{\mathbf{x} } {f(\mathbf{x} ) : \mathbf{x} \in C}$ if and only if \(\mathbf{x}^{\ast} = P_{C}(\mathbf{x}^{\ast} - s\nabla f(\mathbf{x}^{\ast} ) )\)
The Gradient Projection Method
Algorithm 7: The Gradient Projection Method {#algorithm-7-the-gradient-projection-method .unnumbered}
Initialisation: A tolerance parameter $\epsilon >0$ and choose $\mathbf{x}^{(0)} \in \mathbb{R}^n$
General Step: For $k = 0,1,2,\ldots$ execute the following:
Pick stepsize $t^k$ by a line search procedure.
Set $\mathbf{x}^{(k+1)} = P_{C}(\mathbf{x}^{(k)} - t^{k} \nabla f(\mathbf{x}^{(k)} ) )$
If $\left\lVert \mathbf{x}^k - \mathbf{x}^{k+1} \right\rVert \leq \epsilon$, then STOP and output $\mathbf{x}^{(k+1)}$
Algorithm 8: The Gradient Projection Method with Backtracking {#algorithm-8-the-gradient-projection-method-with-backtracking .unnumbered}
Initialisation: A tolerance parameter $\epsilon >0$ and choose $\mathbf{x}^{(0)} \in \mathbb{R}^n$, Parameters: $s >0, \alpha \in (0,1)$ and $\beta \in (0,1)$ General Step: For $k = 0,1,2,\ldots$ execute the following:
Set $t^k = s$
While $f(x^k) - f(P_{C}(\mathbf{x}f - t^k \nabla f(\mathbf{x}^k))) < \alpha t^k \left\lVert G_{\frac{1}{t^k}}\right\rVert^{2}$, set $t^k = \beta t^k$
Set $\mathbf{x}^{(k+1)} = P_{C}(\mathbf{x}^{(k)} - t^{k} \nabla f(\mathbf{x}^{(k)} ) )$
If $\left\lVert \mathbf{x}^k - \mathbf{x}^{k+1} \right\rVert \leq \epsilon$, then STOP and output $\mathbf{x}^{(k+1)}$
Theorem 106 (Convergence of the Gradient Projection Method). Let ${\mathbf{x}^k}$ be the sequence generated by the gradient projection method for solving problem $\mathop{\min}{\mathbf{x} \in C} {f(\mathbf{x} )}$ with either a constant stepsize $\overline{t} \in (0,\frac{2}{L})$, where $L$ a Lipschitz constant of $\nabla f$ or a backtracking stepsize strategy. Assume $f$ bounded below. Then:_
The sequence ${f(\mathbf{x}^k)}$ is nonincreasing
$G_{d}(\mathbf{x}^k) \to 0$ as $k \to \infty$, where \(d = \begin{cases} \frac{1}{\overline{t}} & \text{if constant stepsize} \\ \frac{1}{s} & \text{ if backtracking} \end{cases}\) $$
Optimality Conditions
Separation Theorem
Definition 107. A hyperplane \(H = \{ \mathbf{x} \in \mathbb{R}^n : \mathbf{a}^T \mathbf{x} = b \} \quad \text{where } \mathbf{a} \in \mathbb{R}^n \setminus \{0\}, b \in \mathbb{R}\) is said to strictly separate a point $\mathbf{y} \notin S$ from $S$ if
\[\mathbf{a}^T \mathbf{y} > b\]and
\[\mathbf{a}^T \mathbf{x} \leq b \quad \forall \mathbf{y} \in S\]Theorem 108 (Separation of a Point from a Closed and Convex Set). _Let $C \subseteq \mathbb{R}^n$ be a nonempty closed and convex set, and let $\mathbf{y} \notin C$. Then there exists $\mathbf{p} \in \mathbb{R}^n \setminus {0}$ and $\alpha \in \mathbb{R}$ such that $\mathbf{p}^T \mathbf{y} > \alpha$ and $\mathbf{p}^T \mathbf{x} \leq \alpha$ for all $\mathbf{x} \in C$
Lemma 109 (Farkas’ Lemma - an Alternative Theorem). Let $\mathbf{x} \in \mathbb{R}^n$ and $\mathbf{A} \in \mathbb{R}^{m\times n}$. Then exactly one of the following systems has a solution:
$\mathbf{Ax} \leq 0, \mathbf{c}^T \mathbf{x} >0$
$\mathbf{A}^T \mathbf{y} = \mathbf{c} , \mathbf{y} \geq 0$
Lemma 110 (Farkas’ Lemma - Second Formulation). Let $\mathbf{c} \in \mathbb{R}^n$ and $\mathbf{A} \in \mathbb{R}^{m\times n}$. Then teh following two claims are equivalent:
The implication $\mathbf{Ax} \leq 0 \implies \mathbf{c}^T \mathbf{x} \leq 0$ holds true
There exists $\mathbf{y} \in \mathbb{R}^m+$ such that $\mathbf{A}^T \mathbf{y} = \mathbf{c}$_
Theorem 111 (Gordan’s Alternative Theorem). Let $\mathbf{A} \in \mathbb{R}^{m\times n}$ . Then exactly one of the follwing systems has a solution:
$\mathbf{Ax} < 0$
$\mathbf{p} \neq \textbf{0} , \mathbf{A}^T \mathbf{p} = 0, \mathbf{p} \geq 0$
Theorem 112 (KKT conditions for Linearly Constrained Problems - Necessary Optimality Conditions). *Consider minimization problem
\[\begin{cases} \mathop{\min}_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x} ) \\ \text{subject to } \mathbf{Ax} = \mathbf{b} \end{cases}\]where $f$ is continuously differentiable over $\mathbf{a}_1, \ldots ,\mathbf{a}_n,\ b_1, \ldots,b_m \in \mathbb{R}$ and let $\mathbf{x}^{\ast}$ be a local minimum point of the problem. Then there exists $\lambda_1, \ldots \lambda_m \geq 0$ such that:
\[\nabla f(\mathbf{x}^{\ast} ) + \sum_{i=1}^{m} \lambda_i \mathbf{a}_i = 0\]and
\[\lambda_{i}( \mathbf{a}_{i}^T \mathbf{x}^{\ast} - b_i ) = 0, \quad i = 1, \ldots , m\]Theorem 113 (KKT Conditions for Convex Linearly Constrained Problems - Necessary and Sufficient Optimality Conditions). *Consider the minimization problem \(\begin{cases} \mathop{\min}_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x} ) \\ \text{subject to } \mathbf{Ax} = \mathbf{b} \end{cases}\) where in addition $f$ is a convex continuously differentiable function over $\mathbb{R}^n$, and let $\mathbf{x}^{\ast}$ be a feasible solution. Then $\mathbf{x}^{\ast}$ is an optimal solution if and only if there exists $\lambda_1, \ldots , \lambda_m \geq 0$ such that: \(\nabla f(\mathbf{x}^{\ast} ) + \sum_{i=1}^{m} \lambda_i \mathbf{a}_i = 0\) and \(\lambda_{i}(\mathbf{a}_{i}^T \mathbf{x}^{\ast} - b_i ) = 0, \quad i = 1, \ldots , m\)
Theorem 114 (KKT Conditions for Linearly Constrained Problems). Consider minimization problem \(\begin{cases} &\mathop{\min}_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x} ) \\ \text{subject to } &\mathbf{a}_{i}^T \mathbf{x} \leq b_{i}, \quad i = 1, \ldots , m\\ &\mathbf{c}_{j}^T \mathbf{x} = d_{j}, \quad j = 1, \ldots , p \end{cases}\) where $f$ is continuously differentiable $a{i}, c_{j} \in \mathbb{R}^n, b_{i}, d_{j}\in \mathbb{R}$_
*(Necessity of the KKT ) If $\mathbf{x}^{\ast}$ is a local minimum of the problem, then there exist $\lambda_1, \ldots , \lambda_{m} \geq 0$ and $\mu_1, \ldots , \mu_{p} \in \mathbb{R}$ such that:
\[\begin{aligned} \nabla f(\mathbf{x}^{\ast} ) + \sum*{i=1}^{m} \lambda_i \mathbf{a}\_i + \sum*{j=1}^{p} \mu*j \mathbf{c}\_j &= 0 \\ \lambda*{i}(\mathbf{a}_{i}^T \mathbf{x}^{\ast} - b_{i} ) &= 0, \quad i = 1, \ldots , m \end{aligned}\] \[\]*(Sufficiency of the KKT) If $f$ is convex over $\mathbb{R}^n$ and $\mathbf{x}^{\ast}$ is a feasible solution of the problem, for which there exist $\lambda_1, \ldots , \lambda_{m} \geq 0$ and $\mu_1, \ldots , \mu_{p} \in \mathbb{R}$ such that the KKT conditions are satisfied, then $\mathbf{x}^{\ast}$ is an optimal solution of the problem*
Orthogonal Projections
Definition 115 (Orthogonal Projection onto Affine Spaces). Let $C$ be the affine space $C = { \mathbf{x} \in \mathbb{R}^n : \mathbf{Ax} = \mathbf{b} }$ where $\mathbf{A} \in \mathbb{R}^{m\times n}$ and $\mathbf{b} \in \mathbb{R}^m$. The orthogonal projection operator $P_{C}: \mathbb{R}^n \to C$ is defined by: \(P_{C}(\mathbf{y} ) = \mathbf{y} - \mathbf{A}^T (\mathbf{A} \mathbf{A}^T )^{-1} (\mathbf{A} \mathbf{y} - \mathbf{b} )\)
Definition 116 (Orthogonal Projection onto Hyperplanes). Consider the hyperplane \(H = \{ \mathbf{x} \in \mathbb{R}^n : \mathbf{a}^T \mathbf{x} = b \} \quad \text{where } \mathbf{a} \in \mathbb{R}^n \setminus \{0\}, b \in \mathbb{R}\) The orthogonal projection operator $P_{H}: \mathbb{R}^n \to H$ is defined by: \(P_{H}(\mathbf{y} ) = \mathbf{y} - \frac{\mathbf{a}^T \mathbf{y} - b}{\left\lVert \mathbf{a} \right\rVert^{2}} \mathbf{a}\)
Lemma 117 (Distance of a point from a hyperplane). *Let $H = { \mathbf{x} \in \mathbb{R}^n : \mathbf{a}^T \mathbf{x} = b }$ be a hyperplane where $\mathbf{a} \in \mathbb{R}^n \setminus {0}$ and $b \in \mathbb{R}$. Then for any $\mathbf{y} \in \mathbb{R}^n$, the distance of $\mathbf{y}$ from $H$ is given by: \(\text{dist}(\mathbf{y} , H) = \frac{\left| \mathbf{a}^T \mathbf{y} - b \right|}{\left\lVert \mathbf{a} \right\rVert}\)
Lemma 118. *Let $H^- = {\mathbf{x} \in \mathbb{R}^n : \mathbf{a}^T \mathbf{x} \leq b }$ be a half-space where $\mathbf{a} \in \mathbb{R}^n \setminus {0}$ and $b \in \mathbb{R}$. Then \(P_{H^-}(\mathbf{x} ) = \mathbf{x} - \frac{[\mathbf{a}^T \mathbf{x} - b]_+}{\left\lVert \mathbf{a} \right\rVert^{2}} \mathbf{a}\)
KKT Conditions for nonlinear problems
Lemma 119. If $\mathbf{x}^{\ast}$ a local optimal solution of the following, then there are no feasible descent directions.
\[\mathop{\min} f(\mathbf{x} ) \quad \text{s.t } \mathbf{x} \in C\]for some convex set $C \subseteq \mathbb{R}^n$, and $f$ continuously differentiable. We say $\mathbf{d} \neq 0$ a feasible descent direction at $\mathbf{x} \in C$ if
$\nabla f(\mathbf{x} )^T \mathbf{d} < 0$, and_
_There exists $\epsilon > 0$ such that $\mathbf{x} + t\mathbf{d} \in C$ for all $t \in (0, \epsilon )$
*Extending this to the problem: \(\begin{cases} \mathop{\min} &f(\mathbf{x} ) \\ \text{s.t } & g_{i}(\mathbf{x} ) \leq 0 \quad i = 1, \ldots , m \end{cases}\) We introduce the notion of active constraints. We say the $i$-the constraint is active at $\mathbf{\widetilde{x} }$ if $g_{i}(\mathbf{\widetilde{x}}) = 0$ i.e. when the constraints are binding. \(I(\mathbf{\widetilde{x} } ) = \{i = \{1, \ldots , m \}: g_{i}(\mathbf{\widetilde{x} } )\} \quad \text{The set of active constraints at } \mathbf{\widetilde{x} }\)
Lemma 120. Let $\mathbf{x}^{\ast}$ be a local minimum of the problem \(\begin{cases} \mathop{\min} &f(\mathbf{x} ) \\ \text{s.t } & g_{i}(\mathbf{x} ) \leq 0 \quad i = 1, \ldots , m \end{cases}\) where $f$ and $g_1, \ldots g{m}$ are continuously differentiable functions over $\mathbb{R}^n$. Let $I(\mathbf{x}^{\ast} )$ be the set of active constraints at $\mathbf{x}^{\ast}$. Then, there does not exist a vector $d \in \mathbb{R}^n$ such that_
$\nabla f(\mathbf{x}^{\ast} )^T \mathbf{d} < 0,$ and_
$\nabla g_{i}(\mathbf{x}^{\ast} )^T \mathbf{d} \leq 0, \quad \forall i \in I(\mathbf{x}^{\ast} )$_
KKT Conditions for nonlinear convex problems
Theorem 121 (Sufficiency of the KKT conditions for convex optimization problems). Let $\mathbf{x}^{\ast}$ be a feasible solution of \(\begin{cases} \mathop{\min} &f(\mathbf{x} ) \\ \text{subject to } &g_{i}(\mathbf{x} ) \leq 0, \quad i = 1, \ldots , m\\ &h*{j}(\mathbf{x} ) = 0, \quad j = 1, \ldots , p \end{cases}\) where $f, g_1, \ldots ,g_n$ are continuously differentiable convex functions over $\mathbb{R}^n$ and $h_1, \ldots , h_p$ are affine functions. Suppose that there exist multipliers $\lambda_1, \ldots , \lambda*{m} \geq 0$ and $\mu_1, \ldots , \mu{p} \in \mathbb{R}$ such that \(\begin{aligned} \nabla f(\mathbf{x}^{\ast} ) + \sum_{i=1}^{m} \lambda_i \nabla g_{i}(\mathbf{x}^{\ast} ) + \sum_{j=1}^{p} \mu_j \nabla h_{j}(\mathbf{x}^{\ast} ) &= 0\\ \lambda_{i} g_{i}(\mathbf{x}^{\ast} ) &= 0, \quad i = 1, \ldots , m\\ \end{aligned}\) Then $\mathbf{x}^{\ast}$ an optimal solution of the problem_
Theorem 122 (Necessity of the KKT conditions under the generalized Slater’s Condition). *Let $\mathbf{x}^{\ast}$ be an optimal solution of the problem \(\begin{cases} &\mathop{\min} f(\mathbf{x} ) \\ \text{subject to } &g_{i}(\mathbf{x} ) \leq 0, \quad i = 1, \ldots , m\\ &h_{j}(\mathbf{x} ) \leq 0, \quad j = 1, \ldots , p\\ &s_{k}(\mathbf{x} ) = 0, \quad k = 1, \ldots , q \end{cases}\) where $f, g_1, \ldots ,g_m$ are continuously differentiable convex functions over $\mathbb{R}^n$, and $h_1, \ldots , h_p, s_1, \ldots ,s_k$ are affine functions. Suppose that there exists a point $\mathbf{\hat{x} }$ satisfying the generalized Slater’s condition: \(\begin{aligned} g_{i}(\mathbf{\hat{x} } ) &< 0, \quad i = 1, \ldots , m\\ h_{j}(\mathbf{\hat{x} } ) &\leq 0, \quad j = 1, \ldots , p\\ s_{k}(\mathbf{\hat{x} } ) &= 0, \quad k = 1, \ldots , q \end{aligned}\) Then there exist multipliers $\lambda_1, \ldots , \lambda_{m} \geq 0$ and $\mu_1, \ldots , \mu_{p} \in \mathbb{R}$ such that \(\begin{aligned} \nabla f(\mathbf{x}^{\ast} ) + \sum_{i=1}^{m} \lambda_i \nabla g_{i}(\mathbf{x}^{\ast} ) + \sum_{j=1}^{p} \eta_{j} \nabla h_{j}(\mathbf{x}^{\ast} ) + \sum_{k=1}^{q} \mu_{k} \nabla s_{k}(\mathbf{x}^{\ast} ) &= 0\\ \lambda_{i} g_{i}(\mathbf{x}^{\ast} ) &= 0, \quad i = 1, \ldots , m\\ \eta_{j}h_{j}(\mathbf{x}^{\ast} ) &= 0, \quad j = 1, \ldots , p\\ \end{aligned}\)
Duality
Definition 123 (Primal Problem). The primal problem is the problem of the form: \(\begin{cases} &\mathop{\min} f(\mathbf{x} ) \\ \text{subject to } &g_{i}(\mathbf{x} ) \leq 0, \quad i = 1, \ldots , m\\ &h_{j}(\mathbf{x} ) = 0, \quad j = 1, \ldots , p\\ &x \in X \end{cases}\) where $f, g_1, \ldots ,g_m$ are functions defined on the set $X \subseteq \mathbb{R}^n$. This is the "usual" optimization problem. The Lagrangian associated to this problem is \(L(\mathbf{x} , \lambda, \eta ) = f(\mathbf{x} ) + \sum_{i=1}^{m} \lambda_{i} g_{i}(\mathbf{x} ) + \sum_{j=1}^{p} \mu_{j} h_{j}(\mathbf{x} ) \quad (\mathbf{x} \in X, \mathbf{\lambda} \in \mathbb{R}^m_+, \mathbf{\mu} \in \mathbb{R}^p)\) The domain of the dual objective function is
\[\text{dom} (q) = \{(\lambda ,\mu ) \in \mathbb{R}^m_+ \times \mathbb{R}^p : q(\lambda , \mu ) > -\infty \}\]The dual problem is given by
\[\begin{aligned} q^{\ast} &= \mathop{\max} q(\lambda , \mu )\\ \text{subject to } (\lambda, \mu ) &\in \text{dom}(q) \end{aligned}\]Theorem 124. Consider the primal problem with $f,g{i},h{j}, i = 1, \ldots , m, j = 1, \ldots , p$, functions defined on the set $X\subseteq \mathbb{R}^n$, and let $q$ be the dual function defined in the Dual problem. Then:
$\text{dom}(q)$ is a convex set_
$q$ is a concave function over $\text{dom}(q)$
Weak and Strong Duality
Theorem 125 (Weak Duality Theorem). Consider the primal problem and its dual problem. Then \(q^{\ast} \leq f^{\ast}\) where $f^{\ast} , q^{\ast}$ are the primal and dual optimal values respectively
Theorem 126 (Supporting Hyperplane Theorem). _Let $C \subseteq \mathbb{R}^n$ be a convex set and let $\mathbf{y} \notin C$. Then there exists $\textbf{0} \neq \mathbf{p} \in \mathbb{R}^n$ such that \(\mathbf{p}^T \mathbf{x} \leq \mathbf{p}^T \mathbf{y} \quad \forall \mathbf{x} \in C\)
Theorem 127 (Separation of Two Convex Sets). _Let $C_1, C_2, \subseteq \mathbb{R}^n$ be two nonempty convex sets such that $C_1 \cap C_2 = \emptyset$. Then there exists $\mathbf{p} \neq \textbf{0}$ for which \(\mathbf{p}^T \mathbf{x} \leq \mathbf{p}^T \mathbf{y} \quad \forall \mathbf{x} \in C_1, \forall \mathbf{y} \in C_2\)
Theorem 128 (Nonlinear Farkas Lemma). Let $X\subseteq \mathbb{R}^n$ be a convex set and let $f, g_1, \ldots , g_m$ be convex functions defined over $X$. Assume that there exists $\hat{x} \in X$ such that \(g_1 (\mathbf{\hat{x} }) <0, g_2(\mathbf{\hat{x} }) < 0, \ldots , g_m(\mathbf{\hat{x} }) < 0\) Let $c \in \mathbb{R}$. Then the following two claims are equivalent:
The following implication holds true: \(\mathbf{x} \in X, g_{i}(\mathbf{x} ) \leq 0, i = 1, \ldots , m \implies f(\mathbf{x} ) \geq c\)
There exists $\lambda_1, \ldots , \lambda_{m} \geq 0$ such that \(\mathop{\min}_{x \in X} \left\{ f(x) + \sum_{i=1}^{m} \lambda_{i}g_{i}(x) \right\} \geq c\)
Theorem 129 (Strong Duality of Convex Problems with Inequality Constraints). Consider the optimization problem \(\begin{aligned} f^{\ast} &= \mathop{\min} f(\mathbf{x} )\\ \text{subject to } g_{i}(\mathbf{x} ) &\leq 0, \quad i = 1, \ldots , m\\ \mathbf{x} &\in X \end{aligned}\) where $X$ is a convex set and $f, g_1, \ldots , g_m$ are convex functions defined over $X$. Suppose that there exists $\mathbf{\hat{x} }\in X$ for which $g{i}(\mathbf{\hat{x} } ) < 0, i = 1,2, \ldots ,m$. If this problem has a finite optimal value, then_
the optimal value of the dual problem is attained
the primal and dual problems have the same optimal value $f^{\ast} = q^{\ast}$
Theorem 130 (Complementary Slackness Conditions). _Consider the optimization problem, \(f^{\ast} := \mathop{\min} \{\mathop{\min} f(\mathbf{x} ) : g_{i}(\mathbf{x} ) \leq 0, i = 1, \ldots , m, \mathbf{x} \in X\}\) and assume that $f^{\ast} = q^{\ast}$ where $q^{\ast}$ is the optimal value of the dual problem. Let $\mathbf{x}^{\ast} , \mathbf{\lambda}^{\ast}$ be feasible solutions of the primal and dual problems. Then $\mathbf{x}^{\ast} , \mathbf{\lambda}^{\ast}$ are optimal solutions of the primal and dual problems iff \(\begin{aligned} \mathbf{x}^{\ast} &\in \underset{\mathbf{x} \in X}{\mathop{\arg\ \min}} L (\mathbf{x} ,\mathbf{\lambda}^{\ast} )\\ \lambda_{i}^{\ast} g_{i}(\mathbf{x}^{\ast} ) &= 0, \quad i = 1, \ldots , m \end{aligned}\)
Theorem 131 (General Strong Duality Theorem). *Consider the optimization problem \(\begin{aligned} f^{\ast} &= \mathop{\min} f(\mathbf{x} )\\ \text{subject to } g_{i}(\mathbf{x} ) &\leq 0, \quad i = 1, \ldots , m\\ h_{j}(\mathbf{x} ) &\leq 0, \quad j = 1, \ldots , p\\ s_{k}(\textbf{x} ) &= 0, \quad k = 1, \ldots , q\\ \mathbf{x} &\in X \end{aligned}\) where $X$ is a convex set and $f,g_{i}, i = 1, \ldots, m$ are convex functions over $X$ The functions $h_{j},s_{k}$ are affine functions. Suppose that there exists $\mathbf{\hat{x} } \in \text{int}(X)$ for which $g_{i}(\mathbf{\hat{x} } ) <0, h_{j}(\mathbf{\hat{x} } ) \leq 0$ and $s_{k}(\mathbf{\hat{x} } ) = 0$. Then if the problem has a finite optimal value, then the optimal value of the dual problem
\[q^{\ast} = \mathop{\max}\{ q(\mathbf{\lambda} , \mathbf{\eta}, \mathbf{\mu} ) : (\mathbf{\lambda} , \mathbf{\eta} , \mathbf{\mu} ) \in \text{dom}(q) \}\]where
\[q(\mathbf{\lambda} , \mathbf{\eta} , \mathbf{\mu} ) = \mathop{\min}_{\mathbf{x} \in X} \left[ f(\mathbf{x} ) + \sum_{i=1}^{m} \lambda_{i} g_{i}(\mathbf{x} ) + \sum_{j=1}^{p} \eta_{j} h_{j}(\mathbf{x} ) + \sum_{k=1}^{q} \mu_{k} s_{k}(\mathbf{x} ) \right]\]is attained, and $f^{\ast} = q^{\ast}$
Three Important Examples of Duality Use
Linear Programming
Consider the linear programming problem
\[\begin{aligned} f^{\ast} &= \mathop{\min} \{c^T \mathbf{x} : \mathbf{Ax} = \mathbf{b}, \mathbf{x} \geq 0\}\\ \text{subject to } &\mathbf{Ax} = \mathbf{b},\quad \mathbf{c} \in \mathbb{R}^{n}, \mathbf{b} \in \mathbb{R}^{m}, \mathbf{A} \in \mathbb{R}^{m\times n} \end{aligned}\]Assume that the problem is feasible, implying strong duality holds. THe Lagrangian is given by \(L(\mathbf{x} , \mathbf{\lambda} ) = \mathbf{c}^T \mathbf{x} + \mathbf{\lambda}^T (\mathbf{Ax} - \mathbf{b} ) = (\mathbf{c} + \mathbf{A}^T \mathbf{\lambda} )^T \mathbf{x} - \mathbf{b}^T \mathbf{\lambda}\) and the dual objective function is
\[q(\mathbf{\lambda} ) = \mathop{\min}_{\mathbf{x} \in \mathbb{R}^n} L(\mathbf{x} ,\mathbf{\lambda} ) = \mathop{\min}_{x\in \mathbb{R}^n} (\mathbf{c} + \mathbf{A}^T \lambda )^T \mathbf{x} -\mathbf{b}^\mathbf{\lambda} = \begin{cases} -\mathbf{b}^T \mathbf{\lambda} & \text{if } \mathbf{c} + \mathbf{A}^T \mathbf{\lambda} = 0\\ -\infty & \text{otherwise} \end{cases}\]Therefore, the dual problem is
\[\begin{aligned} \mathop{\max} -\mathbf{b}^T \mathbf{\lambda} \\ \text{subject to } \mathbf{c} + \mathbf{A}^T \mathbf{\lambda} &= 0\\ \mathbf{\lambda} &\geq 0\end{aligned}\]Strictly Convex Quadratic Programming
Consider the strictly convex quadratic programming problem
\[\begin{aligned} \mathop{\min}\ &\mathbf{x}^T \mathbf{Q} \mathbf{x} + 2 \mathbf{f}^T \mathbf{x}\\ \text{subject to } &\mathbf{Ax} = \mathbf{b}\end{aligned}\]where $\mathbf{Q} \in \mathbb{R}^{n\times n}$ positive definite, $\mathbf{f} \in \mathbb{R}^n, \mathbf{A} \in \mathbb{R}^{m\times n}, \mathbf{b} \in \mathbb{R}^n$. The Lagrangian is given by
\[L(\mathbf{x} , \mathbf{\lambda} ) = \mathbf{x}^T \mathbf{Q} \mathbf{x} + 2 \mathbf{f}^T \mathbf{x} + 2\mathbf{\lambda}^T (\mathbf{Ax} - \mathbf{b} ) = \mathbf{x}^T \mathbf{Q} \mathbf{x} + 2(\mathbf{A}^T \lambda + \mathbf{f} )^T \mathbf{x} - 2\mathbf{b}^T \mathbf{\lambda}\]the minimizer of the Lagrangian is attained at
$\mathbf{x}^{\ast} = -\mathbf{Q}^{-1} (\mathbf{f} + \mathbf{A}^T \mathbf{\lambda} )$. With this, we work over the dual objective,
\[\begin{aligned} q(\mathbf{\lambda} ) &= L(\mathbf{x}^{\ast} ,\mathbf{\lambda} )\\ &= -\mathbf{\lambda}^T \mathbf{A} \mathbf{Q}^{-1} \mathbf{A}^T \mathbf{\lambda} - 2(\mathbf{A} \mathbf{Q}^{-1} \mathbf{f} + \mathbf{b})^T \mathbf{\lambda} - \mathbf{f}^T \mathbf{Q}^{-1} \mathbf{f} \end{aligned}\]Computing the Orthogonal onto the Unit Simplex
FILL LATER
Optimal Control
What is Optimal Control?
Definition 132. Optimal control is the problem of finding a control function that minimizes a given cost functional, subject to a set of differential equations that describe the dynamics of the system.
We do this by means of an objective function to be optimized, i.e. as a dynamic optimization problem such as \(\begin{aligned} &\mathop{\min}_{\mathbf{u}(\cdot ) \in \mathcal{U}} \int_{0}^{T} L(\mathbf{x} (s), \mathbf{u} (s)) \ ds + \Phi (\mathbf{x} (T))\\ \text{subject to } \dot{\mathbf{x}} (t) &= \mathbf{f} (\mathbf{x} (t), \mathbf{u} (t))\\ \mathbf{x} (0) &= \mathbf{x}_0 \end{aligned}\) We have the control signal in a space of admissible controls $\mathcal{U}$ representing the control constraints of our problem. The running cost $L(\mathbf{x} ,\mathbf{u} )$ is a running cost, expressing our wish to achieve an objective with a certain control budget.
We have $\Phi (\mathbf{x} (T))$ a final time penalty, encoding the fact that when our optimization finishes at time $t = T$, we expect to find the system in the reference position. Here the time horizon can be fixed $(T \text{ or } +\infty )$ or variable, i.e treated as an additional optimization variable.\
Definition 133 (Closed-loop Optimal Control). In the optimal control problem, the control signal $\mathbf{u} (t)$ is a function of the state $\mathbf{x} (t)$, i.e. $\mathbf{u} = \mathbf{u} (\mathbf{x} )$. This is known as the closed-loop optimal control problem. The optimal control is a function of the state of the system.
Definition 134 (Open-loop Optimal Control). In the optimal control problem, the control signal $\mathbf{u} (t)$ is a function of time, i.e. $\mathbf{u} = \mathbf{u} (t)$. This is known as the open-loop optimal control problem. The optimal control is a function of time.
The optimal control Formulation
Start with nonlinear dynamical system \(\dot{\mathbf{x}} = \mathbf{f} (t,\mathbf{x}(t) , \mathbf{u}(t)), \quad \mathbf{x}(t_0) = \mathbf{x}_0 \in \mathbb{R}^n, \quad t \in [t_0, t_f]\) We write a cost functional expressing our control goals \(\mathcal{J} (\mathbf{x} ,\mathbf{u} ) := \Phi (\mathbf{x} (t_{f}), t_{j}) + \int_{t_0}^{t_{f}} L(\mathbf{x} (t), \mathbf{u} (t)) \,\mathrm{d}t\) by means of a running cost $L(\mathbf{x}(t), \mathbf{u} (t))$ and a terminal penalty $\Phi (\mathbf{x} (t_{f}), t_{f})$. We cast the control synthesis as a nonlinear optimization problem
\[\mathop{\min}_{\mathbf{u}(\cdot )} \mathcal{J} ( \mathbf{x} (\cdot ), \mathbf{u} (\cdot )), \quad \text{subject to } \dot{\mathbf{x}} = \mathbf{f} (t, \mathbf{x} , \mathbf{u})\]Using Calculus of Variations
Lemma 135 (Euler-Lagrange to solve (1)). _Formally we adjoin the nonlinear constraints through a time-dependent Lagrange multiplier $\mathbf{p} (t)$ to the cost functional, leading to \(\bar{\mathcal{J}} := \Phi (\mathbf{x} (t_{f}), t*{f}) + \int*{t*0}^{t*{f}} L[\mathbf{x} (t), \mathbf{u} (t)] + \mathbf{p}^T (t) [\mathbf{f} (t, \mathbf{x} , \mathbf{u}) - \dot{\mathbf{x}}] \,\mathrm{d}t\) At this point, we define the Hamiltonian as: \(\mathcal{H} (t,\mathbf{x}(t), \mathbf{u}(t), \mathbf{p}(t)) := L(\mathbf{x}(t), \mathbf{u}(t)) + \mathbf{p}^T (t) \mathbf{f} (t, \mathbf{x}(t), \mathbf{u}(t))\) Integration by parts of the last term ($\mathbf{p}^T \dot{\mathbf{x}}$ )in $\bar{\mathcal{J}}$ yields \(\mathcal{\overline{J}} = TO\ FINISH\)
FINISH THIS LATER
Pontryagin’s Maximum Principle
Characterises optimality conditions for a problem of the type \(\mathbf{\dot{x}} (t) = \mathbf{f} (t, \mathbf{x} (t), \mathbf{u} (t)), \quad \mathbf{x} (t_0) = \mathbf{x}_0 \in \mathbb{R}^n, \mathbf{u} \in \mathcal{U} \subset \mathbb{R}^m, \quad t \in [t_0, t_f]\) With a cost functional, also known as the Bolza Problem \(\mathcal{J} (\mathbf{x} ,\mathbf{u} ) := \Phi (\mathbf{x} (t_{f}), t_{f}) + \int_{t_0}^{t_{f}} L(\mathbf{x} (t), \mathbf{u} (t)) \,\mathrm{d}t\) and terminal constraints \(\Psi (\mathbf{x} (t_{f})) = 0, \quad \Psi : \mathbb{R}^n \to \mathbb{R}^q\) Differences: the terminal time $t_{f}$ is allowed to be free, and we express a set of $q$ terminal constraints for the final state through $\Psi (\mathbf{x} (t_{f}))$ and the existence of a space of admissible controls $\mathcal{U}$, where we restrict our optimal control signal.
Recalling Hamiltonian \(\mathcal{H} (t, \mathbf{x}(t), \mathbf{u}(t), \mathbf{p}(t)) = L(\mathbf{x}(t), \mathbf{u}(t)) + \mathbf{p}^T (t) \mathbf{f} (t, \mathbf{x}(t), \mathbf{u}(t))\) the optimality conditions now read
The last condition states that the optimal control is the minimizer of the Hamiltonian. This is equivalent ot the previous condition that $\frac{\partial H}{\partial u} = 0$, but since we include constraint $\mathbf{u} \in \mathcal{U}$, we realise this as \(\mathbf{u}^{\ast} \in \underset{\mathbf{w} \in \mathcal{U} }{\mathop{\arg\ \min}} \mathcal{H} (t, \mathbf{x}^{\ast} (t), \mathbf{w}, \mathbf{p}^{\ast} (t))\)