# BFGS – Gradient Approximation Methods

The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is the most commonly used update strategy for implementing a Quasi-Newtown optimization technique. Newton’s method was first derived as a numerical technique for solving for the roots of a nonlinear equation. Newton’s Method solves for the roots of a nonlinear equation by providing a linear approximation to the nonlinear equation at a given guess $x_k$. The linear approximation, as measured by the tangent to the curve at the point $x_k$ is then used to obtain a new guess to the solution, and the process repeats. Formally, the iterative method is expressed as
$x_{k+1}=x_k+\frac{f(x_k)}{f'(x_k)}.$
In terms of optimization, the ultimate goal is to find a stationary point $x^*$ such that $f'(x^*)=0$. Therefore, assuming the objective function is twice differentiable, Newton’s Method uses $f''$ in order calculate the roots of $f'$, and the iterative process becomes
$x_{k+1} = x_k + \frac{f'(x_k)}{f''(x_k)}$

This iterative scheme can be generalized to multiple dimensions by replacing $f'(x)$ and $f''(x)$ with the gradient, $\nabla f(x)$ and the Hessian, $\nabla^2 f(x)$, respectively. The main advantages of Newton’s method is its robustness and rapid local quadratic rate of convergence. Unfortunately, this method requires computation and storage of the Hessian matrix $\nabla^2 f(x)$ and solving the resulting system of equations, which can be costly.

The most rudimentary Quasi-Newton method is the one dimensional secant method, which approximates $f''(x_k)$ with a finite differencing scheme
$f'(x_k) \approx \frac{f'(x_k)-f'(x_{k-1})}{x_k-x_{k-1}}.$ The BFGS quasi-Newton approach can therefore be thought of as a generalization of the secant method.

The general form of a quasi-Newton optimization is, given a point $x_k$, we attempt to solve for the descent direction $p_k$ by approximating the Hessian matrix $\beta_k$ at $x_k$. We can then obtain a search direction $p_k$ by solving
$\beta_k p_k = -\nabla f(x_k) \ \text{for} \ k=0,1,... .$ Like Newton’s Method, we follow along this search direction to obtain a new guess
$x_{k+1}=x_k+\alpha_k p_k,$ where $\alpha_k$ is the step-size, often determined through a scalar optimization or a line search procedure.

The simplest and most cost efficient generalized quasi-Newton method is the method of steepest descent. This method assumes $\beta_k$ is equal to the identity matrix I for all k, and therefore does not require costly computation and storage of $\beta_k$. As a result, the descent direction simplifies to $p_k = -\nabla f(x_k)$, and is that of “most” descent. This simplification, however, decreases the rate of convergence such that the method of steepest descent converges linearly.

For optimization of $\mathcal{L_\beta}$, we use Matlab’s built-in unconstrained optimization algorithm fminunc. fminunc uses a quasi-Newton optimization technique and includes multiple options for tackling a wide variety of optimization problems. For convenience, we will describe fminunc in terms of the options that we have selected for our specific optimization problem.

fminunc uses the BFGS method to update the Hessian matrix $\beta_k$ at each point $x_k$. The BFGS quasi-Newton algorithm can be summarized by the following steps:

1. Specify an initial $\beta_0$ and $x_0$.
2. For k=0,1,2,…
a)Stop if $x_k$ is optimal
b) Solve $\beta_k p_k = -\nabla f(x_k)$ for search direction $p_k$.
c) Use a line search to determine the step-size $\alpha_k$.
d) Update $x_{k+1} = x_k +\alpha_k p_k$.
e) Compute $\beta_{k+1}$ using the BFGS update.

We have chosen to use the medium-scale implementation of fminunc, in which the user must supply an initial value for x0. The user, however, does not need to supply the initial Hessian matrix $\beta_0$ as by default $\beta_0 = I$ under the medium-scale implementation. Therefore, the first iteration of fminunc follows the method of steepest descent. We define $x_k$ as being optimal if $|\nabla f(x_k)| < 10^{-6}$, where $|\nabla f(x_k)|$ is the magnitude of the gradient of the likelihood function, $|\nabla \mathcal{L_\beta}|$. The gradient function, $\nabla f(x_k)$, is derived using a forward finite differencing scheme. If $x_k$ is not optimal, then we proceed to solve for the search direction $p_k$. A mixed quadratic/cubic line search procedure (described below) is used to determine an appropriate step-size $\alpha_k$, which allows us to update to our new position $x_{k+1}$.

The BFGS update formula falls into a broader class of rank two update formulas that maintains both symmetry and positive definiteness (SPD) of the Hessian approximation. Preservation of SPD ensures that the search direction is always a descent direction. The general rank two update formula is given by
$\beta_{k+1} = \beta_k - \frac{(\beta_k s_k)(\beta_k s_k)^T}{s_k^T \beta_k s_k} + \frac{y_k y_k^T}{y_k^Ts_k} + \phi(s_k^T\beta_k s_k)v_k v_k^T,$
where $\phi$ is a scalar and
$s_k = x_{k+1}-x_{k}, \quad y_k = \nabla f(x_{k+1})- \nabla f(x_k), \quad v_k =\frac{ y_k}{y_k^T s_k}-\frac{\beta_k s_k}{s_k^T \beta_k s_k}.$
The BFGS Hessian update is obtained by setting $\phi = 0$, and therefore simplifies to
$\beta_{k+1} = \beta_k - \frac{(\beta_k s_k)(\beta_k s_k)^T}{s_k^T \beta_k s_k} + \frac{y_k y_k^T}{y_k^Ts_k}.$

LINE SEARCH PROCEDURE

A mixed quadratic/cubic line search procedure is used in order to determine a suitable step-size $\alpha_k$. The line search procedure operates by first perturbing the initial point $\lambda_1$, along the line of steepest descent in order to obtain a second point $\lambda_2$. The two points along with the initial gradient approximation are used to obtain a quadratic interpolation. A third point, $\lambda_3$, is obtained through extrapolation of the quadratic interpolant, and brackets the minimum such that $\lambda_1 < \lambda_2 < \lambda_3$. Using each of $\lambda _{i}$, for $i=\{1,2,3\},$ and the first gradient approximation, allows us to obtain a cubic interpolant, whose minimum, $\hat{\lambda}$, estimates the true minimum. We then proceed to bisect the bracket interval using two of $\lambda_{i}$ along with the minimum estimate $\hat{\lambda}$ to obtain a new cubic interpolant, and the process repeats.