A Quick Analysis of Heuristic Optimization by Stochastic Genetic Algorithms and Particle Swarm

I have been inspired by Varadi’s latest series of posts in http://cssanalytics.wordpress.com regarding Heuristic optimization techniques (particularly Genetic Algorithms, GAs, and Particle Swarm Optimization, PSO) to perform a quick analysis comparing said techniques. I highly recommend you read the http://cssanalytics.wordpress.com blog posts from Sept. 3-6, 2013, for a detailed explanation of GAs and PSO before viewing my analysis.

numanalysis.jpg
A Quick Analysis of Heuristic Optimization by Stochastic Genetic Algorithms and Particle Swarm
We investigate the optimization performance of both Genetic Algorithms (GAs) and Particle Swarm Optimization (PSO). Particularly we are interested in comparing the efficiency and accuracy of these techniques in optimizing a variety of well known test functions. The test functions used are outlined below.
We consider three classes of GAs:

  1. Matlab’s Built in Traditional GA (MatlabGA).
  2.  A traditional GA as outlined by Varadi in  http://cssanalytics.wordpress.com/ (GA)
  3.  A simplistic Micro-GA (mGA) as outlined by Varadi in http://cssanalytics.wordpress.com/.

A Brief Review of Traditional GAs
Recall, like most evolutionary algorithms, a GA is a stochastic sampling method that repeatedly refines a population or set of candidate solutions. The method is based on the Darwinian notion of “Survival of the Fittest” in which only the elite solutions, (those that return the most optimal objective function values) are kept, while sub-optimal solutions are discarded.

A typical GA begins by randomly generating an initial population candidate solutions, called the parents. The initial population of each generation is then enlarged through production of a combination of  cross-over children and/or mutation children. Cross-over children are derived by mixing and matching individual components of the parents, in what Varadi describes as Mating. Formally, at each component i of the child vector, the default crossover function randomly selects a value, x_i, from one of the two parents and assigns it to the i^{th} component of the cross-over child. Mutation children are created by randomly perturbing the parent solutions, often via a Gaussian distribution.

Next we evaluate the corresponding objective function value at each of the potential solution in the current population. Only a fraction of the individuals in the current population that provide the most optimal objective function value will continue onward to form the parent population of the next generation.

The mGA, on the other hand, is a simplified version of the traditional GA and involves a Tournament phase for determining which parent solutions are worthy to mate. Those parents that are worthy to mate produce cross-over children, as described above and in http://cssanalytics.wordpress.com/.

For consistency amongst the varying GAs we have kept each population size and maximum number of generations fixed at 100 and 100d, respectively, where d is the number of input dimensions to the test function. Convergence is established either by achieving the maximum number of generations or if the average of the relative difference between a generation’s solution and best solution found over the most recent 50 generations is less than a tolerance, tol=10^{-6}.

A Brief Review of PSO

PSO is a gradient free, stochastic optimization algorithm. Like GAs and mGAs, PSO begins by first generating an initial set of candidate solutions or particles, collectively called the swarm. At each iteration, individual particles of the swarm navigate the search-space based on each particles personal best position and the overall best position found by the swarm. Therefore, previous locations that minimize the objective function, f, act as attractors to which the swarm will migrate towards. The position of each particle, and the swarm alike, is defined in terms of its previous position, x, and the velocity, v, of the particle(or swarm). The velocity (or momentum factor) describes a particles tendency to move in a certain direction. The position and velocity are defined as:
x_{n+1}=x_n+v_n and
v_{n+1}=v_n+r_1(x_p-x_n) +r_2(x_s-x_n)

where r_1 and r_2 are random numbers on the interval (0,1), x_p denotes the best previous position of that particle, and x_s denotes the best previous position of the swarm.

For optimization of each test function I have chosen to use a swarm size of 20. For a fair comparison with the GAs, the stopping criteria for PSO is that the average relative difference of the best solution found over the most recent 50 iterations is less than a tolerance, tol=10^{-6}.

Lastly, note that since both the GA and PSO are stochastic optimization routines, the results presented below have each been averaged over 100 simulations.

Numerical Results 
We analyze three quantities:

  1. Accuracy: Average Percent Error = \frac{|f^*-f_{opt}|}{|f_{opt}|}\times 100\%, where f_{opt} is the true, known, global optimum value.
  2. Efficiency: Average number of Function Evaluations (FEs) required for convergence.
  3. General Performance Assessment:} GPA= FEs \times (\%Err)^2, in which accuracy holds twice the weight as efficiency and smaller GPA values indicate optimal performance.

Simulation results for each test function can be found in Table 1. From Table 1 we can draw the following conclusions:

  • In general MatlabGA outperforms both my own traditional GA and mGA, with the exception of optimizing the 2-D Hump Function and 6-D Hartmann function. This is not surprising given the level of sophistication of Matlab’s built in optimization algorithms as compared to the 1-day coding blitz of my GAs.
  • The mGA is indeed the most efficient of the GAs, as measured by the number of FEs required to establish convergence. Of course improved efficiency comes at the cost of decreased accuracy – ultimately the user’s optimization goals will play a huge factor in algorithm selection.
  • As expected, PSO outperforms all GAs for objective functions that are strictly convex, with minimal noise and a single global optimum.
  • As Varadi suggests, PSO (or a hybrid PSO method) may be the preferred technique for several portfolio optimization problems.
  • Objective surfaces such as the Schwefel Function and the Rastrigin Function are noisy and multi-modal, containing several local optima. For global optimization of both these functions, MatlabGA proves to outperform PSO.

Table1

Table2

For full test function details see:
http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/Hedar_files/TestGO.htm

Analysis of GAs and PSO

Posted in Optimization Algorithms

Hybridization of DIRECT with Local Optimization Techniques

The hybridization of DIRECT with IF or BFGS operates by first allowing DIRECT to globally explore the search space. As described in my previous post, a large number of function evaluations would be required in order for DIRECT to find the global optimum to our desired level of accuracy. However, DIRECT converges rapidly to locations that are in the neighbourhood of the global optimum, providing a good starting position to execute either implicit filter or BFGS.

Based on experimentation, a budget of 200 times the dimension of the problem, function evaluations enables DIRECT to efficiently determine a reasonable starting position to begin IF or BFGS. Starting from the termination position of DIRECT, we then allow either IF or BFGS to fine-tune DIRECT’s global search, in what is called the precision phase. Assuming that the global optimum is contained within the boundaries of the feasible region, IF should rapidly converge to a highly accurate global optimum.

It is, however, possible for the position of the global optimum to be located outside of the domain provided to IF. In this case, IF is strictly confined to this region and will not be able to converge to this position. In theory, this could have a negative effect on the model fitting process. We therefore provide a second hybridization option by replacing IF with the unbounded optimization algorithm, BFGS, which is able to explore outside of the bound constraints provided to IF.

Posted in Optimization Algorithms

DIviding RECTangle = DIRECT; A Partitioning Algorithm

DIRECT, which is shorthand for Dividing Rectangles, is a derivative-free, sampling optimization algorithm. More specifically, DIRECT is a partitioning algorithm that samples points in a given domain and then refines the search domain at each iteration based on the function values at the sampled points. Formally, DIRECT is a global optimization algorithm, that is, given enough function evaluations and a highly sensitive stopping criteria, DIRECT will always find the global optimum of a function within a given domain. Of course, this exhaustive search can be computationally expensive and thus impractical for optimization of \mathcal{L_\beta}. Therefore, we will use DIRECT to find a sub-region within the feasible region that contains the global optimum, while simultaneously reducing the chance of searching in regions that contain local optima.

The first step of DIRECT is to transform the user-supplied domain of the problem into a d-dimensional unit hypercube. Therefore, the feasible region as described by Finkel is
\Omega = \{x \  \epsilon \  R^d: 0 \leq x_i \leq 1\}.
The objective function is first evaluated at the center of this hypercube, c_1. DIRECT then divides the hypercube into smaller hyper-rectangles by evaluating the objective function at one-third of the distance, \delta, from the center in all coordinate directions ,
c_1 \pm \delta e_i, \ i=1,...,d.
The first division is performed so that the region with the most optimal function value is given the largest space.

Once the first division is complete, the next step is to identify potential optimal hyper-rectangles. Finkel provides the following definition for selecting potential optimal hyper-rectangles.

DEFINITION: Let \epsilon > 0 be a positive constant and let f_{min} be the current best function value. A hyper-rectangle j is said to be potentially optimal if there exists \hat{K} >0 such that}
f(c_j)-\hat{K}d_j \leq f(c_i) -\hat{d_i}, \ \forall \ i, \ \text{and}
f(c_j)-\hat{K}d_j \leq  f_{min}- \epsilon |f_{min}|.

Here, c_j is the center of hyper-rectangle j and d_j is the distance from the center to the corresponding vertices. In general, a value of \epsilon = 10^{-4} is used to ensure that f(c_j) exceeds f_{min} by some non-trivial amount.

From the definition, we observe that a hyper-rectangle i is potentially optimal if:
1. f(c_i) \leq f(c_j) for all hyper-rectangles of equal size, d_i=d_j .
2. d_i \geq d_k for all other hyper-rectangles k, and f(c_i) \leq f(c_j) for all hyper-rectangles of equal size, d_i=d_j.
3. d_i \leq d_k for all other hyper-rectangles k, and f(c_i) = f_{min}

Once a hyper-rectangle is selected as being potentially optimal, DIRECT begins dividing the hyper-rectangle into smaller hyper-rectangles. First, the objective function is evaluated at points c \pm \delta e_i for all longest dimensions i. Of course, if the hyper-rectangle is a hypercube, then the objective function is evaluated in all coordinate directions e_i, which is identical to the initial sampling step. We then divide the hyper-rectangle, with order determined by the objective function values returned during the sampling phase.

The three Figures below provides a 2-dimensional visualization of how DIRECT samples and divides its search space. The bold regions are the potentially optimal rectangles, which are to be divided. The first iteration, (a), shows that the right-most rectangle is potentially optimal, and is therefore sampled and divided along its vertical dimension. The second iteration, (b), shows that both the left-most rectangle, and the right-most, central rectangles are potential optimal. Here we apply the second condition for selecting potential optimal rectangles, in which the longest side of the left-most rectangle is greater than the longest side of all other rectangles, and therefore by default, the function value at the center of the left-most rectangle is less than the function value of all other rectangles of equal size. Again, we sample and divide the rectangles according to their dimension. This process of selecting potentially optimal rectangles, sampling and dividing is once again repeated in the third iteration, (c) and continues for subsequent iterations until the stopping criteria is met.

After the initial sampling phase, the rectangle on the far right is potentially optimal, and is therefore divided into thirds along its longest side.

After the initial sampling phase, the rectangle on the far right is potentially optimal, and is therefore divided into thirds along its longest side.

The central rectangle on the far right is potentially optimal. This rectangle is a cube, therefore division is identical to that of the initial sampling phase.  The rectangle on the far left is also potentially optimal as it has a side which is longer than the sides of all other rectangles.

The central rectangle on the far right is potentially optimal. This rectangle is a cube, therefore division is identical to that of the initial sampling phase. The rectangle on the far left is also potentially optimal as it has a side which is longer than the sides of all other rectangles.

The cube in the center of the far right rectangle is potentially optimal and is divided. In this case, the central cube on the bottom row, which has a side length that is greater than or equal to the side length of all other rectangles, is is also potentially optimal and is therefore divided.

he cube in the center of the far right rectangle is potentially optimal and is divided. In this case, the central cube on the bottom row, which has a side length that is greater than or equal to the side length of all other rectangles, is is also potentially optimal and is therefore divided.

Posted in Optimization Algorithms

Particle Swarm Optimization and General Pattern Search

My post-doc supervisor is using a hybridization of particle swarm with a general pattern search for his oil well placement and parameterization optimization problem. PSO and GPS is described below:

PSO
Particle Swarm Optimization is a gradient free optimization method that uses a stochastic, meta-heuristic approach in finding the optimum value. The algorithm begins by first generating an initial array of candidate solutions, called the swarm. The individual particles of the swarm then move through the search space based on each particles knowledge of where a better solution is located, as well as the swarms knowledge, as a whole, in to where the optimum value is located. Therefore, previous successful solutions act as attractors to which the swarm will migrate towards. The position of each particle, and the swarm alike, is defined in terms of its previous position, x, and the velocity, v, of the particle (or swarm) which describes a particles tendency to move in a certain direction. We therefore define the position and velocity as: x_{n+1} = x_{n} + v_{n} v_{n+1} = v_n + c_1 r_1(x^{\*}_p - x_n) + c_2 r_2(x^{\*}_s - x_n) where c_{1,2} is a weighting parameter specified by the used, r_{1,2} are random numbers on the interval (0,1) which enforce exploration of the search space, x^{\*}_p denotes the best previous position of that particle, and x^{\*}_s denotes the best previous position of the swarm. Qualitatively, the velocity component of PSO is specified by three parameters:
1. Inertial tendency of the particle to continue moving in it’s current direction.
2. Cognitive: tendency of the particle to move towards that particle’s best found solution.
3. Social: tendency of a particle to move towards the swarm’s best found solution.
In general, PSO handles bound constraint optimization problems, as well as linear and non-linear general constraints.

General Pattern Search
Though we will not be implementing a General Pattern Search (GPS) algorithm, the method is worth mentioning as Humphries’ has incorporated GPS along with PSO as a hybridized optimization technique. GPS begins at a user provided initial point x_0. Unless otherwise specified, the first iteration searches with a mesh size of 1 in all n directions, where n is the dimension of the objective function. The objective function is then computed at each of the mesh points, until it finds a value that is less than the value of the current objective function. Once it has done so, the algorithm sets to that point and begins polling again. After a successful poll, the mesh size is multiplied by a factor of 2. If a poll is unsuccessful, the algorithm remains at it’s current position, and polls once again, this time with a mesh size that is 0.5 the former mesh size. GPS can handle both bound and general constraints optimization problems. Humphries’ hybridization technique involves implementing PSO until we encounter an iteration that does not improve our solution. From there we perform one step of GPS, to poll for a better solution. If no better solution is found, we reduce the stencil size and poll again.

Posted in Optimization Algorithms

BFGS – Gradient Approximation Methods

The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is the most commonly used update strategy for implementing a Quasi-Newtown optimization technique. Newton’s method was first derived as a numerical technique for solving for the roots of a nonlinear equation. Newton’s Method solves for the roots of a nonlinear equation by providing a linear approximation to the nonlinear equation at a given guess x_k. The linear approximation, as measured by the tangent to the curve at the point x_k is then used to obtain a new guess to the solution, and the process repeats. Formally, the iterative method is expressed as
x_{k+1}=x_k+\frac{f(x_k)}{f'(x_k)}.
In terms of optimization, the ultimate goal is to find a stationary point x^* such that f'(x^*)=0. Therefore, assuming the objective function is twice differentiable, Newton’s Method uses f'' in order calculate the roots of f', and the iterative process becomes
x_{k+1} = x_k + \frac{f'(x_k)}{f''(x_k)}

This iterative scheme can be generalized to multiple dimensions by replacing f'(x) and f''(x) with the gradient, \nabla f(x) and the Hessian, \nabla^2 f(x), respectively. The main advantages of Newton’s method is its robustness and rapid local quadratic rate of convergence. Unfortunately, this method requires computation and storage of the Hessian matrix \nabla^2 f(x) and solving the resulting system of equations, which can be costly.

The most rudimentary Quasi-Newton method is the one dimensional secant method, which approximates f''(x_k) with a finite differencing scheme
f'(x_k) \approx \frac{f'(x_k)-f'(x_{k-1})}{x_k-x_{k-1}}. The BFGS quasi-Newton approach can therefore be thought of as a generalization of the secant method.

The general form of a quasi-Newton optimization is, given a point x_k, we attempt to solve for the descent direction p_k by approximating the Hessian matrix \beta_k at $x_k$. We can then obtain a search direction p_k by solving
\beta_k p_k = -\nabla f(x_k) \ \text{for} \ k=0,1,...  . Like Newton’s Method, we follow along this search direction to obtain a new guess
x_{k+1}=x_k+\alpha_k p_k, where \alpha_k is the step-size, often determined through a scalar optimization or a line search procedure.

The simplest and most cost efficient generalized quasi-Newton method is the method of steepest descent. This method assumes \beta_k is equal to the identity matrix I for all k, and therefore does not require costly computation and storage of \beta_k. As a result, the descent direction simplifies to p_k = -\nabla f(x_k), and is that of “most” descent. This simplification, however, decreases the rate of convergence such that the method of steepest descent converges linearly.

For optimization of \mathcal{L_\beta}, we use Matlab’s built-in unconstrained optimization algorithm fminunc. fminunc uses a quasi-Newton optimization technique and includes multiple options for tackling a wide variety of optimization problems. For convenience, we will describe fminunc in terms of the options that we have selected for our specific optimization problem.

fminunc uses the BFGS method to update the Hessian matrix \beta_k at each point x_k. The BFGS quasi-Newton algorithm can be summarized by the following steps:

1. Specify an initial \beta_0 and x_0.
2. For k=0,1,2,…
a)Stop if x_k is optimal
b) Solve \beta_k p_k = -\nabla f(x_k) for search direction p_k.
c) Use a line search to determine the step-size \alpha_k.
d) Update x_{k+1} = x_k +\alpha_k p_k.
e) Compute \beta_{k+1} using the BFGS update.

We have chosen to use the medium-scale implementation of fminunc, in which the user must supply an initial value for x0. The user, however, does not need to supply the initial Hessian matrix \beta_0 as by default \beta_0 = I under the medium-scale implementation. Therefore, the first iteration of fminunc follows the method of steepest descent. We define x_k as being optimal if |\nabla f(x_k)| < 10^{-6}, where |\nabla f(x_k)| is the magnitude of the gradient of the likelihood function, |\nabla \mathcal{L_\beta}|. The gradient function, \nabla f(x_k), is derived using a forward finite differencing scheme. If x_k is not optimal, then we proceed to solve for the search direction p_k. A mixed quadratic/cubic line search procedure (described below) is used to determine an appropriate step-size \alpha_k, which allows us to update to our new position x_{k+1}.

The BFGS update formula falls into a broader class of rank two update formulas that maintains both symmetry and positive definiteness (SPD) of the Hessian approximation. Preservation of SPD ensures that the search direction is always a descent direction. The general rank two update formula is given by
\beta_{k+1} = \beta_k - \frac{(\beta_k s_k)(\beta_k s_k)^T}{s_k^T \beta_k s_k} + \frac{y_k y_k^T}{y_k^Ts_k} + \phi(s_k^T\beta_k s_k)v_k v_k^T,
where \phi is a scalar and
s_k = x_{k+1}-x_{k}, \quad  y_k = \nabla f(x_{k+1})- \nabla f(x_k), \quad v_k =\frac{ y_k}{y_k^T s_k}-\frac{\beta_k s_k}{s_k^T \beta_k s_k}.
The BFGS Hessian update is obtained by setting \phi = 0, and therefore simplifies to
\beta_{k+1} = \beta_k - \frac{(\beta_k s_k)(\beta_k s_k)^T}{s_k^T \beta_k s_k} + \frac{y_k y_k^T}{y_k^Ts_k}.

LINE SEARCH PROCEDURE

A mixed quadratic/cubic line search procedure is used in order to determine a suitable step-size \alpha_k. The line search procedure operates by first perturbing the initial point \lambda_1, along the line of steepest descent in order to obtain a second point \lambda_2. The two points along with the initial gradient approximation are used to obtain a quadratic interpolation. A third point, \lambda_3, is obtained through extrapolation of the quadratic interpolant, and brackets the minimum such that \lambda_1 < \lambda_2 < \lambda_3. Using each of \lambda _{i}, for i=\{1,2,3\}, and the first gradient approximation, allows us to obtain a cubic interpolant, whose minimum, \hat{\lambda}, estimates the true minimum. We then proceed to bisect the bracket interval using two of \lambda_{i} along with the minimum estimate \hat{\lambda} to obtain a new cubic interpolant, and the process repeats.

Posted in Optimization Algorithms

Implicit Filtering – Thank You C.T. Kelley

Implicit Filtering (IF) (designed by C.T. Kelley) is a sophisticated, deterministic pattern search method for bound constrained optimization. Like most pattern search algorithms, the evaluation of the objective function at time step i is used to determine the next cluster of points for the following time step i+1. IF, however, uses a combination of sampling, thus exploring the search space by evaluating the objective function at multiple feasible points, and interpolation, which uses a first-order interpolant constructed via linear least squares, and a quasi-Newton Hessian approximation to infer descent directions.

The sampling method is arranged on a stencil, which, given a current iterate x_c, will sample the objective function in all coordinate directions via x_c \pm hv_i. Here v_i = (L_i-U_i)e_i, where L_i and U_i represent the components of the lower and upper bounds of the feasible region, respectively, and e_i is the unit coordinate vector. The scale, h varies as the optimization progresses, according to \{2^{-n}\}_{n=1}^7. Therefore, if a sampling phase is unsuccessful in finding a more optimal position than the current incumbent point, the scale, and thus the stencil, is reduced by a factor of 0.5, and the sampling phase is repeated. By default, the optimization will terminate after the scale has been reduced by a factor of 2^{-7}. It is, however, the quasi-Newton iteration that provides IF with a clear advantage over a general pattern search algorithm, and has shown to improve overall efficiency and ability of global convergence.

In terms of algorithm structure, IF can be broken into two main loops: the outer loop, which controls the pattern search, and the inner loop, which implements the quasi-Newton iteration at the current best position, x_{min}, determined at the end of each sampling step. Therefore, IF must be given two main stopping criterion; one for each loop.

The quasi-Newton iteration will terminate when the norm of the difference gradient ||\nabla f(x,v,h)|| \leq \tau h. Here, \tau = \xi *\epsilon, where \xi is a scaling factor that attempts to eliminate discrepancies in the step size of the quasi-Newton iteration, which arise when the values of the objective function are either very small or very large. The default scale value is \xi= 1.2|f(x_0)|, where x_0 is the user-supplied initial starting point. \epsilon is the tolerance level for terminating the quasi-Newton iteration. Highly sensitive stopping tolerances are required for optimization of \mathcal{L_\beta} as the likelihood function contains many regions that are relatively flat, which can result in a premature termination of the optimization process if the stopping tolerance is too large.

Kelley states that IF works efficiently for functions which are noisy, non-smooth or random. Therefore in theory, IF should be well suited for optimization of our particular objective function, \mathcal{L_\beta}. The general thought is that the sampling phase, with a suitable step size, should step over the local minima, whereas the quasi-Newton iteration will establish efficient convergence in regions near the global optimum. For more information on IF’s options and specifications, please refer to Implicit Filtering by C.T. Kelley.

Posted in Optimization Algorithms

Genetic Algorithm – A class of Evolutionary Algorithms

Originally a Genetic Algorithm (GA) was proposed for optimization of the likelihood function \mathcal{L_\beta}. Like most evolutionary algorithms, a GA is a stochastic sampling method that repeatedly refines a population or set of candidate solutions. The method is based on the Darwinian notion of “Survival of the Fittest” in which only the elite solutions, (those that return the most optimal objective function values) are kept, while sub-optimal solutions are discarded.

The GA begins by randomly generating an initial population of candidate solutions, called the parents, that are contained to the feasible region of the objective function. In order to further explore this domain, the initial population of each generation is quadrupled through generation of cross-over children and mutation children. First, additional cross-over children are derived through combining individual components of the vectors of independent variables, which doubles the current population. At each coordinate of the child vector, the default crossover function randomly selects a value, x_i, from one of the two parents and assigns it to the cross-over child, providing a mix-and-match solution. Mutation children are then created by randomly perturbing both the parent solutions and cross-over children, via a Gaussian distribution. The final result is a population that is four times the size of the parent population.

Next we evaluate the corresponding \mathcal{L_\beta} value of each of the candidate solutions in the current population in order to create the offspring for the next generation. The 25% of individuals in the current population that provide the most optimal objective function value will continue onward to form the parent population of the next generation.

A GA was originally proposed for optimization of $\mathcal{L_\beta} in their original GP model fitting procedure for several reasons. First, non-deterministic evolutionary algorithms such as GA are generally robust and do not require the use of gradient information. Moreover, using a large population size and maximum number of generations, which are scaled to the dimension of the problem, will help guarantee accurate convergence to the global optimum. In order to achieve both robustness and a high level of accuracy using a GA, however, requires excessively evaluating \mathcal{L_\beta}, which is generally inefficient.

For example, optimizing \mathcal{L_\beta} of a 10-D function requires under this scheme 4000 function evaluations (FE) per generation, for a maximum of 50 generations, which could potentially cost 200,000 FE in total. Assuming we are operating on a single core, then this single optimization process would translate to days of computation time. From this realistic example, the demand for a more efficient optimization technique is evident.

Posted in Optimization Algorithms

The Gaussian Process Model

Below I will describe the essential components of the Gaussian Process (GP) Model, designed by Dr. Pritam Ranjan at Acadia University. You will quickly begin to see why efficient and robust optimization of the Model’s processes (particularly through the maximum likelihood approach) is so crucial to the performance and accuracy of the GP model.

The GP model requires as input a set of design points, x_i = (x_{i1},...,x_{id}), and the corresponding simulator outputs, y_i=y(x_i), where i=\{1,...,n\}, n is the number of user supplied design points, and d is the dimension of the problem. We denote X =\{x_1,...,x_n\} as the set of n design points, and Y=y(X)=(y_1,...,y_n)' as the vector that holds the corresponding simulator output. The simulator output is thus modelled as
y(x_i)=\mu +z(x_i), where \mu is the overall mean, and z(x_i) is a collection of random variables, which have a joint normal distribution, and thus is a Gaussian Process. For example, in such a case where the overall mean, \mu = 0, then the random variables, z(x_i), will represent the true function value at the location x_i. The GP z(x_i) is therefore a generalization of the Gaussian probability distribution, and is fully defined by its expected value, E[z(x_i)]=0, variance \text{Var}[z(x_i)]=\sigma^2, and covariance function \text{Cov}[(z(x_i),z(x_j)]=\sigma^2 R.

Here, R is the spatial correlation matrix, which defines the degree of dependency between any two design points, x_i, based on their observed simulator value. There are many choices for the correlation matrix R, however we will use the squared exponential correlation function provided by Macdonald et al.

R_{ij} = \prod \limits_{k=1}^d \text{exp}\{-10^{\beta_k}|x_{ik}-x_{jk}|^2\} \quad \text{for all} \ i,j,
where \beta = (\beta_1,...,\beta_d) is an d \times 1 vector of scaling parameters. Determining the optimal \beta scaling parameter is crucial for ensuring the integrity of the GP model fit. Originally, Dr. Ranjan proposed the use of the scaling parameter \Theta in place of 10^{\beta}, which confines the parameter space to \Theta = (0,\infty)^d. However, they noticed that optimal \Theta values would often be position at the boundary, \Theta= 0^d, making the optimization process challenging. We therefore proposed a re-parameterization of the \Theta-space, by letting \beta = \log_{10}(\Theta), such that \beta = (-\infty, \infty) \cite{P2}. Such a re-parameterization is preferred as it smoothes the parameter space, such that it is centered about \beta =0^d , and ensures that potential \beta solutions do not lie on a boundary.

The GP model’s ability to predict the simulator output for an unknown point x^* is directly related to the underlying properties of the correlation matrix R. Prediction error at any point x^* is relative to the distance between that point and nearby design points. Therefore, the GP model will predict accurately when the distance between x^* and nearby design points is small, whereas the prediction error will be larger for predictor points that are far away from the design points. The hyper-parameter \beta can therefore be interpreted as a measure of the sensitivity of y(x_i) to the spatial distribution |x_{ik}-x_{jk}|^2 for all i, j \in \{1,...,n\} and k \in \{1,...,d\}.

The model uses the closed form estimators of the mean \mu, and variance, \sigma^2, given by
\hat{\mu}(\beta)=(1_n'R^{-1}1_n)^{-1}(1_n'R^{-1}Y) \ \text{and} \ \hat{\sigma}^2(\beta) = \frac{(Y-1_n\hat{\mu}(\beta))'R^{-1}(Y-1_n\hat{\mu}(\beta))}{n}.
Here 1_n is an n \times 1 vector of all ones. Calculation of the mean and variance estimators, along with the repeated update of the inverse and determinant of the correlation matrix, R^{-1}, and |R| respectively, allow us to derive the profile log-likelihood
\mathcal{L}_\beta = -2\log{(L_\beta)} \  \alpha \ \log{(|R|)} + n\log{[(Y-1_n\hat{\mu}(\beta))'R^{-1}(Y-1_n \hat{\mu}(\beta))]}.

The GP model will provide as output the simulator estimate, \hat{y}(x^*), for a desired predictor point x^*. Through the maximum likelihood approach, the best function estimate for a given point x^* is
\hat{y}(x^*) = \hat{\mu}+r'R^{-1}(Y-1_n\hat{\mu}) = \left[\frac{(1_n-r'R^{-1}1_n)}{{1_n}'R^{-1}1_n} {1_n}' + r'\right]R^{-1}Y = C'Y,
$
where r=[r_1(x^*),...,r_n(x^*)], and r_i(x^*)=\text{corr}[z(x^*),z(x_i)], the correlation between z(x^*) and z(x_i). The GP model will also output the uncertainty in this estimate, s^2(x^*), as measured by the mean squared error (MSE),
s^2(x^*)=E \left[ ( \hat{y}(x^*)- {y}(x^*))^2\right] = \hat{\sigma}^2 (1_n-2C'r+C'RC).

INTRODUCING A NUGGET PARAMETER

Typically, a GP model attempts to find the best fit interpolant of the user supplied design points and its corresponding simulator outputs. The maximum likelihood approach enables the user to then estimate the corresponding simulator values for any number of predictor points. Algorithmic searching for the optimal \beta parameter requires a repeated update of the correlation matrix R for each potentially optimal \beta. As shown above, the profile likelihood requires accurate computation of both the inverse and determinant of R. However, if any pair of design points are excessively close in space, then the matrix R can become near-singular, resulting in unstable computation of the inverse and determinant of R, which can lead to an unreliable model fit.

To overcome this instability, we follow a popular technique which adds a small “nugget” parameter, \delta to the model fitting procedure, which in turn replaces R with R_\delta = R +\delta I. The addition of the nugget parameter smoothes the model predictor and therefore the GP does not produce an exact interpolant of our design points. To avoid unnecessary over-smoothing of the model predictions a lower bound on the nugget parameter is introduced, and is given by
\delta_{lb}=max\left\{\frac{\lambda_n(\kappa(R) -e^a)}{\kappa(R)(e^a-1)},0\right\}.
Here, the condition number is given by \kappa(R)=\frac{\lambda_n}{\lambda_1}, where \lambda_1 \leq \lambda_2 \leq ...\leq \lambda_n are the eigenvalues of R, and e^a is the desired threshold for the $\kappa(R). We define an nXn matrix R to be near-singular if its condition number \kappa(R)=||R||\cdot ||R^{-1}|| exceeds the threshold value e^a, where ||\cdot|| denotes the matrix L2-norm. A value of a=25 is recommended for a space-filling Latin hypercube designs (LHDs) that is used in determining the design points. In terms of optimization, it is important to note that \delta_{lb} is a function of both the design points and the optimization parameter \beta.

Posted in Financial Mathematics, Optimization Algorithms

Motivation

In order to familiarize the reader with the methods that I will use to apply mathematics (particularly optimization) to everyday problems, I feel it somewhat necessary to provide a few posts that summarize my current research, (methods, techniques etc.) so that future posts don’t appear so peculiar . Here we go:

Computer simulators are useful tools for modelling the complexity of real world systems which are either impractical, too expensive or too time consuming to physically observe. For example, analysis of large physical systems such as the flow of oil in a reservoir, or the tidal energy generated by the tides of large ocean basins, can be achieved through the use of computer simulators. Suppose we wish to optimize the position of injector and production wells in an oil field, or similarly the positions of the tidal turbines in an ocean basin. Strategic planning of such procedures is crucial in order to maximize the net output of energy. Computer simulators, therefore, enable scientists and engineers to experiment with a numerical replica of such complex problems in order to determine the most effective and efficient solution(s).

This being said, realistic computer simulators can be computationally expensive to run, and are often emulated using statistical models, such as Gaussian Process (GP) models. Many mathematical, physical, engineering and geological processes employ the use of GP models as statistical surrogates to replace computationally expensive computer simulators. For example, computer simulations of the elastic deformation of cylinders resulting from high velocity impacts, the manufacturing and integration processes of electrical circuits for quality measures, or the estimation of the magnetic field generated near the Milky Way, can be computationally expensive and overly time consuming. Therefore, in each case a GP model was used as an efficient way to obtain an approximation to the corresponding simulator output.

The GP model requires as input a set of experimental design points and their corresponding simulator outputs. The simulator output is required to estimate the mean and variance of the Gaussian Process z(x), explained further in my next post, whereas the design points are used for calculating the spatial correlation matrix, R. There are many choices for the correlation matrix R, however, we will use the squared exponential correlation function. The components of R depend on both the spatial distribution of the design points, as well as a scaling hyper-parameter beta. In order for the GP model to predict the simulator output with a high level of accuracy we must determine the corresponding optimal beta values.

The maximum likelihood approach (which is the actual optimization problem at hang) determines the hyper-parameter beta that will minimize the deviance in our model estimates, as measured by the mean squared error (MSE). The quality of the GP model is quantified by a likelihood function, \mathcal{L_\beta}, which is the objective function that we seek to optimize. Gradient information of the likelihood function cannot be easily obtained and the likelihood surface can often have multiple local optima, making the optimization problem challenging. Derivative-free optimization techniques such as the genetic algorithm (GA) are robust, but can be computationally intensive. Gradient approximation methods, such as BFGS, are generally more efficient, but have the potential to converge to a local optimum if poorly initialized. I common technique when implementing local optimizers is a multi-start algorithm, which essentially initiates a local optimizer at multiple points within our search domain, allowing for a more globalized search to be performed. Nonetheless, this method demands several full executions of BFGS, which is still computationally expensive.

In order to improve the efficiency of the likelihood optimization, we investigate the use of three new derivative-free optimization techniques. The primary technique is a hybridization of two algorithms: Dividing Rectangles (DIRECT), a partitioning algorithm, and Implicit Filtering (IF), a general pattern search algorithm that incorporates the use of a linear interpolant from which descent direction information is derived. The second technique replaces IF with BFGS in the DIRECT hybridization, which has its advantageous as the BFGS algorithm does not require bound constraints, whereas both DIRECT and IF are constrained optimizers. The third technique modifies a former clustering based multi-start BFGS method, by replacing BFGS with IF, and introducing a function evaluation budget for a reduced number of clustered centres (0.5 X dimension of the problem). We will compare the performance of the three techniques with that of the originally proposed genetic algorithm, and the multi-start BFGS method employed using 2 X dimension +1 starting points. Analysis of both the likelihood value and the resulting MSE will determine the algorithm’s ability to provide an accurate model fit. Efficiency will be measured by both the time and number of function evaluations required for convergence.

Posted in Optimization Algorithms

Optimization of the GP Model

The negative log-likelihood, \mathcal{L}_\beta is dependent upon several components, namely the mean and variance estimators \hat{\mu}(\beta) and $\hat{\sigma}^2(\beta)$, as well as the inverse and determinant of R_\delta, all of which are dependent upon the hyper-parameter \beta. Recall that each \beta= (\beta_1,...,\beta_d) is a (1 \times d) vector of parameters, with individual components \beta_i, for all $i=\{1,…d\}$. Additionally, the nugget parameter \delta depends on the condition number $\kappa(R)$, which again is dependent upon $\beta$. For this reason, it is difficult, if not impossible, to extract analytic gradient information from \mathcal{L_\beta}. It follows that optimization methods that rely on the user providing an accurate expression for \nabla \mathcal{L}_\beta are of no benefit. We can, however, provide numerical approximations to \nabla \mathcal{L}_\beta through finite differencing, as is performed in the BFGS and Implicit Filtering (IF) algorithms. Such methods do not rely on accurately computing the gradient of the objective function and are known as derivative-free optimization algorithms.

Even when using derivative-free optimization techniques, the optimization process remains challenging. The objective function, \mathcal{L_\beta}, is often very rough around the global optimum and can contain numerous local optima and flat regions. It is not uncommon for the likelihood value at these sub-optimal solutions to be close in value to the that of the global optimum. However, the corresponding \beta parameterization of these local optima can vary significantly, resulting in a poor-quality model fit. To ensure that the quality of the GP model is reliable, convergence to a highly precise global optimum is crucial, and thus a highly accurate and robust global optimization technique is required.

For a list of optimization algorithms please navigate to categories->Optimization Algorithms.

Posted in Financial Mathematics, Optimization Algorithms