**A Quick Analysis of Heuristic Optimization by Stochastic Genetic Algorithms and Particle Swarm**

We investigate the optimization performance of both Genetic Algorithms (GAs) and Particle Swarm Optimization (PSO). Particularly we are interested in comparing the efficiency and accuracy of these techniques in optimizing a variety of well known test functions. The test functions used are outlined below.

We consider three classes of GAs:

- Matlab’s Built in Traditional GA (MatlabGA).
- A traditional GA as outlined by Varadi in
*http://cssanalytics.wordpress.com/*(GA) - A simplistic Micro-GA (mGA) as outlined by Varadi in
*http://cssanalytics.wordpress.com/*.

**A Brief Review of Traditional GAs**

Recall, like most evolutionary algorithms, a GA is a stochastic sampling method that repeatedly refines a population or set of candidate solutions. The method is based on the Darwinian notion of “Survival of the Fittest” in which only the elite solutions, (those that return the most optimal objective function values) are kept, while sub-optimal solutions are discarded.

A typical GA begins by randomly generating an initial population candidate solutions, called the parents. The initial population of each generation is then enlarged through production of a combination of *cross-over children* and/or *mutation children*. Cross-over children are derived by mixing and matching individual components of the parents, in what Varadi describes as Mating. Formally, at each component of the child vector, the default crossover function randomly selects a value, , from one of the two parents and assigns it to the component of the cross-over child. Mutation children are created by randomly perturbing the parent solutions, often via a Gaussian distribution.

Next we evaluate the corresponding objective function value at each of the potential solution in the current population. Only a fraction of the individuals in the current population that provide the most optimal objective function value will continue onward to form the parent population of the next generation.

The mGA, on the other hand, is a simplified version of the traditional GA and involves a Tournament phase for determining which parent solutions are worthy to mate. Those parents that are worthy to mate produce cross-over children, as described above and in *http://cssanalytics.wordpress.com/*.

For consistency amongst the varying GAs we have kept each population size and maximum number of generations fixed at 100 and 100d, respectively, where d is the number of input dimensions to the test function. Convergence is established either by achieving the maximum number of generations or if the average of the relative difference between a generation’s solution and best solution found over the most recent 50 generations is less than a tolerance, .

**A Brief Review of PSO**

PSO is a gradient free, stochastic optimization algorithm. Like GAs and mGAs, PSO begins by first generating an initial set of candidate solutions or particles, collectively called the swarm. At each iteration, individual particles of the swarm navigate the search-space based on each particles personal best position and the overall best position found by the swarm. Therefore, previous locations that minimize the objective function, , act as attractors to which the swarm will migrate towards. The position of each particle, and the swarm alike, is defined in terms of its previous position, , and the velocity, , of the particle(or swarm). The velocity (or momentum factor) describes a particles tendency to move in a certain direction. The position and velocity are defined as:

and

where and are random numbers on the interval , denotes the best previous position of that particle, and denotes the best previous position of the swarm.

For optimization of each test function I have chosen to use a swarm size of 20. For a fair comparison with the GAs, the stopping criteria for PSO is that the average relative difference of the best solution found over the most recent 50 iterations is less than a tolerance, .

Lastly, note that since both the GA and PSO are stochastic optimization routines, the results presented below have each been averaged over 100 simulations.

**Numerical Results **

We analyze three quantities:

- Accuracy: Average Percent Error = , where is the true, known, global optimum value.
- Efficiency: Average number of Function Evaluations (FEs) required for convergence.
- General Performance Assessment:} GPA= FEs in which accuracy holds twice the weight as efficiency and smaller GPA values indicate optimal performance.

Simulation results for each test function can be found in Table 1. From Table 1 we can draw the following conclusions:

- In general MatlabGA outperforms both my own traditional GA and mGA, with the exception of optimizing the 2-D Hump Function and 6-D Hartmann function. This is not surprising given the level of sophistication of Matlab’s built in optimization algorithms as compared to the 1-day coding blitz of my GAs.
- The mGA is indeed the most efficient of the GAs, as measured by the number of FEs required to establish convergence. Of course improved efficiency comes at the cost of decreased accuracy – ultimately the user’s optimization goals will play a huge factor in algorithm selection.
- As expected, PSO outperforms all GAs for objective functions that are strictly convex, with minimal noise and a single global optimum.
- As Varadi suggests, PSO (or a hybrid PSO method) may be the preferred technique for several portfolio optimization problems.
- Objective surfaces such as the Schwefel Function and the Rastrigin Function are noisy and multi-modal, containing several local optima. For global optimization of both these functions, MatlabGA proves to outperform PSO.

For full test function details see:

*http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/Hedar_files/TestGO.htm*

]]>

Based on experimentation, a budget of 200 times the dimension of the problem, function evaluations enables DIRECT to efficiently determine a reasonable starting position to begin IF or BFGS. Starting from the termination position of DIRECT, we then allow either IF or BFGS to fine-tune DIRECT’s global search, in what is called the precision phase. Assuming that the global optimum is contained within the boundaries of the feasible region, IF should rapidly converge to a highly accurate global optimum.

It is, however, possible for the position of the global optimum to be located outside of the domain provided to IF. In this case, IF is strictly confined to this region and will not be able to converge to this position. In theory, this could have a negative effect on the model fitting process. We therefore provide a second hybridization option by replacing IF with the unbounded optimization algorithm, BFGS, which is able to explore outside of the bound constraints provided to IF.

]]>

The first step of DIRECT is to transform the user-supplied domain of the problem into a d-dimensional unit hypercube. Therefore, the feasible region as described by Finkel is

The objective function is first evaluated at the center of this hypercube, . DIRECT then divides the hypercube into smaller hyper-rectangles by evaluating the objective function at one-third of the distance, , from the center in all coordinate directions ,

The first division is performed so that the region with the most optimal function value is given the largest space.

Once the first division is complete, the next step is to identify potential optimal hyper-rectangles. Finkel provides the following definition for selecting potential optimal hyper-rectangles.

DEFINITION: Let be a positive constant and let be the current best function value. A hyper-rectangle j is said to be potentially optimal if there exists such that}

Here, is the center of hyper-rectangle j and is the distance from the center to the corresponding vertices. In general, a value of is used to ensure that exceeds by some non-trivial amount.

From the definition, we observe that a hyper-rectangle i is potentially optimal if:

1. for all hyper-rectangles of equal size, .

2. for all other hyper-rectangles k, and for all hyper-rectangles of equal size, .

3. for all other hyper-rectangles k, and

Once a hyper-rectangle is selected as being potentially optimal, DIRECT begins dividing the hyper-rectangle into smaller hyper-rectangles. First, the objective function is evaluated at points for all longest dimensions i. Of course, if the hyper-rectangle is a hypercube, then the objective function is evaluated in all coordinate directions , which is identical to the initial sampling step. We then divide the hyper-rectangle, with order determined by the objective function values returned during the sampling phase.

The three Figures below provides a 2-dimensional visualization of how DIRECT samples and divides its search space. The bold regions are the potentially optimal rectangles, which are to be divided. The first iteration, (a), shows that the right-most rectangle is potentially optimal, and is therefore sampled and divided along its vertical dimension. The second iteration, (b), shows that both the left-most rectangle, and the right-most, central rectangles are potential optimal. Here we apply the second condition for selecting potential optimal rectangles, in which the longest side of the left-most rectangle is greater than the longest side of all other rectangles, and therefore by default, the function value at the center of the left-most rectangle is less than the function value of all other rectangles of equal size. Again, we sample and divide the rectangles according to their dimension. This process of selecting potentially optimal rectangles, sampling and dividing is once again repeated in the third iteration, (c) and continues for subsequent iterations until the stopping criteria is met.

]]>

PSO

Particle Swarm Optimization is a gradient free optimization method that uses a stochastic, meta-heuristic approach in finding the optimum value. The algorithm begins by first generating an initial array of candidate solutions, called the swarm. The individual particles of the swarm then move through the search space based on each particles knowledge of where a better solution is located, as well as the swarms knowledge, as a whole, in to where the optimum value is located. Therefore, previous successful solutions act as attractors to which the swarm will migrate towards. The position of each particle, and the swarm alike, is defined in terms of its previous position, , and the velocity, , of the particle (or swarm) which describes a particles tendency to move in a certain direction. We therefore define the position and velocity as: where is a weighting parameter specified by the used, are random numbers on the interval which enforce exploration of the search space, denotes the best previous position of that particle, and denotes the best previous position of the swarm. Qualitatively, the velocity component of PSO is specified by three parameters:

1. Inertial tendency of the particle to continue moving in it’s current direction.

2. Cognitive: tendency of the particle to move towards that particle’s best found solution.

3. Social: tendency of a particle to move towards the swarm’s best found solution.

In general, PSO handles bound constraint optimization problems, as well as linear and non-linear general constraints.

General Pattern Search

Though we will not be implementing a General Pattern Search (GPS) algorithm, the method is worth mentioning as Humphries’ has incorporated GPS along with PSO as a hybridized optimization technique. GPS begins at a user provided initial point . Unless otherwise specified, the first iteration searches with a mesh size of 1 in all n directions, where n is the dimension of the objective function. The objective function is then computed at each of the mesh points, until it finds a value that is less than the value of the current objective function. Once it has done so, the algorithm sets to that point and begins polling again. After a successful poll, the mesh size is multiplied by a factor of 2. If a poll is unsuccessful, the algorithm remains at it’s current position, and polls once again, this time with a mesh size that is 0.5 the former mesh size. GPS can handle both bound and general constraints optimization problems. Humphries’ hybridization technique involves implementing PSO until we encounter an iteration that does not improve our solution. From there we perform one step of GPS, to poll for a better solution. If no better solution is found, we reduce the stencil size and poll again.

]]>

In terms of optimization, the ultimate goal is to find a stationary point such that . Therefore, assuming the objective function is twice differentiable, Newton’s Method uses in order calculate the roots of , and the iterative process becomes

This iterative scheme can be generalized to multiple dimensions by replacing and with the gradient, and the Hessian, , respectively. The main advantages of Newton’s method is its robustness and rapid local quadratic rate of convergence. Unfortunately, this method requires computation and storage of the Hessian matrix and solving the resulting system of equations, which can be costly.

The most rudimentary Quasi-Newton method is the one dimensional secant method, which approximates with a finite differencing scheme

The BFGS quasi-Newton approach can therefore be thought of as a generalization of the secant method.

The general form of a quasi-Newton optimization is, given a point , we attempt to solve for the descent direction by approximating the Hessian matrix at $x_k$. We can then obtain a search direction by solving

Like Newton’s Method, we follow along this search direction to obtain a new guess

where is the step-size, often determined through a scalar optimization or a line search procedure.

The simplest and most cost efficient generalized quasi-Newton method is the method of steepest descent. This method assumes is equal to the identity matrix I for all k, and therefore does not require costly computation and storage of . As a result, the descent direction simplifies to , and is that of “most” descent. This simplification, however, decreases the rate of convergence such that the method of steepest descent converges linearly.

For optimization of , we use Matlab’s built-in unconstrained optimization algorithm fminunc. fminunc uses a quasi-Newton optimization technique and includes multiple options for tackling a wide variety of optimization problems. For convenience, we will describe fminunc in terms of the options that we have selected for our specific optimization problem.

fminunc uses the BFGS method to update the Hessian matrix at each point . The BFGS quasi-Newton algorithm can be summarized by the following steps:

1. Specify an initial and .

2. For k=0,1,2,…

a)Stop if is optimal

b) Solve for search direction .

c) Use a line search to determine the step-size .

d) Update .

e) Compute using the BFGS update.

We have chosen to use the medium-scale implementation of fminunc, in which the user must supply an initial value for x0. The user, however, does not need to supply the initial Hessian matrix as by default under the medium-scale implementation. Therefore, the first iteration of fminunc follows the method of steepest descent. We define as being optimal if , where is the magnitude of the gradient of the likelihood function, . The gradient function, , is derived using a forward finite differencing scheme. If is not optimal, then we proceed to solve for the search direction . A mixed quadratic/cubic line search procedure (described below) is used to determine an appropriate step-size , which allows us to update to our new position .

The BFGS update formula falls into a broader class of rank two update formulas that maintains both symmetry and positive definiteness (SPD) of the Hessian approximation. Preservation of SPD ensures that the search direction is always a descent direction. The general rank two update formula is given by

where is a scalar and

The BFGS Hessian update is obtained by setting , and therefore simplifies to

LINE SEARCH PROCEDURE

A mixed quadratic/cubic line search procedure is used in order to determine a suitable step-size . The line search procedure operates by first perturbing the initial point , along the line of steepest descent in order to obtain a second point . The two points along with the initial gradient approximation are used to obtain a quadratic interpolation. A third point, , is obtained through extrapolation of the quadratic interpolant, and brackets the minimum such that . Using each of , for and the first gradient approximation, allows us to obtain a cubic interpolant, whose minimum, , estimates the true minimum. We then proceed to bisect the bracket interval using two of along with the minimum estimate to obtain a new cubic interpolant, and the process repeats.

]]>

The sampling method is arranged on a stencil, which, given a current iterate , will sample the objective function in all coordinate directions via . Here , where and represent the components of the lower and upper bounds of the feasible region, respectively, and is the unit coordinate vector. The scale, h varies as the optimization progresses, according to . Therefore, if a sampling phase is unsuccessful in finding a more optimal position than the current incumbent point, the scale, and thus the stencil, is reduced by a factor of 0.5, and the sampling phase is repeated. By default, the optimization will terminate after the scale has been reduced by a factor of . It is, however, the quasi-Newton iteration that provides IF with a clear advantage over a general pattern search algorithm, and has shown to improve overall efficiency and ability of global convergence.

In terms of algorithm structure, IF can be broken into two main loops: the outer loop, which controls the pattern search, and the inner loop, which implements the quasi-Newton iteration at the current best position, , determined at the end of each sampling step. Therefore, IF must be given two main stopping criterion; one for each loop.

The quasi-Newton iteration will terminate when the norm of the difference gradient . Here, , where is a scaling factor that attempts to eliminate discrepancies in the step size of the quasi-Newton iteration, which arise when the values of the objective function are either very small or very large. The default scale value is , where is the user-supplied initial starting point. is the tolerance level for terminating the quasi-Newton iteration. Highly sensitive stopping tolerances are required for optimization of as the likelihood function contains many regions that are relatively flat, which can result in a premature termination of the optimization process if the stopping tolerance is too large.

Kelley states that IF works efficiently for functions which are noisy, non-smooth or random. Therefore in theory, IF should be well suited for optimization of our particular objective function, . The general thought is that the sampling phase, with a suitable step size, should step over the local minima, whereas the quasi-Newton iteration will establish efficient convergence in regions near the global optimum. For more information on IF’s options and specifications, please refer to Implicit Filtering by C.T. Kelley.

]]>

The GA begins by randomly generating an initial population of candidate solutions, called the parents, that are contained to the feasible region of the objective function. In order to further explore this domain, the initial population of each generation is quadrupled through generation of cross-over children and mutation children. First, additional cross-over children are derived through combining individual components of the vectors of independent variables, which doubles the current population. At each coordinate of the child vector, the default crossover function randomly selects a value, , from one of the two parents and assigns it to the cross-over child, providing a mix-and-match solution. Mutation children are then created by randomly perturbing both the parent solutions and cross-over children, via a Gaussian distribution. The final result is a population that is four times the size of the parent population.

Next we evaluate the corresponding value of each of the candidate solutions in the current population in order to create the offspring for the next generation. The 25% of individuals in the current population that provide the most optimal objective function value will continue onward to form the parent population of the next generation.

A GA was originally proposed for optimization of $ in their original GP model fitting procedure for several reasons. First, non-deterministic evolutionary algorithms such as GA are generally robust and do not require the use of gradient information. Moreover, using a large population size and maximum number of generations, which are scaled to the dimension of the problem, will help guarantee accurate convergence to the global optimum. In order to achieve both robustness and a high level of accuracy using a GA, however, requires excessively evaluating , which is generally inefficient.

For example, optimizing of a 10-D function requires under this scheme 4000 function evaluations (FE) per generation, for a maximum of 50 generations, which could potentially cost 200,000 FE in total. Assuming we are operating on a single core, then this single optimization process would translate to days of computation time. From this realistic example, the demand for a more efficient optimization technique is evident.

]]>

The GP model requires as input a set of design points, , and the corresponding simulator outputs, , where , n is the number of user supplied design points, and d is the dimension of the problem. We denote as the set of n design points, and as the vector that holds the corresponding simulator output. The simulator output is thus modelled as

where is the overall mean, and is a collection of random variables, which have a joint normal distribution, and thus is a Gaussian Process. For example, in such a case where the overall mean, , then the random variables, , will represent the true function value at the location . The GP is therefore a generalization of the Gaussian probability distribution, and is fully defined by its expected value, , variance , and covariance function .

Here, R is the spatial correlation matrix, which defines the degree of dependency between any two design points, , based on their observed simulator value. There are many choices for the correlation matrix R, however we will use the squared exponential correlation function provided by Macdonald et al.

where is an vector of scaling parameters. Determining the optimal scaling parameter is crucial for ensuring the integrity of the GP model fit. Originally, Dr. Ranjan proposed the use of the scaling parameter in place of , which confines the parameter space to . However, they noticed that optimal values would often be position at the boundary, , making the optimization process challenging. We therefore proposed a re-parameterization of the -space, by letting , such that \cite{P2}. Such a re-parameterization is preferred as it smoothes the parameter space, such that it is centered about , and ensures that potential solutions do not lie on a boundary.

The GP model’s ability to predict the simulator output for an unknown point is directly related to the underlying properties of the correlation matrix R. Prediction error at any point is relative to the distance between that point and nearby design points. Therefore, the GP model will predict accurately when the distance between and nearby design points is small, whereas the prediction error will be larger for predictor points that are far away from the design points. The hyper-parameter can therefore be interpreted as a measure of the sensitivity of to the spatial distribution for all and

The model uses the closed form estimators of the mean , and variance, , given by

Here is an vector of all ones. Calculation of the mean and variance estimators, along with the repeated update of the inverse and determinant of the correlation matrix, , and respectively, allow us to derive the profile log-likelihood

The GP model will provide as output the simulator estimate, , for a desired predictor point . Through the maximum likelihood approach, the best function estimate for a given point is

$

where , and the correlation between and . The GP model will also output the uncertainty in this estimate, , as measured by the mean squared error (MSE),

INTRODUCING A NUGGET PARAMETER

Typically, a GP model attempts to find the best fit interpolant of the user supplied design points and its corresponding simulator outputs. The maximum likelihood approach enables the user to then estimate the corresponding simulator values for any number of predictor points. Algorithmic searching for the optimal parameter requires a repeated update of the correlation matrix R for each potentially optimal . As shown above, the profile likelihood requires accurate computation of both the inverse and determinant of R. However, if any pair of design points are excessively close in space, then the matrix R can become near-singular, resulting in unstable computation of the inverse and determinant of R, which can lead to an unreliable model fit.

To overcome this instability, we follow a popular technique which adds a small “nugget” parameter, to the model fitting procedure, which in turn replaces R with . The addition of the nugget parameter smoothes the model predictor and therefore the GP does not produce an exact interpolant of our design points. To avoid unnecessary over-smoothing of the model predictions a lower bound on the nugget parameter is introduced, and is given by

Here, the condition number is given by , where are the eigenvalues of R, and is the desired threshold for the $. We define an nXn matrix R to be near-singular if its condition number exceeds the threshold value , where denotes the matrix L2-norm. A value of a=25 is recommended for a space-filling Latin hypercube designs (LHDs) that is used in determining the design points. In terms of optimization, it is important to note that is a function of both the design points and the optimization parameter .

]]>

Computer simulators are useful tools for modelling the complexity of real world systems which are either impractical, too expensive or too time consuming to physically observe. For example, analysis of large physical systems such as the flow of oil in a reservoir, or the tidal energy generated by the tides of large ocean basins, can be achieved through the use of computer simulators. Suppose we wish to optimize the position of injector and production wells in an oil field, or similarly the positions of the tidal turbines in an ocean basin. Strategic planning of such procedures is crucial in order to maximize the net output of energy. Computer simulators, therefore, enable scientists and engineers to experiment with a numerical replica of such complex problems in order to determine the most effective and efficient solution(s).

This being said, realistic computer simulators can be computationally expensive to run, and are often emulated using statistical models, such as Gaussian Process (GP) models. Many mathematical, physical, engineering and geological processes employ the use of GP models as statistical surrogates to replace computationally expensive computer simulators. For example, computer simulations of the elastic deformation of cylinders resulting from high velocity impacts, the manufacturing and integration processes of electrical circuits for quality measures, or the estimation of the magnetic field generated near the Milky Way, can be computationally expensive and overly time consuming. Therefore, in each case a GP model was used as an efficient way to obtain an approximation to the corresponding simulator output.

The GP model requires as input a set of experimental design points and their corresponding simulator outputs. The simulator output is required to estimate the mean and variance of the Gaussian Process z(x), explained further in my next post, whereas the design points are used for calculating the spatial correlation matrix, R. There are many choices for the correlation matrix R, however, we will use the squared exponential correlation function. The components of R depend on both the spatial distribution of the design points, as well as a scaling hyper-parameter beta. In order for the GP model to predict the simulator output with a high level of accuracy we must determine the corresponding optimal beta values.

The maximum likelihood approach (which is the actual optimization problem at hang) determines the hyper-parameter beta that will minimize the deviance in our model estimates, as measured by the mean squared error (MSE). The quality of the GP model is quantified by a likelihood function, , which is the objective function that we seek to optimize. Gradient information of the likelihood function cannot be easily obtained and the likelihood surface can often have multiple local optima, making the optimization problem challenging. Derivative-free optimization techniques such as the genetic algorithm (GA) are robust, but can be computationally intensive. Gradient approximation methods, such as BFGS, are generally more efficient, but have the potential to converge to a local optimum if poorly initialized. I common technique when implementing local optimizers is a multi-start algorithm, which essentially initiates a local optimizer at multiple points within our search domain, allowing for a more globalized search to be performed. Nonetheless, this method demands several full executions of BFGS, which is still computationally expensive.

In order to improve the efficiency of the likelihood optimization, we investigate the use of three new derivative-free optimization techniques. The primary technique is a hybridization of two algorithms: Dividing Rectangles (DIRECT), a partitioning algorithm, and Implicit Filtering (IF), a general pattern search algorithm that incorporates the use of a linear interpolant from which descent direction information is derived. The second technique replaces IF with BFGS in the DIRECT hybridization, which has its advantageous as the BFGS algorithm does not require bound constraints, whereas both DIRECT and IF are constrained optimizers. The third technique modifies a former clustering based multi-start BFGS method, by replacing BFGS with IF, and introducing a function evaluation budget for a reduced number of clustered centres (0.5 X dimension of the problem). We will compare the performance of the three techniques with that of the originally proposed genetic algorithm, and the multi-start BFGS method employed using 2 X dimension +1 starting points. Analysis of both the likelihood value and the resulting MSE will determine the algorithm’s ability to provide an accurate model fit. Efficiency will be measured by both the time and number of function evaluations required for convergence.

]]>

Even when using derivative-free optimization techniques, the optimization process remains challenging. The objective function, , is often very rough around the global optimum and can contain numerous local optima and flat regions. It is not uncommon for the likelihood value at these sub-optimal solutions to be close in value to the that of the global optimum. However, the corresponding parameterization of these local optima can vary significantly, resulting in a poor-quality model fit. To ensure that the quality of the GP model is reliable, convergence to a highly precise global optimum is crucial, and thus a highly accurate and robust global optimization technique is required.

For a list of optimization algorithms please navigate to categories->Optimization Algorithms.

]]>