# The Gaussian Process Model

Below I will describe the essential components of the Gaussian Process (GP) Model, designed by Dr. Pritam Ranjan at Acadia University. You will quickly begin to see why efficient and robust optimization of the Model’s processes (particularly through the maximum likelihood approach) is so crucial to the performance and accuracy of the GP model.

The GP model requires as input a set of design points, $x_i = (x_{i1},...,x_{id})$, and the corresponding simulator outputs, $y_i=y(x_i)$, where $i=\{1,...,n\}$, n is the number of user supplied design points, and d is the dimension of the problem. We denote $X =\{x_1,...,x_n\}$ as the set of n design points, and $Y=y(X)=(y_1,...,y_n)'$ as the vector that holds the corresponding simulator output. The simulator output is thus modelled as
$y(x_i)=\mu +z(x_i),$ where $\mu$ is the overall mean, and $z(x_i)$ is a collection of random variables, which have a joint normal distribution, and thus is a Gaussian Process. For example, in such a case where the overall mean, $\mu = 0$, then the random variables, $z(x_i)$, will represent the true function value at the location $x_i$. The GP $z(x_i)$ is therefore a generalization of the Gaussian probability distribution, and is fully defined by its expected value, $E[z(x_i)]=0$, variance $\text{Var}[z(x_i)]=\sigma^2$, and covariance function $\text{Cov}[(z(x_i),z(x_j)]=\sigma^2 R$.

Here, R is the spatial correlation matrix, which defines the degree of dependency between any two design points, $x_i$, based on their observed simulator value. There are many choices for the correlation matrix R, however we will use the squared exponential correlation function provided by Macdonald et al.

$R_{ij} = \prod \limits_{k=1}^d \text{exp}\{-10^{\beta_k}|x_{ik}-x_{jk}|^2\} \quad \text{for all} \ i,j,$
where $\beta = (\beta_1,...,\beta_d)$ is an $d \times 1$ vector of scaling parameters. Determining the optimal $\beta$ scaling parameter is crucial for ensuring the integrity of the GP model fit. Originally, Dr. Ranjan proposed the use of the scaling parameter $\Theta$ in place of $10^{\beta}$, which confines the parameter space to $\Theta = (0,\infty)^d$. However, they noticed that optimal $\Theta$ values would often be position at the boundary, $\Theta= 0^d$, making the optimization process challenging. We therefore proposed a re-parameterization of the $\Theta$-space, by letting $\beta = \log_{10}(\Theta)$, such that $\beta = (-\infty, \infty)$ \cite{P2}. Such a re-parameterization is preferred as it smoothes the parameter space, such that it is centered about $\beta =0^d$ , and ensures that potential $\beta$ solutions do not lie on a boundary.

The GP model’s ability to predict the simulator output for an unknown point $x^*$ is directly related to the underlying properties of the correlation matrix R. Prediction error at any point $x^*$ is relative to the distance between that point and nearby design points. Therefore, the GP model will predict accurately when the distance between $x^*$ and nearby design points is small, whereas the prediction error will be larger for predictor points that are far away from the design points. The hyper-parameter $\beta$ can therefore be interpreted as a measure of the sensitivity of $y(x_i)$ to the spatial distribution $|x_{ik}-x_{jk}|^2$ for all $i, j \in \{1,...,n\}$ and $k \in \{1,...,d\}.$

The model uses the closed form estimators of the mean $\mu$, and variance, $\sigma^2$, given by
$\hat{\mu}(\beta)=(1_n'R^{-1}1_n)^{-1}(1_n'R^{-1}Y) \ \text{and} \ \hat{\sigma}^2(\beta) = \frac{(Y-1_n\hat{\mu}(\beta))'R^{-1}(Y-1_n\hat{\mu}(\beta))}{n}.$
Here $1_n$ is an $n \times 1$ vector of all ones. Calculation of the mean and variance estimators, along with the repeated update of the inverse and determinant of the correlation matrix, $R^{-1}$, and $|R|$ respectively, allow us to derive the profile log-likelihood
$\mathcal{L}_\beta = -2\log{(L_\beta)} \ \alpha \ \log{(|R|)} + n\log{[(Y-1_n\hat{\mu}(\beta))'R^{-1}(Y-1_n \hat{\mu}(\beta))]}.$

The GP model will provide as output the simulator estimate, $\hat{y}(x^*)$, for a desired predictor point $x^*$. Through the maximum likelihood approach, the best function estimate for a given point $x^*$ is
$\hat{y}(x^*) = \hat{\mu}+r'R^{-1}(Y-1_n\hat{\mu}) = \left[\frac{(1_n-r'R^{-1}1_n)}{{1_n}'R^{-1}1_n} {1_n}' + r'\right]R^{-1}Y = C'Y,$
$where $r=[r_1(x^*),...,r_n(x^*)]$, and $r_i(x^*)=\text{corr}[z(x^*),z(x_i)],$ the correlation between $z(x^*)$ and $z(x_i)$. The GP model will also output the uncertainty in this estimate, $s^2(x^*)$, as measured by the mean squared error (MSE), $s^2(x^*)=E \left[ ( \hat{y}(x^*)- {y}(x^*))^2\right] = \hat{\sigma}^2 (1_n-2C'r+C'RC).$ INTRODUCING A NUGGET PARAMETER Typically, a GP model attempts to find the best fit interpolant of the user supplied design points and its corresponding simulator outputs. The maximum likelihood approach enables the user to then estimate the corresponding simulator values for any number of predictor points. Algorithmic searching for the optimal $\beta$ parameter requires a repeated update of the correlation matrix R for each potentially optimal $\beta$. As shown above, the profile likelihood requires accurate computation of both the inverse and determinant of R. However, if any pair of design points are excessively close in space, then the matrix R can become near-singular, resulting in unstable computation of the inverse and determinant of R, which can lead to an unreliable model fit. To overcome this instability, we follow a popular technique which adds a small “nugget” parameter, $\delta$ to the model fitting procedure, which in turn replaces R with $R_\delta = R +\delta I$. The addition of the nugget parameter smoothes the model predictor and therefore the GP does not produce an exact interpolant of our design points. To avoid unnecessary over-smoothing of the model predictions a lower bound on the nugget parameter is introduced, and is given by $\delta_{lb}=max\left\{\frac{\lambda_n(\kappa(R) -e^a)}{\kappa(R)(e^a-1)},0\right\}.$ Here, the condition number is given by $\kappa(R)=\frac{\lambda_n}{\lambda_1}$, where $\lambda_1 \leq \lambda_2 \leq ...\leq \lambda_n$ are the eigenvalues of R, and $e^a$ is the desired threshold for the$$\kappa(R)$. We define an nXn matrix R to be near-singular if its condition number $\kappa(R)=||R||\cdot ||R^{-1}||$ exceeds the threshold value $e^a$, where $||\cdot||$ denotes the matrix L2-norm. A value of a=25 is recommended for a space-filling Latin hypercube designs (LHDs) that is used in determining the design points. In terms of optimization, it is important to note that $\delta_{lb}$ is a function of both the design points and the optimization parameter $\beta$.

# Motivation

In order to familiarize the reader with the methods that I will use to apply mathematics (particularly optimization) to everyday problems, I feel it somewhat necessary to provide a few posts that summarize my current research, (methods, techniques etc.) so that future posts don’t appear so peculiar . Here we go:

Computer simulators are useful tools for modelling the complexity of real world systems which are either impractical, too expensive or too time consuming to physically observe. For example, analysis of large physical systems such as the flow of oil in a reservoir, or the tidal energy generated by the tides of large ocean basins, can be achieved through the use of computer simulators. Suppose we wish to optimize the position of injector and production wells in an oil field, or similarly the positions of the tidal turbines in an ocean basin. Strategic planning of such procedures is crucial in order to maximize the net output of energy. Computer simulators, therefore, enable scientists and engineers to experiment with a numerical replica of such complex problems in order to determine the most effective and efficient solution(s).

This being said, realistic computer simulators can be computationally expensive to run, and are often emulated using statistical models, such as Gaussian Process (GP) models. Many mathematical, physical, engineering and geological processes employ the use of GP models as statistical surrogates to replace computationally expensive computer simulators. For example, computer simulations of the elastic deformation of cylinders resulting from high velocity impacts, the manufacturing and integration processes of electrical circuits for quality measures, or the estimation of the magnetic field generated near the Milky Way, can be computationally expensive and overly time consuming. Therefore, in each case a GP model was used as an efficient way to obtain an approximation to the corresponding simulator output.

The GP model requires as input a set of experimental design points and their corresponding simulator outputs. The simulator output is required to estimate the mean and variance of the Gaussian Process z(x), explained further in my next post, whereas the design points are used for calculating the spatial correlation matrix, R. There are many choices for the correlation matrix R, however, we will use the squared exponential correlation function. The components of R depend on both the spatial distribution of the design points, as well as a scaling hyper-parameter beta. In order for the GP model to predict the simulator output with a high level of accuracy we must determine the corresponding optimal beta values.

The maximum likelihood approach (which is the actual optimization problem at hang) determines the hyper-parameter beta that will minimize the deviance in our model estimates, as measured by the mean squared error (MSE). The quality of the GP model is quantified by a likelihood function, $\mathcal{L_\beta}$, which is the objective function that we seek to optimize. Gradient information of the likelihood function cannot be easily obtained and the likelihood surface can often have multiple local optima, making the optimization problem challenging. Derivative-free optimization techniques such as the genetic algorithm (GA) are robust, but can be computationally intensive. Gradient approximation methods, such as BFGS, are generally more efficient, but have the potential to converge to a local optimum if poorly initialized. I common technique when implementing local optimizers is a multi-start algorithm, which essentially initiates a local optimizer at multiple points within our search domain, allowing for a more globalized search to be performed. Nonetheless, this method demands several full executions of BFGS, which is still computationally expensive.

In order to improve the efficiency of the likelihood optimization, we investigate the use of three new derivative-free optimization techniques. The primary technique is a hybridization of two algorithms: Dividing Rectangles (DIRECT), a partitioning algorithm, and Implicit Filtering (IF), a general pattern search algorithm that incorporates the use of a linear interpolant from which descent direction information is derived. The second technique replaces IF with BFGS in the DIRECT hybridization, which has its advantageous as the BFGS algorithm does not require bound constraints, whereas both DIRECT and IF are constrained optimizers. The third technique modifies a former clustering based multi-start BFGS method, by replacing BFGS with IF, and introducing a function evaluation budget for a reduced number of clustered centres (0.5 X dimension of the problem). We will compare the performance of the three techniques with that of the originally proposed genetic algorithm, and the multi-start BFGS method employed using 2 X dimension +1 starting points. Analysis of both the likelihood value and the resulting MSE will determine the algorithm’s ability to provide an accurate model fit. Efficiency will be measured by both the time and number of function evaluations required for convergence.

Posted in Optimization Algorithms

# Optimization of the GP Model

The negative log-likelihood, $\mathcal{L}_\beta$ is dependent upon several components, namely the mean and variance estimators $\hat{\mu}(\beta)$ and $\hat{\sigma}^2(\beta)$, as well as the inverse and determinant of $R_\delta$, all of which are dependent upon the hyper-parameter $\beta$. Recall that each $\beta= (\beta_1,...,\beta_d)$ is a $(1 \times d)$ vector of parameters, with individual components $\beta_i$, for all $i=\{1,…d\}$. Additionally, the nugget parameter $\delta$ depends on the condition number $\kappa(R)$, which again is dependent upon $\beta$. For this reason, it is difficult, if not impossible, to extract analytic gradient information from $\mathcal{L_\beta}$. It follows that optimization methods that rely on the user providing an accurate expression for $\nabla \mathcal{L}_\beta$ are of no benefit. We can, however, provide numerical approximations to $\nabla \mathcal{L}_\beta$ through finite differencing, as is performed in the BFGS and Implicit Filtering (IF) algorithms. Such methods do not rely on accurately computing the gradient of the objective function and are known as derivative-free optimization algorithms.

Even when using derivative-free optimization techniques, the optimization process remains challenging. The objective function, $\mathcal{L_\beta}$, is often very rough around the global optimum and can contain numerous local optima and flat regions. It is not uncommon for the likelihood value at these sub-optimal solutions to be close in value to the that of the global optimum. However, the corresponding $\beta$ parameterization of these local optima can vary significantly, resulting in a poor-quality model fit. To ensure that the quality of the GP model is reliable, convergence to a highly precise global optimum is crucial, and thus a highly accurate and robust global optimization technique is required.

For a list of optimization algorithms please navigate to categories->Optimization Algorithms.

My current research is in numerical optimization techniques, for efficient and robust optimization of the likelihood surface of a Gaussian Process model, which we are using to emulate computational expensive computer simulator output.

Broadly speaking, Optimization is a mathematical technique used to determine the maximum or minimum value of a function. Optimization problems arise in almost every field, from mathematics, science and engineering, to finance and risk analysis. The general optimization problem is as follows: $\underset{x \in \Omega}{min} \ f(x)$ where $f$ is the objective function, that is confined to the feasible region $\Omega$. The feasible region defines the search space for the optimization algorithm, parameterized by upper and lower bound constraints, as well as general equality and inequality constraints. We should note that in order to search for the maximum value, $\underset{x \in \Omega}{max} \ f(x)$, one must simply negate the objective function $f$.

An excellent 2-dimensional function for testing the performance of optimization algorithms. Notice the multiple local and global optima present.

My post-doc supervisor, Thomas Humphries, uses a simulated oil reservoir, implemented with the use of MatLab’s Reservoir Simulation Toolbox (MRST). His primary goal is to find the optimum well position and control parameters that will return the largest net present value (NPV) of the extracted oil over a certain forecast horizon. The NPV function equates to the total amount of oil recovered minus the costs required to obtain the oil, such as the cost of injecting or disposing of water. As a result, he is experimenting with several global, gradient-free optimization techniques, in search for the algorithm that is best suited for this problem.

Of course, computer simulators are very useful tools for modeling or imitating the complexity of real world systems, such as the the simulation of an oil reservoir, in which the continuously changing properties and sheer size of the field would be too cumbersome to physically observe. As a solution, we input vast amounts of historical data, statistical approximations and known parameters into a simulator, in order to best replicate its behaviour. Unfortunately, computer simulators can become very expensive to compute, and when called multiple times can greatly reduce our computational performance. In a recent paper by Ranjan et al. (2012, University of Acadia) a Gaussian process (GP) model, GPfit, was proposed in order to build a statistical surrogate that will replace having to make expensive calls to the simulator. This method involves optimizing the likelihood function which, in turn, minimizes the deviance in our approximation, in order to provide the best true function estimates. Ranjan proposed both a genetic algorithm (GA), implemented in Matlab, as well as a multi-start, gradient based BFGS algorithm, implemented in $R$, in order to optimize the log-likelihood function. I am also currently looking into replacing the genetic algorithm with faster, more robust algorithms. We analyze the performance of the algorithms in terms of their ability to find a suitable global optimum to within an optimal number of function evaluations (the classic law of diminishing return approach). Stay tuned for posts that will discuss and further clarify the Guassian Process and optimization of a likelihood surface.

Gaussian Model fit of a 1-Dimensional function.

Until next time,

AB

Posted in Optimization Algorithms

# My First Post – About Me

### A Brief Personal History

I was born and raised in St. John’s, the capital and most easterly city in Newfoundland.  My grade 12 physics teacher at Holy Heart sparked my interest in both mathematics and physics, which persuaded me towards an Undergraduate B.Sc. (Hons) in Applied Mathematics and Physics at Memorial University of Newfoundland. I am currently working as a research assistant in the Mathematics and Statistics Faculty at Memorial, under the supervision of Dr. Ronald Haynes and Dr. Thomas Humphries. My current research is in numerical optimization techniques, as they pertain to optimization of the elegant likelihood function of a stochastic Gaussian process model. My current goals are to complete my research work-term placement as I prepare for the CFA Level I examination, with a tentative plan to pursue a M.Sc. in Quantitative/Mathematical Finance.

In my spare time I enjoy reading, with a rather broad spectrum of interests, from the social sciences, to quantum mechanics, to historical fiction.  My favorite authors include: Malcolm Gladwell, Michio Kaku, Richard Dawkins, John Grisham and Dan Brown, to name a few.  I am a soft rock and blues guitar enthusiast, though have come to realize that my days of striving to be a rock star are behind me. I tend to want to cram as much into every day as humanly possible, so when I’m not doing any of the above mentioned activities I will often find myself exercising, conquering the outdoors or enjoying time spent with family and friends.

If you are interested in Science, Mathematics, Physics or Computation please stay tuned as the blogging is about to begin!

Cheers,

Andrew Butler

Posted in Miscellaneous