# The Gaussian Process Model

Below I will describe the essential components of the Gaussian Process (GP) Model, designed by Dr. Pritam Ranjan at Acadia University. You will quickly begin to see why efficient and robust optimization of the Model’s processes (particularly through the maximum likelihood approach) is so crucial to the performance and accuracy of the GP model.

The GP model requires as input a set of design points, $x_i = (x_{i1},...,x_{id})$, and the corresponding simulator outputs, $y_i=y(x_i)$, where $i=\{1,...,n\}$, n is the number of user supplied design points, and d is the dimension of the problem. We denote $X =\{x_1,...,x_n\}$ as the set of n design points, and $Y=y(X)=(y_1,...,y_n)'$ as the vector that holds the corresponding simulator output. The simulator output is thus modelled as
$y(x_i)=\mu +z(x_i),$ where $\mu$ is the overall mean, and $z(x_i)$ is a collection of random variables, which have a joint normal distribution, and thus is a Gaussian Process. For example, in such a case where the overall mean, $\mu = 0$, then the random variables, $z(x_i)$, will represent the true function value at the location $x_i$. The GP $z(x_i)$ is therefore a generalization of the Gaussian probability distribution, and is fully defined by its expected value, $E[z(x_i)]=0$, variance $\text{Var}[z(x_i)]=\sigma^2$, and covariance function $\text{Cov}[(z(x_i),z(x_j)]=\sigma^2 R$.

Here, R is the spatial correlation matrix, which defines the degree of dependency between any two design points, $x_i$, based on their observed simulator value. There are many choices for the correlation matrix R, however we will use the squared exponential correlation function provided by Macdonald et al.

$R_{ij} = \prod \limits_{k=1}^d \text{exp}\{-10^{\beta_k}|x_{ik}-x_{jk}|^2\} \quad \text{for all} \ i,j,$
where $\beta = (\beta_1,...,\beta_d)$ is an $d \times 1$ vector of scaling parameters. Determining the optimal $\beta$ scaling parameter is crucial for ensuring the integrity of the GP model fit. Originally, Dr. Ranjan proposed the use of the scaling parameter $\Theta$ in place of $10^{\beta}$, which confines the parameter space to $\Theta = (0,\infty)^d$. However, they noticed that optimal $\Theta$ values would often be position at the boundary, $\Theta= 0^d$, making the optimization process challenging. We therefore proposed a re-parameterization of the $\Theta$-space, by letting $\beta = \log_{10}(\Theta)$, such that $\beta = (-\infty, \infty)$ \cite{P2}. Such a re-parameterization is preferred as it smoothes the parameter space, such that it is centered about $\beta =0^d$ , and ensures that potential $\beta$ solutions do not lie on a boundary.

The GP model’s ability to predict the simulator output for an unknown point $x^*$ is directly related to the underlying properties of the correlation matrix R. Prediction error at any point $x^*$ is relative to the distance between that point and nearby design points. Therefore, the GP model will predict accurately when the distance between $x^*$ and nearby design points is small, whereas the prediction error will be larger for predictor points that are far away from the design points. The hyper-parameter $\beta$ can therefore be interpreted as a measure of the sensitivity of $y(x_i)$ to the spatial distribution $|x_{ik}-x_{jk}|^2$ for all $i, j \in \{1,...,n\}$ and $k \in \{1,...,d\}.$

The model uses the closed form estimators of the mean $\mu$, and variance, $\sigma^2$, given by
$\hat{\mu}(\beta)=(1_n'R^{-1}1_n)^{-1}(1_n'R^{-1}Y) \ \text{and} \ \hat{\sigma}^2(\beta) = \frac{(Y-1_n\hat{\mu}(\beta))'R^{-1}(Y-1_n\hat{\mu}(\beta))}{n}.$
Here $1_n$ is an $n \times 1$ vector of all ones. Calculation of the mean and variance estimators, along with the repeated update of the inverse and determinant of the correlation matrix, $R^{-1}$, and $|R|$ respectively, allow us to derive the profile log-likelihood
$\mathcal{L}_\beta = -2\log{(L_\beta)} \ \alpha \ \log{(|R|)} + n\log{[(Y-1_n\hat{\mu}(\beta))'R^{-1}(Y-1_n \hat{\mu}(\beta))]}.$

The GP model will provide as output the simulator estimate, $\hat{y}(x^*)$, for a desired predictor point $x^*$. Through the maximum likelihood approach, the best function estimate for a given point $x^*$ is
$\hat{y}(x^*) = \hat{\mu}+r'R^{-1}(Y-1_n\hat{\mu}) = \left[\frac{(1_n-r'R^{-1}1_n)}{{1_n}'R^{-1}1_n} {1_n}' + r'\right]R^{-1}Y = C'Y,$
$where $r=[r_1(x^*),...,r_n(x^*)]$, and $r_i(x^*)=\text{corr}[z(x^*),z(x_i)],$ the correlation between $z(x^*)$ and $z(x_i)$. The GP model will also output the uncertainty in this estimate, $s^2(x^*)$, as measured by the mean squared error (MSE), $s^2(x^*)=E \left[ ( \hat{y}(x^*)- {y}(x^*))^2\right] = \hat{\sigma}^2 (1_n-2C'r+C'RC).$ INTRODUCING A NUGGET PARAMETER Typically, a GP model attempts to find the best fit interpolant of the user supplied design points and its corresponding simulator outputs. The maximum likelihood approach enables the user to then estimate the corresponding simulator values for any number of predictor points. Algorithmic searching for the optimal $\beta$ parameter requires a repeated update of the correlation matrix R for each potentially optimal $\beta$. As shown above, the profile likelihood requires accurate computation of both the inverse and determinant of R. However, if any pair of design points are excessively close in space, then the matrix R can become near-singular, resulting in unstable computation of the inverse and determinant of R, which can lead to an unreliable model fit. To overcome this instability, we follow a popular technique which adds a small “nugget” parameter, $\delta$ to the model fitting procedure, which in turn replaces R with $R_\delta = R +\delta I$. The addition of the nugget parameter smoothes the model predictor and therefore the GP does not produce an exact interpolant of our design points. To avoid unnecessary over-smoothing of the model predictions a lower bound on the nugget parameter is introduced, and is given by $\delta_{lb}=max\left\{\frac{\lambda_n(\kappa(R) -e^a)}{\kappa(R)(e^a-1)},0\right\}.$ Here, the condition number is given by $\kappa(R)=\frac{\lambda_n}{\lambda_1}$, where $\lambda_1 \leq \lambda_2 \leq ...\leq \lambda_n$ are the eigenvalues of R, and $e^a$ is the desired threshold for the$$\kappa(R)$. We define an nXn matrix R to be near-singular if its condition number $\kappa(R)=||R||\cdot ||R^{-1}||$ exceeds the threshold value $e^a$, where $||\cdot||$ denotes the matrix L2-norm. A value of a=25 is recommended for a space-filling Latin hypercube designs (LHDs) that is used in determining the design points. In terms of optimization, it is important to note that $\delta_{lb}$ is a function of both the design points and the optimization parameter $\beta$.