## Subset of Regressors Approximation for GPR Models

The subset of regressors (SR) approximation method consists of replacing the kernel function $k\left(x,{x}_{r}|\theta \right)$ in the exact GPR method by its approximation ${\stackrel{^}{k}}_{SR}\left(x,{x}_{r}|\theta ,\mathcal{A}\right)$, given the active set $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$. You can specify the SR method for parameter estimation by using the `'FitMethod','sr'` name-value pair argument in the call to `fitrgp`. For prediction using SR, you can use the `'PredictMethod','sr'` name-value pair argument in the call to `fitrgp`.

### Approximating the Kernel Function

For the exact GPR model, the expected prediction in GPR depends on the set of $\mathcal{N}$ functions ${\mathcal{S}}_{\mathcal{N}}=\left\{k\left(x,{x}_{i}|\theta \right),i=1,2,\dots ,n\right\}$, where $\mathcal{N}=\left\{1,2,...,n\right\}$ is the set of indices of all observations, and n is the total number of observations. The idea is to approximate the span of these functions by a smaller set of functions, ${\mathcal{S}}_{\mathcal{A}}$, where $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$ is the subset of indices of points selected to be in the active set. Consider ${\mathcal{S}}_{\mathcal{A}}=\left\{k\left(x,{x}_{j}|\theta \right),j\in \mathcal{A}\right\}$. The aim is to approximate the elements of ${\mathcal{S}}_{\mathcal{N}}$ as linear combinations of the elements of ${\mathcal{S}}_{\mathcal{A}}$.

Suppose the approximation to $k\left(x,{x}_{r}|\theta \right)$ using the functions in ${\mathcal{S}}_{\mathcal{A}}$ is as follows:

`$\stackrel{^}{k}\left(x,{x}_{r}|\theta \right)=\sum _{j\in \mathcal{A}}{c}_{jr}k\left(x,{x}_{j}|\theta \right),$`

where ${c}_{jr}\in ℝ$ are the coefficients of the linear combination for approximating $k\left(x,{x}_{r}|\theta \right)$. Suppose $C$ is the matrix that contains all the coefficients ${c}_{jr}$. Then, $C$, is a $|\mathcal{A}|×n$ matrix such that $C\left(j,r\right)={c}_{jr}$. The software finds the best approximation to the elements of ${\mathcal{S}}_{\mathcal{N}}$ using the active set $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$ by minimizing the error function

`$E\left(\mathcal{A},C\right)=\sum _{r=1}^{n}{‖k\left(x,{x}_{r}|\theta \right)-\stackrel{^}{k}\left(x,{x}_{r}|\theta \right)‖}_{ℋ}^{2},$`

where $ℋ$ is the Reproducing Kernel Hilbert Spaces (RKHS) associated with the kernel function k , .

The coefficient matrix that minimizes $E\left(\mathcal{A},C\right)$ is

and an approximation to the kernel function using the elements in the active set $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$ is

The SR approximation to the kernel function using the active set $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$ is defined as:

and the SR approximation to $K\left(X,X|\theta \right)$ is:

### Parameter Estimation

Replacing $K\left(X,X|\theta \right)$ by ${\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)$ in the marginal log likelihood function produces its SR approximation:

`$\begin{array}{ll}\mathrm{log}{P}_{SR}\left(y|X,\beta ,\theta ,{\sigma }^{2},\mathcal{A}\right)=\hfill & -\frac{1}{2}{\left(y-H\beta \right)}^{T}{\left[{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}\right]}^{-1}\left(y-H\beta \right)\hfill \\ \hfill & -\frac{N}{2}\mathrm{log}2\pi -\frac{1}{2}\mathrm{log}|{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}|\hfill \end{array}$`

As in the exact method, the software estimates the parameters by first computing $\stackrel{^}{\beta }\left(\theta ,{\sigma }^{2}\right)$, the optimal estimate of $\beta$, given $\theta$ and ${\sigma }^{2}$. Then it estimates $\theta$, and ${\sigma }^{2}$ using the $\beta$-profiled marginal log likelihood. The SR estimate to $\beta$ for given $\theta$, and ${\sigma }^{2}$ is:

`${\stackrel{^}{\beta }}_{SR}\left(\theta ,{\sigma }^{2},\mathcal{A}\right)={\left[\underset{*}{\underbrace{{H}^{T}{\left[{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}\right]}^{-1}H}}\right]}^{-1}\underset{**}{\underbrace{{H}^{T}{\left[{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}\right]}^{-1}y}},$`

where

`$\begin{array}{l}{\left[{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}\right]}^{-1}=\frac{{I}_{N}}{{\sigma }^{2}}-\frac{K\left(X,{X}_{\mathcal{A}}|\theta \right)}{{\sigma }^{2}}{A}_{\mathcal{A}}^{-1}\frac{K\left({X}_{\mathcal{A}},X|\theta \right)}{{\sigma }^{2}},\\ {A}_{\mathcal{A}}=K\left({X}_{\mathcal{A}},{X}_{\mathcal{A}}|\theta \right)+\frac{K\left({X}_{\mathcal{A}},X|\theta \right)K\left(X,{X}_{\mathcal{A}}|\theta \right)}{{\sigma }^{2}},\\ *=\frac{{H}^{T}H}{{\sigma }^{2}}-\frac{{H}^{T}K\left(X,{X}_{\mathcal{A}}|\theta \right)}{{\sigma }^{2}}{A}_{\mathcal{A}}^{-1}\frac{K\left({X}_{\mathcal{A}},X|\theta \right)H}{{\sigma }^{2}},\\ **=\frac{{H}^{T}y}{{\sigma }^{2}}-\frac{{H}^{T}K\left(X,{X}_{\mathcal{A}}|\theta \right)}{{\sigma }^{2}}{A}_{\mathcal{A}}^{-1}\frac{K\left({X}_{\mathcal{A}},X|\theta \right)y}{{\sigma }^{2}}.\end{array}$`

And the SR approximation to the $\beta$-profiled marginal log likelihood is:

`$\begin{array}{l}\mathrm{log}{P}_{SR}\left(y|X,{\stackrel{^}{\beta }}_{SR}\left(\theta ,{\sigma }^{2},\mathcal{A}\right),\theta ,{\sigma }^{2},\mathcal{A}\right)=\\ \begin{array}{ll}\hfill & -\frac{1}{2}{\left(y-H{\stackrel{^}{\beta }}_{SR}\left(\theta ,{\sigma }^{2},\mathcal{A}\right)\right)}^{T}{\left[{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}\right]}^{-1}\left(y-H{\stackrel{^}{\beta }}_{SR}\left(\theta ,{\sigma }^{2},\mathcal{A}\right)\right)\hfill \\ \hfill & -\frac{N}{2}\mathrm{log}2\pi -\frac{1}{2}\mathrm{log}|{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}|.\hfill \end{array}\end{array}$`

### Prediction

The SR approximation to the distribution of ${y}_{new}$ given $y$, $X$, ${x}_{new}$ is

`$P\left({y}_{new}|y,X,{x}_{new}\right)=\mathcal{N}\left({y}_{new}|h{\left({x}_{new}\right)}^{T}\beta +{\mu }_{SR},{\sigma }_{new}^{2}+{\Sigma }_{SR}\right),$`

where ${\mu }_{SR}$ and ${\Sigma }_{SR}$ are the SR approximations to $\mu$ and $\Sigma$ shown in prediction using the exact GPR method.

${\mu }_{SR}$ and ${\Sigma }_{SR}$ are obtained by replacing $k\left(x,{x}_{r}|\theta \right)$ by its SR approximation ${\stackrel{^}{k}}_{SR}\left(x,{x}_{r}|\theta ,\mathcal{A}\right)$ in $\mu$ and $\Sigma$, respectively.

That is,

Since

and from the fact that , ${\mu }_{SR}$ can be written as

Similarly, ${\Sigma }_{SR}$ is derived as follows:

`${\Sigma }_{SR}=\underset{*}{\underbrace{{\stackrel{^}{k}}_{SR}\left({x}_{new},{x}_{new}|\theta ,\mathcal{A}\right)}}-\underset{**}{\underbrace{{\stackrel{^}{K}}_{SR}\left({x}_{new}^{T},X|\theta ,\mathcal{A}\right)}}\underset{***}{\underbrace{{\left({\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{N}\right)}^{-1}}}\underset{****}{\underbrace{{\stackrel{^}{K}}_{SR}\left(X,{x}_{new}^{T}|\theta ,\mathcal{A}\right)}}.$`

Because

${\Sigma }_{SR}$ is found as follows:

### Predictive Variance Problem

One of the disadvantages of the SR method is that it can give unreasonably small predictive variances when making predictions in a region far away from the chosen active set $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$. Consider making a prediction at a new point ${x}_{new}$ that is far away from the training set $X$. In other words, assume that $K\left({x}_{new}^{T},X|\theta \right)\approx 0$.

For exact GPR, the posterior distribution of ${f}_{new}$ given $y$, $X$ and ${x}_{new}$ would be Normal with mean $\mu =0$ and variance $\Sigma =k\left({x}_{new},{x}_{new}|\theta \right)$. This value is correct in the sense that, if ${x}_{new}$ is far from $X$, then the data $\left(X,y\right)$ does not supply any new information about ${f}_{new}$ and so the posterior distribution of ${f}_{new}$ given $y$, $X$, and ${x}_{new}$ should reduce to the prior distribution ${f}_{new}$ given ${x}_{new}$, which is a Normal distribution with mean $0$ and variance $k\left({x}_{new},{x}_{new}|\theta \right)$.

For the SR approximation, if ${x}_{new}$ is far away from $X$ (and hence also far away from ${X}_{\mathcal{A}}$), then ${\mu }_{SR}=0$ and ${\Sigma }_{SR}=0$. Thus in this extreme case, ${\mu }_{SR}$ agrees with $\mu$ from exact GPR, but ${\Sigma }_{SR}$ is unreasonably small compared to $\Sigma$ from exact GPR.

The fully independent conditional approximation method can help avoid this problem.

 Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press. Cambridge, Massachusetts, 2006.

 Smola, A. J. and B. Schökopf. "Sparse greedy matrix approximation for machine learning." In Proceedings of the Seventeenth International Conference on Machine Learning, 2000.