using Pkg
Pkg.activate(".") Activating project at `~/gitrepos/machine-learning/MLJGP.jl/notebooks`
The following is based on the book Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K. I. Williams. You can find the free online book here.1
1 Carl Edward Rasmussen and Christopher K. I. Williams; The MIT Press, 2006. ISBN 0-262-18253-X.
To explain/derive the Gaussian Process model for regression, let’s first consider a motivated example: Linear Regression. We will use this guiding example to derive GRP from a weight space view. After this derivation, will suggest a simpler, but more abstract derivation using a function space view.
using Pkg
Pkg.activate(".") Activating project at `~/gitrepos/machine-learning/MLJGP.jl/notebooks`
using LinearAlgebra
using Distributions
using Random
using PlotsFirst let’s set up some data we can use for training:
We consider a dataset \mathcal{D} with n observations \begin{equation} \mathcal{D} = \Big\{ (\mathbf{x}_i, y_i) \;\Big\vert \; i = 1,...,n\Big\} \end{equation}
Linear regression is easily understood in terms of linear algebra. We therefore collect our dataset \mathcal{D} into a D \times n dimensional Design Matrix. Note that we have used a transposed definition (features are rows, records are columns) as Julia is a column-major language (like Matlab & Fortran).
\begin{equation} X := \begin{pmatrix} \vdots & \vdots & & \vdots \\ \mathbf{x}_1 & \mathbf{x}_2 & ... & \mathbf{x}_n \\ \vdots & \vdots & & \vdots \end{pmatrix} \end{equation}
and our targets into a target vector
\begin{equation} \mathbf{y} := (y_1, ..., y_n) \end{equation} so that the full training set becomes \begin{equation} \mathcal{D} := (X, \mathbf{y}) \end{equation}
Standard linear regression is a model of the form \begin{equation} f(\mathbf{x}) = \mathbf{x}^T\mathbf{w} \end{equation} where \mathbf{w} is the D-dimensional vector of weights. By minimizing the mean-squared-error between our model and targets, one can show that the optimal weights are given by \begin{equation} \mathbf{w} = (XX^T)^{-1}X\mathbf{y} \end{equation}
This can also be easily obtained geometrically by finding the vector with the shortest distance to the hyperplane defined by the column space of X. This corresponds to solving the normal equations \begin{equation} XX^T \mathbf{w} = X\mathbf{y} \end{equation}
The following demonstrates this procedure on a simple dataset
X = rand(200)
y = 0.3 .* X .+ 0.25 .* (rand(200) .- 0.5)
X = X'
w = (X*X')\(X*y)
ypred = X'*w
plot(X', y,
seriestype=:scatter,
color=:red, alpha=0.25,
label="data")
plot!(X', ypred, color=:blue, label="fit")
xlabel!("x")
ylabel!("y")
title!("Linear Regression in 1-Dimension")We can also fit a y-intercept (aka bias) by augmenting the design matrix X to contain an extra row with all 1’s, i.e. \begin{equation}
X[D+1, :] = (1, ..., 1)
\end{equation}
Standard lienar regression assumes that are data \mathcal{D} are perfect but we can clearly see that the above data are noisy. To account for this, we need to make our model Bayesian by augmenting it to consider measurement error. We define \begin{align} f(\mathbf{x}) &= \mathbf{x}^T\mathbf{w} \\ \mathbf{y} &= f(\mathbf{x}) + \mathbf{\epsilon} \\ \mathbf{\epsilon} &\sim \mathcal{N}(0, \sigma_n^2) \end{align} or, in words, our observed values differ from the truth by identically, independently, distributed Gaussian noise with mean 0 and variance \sigma_n^2. The assumption that the noise is i.i.d. is critical because it allows us to simplify the likelihood function by separating out each individual contribution by our datapoints: \begin{align} p(\mathbf{y}\vert X,\mathbf{w}) &:= \prod\limits_i^n p(\mathbf{y}_i \vert \mathbf{x}_i, \mathbf{w}) \\ &= \prod\limits_i^n \frac{1}{\sqrt{2\pi\sigma_n^2}}\exp\left( -\dfrac{(\mathbf{y}_i-\mathbf{x}_i^T\mathbf{w})^2}{2\sigma_n^2}\right)\\ &= \dfrac{1}{(2\pi\sigma_n^2)^{n/2}}\exp\left( -\frac{1}{2\sigma_n^2}\lvert \mathbf{y} - X^T\mathbf{w}\rvert^2 \right) \\ &= \mathcal{N}\left(X^T\mathbf{w}, \sigma_n^2I\right) \end{align}
To perform inference with this updated model, we apply Baye’s Rule, that is:
\begin{equation} p(\mathbf{w}\vert \mathbf{y}, X) = \dfrac{p(\mathbf{y}\vert X, \mathbf{w})p(\mathbf{w})}{p(\mathbf{y}\vert X)} \end{equation} where
It is now that the utility of choosing gaussian distributions for our likelihood and prior becomes clear… \begin{align} p(\mathbf{w}\vert\mathbf{y},X) &\propto \exp\left(-\frac{1}{2\sigma_n^2}(\mathbf{y}-X^T\mathbf{w})^T(\mathbf{y}-X^T\mathbf{w}) \right)\exp\left(-\frac{1}{2}\mathbf{w}^T\Sigma_p^{-1}\mathbf{w}\right) \end{align} Taking the log and expanding leads to \begin{align} \log(p(\mathbf{w}\vert \mathbf{y}, X))&= \frac{1}{2}\left[ \frac{1}{\sigma_n^2}\mathbf{y}^T\mathbf{y} - \frac{1}{\sigma_n^2}\mathbf{y}^TX^T\mathbf{w} - \frac{1}{\sigma_n^2}\mathbf{w}^TX\mathbf{y} + \frac{1}{\sigma_n^2}\mathbf{w}^TXX^T\mathbf{w} + \mathbf{w}^T\Sigma_p^{-1}\mathbf{w}\right] \\ &= \frac{1}{2}\left[ \mathbf{w}^T\left(\frac{1}{\sigma_n^2}XX^T+\Sigma_p^{-1}\right)\mathbf{w} -\left(\frac{1}{\sigma_n^2}\mathbf{y}^TX^T\right)\mathbf{w} - \mathbf{w}^T\left(\frac{1}{\sigma_n^2}X\mathbf{y}\right) + \mathbf{y}^T\frac{1}{\sigma_n^2}\mathbf{y}\right]\\ &= \mathbf{w}^TA\mathbf{w} - B^T\mathbf{w} - \mathbf{w}^TB + C \end{align} where we have defined \begin{align} A &:= \frac{1}{\sigma_n^2}XX^T + \Sigma_p^{-1} \\ B &:= \frac{1}{\sigma_n^2}X\mathbf{y} \\ C &:= \mathbf{y}^T\frac{1}{\sigma_n^2}\mathbf{y} \end{align}
Now we can complete the square so that \begin{equation} \mathbf{w}^TA\mathbf{w} - B^T\mathbf{w} - \mathbf{w}^TB + C = \left(\mathbf{w} - \bar{\mathbf{w}} \right)^TA\left(\mathbf{w} - \bar{\mathbf{w}} \right) + K \end{equation} leading to \begin{align} \bar{\mathbf{w}} &= A^{-1}B = \frac{1}{\sigma_n^2}\left(\frac{1}{\sigma_n^2}XX^T + \Sigma_p^{-1}\right)^{-1}X\mathbf{y} \\ K &= C- \bar{\mathbf{w}}^TA\bar{\mathbf{w}} \end{align} Since K does not depend on \mathbf{w} directly, it may be absorbed into the normalization of p(\mathbf{w}\vert \mathbf{y}, X). Thus we are left with
\begin{align} p(\mathbf{w}\vert\mathbf{y},X) &= \mathcal{N}\left( \bar{\mathbf{w}}=\frac{1}{\sigma_n^2}A^{-1}X\mathbf{y}, \Sigma=A^{-1}\right) \\ A &= \frac{1}{\sigma_n^2}XX^T+\Sigma_p^{-1} \end{align}
This result gives us the gaussian distriubtion over the space of possible parameter vectors \mathbf{w}. To use this distribution to make predictions, consider a newly supplied testpoint \mathbf{x}_*. We want to find \begin{equation} p(y_* \vert \mathbf{x}_*, \mathbf{y}, X) \end{equation}
We do this by marginalizing over our weight distribution, i.e. \begin{equation} p(y_* \vert \mathbf{x}_*, \mathbf{y}, X) = \int_{\mathbf{w}} p(y_*\vert \mathbf{x}_*,\mathbf{w})p(\mathbf{w}\vert \mathbf{y}, X)d\mathbf{w} \end{equation} If we make the further assumption that testing points are i.i.d. guassian distriubted, we see that this integral is the product of two gaussians and therefore is also a guassian. To find the mean and covariance of the predictive distribution, we check \begin{align} \bar{y}_* &= \mathbb{E}[y_*] = \mathbb{E}[\mathbf{x}_*^T\mathbf{w}] = \mathbf{x}_*^T\mathbb{E}[\mathbf{w}] = \mathbf{x}_*^T\bar{\mathbf{w}} \\ \text{Cov}(y_*) &= \mathbb{E}[(y_*-\bar{y}_*)(y_*-\bar{y}_*)^T] \\ &= \mathbb{E}[(\mathbf{x}_*^T\mathbf{w}-\mathbf{x}_*^T\bar{\mathbf{w}})(\mathbf{x}_*^T\mathbf{w}-\mathbf{x}_*^T\bar{\mathbf{w}})^T] \\ &= \mathbb{E}[\mathbf{x}_*^T(\mathbf{w}-\bar{\mathbf{w}})(\mathbf{w}-\bar{\mathbf{w}})^T\mathbf{x}_*] \\ &= \mathbf{x}_*^T\mathbb{E}[(\mathbf{w}-\bar{\mathbf{w}})(\mathbf{w}-\bar{\mathbf{w}})^T]\mathbf{x}_* \\ &= \mathbf{x}_*^T\text{Cov}(\mathbf{w})\mathbf{x}_* \\ &= \mathbf{x}_*^TA^{-1}\mathbf{x}_* \end{align} so that \begin{equation} \boxed{p(y_* \vert \mathbf{x}_*, \mathbf{y}, X) = \mathcal{N}\left(\mathbf{x}_*^T\mathbf{w},\; \mathbf{x}_*^TA^{-1}\mathbf{x}_*\right)} \end{equation}
Let’s take a break from our Bayesian regression and return to the standard linear regression model for a moment. The key drawback of linear models like this is, of course, that they’re linear!. Considering that many (most?) interesting relationships are not linear, how can we extend our simple linear model to enable us to perform complicated non-linear fits?
In the parlance of machine learning, the simple solution is to do feature engineering. If our inital feature vector is \begin{equation} \mathbf{x} = (x_1, ..., x_n) \end{equation} we can use our expertise to concot new combinations of these features to produce the agumented vector \begin{equation} \tilde{\mathbf{x}} = (x_1, ..., x_n, x_1^2, \;sin(x_2), \;x_5x_7/x_4,\;...) \end{equation}
As an example, a linear classifier is unable to distinguish points inside a circle from those outside just from the (x,y) coordinates alone. Augmenting the feature vector to include the squared radius x^2+y^2 as a new feature removes this obstacle.
This works because the linear part of linear regression only refers to the fact that our model takes linear combinations of feature variables to produce it’s output. There is no restriction that the features themselves need to be independent variables! This same idea is what makes methods like SINDy work…
Constructing new features is often more art than science. To standardize the process, let’s abstract mapping from the original feature vector \mathbf{x} to the augmented vector \tilde{\mathbf{x}}. This is accomplished via the projection map \phi:\mathbb{R}^D \to \mathbb{R}^N where \begin{equation} \mathbf{x} \mapsto \tilde{\mathbf{x}} = \phi(\mathbf{x}) \end{equation}
The result is that our linear model updates to become \begin{equation} f(\mathbf{x}) := \phi(\mathbf{x})^T\mathbf{w} \end{equation} where the weight vector has gone from D dimensional to N dimensional.
Similarly, the normal equations for \mathbf{w} update to become \begin{equation} \mathbf{w} = (\Phi\Phi^T)^{-1}\Phi\mathbf{y} \end{equation} where \Phi = \phi(X) is the N\times n matrix resulting from applying \phi columnwise to X.
The following example shows how to use such a mapping to produce a quadratic polynomial fit.
# generate noisy quadratic data
X = 2*rand(200)
y = 1.0 .- 2.0 .* X .+ X.^2 .+ (0.5 .* rand(200))
X = X'
# compute projection
Φ(x) = [1.0, x...,[x[i]*x[j] for i∈1:size(x,1) for j∈1:size(x,1)]...]
X̃ = hcat([Φ(col) for col ∈ eachcol(X)]...)
# fit the parameters
w = (X̃*X̃')\(X̃*y)
# compute projection on test points
Xpred = collect(0:0.025:2)
Xpred = Xpred'
X̃pred = hcat([Φ(col) for col ∈ eachcol(Xpred)]...)
# compute prediction on test points
ypred = X̃pred'*w
# visualize
plot(X', y,
seriestype=:scatter,
color=:red, alpha=0.25,
label="data")
plot!(Xpred', ypred, color=:blue, label="fit")
xlabel!("x")
ylabel!("y")
title!("equation fit: $(round(w[1],digits=2)) + $(round(w[2], digits=2))x + $(round(w[3], digits=2))x²")We see from the above that our linear regression model found a great fit for a 2nd order polynomial when supplied with polynomial features.
There is a massive problem with this method. Our order 2 polynomial map \phi takes us from a D dimenional feature vector to (D+1)! many. This means that as we add more features to our feature map, the dimension of the resulting vector will quickly become prohibitively large.
Let’s update our Bayesian regression scheme to reflect the use of our feature projection map \phi. First we define \begin{align} \Phi &:= \phi(X) \\ \phi_* &:= \phi(\mathbf{x}_*) \end{align}
Our predictive distribution therefore becomes \begin{align} p(y_* \vert \mathbf{x}_*, X, \mathbf{y}) &= \mathcal{N}\left(\frac{1}{\sigma_n^2}\phi_*^TA^{-1}\Phi\mathbf{y}, \;\phi_*^TA^{-1}\phi_*\right) \\ A &= \frac{1}{\sigma_n^2}\Phi\Phi^T + \Sigma_p^{-1} \end{align} Great! Now we can do our Bayesian inference with non-linear features given by \phi.
Returning to the problem of the rapidly-growing dimensionality of our augmented feature vectors \phi(\mathbf{x}), we see that the computational bottleneck is the matrix inversion of A which requires we invert an N\times N matrix. Our prediction (i.e. the mean) involves multiplication on the right by the n dimensional vector \mathbf{y}. With that in mind, perhaps we can reformulate the above into an equivalent form using at most an n\times n dimensional matrix…
Let K:= \Phi^T\Sigma_p\Phi. Observe the following: \begin{align} \frac{1}{\sigma_n^2}\Phi(K+\sigma_n^2I) &= \frac{1}{\sigma_n^2}\Phi\left(\Phi^T\Sigma_p\Phi + \sigma_n^2I \right) \\ &= \frac{1}{\sigma_n^2}\Phi\Phi^T\Sigma_p\Phi + \Phi I \\ &= \left(\frac{1}{\sigma_n^2}\Phi\Phi^T \right)\Sigma_p\Phi + \left(\Phi I \Phi^{-1}\Sigma_p^{-1} \right)\Sigma_p\Phi \\ &= \left(\frac{1}{\sigma_n^2}\Phi\Phi^T + \Sigma_p^{-1}\right)\Sigma_p\Phi \\ &= A\Sigma_p\Phi \end{align}
From there we see that \begin{align} A^{-1}\frac{1}{\sigma_n^2}\Phi\left(K+\sigma_n^2I\right) &= \Sigma_p\Phi \\ \Rightarrow \frac{1}{\sigma_n^2}A^{-1}\Phi &= \Sigma_p\Phi\left(K + \sigma_n^2I\right)^{-1} \\ \Rightarrow \frac{1}{\sigma_n^2}\phi_*^TA^{-1}\Phi &= \phi_*^T\Sigma_p\Phi\left(K + \sigma_n^2I\right)^{-1} \end{align}
For the covariance, we utilize the matrix inversion lemma which states \begin{equation} (Z + UWV^T)^{-1} = Z^{-1} - Z^{-1}U(W^{-1} + V^TZ^{-1}U)^{-1}V^TZ^{-1} \end{equation}
With the identification \begin{align} Z^{-1} &\to \Sigma_p \\ W^{-1} &\to \sigma_n^2I \\ V &\to \Phi \\ U &\to \Phi \end{align} we find \begin{align} \Sigma_p - \Sigma_p\Phi\left(\Sigma_p + \Phi^T\Sigma_p\Phi \right)^{-1}\Phi^T\Sigma_p &= \left(\Sigma_p^{-1} + \Phi\frac{1}{\sigma_n^2}I\Phi^T\right)^{-1}\\ &= \left(\frac{1}{\sigma_n^2}\Phi\Phi^T + \Sigma_p^{-1}\right)^{-1} \\ &= A^{-1} \end{align}
Thus, we have the equivalent form for our predictive distribution: \begin{equation} \boxed{p(y_*\vert \mathbf{x}_*, X, \mathbf{y}) =\\ \mathcal{N}\left( \phi_*^T\Sigma_p\Phi(K+\sigma_n^2I)^{-1}\mathbf{y}, \; \phi_*^T\Sigma_p\phi_* - \phi_*^T\Sigma_p\Phi(K+\sigma_n^2I)^{-1}\Phi^T\Sigma_p\phi_*\right)} \end{equation} where the pesky N\times N term has been replaced by the n\times n matrix \Phi^T\Sigma_p\Phi.
We now make the the key observation that the only matrices that appear in the above expression are \begin{align} &\Phi^T\Sigma_p\Phi, &\phi_*^T\Sigma_p\phi_* \\ &\phi_*^T\Sigma_p\Phi, &\Phi^T\Sigma_p\phi_* \end{align} whose matrix elements we can write abstractly as \begin{equation} \phi(\mathbf{x})^T\Sigma_p\phi(\mathbf{x}') \end{equation}
To fit our model, we must determine appropriate values for the symmetric, positive semi-definite covariance matrix \Sigma_p (and \sigma_n too, technically). Instead, we observe that this matrix product is a quadratic form which we can think of as representing an inner product on our transformed vectors: \begin{equation} K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j)\rangle \end{equation} We call the function k(\mathbf{x},\mathbf{x}') the kernel function or the covariance function.
All we need to perform the above calculations are the matrix elements of K on our data \mathcal{D} and any test points \mathbf{x}_* we wish to apply our model to. In effect, this means we are free to use feature vectors of any dimension, including \infty.2
2 The idea here is that ther kernel vunction represents an inner product over some vector space. As it turns out, the RBF kernel corresponds to a an infinite dimensional feature vector.
There are many choices for the kernel function. One of the most popular is the RBF (radial basis function) kernel, also known as the squared exponential kernel:
\begin{equation} k_{\text{rbf}}(\mathbf{x}, \mathbf{x}') := \sigma_f^2\exp(-\frac{1}{2\ell^2}\lvert \mathbf{x}-\mathbf{x}'\rvert^2) \end{equation}
where \sigma_f^2 is the signal variance and \ell denotes the similarity length scale.
For notational convenience, let’s define \begin{align} K &:= k(X,X) \\ K_{**} &:= k(X_*, X_*) \\ K_{*} &:= k(X, X_*) \end{align}
then, our predictive distribution takes the final, clean form \begin{equation} \boxed{p(\mathbf{y}_* \vert X_*, X, \mathbf{y}) = \mathcal{N}\left( K_*^T(K+\sigma_n^2I)^{-1}\mathbf{y},\; K_{**}-K_{*}^T(K+\sigma_n^2I)^{-1}K_*\right)} \end{equation}
This is the end-result of Gaussian Process Regression acheived via the weight-space view.
So far our approach has been to generalize the standard linear regression model to allow for fitting over a (possibly infinite) basis of features with consideration for measurement and model uncertainty (our Bayesian priors). In essence, the idea was to fit the distribution of all possible weights conditioned on the available training data, p(\mathbf{w} \vert X, \mathbf{y}). A second equivalent approach is to instead consider the distribution of all possible model function f(\mathbf{x}). By constructing a Bayesian prior of this space, we constrain the space of possible model functions and fit a distribution over all allowed model functions, p(f \vert X, \mathbf{y}). To do so we will need to develop the abstract machinery of distributions over function spaces. When these distributions are Gaussian in nature, the result is called a Gaussian process.
By this point, we are all familiar with the Gaussian distribution, aka the Normal distribtion \mathcal{N}(\mu, \sigma^2). This distribution is defined by a mean value \mu and a variance \sigma^2. It’s big brother is the Multivariate Normal Distribution, \mathcal{N}(\mathbf{\mu}, \Sigma), described be a vector of means \mathbf{\mu}} and a covariance matrix \Sigma. A natural question, then, is can we generalize the concept of the Gaussian distribution from N dimensions to being defined over a continuous field? This question leads naturally to the definition of a Gaussian Process
Definition: A Gaussian Process, \mathcal{GP}, is a collection of random variables for which any finite subset are described by a joint Gaussian distribution.
To see where this comes from, recall that in our previous derivation, we already made the assumption that all our our data points \mathcal{D} are i.i.d. Gaussian distributed. A gaussian process is the natural extension of this and makes the assumption that the continuous set from which are data are sampled are so Guassian that any finite sample will be jointly Gaussian distributed. The term process is used to distinguish between finite collections of random variables (distributions) and their continuous counterparts described here.
Because each finite subset of this continuous collection is jointly gaussian, we can completely specify a Gaussian Process with two functions: the mean function m(\mathcal{x}) and the covariance function k(\mathbf{x},\mathbf{x}'). To denote this, we typically write \begin{equation} f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}')) \end{equation}
To see this in action, recall our Bayesian regression model \begin{equation} f(\mathbf{x} = \phi(\mathbf{x})^T\mathbf{w} \qquad \mathbf{w}\sim\mathcal{N}(\mathbf{0}, \Sigma_p) \end{equation} where we have set the prior on \mathcal{w} to have zero mean.
The mean function is given by the expectation value of our model: \begin{equation} \mathbb{E}[f(\mathbf{x})] = \phi(\mathbf{x})^T\mathbb{E}[\mathbf{w}] = 0 \end{equation} and the covariance function is given by \begin{equation} \mathbb{E}[f(\mathbf{x})f(\mathbf{x'})] = \phi(\mathbf{x})^T\mathbb{E}[\mathbf{w}\mathbf{w}^T]\phi(\mathbf{x}') = \phi(\mathbf{x})^T\Sigma_p\phi(\mathbf{x}') \end{equation}
To repeat the point, the key feature of Gaussian processes is that finite subsets are jointly Gaussian distributed. Thus we can we can split our data into the testpoints \mathcal{D}=(X,\mathbf{y}) and testpoints X_* t and treat each collection as joint distributions with the following priors:
\begin{equation}\begin{bmatrix} \mathbf{f} \\ \mathbf{f}_* \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0},\begin{bmatrix} K(X,X) & K(X,X_*) \\ K(X_*,X) & K(X_*,X_*) \end{bmatrix}\right) \end{equation}
where \mathbf{f}:= f(X) and \mathbf{f}_* = f(X_*).
To obtain our predictive distribution, p(\mathbf{f}_* \vert X_*, X, \mathbf{y}), we condition the joint prior distribution on the observations. To see how this works, consider a general joint gaussian distribution given by \begin{equation} \begin{bmatrix} x \\ y \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix}\mu_x \\ \mu_y\end{bmatrix},\; \begin{bmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{bmatrix}\right) \end{equation}
define the centered values \tilde{x} := x-\mu_x and \tilde{y} := x-\mu_y. Define the intermediate variable \begin{equation} z := \tilde{x} - A\tilde{y} \end{equation}
Note that since we’ve subtracted out the mean we have \mathbb{E}[\tilde{x}] = \mathbb{E}[\tilde{y}] = \mathbb{E}[z] = 0
Let’s now find A… \begin{align} \mathbb{E}[z\tilde{y}^T] &= \mathbb{E}[(\tilde{x}-A\tilde{y})\tilde{y}^T] \\ &= \mathbb{E}[\tilde{x}\tilde{y}^T - A\tilde{y}\tilde{y}] \\ &= \mathbb{E}[\tilde{x}\tilde{y}^T] - \mathbb{E}[A\tilde{y}\tilde{y}^T] \\ &= \Sigma_{xy} - A\mathbb{E}[\tilde{y}\tilde{y}^T] \\ &= \Sigma_{xy} - A\Sigma_{yy} \end{align}
Therefore if we choose A so that z and \tilde{y} are independent and uncorrelated, then \Sigma_{zy} = \mathbb{E}[z\tilde{y}^T] = 0. Using this assumption, we find \begin{equation} 0 = \mathbb{E}[z\tilde{y}^T] = \Sigma_{xy}-A\Sigma_{yy} \\ \Rightarrow \boxed{A = \Sigma_{xy}\Sigma_{yy}^{-1}} \end{equation}
If we now condition \tilde{x} on \tilde{y} (i.e. look at \tilde{x} when \tilde{y} is constant), we find \begin{align} \mathbb{E}[\tilde{x}\vert\tilde{y}] &= A\tilde{y} + \mathbb{E}[z] \\ &= A\tilde{y} + 0 \\ &= \Sigma_{xy}\Sigma_{yy}^{-1} \\ \end{align}
By manipulating this expression, we can now derive \mathbb{E}[x\vert y] as follows: \begin{align} \mathbb{E}[x\vert\tilde{y}] &= \mathbb{E}[\tilde{x}\vert\tilde{y}] + \mu_x \\ &= \mu_x + \Sigma_{xy}\Sigma_{yy}^{-1}\tilde{y} \\ \end{align} \begin{equation} \boxed{\mathbb{E}[x\vert y] = \mu_x + \Sigma_{xy}\Sigma_{yy}^{-1}(y-\mu_y)} \end{equation}
Similarly for the covariance, we have \begin{align} \text{Cov}(x \vert y) &= \text{Cov}(\tilde{x}+\mu_x \vert \tilde{y}) \\ &= \text{Cov}(\tilde{x}+\mu_x \vert \tilde{y} + \mu_y) \\ &= \text{Cov}(\tilde{x}\vert(\tilde{y}+\mu_y)) \\ &= \text{Cov}(\tilde{x}\vert \tilde{y}) \\ &= \text{Cov}((z+A\tilde{y})\vert\tilde{y}) \\ &= \text{Cov}(z) + {A\text{Cov}(\tilde{y})} \\ &= \text{Cov}(z) + 0 \\ &= \mathbb{E}[zz^T] \\ &= \mathbb{E}[(\tilde{x}-A\tilde{y})(\tilde{x}-A\tilde{y})^T]\\ &= \mathbb{E}[\tilde{x}\tilde{x}^T - A\tilde{y}\tilde{x}^T -x(A\tilde{y})^T + A\tilde{y}\tilde{y}^TA^T] \\ &= \Sigma_{xx} - A\Sigma_{yx} - \Sigma_{xy}A^T + A\Sigma_{yy}A^T \\ &= \Sigma_{xx}-(\Sigma_{xy}\Sigma_{yy}^{-1})\Sigma_{yx} - \Sigma_{xy}(\Sigma_{yy}^{-1})^T\Sigma_{xy}^T + \Sigma_{xy}\Sigma_y^{-1}\Sigma_{y}(\Sigma_{y}^{-1})^T\Sigma_{xy}^T \\ &= \Sigma_{xx} - \Sigma_{xy}\Sigma{yy}^{-1}\Sigma_{xy}^T - \Sigma_{xy}(\Sigma_{yy}^{-1})^T\Sigma_{xy}^T + \Sigma_{xy}(\Sigma_{yy}^{-1})^T\Sigma_{xy}^T \\ &= \Sigma_{xx}-\Sigma_{xy}\left[\Sigma_{yy}^{-1} - (\Sigma_{yy}^{-1})^T + (\Sigma_{yy}^{-1})^T \right]\Sigma_{xy}^T \\ &= \Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx} \end{align}
\begin{equation} \boxed{\text{Cov}(x \vert y) = \Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}} \end{equation}
armed with this identity for joint Guassian distributions, we are ready to derive the predictive distribution for Gaussian Process Regression
Applying these results for our gaussian process, we find \begin{equation} p(\mathbf{f}_* \vert X_*, X, \mathbf{y} = \mathcal{N}\left( K_*^TK^{-1}\mathbf{f},\; K_{**}-K_*^TK^{-1}K_*\right) \end{equation}
To account for noisy observations, we can augment our correlation function to include a noise offset. The joint distrubtion then becomes:
\begin{equation}\begin{bmatrix} \mathbf{f} \\ \mathbf{f}_* \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0},\begin{bmatrix} K(X,X)-\sigma_n^2I & K(X,X_*) \\ K(X_*,X) & K(X_*,X_*) \end{bmatrix}\right) \end{equation}
which leads to the predictive distribution \begin{equation} \boxed{p(\mathbf{f}_* \vert X_*, X, \mathbf{y}) = \mathcal{N}\left( K_*^T\left[K + \sigma_n^2 I\right]^{-1}\mathbf{f},\; K_{**}-K_*^T\left[K + \sigma_n^2 I\right]^{-1}K_*\right)} \end{equation}
using LinearAlgebra
using Distributions
using Random
using Plots
using Tables
using MLJBase
using KernelFunctions
using AbstractGPs
using Statistics# Make training set
function make_training_data(n, ν)
xs = reshape(collect(range(0.0, stop=1.0, length=500)), (1,500))
y(x) = (exp(-x/(0.5)^2) * sin(2π*ν*x)) + (0.3)^2*(rand()-0.5) # this one has some noise
ytruth(x) = (exp(-x/(0.5)^2) * sin(2π*ν*x))
X = Tables.table(rand(1,n)', header=[:x])
Xtrue = Tables.table(xs', header=[:x])
y = y.(X.x)
ytrue = ytruth.(Xtrue.x)
return X, y, Xtrue, ytrue
end
X, y, Xtrue, ytrue = make_training_data(50, 10)
scatter(X.x, y, color=:red, label="noisy data")
plot!(Xtrue.x, ytrue, color=:blue, label="true function")
xlabel!("x")
ylabel!("y")KernelFunctions.jl provides a clean interface to create various kernelfunctions and apply them to data to create our matrices K.
Due to the fact that kernel functions obey composition laws, we can easily build up complicated Kernels from basic pieces via function composition with \circ
# create kernel function
ℓ = 0.1 # length scale parameter
k = 1.0 * (SqExponentialKernel() ∘ ScaleTransform(1/2ℓ^2))
# test out kernels interface by generating matrix from k
K = kernelmatrix(k, Tables.matrix(X)') # transpose as we want size (n_features, n_datapoints)
heatmap(K, yflip=true)Unsurprisingly, there is a lot of activation on the diagonal as for a single datapoint \mathbf{x}, we have \begin{equation} k(\mathbf{x},\mathbf{x}) = \exp\left(-\frac{0}{2\ell^2} \right) = 1.0 \end{equation}
Now that we have our Kernel function, let’s construct our Gaussian Process.
AbstractGPs.jl provides an excellent way to define Gaussian Processes by supplying mean and kernel functions. We can then sample from our GPs with a simple interface designed to extend the basic functions from Statistics.jl. From an AbstractGP we can construct a FiniteGP by indexing into our datasets.
First we construct f\sim\mathcal{GP}(0, k(\cdot, \cdot))
# construct an AbstractGP with mean 0 and cov k(x,x')
f = GP(0, k)GP{AbstractGPs.ConstMean{Int64}, ScaledKernel{TransformedKernel{SqExponentialKernel{Distances.Euclidean}, ScaleTransform{Float64}}, Float64}}(AbstractGPs.ConstMean{Int64}(0), Squared Exponential Kernel (metric = Distances.Euclidean(0.0))
- Scale Transform (s = 49.99999999999999)
- σ² = 1.0)
From this AbstractGP, we can now construct a FiniteGP, i.e. a multivariate normal distribution by applying GP to our training data. We include a measurement variance of \sigma^2 = 0.1 to account for noisy observations
fₓ = f(Tables.matrix(X)', (0.1)^2)AbstractGPs.FiniteGP{GP{AbstractGPs.ConstMean{Int64}, ScaledKernel{TransformedKernel{SqExponentialKernel{Distances.Euclidean}, ScaleTransform{Float64}}, Float64}}, ColVecs{Float64, Matrix{Float64}, SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}, Diagonal{Float64, FillArrays.Fill{Float64, 1, Tuple{Base.OneTo{Int64}}}}}(
f: GP{AbstractGPs.ConstMean{Int64}, ScaledKernel{TransformedKernel{SqExponentialKernel{Distances.Euclidean}, ScaleTransform{Float64}}, Float64}}(AbstractGPs.ConstMean{Int64}(0), Squared Exponential Kernel (metric = Distances.Euclidean(0.0))
- Scale Transform (s = 49.99999999999999)
- σ² = 1.0)
x: SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}[[0.5787278383554891], [0.9142048469151632], [0.6349481630131082], [0.09171915630391003], [0.7335198108794088], [0.47760835807275903], [0.02449415371328323], [0.6044400921590919], [0.5943832380403392], [0.31057576909916196] … [0.913051019929006], [0.5985706916405367], [0.7065005588747734], [0.5987990720193792], [0.26702829949066875], [0.4534173721134228], [0.2656832707543302], [0.5016867800937572], [0.7817138865001715], [0.6118324809053808]]
Σy: [0.010000000000000002 0.0 … 0.0 0.0; 0.0 0.010000000000000002 … 0.0 0.0; … ; 0.0 0.0 … 0.010000000000000002 0.0; 0.0 0.0 … 0.0 0.010000000000000002]
)
# let's examine this object
println(fieldnames(typeof(fₓ)))
fₓ.x.X # the underlying data
fₓ.x # internel data representation as columns for fast computation of kernel matrix
fₓ.Σy # Data Covariance matrix constructed from the σ² we supplied. Note the sparsity(:f, :x, :Σy)
50×50 Diagonal{Float64, FillArrays.Fill{Float64, 1, Tuple{Base.OneTo{Int64}}}}:
0.01 ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ 0.01 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ 0.01 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ 0.01 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ 0.01 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ 0.01 … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋮ ⋮ ⋱ ⋮
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.01 ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ 0.01 ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.01 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.01 ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.01 ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.01
Now that we have our Gaussian Process, we can compute the log-marginal likelihood p(\mathbf{y}\vert X, \theta), i.e. the probabity of obtaining the targets given the features, hyperparameters, etc…
logpdf(fₓ, y) # this is a marginal likelihood-7.700184939882172
Next, we demonstrate how to compute the posterior Gaussian process (for us that would be f_*). First we create the finite gaussian process (a function) which we will use to compute the posterior distribution \begin{equation} p(\mathbf{f}_* \vert X_*, X, \mathbf{y}) \end{equation}
p_fₓ = posterior(fₓ, y)
println(typeof(p_fₓ))AbstractGPs.PosteriorGP{GP{AbstractGPs.ConstMean{Int64}, ScaledKernel{TransformedKernel{SqExponentialKernel{Distances.Euclidean}, ScaleTransform{Float64}}, Float64}}, NamedTuple{(:α, :C, :x, :δ), Tuple{Vector{Float64}, Cholesky{Float64, Matrix{Float64}}, ColVecs{Float64, Matrix{Float64}, SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}, Vector{Float64}}}}
Now that we have the distribution, we can form our predictions… This can be done a few different ways:
fₓ_pred = p_fₓ(Tables.matrix(Xtrue)')
ypred = mean(fₓ_pred)
yσ² = var(fₓ_pred) # variance is diagonal of resulting covariance matrix500-element Vector{Float64}:
0.7399969405698197
0.6745544116473559
0.6012053216016902
0.5216117222468406
0.4382010286577165
0.3540638974650211
0.2727409320138179
0.19790673960059169
0.1329806849891212
0.08071357072269503
0.042813732967400764
0.019680168275069354
0.010301206805772735
⋮
0.03407265166349216
0.05774728658904227
0.09110947387388468
0.13471067956710525
0.18827073610317135
0.25067899726970433
0.3201038052855034
0.39418761362428856
0.47029407824696645
0.5457689535207053
0.6181787610714443
0.685498828645858
Alternatively, if we instead want a distribution for each datapoint we can compute \begin{equation}
p(\mathbf{y}_x \vert \mathbf{x}_*, X, y)
\end{equation} When treated as a collection, we can think about each of these representing a marginalized distribution over the test points \mathbf{x}_* and hence, we call marginals()
p_pred = marginals(fₓ_pred)
println(p_pred[1])
y_pred = mean.(p_pred);
y_σ = std.(p_pred);Normal{Float64}(μ=0.45719064373766344, σ=0.8602307484447529)
p1 = plot(X.x, y, seriestype=:scatter, color=:red, label="noisy data")
plot!(Xtrue.x, ytrue, color=:black, linestyle=:dash, label="true function")
plot!(Xtrue.x, y_pred, color=:blue, linewidth=3, label="fit" )
plot!(Xtrue.x, y_pred .+ 2y_σ, c=:gray, label="")
plot!(Xtrue.x, y_pred .- 2y_σ, fillrange = y_pred .+ 2y_σ, fillalpha=0.25, color=:gray, label="± 2σ")
xlabel!("x")
ylabel!("y")
title!("Vanilla GPR")So far we have shown how to:
KernelFunctions.jlAbstractGPs.jlXtrue).mean() and variances via var(). Alternatively, we can obtain a multivariate normal distribution for each point by calling marginals().You may think we have already fit the Guassian process however, we were forced to choose values for both \ell and \sigma^2. How can we optimally select the ideal hyperparameters for our Gaussian Process? This leads us into the realm of Bayesian Model Selection
There are several levels of parameters in machine learning. At the lowest level, we have the model weights \mathbf{w}. Above that, we have model hyperparameters, \theta. At the top we have model structure \mathcal{H}. In our Bayesian framework, we can consider distributions defined at each of these levels. At the bottom, we have \begin{equation} p(\mathbf{w} \vert X, \mathbf{y}, \theta, \mathcal{H}_i) = \frac{p(\mathbf{y} \vert X, \mathbf{w}, \theta, \mathcal{H}_i) p(\mathbf{w}\vert \theta, \mathcal{H}_i) }{p(\mathbf{y}\vert X, \theta, \mathcal{H}_i)} \end{equation}
If this looks confusing, consider Bayes rule for 3 events R, H, S. We have: \begin{align} P(R \vert H, S) &= \frac{P(R,H,S)}{P(H,S)} \\ &= \frac{P(H \vert R, S)P(R, S)}{P(H,S)}\\ &= \frac{P(H \vert R, S)P(R\vert S)P(S)}{P(H\vert S)P(S)} \\ &= \frac{P(H \vert R, S)P(R\vert S)}{P(H\vert S)} \end{align}
To get the result, just think of \theta and \mathcal{H}_i as a single event and translate the above to distribution functions.
The prior p(\mathbf{w}\vert \theta, \mathcal{H}_i) encodes any knowledge we have about the parameters prior to seeing the data. The denominator is the marginal likelihood and is given by \begin{equation} p(\mathbf{y}\vert X, \theta, \mathcal{H}_i) = \int d\mathbf{w}\; p(\mathbf{y} \vert X, \mathbf{w}, \theta, \mathcal{H}_i)p(\mathbf{w}\vert \theta, \mathcal{H}_i) \end{equation}
The next level up is to express the distribution of hyper-parameters \theta: \begin{equation} p(\theta \vert X, \mathbf{y}, \mathcal{H}_i) = \frac{p(\mathbf{y}\vert X, \theta, \mathcal{H}_i)p(\theta \vert \mathcal{H}_i)}{p(\mathbf{y}\vert X, \mathcal{H}_i)} \end{equation} Here p(\theta \vert \mathcal{H}_i) is called the hyper-prior. Similarly, the normalization constant is given by \begin{equation} p(\mathbf{y}\vert X,\mathcal{H}_i) = \int d\theta \; p(\mathbf{y}\vert X, \theta, \mathcal{H}_i)p(\theta \vert \mathcal{H}_i) \end{equation}
Finally, at the top level we have the set of possible model structures \{\mathcal{H}_i\}. This leads to \begin{equation} p(\mathcal{H}_i \vert X, \mathbf{y}) = \frac{p(\mathbf{y} \vert X, \mathcal{H}_i)p(\mathcal{H}_i)}{p(\mathbf{y}\vert X)} \end{equation} with normlization constant \begin{equation} p(\mathbf{y}\vert X) = \sum_i p(\mathbf{y} \vert X, \mathcal{H}_i)p(\mathcal{H}_i) \end{equation}
Depending on the model details, these integrals may be intractible to approximations or Monte Carlo methods. Since we rarely have sufficient knowledge to form a hyperparameter prior, one often attempts to maximize the marginal likelihood p(\mathbf{y} \vert X, \theta, \mathcal{H}_i) with respect to the hyperparameters \theta instead. This is known as Type II Maximium Likelihood Estimation.
In the case of Gaussian Process Regression, we are once again saved by the fact that every piece has a convenient functional from resulting in analytically tractible integrals for the marginal likelihood function. We find \begin{equation} \ln p(\mathbf{y}\vert X, \theta) = -\frac{1}{2}\mathbf{y}^T(K_f + \sigma_n^2 I)^{-1}\mathbf{y} - \frac{1}{2}\ln\lvert K_f + \sigma_n^2 I \rvert -\frac{n}{2}\ln(2\pi) \end{equation}
We should add a derivation of this when possible. It’s just a big ’ol nasty integral.
Let’s try this out in code! See this section from the AbstractGPs.jl docs…
using Optim# helpuer functions to insure parameters are positive
softplus(x) = log(1+exp(x))
softplusinv(x) = log(exp(x)-1)softplusinv (generic function with 1 method)
We want to maximize the log-marginal-likelihood and therefore want to minimize minus that quantitty:
# define loss function
function loss_function(x,y)
function negativelogmarginallikelihood(params)
kernel = softplus(params[1]) * (SqExponentialKernel() ∘ ScaleTransform(softplus(params[2])))
f = GP(kernel)
fx = f(x, softplus(params[3]))
return -logpdf(fx, y)
end
return negativelogmarginallikelihood
endloss_function (generic function with 1 method)
# prefactor, length parameter, measurement variance
θ₀ = [1.0, 1/(2ℓ^2), (0.25)^2]3-element Vector{Float64}:
1.0
49.99999999999999
0.0625
# default option uses finite diff methods
opt = Optim.optimize(loss_function(Tables.matrix(X)', y), θ₀, LBFGS(); autodiff=:forward)
params_best = opt.minimizer
softplus.(params_best)3-element Vector{Float64}:
0.1206544051219394
37.409897810113726
0.000582519141662673
kernel_best = softplus(params_best[1]) * (SqExponentialKernel() ∘ ScaleTransform(softplus(params_best[2])))
f = GP(kernel_best)
fₓ = f(Tables.matrix(X)', softplus(params_best[3]))
p_fₓ = posterior(fₓ, y)
fₓ_pred = p_fₓ(Tables.matrix(Xtrue)')
p_pred = marginals(fₓ_pred)
y_pred = mean.(p_pred);
y_σ = std.(p_pred);
p2 = plot(X.x, y, seriestype=:scatter, color=:red, label="noisy data")
plot!(Xtrue.x, ytrue, color=:black, linestyle=:dash, label="true function")
plot!(Xtrue.x, y_pred, color=:blue, linewidth=3, label="fit" )
plot!(Xtrue.x, y_pred .+ 2y_σ, c=:gray, label="")
plot!(Xtrue.x, y_pred .- 2y_σ, fillrange = y_pred .+ 2y_σ, fillalpha=0.25, color=:gray, label="± 2σ")
xlabel!("x")
ylabel!("y")
title!("HPO GPR")
plot(p1, p2, layout=(1,2), size=(900, 500), plot_title="Gaussian Process Regression Fits")Excellent! Now that we have the hang of it, let’s try another fit for a function of two variables.
n=1000
xs = (rand(n,2) .- 0.5)
x₁ = xs[:,1]
x₂ = xs[:,2]
y = x₁.^5 .+ x₂.^4 .- x₁.^4 .- x₂.^3
Xtrain = Tables.table(xs[1:Int(0.9*n), :], header=[:x₁, :x₂])
Xtest = Tables.table(xs[Int(0.9*n)+1:end, :], header=[:x₁, :x₂])
ytrain = y[1:Int(0.9*n)]
ytest = y[Int(0.9*n)+1:end]
scatter(Xtrain.x₁, Xtrain.x₂, ytrain, ms=3, label="training points")
scatter!(Xtest.x₁, Xtest.x₂, ytest, color=:red, ms=3, label="testing points")# this is me just trying to manuallyfind some good parameters
σf² = 0.25
ℓ = 0.3
σn² = 0.25
k = σf² * (SqExponentialKernel() ∘ ScaleTransform(1/2ℓ^2))
# test out kernels interface by generating matrix from k
K = kernelmatrix(k, Tables.matrix(Xtrain)') # transpose as we want size (n_features, n_datapoints)
f = GP(0, k)
fₓ = f(Tables.matrix(Xtrain)', σn²)
logpdf(fₓ, ytrain) # this is a marginal likelihood
p_fₓ = posterior(fₓ, ytrain)
fₓ_pred_train = p_fₓ(Tables.matrix(Xtrain)')
ypred_train = mean(fₓ_pred_train)
yσ²_train = var(fₓ_pred_train) # variance is diagonal of resulting covariance matrix
fₓ_pred_test = p_fₓ(Tables.matrix(Xtest)')
ypred_test = mean(fₓ_pred_test)
yσ²_test = var(fₓ_pred_test) # variance is diagonal of resulting covariance matrix
pf1 = scatter(ypred_train, ytrain, label="training data")
scatter!(ypred_test, ytest, label="testing data")
title!("Fit with guessed parameters")
we need a way to intelligently initialize the outputs. See this post for ideas
θ_lower = [1e-5, 1e-5, 1e-5]
θ_upper = [Inf, Inf, Inf]
θ₀ = [0.25, 0.3, 0.25]
# define loss function
function loss_function(x,y)
function negativelogmarginallikelihood(params)
kernel = params[1] * (SqExponentialKernel() ∘ ScaleTransform(params[2]))
f = GP(kernel)
fx = f(x, params[3])
return -logpdf(fx, y)
end
return negativelogmarginallikelihood
end
# default option uses finite diff methods
opt = Optim.optimize(loss_function(Tables.matrix(Xtrain)', ytrain), θ_lower, θ_upper, θ₀, Fminbox(LBFGS()); autodiff=:forward)
params_best = opt.minimizer
# softplus.(params_best)3-element Vector{Float64}:
7.44878713782739
1.062150755934373
1.0000000000000003e-5
kernel_best = params_best[1] * (SqExponentialKernel() ∘ ScaleTransform(params_best[2]))
f = GP(kernel_best)
fₓ = f(Tables.matrix(Xtrain)', params_best[3])
p_fₓ = posterior(fₓ, ytrain)
# compute
fₓ_pred = p_fₓ(Tables.matrix(Xtest)')
p_pred = marginals(fₓ_pred)
y_pred = mean.(p_pred);
y_σ = std.(p_pred);
fₓ_pred_train = p_fₓ(Tables.matrix(Xtrain)')
ypred_train = mean(fₓ_pred_train)
yσ²_train = var(fₓ_pred_train) # variance is diagonal of resulting covariance matrix
fₓ_pred_test = p_fₓ(Tables.matrix(Xtest)')
ypred_test = mean(fₓ_pred_test)
yσ²_test = var(fₓ_pred_test) # variance is diagonal of resulting covariance matrix
pf2 = scatter(ypred_train, ytrain, label="training data")
scatter!(ypred_test, ytest, label="testing data")
title!("Fit with HPO Parameters")marginals(fₓ_pred_train)900-element Vector{Normal{Float64}}:
Normal{Float64}(μ=0.010400795687473874, σ=0.00035037346174344536)
Normal{Float64}(μ=-0.012172544294344334, σ=0.0003825535966581155)
Normal{Float64}(μ=-0.003692183680868766, σ=0.00036326742814205324)
Normal{Float64}(μ=0.0030344804249580193, σ=0.00035299693729260947)
Normal{Float64}(μ=-0.008193057695734751, σ=0.0003798955083331059)
Normal{Float64}(μ=0.15378105970739853, σ=0.0008182225299438748)
Normal{Float64}(μ=-0.038238358903527114, σ=0.0005019506938365845)
Normal{Float64}(μ=-0.05854285968462136, σ=0.0007776923621838775)
Normal{Float64}(μ=-0.01249912826324362, σ=0.0003932095861878094)
Normal{Float64}(μ=0.057902235075289354, σ=0.00041030597882028684)
Normal{Float64}(μ=0.0011632400992311887, σ=0.00034108646783589946)
Normal{Float64}(μ=-0.018200111100668437, σ=0.0004837689169881995)
Normal{Float64}(μ=-0.002522870589473314, σ=0.0003555895502088072)
⋮
Normal{Float64}(μ=-0.05043878931292056, σ=0.0004920300553927884)
Normal{Float64}(μ=-0.016010720361464337, σ=0.00040222272797466903)
Normal{Float64}(μ=0.06253937610745197, σ=0.0007782123676896544)
Normal{Float64}(μ=0.04114382454235965, σ=0.000392768595621182)
Normal{Float64}(μ=0.0014618399472965393, σ=0.00035841328830948124)
Normal{Float64}(μ=-0.0014261831347539555, σ=0.00035119866114851093)
Normal{Float64}(μ=-0.04270731098517899, σ=0.0005029644552482196)
Normal{Float64}(μ=0.003203716380539845, σ=0.0003874582202578705)
Normal{Float64}(μ=0.068935978853915, σ=0.0005932645609833954)
Normal{Float64}(μ=0.0004937165590490622, σ=0.00034980544552711034)
Normal{Float64}(μ=-0.0034229375296490616, σ=0.0004060240358432208)
Normal{Float64}(μ=-0.014041440888831858, σ=0.00043789872608950855)