2023-10-06
Gaussian process regression StatisticsPetroleum engineers solve many problems during the process of developing a hydrocarbon asset. One of them is assisting the exploration team to understand the reservoir - the rock in the ground that contains the oil and/or gas.
To do this, they follow the same procedure that a data scientist (DS) (or experimental physicist, or econometrician, or any 21st century data-driven professional) follows, namely:
The main things they want to know about the reservoir are:
One of the ways of collecting data in the petroleum industry is to drill a hole, take out a sample of rock from the bottom of the hole, and perform laboratory measurements on that sample. Because each hole can cost from hundreds of thousands to several millions of dollars to drill, all else being equal we would like to drill as few of them as possible.
Suppose we drill a few of holes and take measurements of one of the above quantities from the sample extracted from each hole. Suppose we then want to estimate the value for that quantity at some new location far away from any of the holes. This is the problem of spatial interpolation.
With the above terminology in place, we can now restate the problem formally as follows:
Suppose we drill holes located at , and we take a measurement of one of the above regionalised variables in each hole, so that we have observed one realisation of each of random variables . Our goal is to estimate the true realisations of the random variables for a set of new locations .
Kriging is the term used by the geostatistics community for historical reasons to refer to Gaussian process regression (GPR). In GPR, we model the regionalised variable to be predicted as a Gaussian process (GP), a type of stochastic process. The regression problem then becomes one of estimating the mean and variance of the GP for the test set.
GPR gives linear estimates in these sense that each target prediction can be shown to be a linear combination of the training targets. The variance between target values for different input features is modelled by a kernel function , which must be an inner product in feature space. As is possible to show, this is equivalent to saying the Gram matrix whose elements are is positive definite. Unlike, say, linear regression, GPR is a nonparametric method and as such its complexity grows with the size of the training set.
Before proceeding, let's first define some notation: Uppercase , uppercase and lowercase .
Let be the known values of the regionalised variable of interest at sampled locations . This is our training set. Our goal is to predict the function values at unsampled locations . This requires that we evaluate the conditional distribution .
Let . Now , where is the covariance matrix related to the Gram matrix as follows: , where is the variance (autocovariance at lag 0) and is the identity matrix. We can partition as follows:
Where and are the covariance matrices for sampled and unsampled locations. Using well-established results for partitioned covariances matrices of multivariate Gaussians, we see that the conditional distribution is a Gaussian with mean and covariance given by:
(GPR equations)
The kriging estimate is simply this posterior mean , and the posterior covariance additionally gives us a measure of the uncertainty in our estimate.
The above two equations are the key results that define GPR. (Later on, I will show that the above expression for the mean of the GPR is in fact a best linear unbiased estimator (BLUP) for ).
What remains to be seen is how we specify the kernel function and in turn specify all the and matrices. One approach is to estimate it using maximum likelihood estimation (more on this later). An alternative approach is to infer it from the data. For this, we turn to a popular tool in geostatistics: The variogram.
In geostatistics, the kernel function (and hence the covariance) is expressed using the so-called variogram. The variogram is defined as the expected squared difference between the values at two points and :
It can be shown that under the stationarity assumptions this reduces to:
Where , so that is the Euclidean distance between and . Thus the variogram is the expected squared difference of values of the regionalised variable in different locations as a function of their distance apart. Further application of the stationary assumptions and algebraic manipulation shows that this can be written as a function of only, as follows:
Where . The function equal to half the variogram is called the semivariogram. The semivariogram is often used instead of the variogram for convenience. I have illustrated the relationship between the semivariogram and autocovariance in the following figure.
To construct an empirical variogram, for every possible lag , we first find the subset and take its mean. (In practice, because we may have minimal data, we instead bin the data into subsets where is some small threshold). We then fit various semivariogram models to the data and select the one that has the best fit, perhaps drawing on experience in adjacent fields to use as analogs.
The following figure illustrates a few popular variogram models. Range, sill and nugget are 20 m, 1 m2 and 0 m2, respectively.
In particular, the Gaussian variogram is given by:
Suppose we are developing a reservoir. For simplicity, let's start with the assumption that most of its variation can be explained by a single direction, in which case we can model it as a 1D problem. Suppose we have drilled 10 holes and have taken laboratory measurements on rock samples extracted from those holes. The results are shown below.
We then perform the aforementioned procedure to build the empirical variogram. The theoretical and empirical variogram are shown in the following figure.
There is an entire body of literature on the best ways to fit empirical variograms. Here, for simplicity I have just used scipy.optimize with a squared-error objective function. Note that the empirical variogram grossly overestimates the true variogram. This reflects the fact that estimates of mean squared differences are biased. However, they are consistent, which means that in the limit of an infinite number of sampling experiments, the empirical variogram would perfectly reproduce the theoretical. To convince ourselves of this, and as a sense-check of our approach, we can perform a large number of sampling experiments and plot the variogram cloud. The results are shown in the following figure.
Suppose for sake of argument that before we drilled the holes, we had an idea of the variogram. Then we would have a prior distribution given by a GP with mean 0 and covariance matrix specified by the variogram:
The following figure shows a few sample functions drawn from this prior.
The following figure is a plot of the prior covariance matrix. It is a Toeplitz matrix, reflecting the fact that the prior GP is stationary.
Later on, after we have drilled a few holes and performed laboratory measurements of core samples, we will compare this with the posterior covariance (N.B. this is not something done in practice but is useful for pedagogical purposes).
After we have drilled the holes and taken measurements, we update our prior belief (expressed by the prior GP) with data (AKA observations) to arrive at a posterior belief (our belief after-the-fact). The following figure shows samples drawn from this posterior.
The kriging estimate is simply the expected value of this posterior GP. This is shown in the following figure. As we can see, it interpolates the true realisation reasonably well for densely sampled regions but less so for regions with few or no samples, particularly the region between Wells 6 and 7.
The following figure is a visualisation of the posterior covariance matrix. It of course remains symmetric but is no longer Toeplitz, reflecting the fact that the posterior GP is nonstationary: The means are no longer identically zero but are equal the values at sampled locations, at which the variance is zero and covariance is higher for nearby unsampled points. The bright yellow patch represents the region of greatest uncertainty, between holes 6 and 7.
Suppose for geological reasons we could no longer make the 1D assumption and we had to model it in 2D. The above method applies to this 2D case (and any higher dimensions) with no modification. Suppose again we have drilled 10 holes and have taken laboratory measurements on rock samples extracted from those holes. The results are shown below.
2D view:
3D isometric view:
As before, we can estimate the surface using kriging:
2D slice at :
Note that in this case, we need to drill more holes and collect more data to accurately model the surface. In fact, the amount of data we need grows exponentially with the number of dimensions. This is a manifestation of the curse of dimensionality.
Now that I have defined kriging and provided worked examples, I think it is beneficial to discuss an alternative derivation of kriging which is more from a classic statistical perspective, compared to the derivation I presented first which is more from a modern machine learning (ML) perspective. The classic derivation I present in the following is in fact the one I learned first at university. Only later in my DS career I did I learn the ML derivation above, i.e. the derivation from the Gaussian process perspective. Crucially, the classic derivation reveals that kriging is the best linear unbiased predictor (BLUP) of the regionalised variable of interest.
Suppose we wish to estimate the value for at an unsampled location using a linear estimate, i.e. by taking a linear combination of the values at each of the holes:
There are many ways we could come up with the weights . For example, we could let the weight for the closest data point be equal to 1 and the rest be equal to 0. This is the nearest neigbhour (NN) estimate.
Or as a generalisation of this, we could let the weights for the closest data points be equal to and the rest equal to . This is the k-nearest neighbour (kNN) estimate.
Or we could set the weights equal to the normalised reciprocals of the distances between the existing locations and the new location:
This is the inverse distance weighting (IDW) estimate.
Each of the above approaches, while simple, is somewhat unsatisfying. They leave much to be desired from a statistical standpoint. In particular:
A question worth asking is: Given we want to make a linear estimate of , what are some requirements we want such an estimate to have? A statistician might answer this question as follows:
Under a set of assumptions, it is possible to show that kriging is the best linear unbiased predictor (BLUP) of the regionalised variable in question at unsampled locations.
What is a BLUP? It is a predictor with the following properties:
Before we proceed, a brief note on notation: For simplicity, because for the purpose of this derivation we are only considering a single point to be estimated, I will use the notation to refer to what I have called previously. With this in mind, we have
This shows that the requirement that it is unbiased implies that the kriging weights are normalised.
The coefficients can be determined from this equation and from condition (4), taking into account the constraint Eq. (7). This system of equations is solved by introducing (7) in (4) with a Lagrange multiplier . The Lagrange multiplier is an additional unknown constant which makes the number of unknown parameters equal to the number of equations.
The system of linear equations for which and must be solved is given by
The Lagrangian is
The partial derivatives of condition (8) yield (7) and a system of linear equations:
In matrix form:
Considering the system for only, taking the transpose of both sides and right-multiplying by ,
But the matrices on the RHS are just and defined previously! Substituting these,
Which is precisely the same as the posterior mean of the Gaussian process derived previously in the GPR equations! This shows that the mean of the GP is indeed a BLUP for . QED.
Instead of specifying the covariance via an empirical variogram, we could have estimated it using maximum likelihood estimation (MLE). The log marginal likelihood is given by:
The partial derivatives are given by:
Details of an algorithm for a practical implementation of the above is beyond the scope of this post. In short, it involves performing a Cholesky decomposition of and solving two linear systems of equations involving the resulting lower triangular matrix. The interested reader should refer to Chapter 2 of Rasmussen (2006).
Bárdossy, A. Introduction to geostatistics.
Bay, X., Gauthier, B. 2010. Extended formula for kriging interpolation.
Bishop, C. M., 2006. Pattern recognition and machine learning.
Ginsbourger, D. et. al. 2008. A note on the choice and the estimation of kriging models for the analysis of deterministic computer experiments.
Le, N. D., Zidek, J. V., 1992. Interpolation with uncertain spatial covariances: A Bayesian alternative to kriging.
Malvic, T., Balic, D. 2010. Linearity and Lagrange linear multiplicator in the equations of ordinary kriging.
Mønness, E., Coleman, S. 2011. A short note on variograms and correlograms.
Rasmussen, C. E., Williams, K. I., 2005. Gaussian processes for machine learning.
Ro, Y., Yoo, C. 2022. Numerical experiments applying simple kriging to intermittent and log-normal data.
Sjöstedt-De Luna, S., Young, A., 2003. The bootstrap and kriging prediction intervals.
Wilson, J. T. et. al. 2020. Efficiently sampling functions from Gaussian process posteriors.