A foreshadow of my data science career in my engineering undergrad: Kriging

2023-10-06

Gaussian process regression Statistics

Contents


1 Spatial interpolation: An important problem in hydrocarbon exploration

Petroleum engineers solve many problems during the process of developing a hydrocarbon asset. One of them is assisting the exploration team to understand the reservoir - the rock in the ground that contains the oil and/or gas.

To do this, they follow the same procedure that a data scientist (DS) (or experimental physicist, or econometrician, or any 21st century data-driven professional) follows, namely:

  1. Collect and prepare data
  2. Analyse the data
  3. Build a model
  4. Make inferences

The main things they want to know about the reservoir are:

One of the ways of collecting data in the petroleum industry is to drill a hole, take out a sample of rock from the bottom of the hole, and perform laboratory measurements on that sample. Because each hole can cost from hundreds of thousands to several millions of dollars to drill, all else being equal we would like to drill as few of them as possible.

Suppose we drill a few of holes and take measurements of one of the above quantities from the sample extracted from each hole. Suppose we then want to estimate the value for that quantity at some new location far away from any of the holes. This is the problem of spatial interpolation.

2 Stating the problem formally

As in any quantitative problem solving, to be able to get anywhere with it, we must first restate it formally in mathematical language. I will first introduce some terminology familiar to the geostatistics community:

With the above terminology in place, we can now restate the problem formally as follows:

Suppose we drill  holes located at , and we take a measurement of one of the above regionalised variables  in each hole, so that we have observed one realisation  of each of  random variables . Our goal is to estimate the true realisations  of the random variables  for a set of  new locations .

3 Kriging: Gaussian process regression for the geosciences

Kriging is the term used by the geostatistics community for historical reasons to refer to Gaussian process regression (GPR). In GPR, we model the regionalised variable to be predicted as a Gaussian process (GP), a type of stochastic process. The regression problem then becomes one of estimating the mean and variance of the GP for the test set.

GPR gives linear estimates in these sense that each target prediction can be shown to be a linear combination of the training targets. The variance between target values for different input features is modelled by a kernel function , which must be an inner product in feature space. As is possible to show, this is equivalent to saying the Gram matrix  whose elements are   is positive definite. Unlike, say, linear regression, GPR is a nonparametric method and as such its complexity grows with the size of the training set.

Before proceeding, let's first define some notation: Uppercase , uppercase  and lowercase .

Let  be the known values of the regionalised variable of interest at sampled locations . This is our training set. Our goal is to predict the function values  at  unsampled locations . This requires that we evaluate the conditional distribution .

Let . Now , where  is the covariance matrix related to the Gram matrix as follows: , where  is the variance (autocovariance at lag 0) and  is the identity matrix. We can partition  as follows:

Where  and  are the covariance matrices for sampled and unsampled locations. Using well-established results for partitioned covariances matrices of multivariate Gaussians, we see that the conditional distribution  is a Gaussian with mean and covariance given by:

(GPR equations)

The kriging estimate is simply this posterior mean , and the posterior covariance  additionally gives us a measure of the uncertainty in our estimate.

The above two equations are the key results that define GPR. (Later on, I will show that the above expression for the mean of the GPR is in fact a best linear unbiased estimator (BLUP) for ).

What remains to be seen is how we specify the kernel function  and in turn specify all the  and  matrices. One approach is to estimate it using maximum likelihood estimation (more on this later). An alternative approach is to infer it from the data. For this, we turn to a popular tool in geostatistics: The variogram.

4 How the covariance is specified in geostatistics: Variograms

In geostatistics, the kernel function (and hence the covariance) is expressed using the so-called variogram. The variogram  is defined as the expected squared difference between the values at two points  and :

It can be shown that under the stationarity assumptions this reduces to:

Where , so that  is the Euclidean distance between  and . Thus the variogram is the expected squared difference of values of the regionalised variable in different locations as a function of their distance apart. Further application of the stationary assumptions and algebraic manipulation shows that this can be written as a function of  only, as follows:

Where . The function  equal to half the variogram is called the semivariogram. The semivariogram is often used instead of the variogram for convenience. I have illustrated the relationship between the semivariogram and autocovariance in the following figure.

A graph with a blue line  Description automatically generated

To construct an empirical variogram, for every possible lag , we first find the subset  and take its mean. (In practice, because we may have minimal data, we instead bin the data into subsets  where  is some small threshold). We then fit various semivariogram models to the data and select the one that has the best fit, perhaps drawing on experience in adjacent fields to use as analogs.

The following figure illustrates a few popular variogram models. Range, sill and nugget are 20 m, 1 m2 and 0 m2, respectively.

A graph of a graph with colored lines  Description automatically generated with medium confidence

In particular, the Gaussian variogram is given by:

5 Worked example: Ordinary kriging

5.1 1D ordinary kriging

Suppose we are developing a reservoir. For simplicity, let's start with the assumption that most of its variation can be explained by a single direction, in which case we can model it as a 1D problem. Suppose we have drilled 10 holes and have taken laboratory measurements on rock samples extracted from those holes. The results are shown below.

A graph with a line  Description automatically generated

We then perform the aforementioned procedure to build the empirical variogram. The theoretical and empirical variogram are shown in the following figure.

A graph with lines and dots  Description automatically generated

There is an entire body of literature on the best ways to fit empirical variograms. Here, for simplicity I have just used scipy.optimize with a squared-error objective function. Note that the empirical variogram grossly overestimates the true variogram. This reflects the fact that estimates of mean squared differences are biased. However, they are consistent, which means that in the limit of an infinite number of sampling experiments, the empirical variogram would perfectly reproduce the theoretical. To convince ourselves of this, and as a sense-check of our approach, we can perform a large number of sampling experiments and plot the variogram cloud. The results are shown in the following figure.

Chart, scatter chart  Description automatically generated

5.2 Aside: Sampling from the prior

Suppose for sake of argument that before we drilled the holes, we had an idea of the variogram. Then we would have a prior distribution given by a GP with mean 0 and covariance matrix specified by the variogram:

The following figure shows a few sample functions drawn from this prior.

A graph of a graph  Description automatically generated with medium confidence

5.3 Aside: Visualising the prior covariance matrix

The following figure is a plot of the prior covariance matrix. It is a Toeplitz matrix, reflecting the fact that the prior GP is stationary.

A blue and yellow dotted line  Description automatically generated with medium confidence

Later on, after we have drilled a few holes and performed laboratory measurements of core samples, we will compare this with the posterior covariance (N.B. this is not something done in practice but is useful for pedagogical purposes).

5.4 Kriging estimate: The mean of the posterior GP

After we have drilled the holes and taken measurements, we update our prior belief (expressed by the prior GP) with data (AKA observations) to arrive at a posterior belief (our belief after-the-fact). The following figure shows samples drawn from this posterior.

A graph of a graph  Description automatically generated with medium confidence

The kriging estimate is simply the expected value of this posterior GP. This is shown in the following figure. As we can see, it interpolates the true realisation reasonably well for densely sampled regions but less so for regions with few or no samples, particularly the region between Wells 6 and 7.

A graph with a line graph  Description automatically generated with medium confidence

5.5 Visualising the posterior covariance matrix

The following figure is a visualisation of the posterior covariance matrix. It of course remains symmetric but is no longer Toeplitz, reflecting the fact that the posterior GP is nonstationary: The means are no longer identically zero but are equal the values at sampled locations, at which the variance is zero and covariance is higher for nearby unsampled points. The bright yellow patch represents the region of greatest uncertainty, between holes 6 and 7.

A blue background with a green dot  Description automatically generated

6 Higher dimensions: 2D ordinary kriging

Suppose for geological reasons we could no longer make the 1D assumption and we had to model it in 2D. The above method applies to this 2D case (and any higher dimensions) with no modification. Suppose again we have drilled 10 holes and have taken laboratory measurements on rock samples extracted from those holes. The results are shown below.

2D view:

A green and blue map  Description automatically generated with medium confidence

3D isometric view:

A graph of a graph  Description automatically generated with medium confidence

As before, we can estimate the surface using kriging:

A graph of a graph  Description automatically generated with medium confidence

2D slice at :

A graph with a line graph  Description automatically generated

Note that in this case, we need to drill more holes and collect more data to accurately model the surface. In fact, the amount of data we need grows exponentially with the number of dimensions. This is a manifestation of the curse of dimensionality.

Now that I have defined kriging and provided worked examples, I think it is beneficial to discuss an alternative derivation of kriging which is more from a classic statistical perspective, compared to the derivation I presented first which is more from a modern machine learning (ML) perspective. The classic derivation I present in the following is in fact the one I learned first at university. Only later in my DS career I did I learn the ML derivation above, i.e. the derivation from the Gaussian process perspective. Crucially, the classic derivation reveals that kriging is the best linear unbiased predictor (BLUP) of the regionalised variable of interest. 

7 Alternative derivation of kriging: Best linear unbiased predictor

7.1 Preamble: Linear estimation

Suppose we wish to estimate the value for  at an unsampled location using a linear estimate, i.e. by taking a linear combination of the values at each of the holes:

There are many ways we could come up with the weights . For example, we could let the weight for the closest data point be equal to 1 and the rest be equal to 0. This is the nearest neigbhour (NN) estimate.

Or as a generalisation of this, we could let the weights for the closest  data points be equal to  and the rest equal to . This is the k-nearest neighbour (kNN) estimate.

Or we could set the weights equal to the normalised reciprocals of the distances between the existing locations and the new location:

This is the inverse distance weighting (IDW) estimate.

7.2 Desirable properties of a linear estimator

Each of the above approaches, while simple, is somewhat unsatisfying. They leave much to be desired from a statistical standpoint. In particular:

A question worth asking is: Given we want to make a linear estimate  of , what are some requirements we want such an estimate to have? A statistician might answer this question as follows:

Under a set of assumptions, it is possible to show that kriging is the best linear unbiased predictor (BLUP) of the regionalised variable in question at unsampled locations.

What is a BLUP? It is a predictor with the following properties:

7.3 The unbiasedness criterion

Before we proceed, a brief note on notation: For simplicity, because for the purpose of this derivation we are only considering a single point to be estimated, I will use the notation to refer to what I have called  previously. With this in mind, we have

This shows that the requirement that it is unbiased implies that the kriging weights are normalised.

7.4 The minimum variance criterion

The coefficients  can be determined from this equation and from condition (4), taking into account the constraint Eq. (7). This system of  equations is solved by introducing (7) in (4) with a Lagrange multiplier . The Lagrange multiplier is an additional unknown constant which makes the number of unknown parameters equal to the number of equations.

The system of linear equations for which  and  must be solved is given by

The Lagrangian is

The partial derivatives of condition (8) yield (7) and a system of  linear equations:

In matrix form:

7.5 Showing the derivations give equivalent results

Considering the system for  only, taking the transpose of both sides and right-multiplying by ,

But the matrices on the RHS are just  and  defined previously! Substituting these,

Which is precisely the same as the posterior mean  of the Gaussian process derived previously in the GPR equations! This shows that the mean of the GP is indeed a BLUP for . QED.

8 Limitations of kriging

8.1 Endnote: Alternative to fitting a variogram: Estimating model parameters using MLE

Instead of specifying the covariance via an empirical variogram, we could have estimated it using maximum likelihood estimation (MLE). The log marginal likelihood is given by:

The partial derivatives are given by:

Details of an algorithm for a practical implementation of the above is beyond the scope of this post. In short, it involves performing a Cholesky decomposition of  and solving two linear systems of equations involving the resulting lower triangular matrix. The interested reader should refer to Chapter 2 of Rasmussen (2006).

9 References

Bárdossy, A. Introduction to geostatistics.

Bay, X., Gauthier, B. 2010. Extended formula for kriging interpolation.

Bishop, C. M., 2006. Pattern recognition and machine learning.

Ginsbourger, D. et. al. 2008. A note on the choice and the estimation of kriging models for the analysis of deterministic computer experiments.

Le, N. D., Zidek, J. V., 1992. Interpolation with uncertain spatial covariances: A Bayesian alternative to kriging.

Malvic, T., Balic, D. 2010. Linearity and Lagrange linear multiplicator in the equations of ordinary kriging.

Mønness, E., Coleman, S. 2011. A short note on variograms and correlograms.

Rasmussen, C. E., Williams, K. I., 2005. Gaussian processes for machine learning.

Ro, Y., Yoo, C. 2022. Numerical experiments applying simple kriging to intermittent and log-normal data.

Sjöstedt-De Luna, S., Young, A., 2003. The bootstrap and kriging prediction intervals.

Wilson, J. T. et. al. 2020. Efficiently sampling functions from Gaussian process posteriors.

Back to top