# Tutorial on Bayesian statistics for geophysicists

## Tutorial on Bayesian Statistics for Geophysicists

The primary authors of this tutorial are L.M. Berliner and N. Cressie.

### Essence of Bayesian Reasoning

One rationale for the Bayesian approach is that it is a mathematically rigorous method for combining information in the presence of uncertainty. Uncertainty is a notion that applies to observations, models, and parameters, in differing amounts depending on the problem under investigation. This tutorial in combination with the Web Project: Ice Streams, demonstrates how uncertainty can be accounted for and scientific inference can be made on ice-stream dynamics using the Bayesian approach.

The primary computational tool used is Bayes' Theorem, which is simply a result in probability theory relating various conditional distributions. However, the Bayesian view of statistical modeling and analysis involves more than simple application of probability theory. Indeed, it is a paradigm that involves the modeling of unknowns as random variables and using observations to update that modeling effort.

Two primary sources of information are available for inference on the unknown quantities of interest: (i) observations or data that convey some information regarding those unknowns, and (ii) prior information, based on scientific reasoning regarding the unknowns, as well as past experience and data. Both sources of information are subject to uncertainties. The basis of the Bayesian approach is that all such uncertainties are modeled probabilistically. Specifically, a data model or statistical distribution for the data is developed conditional upon the unknowns; the result is denoted by $[\mbox{data} \mid \mbox{unknowns}]$. Similarly, prior information and uncertainty are summarized through a prior probability distribution, denoted by $[\mbox{unknowns}]$.

Construction of a data model does not require that we actually know the values of the unknowns (fortunately!). Rather, this distribution is viewed as a function of both the unknowns and the data. For example, if we observe the surface velocity of an ice-stream at some point in space, we would not believe that the resulting observation, $U$, is exactly equal to the true value, $u$. The presence of an unobservable measurement error, $e$, would be anticipated. A common assumption is that $$U = u + e.$$ To complete the statistical argument, we would suggest a probability distribution for $e$. Suppose that it is plausible to assert that $e$ is a Gaussian or normally distributed random variable with mean zero and a known (say from previous calibration studies) variance $\sigma^2$; in shorthand, $e \sim N(0,\sigma^2)$, where $\sim$ is read "is distributed as". This suggests a statistical model for $U$, conditional upon $u$: $$U \mid u \, \sim \, N(u, \sigma^2).$$

The development of a prior distribution, $[\mbox{unknowns}]$, is the subject of much of the Web Project: Ice Streams. For now, suppose this step has been accomplished. Bayes' Theorem then yields the solution to our problem, which is the posterior distribution of the unknowns conditional upon the observed (i.e., now fixed) data, given by $$[\mbox{unknowns} \mid \mbox{data}] = \frac{[\mbox{data} \mid \mbox{unknowns}] \, [\mbox{unknowns}]}{[\mbox{data}]}.$$ The denominator, $[\mbox{data}]$, in this expression can be viewed simply as that constant, depending on the observed data but not the unknowns, that normalizes the numerator, ensuring that the posterior is indeed a probability distribution (physicists may note the analogy to the partition function in statistical physics).

Pursuing our simple example regarding velocity, suppose that our prior information suggests that the true velocity $u$ is normally distributed:

$$u\sim N(\mu,\sigma^2_u).$$

One can show that the posterior distribution $[u{\mid}U]$ is also Gaussian (e.g., Berger 1985, pp.127-128) with mean

$$\frac{\sigma^2_u}{\sigma^2_u+\sigma^2}U+\frac{\sigma^2}{\sigma^2_u+\sigma^2}\mu$$

and variance

$$\frac{\sigma^2\sigma^2_u}{\sigma^2_u+\sigma^2}.$$

Inspection of these formulas indicates how Bayesian analysis provides combinations of data and prior information. First, note that the posterior mean of $u$ is a weighted average of the prior mean (or guess) $\mu$ and the purely-data-based estimate $U$. Also, note that if the datum is extremely precise relative to our prior information ($\sigma^2$ is small compared with $\sigma^2_u$), the posterior mean is very close to $U$. Next, note that the posterior variance is smaller than either $\sigma^2$ and $\sigma^2_u$; that is, we learn more via the combination of data and prior than each tells us separately. (These aspects are very common, but not universal, aspects of Bayesian analysis.)

It is instructive to compare the Bayesian approach to a conventional statistical approach. In the latter case, primary attention is directed to the data model Eq.(2) only. In particular, the optimal estimator in the sense of both maximum likelihood and minimum squared errors of $u$ is the observation $U$. But notice that if we simply choose the prior distribution to be uniform or constant everywhere, Eq.(3) reduces to $$[\mbox{unknowns} \mid \mbox{data}] \propto [\mbox{data} \mid \mbox{unknowns}],$$ suggesting that many traditional estimation procedures are interpretable as modes (i.e., maxima) of posterior distributions corresponding to uniform priors. This applies quite generally to least-squares estimates, maximum-likelihood estimates, including spline smoothers and certain three- and four-dimensional variational data assimilation procedures, and control methods applied in glaciology (see below). In this sense, the Bayesian approach is more general since geophysical knowledge typically leads to a prior distribution $[\mbox{unknowns}]$ that is not uniform. For example, the force of gravity counterbalanced by stresses on an ice stream can be incorporated into $[\mbox{unknowns}]$, making it far less uncertain than the uniform model that yields the conventional statistical approach.

A key issue in statistical inference is that it involves not only the production of estimates, but also quantifications of uncertainly associated with them. The Bayesian approach responds with the production of the full posterior distribution. By contrast, conventional statistical approaches must rely only on the data model. In our example, the only quantification of variability in $U\mid u\sim N(u,\sigma^2)$ is $\sigma^2$. In classical statistics, this quantity can be used to produce confidence intervals for the true value $u$: $U\pm z\sigma$, where $z$ depends on the level of confidence desired (e.g., a 95% interval arises if $z$=1.96). However, the correct interpretation of a 95% confidence interval is that if data were gathered under identical conditions 100 times, one would expect that 95 out of the 100 confidence intervals would contain the true value $u$. Under the conventional statistical approach, $u$ is a constant and hence is not amenable to a probability statement about its value.

Rather than velocity at a single point, we are interested in velocity profiles or velocity fields. Suppose that a vector of observations $U$ is available for estimation of the surface velocity profile $u(x)$ for locations $x$ in some domain. By analogy to Eq.(2), we might postulate the data model, $[{\bf U}\mid {\bf u}]$ as a multivariate normal (Gaussian) distribution, $N({\bf u}, \sigma^2 I)$, where $u$ is the vector of the true velocities at the observation locations. Choosing the simple estimate $U$ to estimate $u$ is the result of using conventional statistics, but again it accounts for none of our scientific understanding regarding the structure of velocity fields and, furthermore, it does not suggest a method for estimating velocity at unobserved locations. A purely statistical modeling approach would postulate the true profile to be a stochastic process with some mean function and some spatial covariance function, followed by spatial prediction of the field $u(x)$ for all $x$ in the domain. This would require the tasks of formulation and estimation of the covariances. Such approaches are known as kriging or "objective analysis" (though the modifier "objective" is subject to misinterpretation); kriging and related spatial prediction methods are discussed in an introductory article by Cressie (1989).

Physical-statistical modeling offers the opportunity to develop models based on physical reasoning, and we shall use this to build the distribution, $[\mbox{unknowns}]$; see Web Project: Ice Streams. We note that control methods (see below) also make formal use of a physical model, though quantification of uncertainty remains a difficult issue. The physical-statistical Bayesian approach relies on physically based interrelationships among (unknown and sometimes imperfectly observed) quantities of interest, but it focuses on probability distributional interrelationships. To accomplish our goal of scientific inference on the unknown quantities of interest, we use the technology of hierarchical statistical modeling.

### (Bayesian) Hierarchical Statistical Modeling

Many procedures used in the analysis of geophysical data are Bayesian or approximately Bayesian. For example, the Kalman filter is a Bayesian procedure (e.g., Meinhold and Singpurwalla, 1983). As indicated above, a variety of other standard procedures have Bayesian derivations and interpretations (e.g., Evensen and van Leeuwen 2000; Lorenc 1986). General presentations of Bayesian analysis intended for geophysical audiences are given in Epstein (1985) and Tarantola (1987).

Nevertheless, the full power of the methodology is only recently making inroads in geophysics (e.g., Berliner, Wikle, and Cressie 2000; Berliner, Milliff, and Wikle 2003). The two breakthroughs have been in modeling (through emphasis on Bayesian hierarchical modeling) and in computations (through variants of Markov chain Monte Carlo or MCMC).

Consider three basic collections of variables to be modeled: data, our observations; processes, those physical, state variables of interest (e.g., velocities, stresses, etc.); and parameters, unknown physical constants and parameters introduced in the statistical components of the model. Hierarchical thinking suggests a model with three primary components (e.g., Berliner 1996), namely

Data model: $[\mbox{data}|\mbox{processes}, \mbox{parameters}]$,
Prior processes model: $[\mbox{process}|\mbox{parameters}]$,
Prior on parameters: $[\mbox{parameters}]$.

Though the construction of these components is not easy, once the modeling is complete, Bayes' Theorem yields the posterior distribution, $[\mbox{process, parameters}|\mbox{data}]$. Notice that processes and parameters together make up all that is unknown in the problem. That is, $[\mbox{unknowns}] = [\mbox{process, parameters}]$, and Bayes' Theorem given by Eq.(3) can be used to obtain the posterior distribution.

In the Bayesian treatment of parameters, even unknown constants are treated as if they are random. This is important since the definitions of some model parameters may be based on approximations. For example, flow parameters are not really unknown physical constants, but rather functions of other unmodeled process variables (e.g., temperature, pressure, etc.), which introduces uncertainty. While the Bayesian approach may appear complicated compared to a conventional statistical approach, its considerable advantage is that it quantifies and manages uncertainties through to final conclusions.

### Comparison to Other Methods

For further perspective on the Bayesian approach, we offer a brief comparison to a more familiar notion of inverse analysis.

For further perspective on the Bayesian approach, we offer a brief comparison to a more familiar notion of inverse analysis using control methods, for a glaciological problem described in MacAyeal (1989). In their analysis of data from the NE Greenland Ice Stream, Joughin, Fahnestock, MacAyeal, Bamber, and Gogineni (2001) rely on a model that "determines the velocity and stress fields as functions of ice thickness, surface elevation" and other inputs. In the Web Project: Ice Streams, the same ice stream is studied with updated data along a transect down its center. Though their model is more complicated, they have a central step that is parallel to the use of (13) and (14) in Web Project: Ice Streams, using "forward models" that rely on ice thickness and surface elevation.

Joughin et al. (2001) proceed "to find the model parameters (e.g., basal shear stress) that minimize[s] the misfit between a model-derived velocity field and its observed counterpart." In our context and notation, they find $||{\bf U}- {\bf u}||^2$, where ${\bf u}$ is similar to the model in (14), and minimize it over the unknown quantities. One crucial point of departure from the fully Bayesian approach is that in their analysis, no formal adjustments are made to account for the fact that the surface and thickness are unknown, and merely estimated by observations. In a Bayesian analysis, the surface and thickness are random quantities with posterior distributions based on the observations, but sensitive to the uncertainties in those data. And, a Bayesian analysis enables one to explicitly account for spatial dependencies.

The second key difference between the two approaches is that the Bayesian approach does not rely solely on the minimization of a cost function (in this case $||{\bf U}- {\bf u}||^2$) to produce estimates of parameters. Instead, the full distribution of the unknowns is developed. Summaries of the final posterior distribution such as its mean or mode can be computed, and we can also ascribe uncertainty measures to the results, produce realizations from the posterior, and so forth. To emphasize the point further, control methods of inverse modeling produce posterior modes from posterior distributions based on relatively vague priors. That is, conventional statistical approaches are special cases of an approximate Bayesian analysis.

We note that the uncertainty management of a Bayesian analysis should not be expected to come for free. Bayesian modeling is generally more complex, since inverse probability analysis (a historical name for Bayesian analysis) may require more assumptions than inverse estimation. Bayesian computations may also seem more complicated, though nonlinear optimization is seldom easy, with results that may not be stable. Indeed, some of the apparent complexities in either the Bayesian approach or the conventional statistical approach are arguably often matters of experience and tradition.

### Acknowledgments

The preparation of this tutorial was supported by the National Science Foundation, Office of Polar Programs and Probability and Statistics Program, under Grant No. 0229292.

Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York.

Berliner, L.M. (1996). Hierarchical Bayesian time series models. In Maximum Entropy and Bayesian Methods, K. Hanson and R. Silver (Eds.), Kluwer Academic Publishers, 15-22.

Berliner, L.M., Wikle, C.K., and Cressie, N. (2000). Long-lead prediction of Pacific SST's via Bayesian dynamic modeling. Journal of Climate13, 3953-3968.

Berliner, L.M., Milliff, R.F., and Wikle, C.K. (2003). Bayesian hierarchical modeling of air-sea interaction. Journal of Geophysical Research108(C4), 3104, doi:10.1029/ 2002JC001413.

Cressie, N. (1989). Geostatistics. American Statistician43, 197-202.

Epstein, E.S. (1985). Statistical Inference and Prediction in Climatology: A Bayesian Approach. American Meteorological Society, Boston, MA.

Evensen, G. and van Leeuwen, P.J. (2000). An ensemble Kalman smoother for nonlinear dynamics. Monthly Weather Review128, 1852-1867.

Joughin, I., Fahnestock, M., MacAyeal, D., Bamber, J.L., and Gogineni, P. (2001). Observation and analysis of ice flow in the largest Greenland ice stream. Journal of Geophysical Research106(D24), 34,021-34,034.

MacAyeal, D.R. (1989). Large-scale flow over a viscous basal sediment: Theory and application to Ice Stream B, Antarctica. Journal of Geophysical Research94(B4), 4071-4088.

Meinhold, J. and Singpurwalla, N. (1983). Understanding the Kalman filter. American Statistician37, 123-127.

Tarantola, A. (1987). Inverse Problem Theory: Methods for Data Fitting and Model Parameter Estimation. Elsevier, New York.