2  Probability and inference ✏️

This chapter covers the fundamental concepts of probabilistic modeling and statistical inference. Data analysis proceeds by building a generative model—a formal, probabilistic hypothesis about how data are created. Inference is the inverse problem of using observed data to learn about model parameters. Computational methods make complex inference problems tractable.

Learning objectives

After reading this chapter, you should be able to:

  • Build generative models using probability distributions.
  • Apply maximum likelihood and Bayesian inference to estimate parameters.
  • Use Monte Carlo methods when analytical solutions don’t exist.
  • Recognize when computational methods are required versus analytical approaches.

2.1 Probability theory

The concepts in this section provide the mathematical language for describing uncertainty. These mathematical tools support practical modeling applications.

2.1.1 Basic concepts

2.1.1.1 Random variables

A random variable is a function that assigns numerical values to the outcomes of a random experiment. Random variables provide the mathematical foundation for describing uncertainty.

  • Discrete random variables take on countable values (e.g., number of floods per year)
  • Continuous random variables take on uncountable values (e.g., temperature, precipitation amount)

2.1.1.2 Notation conventions

  • Random variables: Capital letters (\(X\), \(Y\), \(Z\))
  • Realizations (specific values): Lowercase letters (\(x\), \(y\), \(z\))
  • Parameters: Greek letters (\(\theta\), \(\mu\), \(\sigma\))
  • Observed data: \(y\) (following Bayesian convention)
  • Predictions: \(\tilde{y}\) (y-tilde)

2.1.2 Distribution functions

The foundation of probability theory rests on three fundamental functions that describe random variables.

2.1.2.1 Probability Mass Function (PMF)

For discrete random variables, \(P(X = x)\) gives the probability that the variable takes on a specific value \(x\). The PMF satisfies \(\sum_x P(X = x) = 1\).

2.1.2.2 Probability Density Function (PDF)

For continuous random variables, \(p(x)\) describes the relative likelihood of different values.

PDF \(p(x)\) is not a probability but a density. Since probability is density multiplied by a (potentially very small) interval of \(x\), the value of \(p(x)\) itself can exceed 1 without violating the laws of probability. Probabilities are areas under the curve: \(P(a \leq X \leq b) = \int_a^b p(x) \, dx\). PDFs are sometimes written as \(f(x)\) or \(f_X(x)\) and must satisfy \(\int_{-\infty}^{\infty} p(x) \, dx = 1\).

2.1.2.3 Cumulative Distribution Function (CDF)

\(F(x) = P(X \leq x)\) gives the probability that a random variable is less than or equal to \(x\). The CDF is the most fundamental descriptor, defined for all random variables (discrete and continuous). It unifies probability concepts and is essential for quantiles and return periods.

2.1.2.4 Quantile function

The quantile function \(Q(p) = F^{-1}(p)\) is the inverse of the CDF. It takes a probability \(p \in [0,1]\) and returns the value \(x\) such that \(P(X \leq x) = p\).

For example, the median is \(Q(0.5)\). Then 99th percentile is \(Q(0.99)\).

2.1.2.5 Examples of key distribution functions

Understanding these concepts requires working through concrete examples. The Distribution Examples notebook demonstrates these relationships using normal and Poisson distributions.

These examples show both forward operations (finding probabilities from values) and inverse operations (finding values from probabilities), illustrating how the three fundamental functions work together to characterize uncertainty.

2.1.3 Multiple variables

2.1.3.1 Joint, marginal, and conditional distributions

Real systems involve multiple random variables, requiring tools to describe their relationships. This machinery allows construction of complex models from simpler components.

The joint distribution \(p(x,y)\) for continuous or \(P(X=x, Y=y)\) for discrete gives the probability of events occurring together.

The marginal distribution \(p(x)\) or \(P(X=x)\) gives the probability of an event, irrespective of other variables. Calculated by summing or integrating over the other variables: \(p(x) = \int p(x,y) \, dy\).

The conditional distribution \(p(y \mid x)\) or \(P(Y=y \mid X=x)\) gives the probability of an event given that another event has occurred. Conditional distributions describe how variables depend on each other.

2.1.3.2 Visualizing joint, marginal, and conditional distributions

Understanding the relationships between joint, marginal, and conditional distributions becomes clearer with visualization. The Distribution Examples notebook includes a comprehensive multivariate example showing how joint distributions decompose into marginal and conditional components.

2.1.3.3 Independence

Two random variables \(X\) and \(Y\) are independent if their joint distribution is the product of their marginal distributions: \(p(x,y) = p(x)p(y)\) for continuous variables, and \(P(X=x, Y=y) = P(X=x)P(Y=y)\) for discrete variables. For example, temperature and rainfall on a given day are typically not independent—hot days often have lower rainfall probability.

2.1.3.4 IID (Independent and Identically Distributed)

A sequence of random variables that are independent and have the same distribution. Many statistical models assume IID data points, which enables powerful analytical and computational techniques.

2.1.3.5 Bayes’ rule

The mechanical relationship between joint, marginal, and conditional distributions:

\[p(y \mid x) = \frac{p(x \mid y) p(y)}{p(x)}\]

Bayes’ rule is a consequence of the definition of conditional probability. It becomes a tool for inference when interpreted probabilistically.

2.2 Statistical foundations

This section provides the mathematical toolkit that underlies statistical inference. These results justify why statistical methods work and enable practical computation.

2.2.1 Summary statistics

2.2.1.1 Expectation (Expected Value)

The expectation is the formal definition of the quantity we approximate with the sample mean in a Monte Carlo simulation. The expectation of a function \(g(X)\) is:

\[\mathbb{E}[g(X)] = \int g(x) p(x) \, dx\]

for continuous variables, or \[\mathbb{E}[g(X)] = \sum_x g(x) P(X = x)\]

for discrete variables.

2.2.1.2 Moments of a distribution

Probability distributions are completely described by their PDF/PMF and CDF, but we often need summary statistics that capture essential properties.

  • Mean: \(\mu = \mathbb{E}[X]\) measures central tendency
  • Variance: \(\sigma^2 = \mathbb{E}[(X - \mu)^2]\) measures spread or scale; standard deviation is \(\sigma = \sqrt{\sigma^2}\)
  • Higher-order moments: Skewness measures asymmetry, kurtosis measures tail weight

For certain heavy-tailed distributions, some higher-order moments (or even the variance) may not exist because the defining integrals diverge.

2.2.2 Fundamental theorems

Two fundamental theorems provide the mathematical foundation for both statistical estimation and computational methods. These results justify why statistical methods work and when we can trust their results.

2.2.2.1 Law of Large Numbers

Subject to mild conditions, the sample mean converges to the expected value as the number of samples increases:

\[\frac{1}{N} \sum_{i=1}^N X_i \to \mathbb{E}[X]\]

This theorem underlies both parameter estimation and Monte Carlo simulation. It guarantees that maximum likelihood estimates become accurate with sufficient data, and that Monte Carlo approximations become precise with enough samples.

2.2.2.2 Central Limit Theorem

The distribution of a sample mean approaches a Normal distribution as the sample size increases, regardless of the underlying distribution shape:

\[\frac{\sqrt{N}(\bar{X} - \mu)}{\sigma} \to N(0,1)\]

where \(\bar{X}\) is the sample mean, \(\mu = \mathbb{E}[X]\), and \(\sigma^2 = \text{Var}(X)\).

This enables uncertainty quantification through confidence intervals and justifies the widespread use of normal approximations in statistical inference. The CLT explains why many phenomena follow normal distributions—they arise from sums of many small, independent effects.

2.2.3 Monte Carlo Expectations

Most decision-relevant quantities can be expressed as expectations. When analytical calculation is impossible, simulation provides a practical approximation method.

The basic Monte Carlo estimate of an expectation is:

\[\mathbb{E}[g(X)] \approx \frac{1}{N} \sum_{i=1}^N g(x_i)\]

where \(x_1, x_2, \ldots, x_N\) are independent samples from the distribution of \(X\). More sophisticated methods (importance sampling, MCMC) exist for drawing samples from more complex distributions.

The Law of Large Numbers guarantees convergence, while the Central Limit Theorem provides the convergence rate. The Monte Carlo standard error is approximately \(\sigma/\sqrt{N}\), where \(\sigma\) is the standard deviation of \(g(X)\). This means that to halve the error, we need four times as many samples.

Monte Carlo methods become essential when dealing with high-dimensional integrals that arise in Bayesian inference and uncertainty propagation through complex models.

2.2.4 Transformation of variables

A core task in probabilistic modeling is understanding how randomness propagates through a system. If we have a random variable \(X\) with PDF \(p_X(x)\) and create \(Y = g(X)\), what is the PDF of \(Y\)?

Simply substituting \(x = g^{-1}(y)\) in the original PDF is incorrect because functions can stretch or compress the probability space. We must account for this distortion.

The general change of variables formula is:

\[p_Y(y) = p_X(g^{-1}(y)) \left| \frac{d}{dy} g^{-1}(y) \right|\]

The term \(\left| \frac{d}{dy} g^{-1}(y) \right|\) is the Jacobian—a “stretching factor” ensuring probability mass is conserved. When a function stretches a region, density decreases proportionally to keep total probability equal to 1. When it compresses a region, density increases.

This formula derives from working through CDFs and applying the chain rule, but the key insight is that transformations distort the coordinate system and we must adjust densities accordingly.

2.3 Likelihood and maximum likelihood estimation

The probability theory and statistical foundations we’ve covered provide the mathematical language for uncertainty and the tools for computation. We now turn from describing uncertainty to learning from data. The first major approach to statistical inference connects data to parameters through likelihood functions and optimization.

2.3.1 The likelihood function

The central tool for connecting data to parameters is the likelihood function. The likelihood is the conditional probability \(p(y \mid \theta)\), where \(y\) represents our observed data.

2.3.1.1 Definition

The likelihood tells us how likely we are to see the observed data \(y\) for some value of the parameters \(\theta\).

2.3.1.2 Crucial distinction

The likelihood is not the probability of the parameters. It’s the probability (or probability density) of the data given the parameters.

This confusion is common: \(p(\text{data}|\text{parameters})\) tells us about data likelihood, not parameter probability. MLE provides point estimates of the most likely parameter values, while Bayesian inference provides probability distributions over parameters. Only Bayesian inference gives us \(p(\text{parameters}|\text{data})\).

For continuous variables, since we’re dealing with a density, the probability of getting exactly that value is zero, but the probability of getting near it is the integral of the PDF over a small interval.

2.3.1.3 Independence and the product form

If we assume our data points are independent and identically distributed (IID), then by the definition of independence: \[p(y_1, y_2, \ldots, y_n \mid \theta) = \prod_{i=1}^n p(y_i \mid \theta)\]

2.3.1.4 The log-likelihood

Products are numerically unstable and difficult to work with. Since the logarithm is monotonic, \[ \arg \max_\theta p(y \mid \theta) = \arg \max_\theta \log p(y \mid \theta) \] although \[ \max_\theta p(y \mid \theta) \neq \max_\theta \log p(y \mid \theta) \] in general.

For independent data, this gives us: \[ \log p(y \mid \theta) = \sum_{i=1}^n \log p(y_i \mid \theta) \]

2.3.2 Maximum likelihood estimation

Maximum likelihood estimation (MLE) finds the parameter values that maximize the likelihood function: \[\hat{\theta}_{\text{MLE}} = \arg\max_\theta p(y \mid \theta)\]

2.3.2.1 Why maximum likelihood makes sense

The likelihood function \(p(y \mid \theta)\) gives the probability of observed data under different parameter values. Maximum likelihood estimation finds the parameter values that maximize the probability of the observed data. The approach selects parameters that best explain the observations.

In practical applications, MLE estimates parameters of distributions describing observed phenomena by finding values that maximize the probability of historical observations. The estimates inform subsequent analysis and decision-making.

2.3.2.2 Implementation

We find the actual parameter values using optimization approaches. This may involve analytical differentiation (setting derivatives to zero) or numerical optimization methods when closed-form solutions don’t exist.

This reframes the statistical problem of inference as a numerical problem of optimization.

2.3.2.3 Properties and theoretical foundations

Understanding when and why MLE works requires defining estimator quality. An estimator should be consistent (converge to the true value as sample size increases), efficient (achieve low variance), and unbiased (correct on average).

Under regularity conditions, MLE estimators have desirable asymptotic properties. As the sample size \(n\) grows large, the MLE estimator \(\hat{\theta}_{\text{MLE}}\) becomes consistent—it converges to the true parameter value \(\theta_0\).

2.3.2.4 Computational considerations

Finding maximum likelihood estimates requires different approaches depending on the complexity of the model.

Analytical solutions exist when we can solve \[ \frac{d}{d\theta} \log p(y \mid \theta) = 0 \] in closed form. This works for simple models like Normal distributions with known variance, or the coin flip example we examine below. These cases provide valuable intuition and serve as building blocks for more complex problems.

Numerical optimization becomes necessary when no closed-form solution exists. Practical challenges arise in applications. The likelihood surface may contain multiple local maxima, requiring different starting values to find the global optimum. Numerical stability requires working with log-likelihoods rather than products of small probabilities. Flat likelihood surfaces indicate that data contain limited information about parameters. All methods assume correct model specification – poor model approximations yield misleading results regardless of optimization quality.

2.3.3 Example: coin flip maximum likelihood estimation

The coin flip example provides a pedagogically clean introduction to maximum likelihood estimation. The Coin Flip Inference notebook demonstrates both the analytical derivation and computational implementation of MLE.

This example shows how MLE connects intuitive parameter estimates (the observed proportion) with formal statistical theory. The same principles apply whether estimating coin bias or the frequency of extreme climate events from historical data.

2.3.4 Linear regression example

Linear regression extends inference to multiple parameters, demonstrating the connections between curve fitting, maximum likelihood estimation, and Bayesian approaches.

The Linear Regression Examples notebook shows how the same statistical problem can be approached from three different philosophical perspectives, each providing different insights into parameter estimation and uncertainty quantification.

This example illustrates how probabilistic models connect optimization-based and fully Bayesian approaches, with implications for modeling relationships between climate variables and their impacts.

2.4 Bayesian inference

Maximum likelihood estimation provides point estimates of parameters by finding values that maximize the probability of observed data. Bayesian inference takes a fundamentally different approach: it treats parameters as random variables and computes full probability distributions that quantify uncertainty. In other words, rather than searching for \(\hat{\theta}\) that maximizes the likelihood, Bayesian inference seeks to estimate the entire distribution \(p(\theta \mid y)\).

2.4.1 Motivation and overview

Real-world risk assessment relies on multiple, imperfect data sources: short instrumental records, longer regional records, qualitative historical accounts, and physical constraints from models. Traditional statistical methods often struggle to formally integrate these different types of information into a single analysis. The Bayesian framework provides a principled solution by treating all knowledge—both prior beliefs and new data—as probability distributions that can be mathematically combined.

The core relationship is Bayes’ rule: \[ p(\theta \mid y) = \frac{p(y \mid \theta) p(\theta)}{p(y)} \]

This deceptively simple equation describes how we update our beliefs:

  • Prior distribution \(p(\theta)\): Quantifies existing knowledge about parameters before analyzing the current dataset
  • Likelihood function \(p(y \mid \theta)\): The engine for learning from data (identical to the likelihood used in maximum likelihood estimation)
  • Posterior distribution \(p(\theta \mid y)\): Our updated beliefs after combining prior knowledge with observed data
  • Marginal likelihood \(p(y)\): A normalizing constant ensuring the posterior integrates to 1

Since \(p(y)\) doesn’t depend on \(\theta\) for a fixed dataset, we often write: \[ p(\theta \mid y) \propto p(y \mid \theta) p(\theta) \]

The result is fundamentally different from maximum likelihood estimation: instead of a single “best” parameter estimate, we obtain a full probability distribution that naturally quantifies uncertainty. This enables direct probabilistic statements like “there is a 95% probability that the parameter lies between these values.”

2.4.2 Maximum A Posteriori: a bridge to optimization

Before exploring full Bayesian inference, we can find the single most probable parameter value given the data. Maximum A Posteriori (MAP) estimation finds the mode of the posterior distribution:

\[\hat{\theta}_{\text{MAP}} = \arg\max_\theta p(\theta \mid y) = \arg\max_\theta p(y \mid \theta) p(\theta)\]

Taking logarithms (since log is monotonic):

\[\hat{\theta}_{\text{MAP}} = \arg\max_\theta [\log p(y \mid \theta) + \log p(\theta)]\]

This reveals an elegant connection to machine learning and optimization. The log-posterior decomposes into the familiar log-likelihood plus a log-prior term. The log-prior acts as a regularization penalty, preventing overfitting by favoring certain parameter values.

When the prior is uniform (non-informative), the log-prior is constant and MAP reduces to maximum likelihood estimation. When the prior is informative, it regularizes the estimate by pulling it toward prior beliefs. This is mathematically identical to penalized likelihood methods like Ridge regression (with Gaussian priors) or Lasso regression (with Laplace priors).

However, MAP provides only a point estimate and discards uncertainty information. To fully leverage the Bayesian framework, we need the entire posterior distribution.

2.4.3 Analytic solutions and conjugate priors

In special cases, we can compute the posterior distribution analytically using conjugate priors. A prior is conjugate to a likelihood if the posterior has the same functional form as the prior. This mathematical convenience allows us to update our beliefs with a simple algebraic formula.

The coin flip example demonstrates this perfectly. For the Binomial likelihood, the Beta distribution is conjugate. To see why, let’s work through the mathematics.

Starting with our prior and likelihood:

  • Prior: \(\theta \sim \text{Beta}(\alpha, \beta)\) with PDF \(p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}\)
  • Likelihood: \(y \mid \theta \sim \text{Binomial}(n, \theta)\) with \(p(y \mid \theta) \propto \theta^y(1-\theta)^{n-y}\)

The posterior is proportional to the product of prior and likelihood: \[ p(\theta \mid y) \propto p(y \mid \theta) \cdot p(\theta) \propto \theta^y(1-\theta)^{n-y} \cdot \theta^{\alpha-1}(1-\theta)^{\beta-1} \] Combining the powers: \[ p(\theta \mid y) \propto \theta^{(\alpha + y) - 1}(1-\theta)^{(\beta + n - y) - 1} \] This is exactly the kernel of a \(\text{Beta}(\alpha + y, \beta + n - y)\) distribution! Therefore: \[ \theta \mid y \sim \text{Beta}(\alpha + y, \beta + n - y) \]

The posterior parameters are intuitive: prior “successes” (\(\alpha\)) plus observed successes (\(y\)), and prior “failures” (\(\beta\)) plus observed failures (\(n-y\)). This demonstrates the core Bayesian principle: new data updates our beliefs in a mathematically principled way.

The following example shows this updating process in action, demonstrating how the posterior distribution evolves with each new coin flip:

As data accumulate, the influence of the prior diminishes relative to that of the likelihood. With sufficient data, Bayesian and maximum likelihood estimates converge regardless of the prior choice.

In practice, conjugate priors exist for only a limited set of models.

2.4.4 Markov Chain Monte Carlo

For any non-trivial model, analytically computing the posterior distribution becomes mathematically intractable. A brute-force approach of evaluating the posterior on a grid fails catastrophically: with \(k\) parameters and \(n\) grid points per parameter, we need \(n^k\) evaluations. For even modest problems (say, 10 parameters with 100 grid points each), this requires \(100^{10} = 10^{20}\) calculations—computationally impossible.

This “curse of dimensionality” means that analytical approaches work only for the simplest models. Real scientific applications require computational methods.

2.4.4.1 Example: A Metropolis-Hastings Sampler from Scratch

Modern MCMC algorithms differ in how they propose new parameter values. A classical, foundational approach is the Metropolis-Hastings algorithm, which gives us a powerful intuition for how MCMC works.

The algorithm allows us to sample from a target distribution that we can’t directly sample from, as long as we can evaluate its density up to a normalizing constant. We use a separate proposal distribution to suggest new parameter values, and then we probabilistically accept or reject these proposals based on how well they align with the target.

(Shout out to Danielle Navarro for a clear explanation and the working example that inspired this section).

The algorithm proceeds as follows:

  1. Start with an initial parameter value \(\theta^{(0)}\).
  2. For each iteration \(i\):
    • Propose a new value \(\theta^*\) from a proposal distribution \(q(\theta^* \mid \theta^{(i-1)})\).
    • Compute the acceptance ratio: \[ r = \frac{p(\theta^* \mid y) \, q(\theta^{(i-1)} \mid \theta^*)}{p(\theta^{(i-1)} \mid y) \, q(\theta^* \mid \theta^{(i-1)})} \]
    • Accept the proposal with probability \(\min(1, r)\):
      • If a random number is less than \(r\), set \(\theta^{(i)} = \theta^*\).
      • Otherwise, reject the proposal and set \(\theta^{(i)} = \theta^{(i-1)}\).
  3. Repeat for many iterations to generate a chain of samples that approximates the posterior distribution \(p(\theta \mid y)\).
using CairoMakie
using LaTeXStrings
using Distributions
2.4.4.1.1 Step 1: Target Distribution

First, let’s define the distribution we want to sample from. To make it interesting, we’ll use a “squiggly” function that isn’t a standard probability distribution. This is our unnormalized target density, which corresponds to \(p(\theta^* \mid y)\) in the ratio above.

# Define the target distribution (unnormalized)
target(x) = exp(-x^2 / 2) * (2 + sin(5 * x) + sin(2 * x))

# Visualize what it looks like
fig = Figure()
ax = Axis(fig[1, 1],
    xlabel = L"θ",
    ylabel = L"p(θ | y) \propto \text{Target Density}")
lines!(ax, -4:0.01:4, target, linewidth = 3)
fig

Our goal is to generate samples whose histogram matches the shape of this distribution.

2.4.4.1.2 Step 2: Proposal Distribution

Next, we need a way to propose new steps. For this example, we’ll use a Normal distribution centered on our current position. This is like telling our sampler to take a random “hop” left or right.

This choice is purely for didactic purposes. A Normal distribution is symmetric, meaning the probability of hopping from \(\theta_A\) to \(\theta_B\) is the same as hopping from \(\theta_B\) to \(\theta_A\). This simplifies the acceptance ratio, as the proposal terms \(q(\theta^{(i-1)} \mid \theta^*)\) and \(q(\theta^* \mid \theta^{(i-1)})\) cancel out, leaving us with the simpler Metropolis algorithm.

proposal(μ, σ) = Normal(μ, σ)

The standard deviation, σ, controls the “hop size.” A larger σ will propose more distant jumps, while a smaller σ will explore the local area more finely.

2.4.4.1.3 Step 3: Sampler

Now we can write the core function that implements the Metropolis-Hastings logic. The function will take the number of samples, a starting value, and the proposal hop size as inputs. It will loop through the algorithm, building the chain one sample at a time.

function metropolis_hastings_sampler(n_samples, start_val, proposal_sd)
    # 1. Initialize an array to store the chain
    chain = zeros(n_samples)
    chain[1] = start_val
    n_accepted = 0

    # 2. Loop from the second sample to the end
    for i in 2:n_samples
        current_x = chain[i-1]

        # 3. Propose a new value
        proposed_x = rand(proposal(current_x, proposal_sd))

        # 4. Calculate acceptance ratio (r). 
        # Since the proposal is symmetric, the q terms cancel.
        r = target(proposed_x) / target(current_x)

        # 5. Accept or reject the proposal
        if rand() < r
            chain[i] = proposed_x # Accept
            n_accepted += 1
        else
            chain[i] = current_x # Reject (stay put)
        end
    end

    return chain, n_accepted / n_samples
end
2.4.4.1.4 Step 4: Implement

Finally, we run the sampler and plot the results. We need to check two things:

  1. The Trace Plot: This shows the value of the parameter at each iteration. We are looking for a “fuzzy caterpillar” pattern, which indicates the chain is exploring the distribution well and is stationary.
  2. The Histogram: This shows the distribution of the collected samples. If the sampler worked, its shape should match our target density.
# Run the sampler
n_samples = 20000
burn_in = 2000 # Number of initial samples to discard
chain, acceptance_rate = metropolis_hastings_sampler(n_samples, 0.0, 1.0)

println("Acceptance rate: ", round(acceptance_rate, digits = 3))

# Visualize the results
fig = Figure(size = (800, 600))

# 1. Trace Plot
ax1 = Axis(fig[1, 1],
    title = "MCMC Trace Plot",
    xlabel = "Iteration",
    ylabel = L"θ")
lines!(ax1, chain, label = "Trace")

# 2. Histogram of Samples
ax2 = Axis(fig[2, 1],
    title = "Posterior Approximation",
    xlabel = L"θ",
    ylabel = "Density")

# Plot the histogram, removing the initial "burn-in" period
hist!(ax2, chain[burn_in+1:end],
    bins = 60,
    normalization = :pdf,
    label = "MCMC Samples")

# Overlay the true target density (scaled to be a valid PDF)
x_grid = -4:0.01:4
target_pdf = target.(x_grid)
target_pdf ./= sum(target_pdf .* 0.01) # Normalize
lines!(ax2, x_grid, target_pdf,
    label = "Target Density",
    linewidth = 3,
    color = :purple)

# Add a legend to the second plot
axislegend(ax2)

fig
Acceptance rate: 0.594

Results from our Metropolis-Hastings sampler. The trace plot (top) shows the sampler exploring the parameter space. The histogram of the MCMC samples (bottom) closely matches the shape of the unnormalized target density.

The acceptance rate is a useful diagnostic. A very high rate (> 0.8) might mean the hop size is too small, and the chain isn’t exploring efficiently. A very low rate (< 0.1) might mean the hop size is too large, and most proposals are being rejected. Rates between 0.2 and 0.5 are often considered good for this type of sampler.

As we can see, the final histogram of samples beautifully recovers the shape of the target distribution, demonstrating that this simple algorithm successfully turned a function we can evaluate into a distribution we can sample from.

2.4.4.2 Modern MCMC algorithms

In practice, Metropolis-Hastings is rarely used due to its inefficiency and because it scales poorly to high-dimensional spaces where the acceptance rate can become very low. Modern MCMC algorithms, such as Hamiltonian Monte Carlo (HMC) and the No-U-Turn Sampler (NUTS), leverage gradient information to propose more informed jumps in parameter space. These methods explore the posterior distribution more efficiently, especially in high-dimensional problems common in climate modeling and other scientific applications.

WHen using these advanced samplers, several best practices ensure robust and reliable inference:

  1. Running multiple independent chains helps assess convergence and identify multimodal posteriors. You will often see four chains as a default, but that’s mainly because laptops ten years ago mainly had four cores. More is often better.
  2. Initial iterations allow the sampler to adapt to the posterior geometry and find the typical set. We typically discard these “warm-up” samples before final inference.
  3. Most samplers have hyperparameters (e.g., step size, number of leapfrog steps) that require tuning for optimal performance. Many software packages include automatic tuning during a warm-up phase.
  4. Diagnostics are essential to ensure the sampler has converged and is exploring the posterior adequately.
    • Trace plots show the parameter values over iterations. We are looking for a “fuzzy caterpillar” pattern, which indicates the chain is exploring the distribution well and is stationary. If you see trends or drifts, the chain has not converged.
    • Trace plots for multiple chains should overlap and mix well, indicating they are sampling from the same distribution.
    • Autocorrelation plots help assess the degree of correlation between samples. Rapidly decreasing autocorrelation suggests good mixing, while high autocorrelation indicates the chain is stuck in local modes.
    • Quantitative metrics, such as the \(\hat{R}\) statistic and effective sample size, provide numerical assessments of convergence and sampling efficiency.

In all cases, we can rule out some kinds of convergence problems, but actually proving convergence is impossible, so we must always include other checks. Avoid the tendency common in applied work to state without reflection or critique that because \(\hat{R} < 1.1\) no further thought is needed. For tricky problems, work with a computational statistician (or take Bayesian stats courses beyond the scope of these lecture notes).

2.4.4.3 From samples to inference

After running the chain for thousands of iterations, the resulting samples serve as a high-fidelity numerical approximation of the true posterior distribution. We can use these samples to approximate any posterior quantity: \[ \mathbb{E}[g(\theta) \mid y] \approx \frac{1}{N} \sum_{i=1}^N g(\theta^{(i)}) \] where \(\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)}\) are our MCMC samples after discarding warm-up iterations.

This Monte Carlo approximation enables: - Parameter estimates and credible intervals - Posterior predictive distributions for forecasting - Model comparison through marginal likelihood estimation - Decision analysis under parameter uncertainty

The quality of these approximations depends on sample size, chain mixing, and the specific posterior quantity being estimated. Tail probabilities and extreme quantiles require more samples than central tendencies.

2.4.5 Example: coin flip Bayesian analysis

The coin flip example demonstrates the complete Bayesian workflow, from prior specification through posterior computation and interpretation.

The Coin Flip Inference notebook shows how treating parameters as random variables enables full uncertainty quantification. This example compares analytical solutions (using conjugate priors) with computational approaches, validating MCMC methods against known exact results.

The example illustrates key Bayesian concepts: how priors influence posterior distributions, the role of conjugacy in analytical tractability, and the use of MCMC when analytical solutions don’t exist.

2.4.6 Example: linear regression Bayesian analysis

Linear regression demonstrates Bayesian inference for multivariate problems requiring computational methods.

The Linear Regression Examples notebook provides a comprehensive comparison of least squares, maximum likelihood, and Bayesian approaches to the same problem. This comparison illuminates the philosophical differences between methods and their practical implications for uncertainty quantification.

The Bayesian approach excels when parameter uncertainty must be propagated through subsequent calculations or decision processes—a crucial capability for climate risk assessment and decision-making under uncertainty.

2.4.7 Posterior predictive distribution and model checking

The Bayesian workflow extends beyond parameter estimation to include systematic model validation through posterior predictive checking. This approach uses the fitted model to generate simulated data, then compares these simulations to observed data on relevant metrics.

The posterior predictive distribution represents our beliefs about new data given what we’ve observed: \[p(y^{\text{rep}} | y^{\text{obs}}) = \int p(y^{\text{rep}} | \theta) p(\theta | y^{\text{obs}}) d\theta\]

where \(y^{\text{rep}}\) denotes replicated (simulated) data and \(y^{\text{obs}}\) denotes observed data.

2.4.7.1 The posterior predictive checking workflow

The systematic approach follows these steps:

  1. Use MCMC to obtain samples from \(p(\theta | y^{\text{obs}})\)
  2. For each posterior sample \(\theta^{(i)}\), draw \(y^{\text{rep}(i)} \sim p(y | \theta^{(i)})\)
  3. Calculate relevant “test statistics” or metrics \(T(y^{\text{rep}(i)})\) and \(T(y^{\text{obs}})\)
  4. Assess whether \(T(y^{\text{obs}})\) falls within the distribution of \(T(y^{\text{rep}})\)

This is related to the frequentist idea of p-values explored in the mosquitos case study.

Further reading

For deeper study of probability and statistics:

  • Blitzstein and Hwang (2019) provides excellent intuition with computational examples
  • Downey (2021) emphasizes Bayesian thinking with practical applications
  • Gelman (2021) connects regression to broader statistical modeling
  • Gelman et al. (2014) comprehensive treatment of Bayesian computation

This chapter’s concepts are also demonstrated through focused computational notebooks: