Linear regression: three perspectives on the same problem ✏️
This notebook demonstrates how the same statistical problem can be approached from three different perspectives: curve fitting, maximum likelihood estimation, and Bayesian inference. Linear regression provides an ideal example because it bridges simple parameter estimation with multivariate modeling while maintaining analytical tractability.
Understanding these different approaches illuminates the philosophical foundations underlying statistical methods and their practical implications for uncertainty quantification.
The regression problem
Linear regression models the relationship between a predictor variable \(x\) and a response variable \(y\) as: \[y_i = \alpha + \beta x_i + \epsilon_i\]
where \(\alpha\) is the intercept, \(\beta\) is the slope, and \(\epsilon_i\) represents random noise.
This framework applies broadly in climate science: relating temperature to elevation, precipitation to atmospheric pressure, or sea level rise to global temperature.
Generating synthetic data
We create synthetic data from a known linear relationship to evaluate how well different methods recover the true parameters. First, we generate and plot our synthetic data.
# Set reproducible parameters
Random.seed!(42)
n_obs = 30
true_intercept = 2.0
true_slope = 1.5
true_sigma = 1.5
# Generate predictor and response data
x_data = sort(rand(Uniform(0, 5), n_obs))
y_data = true_intercept .+ true_slope .* x_data .+ rand(Normal(0, true_sigma), n_obs)
function plot_synthetic_data()
fig = Figure(size = (700, 500))
ax = Axis(fig[1, 1],
xlabel = "x",
ylabel = "y",
title = "Synthetic linear regression dataset")
# Plot data points
scatter!(ax, x_data, y_data, color = :blue, markersize = 8, label = "Observed data")
# Plot true relationship
x_line = range(minimum(x_data), maximum(x_data), length = 100)
y_true = true_intercept .+ true_slope .* x_line
lines!(ax, x_line, y_true, color = :red, linewidth = 3,
label = L"True: $y = %$(true_intercept) + %$(true_slope)x$")
# Add uncertainty band
y_upper = y_true .+ 2 * true_sigma
y_lower = y_true .- 2 * true_sigma
band!(ax, x_line, y_lower, y_upper, color = (:red, 0.2), label = L"±2σ")
axislegend(ax, position = :lt)
return fig
end
fig_data = plot_synthetic_data()
fig_data
The plot shows our synthetic data with the true linear relationship and noise level. The scattered points represent our “observations,” while the red line and shaded band show the true underlying process we’re trying to recover.
Approach 1: Curve fitting (least squares)
The curve fitting perspective treats regression as an optimization problem: find the line that minimizes prediction errors. This approach focuses on algorithmic solutions without probabilistic interpretation.
Mathematical foundation
Least squares minimizes the sum of squared residuals: \[\text{minimize} \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2\]
This optimization problem has a closed-form analytical solution.
function fit_least_squares(x, y)
"""Analytical least squares regression"""
n = length(x)
x_bar = mean(x)
y_bar = mean(y)
# Analytical formulas
slope = sum((x .- x_bar) .* (y .- y_bar)) / sum((x .- x_bar) .^ 2)
intercept = y_bar - slope * x_bar
# Calculate residuals and fit statistics
y_pred = intercept .+ slope .* x
residuals = y .- y_pred
mse = mean(residuals .^ 2)
r_squared = 1 - sum(residuals .^ 2) / sum((y .- y_bar) .^ 2)
return (
intercept = intercept,
slope = slope,
mse = mse,
r_squared = r_squared,
predictions = y_pred,
residuals = residuals,
)
end
# Fit the model
ols_results = fit_least_squares(x_data, y_data)
# Create results summary
ols_summary = DataFrame(
"parameter" => ["Intercept", "Slope", "MSE", "R²"],
"estimate" => [ols_results.intercept, ols_results.slope, ols_results.mse, ols_results.r_squared],
"true_value" => [true_intercept, true_slope, missing, missing],
)
ols_summary
Row | parameter | estimate | true_value |
---|---|---|---|
String | Float64 | Float64? | |
1 | Intercept | 1.85082 | 2.0 |
2 | Slope | 1.58365 | 1.5 |
3 | MSE | 2.22587 | missing |
4 | R² | 0.711843 | missing |
The least squares estimates are close to the true values, demonstrating that the method works well for this problem. However, this approach provides no direct way to quantify uncertainty in the parameter estimates.
Visualization with residuals
function plot_least_squares_fit()
fig = Figure(size = (800, 500))
ax = Axis(fig[1, 1],
xlabel = "x",
ylabel = "y",
title = "Least squares curve fitting")
# Plot data points
scatter!(ax, x_data, y_data, color = :blue, markersize = 8, label = "Data")
# Plot fitted line
x_line = range(minimum(x_data), maximum(x_data), length = 100)
y_fitted = ols_results.intercept .+ ols_results.slope .* x_line
lines!(ax, x_line, y_fitted, color = :red, linewidth = 2,
label = L"Fitted: $y = %$(round(ols_results.intercept, digits=2)) + %$(round(ols_results.slope, digits=2))x$")
# Show residuals as vertical lines
for i in eachindex(x_data)
lines!(ax, [x_data[i], x_data[i]], [y_data[i], ols_results.predictions[i]],
color = :gray, linestyle = :dash, alpha = 0.7, linewidth = 1)
end
# Add true line for comparison
y_true_line = true_intercept .+ true_slope .* x_line
lines!(ax, x_line, y_true_line, color = :green, linewidth = 2, linestyle = :dot,
label = L"True: $y = %$(true_intercept) + %$(true_slope)x$")
axislegend(ax, position = :lt)
return fig
end
fig_ols = plot_least_squares_fit()
fig_ols
The residual lines (gray dashes) show prediction errors for each data point. Least squares minimizes the sum of squared lengths of these lines, providing the “best fit” in this specific sense.
Approach 2: Maximum likelihood estimation
The probabilistic perspective treats regression as a statistical model with explicitly specified error distributions. This enables likelihood-based inference and uncertainty quantification.
Statistical model specification
We assume: \[y_i \sim \text{Normal}(\alpha + \beta x_i, \sigma^2)\]
This specifies that responses are normally distributed around the linear predictor with constant variance.
# Define the regression model - we'll reuse this for Bayesian inference
@model function linear_regression_model(x, y)
n = length(y)
# Parameters to estimate
α ~ Normal(0, 5) # Intercept
β ~ Normal(0, 5) # Slope
σ ~ LogNormal(0, 1) # Noise standard deviation
# Likelihood
μ = α .+ β .* x
for i in 1:n
y[i] ~ Normal(μ[i], σ)
end
end
# Find MLE estimates using Turing's mode estimation
model = linear_regression_model(x_data, y_data)
mle_result = optimize(model, MLE())
mle_params = mle_result.values
# Extract parameter estimates
mle_results = (
intercept = mle_params[:α],
slope = mle_params[:β],
sigma = mle_params[:σ],
)
# Create comparison table
mle_summary = DataFrame(
Parameter = ["Intercept", "Slope", "Sigma"],
MLE = [mle_results.intercept, mle_results.slope, mle_results.sigma],
OLS = [ols_results.intercept, ols_results.slope, sqrt(ols_results.mse)],
True_Value = [true_intercept, true_slope, true_sigma],
)
mle_summary
Row | Parameter | MLE | OLS | True_Value |
---|---|---|---|---|
String | Float64 | Float64 | Float64 | |
1 | Intercept | 1.85082 | 1.85082 | 2.0 |
2 | Slope | 1.58365 | 1.58365 | 1.5 |
3 | Sigma | 1.49193 | 1.49193 | 1.5 |
Turing’s optimize
function with MLE()
provides a clean, robust approach to maximum likelihood estimation without requiring custom optimization code. The MLE estimates for intercept and slope match the OLS results exactly, confirming the theoretical equivalence between these approaches for linear regression with normal errors. Additionally, MLE provides an estimate of the noise parameter \(\sigma\).
Approach 3: Bayesian inference
Bayesian inference treats all parameters as random variables with probability distributions. It combines prior beliefs with observed data to produce posterior distributions that fully characterize parameter uncertainty.
Bayesian inference implementation
function run_bayesian_regression(x, y; samples_per_chain = 2000, n_chains = 4)
model = linear_regression_model(x, y)
chains = begin
sampler = NUTS()
discard_initial = 1_000
sample(model, sampler, MCMCThreads(),
samples_per_chain, n_chains,
discard_initial = discard_initial,
verbose = false, progress = false)
end
return chains
end
# Run Bayesian inference
bayesian_chains = run_bayesian_regression(x_data, y_data)
summary(bayesian_chains)
┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel └ @ AbstractMCMC ~/.julia/packages/AbstractMCMC/FSyVk/src/sample.jl:382 ┌ Info: Found initial step size └ ϵ = 0.05 ┌ Info: Found initial step size └ ϵ = 0.2 ┌ Info: Found initial step size └ ϵ = 0.2 ┌ Info: Found initial step size └ ϵ = 0.025
"MCMCChains.Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, @NamedTuple{parameters::Vector{Symbol}, internals::Vector{Symbol}}, @NamedTuple{varname_to_symbol::OrderedCollections.OrderedDict{AbstractPPL.VarName, Symbol}, start_time::Vector{Float64}, stop_time::Vector{Float64}}}"
The diagnostic table is consistent with, though does not guarantee, MCMC convergence and accurate sampling. Fundamentally, we can check for comon signs of non-convergence and rule them out, but we cannot actually prove that we are sampling from the true posterior distribution. While there are no perfect rules, generally
- Low values of \(\hat{R}\) (
rhat
) close to 1 (e.g., <1.1) are consistent with “good mixing”, meaning that the chains have converged to the target distribution. - High effective sample sizes (
ess_bulk
,ess_tail
) indicate that the samples are more approximately independent, which is desirable for accurate posterior estimation.
Trace plot
# Extract samples from chains (keep chain structure for trace plots)
α_samples = vec(Array(bayesian_chains[:α])) # flattened for summary stats
β_samples = vec(Array(bayesian_chains[:β]))
σ_samples = vec(Array(bayesian_chains[:σ]))
function plot_mcmc_trace(chains, title_prefix = "")
fig = Figure(size = (1200, 800))
# Extract parameter arrays (chains are organized as samples x variables x chains)
α_arr = Array(chains[:α])
β_arr = Array(chains[:β])
σ_arr = Array(chains[:σ])
nsamples, nchains = size(α_arr, 1), size(α_arr, 2)
# Set up all axes
ax1 = Axis(fig[1, 1], xlabel = "Iteration", ylabel = "α", title = "$(title_prefix)Intercept Parameter Trace")
ax2 = Axis(fig[1, 2], xlabel = "Iteration", ylabel = "β", title = "$(title_prefix)Slope Parameter Trace")
ax3 = Axis(fig[1, 3], xlabel = "Iteration", ylabel = "σ", title = "$(title_prefix)Noise Parameter Trace")
ax4 = Axis(fig[2, 1], xlabel = "α", ylabel = "Density", title = "$(title_prefix)Intercept Posterior")
ax5 = Axis(fig[2, 2], xlabel = "β", ylabel = "Density", title = "$(title_prefix)Slope Posterior")
ax6 = Axis(fig[2, 3], xlabel = "σ", ylabel = "Density", title = "$(title_prefix)Noise Posterior")
# Define colors for chains
colors = [:red, :blue, :green, :orange]
# Plot traces and histograms for each chain
for c in 1:nchains
# Trace plots
lines!(ax1, 1:nsamples, α_arr[:, c], label = "Chain $(c)", color = colors[c])
lines!(ax2, 1:nsamples, β_arr[:, c], label = "Chain $(c)", color = colors[c])
lines!(ax3, 1:nsamples, σ_arr[:, c], label = "Chain $(c)", color = colors[c])
# Marginal densities
hist!(ax4, α_arr[:, c], bins = 40, normalization = :pdf, color = (colors[c], 0.3), label = "Chain $(c)")
hist!(ax5, β_arr[:, c], bins = 40, normalization = :pdf, color = (colors[c], 0.3), label = "Chain $(c)")
hist!(ax6, σ_arr[:, c], bins = 40, normalization = :pdf, color = (colors[c], 0.3), label = "Chain $(c)")
end
# Add true value lines
vlines!(ax4, [true_intercept], color = :black, linestyle = :dash, linewidth = 2, label = "True value")
vlines!(ax5, [true_slope], color = :black, linestyle = :dash, linewidth = 2, label = "True value")
vlines!(ax6, [true_sigma], color = :black, linestyle = :dash, linewidth = 2, label = "True value")
# Add legends to posterior plots
axislegend(ax4; position = :rt)
axislegend(ax5; position = :rt)
axislegend(ax6; position = :rt)
return fig
end
fig_trace = plot_mcmc_trace(bayesian_chains, "Linear Regression: ")
fig_trace
We can see from this plot that the chains visually appear to be well-mixed and stationary, which again is consistent with, though not definitive proof of, convergence.
Posterior predictive distribution
The posterior predictive distribution represents our beliefs about new data given what we’ve observed. It’s computed as \(p(y_{\text{new}} | y_{\text{obs}}) = \int p(y_{\text{new}} | \theta) p(\theta | y_{\text{obs}}) d\theta\).
function plot_posterior_predictive()
# Generate posterior predictive samples at new x values
x_new = range(0, 6, length = 50)
n_samples = 200
sample_indices = rand(1:length(α_samples), n_samples)
# For each posterior sample, generate predictions
y_pred_samples = []
for i in sample_indices
α_i = α_samples[i]
β_i = β_samples[i]
σ_i = σ_samples[i]
# Mean prediction
μ_pred = α_i .+ β_i .* x_new
# Add observation noise
y_pred = rand.(Normal.(μ_pred, σ_i))
push!(y_pred_samples, y_pred)
end
# Calculate prediction quantiles
y_lower = [quantile([y_pred_samples[j][i] for j in 1:n_samples], 0.05) for i in 1:length(x_new)]
y_upper = [quantile([y_pred_samples[j][i] for j in 1:n_samples], 0.95) for i in 1:length(x_new)]
y_median = [quantile([y_pred_samples[j][i] for j in 1:n_samples], 0.5) for i in 1:length(x_new)]
# Create plot
fig = Figure(size = (800, 600))
ax = Axis(fig[1, 1],
xlabel = "x",
ylabel = "y",
title = "Posterior predictive distribution")
# Plot observed data
scatter!(ax, x_data, y_data, color = :black, markersize = 8, label = "Observed data")
# Plot prediction interval
band!(ax, x_new, y_lower, y_upper, color = (:blue, 0.3), label = "90% prediction interval")
# Plot median prediction
lines!(ax, x_new, y_median, color = :blue, linewidth = 2, label = "Posterior median")
# Add true relationship for reference
y_true = true_intercept .+ true_slope .* x_new
lines!(ax, x_new, y_true, color = :red, linewidth = 2, linestyle = :dash, label = "True relationship")
axislegend(ax, position = :lt)
return fig
end
fig_ppd = plot_posterior_predictive()
fig_ppd
The posterior predictive distribution incorporates both parameter uncertainty and observation noise. The 90% prediction interval shows where we expect new observations to fall, while the median line shows our best prediction. This is the fundamental quantity for forecasting and decision-making under uncertainty.