Extreme Value Theory Examples ✏️

This notebook demonstrates extreme value analysis concepts using real Houston precipitation data, illustrating the fundamental extrapolation problem in climate risk assessment.

using DataFrames
using Statistics
using Printf
using Random
using CSV
using Dates
using Distributions

using CairoMakie
using LaTeXStrings

CairoMakie.activate!(; type="svg")

Shape parameter implications

The shape parameter ξ in extreme value distributions fundamentally determines how heavy the tail is, which directly affects our extrapolation to rare events. Let’s visualize how different shape parameter values lead to dramatically different tail behavior:

The key insight: the shape parameter \(\xi\) controls how rapidly probabilities decrease in the tail.

Houston precipitation data

Let’s look at this using a realistic example

# Load GHCN daily precipitation data from Houston area station USC00414313
# Data downloaded from: https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/USC00414313.csv.gz

raw_data = CSV.read(joinpath(dirname(@__FILE__()), "data/USC00414313.csv"), DataFrame, header=true)
station_id = "GHCND:USC00414313"
station_name = "HOUSTON BARKER, TX US"

println("Loaded $(nrow(raw_data)) raw GHCN daily observations")
println("First few rows:")
display(first(raw_data, 3))
Loaded 70366 raw GHCN daily observations
First few rows:
3×8 DataFrame
Row USC00414313 19430101 PRCP 0 Column5 Column6 6 Column8
String15 Int64 String7 Int64 String1? Missing String1 Int64?
1 USC00414313 19430102 PRCP 0 missing missing 6 missing
2 USC00414313 19430103 PRCP 0 missing missing 6 missing
3 USC00414313 19430104 PRCP 0 missing missing 6 missing

This loads the raw GHCN (Global Historical Climatology Network) data file. Each row contains weather measurements with a specific format that we need to decode.

Processing GHCN format

GHCN daily files use a fixed format where each row represents one measurement at one station on one date:

After filtering for precipitation (PRCP):
- Total PRCP observations: 25225
- Date range (raw): 19430102 to 20131130

GHCN includes many weather variables (temperature, snow, etc.), but we only need precipitation (PRCP). The date is stored as YYYYMMDD format and values are in tenths of millimeters.

Converting dates and units

After format conversion:
- Daily observations: 25225
- Date range: 1943-01-02 to 2013-11-30
- Years of data: 71
- Max daily rainfall: 272.3 mm

Now we have properly formatted dates and rainfall measurements in standard millimeter units. This represents over 70 years of daily precipitation measurements - our “high-frequency observations.”

Filling missing dates for complete time series

GHCN data only includes dates where measurements were recorded. For proper time series analysis and plotting, we need to fill in missing dates:

Date completeness:
- GHCN observations: 25225 days
- Complete date range: 25901 days
- Missing days: 676
- Complete time series: 25901 days
- Missing rainfall values: 676

This creates a complete daily time series from $(minimum(precip_data.date)) to $(maximum(precip_data.date)) with missing values for days without measurements. Missing values are common in historical weather data due to equipment issues, observer absence, or data quality problems.

Daily precipitation time series

Now let’s visualize the daily precipitation data to see the “high-frequency observations” mentioned in our extrapolation problem:

Or a subset