Extreme Value Theory Examples ✏️
This notebook demonstrates extreme value analysis concepts using real Houston precipitation data, illustrating the fundamental extrapolation problem in climate risk assessment.
Shape parameter implications
The shape parameter ξ in extreme value distributions fundamentally determines how heavy the tail is, which directly affects our extrapolation to rare events. Let’s visualize how different shape parameter values lead to dramatically different tail behavior:
The key insight: the shape parameter \(\xi\) controls how rapidly probabilities decrease in the tail.
Houston precipitation data
Let’s look at this using a realistic example
# Load GHCN daily precipitation data from Houston area station USC00414313
# Data downloaded from: https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/USC00414313.csv.gz
raw_data = CSV.read(joinpath(dirname(@__FILE__()), "data/USC00414313.csv"), DataFrame, header=true)
station_id = "GHCND:USC00414313"
station_name = "HOUSTON BARKER, TX US"
println("Loaded $(nrow(raw_data)) raw GHCN daily observations")
println("First few rows:")
display(first(raw_data, 3))
Loaded 70366 raw GHCN daily observations
First few rows:
Row | USC00414313 | 19430101 | PRCP | 0 | Column5 | Column6 | 6 | Column8 |
---|---|---|---|---|---|---|---|---|
String15 | Int64 | String7 | Int64 | String1? | Missing | String1 | Int64? | |
1 | USC00414313 | 19430102 | PRCP | 0 | missing | missing | 6 | missing |
2 | USC00414313 | 19430103 | PRCP | 0 | missing | missing | 6 | missing |
3 | USC00414313 | 19430104 | PRCP | 0 | missing | missing | 6 | missing |
This loads the raw GHCN (Global Historical Climatology Network) data file. Each row contains weather measurements with a specific format that we need to decode.
Processing GHCN format
GHCN daily files use a fixed format where each row represents one measurement at one station on one date:
After filtering for precipitation (PRCP):
- Total PRCP observations: 25225
- Date range (raw): 19430102 to 20131130
GHCN includes many weather variables (temperature, snow, etc.), but we only need precipitation (PRCP). The date is stored as YYYYMMDD format and values are in tenths of millimeters.
Converting dates and units
After format conversion:
- Daily observations: 25225
- Date range: 1943-01-02 to 2013-11-30
- Years of data: 71
- Max daily rainfall: 272.3 mm
Now we have properly formatted dates and rainfall measurements in standard millimeter units. This represents over 70 years of daily precipitation measurements - our “high-frequency observations.”
Filling missing dates for complete time series
GHCN data only includes dates where measurements were recorded. For proper time series analysis and plotting, we need to fill in missing dates:
Date completeness:
- GHCN observations: 25225 days
- Complete date range: 25901 days
- Missing days: 676
- Complete time series: 25901 days
- Missing rainfall values: 676
This creates a complete daily time series from $(minimum(precip_data.date)) to $(maximum(precip_data.date)) with missing
values for days without measurements. Missing values are common in historical weather data due to equipment issues, observer absence, or data quality problems.
Daily precipitation time series
Now let’s visualize the daily precipitation data to see the “high-frequency observations” mentioned in our extrapolation problem:
Or a subset