In Notebook 1, I am laying the ETL foundation: a persistent DuckDB store with NUTS boundaries loaded and ready for spatial joins. It is not the flashiest notebook in the series, but it is the one I would least like to debug at the end.
I wanted to make district geometry boring early, so in the follow-up notebooks I can focus on data quality and signal while the polygon assignments stay settled.
Technical lane: Data Ingestion Business lane: Product & DeliveryDecision relevance.
This notebook removes geospatial ambiguity early. Once district geometry is stable, coverage, feature engineering, and risk scoring can all use the same frame and stay comparable.
- Notebook role
- Foundational ingest and geospatial normalization
- Primary artifact
- DuckDB-backed NUTS region tables
- Granularity
- NUTS-0 to NUTS-3 hierarchy for later joins
NUTS-3 DuckDB Data Lake
Load NUTS 0-3 polygons, validate geometry ingest, and prepare spatial joins for all downstream notebooks.
Key output
The notebook creates a reusable nuts_regions table for point-in-polygon operations and
district-level aggregation. It sounds plain because it is plain. Plain is the point.
Practical takeaway: the same district geometries drive every later analytic step. This lowers the chance of comparing station coverage, weather features, and final risk scores on subtly different maps.
| Layer | What is stored | Why it matters |
|---|---|---|
| Spatial boundaries | NUTS polygons from level 0 to 3 | Keeps all later joins on one official administrative hierarchy |
| Reference keys | Region IDs and hierarchy links | Enables deterministic aggregation and roll-up checks |
| Geometry validation flags | Basic geometry sanity checks | Prevents silent failures in downstream point-in-polygon operations |
What this unlocks
With the spatial frame in place, the next notebooks can ask better questions: where station coverage is thin, whether yield tables map cleanly, and which soil or weather features survive district aggregation.
Open notebook source