Notebook 1 - Giving the Project a Spatial Backbone

Published 4 April 2026

agri-weather-yield-drivers notebooks duckdb geospatial nuts3

In Notebook 1, I am laying the ETL foundation: a persistent DuckDB store with NUTS boundaries loaded and ready for spatial joins. It is not the flashiest notebook in the series, but it is the one I would least like to debug at the end.

I wanted to make district geometry boring early, so in the follow-up notebooks I can focus on data quality and signal while the polygon assignments stay settled.

Technical lane: Data Ingestion Business lane: Product & Delivery

Decision relevance.

This notebook removes geospatial ambiguity early. Once district geometry is stable, coverage, feature engineering, and risk scoring can all use the same frame and stay comparable.

Notebook role
Foundational ingest and geospatial normalization
Primary artifact
DuckDB-backed NUTS region tables
Granularity
NUTS-0 to NUTS-3 hierarchy for later joins
1

NUTS-3 DuckDB Data Lake

Load NUTS 0-3 polygons, validate geometry ingest, and prepare spatial joins for all downstream notebooks.

DuckDB spatialGeoParquetGISCO
Core notebook sequence completed: 17%

Key output

The notebook creates a reusable nuts_regions table for point-in-polygon operations and district-level aggregation. It sounds plain because it is plain. Plain is the point.

Practical takeaway: the same district geometries drive every later analytic step. This lowers the chance of comparing station coverage, weather features, and final risk scores on subtly different maps.

LayerWhat is storedWhy it matters
Spatial boundariesNUTS polygons from level 0 to 3Keeps all later joins on one official administrative hierarchy
Reference keysRegion IDs and hierarchy linksEnables deterministic aggregation and roll-up checks
Geometry validation flagsBasic geometry sanity checksPrevents silent failures in downstream point-in-polygon operations

What this unlocks

With the spatial frame in place, the next notebooks can ask better questions: where station coverage is thin, whether yield tables map cleanly, and which soil or weather features survive district aggregation.

Open notebook source