DuckDB

  • Open-source, column-oriented RDBMS (relational database management system)
  • High performance on complex queries, even on large datasets
  • Flexible extension system — spatial is a core extension

Why DuckDB for spatial data?

  • R/sf loads entire datasets into memory → bottleneck with large data
  • DuckDB can stream from disk — processes data exceeding available RAM
  • SQL: declarative language — describe what you want, not how
  • Spatial extension uses familiar function names (st_intersects, st_area, …)
  • Spatial indexing (R-Tree) makes large-scale spatial queries fast

Reinventing DBMS?

  • dplyr, data.table, arrow do relational transformations (select, join, aggregate, …)
  • But compared to a real DBMS, they lack:
    • Holistic query optimization across a pipeline
    • Out-of-memory computation
    • Parallelism and pipelining
  • DuckDB brings 50+ years of DBMS research into your R/Python session

OLAP vs. OLTP

OLAP (Analytical) OLTP (Transactional)
Workload Read-mostly Many small writes
Queries Complex, scan large parts Simple, touch individual rows
Updates Bulk appends Frequent row-level updates
  • DuckDB = OLAP

In-process vs. standalone

  • In-process: runs inside your application (no separate server)
    • DuckDB, SQLite
  • Standalone: separate server process, client connects over network
    • PostgreSQL, MySQL
Table 1: DuckDB fills a niche that no previous software has filled yet
OLTP OLAP
In-Process
Stand-Alone

Let’s put DuckDB to work

DuckDB in practice

  • Download wald-kantone.duckdb from Moodle (forest + canton boundaries)
  • Install DuckDB CLI + R package
  • Install DBeaver Community
  • Connect to the database in DBeaver
  • Install and load the spatial extension:
INSTALL spatial;
LOAD spatial;
  • Verify both tables are present and explore them:
SHOW TABLES;
SELECT * FROM wald;
SELECT * FROM kantone;
  • Create an R-Tree spatial index for both tables:
CREATE INDEX kantone_idx ON kantone USING RTREE (geom);
CREATE INDEX wald_idx ON wald USING RTREE (geom);

Why spatial indices?

  • Without an index: WHERE st_intersects(a, b) checks every pair → slow
  • R-Tree index organizes geometries by bounding boxes
  • Eliminates non-overlapping pairs cheaply → only candidates get the expensive exact test
  • Reduces comparisons dramatically (e.g. millions → thousands)

Building an analysis step-by-step

  • Indices speed up queries — but we still need to compose the analysis
  • Goal: compute forest share per canton — intersect forest polygons with canton boundaries, then aggregate areas
  • Strategy: break it into small, reusable pieces using SQL VIEWs

SQL VIEWs

  • A VIEW = a named SQL query (virtual table)
  • Re-executed every time you access it
  • Lets us build complex analyses step-by-step
  • Our first VIEW: a subset of the forest dataset (for faster iteration)
  • Limit to 1’000 rows and store as a VIEW:
1CREATE VIEW wald2 AS
2SELECT * FROM wald LIMIT 1000;
1
Prepend CREATE VIEW name AS to any SELECT
2
… to store it as a reusable virtual table
  • Query it like a regular table:
SELECT * FROM wald2;

Spatial intersection in SQL

  • Step 1 of the pipeline: intersect forest polygons with canton boundaries
SELECT 
  name, 
2  st_intersection(w.geom, k.geom),
1FROM wald2 w, kantone k;
1
w and k are aliases…
2
… used in the intersection
┌──────────────────────┬───────────────────────────────────────────────────────┐
│         name         │            st_intersection(w.geom, k.geom)            │
│       varchar        │                       geometry                        │
├──────────────────────┼───────────────────────────────────────────────────────┤
│ Genève               │ POLYGON EMPTY                                         │
│ Genève               │ POLYGON EMPTY                                         │
│ Genève               │ POLYGON EMPTY                                         │
│ Genève               │ POLYGON EMPTY                                         │
│ Genève               │ POLYGON EMPTY                                         │
│   ·                  │       ·                                               │
│   ·                  │       ·                                               │
│   ·                  │       ·                                               │
│ Appenzell Innerrho…  │ POLYGON Z ((2752421.3200000883 1239807.9399965794 9…  │
│ Appenzell Innerrho…  │ POLYGON EMPTY                                         │
│ Appenzell Innerrho…  │ POLYGON EMPTY                                         │
│ Appenzell Innerrho…  │ POLYGON EMPTY                                         │
│ Appenzell Innerrho…  │ POLYGON EMPTY                                         │
├──────────────────────┴───────────────────────────────────────────────────────┤
│ 26000 rows (10 shown)                                              2 columns │
└──────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 1.292 user 1.293885 sys 0.005016
  • Problem: cross join — every forest polygon × every canton (1’000 × 26 = 26’000 pairs)
  • Solution: add WHERE to filter early:
SELECT
  name,
  st_intersection(w.geom, k.geom),
FROM wald2 w, kantone k
1WHERE st_intersects(w.geom, k.geom);
1
Only compute the intersection for pairs that actually overlap (R-Tree!)
┌──────────────────────┬───────────────────────────────────────────────────────┐
│         name         │            st_intersection(w.geom, k.geom)            │
│       varchar        │                       geometry                        │
├──────────────────────┼───────────────────────────────────────────────────────┤
│ Appenzell Innerrho…  │ POLYGON Z ((2750577.8690000786 1236499.1299968604 1…  │
│ St. Gallen           │ POLYGON Z ((2733301.3710000697 1235901.1909968755 1…  │
│ St. Gallen           │ POLYGON Z ((2730828.668000072 1236533.6979968112 10…  │
│ Appenzell Innerrho…  │ POLYGON Z ((2751117.7790000793 1236511.4949968604 1…  │
│ Zürich               │ POLYGON Z ((2712200.8850000626 1236517.2959967866 6…  │
│   ·                  │                           ·                           │
│   ·                  │                           ·                           │
│   ·                  │                           ·                           │
│ Appenzell Innerrho…  │ POLYGON Z ((2752421.3200000883 1239807.9399965794 9…  │
│ Zürich               │ POLYGON Z ((2684156.710000051 1239693.0049964704 51…  │
│ St. Gallen           │ POLYGON Z ((2734307.2300000787 1239757.5959965463 1…  │
│ Appenzell Ausserrh…  │ POLYGON Z ((2735958.5650000805 1239767.1879965463 1…  │
│ Appenzell Ausserrh…  │ POLYGON Z ((2738798.7040000805 1239749.8799965512 8…  │
├──────────────────────┴───────────────────────────────────────────────────────┤
│ 1027 rows (10 shown)                                               2 columns │
└──────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.818 user 0.823316 sys 0.006000

st_intersects vs st_intersection

This query uses both: st_intersects in WHERE = predicate (true/false filter), st_intersection in SELECT = operation (computes geometry). Same pattern as in R/sf!

  • We need the area, not the geometry itself:
SELECT 
  name, 
1  st_area(st_intersection(w.geom, k.geom)) as wald_area,
FROM wald2 w, kantone k
WHERE st_intersects(w.geom, k.geom);
1
st_area calculates the area of the intersection
┌────────────────────────┬────────────────────┐
│          name          │     wald_area      │
│        varchar         │       double       │
├────────────────────────┼────────────────────┤
│ Appenzell Innerrhoden  │  3223.381481670876 │
│ St. Gallen             │ 490040.98548061884 │
│ St. Gallen             │ 1515.7608568453097 │
│ Appenzell Innerrhoden  │ 5697.7576364166425 │
│ Zürich                 │  9248.165822466484 │
│   ·                    │          ·         │
│   ·                    │          ·         │
│   ·                    │          ·         │
│ Appenzell Innerrhoden  │ 31971.908096492414 │
│ Zürich                 │ 11339.135541535818 │
│ St. Gallen             │ 11570.958247649043 │
│ Appenzell Ausserrhoden │ 3659.7222327571385 │
│ Appenzell Ausserrhoden │ 1371.8734574811795 │
├────────────────────────┴────────────────────┤
│ 1027 rows (10 shown)              2 columns │
└─────────────────────────────────────────────┘
Run Time (s): real 0.810 user 0.811628 sys 0.018473
  • Save this as a VIEW before aggregating:
1CREATE VIEW wald_kantone AS
SELECT 
  name, 
  st_area(st_intersection(w.geom, k.geom)) AS wald_area,
FROM wald2 w, kantone k
WHERE st_intersects(w.geom, k.geom);
1
This creates a VIEW from the preceding query
  • Query the VIEW like a table, then aggregate with GROUP BY:
SELECT 
3  name,
2  sum(wald_area) as wald_area
FROM wald_kantone
1GROUP BY name;
1
If we use GROUP BY in a SQL query…
2
… we need to wrap all columns with aggregate function…
3
… except for the columns that we use for grouping
  • Save the aggregation as a VIEW:
CREATE VIEW wald_kanton_grp AS
SELECT 
  name, 
  sum(wald_area) as wald_area
FROM wald_kantone
GROUP BY name;
  • Join with kantone to get the total canton area and compute the fraction:
SELECT 
3    kantone.name,
4    wald_area/area as waldanteil,
FROM wald_kanton_grp 
1LEFT JOIN kantone
2ON wald_kanton_grp.name=kantone.name;
1
LEFT JOIN appends columns from another table…
2
… matched by the ON condition
3
We only need the canton name…
4
… and the fraction wald_area / area
  • Save as a final VIEW, ordered by forest share:
CREATE VIEW kanton_frac AS
SELECT 
    kantone.name,                 
    wald_area/area as waldanteil, 
FROM wald_kanton_grp 
LEFT JOIN kantone 
ON wald_kanton_grp.name=kantone.name
1ORDER BY waldanteil DESC;
1
We can ORDER BY to show us the highest values first

Scaling up

  • So far: only 1’000 forest features → incomplete results
  • Since we used VIEWs, switching to the full dataset is trivial:
CREATE OR REPLACE VIEW wald2 AS
SELECT * FROM wald;
  • Every downstream VIEW now automatically uses the full data:
SELECT * FROM kanton_frac;

VIEW vs CREATE TABLE ... AS

  • VIEW = lazy (re-executed on every access)
  • CREATE TABLE ... AS = materialized (stored on disk):
1CREATE TABLE wald_kantone_mat AS
SELECT
  name,
  st_area(st_intersection(w.geom, k.geom)) AS wald_area
FROM wald2 w, kantone k
WHERE st_intersects(w.geom, k.geom);
1
This stores the result as an actual table on disk
  • Trade-off:
    • Materialized: faster to query, takes disk space, does not auto-update
    • VIEW: no extra storage, always current, slower to query

Import into R

  • Connect, load spatial extension, read the VIEW:
library(duckdb)

con <- dbConnect(
  duckdb(),
  dbdir = "data/vector-deepdive/wald-kantone.duckdb",
  read_only = TRUE
)

dbExecute(con, "LOAD spatial;")
kanton_frac <- dbReadTable(con, "kanton_frac")

dbDisconnect(con)

Wrap-up

  • DuckDB = in-process OLAP database with a spatial extension
  • SQL lets you express spatial analyses declaratively
  • R-Tree indices make spatial predicates fast
  • VIEWs let you build analyses incrementally — easy to iterate and scale
  • Results flow back into R for further analysis or visualization