DuckDB

Open-source, column-oriented RDBMS (relational database management system)
High performance on complex queries, even on large datasets
Flexible extension system — spatial is a core extension

Why DuckDB for spatial data?

R/sf loads entire datasets into memory → bottleneck with large data
DuckDB can stream from disk — processes data exceeding available RAM
SQL: declarative language — describe what you want, not how
Spatial extension uses familiar function names (st_intersects, st_area, …)
Spatial indexing (R-Tree) makes large-scale spatial queries fast

Reinventing DBMS?

dplyr, data.table, arrow do relational transformations (select, join, aggregate, …)
But compared to a real DBMS, they lack:
- Holistic query optimization across a pipeline
- Out-of-memory computation
- Parallelism and pipelining
DuckDB brings 50+ years of DBMS research into your R/Python session

OLAP vs. OLTP

	OLAP (Analytical)	OLTP (Transactional)
Workload	Read-mostly	Many small writes
Queries	Complex, scan large parts	Simple, touch individual rows
Updates	Bulk appends	Frequent row-level updates

DuckDB = OLAP

In-process vs. standalone

In-process: runs inside your application (no separate server)
- DuckDB, SQLite
Standalone: separate server process, client connects over network
- PostgreSQL, MySQL

Table 1: DuckDB fills a niche that no previous software has filled yet

Let’s put DuckDB to work

DuckDB in practice

Download wald-kantone.duckdb from Moodle (forest + canton boundaries)
Install DuckDB CLI + R package
Install DBeaver Community
Connect to the database in DBeaver
Install and load the spatial extension:

INSTALL spatial;
LOAD spatial;

Verify both tables are present and explore them:

SHOW TABLES;
SELECT * FROM wald;
SELECT * FROM kantone;

Create an R-Tree spatial index for both tables:

CREATE INDEX kantone_idx ON kantone USING RTREE (geom);
CREATE INDEX wald_idx ON wald USING RTREE (geom);

Why spatial indices?

Without an index: WHERE st_intersects(a, b) checks every pair → slow
R-Tree index organizes geometries by bounding boxes
Eliminates non-overlapping pairs cheaply → only candidates get the expensive exact test
Reduces comparisons dramatically (e.g. millions → thousands)

Building an analysis step-by-step

Indices speed up queries — but we still need to compose the analysis
Goal: compute forest share per canton — intersect forest polygons with canton boundaries, then aggregate areas
Strategy: break it into small, reusable pieces using SQL VIEWs

SQL `VIEW`s

A VIEW = a named SQL query (virtual table)
Re-executed every time you access it
Lets us build complex analyses step-by-step

Our first VIEW: a subset of the forest dataset (for faster iteration)

Limit to 1’000 rows and store as a VIEW:

1CREATE VIEW wald2 AS
2SELECT * FROM wald LIMIT 1000;

1: Prepend CREATE VIEW name AS to any SELECT…
2: … to store it as a reusable virtual table

Query it like a regular table:

SELECT * FROM wald2;

Spatial intersection in SQL

Step 1 of the pipeline: intersect forest polygons with canton boundaries

SELECT 
  name, 
2  st_intersection(w.geom, k.geom),
1FROM wald2 w, kantone k;

1: w and k are aliases…
2: … used in the intersection

┌──────────────────────┬───────────────────────────────────────────────────────┐
│         name         │            st_intersection(w.geom, k.geom)            │
│       varchar        │                       geometry                        │
├──────────────────────┼───────────────────────────────────────────────────────┤
│ Genève               │ POLYGON EMPTY                                         │
│ Genève               │ POLYGON EMPTY                                         │
│ Genève               │ POLYGON EMPTY                                         │
│ Genève               │ POLYGON EMPTY                                         │
│ Genève               │ POLYGON EMPTY                                         │
│   ·                  │       ·                                               │
│   ·                  │       ·                                               │
│   ·                  │       ·                                               │
│ Appenzell Innerrho…  │ POLYGON Z ((2752421.3200000883 1239807.9399965794 9…  │
│ Appenzell Innerrho…  │ POLYGON EMPTY                                         │
│ Appenzell Innerrho…  │ POLYGON EMPTY                                         │
│ Appenzell Innerrho…  │ POLYGON EMPTY                                         │
│ Appenzell Innerrho…  │ POLYGON EMPTY                                         │
├──────────────────────┴───────────────────────────────────────────────────────┤
│ 26000 rows (10 shown)                                              2 columns │
└──────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 1.292 user 1.293885 sys 0.005016

Problem: cross join — every forest polygon × every canton (1’000 × 26 = 26’000 pairs)
Solution: add WHERE to filter early:

SELECT
  name,
  st_intersection(w.geom, k.geom),
FROM wald2 w, kantone k
1WHERE st_intersects(w.geom, k.geom);

1: Only compute the intersection for pairs that actually overlap (R-Tree!)

┌──────────────────────┬───────────────────────────────────────────────────────┐
│         name         │            st_intersection(w.geom, k.geom)            │
│       varchar        │                       geometry                        │
├──────────────────────┼───────────────────────────────────────────────────────┤
│ Appenzell Innerrho…  │ POLYGON Z ((2750577.8690000786 1236499.1299968604 1…  │
│ St. Gallen           │ POLYGON Z ((2733301.3710000697 1235901.1909968755 1…  │
│ St. Gallen           │ POLYGON Z ((2730828.668000072 1236533.6979968112 10…  │
│ Appenzell Innerrho…  │ POLYGON Z ((2751117.7790000793 1236511.4949968604 1…  │
│ Zürich               │ POLYGON Z ((2712200.8850000626 1236517.2959967866 6…  │
│   ·                  │                           ·                           │
│   ·                  │                           ·                           │
│   ·                  │                           ·                           │
│ Appenzell Innerrho…  │ POLYGON Z ((2752421.3200000883 1239807.9399965794 9…  │
│ Zürich               │ POLYGON Z ((2684156.710000051 1239693.0049964704 51…  │
│ St. Gallen           │ POLYGON Z ((2734307.2300000787 1239757.5959965463 1…  │
│ Appenzell Ausserrh…  │ POLYGON Z ((2735958.5650000805 1239767.1879965463 1…  │
│ Appenzell Ausserrh…  │ POLYGON Z ((2738798.7040000805 1239749.8799965512 8…  │
├──────────────────────┴───────────────────────────────────────────────────────┤
│ 1027 rows (10 shown)                                               2 columns │
└──────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.818 user 0.823316 sys 0.006000

`st_intersects` vs `st_intersection`

This query uses both: st_intersects in WHERE = predicate (true/false filter), st_intersection in SELECT = operation (computes geometry). Same pattern as in R/sf!

We need the area, not the geometry itself:

SELECT 
  name, 
1  st_area(st_intersection(w.geom, k.geom)) as wald_area,
FROM wald2 w, kantone k
WHERE st_intersects(w.geom, k.geom);

1: st_area calculates the area of the intersection

┌────────────────────────┬────────────────────┐
│          name          │     wald_area      │
│        varchar         │       double       │
├────────────────────────┼────────────────────┤
│ Appenzell Innerrhoden  │  3223.381481670876 │
│ St. Gallen             │ 490040.98548061884 │
│ St. Gallen             │ 1515.7608568453097 │
│ Appenzell Innerrhoden  │ 5697.7576364166425 │
│ Zürich                 │  9248.165822466484 │
│   ·                    │          ·         │
│   ·                    │          ·         │
│   ·                    │          ·         │
│ Appenzell Innerrhoden  │ 31971.908096492414 │
│ Zürich                 │ 11339.135541535818 │
│ St. Gallen             │ 11570.958247649043 │
│ Appenzell Ausserrhoden │ 3659.7222327571385 │
│ Appenzell Ausserrhoden │ 1371.8734574811795 │
├────────────────────────┴────────────────────┤
│ 1027 rows (10 shown)              2 columns │
└─────────────────────────────────────────────┘
Run Time (s): real 0.810 user 0.811628 sys 0.018473

Save this as a VIEW before aggregating:

1CREATE VIEW wald_kantone AS
SELECT 
  name, 
  st_area(st_intersection(w.geom, k.geom)) AS wald_area,
FROM wald2 w, kantone k
WHERE st_intersects(w.geom, k.geom);

1: This creates a VIEW from the preceding query

Query the VIEW like a table, then aggregate with GROUP BY:

SELECT 
3  name,
2  sum(wald_area) as wald_area
FROM wald_kantone
1GROUP BY name;

1: If we use GROUP BY in a SQL query…
2: … we need to wrap all columns with aggregate function…
3: … except for the columns that we use for grouping

Save the aggregation as a VIEW:

CREATE VIEW wald_kanton_grp AS
SELECT 
  name, 
  sum(wald_area) as wald_area
FROM wald_kantone
GROUP BY name;

Join with kantone to get the total canton area and compute the fraction:

SELECT 
3    kantone.name,
4    wald_area/area as waldanteil,
FROM wald_kanton_grp 
1LEFT JOIN kantone
2ON wald_kanton_grp.name=kantone.name;

1: LEFT JOIN appends columns from another table…
2: … matched by the ON condition
3: We only need the canton name…
4: … and the fraction wald_area / area

Save as a final VIEW, ordered by forest share:

CREATE VIEW kanton_frac AS
SELECT 
    kantone.name,                 
    wald_area/area as waldanteil, 
FROM wald_kanton_grp 
LEFT JOIN kantone 
ON wald_kanton_grp.name=kantone.name
1ORDER BY waldanteil DESC;

1: We can ORDER BY to show us the highest values first

Scaling up

So far: only 1’000 forest features → incomplete results
Since we used VIEWs, switching to the full dataset is trivial:

CREATE OR REPLACE VIEW wald2 AS
SELECT * FROM wald;

Every downstream VIEW now automatically uses the full data:

SELECT * FROM kanton_frac;

`VIEW` vs `CREATE TABLE ... AS`

VIEW = lazy (re-executed on every access)
CREATE TABLE ... AS = materialized (stored on disk):

1CREATE TABLE wald_kantone_mat AS
SELECT
  name,
  st_area(st_intersection(w.geom, k.geom)) AS wald_area
FROM wald2 w, kantone k
WHERE st_intersects(w.geom, k.geom);

1: This stores the result as an actual table on disk

Trade-off:
- Materialized: faster to query, takes disk space, does not auto-update
- VIEW: no extra storage, always current, slower to query

Import into R

Connect, load spatial extension, read the VIEW:

library(duckdb)

con <- dbConnect(
  duckdb(),
  dbdir = "data/vector-deepdive/wald-kantone.duckdb",
  read_only = TRUE
)

dbExecute(con, "LOAD spatial;")
kanton_frac <- dbReadTable(con, "kanton_frac")

dbDisconnect(con)

Wrap-up

DuckDB = in-process OLAP database with a spatial extension
SQL lets you express spatial analyses declaratively
R-Tree indices make spatial predicates fast
VIEWs let you build analyses incrementally — easy to iterate and scale
Results flow back into R for further analysis or visualization

	OLTP	OLAP
In-Process
Stand-Alone

DuckDB

Why DuckDB for spatial data?

Reinventing DBMS?

OLAP vs. OLTP

In-process vs. standalone

Let’s put DuckDB to work

DuckDB in practice

Why spatial indices?

Building an analysis step-by-step

SQL VIEWs

Spatial intersection in SQL

st_intersects vs st_intersection

Scaling up

VIEW vs CREATE TABLE ... AS

Import into R

Wrap-up

SQL `VIEW`s

`st_intersects` vs `st_intersection`

`VIEW` vs `CREATE TABLE ... AS`