duckdb#

What it is#

duckdb is an embedded OLAP database — SQL engine, columnar storage, vectorised executor — packaged as a Python library you pip install. There is no server process; queries run in-process against in-memory tables, local CSV/Parquet/JSON files, pandas DataFrames, polars DataFrames, and Arrow tables, with zero-copy where possible.

On PyPI it occupies the niche between SQLite (row-store transactional) and a standalone analytics warehouse — fast, embedded, zero-config, designed for analytical queries on local data.

Install#

pip install duckdb

Output: (none — exits 0 on success)

uv add duckdb

Output: dependency resolved, added to pyproject.toml

poetry add duckdb

Output: lockfile updated, installed into the project venv

pip install --upgrade duckdb

Output: in-place upgrade — re-open any persistent .duckdb files written by an older binary

Versioning & Python support#

duckdb uses a fast-moving CalVer-ish minor + storage-format version scheme. Releases are roughly monthly; major (0.x → 1.x) jumps signal stable on-disk format guarantees. As of late 2025 the project is on the 1.x line with cross-version storage compatibility.

DuckDB line	Python support	On-disk format
0.10.x	3.7 – 3.11	legacy; requires `EXPORT DATABASE` / re-`IMPORT` to upgrade
0.11.x	3.8 – 3.12	improved storage; still legacy
1.x	3.9 – 3.13	forward-compatible storage format

The Python binding bundles the C++ engine — there is no separate libduckdb to install. Wheels exist for Linux (manylinux/musllinux), macOS (universal2), and Windows.

Package metadata#

Maintainer: DuckDB Foundation; primary authors Mark Raasveldt and Hannes Mühleisen (CWI Amsterdam)
Project home: github.com/duckdb/duckdb
Docs: duckdb.org/docs
License: MIT
PyPI: pypi.org/project/duckdb
Governance: non-profit DuckDB Foundation; commercial sponsor MotherDuck
First released: 2018 (research project); 1.0 in 2024
Downloads: > 20 M / month on PyPI as of late 2025

Optional dependencies & extras#

duckdb itself ships with no PyPI extras — additional functionality is loaded as DuckDB extensions from inside SQL:

INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_parquet('s3://bucket/data.parquet');

Output: httpfs extension downloaded into ~/.duckdb/extensions/<version>/ and registered

Common extensions:

Extension	What it enables
`httpfs`	read/write `http://`, `s3://`, `gcs://`, `azure://` URLs
`parquet`	bundled by default since 1.0
`json`	bundled; full JSON read/write
`iceberg`	read Apache Iceberg tables
`delta`	read Delta Lake tables
`postgres`	scan a live Postgres instance from DuckDB
`sqlite`	scan a SQLite file as DuckDB tables
`mysql`	scan a live MySQL instance
`excel`	read XLSX files natively
`spatial`	GIS — geometry types, ST_* functions

PyPI companion packages most pipelines pull alongside duckdb:

pip install duckdb pandas polars pyarrow jupyter

Output: installs duckdb + the DataFrame libraries it interops with zero-copy

Alternatives#

Package	One-line trade-off
SQLite	row-store OLTP — slow on analytical queries
pandas	imperative API; slower joins/aggregations on multi-GB data
polars	lazy DataFrame; complementary, not a replacement (use both)
PostgreSQL	server-based; needs a daemon, much heavier setup
ClickHouse	server-based OLAP, scales further but operational overhead
Apache Datafusion	Rust-native query engine; less Python-polished

Common gotchas#

In-process only. duckdb is not a server. Concurrent processes opening the same .duckdb file fight for the write lock; use a single writer or switch to ClickHouse/Postgres for true concurrency.
Storage-format compatibility. Files written by 0.x can require export/import to load in 1.x. Within 1.x, format is forward-compatible. Pin the duckdb version in lockfiles for any persistent database.
Default connection is process-shared. duckdb.sql(...) uses one global in-memory connection. For threads or independent state use duckdb.connect() and pass the connection explicitly.
pandas/polars interop is zero-copy on Arrow — but not always. df = duckdb.sql("...").df() materialises into pandas (copies). Use .arrow() or .pl() to stay in Arrow and avoid the round-trip.
Extensions download on first INSTALL. Air-gapped environments need duckdb_extension_directory pre-seeded or the extension built in.
httpfs S3 credentials. duckdb does not read ~/.aws/credentials automatically — use SET s3_access_key_id and s3_secret_access_key, or CREATE SECRET (1.0+).
pip install duckdb on Apple Silicon Rosetta. Wheels are universal2; under Rosetta you accidentally pull the x86_64 slice and run 3× slower than necessary. Verify with import duckdb; duckdb.__file__.

Real-world recipes#

duckdb is the “one-process SQL warehouse” — its sweet spot is querying mixed local files (CSV, Parquet, JSON), pandas / polars frames, and remote object storage from a single Python script. The companion sections/python/duckdb covers SQL/API depth; the recipes below focus on the packaging-level setup each pattern requires.

Multi-format federation in one query — duckdb’s killer feature: join a Parquet directory, a CSV, a JSON dump, and a pandas frame in a single SQL statement.

import duckdb
import pandas as pd

users = pd.DataFrame({"user_id": [1, 2, 3], "tier": ["free", "pro", "free"]})

con = duckdb.connect(":memory:")
con.register("users_df", users)

result = con.sql("""
    SELECT u.tier, COUNT(*) AS event_count, SUM(e.revenue) AS total
    FROM read_parquet('events/*.parquet') e
    JOIN users_df u USING (user_id)
    WHERE e.event_date >= DATE '2026-01-01'
    GROUP BY u.tier
    ORDER BY total DESC
""").df()
print(result)

Output: the aggregation; no copy of users is made — DuckDB scans it via Arrow as a virtual table

S3 / GCS query with the httpfs extension:

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("CREATE SECRET aws (TYPE s3, KEY_ID 'AKIA...', SECRET '...', REGION 'us-east-1');")

q = """
SELECT date_trunc('day', ts) AS day, count(*) AS n
FROM read_parquet('s3://bucket/events/2026/*/*.parquet')
GROUP BY day
ORDER BY day
"""
df = con.sql(q).df()
print(df.head())

Output: the per-day count; httpfs streams just the necessary row groups from S3, never downloading whole files

Parquet pushdown for column projection + filter:

import duckdb

q = """
SELECT user_id, SUM(revenue) AS total
FROM read_parquet('warehouse/events/*.parquet')
WHERE event_date >= '2026-01-01' AND status = 'active'
GROUP BY user_id
ORDER BY total DESC
LIMIT 10
"""
duckdb.sql(q).show()

Output: the top-10 users by revenue; DuckDB’s optimiser pushes the filter and projection into the Parquet reader so only the necessary columns and row groups are decoded

Persistent on-disk database:

import duckdb

con = duckdb.connect("warehouse.duckdb")
con.execute("CREATE TABLE IF NOT EXISTS orders AS SELECT * FROM read_csv_auto('orders.csv')")
con.execute("CREATE INDEX IF NOT EXISTS idx_orders_user ON orders(user_id)")
con.close()

Output: a single-file warehouse.duckdb on disk, queryable by any DuckDB client at the same major version

Postgres scan (no ETL needed):

import duckdb

con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("ATTACH 'host=db.example.com dbname=prod user=ro' AS pg (TYPE postgres)")
df = con.sql("SELECT id, name FROM pg.public.users WHERE created_at > '2026-01-01'").df()
print(df.head())

Output: a DataFrame containing live Postgres data; no dump-and-reload step required

Performance tuning#

duckdb’s vectorised executor is fast by default. The remaining tuning levers are about what to skip (column projection, predicate pushdown) and how to spend RAM (thread count, temp-disk spill).

import duckdb

con = duckdb.connect()
con.execute("SET threads TO 8")           # cap CPU usage
con.execute("SET memory_limit = '12GB'")  # spill to disk past this
con.execute("SET temp_directory = '/fast-ssd/duckdb-tmp'")

Output: no return value; subsequent queries respect the settings

Tuning levers in order of impact:

Lever	Mechanism	When it helps
Use Parquet over CSV	columnar pushdown	most analytical workloads
`SELECT col1, col2` (no `SELECT *`)	projection pushdown	wide tables, narrow queries
Explicit `WHERE ts >= ...` on partitioned data	predicate pushdown	partitioned Parquet trees
`SET threads TO N`	thread cap	shared/CI runners
`SET memory_limit = '...'`	spill threshold	low-RAM containers
`SET preserve_insertion_order=false`	drop ordering for shuffles	huge group-by / sort
`pragma enable_progress_bar`	progress UI	long interactive queries

Inspecting the query plan:

import duckdb

q = """
SELECT region, sum(revenue)
FROM read_parquet('events/*.parquet')
WHERE event_date >= '2026-01-01'
GROUP BY region
"""
duckdb.sql("EXPLAIN " + q).show()
duckdb.sql("EXPLAIN ANALYZE " + q).show()

Output: the physical plan (EXPLAIN) and the same plan annotated with per-operator timings (EXPLAIN ANALYZE) — the difference shows where time is going

Memory & dataset-size scaling#

duckdb is out-of-core by design — it spills to disk past the configured memory limit and works on datasets much larger than RAM. The main scaling lever is configuring spill location and memory budget; you rarely need to chunk data manually.

import duckdb

con = duckdb.connect()
con.execute("SET memory_limit = '4GB'")
con.execute("SET temp_directory = '/var/tmp/duckdb'")

# Aggregate 500 GB of Parquet on a 4 GB-RAM box
con.sql("""
COPY (
    SELECT user_id, SUM(revenue) AS total
    FROM read_parquet('warehouse/**/*.parquet')
    GROUP BY user_id
) TO 'user_totals.parquet' (FORMAT 'parquet', COMPRESSION 'zstd');
""")

Output: writes user_totals.parquet; spills intermediate hash tables to /var/tmp/duckdb as needed

Streaming arrow output without materialising:

import duckdb
import pyarrow as pa

con = duckdb.connect()
reader = con.execute("SELECT * FROM read_parquet('huge.parquet')").fetch_record_batch(rows_per_batch=100_000)
for batch in reader:
    # process one batch at a time — never holds the full table in RAM
    print(batch.num_rows)

Output: prints the rows per batch as DuckDB streams Arrow record batches; useful for piping into downstream processors

For genuinely massive workloads (TB+), the right pairing is DuckDB + MotherDuck (the hosted DuckDB service from the same authors) or DuckDB + an Iceberg/Delta table layout so multiple processes can read the same dataset.

Version migration guide#

DuckDB ships roughly monthly. The on-disk storage format compatibility story is the most important migration concern.

Format compatibility timeline:

0.8.x → 0.9.x → 0.10.x: required EXPORT DATABASE/IMPORT DATABASE round-trips between minors. Painful for any persistent file.
1.0 (2024): forward-compatible storage format declared. Files written by 1.0 can be opened by every later 1.x release.
1.x → 1.x: SQL surface and extension API can change, but storage stays compatible. Read the changelog for breaking SQL keywords or function renames.

import duckdb

print(duckdb.__version__)
con = duckdb.connect("warehouse.duckdb")
print(con.execute("PRAGMA database_size").fetchone())

Output: the installed version and the on-disk size — sanity-check both before upgrading

Common SQL-level migrations within 1.x:

read_csv_auto is still alive but read_csv with auto_detect=true is the preferred form.
The STRUCT_PACK keyword for inline structs became more strict on field naming.
LIST and ARRAY are aliases; older code mixed them — pick one.
Several extensions moved from “bundled” to “auto-loadable” — first-run pulls them from the extension repository unless autoinstall_known_extensions is false (offline-safe).

Pin the duckdb version in production lockfiles. Persistent .duckdb files written by a future version cannot be opened by an older binary even within 1.x.

Interop with adjacent ecosystems#

DuckDB speaks Arrow natively, which is why it pairs so well with pandas and polars. The interop matrix below tells you when a query result is zero-copy versus when it triggers a serialisation.

Result method	Returns	Zero-copy?
`.df()`	pandas DataFrame	Yes if backed by Arrow string types; copy otherwise
`.pl()`	polars DataFrame	Yes
`.arrow()`	pyarrow Table	Yes
`.fetch_record_batch()`	streaming Arrow batches	Yes, streamed
`.fetchall()`	list of tuples	Copy (Python objects)
`.fetchone()`	one tuple	Copy
`.fetchnumpy()`	dict of NumPy arrays	Yes for numeric, copy for strings

import duckdb
import pandas as pd
import polars as pl

pdf = pd.DataFrame({"a": [1, 2, 3]})
plf = pl.DataFrame({"a": [4, 5, 6]})

# Both frames are accessible inside SQL with zero copy
result = duckdb.sql("SELECT a FROM pdf UNION ALL SELECT a FROM plf").pl()
print(result)

Output: a polars frame with [1, 2, 3, 4, 5, 6]; DuckDB sees both source frames as Arrow scans

Crossing into the Jupyter ecosystem: duckdb.sql(query) returns a DuckDBPyRelation that renders nicely in notebooks. Use %load_ext sql + %sql duckdb:/// for cell-magic SQL.

Troubleshooting common errors#

The errors below are the recurring frictions; most resolve with a one-line setting change.

IO Error: Cannot open file '...': No such file — relative path mismatch. DuckDB resolves paths relative to the process CWD, not the script directory. Use absolute paths.
Catalog Error: Table with name X does not exist — you used duckdb.sql(...) (which uses a fresh in-memory connection) and registered the frame on a different connection. Use one explicit con = duckdb.connect(...) for the whole script.
Permission denied writing to a .duckdb file — another process holds the write lock. DuckDB allows multiple readers but only one writer.
HTTP 403 from httpfs — wrong region or stale credentials. Use CREATE SECRET ... TYPE s3 (1.0+) rather than legacy SET s3_* for clean error messages.
Cannot load extension '...': not on the autoload list — an extension is community-maintained. Set SET allow_community_extensions = true first.
Stale ~/.duckdb/extensions/<version>/ after upgrading — extensions are version-keyed; delete the old version directory or run INSTALL <ext> -- update to refresh.
Storage version error opening a .duckdb file with an older binary — upgrade the binary or use EXPORT DATABASE from the newer one and IMPORT DATABASE on the older.
OOM despite memory_limit — some operators (large window functions, recursive CTEs) ignore the limit. Add SET temp_directory = '/fast-disk/...' and ensure that disk has room.

When NOT to use this#

duckdb is the right answer surprisingly often; the cases below are where another tool wins.

Concurrent writers / transactional workloads: DuckDB is one-writer. Use Postgres, SQLite, or a true OLTP system.
Streaming ingest: duckdb is batch-oriented; use Flink, Beam, or a streaming framework.
Cluster-scale (multi-TB): duckdb is single-process. Use Snowflake, BigQuery, Clickhouse, or a distributed query engine for petabyte-scale.
Pure DataFrame ergonomics: if everyone on the team prefers fluent APIs over SQL, polars is a better daily-driver.
Vector / time-series specialist workloads: duckdb has good support but specialised stores (DuckDB + VSS extension vs Pinecone; TimescaleDB; ClickHouse) sometimes win on shape-specific operations.
You need cross-region latency-bounded reads: DuckDB does not know about region affinity; a cloud warehouse with regional storage will do better.

g h	home
g p	Programming section
g p	Python section
g j	JavaScript section
g t	TypeScript section
g o	OS section
g l	Linux section
g w	Windows section
g z	z/OS section
g o	macOS section
g a	AI section
g c	Claude Code section
g c	Codex CLI section
g c	Claude API section
g p	Prompting section
g f	Frameworks section
g p	Packages section
g p	Pip (Python) section
g p	npm (Node) section
g p	Cargo (Rust) section
g p	Go modules section
g g	graph view
g t	tags index

⌘K / /	open search palette
t	cycle theme (dark → light → system)
?	toggle this panel

[ / ]	previous / next sheet in section
j / k	scroll down / up

duckdb — Embedded Analytical SQL