duckdb#
What it is#
duckdb is an embedded OLAP database — SQL engine, columnar storage, vectorised executor — packaged as a Python library you pip install. There is no server process; queries run in-process against in-memory tables, local CSV/Parquet/JSON files, pandas DataFrames, polars DataFrames, and Arrow tables, with zero-copy where possible.
On PyPI it occupies the niche between SQLite (row-store transactional) and a standalone analytics warehouse — fast, embedded, zero-config, designed for analytical queries on local data.
Install#
pip install duckdb
Output: (none — exits 0 on success)
uv add duckdb
Output: dependency resolved, added to pyproject.toml
poetry add duckdb
Output: lockfile updated, installed into the project venv
pip install --upgrade duckdb
Output: in-place upgrade — re-open any persistent .duckdb files written by an older binary
Versioning & Python support#
duckdb uses a fast-moving CalVer-ish minor + storage-format version scheme. Releases are roughly monthly; major (0.x → 1.x) jumps signal stable on-disk format guarantees. As of late 2025 the project is on the 1.x line with cross-version storage compatibility.
| DuckDB line | Python support | On-disk format |
|---|---|---|
| 0.10.x | 3.7 – 3.11 | legacy; requires EXPORT DATABASE / re-IMPORT to upgrade |
| 0.11.x | 3.8 – 3.12 | improved storage; still legacy |
| 1.x | 3.9 – 3.13 | forward-compatible storage format |
The Python binding bundles the C++ engine — there is no separate libduckdb to install. Wheels exist for Linux (manylinux/musllinux), macOS (universal2), and Windows.
Package metadata#
- Maintainer: DuckDB Foundation; primary authors Mark Raasveldt and Hannes Mühleisen (CWI Amsterdam)
- Project home: github.com/duckdb/duckdb
- Docs: duckdb.org/docs
- License: MIT
- PyPI: pypi.org/project/duckdb
- Governance: non-profit DuckDB Foundation; commercial sponsor MotherDuck
- First released: 2018 (research project); 1.0 in 2024
- Downloads: > 20 M / month on PyPI as of late 2025
Optional dependencies & extras#
duckdb itself ships with no PyPI extras — additional functionality is loaded as DuckDB extensions from inside SQL:
INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_parquet('s3://bucket/data.parquet');
Output: httpfs extension downloaded into ~/.duckdb/extensions/<version>/ and registered
Common extensions:
| Extension | What it enables |
|---|---|
httpfs | read/write http://, s3://, gcs://, azure:// URLs |
parquet | bundled by default since 1.0 |
json | bundled; full JSON read/write |
iceberg | read Apache Iceberg tables |
delta | read Delta Lake tables |
postgres | scan a live Postgres instance from DuckDB |
sqlite | scan a SQLite file as DuckDB tables |
mysql | scan a live MySQL instance |
excel | read XLSX files natively |
spatial | GIS — geometry types, ST_* functions |
PyPI companion packages most pipelines pull alongside duckdb:
pip install duckdb pandas polars pyarrow jupyter
Output: installs duckdb + the DataFrame libraries it interops with zero-copy
Alternatives#
| Package | One-line trade-off |
|---|---|
| SQLite | row-store OLTP — slow on analytical queries |
| pandas | imperative API; slower joins/aggregations on multi-GB data |
| polars | lazy DataFrame; complementary, not a replacement (use both) |
| PostgreSQL | server-based; needs a daemon, much heavier setup |
| ClickHouse | server-based OLAP, scales further but operational overhead |
| Apache Datafusion | Rust-native query engine; less Python-polished |
Common gotchas#
- In-process only. duckdb is not a server. Concurrent processes opening the same
.duckdbfile fight for the write lock; use a single writer or switch to ClickHouse/Postgres for true concurrency. - Storage-format compatibility. Files written by 0.x can require export/import to load in 1.x. Within 1.x, format is forward-compatible. Pin the duckdb version in lockfiles for any persistent database.
- Default connection is process-shared.
duckdb.sql(...)uses one global in-memory connection. For threads or independent state useduckdb.connect()and pass the connection explicitly. - pandas/polars interop is zero-copy on Arrow — but not always.
df = duckdb.sql("...").df()materialises into pandas (copies). Use.arrow()or.pl()to stay in Arrow and avoid the round-trip. - Extensions download on first
INSTALL. Air-gapped environments needduckdb_extension_directorypre-seeded or the extension built in. - httpfs S3 credentials. duckdb does not read
~/.aws/credentialsautomatically — useSET s3_access_key_idands3_secret_access_key, orCREATE SECRET(1.0+). pip install duckdbon Apple Silicon Rosetta. Wheels are universal2; under Rosetta you accidentally pull the x86_64 slice and run 3× slower than necessary. Verify withimport duckdb; duckdb.__file__.
Real-world recipes#
duckdb is the “one-process SQL warehouse” — its sweet spot is querying mixed local files (CSV, Parquet, JSON), pandas / polars frames, and remote object storage from a single Python script. The companion sections/python/duckdb covers SQL/API depth; the recipes below focus on the packaging-level setup each pattern requires.
Multi-format federation in one query — duckdb’s killer feature: join a Parquet directory, a CSV, a JSON dump, and a pandas frame in a single SQL statement.
import duckdb
import pandas as pd
users = pd.DataFrame({"user_id": [1, 2, 3], "tier": ["free", "pro", "free"]})
con = duckdb.connect(":memory:")
con.register("users_df", users)
result = con.sql("""
SELECT u.tier, COUNT(*) AS event_count, SUM(e.revenue) AS total
FROM read_parquet('events/*.parquet') e
JOIN users_df u USING (user_id)
WHERE e.event_date >= DATE '2026-01-01'
GROUP BY u.tier
ORDER BY total DESC
""").df()
print(result)
Output: the aggregation; no copy of users is made — DuckDB scans it via Arrow as a virtual table
S3 / GCS query with the httpfs extension:
import duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("CREATE SECRET aws (TYPE s3, KEY_ID 'AKIA...', SECRET '...', REGION 'us-east-1');")
q = """
SELECT date_trunc('day', ts) AS day, count(*) AS n
FROM read_parquet('s3://bucket/events/2026/*/*.parquet')
GROUP BY day
ORDER BY day
"""
df = con.sql(q).df()
print(df.head())
Output: the per-day count; httpfs streams just the necessary row groups from S3, never downloading whole files
Parquet pushdown for column projection + filter:
import duckdb
q = """
SELECT user_id, SUM(revenue) AS total
FROM read_parquet('warehouse/events/*.parquet')
WHERE event_date >= '2026-01-01' AND status = 'active'
GROUP BY user_id
ORDER BY total DESC
LIMIT 10
"""
duckdb.sql(q).show()
Output: the top-10 users by revenue; DuckDB’s optimiser pushes the filter and projection into the Parquet reader so only the necessary columns and row groups are decoded
Persistent on-disk database:
import duckdb
con = duckdb.connect("warehouse.duckdb")
con.execute("CREATE TABLE IF NOT EXISTS orders AS SELECT * FROM read_csv_auto('orders.csv')")
con.execute("CREATE INDEX IF NOT EXISTS idx_orders_user ON orders(user_id)")
con.close()
Output: a single-file warehouse.duckdb on disk, queryable by any DuckDB client at the same major version
Postgres scan (no ETL needed):
import duckdb
con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("ATTACH 'host=db.example.com dbname=prod user=ro' AS pg (TYPE postgres)")
df = con.sql("SELECT id, name FROM pg.public.users WHERE created_at > '2026-01-01'").df()
print(df.head())
Output: a DataFrame containing live Postgres data; no dump-and-reload step required
Performance tuning#
duckdb’s vectorised executor is fast by default. The remaining tuning levers are about what to skip (column projection, predicate pushdown) and how to spend RAM (thread count, temp-disk spill).
import duckdb
con = duckdb.connect()
con.execute("SET threads TO 8") # cap CPU usage
con.execute("SET memory_limit = '12GB'") # spill to disk past this
con.execute("SET temp_directory = '/fast-ssd/duckdb-tmp'")
Output: no return value; subsequent queries respect the settings
Tuning levers in order of impact:
| Lever | Mechanism | When it helps |
|---|---|---|
| Use Parquet over CSV | columnar pushdown | most analytical workloads |
SELECT col1, col2 (no SELECT *) | projection pushdown | wide tables, narrow queries |
Explicit WHERE ts >= ... on partitioned data | predicate pushdown | partitioned Parquet trees |
SET threads TO N | thread cap | shared/CI runners |
SET memory_limit = '...' | spill threshold | low-RAM containers |
SET preserve_insertion_order=false | drop ordering for shuffles | huge group-by / sort |
pragma enable_progress_bar | progress UI | long interactive queries |
Inspecting the query plan:
import duckdb
q = """
SELECT region, sum(revenue)
FROM read_parquet('events/*.parquet')
WHERE event_date >= '2026-01-01'
GROUP BY region
"""
duckdb.sql("EXPLAIN " + q).show()
duckdb.sql("EXPLAIN ANALYZE " + q).show()
Output: the physical plan (EXPLAIN) and the same plan annotated with per-operator timings (EXPLAIN ANALYZE) — the difference shows where time is going
Memory & dataset-size scaling#
duckdb is out-of-core by design — it spills to disk past the configured memory limit and works on datasets much larger than RAM. The main scaling lever is configuring spill location and memory budget; you rarely need to chunk data manually.
import duckdb
con = duckdb.connect()
con.execute("SET memory_limit = '4GB'")
con.execute("SET temp_directory = '/var/tmp/duckdb'")
# Aggregate 500 GB of Parquet on a 4 GB-RAM box
con.sql("""
COPY (
SELECT user_id, SUM(revenue) AS total
FROM read_parquet('warehouse/**/*.parquet')
GROUP BY user_id
) TO 'user_totals.parquet' (FORMAT 'parquet', COMPRESSION 'zstd');
""")
Output: writes user_totals.parquet; spills intermediate hash tables to /var/tmp/duckdb as needed
Streaming arrow output without materialising:
import duckdb
import pyarrow as pa
con = duckdb.connect()
reader = con.execute("SELECT * FROM read_parquet('huge.parquet')").fetch_record_batch(rows_per_batch=100_000)
for batch in reader:
# process one batch at a time — never holds the full table in RAM
print(batch.num_rows)
Output: prints the rows per batch as DuckDB streams Arrow record batches; useful for piping into downstream processors
For genuinely massive workloads (TB+), the right pairing is DuckDB + MotherDuck (the hosted DuckDB service from the same authors) or DuckDB + an Iceberg/Delta table layout so multiple processes can read the same dataset.
Version migration guide#
DuckDB ships roughly monthly. The on-disk storage format compatibility story is the most important migration concern.
Format compatibility timeline:
- 0.8.x → 0.9.x → 0.10.x: required
EXPORT DATABASE/IMPORT DATABASEround-trips between minors. Painful for any persistent file. - 1.0 (2024): forward-compatible storage format declared. Files written by 1.0 can be opened by every later 1.x release.
- 1.x → 1.x: SQL surface and extension API can change, but storage stays compatible. Read the changelog for breaking SQL keywords or function renames.
import duckdb
print(duckdb.__version__)
con = duckdb.connect("warehouse.duckdb")
print(con.execute("PRAGMA database_size").fetchone())
Output: the installed version and the on-disk size — sanity-check both before upgrading
Common SQL-level migrations within 1.x:
read_csv_autois still alive butread_csvwithauto_detect=trueis the preferred form.- The
STRUCT_PACKkeyword for inline structs became more strict on field naming. LISTandARRAYare aliases; older code mixed them — pick one.- Several extensions moved from “bundled” to “auto-loadable” — first-run pulls them from the extension repository unless
autoinstall_known_extensionsis false (offline-safe).
Pin the duckdb version in production lockfiles. Persistent .duckdb files written by a future version cannot be opened by an older binary even within 1.x.
Interop with adjacent ecosystems#
DuckDB speaks Arrow natively, which is why it pairs so well with pandas and polars. The interop matrix below tells you when a query result is zero-copy versus when it triggers a serialisation.
| Result method | Returns | Zero-copy? |
|---|---|---|
.df() | pandas DataFrame | Yes if backed by Arrow string types; copy otherwise |
.pl() | polars DataFrame | Yes |
.arrow() | pyarrow Table | Yes |
.fetch_record_batch() | streaming Arrow batches | Yes, streamed |
.fetchall() | list of tuples | Copy (Python objects) |
.fetchone() | one tuple | Copy |
.fetchnumpy() | dict of NumPy arrays | Yes for numeric, copy for strings |
import duckdb
import pandas as pd
import polars as pl
pdf = pd.DataFrame({"a": [1, 2, 3]})
plf = pl.DataFrame({"a": [4, 5, 6]})
# Both frames are accessible inside SQL with zero copy
result = duckdb.sql("SELECT a FROM pdf UNION ALL SELECT a FROM plf").pl()
print(result)
Output: a polars frame with [1, 2, 3, 4, 5, 6]; DuckDB sees both source frames as Arrow scans
Crossing into the Jupyter ecosystem: duckdb.sql(query) returns a DuckDBPyRelation that renders nicely in notebooks. Use %load_ext sql + %sql duckdb:/// for cell-magic SQL.
Troubleshooting common errors#
The errors below are the recurring frictions; most resolve with a one-line setting change.
IO Error: Cannot open file '...': No such file— relative path mismatch. DuckDB resolves paths relative to the process CWD, not the script directory. Use absolute paths.Catalog Error: Table with name X does not exist— you usedduckdb.sql(...)(which uses a fresh in-memory connection) and registered the frame on a different connection. Use one explicitcon = duckdb.connect(...)for the whole script.Permission deniedwriting to a.duckdbfile — another process holds the write lock. DuckDB allows multiple readers but only one writer.HTTP 403from httpfs — wrong region or stale credentials. UseCREATE SECRET ... TYPE s3(1.0+) rather than legacySET s3_*for clean error messages.Cannot load extension '...': not on the autoload list— an extension is community-maintained. SetSET allow_community_extensions = truefirst.- Stale
~/.duckdb/extensions/<version>/after upgrading — extensions are version-keyed; delete the old version directory or runINSTALL <ext> -- updateto refresh. - Storage version error opening a
.duckdbfile with an older binary — upgrade the binary or useEXPORT DATABASEfrom the newer one andIMPORT DATABASEon the older. - OOM despite
memory_limit— some operators (large window functions, recursive CTEs) ignore the limit. AddSET temp_directory = '/fast-disk/...'and ensure that disk has room.
When NOT to use this#
duckdb is the right answer surprisingly often; the cases below are where another tool wins.
- Concurrent writers / transactional workloads: DuckDB is one-writer. Use Postgres, SQLite, or a true OLTP system.
- Streaming ingest: duckdb is batch-oriented; use Flink, Beam, or a streaming framework.
- Cluster-scale (multi-TB): duckdb is single-process. Use Snowflake, BigQuery, Clickhouse, or a distributed query engine for petabyte-scale.
- Pure DataFrame ergonomics: if everyone on the team prefers fluent APIs over SQL, polars is a better daily-driver.
- Vector / time-series specialist workloads: duckdb has good support but specialised stores (DuckDB + VSS extension vs Pinecone; TimescaleDB; ClickHouse) sometimes win on shape-specific operations.
- You need cross-region latency-bounded reads: DuckDB does not know about region affinity; a cloud warehouse with regional storage will do better.
See also#
- sections/python/duckdb — full API tutorial (SQL, pandas/polars interop, Arrow)
- sections/python/pandas — DataFrames that duckdb queries zero-copy
- sections/python/polars — Arrow-native DataFrame, perfect pair with duckdb
- sections/packages-pip/pip-pandas — package-level comparison
- sections/packages-pip/pip-polars — sibling Arrow-native analytics