skip to content

polars — Fast DataFrames

High-performance DataFrames with a lazy expression API. Covers read/write, select, filter, group_by, joins, LazyFrame, datetime, string operations, and pandas interop.

9 min read 29 snippets deep dive

polars — Fast DataFrames#

What it is#

Polars is a DataFrame library written in Rust that exposes a Python API. It uses Apache Arrow as its memory layout, processes data column-by-column with SIMD instructions, and executes lazy query plans in parallel across all CPU cores. Polars is significantly faster than pandas on most workloads and handles datasets that exceed RAM through streaming. It is the recommended alternative to pandas when performance or memory is a bottleneck.

Install#

pip install polars

Output: (none — exits 0 on success)

For Parquet, Excel, and timezone support:

pip install polars[all]

Output: (none — exits 0 on success)

Quick example#

import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "dept": ["Eng", "Eng", "HR", "HR"],
    "score": [92, 78, 85, 91],
})
print(df.filter(pl.col("score") > 80))

Output:

shape: (3, 3)
┌───────┬──────┬───────┐
│ name  ┆ dept ┆ score │
│ ---   ┆ ---  ┆ ---   │
│ str   ┆ str  ┆ i64   │
╞═══════╪══════╪═══════╡
│ Alice ┆ Eng  ┆ 92    │
│ Carol ┆ HR   ┆ 85    │
│ Dave  ┆ HR   ┆ 91    │
└───────┴──────┴───────┘

When / why to use it#

  • Processing CSV, Parquet, or JSON files faster than pandas with a single-line change.
  • Datasets that are too large for pandas (polars streams through them without loading into RAM with streaming=True).
  • Multi-core machines where pandas’ GIL-constrained single-threaded ops leave cores idle.
  • Pipelines where lazy evaluation (LazyFrame) lets polars optimise the entire query plan before executing it.
  • When you want to avoid pandas’ inconsistent mutation semantics (SettingWithCopyWarning).

Common pitfalls#

[!WARNING] Expressions, not index-based mutation — polars has no .iloc or .loc. All transformations return new DataFrames via expressions. df["col"] = value raises TypeError; use df.with_columns(pl.lit(value).alias("col")) instead.

[!WARNING] collect() must be called on LazyFrames — a LazyFrame is a query plan. Nothing executes until you call .collect(). Forgetting it returns a LazyFrame object, not a DataFrame.

[!WARNING] Column names are case-sensitivepl.col("Score") and pl.col("score") are different. Unlike pandas, polars never coerces column names.

[!TIP] Use df.glimpse() during exploration — it prints one row per column with dtype, much easier to read than df.head() on wide DataFrames.

[!TIP] pl.col("*") selects all columns. pl.col(pl.Float64, pl.Int64) selects all numeric columns by dtype — useful in with_columns for type-based transforms.

Reading and writing files#

Polars has symmetric read_* / write_* and scan_* functions for all major formats. The scan_* family returns a LazyFrame and only reads data that the query plan actually needs, making it ideal for large files.

import polars as pl

# CSV
df = pl.read_csv("data.csv")
df.write_csv("output.csv")

# Parquet (fastest format for polars pipelines)
df = pl.read_parquet("data.parquet")
df.write_parquet("output.parquet")

# JSON (newline-delimited)
df = pl.read_ndjson("records.jsonl")

# Lazy scan — reads only needed columns/rows
lf = pl.scan_csv("large.csv")
result = lf.filter(pl.col("status") == "active").select("id", "name").collect()

select, filter, with_columns#

The core expression-based API. select picks and transforms columns, returning a new DataFrame with only those columns. filter keeps rows matching a boolean expression. with_columns appends or replaces columns while keeping all originals.

import polars as pl

df = pl.DataFrame({
    "product": ["A", "B", "C", "D"],
    "price": [10.0, 25.0, 8.0, 40.0],
    "qty": [100, 50, 200, 30],
})

# select — compute revenue, keep only two columns
print(df.select(
    pl.col("product"),
    (pl.col("price") * pl.col("qty")).alias("revenue"),
))

Output:

shape: (4, 2)
┌─────────┬─────────┐
│ product ┆ revenue │
│ ---     ┆ ---     │
│ str     ┆ f64     │
╞═════════╪═════════╡
│ A       ┆ 1000.0  │
│ B       ┆ 1250.0  │
│ C       ┆ 1600.0  │
│ D       ┆ 1200.0  │
└─────────┴─────────┘
# filter + with_columns
result = (
    df.filter(pl.col("price") < 30)
    .with_columns(
        (pl.col("price") * pl.col("qty")).alias("revenue"),
        pl.col("product").str.to_lowercase().alias("product_lower"),
    )
)
print(result)

Output:

shape: (3, 5)
┌─────────┬───────┬─────┬─────────┬───────────────┐
│ product ┆ price ┆ qty ┆ revenue ┆ product_lower │
│ ---     ┆ ---   ┆ --- ┆ ---     ┆ ---           │
│ str     ┆ f64   ┆ i64 ┆ f64     ┆ str           │
╞═════════╪═══════╪═════╪═════════╪═══════════════╡
│ A       ┆ 10.0  ┆ 100 ┆ 1000.0  ┆ a             │
│ B       ┆ 25.0  ┆ 50  ┆ 1250.0  ┆ b             │
│ C       ┆ 8.0   ┆ 200 ┆ 1600.0  ┆ c             │
└─────────┴───────┴─────┴─────────┴───────────────┘

group_by and aggregation#

group_by splits the DataFrame by one or more key columns and agg computes one or more aggregation expressions per group. Multiple aggregations are specified in a single agg(...) call.

import polars as pl

sales = pl.DataFrame({
    "region": ["East", "East", "West", "West", "East"],
    "product": ["A", "B", "A", "B", "A"],
    "amount": [100, 200, 150, 250, 120],
})

summary = (
    sales.group_by("region", "product")
    .agg(
        pl.col("amount").sum().alias("total"),
        pl.col("amount").count().alias("orders"),
        pl.col("amount").mean().alias("avg"),
    )
    .sort("region", "product")
)
print(summary)

Output:

shape: (4, 5)
┌────────┬─────────┬───────┬────────┬───────┐
│ region ┆ product ┆ total ┆ orders ┆ avg   │
│ ---    ┆ ---     ┆ ---   ┆ ---    ┆ ---   │
│ str    ┆ str     ┆ i64   ┆ u32    ┆ f64   │
╞════════╪═════════╪═══════╪════════╪═══════╡
│ East   ┆ A       ┆ 220   ┆ 2      ┆ 110.0 │
│ East   ┆ B       ┆ 200   ┆ 1      ┆ 200.0 │
│ West   ┆ A       ┆ 150   ┆ 1      ┆ 150.0 │
│ West   ┆ B       ┆ 250   ┆ 1      ┆ 250.0 │
└────────┴─────────┴───────┴────────┴───────┘

Joins#

Polars supports the standard join strategies. All joins are expressed as method calls on the left DataFrame with the right DataFrame as the first argument.

import polars as pl

employees = pl.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})
salaries  = pl.DataFrame({"id": [1, 2, 4], "salary": [75000, 82000, 91000]})

# inner join (default)
print(employees.join(salaries, on="id", how="inner"))

Output:

shape: (2, 3)
┌─────┬───────┬────────┐
│ id  ┆ name  ┆ salary │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ str   ┆ i64    │
╞═════╪═══════╪════════╡
│ 1   ┆ Alice ┆ 75000  │
│ 2   ┆ Bob   ┆ 82000  │
└─────┴───────┴────────┘
# left join — keep all employees
print(employees.join(salaries, on="id", how="left"))

Output:

shape: (3, 3)
┌─────┬───────┬────────┐
│ id  ┆ name  ┆ salary │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ str   ┆ i64    │
╞═════╪═══════╪════════╡
│ 1   ┆ Alice ┆ 75000  │
│ 2   ┆ Bob   ┆ 82000  │
│ 3   ┆ Carol ┆ null   │
└─────┴───────┴────────┘
# anti join — employees with no salary record
print(employees.join(salaries, on="id", how="anti"))

Output:

shape: (1, 2)
┌─────┬───────┐
│ id  ┆ name  │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 3   ┆ Carol │
└─────┴───────┘

LazyFrame — deferred execution and query optimisation#

A LazyFrame records transformation steps without executing them. When you call .collect(), polars compiles all steps into an optimised physical plan — it pushes filters early (predicate pushdown), prunes unused columns (projection pushdown), and parallelises independent branches. For large files, use scan_csv / scan_parquet instead of read_* so the optimizer can skip reading columns you don’t need.

import polars as pl

# scan_parquet never reads the file until collect()
lf = (
    pl.scan_csv("sales.csv")
    .filter(pl.col("amount") > 100)
    .group_by("region")
    .agg(pl.col("amount").sum().alias("total"))
    .sort("total", descending=True)
)

# Inspect the optimised plan before running
print(lf.explain())

# Execute
result = lf.collect()
print(result)

For datasets larger than RAM, add streaming=True to collect():

result = lf.collect(streaming=True)

Datetime expressions#

Polars stores timestamps as Datetime (not Python datetime) and exposes time operations through the .dt namespace on expressions.

import polars as pl

df = pl.DataFrame({
    "ts": ["2026-01-01", "2026-03-15", "2026-06-30"],
    "value": [10, 25, 18],
}).with_columns(pl.col("ts").str.to_date().alias("date"))

print(
    df.with_columns(
        pl.col("date").dt.year().alias("year"),
        pl.col("date").dt.month().alias("month"),
        pl.col("date").dt.day_of_week().alias("dow"),  # 0=Monday
    )
)

Output:

shape: (3, 5)
┌────────────┬───────┬──────┬───────┬─────┐
│ date       ┆ value ┆ year ┆ month ┆ dow │
│ ---        ┆ ---   ┆ ---  ┆ ---   ┆ --- │
│ date       ┆ i64   ┆ i32  ┆ i8    ┆ i8  │
╞════════════╪═══════╪══════╪═══════╪═════╡
│ 2026-01-01 ┆ 10    ┆ 2026 ┆ 1     ┆ 3   │
│ 2026-03-15 ┆ 25    ┆ 2026 ┆ 3     ┆ 6   │
│ 2026-06-30 ┆ 18    ┆ 2026 ┆ 6     ┆ 1   │
└────────────┴───────┴──────┴───────┴─────┘

Group by calendar period with group_by_dynamic:

df2 = pl.DataFrame({
    "date": pl.date_range(pl.date(2026, 1, 1), pl.date(2026, 3, 31), "1d", eager=True),
    "sales": range(90),
})

monthly = df2.group_by_dynamic("date", every="1mo").agg(pl.col("sales").sum())
print(monthly.head(3))

Output:

shape: (3, 2)
┌────────────┬───────┐
│ date       ┆ sales │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2026-01-01 ┆ 465   │
│ 2026-02-01 ┆ 1246  │
│ 2026-03-01 ┆ 2015  │
└────────────┴───────┘

String namespace#

The .str namespace mirrors pandas’ .str accessor but is expressed as pl.col("col").str.<method>() inside with_columns or select.

import polars as pl

df = pl.DataFrame({"raw": ["  Hello World  ", "foo_bar", "CAPS LOCK"]})
print(
    df.with_columns(
        pl.col("raw").str.strip_chars().alias("stripped"),
        pl.col("raw").str.to_lowercase().alias("lower"),
        pl.col("raw").str.replace("_", " ").alias("underscores_removed"),
        pl.col("raw").str.contains("CAPS").alias("has_caps"),
        pl.col("raw").str.len_chars().alias("length"),
    )
)

Output:

shape: (3, 6)
┌──────────────────┬───────────────┬───────────┬─────────────────────┬──────────┬────────┐
│ raw              ┆ stripped      ┆ lower     ┆ underscores_removed ┆ has_caps ┆ length │
│ ---              ┆ ---           ┆ ---       ┆ ---                 ┆ ---      ┆ ---    │
│ str              ┆ str           ┆ str       ┆ str                 ┆ bool     ┆ u32    │
╞══════════════════╪═══════════════╪═══════════╪═════════════════════╪══════════╪════════╡
│   Hello World    ┆ Hello World   ┆   hello … ┆   Hello World       ┆ false    ┆ 15     │
│ foo_bar          ┆ foo_bar       ┆ foo_bar   ┆ foo bar             ┆ false    ┆ 7      │
│ CAPS LOCK        ┆ CAPS LOCK     ┆ caps lock ┆ CAPS LOCK           ┆ true     ┆ 9      │
└──────────────────┴───────────────┴───────────┴─────────────────────┴──────────┴────────┘

List namespace#

When a column holds lists (pl.List dtype), the .list namespace provides vectorised operations without exploding the rows.

import polars as pl

df = pl.DataFrame({"scores": [[90, 85, 92], [78, 80], [95, 88, 76, 91]]})
print(
    df.with_columns(
        pl.col("scores").list.len().alias("count"),
        pl.col("scores").list.mean().alias("avg"),
        pl.col("scores").list.max().alias("max"),
    )
)

Output:

shape: (3, 4)
┌────────────────┬───────┬──────┬─────┐
│ scores         ┆ count ┆ avg  ┆ max │
│ ---            ┆ ---   ┆ ---  ┆ --- │
│ list[i64]      ┆ u32   ┆ f64  ┆ i64 │
╞════════════════╪═══════╪══════╪═════╡
│ [90, 85, 92]   ┆ 3     ┆ 89.0 ┆ 92  │
│ [78, 80]       ┆ 2     ┆ 79.0 ┆ 80  │
│ [95, 88, 76, … ┆ 4     ┆ 87.5 ┆ 95  │
└────────────────┴───────┴──────┴─────┘

Interop with pandas and PyArrow#

to_pandas() converts a polars DataFrame to pandas. from_pandas() goes the other way. Both paths use the shared Arrow memory buffer when possible, making conversion near-zero-copy.

import polars as pl
import pandas as pd

df_pl = pl.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0]})

df_pd = df_pl.to_pandas()
print(type(df_pd), df_pd.dtypes.to_dict())

df_pl2 = pl.from_pandas(df_pd)
print(df_pl2.dtypes)

Output:

<class 'pandas.core.frame.DataFrame'> {'a': dtype('int64'), 'b': dtype('float64')}
[Int64, Float64]

Quick reference#

TaskCode
Filter rowsdf.filter(pl.col("x") > 5)
Select columnsdf.select("a", "b")
Add/replace columndf.with_columns(expr.alias("name"))
Renamedf.rename({"old": "new"})
Sortdf.sort("col", descending=True)
Drop nullsdf.drop_nulls(subset=["col"])
Fill nullsdf.with_columns(pl.col("c").fill_null(0))
Group + aggdf.group_by("k").agg(pl.col("v").sum())
Inner joinleft.join(right, on="id", how="inner")
Lazy planpl.scan_csv("f.csv").filter(...).collect()
Streaming.collect(streaming=True)
To pandasdf.to_pandas()
From pandaspl.from_pandas(df_pd)
Schemadf.schema
Shapedf.shape
Glimpsedf.glimpse()