beautifulsoup4#

What it is#

beautifulsoup4 (PyPI name; imported as bs4) is a Python library by Leonard Richardson for parsing HTML and XML into a navigable tree, then searching and mutating it. It does not fetch pages and does not execute JavaScript — pair it with requests or httpx to fetch, and with playwright or selenium when the page needs JS to render.

The library is a façade over one of three parser backends: the stdlib html.parser, the C-based lxml, or the spec-compliant html5lib. Picking the right backend matters more than picking BeautifulSoup itself.

[!NOTE] The PyPI distribution is beautifulsoup4. The import name is bs4. The older BeautifulSoup (no 4) package is the abandoned 3.x line — do not install it.

Install#

pip install beautifulsoup4

Output: (none — exits 0 on success). Installs bs4 + soupsieve. No parser — falls back to stdlib html.parser.

pip install beautifulsoup4 lxml

Output: installs the fast C-backed parser. Recommended for production scraping.

pip install beautifulsoup4 html5lib

Output: installs the spec-compliant Python parser. Slower but handles malformed HTML the way browsers do.

uv add beautifulsoup4 lxml

Output: added to pyproject.toml

poetry add beautifulsoup4 lxml

Output: updated lockfile + virtualenv install

Versioning & Python support#

Current line is 4.x — the 4 in the package name is the major version. Has been on 4.x since 2012.
Minor-release cadence is irregular — a few releases per year.
Recent releases support Python 3.7+; the project drops one Python minor per minor release roughly.
Loose semver — 4.13 (2025) shipped some find_all behaviour tweaks; check the changelog before upgrading on a production scraper.
The BeautifulSoup 3.x line is abandoned — do not install the BeautifulSoup (no-4) package.

Package metadata#

Maintainer: Leonard Richardson (leonardr) — original author, still primary maintainer
Project home: crummy.com/software/BeautifulSoup
Source: code.launchpad.net/beautifulsoup (Bazaar) — unusual; not GitHub-hosted
Docs: crummy.com/software/BeautifulSoup/bs4/doc
PyPI: pypi.org/project/beautifulsoup4
License: MIT
Governance: single maintainer, very long-running project
First released: 2004 (3.x line), 2012 (current 4.x line)
Downloads: tens of millions per month

Optional dependencies & extras#

Beautifulsoup4 declares no PyPI extras — you install parser backends as separate packages. The choice matters:

Parser	Install	Speed	Lenient?	Notes
`html.parser`	stdlib	Slow	Moderate	Default fallback if no other parser is installed. Stricter on malformed HTML.
`lxml`	`pip install lxml`	Fast (C)	Yes	Recommended for production. Requires `libxml2` system libraries — wheels usually ship them, but exotic platforms can require `apt install libxml2-dev libxslt-dev` first.
`html5lib`	`pip install html5lib`	Slow (pure Python)	Most lenient	Parses the way modern browsers do — best for the worst-broken HTML.
`lxml-xml` / `xml`	`pip install lxml`	Fast	n/a	Use for actual XML, not HTML.

soupsieve is a hard dependency (pulled in automatically) — it provides the CSS-selector engine for soup.select(...). SoupSieve was added in BeautifulSoup 4.7 (2018); before that, CSS selectors were partially supported in-tree.

Alternatives#

Package	Trade-off
`lxml` (direct)	Use `lxml.html` directly when you need raw speed and don’t need BeautifulSoup’s API. ~2× faster on large documents.
`selectolax`	Modern, very fast C-backed HTML parser. CSS-selector first, no tree-mutation API. Use for read-only scraping at scale.
`parsel`	The Scrapy team’s selector library. Wraps `lxml` with CSS + XPath. Use inside Scrapy pipelines.
`pyquery`	jQuery-style API over `lxml`. Fading; pick `parsel` or `selectolax` instead.
`html5lib` (direct)	Spec-compliant tokeniser. Slower; use only when you need exact browser behaviour.
`playwright` / `selenium`	For JS-rendered pages — fetch the rendered HTML and then feed it to BeautifulSoup.

Common gotchas#

pip install BeautifulSoup (no 4) installs the abandoned 3.x line. The correct package name is beautifulsoup4 and the correct import is from bs4 import BeautifulSoup. The wrong package still resolves on PyPI but hasn’t shipped a release in years.
No parser specified → silent warning + html.parser fallback. BeautifulSoup(html) emits a warning then uses the stdlib parser. Always pass features="lxml" explicitly to get deterministic behaviour across environments.
lxml requires libxml2 / libxslt. Wheels ship for common platforms (x86_64/arm64 Linux, macOS, Windows). On Alpine (musl), some BSDs, or older ARM platforms you fall back to source and need the system libraries pre-installed.
html.parser is stricter than html5lib. Malformed HTML that browsers render fine may parse differently — closing tags may be inserted at unexpected points, missing tags may not be inferred. If your scraper works in a browser but not in BeautifulSoup, try features="html5lib".
.find_all returns a list, .select returns a list, .find returns first match (or None). Forgetting the None case crashes scrapers on the one page where the element is missing. Always guard or use .select_one() + if.
Pickling a parsed tree doesn’t round-trip cleanly. The tree holds back-references to the parser. Serialise with str(soup) and re-parse instead.
The SoupSieve CSS engine was added in 4.7 (2018). Code targeting older BS4 that uses select() for complex selectors may behave differently — pin beautifulsoup4>=4.7 if you rely on :has() or pseudo-classes.
Source is on Launchpad, not GitHub. Filing issues requires a Launchpad account — not the usual GitHub Issues flow. PRs are accepted via email patches or Launchpad merge proposals.

Real-world recipes#

Paginated scraping with rate limiting#

import time
import httpx
from bs4 import BeautifulSoup

BASE = "https://example.com/articles"

def scrape_listing(page: int) -> list[dict]:
    r = httpx.get(BASE, params={"page": page}, timeout=10.0)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    return [
        {
            "title": a.get_text(strip=True),
            "url": a["href"],
            "date": item.select_one(".date").get_text(strip=True),
        }
        for item in soup.select("article.post")
        for a in [item.select_one("h2 > a")]
        if a is not None
    ]

results = []
for page in range(1, 11):
    results.extend(scrape_listing(page))
    time.sleep(1.0)  # respect rate limit

The [a in [item.select_one("h2 > a")]] idiom is a one-line guard against missing children — select_one returns None when the selector matches nothing, and dereferencing a["href"] on None crashes the scraper. Always guard.

Structured data extraction (JSON-LD)#

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")
for script in soup.find_all("script", type="application/ld+json"):
    try:
        data = json.loads(script.string)
    except json.JSONDecodeError:
        continue
    if data.get("@type") == "Article":
        print(data["headline"], data["author"])

JSON-LD blocks are the cleanest data source on most modern sites — they sidestep DOM scraping entirely. Parse the wrapping <script> text as JSON; the schema follows schema.org conventions.

Sitemap parsing#

soup = BeautifulSoup(open("sitemap.xml"), "xml")  # note: "xml" parser
urls = [loc.text for loc in soup.find_all("loc")]

Pass "xml" to BeautifulSoup() to use lxml’s XML mode (preserves case-sensitive tag names, doesn’t treat tags as HTML). "lxml" would lowercase tag names and treat <loc> as HTML.

Modifying and writing back#

for img in soup.find_all("img", src=lambda v: v and v.startswith("http://")):
    img["src"] = img["src"].replace("http://", "https://")

html_out = str(soup)

soup.find_all accepts callables for attribute filters. After mutation, str(soup) serialises back to HTML. The default serialisation re-wraps text nodes; use soup.encode("utf-8") for byte-stable output.

Pairing with `httpx.AsyncClient`#

import asyncio
import httpx
from bs4 import BeautifulSoup

async def fetch_and_parse(client, url):
    r = await client.get(url)
    return BeautifulSoup(r.text, "lxml")

async def main(urls):
    async with httpx.AsyncClient(timeout=10) as client:
        soups = await asyncio.gather(*(fetch_and_parse(client, u) for u in urls))
    for soup in soups:
        print(soup.title.string if soup.title else "(no title)")

BeautifulSoup is sync — the parse happens after the fetch. For pure-async scraping, gather the fetches concurrently, then parse sequentially or in a thread pool (asyncio.to_thread).

Performance tuning#

Parsing dominates the cost of scraping. Levers:

Pick lxml over html.parser. ~5× faster on typical HTML. The C extension is the single biggest performance win.
html5lib is the slowest. Use only when you need exact browser behaviour on broken HTML.
SoupStrainer parses only matching subtrees. Pass parse_only=SoupStrainer("article") to BeautifulSoup() — the parser builds only the matching parts of the tree. Major savings on large pages where you need a small slice.
select() (CSS) vs find_all(). select uses soupsieve which is generally faster for complex queries. find_all is faster for simple name="tag" lookups.
Don’t re-parse. Parsing is expensive; reuse the soup across many queries on the same document.
features="lxml-xml" for XML — XML parsing is faster than HTML because the rules are simpler.
Stream parsing isn’t supported. BeautifulSoup loads the full document. For huge documents (>50 MB), use lxml.etree.iterparse directly.

For pure read-only scraping at scale, selectolax (Modest engine, C) outperforms BeautifulSoup by 3–10×, but lacks the mutation API.

Version migration guide#

beautifulsoup4 has been on 4.x since 2012; bumps are small and infrequent. Notable:

4.13 (2025)

find_all behaviour tweaks around attribute matching.
Better handling of namespace-prefixed tags in XML mode.

4.12 (2023)

Improved support for HTML5 spec edge cases.
soupsieve>=2.4 required (modern CSS-selector features like :has()).

4.7 (2018) — historical

Added soupsieve as a hard dep for select() CSS support.
Code targeting older BS4 may use a partial in-tree CSS selector implementation; behaviour diverges on complex selectors.

Upgrade pattern:

Pin beautifulsoup4>=4.13 in pyproject.toml.
soupsieve and lxml floor versions matter too — pin them defensively if you rely on advanced CSS or XML features.
Test against the exact HTML you scrape — minor changes in tag-inference behaviour can shift the tree shape for malformed inputs.

Plugin & rule ecosystem#

BeautifulSoup has no plugin API — extension is by parser choice and soupsieve selectors:

Component	Role
`lxml`	Fast C-backed parser. Recommended for HTML and XML.
`html5lib`	Pure-Python spec-compliant parser. Slow but most lenient.
`html.parser` (stdlib)	Default fallback. Strictest.
`soupsieve`	CSS-selector engine (`select`, `select_one`). Pulled in automatically.
`cchardet` / `chardet`	Encoding detection. BS4 falls back to these for byte-input. `cchardet` is the C-based fast version.

The features= argument selects parser per parse. Common values: "lxml", "html5lib", "html.parser", "lxml-xml", "xml" (alias for lxml-xml).

Configuration & layout patterns#

BeautifulSoup itself has no global config. Conventions for keeping a scraper maintainable:

Always pass features= explicitly. Default fallback to html.parser makes behaviour env-dependent. BeautifulSoup(html, "lxml") is the safe form.
Wrap fetching + parsing in a function. Don’t open files or HTTP responses inline with parsing — separation makes each layer testable.
Selector constants at module top. ARTICLE_SEL = "article.post", TITLE_SEL = "h2 > a". Easier to update when site markup shifts.
Use select_one over find for new code. CSS selectors are more readable and composable than the find(name=, class_=, id=) keyword soup.
Guard None everywhere. select_one and find return None on miss. elem.get_text() if elem else "" is the canonical pattern.
Centralise encoding — pass from_encoding="utf-8" if you know the source encoding; otherwise BS4 sniffs (using cchardet if available).

Troubleshooting common errors#

Symptom	Cause	Fix
`FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml`	`lxml` not installed	`pip install lxml`.
`UserWarning: No parser was explicitly specified`	Falling back to `html.parser`	Pass `features="lxml"` (or another) explicitly.
`AttributeError: 'NoneType' object has no attribute 'get_text'`	`find` / `select_one` returned `None`	Guard with `if elem:`.
`select()` returns nothing for a selector that looks correct	Selector mismatch on case, namespace, or whitespace	Inspect with `print(soup.prettify()[:500])` to verify the tree. Try a simpler selector first.
Encoding shows as mojibake	Source page mis-declared encoding	Pass `from_encoding="utf-8"` explicitly; or fetch with `r.text` (httpx/requests handle declared encoding) before passing to BS4.
XML parsing treats tags as case-insensitive	Used `features="lxml"` instead of `features="lxml-xml"`	Switch to `"lxml-xml"` or `"xml"`.
`select(":has(...)")` doesn’t work	Old `soupsieve` version	Upgrade to `soupsieve>=2.4`.
BS4 modifies whitespace on serialisation	Default formatting prettifies output	Pass `formatter="minimal"` to `str(soup)` or use `soup.encode()`.
Tag found in browser but not in BS4	JavaScript rendered the element after page load	BS4 can’t execute JS — fetch via Playwright/Selenium first, then parse the rendered HTML.
`RecursionError` on deep trees	Python’s default recursion limit	`sys.setrecursionlimit(5000)` (cautiously); or use `lxml.etree` directly which is iterative.

The prettify() method is the diagnostic — print(soup.prettify()) shows the parsed tree exactly as BS4 sees it, surfacing missing elements or unexpected nesting from a malformed input.

Ecosystem integrations#

BeautifulSoup is the parsing layer. The fetch and (sometimes) render layers below it have many options:

requests — synchronous HTTP, the original pairing. Mature, stable.
httpx — modern, both sync and async, HTTP/2 support. The preferred choice in 2026.
aiohttp — async-only HTTP client. Use when the whole stack is async.
playwright / selenium — for JS-rendered pages. Fetch the rendered DOM, feed page.content() to BeautifulSoup.
scrapy — full crawling framework. Uses parsel (similar API) instead of BeautifulSoup. Choose Scrapy for thousands of pages; BS4 for one-off or small jobs.
pandas.read_html() — wraps BS4 + lxml. Useful for table extraction; reads all <table> tags on a page.
mechanicalsoup — session-aware browser-like wrapper. Less popular; consider httpx.AsyncClient(follow_redirects=True) instead.
bleach — HTML sanitisation. Uses BS4 under the hood. Use for cleaning untrusted HTML.

CI integration#

Scrapers in CI typically run against recorded fixtures, not the live target. Patterns:

Record-replay with pytest-vcr, responses, or pytest-recording. Capture real responses once, commit to repo, replay in CI.
Static fixtures — check in HTML files under tests/fixtures/ and parse them in tests. Cheaper than VCR but stale faster.
Live-fetch nightly — a scheduled job hits the live target, surfaces breakage early. Separate from per-commit CI to avoid coupling PR-merge to upstream uptime.

name: scraper-tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -e ".[test]"
      - run: pytest tests/  # uses VCR cassettes, no live calls

For live-target verification on a cron:

on:
  schedule:
    - cron: "0 2 * * *"  # 02:00 UTC daily

Failures here open an issue (actions-ecosystem/action-create-issue) rather than blocking the build.

When NOT to use this#

BeautifulSoup is excellent for HTML scraping but a wrong fit when:

Pure XML processing. Use lxml.etree directly — faster and more XPath-friendly. BS4’s XML mode wraps lxml and adds overhead.
JS-rendered pages. BS4 can’t execute JavaScript. Fetch with Playwright/Selenium first; then pass the rendered HTML to BS4 if you want its ergonomics.
Very large documents (50+ MB). BS4 loads the whole tree. Use lxml.etree.iterparse for streaming.
Pure CSS-selector reads at scale. selectolax is 3–10× faster than BS4 for read-only queries. BS4’s value is the mutation API and the consistent abstraction over multiple parsers.
Inside a Scrapy pipeline. Scrapy ships with parsel, which is the same idea. Don’t mix.

g h	home
g p	Programming section
g p	Python section
g j	JavaScript section
g t	TypeScript section
g o	OS section
g l	Linux section
g w	Windows section
g z	z/OS section
g o	macOS section
g a	AI section
g c	Claude Code section
g c	Codex CLI section
g c	Claude API section
g p	Prompting section
g f	Frameworks section
g p	Packages section
g p	Pip (Python) section
g p	npm (Node) section
g p	Cargo (Rust) section
g p	Go modules section
g g	graph view
g t	tags index

⌘K / /	open search palette
t	cycle theme (dark → light → system)
?	toggle this panel

[ / ]	previous / next sheet in section
j / k	scroll down / up

beautifulsoup4 — HTML/XML Parsing Library

beautifulsoup4#

What it is#

Install#

Versioning & Python support#

Package metadata#

Optional dependencies & extras#

Alternatives#

Common gotchas#

Real-world recipes#

Paginated scraping with rate limiting#

Structured data extraction (JSON-LD)#

Sitemap parsing#

Modifying and writing back#

Pairing with `httpx.AsyncClient`#

Performance tuning#

Version migration guide#

Plugin & rule ecosystem#

Configuration & layout patterns#

Troubleshooting common errors#

Ecosystem integrations#

CI integration#

When NOT to use this#

See also#

beautifulsoup4#

What it is#

Install#

Versioning & Python support#

Package metadata#

Optional dependencies & extras#

Alternatives#

Common gotchas#

Real-world recipes#

Paginated scraping with rate limiting#

Structured data extraction (JSON-LD)#

Sitemap parsing#

Modifying and writing back#

Pairing with httpx.AsyncClient#

Performance tuning#

Version migration guide#

Plugin & rule ecosystem#

Configuration & layout patterns#

Troubleshooting common errors#

Ecosystem integrations#

CI integration#

When NOT to use this#

See also#

Pairing with `httpx.AsyncClient`#