beautifulsoup4#
What it is#
beautifulsoup4 (PyPI name; imported as bs4) is a Python library by Leonard Richardson for parsing HTML and XML into a navigable tree, then searching and mutating it. It does not fetch pages and does not execute JavaScript — pair it with requests or httpx to fetch, and with playwright or selenium when the page needs JS to render.
The library is a façade over one of three parser backends: the stdlib html.parser, the C-based lxml, or the spec-compliant html5lib. Picking the right backend matters more than picking BeautifulSoup itself.
[!NOTE] The PyPI distribution is
beautifulsoup4. The import name isbs4. The olderBeautifulSoup(no 4) package is the abandoned 3.x line — do not install it.
Install#
pip install beautifulsoup4
Output: (none — exits 0 on success). Installs bs4 + soupsieve. No parser — falls back to stdlib html.parser.
pip install beautifulsoup4 lxml
Output: installs the fast C-backed parser. Recommended for production scraping.
pip install beautifulsoup4 html5lib
Output: installs the spec-compliant Python parser. Slower but handles malformed HTML the way browsers do.
uv add beautifulsoup4 lxml
Output: added to pyproject.toml
poetry add beautifulsoup4 lxml
Output: updated lockfile + virtualenv install
Versioning & Python support#
- Current line is
4.x— the4in the package name is the major version. Has been on4.xsince 2012. - Minor-release cadence is irregular — a few releases per year.
- Recent releases support Python 3.7+; the project drops one Python minor per minor release roughly.
- Loose semver —
4.13(2025) shipped somefind_allbehaviour tweaks; check the changelog before upgrading on a production scraper. - The
BeautifulSoup3.x line is abandoned — do not install theBeautifulSoup(no-4) package.
Package metadata#
- Maintainer: Leonard Richardson (
leonardr) — original author, still primary maintainer - Project home: crummy.com/software/BeautifulSoup
- Source: code.launchpad.net/beautifulsoup (Bazaar) — unusual; not GitHub-hosted
- Docs: crummy.com/software/BeautifulSoup/bs4/doc
- PyPI: pypi.org/project/beautifulsoup4
- License: MIT
- Governance: single maintainer, very long-running project
- First released: 2004 (3.x line), 2012 (current 4.x line)
- Downloads: tens of millions per month
Optional dependencies & extras#
Beautifulsoup4 declares no PyPI extras — you install parser backends as separate packages. The choice matters:
| Parser | Install | Speed | Lenient? | Notes |
|---|---|---|---|---|
html.parser | stdlib | Slow | Moderate | Default fallback if no other parser is installed. Stricter on malformed HTML. |
lxml | pip install lxml | Fast (C) | Yes | Recommended for production. Requires libxml2 system libraries — wheels usually ship them, but exotic platforms can require apt install libxml2-dev libxslt-dev first. |
html5lib | pip install html5lib | Slow (pure Python) | Most lenient | Parses the way modern browsers do — best for the worst-broken HTML. |
lxml-xml / xml | pip install lxml | Fast | n/a | Use for actual XML, not HTML. |
soupsieve is a hard dependency (pulled in automatically) — it provides the CSS-selector engine for soup.select(...). SoupSieve was added in BeautifulSoup 4.7 (2018); before that, CSS selectors were partially supported in-tree.
Alternatives#
| Package | Trade-off |
|---|---|
lxml (direct) | Use lxml.html directly when you need raw speed and don’t need BeautifulSoup’s API. ~2× faster on large documents. |
selectolax | Modern, very fast C-backed HTML parser. CSS-selector first, no tree-mutation API. Use for read-only scraping at scale. |
parsel | The Scrapy team’s selector library. Wraps lxml with CSS + XPath. Use inside Scrapy pipelines. |
pyquery | jQuery-style API over lxml. Fading; pick parsel or selectolax instead. |
html5lib (direct) | Spec-compliant tokeniser. Slower; use only when you need exact browser behaviour. |
playwright / selenium | For JS-rendered pages — fetch the rendered HTML and then feed it to BeautifulSoup. |
Common gotchas#
pip install BeautifulSoup(no4) installs the abandoned 3.x line. The correct package name isbeautifulsoup4and the correct import isfrom bs4 import BeautifulSoup. The wrong package still resolves on PyPI but hasn’t shipped a release in years.- No parser specified → silent warning +
html.parserfallback.BeautifulSoup(html)emits a warning then uses the stdlib parser. Always passfeatures="lxml"explicitly to get deterministic behaviour across environments. lxmlrequireslibxml2/libxslt. Wheels ship for common platforms (x86_64/arm64 Linux, macOS, Windows). On Alpine (musl), some BSDs, or older ARM platforms you fall back to source and need the system libraries pre-installed.html.parseris stricter thanhtml5lib. Malformed HTML that browsers render fine may parse differently — closing tags may be inserted at unexpected points, missing tags may not be inferred. If your scraper works in a browser but not in BeautifulSoup, tryfeatures="html5lib"..find_allreturns a list,.selectreturns a list,.findreturns first match (orNone). Forgetting theNonecase crashes scrapers on the one page where the element is missing. Always guard or use.select_one()+if.- Pickling a parsed tree doesn’t round-trip cleanly. The tree holds back-references to the parser. Serialise with
str(soup)and re-parse instead. - The SoupSieve CSS engine was added in 4.7 (2018). Code targeting older BS4 that uses
select()for complex selectors may behave differently — pinbeautifulsoup4>=4.7if you rely on:has()or pseudo-classes. - Source is on Launchpad, not GitHub. Filing issues requires a Launchpad account — not the usual GitHub Issues flow. PRs are accepted via email patches or Launchpad merge proposals.
Real-world recipes#
Paginated scraping with rate limiting#
import time
import httpx
from bs4 import BeautifulSoup
BASE = "https://example.com/articles"
def scrape_listing(page: int) -> list[dict]:
r = httpx.get(BASE, params={"page": page}, timeout=10.0)
r.raise_for_status()
soup = BeautifulSoup(r.text, "lxml")
return [
{
"title": a.get_text(strip=True),
"url": a["href"],
"date": item.select_one(".date").get_text(strip=True),
}
for item in soup.select("article.post")
for a in [item.select_one("h2 > a")]
if a is not None
]
results = []
for page in range(1, 11):
results.extend(scrape_listing(page))
time.sleep(1.0) # respect rate limit
The [a in [item.select_one("h2 > a")]] idiom is a one-line guard against missing children — select_one returns None when the selector matches nothing, and dereferencing a["href"] on None crashes the scraper. Always guard.
Structured data extraction (JSON-LD)#
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
except json.JSONDecodeError:
continue
if data.get("@type") == "Article":
print(data["headline"], data["author"])
JSON-LD blocks are the cleanest data source on most modern sites — they sidestep DOM scraping entirely. Parse the wrapping <script> text as JSON; the schema follows schema.org conventions.
Sitemap parsing#
soup = BeautifulSoup(open("sitemap.xml"), "xml") # note: "xml" parser
urls = [loc.text for loc in soup.find_all("loc")]
Pass "xml" to BeautifulSoup() to use lxml’s XML mode (preserves case-sensitive tag names, doesn’t treat tags as HTML). "lxml" would lowercase tag names and treat <loc> as HTML.
Modifying and writing back#
for img in soup.find_all("img", src=lambda v: v and v.startswith("http://")):
img["src"] = img["src"].replace("http://", "https://")
html_out = str(soup)
soup.find_all accepts callables for attribute filters. After mutation, str(soup) serialises back to HTML. The default serialisation re-wraps text nodes; use soup.encode("utf-8") for byte-stable output.
Pairing with httpx.AsyncClient#
import asyncio
import httpx
from bs4 import BeautifulSoup
async def fetch_and_parse(client, url):
r = await client.get(url)
return BeautifulSoup(r.text, "lxml")
async def main(urls):
async with httpx.AsyncClient(timeout=10) as client:
soups = await asyncio.gather(*(fetch_and_parse(client, u) for u in urls))
for soup in soups:
print(soup.title.string if soup.title else "(no title)")
BeautifulSoup is sync — the parse happens after the fetch. For pure-async scraping, gather the fetches concurrently, then parse sequentially or in a thread pool (asyncio.to_thread).
Performance tuning#
Parsing dominates the cost of scraping. Levers:
- Pick
lxmloverhtml.parser. ~5× faster on typical HTML. The C extension is the single biggest performance win. html5libis the slowest. Use only when you need exact browser behaviour on broken HTML.SoupStrainerparses only matching subtrees. Passparse_only=SoupStrainer("article")toBeautifulSoup()— the parser builds only the matching parts of the tree. Major savings on large pages where you need a small slice.select()(CSS) vsfind_all().selectusessoupsievewhich is generally faster for complex queries.find_allis faster for simplename="tag"lookups.- Don’t re-parse. Parsing is expensive; reuse the soup across many queries on the same document.
features="lxml-xml"for XML — XML parsing is faster than HTML because the rules are simpler.- Stream parsing isn’t supported. BeautifulSoup loads the full document. For huge documents (>50 MB), use
lxml.etree.iterparsedirectly.
For pure read-only scraping at scale, selectolax (Modest engine, C) outperforms BeautifulSoup by 3–10×, but lacks the mutation API.
Version migration guide#
beautifulsoup4 has been on 4.x since 2012; bumps are small and infrequent. Notable:
4.13 (2025)
find_allbehaviour tweaks around attribute matching.- Better handling of namespace-prefixed tags in XML mode.
4.12 (2023)
- Improved support for HTML5 spec edge cases.
soupsieve>=2.4required (modern CSS-selector features like:has()).
4.7 (2018) — historical
- Added
soupsieveas a hard dep forselect()CSS support. - Code targeting older BS4 may use a partial in-tree CSS selector implementation; behaviour diverges on complex selectors.
Upgrade pattern:
- Pin
beautifulsoup4>=4.13inpyproject.toml. soupsieveandlxmlfloor versions matter too — pin them defensively if you rely on advanced CSS or XML features.- Test against the exact HTML you scrape — minor changes in tag-inference behaviour can shift the tree shape for malformed inputs.
Plugin & rule ecosystem#
BeautifulSoup has no plugin API — extension is by parser choice and soupsieve selectors:
| Component | Role |
|---|---|
lxml | Fast C-backed parser. Recommended for HTML and XML. |
html5lib | Pure-Python spec-compliant parser. Slow but most lenient. |
html.parser (stdlib) | Default fallback. Strictest. |
soupsieve | CSS-selector engine (select, select_one). Pulled in automatically. |
cchardet / chardet | Encoding detection. BS4 falls back to these for byte-input. cchardet is the C-based fast version. |
The features= argument selects parser per parse. Common values: "lxml", "html5lib", "html.parser", "lxml-xml", "xml" (alias for lxml-xml).
Configuration & layout patterns#
BeautifulSoup itself has no global config. Conventions for keeping a scraper maintainable:
- Always pass
features=explicitly. Default fallback tohtml.parsermakes behaviour env-dependent.BeautifulSoup(html, "lxml")is the safe form. - Wrap fetching + parsing in a function. Don’t open files or HTTP responses inline with parsing — separation makes each layer testable.
- Selector constants at module top.
ARTICLE_SEL = "article.post",TITLE_SEL = "h2 > a". Easier to update when site markup shifts. - Use
select_oneoverfindfor new code. CSS selectors are more readable and composable than thefind(name=, class_=, id=)keyword soup. - Guard
Noneeverywhere.select_oneandfindreturnNoneon miss.elem.get_text() if elem else ""is the canonical pattern. - Centralise encoding — pass
from_encoding="utf-8"if you know the source encoding; otherwise BS4 sniffs (usingcchardetif available).
Troubleshooting common errors#
| Symptom | Cause | Fix |
|---|---|---|
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml | lxml not installed | pip install lxml. |
UserWarning: No parser was explicitly specified | Falling back to html.parser | Pass features="lxml" (or another) explicitly. |
AttributeError: 'NoneType' object has no attribute 'get_text' | find / select_one returned None | Guard with if elem:. |
select() returns nothing for a selector that looks correct | Selector mismatch on case, namespace, or whitespace | Inspect with print(soup.prettify()[:500]) to verify the tree. Try a simpler selector first. |
| Encoding shows as mojibake | Source page mis-declared encoding | Pass from_encoding="utf-8" explicitly; or fetch with r.text (httpx/requests handle declared encoding) before passing to BS4. |
| XML parsing treats tags as case-insensitive | Used features="lxml" instead of features="lxml-xml" | Switch to "lxml-xml" or "xml". |
select(":has(...)") doesn’t work | Old soupsieve version | Upgrade to soupsieve>=2.4. |
| BS4 modifies whitespace on serialisation | Default formatting prettifies output | Pass formatter="minimal" to str(soup) or use soup.encode(). |
| Tag found in browser but not in BS4 | JavaScript rendered the element after page load | BS4 can’t execute JS — fetch via Playwright/Selenium first, then parse the rendered HTML. |
RecursionError on deep trees | Python’s default recursion limit | sys.setrecursionlimit(5000) (cautiously); or use lxml.etree directly which is iterative. |
The prettify() method is the diagnostic — print(soup.prettify()) shows the parsed tree exactly as BS4 sees it, surfacing missing elements or unexpected nesting from a malformed input.
Ecosystem integrations#
BeautifulSoup is the parsing layer. The fetch and (sometimes) render layers below it have many options:
requests— synchronous HTTP, the original pairing. Mature, stable.httpx— modern, both sync and async, HTTP/2 support. The preferred choice in 2026.aiohttp— async-only HTTP client. Use when the whole stack is async.playwright/selenium— for JS-rendered pages. Fetch the rendered DOM, feedpage.content()to BeautifulSoup.scrapy— full crawling framework. Usesparsel(similar API) instead of BeautifulSoup. Choose Scrapy for thousands of pages; BS4 for one-off or small jobs.pandas.read_html()— wraps BS4 + lxml. Useful for table extraction; reads all<table>tags on a page.mechanicalsoup— session-aware browser-like wrapper. Less popular; considerhttpx.AsyncClient(follow_redirects=True)instead.bleach— HTML sanitisation. Uses BS4 under the hood. Use for cleaning untrusted HTML.
CI integration#
Scrapers in CI typically run against recorded fixtures, not the live target. Patterns:
- Record-replay with
pytest-vcr,responses, orpytest-recording. Capture real responses once, commit to repo, replay in CI. - Static fixtures — check in HTML files under
tests/fixtures/and parse them in tests. Cheaper than VCR but stale faster. - Live-fetch nightly — a scheduled job hits the live target, surfaces breakage early. Separate from per-commit CI to avoid coupling PR-merge to upstream uptime.
name: scraper-tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -e ".[test]"
- run: pytest tests/ # uses VCR cassettes, no live calls
For live-target verification on a cron:
on:
schedule:
- cron: "0 2 * * *" # 02:00 UTC daily
Failures here open an issue (actions-ecosystem/action-create-issue) rather than blocking the build.
When NOT to use this#
BeautifulSoup is excellent for HTML scraping but a wrong fit when:
- Pure XML processing. Use
lxml.etreedirectly — faster and more XPath-friendly. BS4’s XML mode wrapslxmland adds overhead. - JS-rendered pages. BS4 can’t execute JavaScript. Fetch with Playwright/Selenium first; then pass the rendered HTML to BS4 if you want its ergonomics.
- Very large documents (50+ MB). BS4 loads the whole tree. Use
lxml.etree.iterparsefor streaming. - Pure CSS-selector reads at scale.
selectolaxis 3–10× faster than BS4 for read-only queries. BS4’s value is the mutation API and the consistent abstraction over multiple parsers. - Inside a Scrapy pipeline. Scrapy ships with
parsel, which is the same idea. Don’t mix.
See also#
- Python: BeautifulSoup — API tutorial, navigation, CSS selectors, scraping recipes
- Concept: HTTP — what
requests/httpxdeliver before BeautifulSoup parses it - Packages: pip-requests — the canonical fetch-then-parse pairing