xidel Web Scraping & Data Extraction#
What it is#
xidel is an open-source command-line tool for downloading web pages and extracting structured data from HTML, XML, and JSON sources, maintained at videlibri.de. It supports XPath 2.0/3.0, CSS selectors, custom template-based pattern matching, and JSONiq, making it one of the most expressive scraping tools available without writing a full script. Reach for xidel when you need to query deeply nested HTML or XML documents with XPath from a shell pipeline, or when grep/sed are too brittle for structured markup.
Install:
apt-get install xidel(Debian/Ubuntu),brew install xidel(macOS), or download from videlibri.de/xidel.html
[!NOTE] Release status (May 2026): the last tagged stable release on GitHub remains 0.9.8 (April 2022). Development version 0.9.9 is published irregularly as preview binaries for Windows, Linux, macOS, and Android — it ships ~99.6% XPath/XQuery 3.1 coverage, partial XPath 4.0 syntax,
--json-mode,--in-place, and new extension functions (inner-text,x:request-decode,matched-text). The project is low-velocity but not abandoned. If a feature below appears missing on 0.9.8, grab a 0.9.9 preview build from videlibri.de.
Extract with XPath#
XPath is a query language for navigating the tree structure of HTML and XML documents; xidel supports XPath 2.0/3.0, which adds functions, sequences, and regular expressions beyond what most tools support. Use --extract with an XPath expression when you need to traverse nested elements, filter by attribute value, or apply string functions that CSS selectors cannot express.
# Extract all link href attributes from a page
xidel https://example.org --extract "//a/@href"
Output:
/
/about
/contact
https://docs.example.org/
https://github.com/example/repo
# Extract all page titles from links found via Google
xidel "https://www.google.com/search?q=linux+tips" \
--extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
Output:
https://www.linuxcommand.org/
https://linuxjourney.com/
https://tldr.sh/
https://cheat.sh/
# Extract all image sources
xidel https://example.org --extract "//img/@src"
Output:
/assets/logo.png
/assets/hero.jpg
/assets/icons/arrow.svg
# Extract text content of all headings
xidel https://example.org --extract "//h1|//h2|//h3"
Output:
Welcome to Example
Getting Started
Installation
Configuration
API Reference
Extract with CSS selectors#
CSS selectors are a concise alternative to XPath for element selection by tag name, class, ID, or attribute — the same syntax used in browser DevTools. Use --css when the query is simple and familiar from web development; switch to XPath when you need axis traversal, positional predicates, or string operations.
# Extract text of all paragraphs
xidel https://example.org --css "p"
Output:
This is the first paragraph describing the product.
Use it to simplify your workflow and automate tasks.
See the documentation for full details.
# Extract href from all nav links
xidel https://example.org --css "nav a" --extract "@href"
Output:
/
/docs
/api
/blog
/contact
# Combine: follow CSS-selected links and extract their titles
xidel https://example.org --follow "css('a')" --css title
Output:
Example - Home
Example - Documentation
Example - API Reference
Example - Blog
Pattern matching (template syntax)#
Pattern matching lets you describe the shape of the data you want with placeholders:
# Extract whatever is between <title> and </title>
xidel https://example.org --extract "<title>{.}</title>"
Output:
Example Domain
# Follow all <a> links and extract each page's title
xidel https://example.org \
--follow "<a>{.}</a>*" \
--extract "<title>{.}</title>"
Output:
Example Domain
Example - About
Example - Contact
Example - Documentation
# Extract a specific nested value — also validates structure is present
xidel path/to/example.xml \
--extract "<x><foo>ood</foo><bar>{.}</bar></x>"
Output:
the bar value
Follow links & crawl#
--follow takes an XPath or CSS expression that selects URLs, fetches each one, and applies the --extract expression to the resulting pages. This turns xidel into a single-command crawler — useful for scraping paginated sites or downloading all assets linked from a page.
# Follow all <a> tags on a page and print each linked page's title
xidel https://example.org --follow //a --extract //title
Output:
Example Domain
Example - Getting Started
Example - API Reference
Example - Changelog
# Follow Google result links, print titles, download pages into host-named dirs
xidel "https://www.google.com/search?q=test" \
--follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" \
--extract //title \
--download '{$host}/'
Output:
Test - Wikipedia
Software testing - MDN
Pytest documentation
…
JSON APIs#
xidel parses JSON responses and exposes them as an XPath-navigable tree, so the same --extract "//field" syntax works for JSON as it does for XML. This makes it a lightweight alternative to curl | jq when the extraction logic is straightforward.
# Extract a field from a JSON API response
xidel https://api.github.com/repos/octocat/Hello-World --extract "//name"
Output:
Hello-World
# Use JSONiq-style extraction
xidel https://api.example.com/data.json --extract "//items/title"
Output:
First Article
Second Article
Third Article
Structured output from RSS / Atom#
RSS and Atom feeds are well-formed XML, making them ideal targets for xidel’s template syntax. Named variable assignments (field:=.) let you pair related values from different elements in a single pass, producing structured records rather than a flat list of values.
# Extract title + URL from every Stack Overflow question in the RSS feed
xidel http://stackoverflow.com/feeds \
--extract "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"
Output:
title: How do I reverse a list in Python?
uri: https://stackoverflow.com/questions/3940128/
title: Difference between append and extend in Python
uri: https://stackoverflow.com/questions/252703/
title: How to check if a file exists in Python?
uri: https://stackoverflow.com/questions/82831/
…
The + at the end means “repeat this pattern one or more times.” Named variables (title:=, uri:=) pair related fields.
Form automation & login#
xidel can submit HTML forms by wrapping a CSS selector for the form element with form() and a dictionary of field values. It maintains cookies across requests, enabling login flows where subsequent --follow and --extract calls run as the authenticated session.
# Log in to Reddit and check unread mail count
# Combines CSS selectors, XPath, JSONiq, and form evaluation
xidel https://reddit.com \
--follow "form(css('form.login-form')[1], {'user': 'myuser', 'passwd': 'mypassword'})" \
--extract "css('#mail')/@title"
Output:
3 messages
Output formats#
--output-format controls how extracted values are serialized. The default adhoc prints one result per line, json produces a JSON array suitable for piping into jq, and xml wraps results in a <result> element. Choose json or xml when passing xidel output to another tool that expects structured data.
# Output as JSON array
xidel https://example.org --extract "//a/@href" --output-format json
Output:
["/","\/about","\/contact","https:\/\/docs.example.org\/","https:\/\/github.com\/example\/repo"]
# Output as XML
xidel https://example.org --extract "//a" --output-format xml
Output:
<result>
<a href="/">Home</a>
<a href="/about">About</a>
<a href="/contact">Contact</a>
</result>
# Wrap each result on its own line (default)
xidel https://example.org --extract "//a/@href" --output-format adhoc
Output:
/
/about
/contact
https://docs.example.org/
https://github.com/example/repo
Query language comparison#
| Query | XPath | CSS | Pattern |
|---|---|---|---|
| All links | //a | css('a') | <a>{.}</a>* |
| Link href | //a/@href | css('a') + @href | <a href="{.}"> |
| Page title | //title | css('title') | <title>{.}</title> |
| First h1 | //h1[1] | css('h1:first-of-type') | — |
Combine with shell pipelines#
# Save all scraped URLs to a file for aria2c batch download
xidel https://example.org/downloads --extract "//a[contains(@href,'.iso')]/@href" \
> iso-urls.txt
aria2c --input-file=iso-urls.txt -c -d ~/Downloads
Output (xidel ... > iso-urls.txt / preview of iso-urls.txt):
https://example.org/downloads/ubuntu-24.04-desktop-amd64.iso
https://example.org/downloads/ubuntu-24.04-server-amd64.iso
https://example.org/downloads/debian-12.5.0-amd64-netinst.iso
XPath 3.0 essentials#
XPath 3.0 is the query language xidel uses by default; the syntax extends XPath 2.0 with map and array types, higher-order functions, and improved string handling. Understanding the four primary building blocks — path steps, predicates, axes, and functions — turns xidel into a precise surgical tool rather than a guess-and-check scraper.
Path expressions#
A path expression is a sequence of steps separated by / (one level down) or // (any descendant), where each step yields a sequence of nodes that the next step traverses. The leading / anchors at the document root; an unanchored //foo finds every foo element anywhere in the tree.
xidel page.html --extract "/html/body//p" # absolute path to paragraphs
xidel page.html --extract "//div//span" # spans nested under any div
xidel page.html --extract "//article/h1" # direct h1 children of article
xidel page.html --extract "//*[@id='main']" # any element with id="main"
Output:
Welcome to the article
Section one heading
A paragraph in section one
Section two heading
Predicates#
A predicate is a […] filter appended to a step that keeps only nodes matching its condition. Numeric predicates select positionally ([1] is first, [last()] is last); boolean predicates filter by attribute, text content, or function results.
# First link, last link, links with class "external"
xidel page.html --extract "//a[1]"
xidel page.html --extract "//a[last()]"
xidel page.html --extract "//a[@class='external']"
xidel page.html --extract "//a[contains(@href, 'github')]"
xidel page.html --extract "//li[position() <= 3]" # first three list items
xidel page.html --extract "//tr[td[3] > 100]" # rows where col 3 > 100
Output:
https://example.org/page1
https://github.com/example/repo
Item one
Item two
Item three
Axes#
An axis defines the direction of traversal from a context node — most queries use the default child:: (implicit before each step) but explicit axes unlock parent/sibling/ancestor lookups. Use parent::, following-sibling::, and ancestor:: when CSS selectors hit a dead end.
# Parent of the first h1
xidel page.html --extract "//h1[1]/parent::*"
# All siblings after a heading until the next heading
xidel page.html --extract "//h2[1]/following-sibling::p"
# Ancestor div of an element
xidel page.html --extract "//a[@id='link']/ancestor::div[1]"
# Preceding nodes (text before the first table)
xidel page.html --extract "//table[1]/preceding::p"
Output:
The first paragraph after section one
The second paragraph after section one
XPath functions#
XPath 3.0 ships with a rich function library covering strings, numbers, sequences, and dates. The most-used in scraping are normalize-space(), tokenize(), matches(), substring-before/after(), lower-case(), and concat().
xidel page.html --extract "normalize-space(//h1)"
xidel page.html --extract "lower-case(//title)"
xidel page.html --extract "tokenize(//meta[@name='keywords']/@content, ',\s*')"
xidel page.html --extract "//a[matches(@href, '^https://github\.com')]/@href"
xidel page.html --extract "substring-after(//meta[@property='og:url']/@content, '://')"
xidel page.html --extract "concat(//h1, ' — ', //h2[1])"
Output:
welcome to example
linux, cli, tutorial
https://github.com/example/repo
example.org/article/123
Welcome — Getting Started
CSS selector deep dive#
xidel’s CSS engine implements most of Selectors Level 3 — descendant, child, attribute, pseudo-class — invoked via --css for whole-document selection or inside an expression with css('…') for use alongside XPath. Selectors are looser than XPath: they cannot walk up the tree and have no built-in string functions, but they are typically 50% shorter for common scraping queries.
# Descendant, child, adjacent sibling
xidel page.html --css "article p" # descendant
xidel page.html --css "article > p" # direct child
xidel page.html --css "h2 + p" # first paragraph after each h2
# Attribute selectors
xidel page.html --css "a[href^='https://']" # starts-with
xidel page.html --css "a[href$='.pdf']" # ends-with
xidel page.html --css "a[href*='github']" # contains
xidel page.html --css "input[type='hidden']" # exact match
# Pseudo-classes
xidel page.html --css "li:first-child"
xidel page.html --css "li:nth-child(odd)"
xidel page.html --css "tr:not(.header)"
Output:
First paragraph in the article
Direct child paragraph
https://example.org/intro.pdf
https://github.com/example/repo
Mixing CSS and XPath#
css('selector') is a function inside any expression — combine it with XPath axes when CSS picks the starting element but you need to walk relatives.
# Use CSS to find articles, then XPath to grab their first paragraph
xidel page.html --extract "css('article')/p[1]"
# CSS-selected nav links followed by parent <li> for context
xidel page.html --extract "css('nav a')/parent::li"
Output:
First paragraph from article one
First paragraph from article two
JSON / JSONiq queries#
When xidel reads JSON, the document becomes a tree of maps (objects) and arrays accessible by both XPath-flavoured //field syntax and JSONiq’s dot/bracket notation. JSONiq is the W3C standard for JSON query and is closer in feel to jq, while XPath syntax is consistent with everything else xidel does — pick whichever reads better for the task.
# Plain XPath against JSON
xidel api.json --extract "//items/title"
xidel api.json --extract "//user/email"
# JSONiq dot/bracket — strongly recommended for nested arrays
xidel api.json --extract '$json.items[].title'
xidel api.json --extract '$json.items()[$$.score > 80].name'
xidel api.json --extract 'count($json.items())'
# Multiple fields paired
xidel api.json --extract '$json.items()!{name: ., score: .score}'
Output:
First item
Second item
Third item
3
When to prefer xidel over jq for JSON#
jq is the default JSON tool and is faster for pure JSON pipelines; xidel becomes attractive when the same scraping job mixes HTML/XML pages and JSON APIs, when you want a single expression language across all formats, or when you need to follow links discovered inside JSON responses.
# Read a JSON index, follow each item's URL, scrape the HTML title
xidel https://api.example.com/articles.json \
--follow '$json.articles().url' \
--extract "//title"
Output:
First Article Title
Second Article Title
Third Article Title
Variables and bindings#
--variable NAME=value (or -v short form) binds a variable accessible as $NAME inside any expression — useful for parameterising base URLs, query strings, or output prefixes without rewriting the expression. Variables can also be JSON-encoded objects for richer data passing.
# Pass a base URL into the extract expression
xidel "https://example.org/page" \
--variable "host=example.org" \
--extract "concat($host, ': ', //title)"
# Multiple variables
xidel page.html \
-v "year=2026" \
-v "section=docs" \
--extract "concat($section, '/', $year, '/', //h1)"
# JSON-typed variable
xidel page.html \
--var-json 'filters={"min": 10, "max": 100}' \
--extract "//item[price >= $filters.min and price <= $filters.max]"
Output:
example.org: Welcome
docs/2026/Getting Started
Recursive crawling with —follow#
--follow accepts the same selectors as --extract but uses each result as a URL to fetch. Add --follow-level N to cap recursion depth, and --follow-from to control which page the follow expression runs against. xidel maintains its own visited-URL set, so cycles are detected automatically.
# Crawl two levels deep, extract every page's title
xidel https://example.org \
--follow "//a[contains(@href, 'example.org')]/@href" \
--follow-level 2 \
--extract "//title"
# Stay within one domain
xidel https://example.org \
--follow "//a/@href" \
--follow-include "example.org" \
--extract "//h1"
# Follow only Atom feed entries
xidel https://example.org/feed.atom \
--follow "//entry/link/@href" \
--extract "//article/h1"
Output:
Welcome to Example
Example - About
Example - Documentation
Example - API Reference
Example - Changelog
Example - Contributing
HTTP options — headers, cookies, auth#
xidel is built on libcurl-style HTTP, exposing user-agent, cookie jar, custom headers, basic auth, POST data, and proxy flags. Use these when the target site differentiates between browsers, requires login, or rate-limits anonymous traffic.
# Custom user-agent (some sites block default xidel UA)
xidel https://example.org \
--user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36" \
--extract "//h1"
# Persistent cookie jar across runs
xidel https://example.org/dashboard \
--cookie-jar ~/.cache/xidel-cookies.txt \
--extract "//span[@class='username']"
# Custom header (API tokens)
xidel "https://api.example.com/v1/me" \
--header "Authorization: Bearer YOUR_TOKEN" \
--extract "//email"
# POST a form body
xidel https://example.org/login \
--post "user=alicedev&pass=secret" \
--extract "//div[@id='status']"
# Basic auth
xidel "https://admin.example.org/stats" \
--user "alicedev:secret" \
--extract "//table//td"
Output:
Welcome
alicedev
alice@example.com
Logged in successfully
42 active sessions
Local files and standard input#
xidel accepts file paths and - for stdin as readily as URLs. This makes it scriptable against archived pages, downloaded snapshots, and pipe-driven workflows where curl or wget handles the fetching and xidel handles the parsing.
# Local file
xidel path/to/example.html --extract "//title"
# stdin
curl -s https://example.org | xidel - --extract "//title"
# Glob across many local files
xidel ./archive/*.html --extract "//h1"
# Pipe gzipped HTML through gunzip
gunzip -c snapshot.html.gz | xidel - --extract "//meta[@property='og:title']/@content"
Output:
Example Domain
Welcome to Example
Page Title from Archived Snapshot
Comparison with sibling tools#
xidel sits at the intersection of HTML parsing (htmlq, pup), XML query (xmllint), JSON query (jq), and the newer format-agnostic crop (dasel, yq). The table below shows where each tool wins; reach for xidel when one job spans more than one of these formats and needs XPath/XQuery expressiveness.
| Need | Best tool | Why |
|---|---|---|
| Pure JSON pipelines | jq | Smaller, faster, ubiquitous |
| CSS-selector-only HTML extraction | htmlq (Rust) or pup (Go) | Lighter syntax, focused scope, single static binary |
| Cross-format query/edit (JSON + YAML + TOML + XML + CSV) without XPath | dasel | One path-style selector across all formats; small Go binary |
| XML schema validation, namespaces | xmllint | Mature XML stack, XSD support |
| XPath against HTML | xidel | Tolerates malformed HTML, XPath 3.0 (3.1 in 0.9.9 dev) |
| Mixed HTML + JSON crawl in one expression | xidel | One query language, built-in --follow |
| Login flows with cookies | xidel | --cookie-jar + form submission built-in |
| JavaScript-rendered pages | Playwright / Puppeteer | xidel does not execute JS |
When to pick something else#
- You only need CSS selectors on HTML —
htmlqorpupstart faster, have shorter syntax, and are easier to install on minimal images.pupwent dormant for a stretch;htmlqis the more actively maintained Rust replacement. - You’re slicing JSON/YAML/TOML/XML configs with no HTML in the loop —
daselgives one path syntax across all of them and is a static Go binary with no Pascal runtime. xidel’s XPath wins as soon as the structure gets deep or you need predicates / functions. - You want pure-XML rigor (namespaces, XSD, XSLT) —
xmllintandxmlstarletremain the canonical choice.
[!NOTE] xidel does not run JavaScript — single-page apps that render content client-side return empty results. For those, render with a headless browser first (Playwright, Puppeteer, or
chromium --dump-dom) and pipe the rendered HTML into xidel.
0.9.9 dev-build extras#
The 0.9.9 preview adds a handful of conveniences worth knowing about if you grab the development binary from videlibri.de. They fix small annoyances in 0.9.8 — especially around visible-text extraction and in-place file edits — without changing day-to-day query syntax.
# inner-text: skip <script>/<style>/hidden, collapse whitespace,
# return what a browser actually renders
xidel article.html --extract "inner-text(//article)"
# --json-mode picks the JSON dialect explicitly
xidel api.json --json-mode=xpath --extract '?items?*?title' # XPath 3.1 syntax
xidel api.json --json-mode=jsoniq --extract '$json.items().title'
# --in-place overwrites the input file with the result —
# useful for batch rewrites of local snapshots
xidel ./snapshots/*.html --in-place --extract "//main"
# matched-text exposes what a pattern match actually captured
xidel page.html --extract "<a href='{link:=.}'>{matched-text()}</a>*"
# x:request-decode parses an application/x-www-form-urlencoded body
xidel - --extract "x:request-decode('a=1&b=hello%20world')"
Output:
Welcome to the article.
First paragraph in context.
…
{"a": "1", "b": "hello world"}
Recipes#
A grab bag of complete one-liners that combine the features above for real scraping problems. Each recipe is copy-paste runnable against a site that exposes the relevant markup.
Extract OpenGraph metadata#
xidel "https://example.org/article" --extract '
{
"title": //meta[@property="og:title"]/@content,
"description": //meta[@property="og:description"]/@content,
"image": //meta[@property="og:image"]/@content,
"url": //meta[@property="og:url"]/@content
}
' --output-format json
Output:
{
"title": "How to scrape with xidel",
"description": "A practical walkthrough of XPath and CSS selectors.",
"image": "https://example.org/images/og-card.png",
"url": "https://example.org/article"
}
Sitemap → flat URL list#
xidel "https://example.org/sitemap.xml" --extract "//*[local-name()='loc']"
Output:
https://example.org/
https://example.org/about
https://example.org/blog/post-1
https://example.org/blog/post-2
RSS feed → JSON#
xidel "https://example.org/feed.rss" \
--extract '//item ! { "title": title, "link": link, "date": pubDate }' \
--output-format json
Output:
[
{ "title": "New release 1.4", "link": "https://example.org/blog/1.4", "date": "Wed, 22 Apr 2026 12:00:00 GMT" },
{ "title": "Roadmap update", "link": "https://example.org/blog/roadmap", "date": "Mon, 15 Apr 2026 09:30:00 GMT" }
]
Crawl a blog index and dump articles to JSON#
xidel "https://example.org/blog" \
--follow "//article//a/@href" \
--extract '{
"title": //h1,
"author": //meta[@name="author"]/@content,
"date": //time/@datetime,
"content": string-join(//article//p, "\n\n")
}' --output-format json
Output:
[
{ "title": "Post One", "author": "Alice Dev", "date": "2026-04-22", "content": "First paragraph...\n\nSecond paragraph..." },
{ "title": "Post Two", "author": "Alice Dev", "date": "2026-04-15", "content": "Intro line.\n\nMore text..." }
]
Combine with jq for post-processing#
xidel "https://example.org/products" \
--extract '//product ! { name: name, price: number(price) }' \
--output-format json \
| jq '[.[] | select(.price < 50)] | sort_by(.price)'
Output:
[
{ "name": "Pen", "price": 2.5 },
{ "name": "Notebook", "price": 8.99 },
{ "name": "Mug", "price": 12.0 }
]
[!TIP] Add
--silentto suppress xidel’s progress and informational messages, leaving only the extracted data — useful when piping into another command or capturing in a variable.
[!TIP] If an expression returns nothing, run xidel with
--printed-node-format=text-with-html-tags(or use--extract "//*") to inspect the actual parsed tree. Browser DevTools selectors sometimes target post-JavaScript DOM that xidel cannot see.
Sources#
- xidel — videlibri.de homepage and downloads
- benibela/xidel on GitHub — releases (0.9.8 latest tagged stable, April 2022)
- benibela/xidel changelog — 0.9.9 dev features (XPath/XQuery 3.1,
--json-mode,--in-place,inner-text) - xidel Homebrew formula
- htmlq — Rust HTML selector CLI (modern alternative for CSS-only HTML)
- pup — Go HTML selector CLI
- dasel — format-agnostic selector for JSON/YAML/TOML/XML/CSV
- structured-text-tools — curated index of CLI tools for structured data