xidel Web Scraping & Data Extraction#

What it is#

xidel is an open-source command-line tool for downloading web pages and extracting structured data from HTML, XML, and JSON sources, maintained at videlibri.de. It supports XPath 2.0/3.0, CSS selectors, custom template-based pattern matching, and JSONiq, making it one of the most expressive scraping tools available without writing a full script. Reach for xidel when you need to query deeply nested HTML or XML documents with XPath from a shell pipeline, or when grep/sed are too brittle for structured markup.

Install: apt-get install xidel (Debian/Ubuntu), brew install xidel (macOS), or download from videlibri.de/xidel.html

[!NOTE] Release status (May 2026): the last tagged stable release on GitHub remains 0.9.8 (April 2022). Development version 0.9.9 is published irregularly as preview binaries for Windows, Linux, macOS, and Android — it ships ~99.6% XPath/XQuery 3.1 coverage, partial XPath 4.0 syntax, --json-mode, --in-place, and new extension functions (inner-text, x:request-decode, matched-text). The project is low-velocity but not abandoned. If a feature below appears missing on 0.9.8, grab a 0.9.9 preview build from videlibri.de.

Extract with XPath#

XPath is a query language for navigating the tree structure of HTML and XML documents; xidel supports XPath 2.0/3.0, which adds functions, sequences, and regular expressions beyond what most tools support. Use --extract with an XPath expression when you need to traverse nested elements, filter by attribute value, or apply string functions that CSS selectors cannot express.

# Extract all link href attributes from a page
xidel https://example.org --extract "//a/@href"

Output:

/
/about
/contact
https://docs.example.org/
https://github.com/example/repo

# Extract all page titles from links found via Google
xidel "https://www.google.com/search?q=linux+tips" \
  --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"

Output:

https://www.linuxcommand.org/
https://linuxjourney.com/
https://tldr.sh/
https://cheat.sh/

# Extract all image sources
xidel https://example.org --extract "//img/@src"

Output:

/assets/logo.png
/assets/hero.jpg
/assets/icons/arrow.svg

# Extract text content of all headings
xidel https://example.org --extract "//h1|//h2|//h3"

Output:

Welcome to Example
Getting Started
Installation
Configuration
API Reference

Extract with CSS selectors#

CSS selectors are a concise alternative to XPath for element selection by tag name, class, ID, or attribute — the same syntax used in browser DevTools. Use --css when the query is simple and familiar from web development; switch to XPath when you need axis traversal, positional predicates, or string operations.

# Extract text of all paragraphs
xidel https://example.org --css "p"

Output:

This is the first paragraph describing the product.
Use it to simplify your workflow and automate tasks.
See the documentation for full details.

# Extract href from all nav links
xidel https://example.org --css "nav a" --extract "@href"

Output:

/
/docs
/api
/blog
/contact

# Combine: follow CSS-selected links and extract their titles
xidel https://example.org --follow "css('a')" --css title

Output:

Example - Home
Example - Documentation
Example - API Reference
Example - Blog

Pattern matching (template syntax)#

Pattern matching lets you describe the shape of the data you want with placeholders:

# Extract whatever is between <title> and </title>
xidel https://example.org --extract "<title>{.}</title>"

Output:

Example Domain

# Follow all <a> links and extract each page's title
xidel https://example.org \
  --follow "<a>{.}</a>*" \
  --extract "<title>{.}</title>"

Output:

Example Domain
Example - About
Example - Contact
Example - Documentation

# Extract a specific nested value — also validates structure is present
xidel path/to/example.xml \
  --extract "<x><foo>ood</foo><bar>{.}</bar></x>"

Output:

the bar value

Follow links & crawl#

--follow takes an XPath or CSS expression that selects URLs, fetches each one, and applies the --extract expression to the resulting pages. This turns xidel into a single-command crawler — useful for scraping paginated sites or downloading all assets linked from a page.

# Follow all <a> tags on a page and print each linked page's title
xidel https://example.org --follow //a --extract //title

Output:

Example Domain
Example - Getting Started
Example - API Reference
Example - Changelog

# Follow Google result links, print titles, download pages into host-named dirs
xidel "https://www.google.com/search?q=test" \
  --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" \
  --extract //title \
  --download '{$host}/'

Output:

Test - Wikipedia
Software testing - MDN
Pytest documentation
…

JSON APIs#

xidel parses JSON responses and exposes them as an XPath-navigable tree, so the same --extract "//field" syntax works for JSON as it does for XML. This makes it a lightweight alternative to curl | jq when the extraction logic is straightforward.

# Extract a field from a JSON API response
xidel https://api.github.com/repos/octocat/Hello-World --extract "//name"

Output:

Hello-World

# Use JSONiq-style extraction
xidel https://api.example.com/data.json --extract "//items/title"

Output:

First Article
Second Article
Third Article

Structured output from RSS / Atom#

RSS and Atom feeds are well-formed XML, making them ideal targets for xidel’s template syntax. Named variable assignments (field:=.) let you pair related values from different elements in a single pass, producing structured records rather than a flat list of values.

# Extract title + URL from every Stack Overflow question in the RSS feed
xidel http://stackoverflow.com/feeds \
  --extract "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"

Output:

title: How do I reverse a list in Python?
uri: https://stackoverflow.com/questions/3940128/

title: Difference between append and extend in Python
uri: https://stackoverflow.com/questions/252703/

title: How to check if a file exists in Python?
uri: https://stackoverflow.com/questions/82831/
…

The + at the end means “repeat this pattern one or more times.” Named variables (title:=, uri:=) pair related fields.

xidel can submit HTML forms by wrapping a CSS selector for the form element with form() and a dictionary of field values. It maintains cookies across requests, enabling login flows where subsequent --follow and --extract calls run as the authenticated session.

# Log in to Reddit and check unread mail count
# Combines CSS selectors, XPath, JSONiq, and form evaluation
xidel https://reddit.com \
  --follow "form(css('form.login-form')[1], {'user': 'myuser', 'passwd': 'mypassword'})" \
  --extract "css('#mail')/@title"

Output:

3 messages

Output formats#

--output-format controls how extracted values are serialized. The default adhoc prints one result per line, json produces a JSON array suitable for piping into jq, and xml wraps results in a <result> element. Choose json or xml when passing xidel output to another tool that expects structured data.

# Output as JSON array
xidel https://example.org --extract "//a/@href" --output-format json

Output:

["/","\/about","\/contact","https:\/\/docs.example.org\/","https:\/\/github.com\/example\/repo"]

# Output as XML
xidel https://example.org --extract "//a" --output-format xml

Output:

<result>
  <a href="/">Home</a>
  <a href="/about">About</a>
  <a href="/contact">Contact</a>
</result>

# Wrap each result on its own line (default)
xidel https://example.org --extract "//a/@href" --output-format adhoc

Output:

/
/about
/contact
https://docs.example.org/
https://github.com/example/repo

Query language comparison#

Query	XPath	CSS	Pattern
All links	`//a`	`css('a')`	`<a>{.}</a>*`
Link href	`//a/@href`	`css('a')` + `@href`	`<a href="{.}">`
Page title	`//title`	`css('title')`	`<title>{.}</title>`
First h1	`//h1[1]`	`css('h1:first-of-type')`	—

Combine with shell pipelines#

# Save all scraped URLs to a file for aria2c batch download
xidel https://example.org/downloads --extract "//a[contains(@href,'.iso')]/@href" \
  > iso-urls.txt
aria2c --input-file=iso-urls.txt -c -d ~/Downloads

Output (xidel ... > iso-urls.txt / preview of iso-urls.txt):

https://example.org/downloads/ubuntu-24.04-desktop-amd64.iso
https://example.org/downloads/ubuntu-24.04-server-amd64.iso
https://example.org/downloads/debian-12.5.0-amd64-netinst.iso

XPath 3.0 essentials#

XPath 3.0 is the query language xidel uses by default; the syntax extends XPath 2.0 with map and array types, higher-order functions, and improved string handling. Understanding the four primary building blocks — path steps, predicates, axes, and functions — turns xidel into a precise surgical tool rather than a guess-and-check scraper.

Path expressions#

A path expression is a sequence of steps separated by / (one level down) or // (any descendant), where each step yields a sequence of nodes that the next step traverses. The leading / anchors at the document root; an unanchored //foo finds every foo element anywhere in the tree.

xidel page.html --extract "/html/body//p"        # absolute path to paragraphs
xidel page.html --extract "//div//span"          # spans nested under any div
xidel page.html --extract "//article/h1"         # direct h1 children of article
xidel page.html --extract "//*[@id='main']"      # any element with id="main"

Output:

Welcome to the article
Section one heading
A paragraph in section one
Section two heading

Predicates#

A predicate is a […] filter appended to a step that keeps only nodes matching its condition. Numeric predicates select positionally ([1] is first, [last()] is last); boolean predicates filter by attribute, text content, or function results.

# First link, last link, links with class "external"
xidel page.html --extract "//a[1]"
xidel page.html --extract "//a[last()]"
xidel page.html --extract "//a[@class='external']"
xidel page.html --extract "//a[contains(@href, 'github')]"
xidel page.html --extract "//li[position() <= 3]"     # first three list items
xidel page.html --extract "//tr[td[3] > 100]"         # rows where col 3 > 100

Output:

https://example.org/page1
https://github.com/example/repo
Item one
Item two
Item three

Axes#

An axis defines the direction of traversal from a context node — most queries use the default child:: (implicit before each step) but explicit axes unlock parent/sibling/ancestor lookups. Use parent::, following-sibling::, and ancestor:: when CSS selectors hit a dead end.

# Parent of the first h1
xidel page.html --extract "//h1[1]/parent::*"

# All siblings after a heading until the next heading
xidel page.html --extract "//h2[1]/following-sibling::p"

# Ancestor div of an element
xidel page.html --extract "//a[@id='link']/ancestor::div[1]"

# Preceding nodes (text before the first table)
xidel page.html --extract "//table[1]/preceding::p"

Output:

The first paragraph after section one
The second paragraph after section one

XPath functions#

XPath 3.0 ships with a rich function library covering strings, numbers, sequences, and dates. The most-used in scraping are normalize-space(), tokenize(), matches(), substring-before/after(), lower-case(), and concat().

xidel page.html --extract "normalize-space(//h1)"
xidel page.html --extract "lower-case(//title)"
xidel page.html --extract "tokenize(//meta[@name='keywords']/@content, ',\s*')"
xidel page.html --extract "//a[matches(@href, '^https://github\.com')]/@href"
xidel page.html --extract "substring-after(//meta[@property='og:url']/@content, '://')"
xidel page.html --extract "concat(//h1, ' — ', //h2[1])"

Output:

welcome to example
linux, cli, tutorial
https://github.com/example/repo
example.org/article/123
Welcome — Getting Started

CSS selector deep dive#

xidel’s CSS engine implements most of Selectors Level 3 — descendant, child, attribute, pseudo-class — invoked via --css for whole-document selection or inside an expression with css('…') for use alongside XPath. Selectors are looser than XPath: they cannot walk up the tree and have no built-in string functions, but they are typically 50% shorter for common scraping queries.

# Descendant, child, adjacent sibling
xidel page.html --css "article p"          # descendant
xidel page.html --css "article > p"        # direct child
xidel page.html --css "h2 + p"             # first paragraph after each h2

# Attribute selectors
xidel page.html --css "a[href^='https://']"     # starts-with
xidel page.html --css "a[href$='.pdf']"          # ends-with
xidel page.html --css "a[href*='github']"        # contains
xidel page.html --css "input[type='hidden']"     # exact match

# Pseudo-classes
xidel page.html --css "li:first-child"
xidel page.html --css "li:nth-child(odd)"
xidel page.html --css "tr:not(.header)"

Output:

First paragraph in the article
Direct child paragraph
https://example.org/intro.pdf
https://github.com/example/repo

Mixing CSS and XPath#

css('selector') is a function inside any expression — combine it with XPath axes when CSS picks the starting element but you need to walk relatives.

# Use CSS to find articles, then XPath to grab their first paragraph
xidel page.html --extract "css('article')/p[1]"

# CSS-selected nav links followed by parent <li> for context
xidel page.html --extract "css('nav a')/parent::li"

Output:

First paragraph from article one
First paragraph from article two

JSON / JSONiq queries#

When xidel reads JSON, the document becomes a tree of maps (objects) and arrays accessible by both XPath-flavoured //field syntax and JSONiq’s dot/bracket notation. JSONiq is the W3C standard for JSON query and is closer in feel to jq, while XPath syntax is consistent with everything else xidel does — pick whichever reads better for the task.

# Plain XPath against JSON
xidel api.json --extract "//items/title"
xidel api.json --extract "//user/email"

# JSONiq dot/bracket — strongly recommended for nested arrays
xidel api.json --extract '$json.items[].title'
xidel api.json --extract '$json.items()[$$.score > 80].name'
xidel api.json --extract 'count($json.items())'

# Multiple fields paired
xidel api.json --extract '$json.items()!{name: ., score: .score}'

Output:

First item
Second item
Third item
3

When to prefer xidel over jq for JSON#

jq is the default JSON tool and is faster for pure JSON pipelines; xidel becomes attractive when the same scraping job mixes HTML/XML pages and JSON APIs, when you want a single expression language across all formats, or when you need to follow links discovered inside JSON responses.

# Read a JSON index, follow each item's URL, scrape the HTML title
xidel https://api.example.com/articles.json \
  --follow '$json.articles().url' \
  --extract "//title"

Output:

First Article Title
Second Article Title
Third Article Title

Variables and bindings#

--variable NAME=value (or -v short form) binds a variable accessible as $NAME inside any expression — useful for parameterising base URLs, query strings, or output prefixes without rewriting the expression. Variables can also be JSON-encoded objects for richer data passing.

# Pass a base URL into the extract expression
xidel "https://example.org/page" \
  --variable "host=example.org" \
  --extract "concat($host, ': ', //title)"

# Multiple variables
xidel page.html \
  -v "year=2026" \
  -v "section=docs" \
  --extract "concat($section, '/', $year, '/', //h1)"

# JSON-typed variable
xidel page.html \
  --var-json 'filters={"min": 10, "max": 100}' \
  --extract "//item[price >= $filters.min and price <= $filters.max]"

Output:

example.org: Welcome
docs/2026/Getting Started

Recursive crawling with —follow#

--follow accepts the same selectors as --extract but uses each result as a URL to fetch. Add --follow-level N to cap recursion depth, and --follow-from to control which page the follow expression runs against. xidel maintains its own visited-URL set, so cycles are detected automatically.

# Crawl two levels deep, extract every page's title
xidel https://example.org \
  --follow "//a[contains(@href, 'example.org')]/@href" \
  --follow-level 2 \
  --extract "//title"

# Stay within one domain
xidel https://example.org \
  --follow "//a/@href" \
  --follow-include "example.org" \
  --extract "//h1"

# Follow only Atom feed entries
xidel https://example.org/feed.atom \
  --follow "//entry/link/@href" \
  --extract "//article/h1"

Output:

Welcome to Example
Example - About
Example - Documentation
Example - API Reference
Example - Changelog
Example - Contributing

HTTP options — headers, cookies, auth#

xidel is built on libcurl-style HTTP, exposing user-agent, cookie jar, custom headers, basic auth, POST data, and proxy flags. Use these when the target site differentiates between browsers, requires login, or rate-limits anonymous traffic.

# Custom user-agent (some sites block default xidel UA)
xidel https://example.org \
  --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36" \
  --extract "//h1"

# Persistent cookie jar across runs
xidel https://example.org/dashboard \
  --cookie-jar ~/.cache/xidel-cookies.txt \
  --extract "//span[@class='username']"

# Custom header (API tokens)
xidel "https://api.example.com/v1/me" \
  --header "Authorization: Bearer YOUR_TOKEN" \
  --extract "//email"

# POST a form body
xidel https://example.org/login \
  --post "user=alicedev&pass=secret" \
  --extract "//div[@id='status']"

# Basic auth
xidel "https://admin.example.org/stats" \
  --user "alicedev:secret" \
  --extract "//table//td"

Output:

Welcome
alicedev
alice@example.com
Logged in successfully
42 active sessions

Local files and standard input#

xidel accepts file paths and - for stdin as readily as URLs. This makes it scriptable against archived pages, downloaded snapshots, and pipe-driven workflows where curl or wget handles the fetching and xidel handles the parsing.

# Local file
xidel path/to/example.html --extract "//title"

# stdin
curl -s https://example.org | xidel - --extract "//title"

# Glob across many local files
xidel ./archive/*.html --extract "//h1"

# Pipe gzipped HTML through gunzip
gunzip -c snapshot.html.gz | xidel - --extract "//meta[@property='og:title']/@content"

Output:

Example Domain
Welcome to Example
Page Title from Archived Snapshot

Comparison with sibling tools#

xidel sits at the intersection of HTML parsing (htmlq, pup), XML query (xmllint), JSON query (jq), and the newer format-agnostic crop (dasel, yq). The table below shows where each tool wins; reach for xidel when one job spans more than one of these formats and needs XPath/XQuery expressiveness.

Need	Best tool	Why
Pure JSON pipelines	`jq`	Smaller, faster, ubiquitous
CSS-selector-only HTML extraction	`htmlq` (Rust) or `pup` (Go)	Lighter syntax, focused scope, single static binary
Cross-format query/edit (JSON + YAML + TOML + XML + CSV) without XPath	`dasel`	One path-style selector across all formats; small Go binary
XML schema validation, namespaces	`xmllint`	Mature XML stack, XSD support
XPath against HTML	`xidel`	Tolerates malformed HTML, XPath 3.0 (3.1 in 0.9.9 dev)
Mixed HTML + JSON crawl in one expression	`xidel`	One query language, built-in `--follow`
Login flows with cookies	`xidel`	`--cookie-jar` + form submission built-in
JavaScript-rendered pages	Playwright / Puppeteer	xidel does not execute JS

When to pick something else#

You only need CSS selectors on HTML — htmlq or pup start faster, have shorter syntax, and are easier to install on minimal images. pup went dormant for a stretch; htmlq is the more actively maintained Rust replacement.
You’re slicing JSON/YAML/TOML/XML configs with no HTML in the loop — dasel gives one path syntax across all of them and is a static Go binary with no Pascal runtime. xidel’s XPath wins as soon as the structure gets deep or you need predicates / functions.
You want pure-XML rigor (namespaces, XSD, XSLT) — xmllint and xmlstarlet remain the canonical choice.

[!NOTE] xidel does not run JavaScript — single-page apps that render content client-side return empty results. For those, render with a headless browser first (Playwright, Puppeteer, or chromium --dump-dom) and pipe the rendered HTML into xidel.

0.9.9 dev-build extras#

The 0.9.9 preview adds a handful of conveniences worth knowing about if you grab the development binary from videlibri.de. They fix small annoyances in 0.9.8 — especially around visible-text extraction and in-place file edits — without changing day-to-day query syntax.

# inner-text: skip <script>/<style>/hidden, collapse whitespace,
# return what a browser actually renders
xidel article.html --extract "inner-text(//article)"

# --json-mode picks the JSON dialect explicitly
xidel api.json --json-mode=xpath  --extract '?items?*?title'   # XPath 3.1 syntax
xidel api.json --json-mode=jsoniq --extract '$json.items().title'

# --in-place overwrites the input file with the result —
# useful for batch rewrites of local snapshots
xidel ./snapshots/*.html --in-place --extract "//main"

# matched-text exposes what a pattern match actually captured
xidel page.html --extract "<a href='{link:=.}'>{matched-text()}</a>*"

# x:request-decode parses an application/x-www-form-urlencoded body
xidel - --extract "x:request-decode('a=1&b=hello%20world')"

Output:

Welcome to the article.
First paragraph in context.
…

{"a": "1", "b": "hello world"}

Recipes#

A grab bag of complete one-liners that combine the features above for real scraping problems. Each recipe is copy-paste runnable against a site that exposes the relevant markup.

Extract OpenGraph metadata#

xidel "https://example.org/article" --extract '
  {
    "title":       //meta[@property="og:title"]/@content,
    "description": //meta[@property="og:description"]/@content,
    "image":       //meta[@property="og:image"]/@content,
    "url":         //meta[@property="og:url"]/@content
  }
' --output-format json

Output:

{
  "title": "How to scrape with xidel",
  "description": "A practical walkthrough of XPath and CSS selectors.",
  "image": "https://example.org/images/og-card.png",
  "url": "https://example.org/article"
}

Sitemap → flat URL list#

xidel "https://example.org/sitemap.xml" --extract "//*[local-name()='loc']"

Output:

https://example.org/
https://example.org/about
https://example.org/blog/post-1
https://example.org/blog/post-2

RSS feed → JSON#

xidel "https://example.org/feed.rss" \
  --extract '//item ! { "title": title, "link": link, "date": pubDate }' \
  --output-format json

Output:

[
  { "title": "New release 1.4", "link": "https://example.org/blog/1.4", "date": "Wed, 22 Apr 2026 12:00:00 GMT" },
  { "title": "Roadmap update",   "link": "https://example.org/blog/roadmap", "date": "Mon, 15 Apr 2026 09:30:00 GMT" }
]

Crawl a blog index and dump articles to JSON#

xidel "https://example.org/blog" \
  --follow "//article//a/@href" \
  --extract '{
    "title":   //h1,
    "author":  //meta[@name="author"]/@content,
    "date":    //time/@datetime,
    "content": string-join(//article//p, "\n\n")
  }' --output-format json

Output:

[
  { "title": "Post One", "author": "Alice Dev", "date": "2026-04-22", "content": "First paragraph...\n\nSecond paragraph..." },
  { "title": "Post Two", "author": "Alice Dev", "date": "2026-04-15", "content": "Intro line.\n\nMore text..." }
]

Combine with jq for post-processing#

xidel "https://example.org/products" \
  --extract '//product ! { name: name, price: number(price) }' \
  --output-format json \
  | jq '[.[] | select(.price < 50)] | sort_by(.price)'

Output:

[
  { "name": "Pen", "price": 2.5 },
  { "name": "Notebook", "price": 8.99 },
  { "name": "Mug", "price": 12.0 }
]

[!TIP] Add --silent to suppress xidel’s progress and informational messages, leaving only the extracted data — useful when piping into another command or capturing in a variable.

[!TIP] If an expression returns nothing, run xidel with --printed-node-format=text-with-html-tags (or use --extract "//*") to inspect the actual parsed tree. Browser DevTools selectors sometimes target post-JavaScript DOM that xidel cannot see.

g h	home
g p	Programming section
g p	Python section
g j	JavaScript section
g t	TypeScript section
g o	OS section
g l	Linux section
g w	Windows section
g z	z/OS section
g o	macOS section
g a	AI section
g c	Claude Code section
g c	Codex CLI section
g c	Claude API section
g p	Prompting section
g f	Frameworks section
g p	Packages section
g p	Pip (Python) section
g p	npm (Node) section
g p	Cargo (Rust) section
g p	Go modules section
g g	graph view
g t	tags index

⌘K / /	open search palette
t	cycle theme (dark → light → system)
?	toggle this panel

[ / ]	previous / next sheet in section
j / k	scroll down / up

xidel Web Scraping & Data Extraction#

What it is#

Extract with XPath#

Extract with CSS selectors#

Pattern matching (template syntax)#

Follow links & crawl#

JSON APIs#

Structured output from RSS / Atom#

Form automation & login#

Output formats#

Query language comparison#

Combine with shell pipelines#

XPath 3.0 essentials#

Path expressions#

Predicates#

Axes#

XPath functions#

CSS selector deep dive#

Mixing CSS and XPath#

JSON / JSONiq queries#

When to prefer xidel over jq for JSON#

Variables and bindings#

Recursive crawling with —follow#

HTTP options — headers, cookies, auth#

Local files and standard input#

Comparison with sibling tools#

When to pick something else#

0.9.9 dev-build extras#

Recipes#

Extract OpenGraph metadata#

Sitemap → flat URL list#

RSS feed → JSON#

Crawl a blog index and dump articles to JSON#

Combine with jq for post-processing#

Sources#