XPath is one of the most powerful tools for parsing and extracting data from HTML and XML documents. It allows you to navigate through element trees and select nodes based on various criteria. In web scraping, XPath is commonly used for robustly extracting data from pages. It can help scrape dynamic content that is challenging with basic regex parsing.
This ultimate cheatsheet aims to be a comprehensive reference for using XPath in the context of web scraping. It covers all the key features, functions, and usage patterns you need to master XPath parsing.
XPath Basics
Before diving into the details, let's understand the basics of XPath syntax and how it can be used for web scraping tasks.
What is XPath?
XPath stands for XML Path Language. It is a query language for selecting nodes from XML documents. Since HTML is a subset of XML, XPath can be used to parse HTML as well.
The primary purpose of XPath is to address elements within a document for retrieval and manipulation. It models an XML or HTML document as a tree of nodes that can be traversed to find specific pieces of data.
XPath Syntax
XPath uses path expressions to select nodes or node-sets in an XML/HTML document. The node is selected by following a path or steps. An XPath expression generally looks like:
xpath expression
Where a location path
is a sequence of steps separated by /
or //
:
step/step/...
And each step
can filter via a node name test or predicates:
node[predicate]
Some examples of simple XPath expressions:
/html/body/div
– selects alldiv
elements underbody
//div[@id='main']
–div
withid='main'
anywhere in doc//a[starts-with(@href, 'category')]
–a
links starting with ‘category'
Using XPath for Web Scraping
XPath is supported in all major Python web scraping libraries:
lxml
from lxml import html doc = html.fromstring(page_html) xpaths = doc.xpath('//div[@class="product"]')
scrapy
response.xpath('//h1/text()')
parsel
from parsel import Selector sel = Selector(text=page_html) sel.xpath('//title/text()')
So XPath can be used for scraping without needing to touch regex!
XPath Versions
There are currently 3 main XPath versions:
- XPath 1.0 – Recommended version for web scraping. Offer wide support and all necessary features.
- XPath 2.0 – Adds some helpful functions but not fully supported.
- XPath 3.0 – Mainly adds mappings to new data types.
This guide focuses on XPath 1.0 and 2.0 features that are useful for web scraping. Now let's explore how to use XPath for some common web scraping tasks.
Selecting Nodes by Name
The most basic way to select nodes is by their element name:
<h1>Heading</h1> <div>Content</div>
To select the h1
element:
//h1
And to select all div
elements:
//div
Name wildcards
You can use the *
wildcard to select any element name:
//*
This allows you to select all elements under a section without knowing the exact tags.
Matching by Attribute
Elements can be selected based on attributes using:
//div[@class='product']
This will select div
elements with a class
attribute matching ‘product'.
Attribute operators
Attribute values can also be filtered using comparison operators:
//div[@id!='footer'] //a[starts-with(@href, 'category')]
Some useful operators are =
, !=
, >
, >=
, <
, <=
for numeric values.
Attribute patterns
The contains()
function allows partial matching on attribute values:
//div[contains(@class, 'product')]
This is useful when the exact values are unpredictable.
Selecting by Position
Elements can be selected based on their position using square brackets:
<div> <p>First</p> <p>Second</p> <p>Third</p> </div>
//p[1]
– firstp
element//p[2]
– secondp
element//p[last()]
– lastp
element
The position()
function also returns the position index:
//p[position() > 1]
This selects p
elements after the first.
Matching Text Values
The text()
method returns the inner text of elements:
<div> <p>Hello World</p> </div>
//p/text()
will return “Hello World”
You can then match based on text values using functions like:
//p[contains(text(), 'Hello')]
This will select p
elements containing the text “Hello”.
Some other useful text functions are:
starts-with(text(), 'Hello')
ends-with(text(), 'World')
matches(text(), '\d+')
– regex match
Selecting Child and Descendant Elements
The /
character selects direct children of elements:
<div> <p>Child</p> <span> <p>Grandchild</p> </span> </div>
/div/p
will select the directp
child/div/span/p
will select the nestedp
The //
characters select any descendant element:
//div//p
This will select both p
elements regardless of nesting depth.
Navigating to Parent and Ancestor Nodes
The ..
selector navigates to the parent node:
<div> <p> <span>Target</span> </p> </div>
//span/..
selects the parentp
element.
The ancestor
axis selects any ancestor node:
//span/ancestor::div
This will select the div
element. Some other useful axes are:
ancestor-or-self
– ancestors and current nodedescendant-or-self
– descendants and current node
Selecting Sibling Elements
To select sibling elements, you can use:
/div/p[1]/following-sibling::p
This will select p
elements following the first p
. Some other axes for siblings are:
preceding-sibling
– previous siblingsfollowing-sibling
– next siblings
And for vertical navigation:
preceding
– all preceding elementsfollowing
– all following elements
Which allows you to select elements before and after the current node.
Combining Multiple Selectors
You can combine multiple XPath expressions using:
|
– union of expressionsor
– requires one expression to matchand
– requires all expressions to match
For example:
//div[@class='product'] | //div/h2
Will select either div.product
elements or h2
under div
.
Filtering Nodes with Predicates
Predicates allow adding advanced criteria to a step:
/books/book[price > 30]
This will select book
nodes where the price
attribute is greater than 30. You can also chain multiple predicates:
/books/book[price > 30][year > 2010]
Which will apply both filters.
Utility Functions
Along with the text functions we saw earlier, some other useful utility functions are:
String manipulation
concat(*args)
– concatenate stringssubstring(str, start, end)
– extract a substringnormalize-space(str)
– normalize whitespace
Node info
name()
– get node namecount(node)
– count matching nodes
Conditionals
not(expr)
– logical NOT operatortrue()
/false()
– boolean functions
Regular Expression Matching
The matches()
function allows matching text values against regular expressions:
Which will apply both filters. Utility Functions Along with the text functions we saw earlier, some other useful utility functions are: String manipulation concat(*args) - concatenate strings substring(str, start, end) - extract a substring normalize-space(str) - normalize whitespace Node info name() - get node name count(node) - count matching nodes Conditionals not(expr) - logical NOT operator true()/false() - boolean functions Regular Expression Matching The matches() function allows matching text values against regular expressions:
This will select title
elements with alphanumeric text. Some engines also support regexp:
or re:
namespace:
//title[re:test(text(), '\w+')]
Which works similarly.
Robust Matching Tips
- Avoid long complex paths like
/html/body/div/span/p
- Prefer
//
descendant axes for reliability - Leverage attributes over fickle element names/nesting
- Match partial attributes with
contains()
- Use text functions for text node selection
- Apply
normalize-space()
when comparing text
Proper use of XPath can create very robust scrape queries resistant to minor HTML changes.
Example Patterns
Here are some examples of common XPath scraping patterns:
Get text from complex HTML
<div class="description"> Call <span class="phone">(888) 888-8888</span> to order! </div>
//div[@class='description']//text()
Will extract all the text regardless of nesting.
Filter listings by price
<div class="product" data-price="39.99">...</div> <div class="product" data-price="99.99">...</div>
//div[@class="product"][number(data-price) < 40]
Converts to a number and filters by price.
Select row cells from tables
<table> <tr><td>Cell 1</td><td>Cell 2</td></tr> </table>
//tr/td[position() = 2]
Gets the 2nd cell value from rows.
Paginated content
(//h2[text()='Products'])[1]/following::div[1]
Finds the first div
after the first products heading.
And many more!
Take a look at the tools section for sites with more example patterns.
Comparison to CSS Selectors
CSS selectors are an alternative to XPath for HTML parsing. Some advantages of each:
CSS Selectors
- Simple and terse syntax
- Widespread browser support
- Integrated into CSS
XPath
- Full programmatic access to manipulate results
- Advanced conditional logic
- Additional helper functions
- Can query both HTML and XML sources
Here are some common conversions between CSS and XPath:
CSS XPath --------------------------------------------------- .class //*[contains(concat(' ',normalize-space(@class),' '), ' class ')] #id //*[@id='id'] div > span /div/span div span //div//span div + p //div/following-sibling::p[1] div ~ p //div/following-sibling::p a[target] //a[@target] a[target="_blank"] //a[@target='_blank']
So CSS selectors can typically be converted to XPath equivalents.
XPath 2.0/3.0
XPath 2.0 and 3.0 add some advanced capabilities:
XPath 2.0
- Better string and numeric handling
lower-case()
,upper-case()
,ends-with()
, etc.- Regular expression support
parse-xml()
to load XML
XPath 3.0
- Support for JSON
- New data types:
map
,array
,binary
match
keyword for pattern matching- Enhanced error handling
Most of these just provide alternative ways to do operations already possible in XPath 1.0. The major advantage is better string processing functions. Since XPath 3.0 is relatively new, it has limited availability in most scraping libraries currently. But XPath 2.0 functions are making their way in.
Tools and Resources
Here are some useful XPath tools and resources:
- XPath Tester – Test XPath expressions interactively
- XPath Helper – Another handy tester
- XML Quire – XPath/XQuery learning platform
- Scrapy Shell – Test XPath in Scrapy environment
- XPath Explorer – Windows app for XPath debugging
For documentation, the standards are published at:
And to dive deeper, the book XPath and XPointer provides a great overview.
Conclusion
This concludes our detailed XPath cheatsheet, essential for web scraping mastery! By mastering XPath, you unlock the potential to extract virtually any piece of information from HTML and XML documents. The core skill lies in adeptly navigating and querying node trees with the path syntax. This collection of examples aims to solidify your understanding, empowering you to utilize XPath effectively in your projects. Wishing you successful data extraction!