HTML and XML documents have a natural tree structure that makes querying and extracting data from them convenient and efficient. XPath is a powerful query language specifically designed for navigating and selecting nodes from XML/HTML documents.
In this comprehensive guide, we will explore using XPath for parsing and extracting data from HTML.
What is XPath?
XPath stands for XML Path Language. It is a query language for selecting nodes from an XML/HTML document. Some key features of XPath include:
- Navigation of document structure allows moving up and down an HTML tree
- Selection of nodes based on names, attributes, text content and more
- Extensible with functions to transform and filter nodesets on the fly
- Supported in all major programming languages
At its core, XPath treats an HTML document as a tree structure of nodes. It provides syntax to describe a “path” to target nodes we want to extract. For example, consider this sample HTML:
<div> <p> <b>Hello</b> world! </p> </div>
The XPath /div/p/b
would select the <b>
node containing the text “Hello”. XPath paths essentially describe steps from the root of the document down into deeper nodes. The result is one or more matching nodes referred to as a nodeset.
XPath Syntax Overview
XPath uses path notation to navigate an XML/HTML document tree. The complete XPath syntax is quite extensive, but we will focus on the most common syntax used for parsing HTML below:
Syntax | Description | Example |
---|---|---|
/ | Selects direct child node | /html/body |
// | Selects descendant node at any level | //p |
.* | Wildcard for node name | //*.class |
@ | Select attribute | //a/@href |
text() | Selects text content of node | //a/text() |
[] | Applies filter expression | //a[@class='button'] |
. | Current node | ./@href |
.. | Parent node | ../@id |
position() | Current position in nodeset | [position() < 3] |
Some key things to note about XPath syntax:
- Paths are relative to the root which is implicitly
/
//
allows selecting descendants at any level- Filters like
[@attr]
and[expr]
constraint matches - Dot
.
refers to current node context - Double dot
..
traverses up to the parent node
This covers the most essential parts of XPath syntax. There are many more advanced functions and syntax forms we will discover later on.
Selecting Nodes
The core purpose of XPath is to select nodes from an HTML document. Let's look at some examples of selecting elements by tag name, attributes, and positional filters.
Element Name
The simplest XPath syntax is to specify an element name to select all matching nodes:
<div> <p>Paragraph 1</p> <p>Paragraph 2</p> </div>
//p
would select both <p>
nodes. We can also select direct children with a absolute path like /html/body/div/p
but this is fragile to changes in nesting and order. Using //
descendant selection is more robust in most cases.
Attributes
Elements can be filtered by attribute values using the [@attr]
syntax:
<a class="button">Click</a>
//a[@class='button']
selects the link by its class attribute value. Some other examples:
//input[@type='text'] //div[@id='header'] //img[contains(@src,'logo')]
Attribute filters are a great way to target elements precisely.
Positional Filters
When a path selects multiple nodes, positional predicates like [1]
can grab a specific item from that nodeset:
<p>Paragraph 1</p> <p>Paragraph 2</p>
//p[1]
selects the first<p>
//p[2]
selects the second<p>
//p[last()]
selects the last<p>
Positional filters enable precise control when extracting data from lists of repeating elements.
Putting It Together
XPath selectors generally combine element names, attributes, and filters together to pinpoint target nodes precisely:
<ul class="products"> <li class="product">Product 1</li> <li class="product">Product 2</li> </ul>
//ul[@class='products']/li[@class='product'][2]
This selects the second <li>
child of the unordered list with class “products”. As you can see, XPath provides many tools for precisely targeting nodes by name, attributes, position, and relationship.
Matching Text
In addition to selecting elements, we often want to match text content as well. XPath has a few options for matching text:
Contains Text
The XPath expression contains(., 'text')
checks if the current node contains the given text:
<p>Hello world</p>
//p[contains(., 'world')]
This finds <p>
tags with text containing “world”.
Exact Text
We can match exact inner text using:
<p>Hello world</p>
//p[text()='Hello world']
The
text()
syntax refers to the inner text content of an element.
This allows matching the exact text value of nodes.
Normalizing Whitespace
When matching text, be aware that HTML collapses whitespace. The expression:
//p[text()='Hello world']
Would not match the <p>
above due to the extra whitespace being collapsed. To match ignoring whitespace, either normalize spaces in the text content before matching or use XPath's normalize-space()
function:
//p[normalize-space(text())='Hello world']
This normalizes inner text allowing a match while ignoring whitespace differences.
Accessing Attributes
In addition to selecting elements, we often want to extract attributes such as src
, href
, and alt
from nodes:
<a href="page.html">Link</a>
To get an attribute, use the @attr
syntax:
//a/@href
This would return the href
attribute value, in this case "page.html"
. Some other examples:
//img/@src //div/@id
This syntax gives access to element attributes for extraction.
Beyond Basic Selections
Up until now, we've covered the core syntax of XPath for selecting elements and attributes. However, XPath is capable of much more:
- Navigation of ancestor, sibling and relative nodes
- Extensible functions to transform nodesets
- Conditionals, loops, and other programming logic
- Integration with host language variables and parameters
Now that we understand the basics let's look at some more advanced yet common XPath usage patterns.
Navigating the Tree
In addition to parent/child and descendant navigation, XPath provides syntax for easy traversing between ancestor, sibling, and other relative nodes.
Ancestors
The ..
selector travels up and selects the parent node:
<body> <div> <p>Paragraph</p> </div> </body>
//div/p/..
selects the <div>
parent element.
Chaining ..
goes further up the ancestry:
//div/p/../..
selects <body>
This is useful for selecting elements based on context rather than position.
Siblings
XPath can also select sibling elements:
<ul> <li>Item 1</li> <li>Item 2</li> </ul>
//li[1]/following-sibling::li[1]
selects the second <li>
sibling.
following-sibling
finds later siblingspreceding-sibling
finds earlier siblings
This helps target siblings in relation to others even when positions change.
Relative Navigation
In addition to absolute paths, .
and ..
can be used to create relative paths:
./@class # class of current node ../@id # id of parent node //li/./span # span within current li //div/..//span # span relative to div's parent
Relative navigation is useful when the exact structure is unknown.
Axes
Along with child
and descendant
axes we've used already, some other helpful axes are:
ancestor
– All parent nodes up to rootattribute
– Attributes of the current nodefollowing
– Nodes after current nodepreceding
– Nodes before current nodeself
– Current node itself
These offer even more approaches to navigating the tree relative to a context node.
Functions
One of XPath's powerful features is extensibility via functions. Many built-in and custom functions are available for manipulating nodesets “on the fly”.
Built-in Functions
Some commonly used XPath functions include:
text()
– Text content of nodename()
– Name of nodecontains()
– Whether string contains textsubstring()
– Extract part of a stringnumber()
– Convert string to numberstring()
– Convert number to stringnormalize-space()
– Normalize whitespace in stringstarts-with()
,ends-with()
– Check string prefixes/suffixes
XPath functions give great flexibility to massage content as it is extracted without having to post-process in the host language.
Custom Functions
Most XPath libraries allow registering custom functions as well. For example:
// Register function xpath.register('lowercase', str => { return str.toLowerCase() }) // Use in expression const results = xpath('//div', { lowercase: xpath.lowercase })
This allows extending XPath's capabilities to handle any application-specific logic.
Putting It All Together
Now that we've covered different types of selections, traversal, and some functions, let's look at an example making use of these together:
<div class="comments"> <div class="comment"> <img src="user1.png"> <p>This is comment #1</p> </div> <div class="comment"> <img src="user2.png"> <p>This is comment #2</p> </div> </div>
To extract text of all comments by the profile images we can use:
//div[@class='comments']/div[@class='comment']/img/following-sibling::p/text()
Breaking this down:
//div[@class='comments']
– Match parent container/div[@class='comment']
– Select each comment child/img
– Select image within comment/following-sibling::p
– Get paragraph after image/text()
– Extract text content
The key is combining different types of selection and traversal to target the data we want precisely.
Advanced Techniques
We've now covered the core foundations of XPath. Here are some more advanced techniques and tips for special use cases you may encounter.
Conditionals
XPath supports if
/else
conditional expressions for additional logic:
if (condition) then (expression1) else (expression2)
Some examples:
//img[ if (@src) then @src else @data-src ] //div[ if (@class='highlight') then ./p/text() else string() ]
This allows applying different extraction logic depending on attribute values and other conditions.
Loops
XPath also provides iterative processing of nodesets using for
expressions:
for $item in (expression) return data($item)
For example:
for $product in //product return <item> {$product/@name} {$product/@price} </item>
This iterates over <product>
nodes transforming each into a new <item>
structure. Loops enable repeating extraction logic over node lists.
Parameters
XPath libraries allow passing in external values as parameters:
const results = xpath('//a[text()=$linkText]', { linkText: 'Next Page' })
The $param
syntax integrates external data into the expression. Parameters are useful for reuse and avoiding hard-coded values.
Regular Expressions
The matches()
function allows matching text via regex:
//a[matches(text(), '\d+')]
finds links containing numbers.
Regular expressions provide powerful text parsing.
Namespaces
Namespaces are a consideration for XML documents. The *
wildcard does not match namespaced nodes. Prefix your XPath with namespace declarations to match namespaced elements:
/xhtml:html/xhtml:body//xhtml:div
See Namespace Axes for more details.
Using XPath in Code
Now that we understand XPath querying, let's look at how it can be used in real code. XPath is supported in all major programming languages either natively or via common libraries.
XPath in JavaScript
JavaScript supports XPath via the DOM document.evaluate()
method:
const xpath = '//a[@class="highlight"]' const nodes = document.evaluate(xpath, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null)
This evaluates the XPath returning matched nodes.
XPath in Python
For Python, the lxml
library provides excellent XPath support:
from lxml import html tree = html.parse('page.html') links = tree.xpath('//a[@class="highlight"]')
xpath()
returns a list of matched elements to extract data from.
XPath in PHP
PHP also includes XPath capabilities:
$doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXPath($doc); $links = $xpath->query('//a[@class="highlight"]');
The DOMXPath::query()
method evaluates XPath expressions.
Other Languages
Most other languages have XPath libraries including:
No matter your language, there are robust XPath processing options available.
Tips and Best Practices
Here are some key tips to follow when using XPath for HTML parsing:
- Favor
//
descendant over/
child axes for more resilient selectors - Leverage attributes over positional indexes when possible
- Use context nodes like
..
and.
for relatives selectors - Learn to use sibling, ancestry, and reverse axes over complex paths
- Become familiar with key string, numeric and boolean functions
- Use an XPath tester tool to build interactively and test queries
- Beware of performance issues with complex expressions impacting large documents
Mastering XPath for HTML parsing takes learning and practice. An excellent approach for learning is to test XPath expressions in the browser console on live pages using document.evaluate()
. This provides fast feedback on selectors as you iterate.
Conclusion
XPath takes time to master but pays off in providing resilient data extraction from HTML and XML. For parsing web content, it pairs extremely well with HTML parsers like lxml in Python and jQuery in JavaScript to reliably target and extract relevant information. I hope this guide provides a solid foundation for using XPath in your own projects.