Parsing HTML with Xpath

HTML and XML documents have a natural tree structure that makes querying and extracting data from them convenient and efficient. XPath is a powerful query language specifically designed for navigating and selecting nodes from XML/HTML documents.

In this comprehensive guide, we will explore using XPath for parsing and extracting data from HTML.

What is XPath?

XPath stands for XML Path Language. It is a query language for selecting nodes from an XML/HTML document. Some key features of XPath include:

Navigation of document structure allows moving up and down an HTML tree
Selection of nodes based on names, attributes, text content and more
Extensible with functions to transform and filter nodesets on the fly
Supported in all major programming languages

At its core, XPath treats an HTML document as a tree structure of nodes. It provides syntax to describe a “path” to target nodes we want to extract. For example, consider this sample HTML:

<div>
  <p>
    <b>Hello</b> world!
  </p>
</div>

The XPath /div/p/b would select the  node containing the text “Hello”. XPath paths essentially describe steps from the root of the document down into deeper nodes. The result is one or more matching nodes referred to as a nodeset.

XPath Syntax Overview

XPath uses path notation to navigate an XML/HTML document tree. The complete XPath syntax is quite extensive, but we will focus on the most common syntax used for parsing HTML below:

Syntax	Description	Example
`/`	Selects direct child node	`/html/body`
`//`	Selects descendant node at any level	`//p`
`.*`	Wildcard for node name	`//*.class`
`@`	Select attribute	`//a/@href`
`text()`	Selects text content of node	`//a/text()`
`[]`	Applies filter expression	`//a[@class='button']`
`.`	Current node	`./@href`
`..`	Parent node	`../@id`
`position()`	Current position in nodeset	`[position() < 3]`

Some key things to note about XPath syntax:

Paths are relative to the root which is implicitly /
// allows selecting descendants at any level
Filters like [@attr] and [expr] constraint matches
Dot . refers to current node context
Double dot .. traverses up to the parent node

This covers the most essential parts of XPath syntax. There are many more advanced functions and syntax forms we will discover later on.

Selecting Nodes

The core purpose of XPath is to select nodes from an HTML document. Let's look at some examples of selecting elements by tag name, attributes, and positional filters.

Element Name

The simplest XPath syntax is to specify an element name to select all matching nodes:

<div>
  <p>Paragraph 1</p>
  <p>Paragraph 2</p>
</div>

//p would select both  nodes. We can also select direct children with a absolute path like /html/body/div/p but this is fragile to changes in nesting and order. Using // descendant selection is more robust in most cases.

Attributes

Elements can be filtered by attribute values using the [@attr] syntax:

<a class="button">Click</a>

//a[@class='button'] selects the link by its class attribute value. Some other examples:

//input[@type='text']
//div[@id='header']
//img[contains(@src,'logo')]

Attribute filters are a great way to target elements precisely.

Positional Filters

When a path selects multiple nodes, positional predicates like [1] can grab a specific item from that nodeset:

<p>Paragraph 1</p> 
<p>Paragraph 2</p>

//p[1] selects the first 
//p[2] selects the second 
//p[last()] selects the last

Positional filters enable precise control when extracting data from lists of repeating elements.

Putting It Together

XPath selectors generally combine element names, attributes, and filters together to pinpoint target nodes precisely:

<ul class="products">
  <li class="product">Product 1</li>
  <li class="product">Product 2</li>
</ul>

//ul[@class='products']/li[@class='product'][2] This selects the second <li> child of the unordered list with class “products”. As you can see, XPath provides many tools for precisely targeting nodes by name, attributes, position, and relationship.

Matching Text

In addition to selecting elements, we often want to match text content as well. XPath has a few options for matching text:

Contains Text

The XPath expression contains(., 'text') checks if the current node contains the given text:

<p>Hello world</p>

//p[contains(., 'world')]

This finds  tags with text containing “world”.

Exact Text

We can match exact inner text using:

<p>Hello world</p>

//p[text()='Hello world']

The text() syntax refers to the inner text content of an element.

This allows matching the exact text value of nodes.

Normalizing Whitespace

When matching text, be aware that HTML collapses whitespace. The expression:

//p[text()='Hello world']

Would not match the  above due to the extra whitespace being collapsed. To match ignoring whitespace, either normalize spaces in the text content before matching or use XPath's normalize-space() function:

//p[normalize-space(text())='Hello world']

This normalizes inner text allowing a match while ignoring whitespace differences.

Accessing Attributes

In addition to selecting elements, we often want to extract attributes such as src, href, and alt from nodes:

<a href="page.html">Link</a>

To get an attribute, use the @attr syntax:

//a/@href

This would return the href attribute value, in this case "page.html". Some other examples:

//img/@src
//div/@id

This syntax gives access to element attributes for extraction.

Beyond Basic Selections

Up until now, we've covered the core syntax of XPath for selecting elements and attributes. However, XPath is capable of much more:

Navigation of ancestor, sibling and relative nodes
Extensible functions to transform nodesets
Conditionals, loops, and other programming logic
Integration with host language variables and parameters

Now that we understand the basics let's look at some more advanced yet common XPath usage patterns.

Navigating the Tree

In addition to parent/child and descendant navigation, XPath provides syntax for easy traversing between ancestor, sibling, and other relative nodes.

Ancestors

The .. selector travels up and selects the parent node:

<body>
  <div>
    <p>Paragraph</p>
  </div>
</body>

//div/p/.. selects the <div> parent element.

Chaining .. goes further up the ancestry:

//div/p/../.. selects <body>

This is useful for selecting elements based on context rather than position.

Siblings

XPath can also select sibling elements:

<ul>
  <li>Item 1</li>
  <li>Item 2</li>
</ul>

//li[1]/following-sibling::li[1] selects the second <li> sibling.

following-sibling finds later siblings
preceding-sibling finds earlier siblings

This helps target siblings in relation to others even when positions change.

Relative Navigation

In addition to absolute paths, . and .. can be used to create relative paths:

./@class         # class of current node
../@id           # id of parent node 
//li/./span      # span within current li
//div/..//span   # span relative to div's parent

Relative navigation is useful when the exact structure is unknown.

Axes

Along with child and descendant axes we've used already, some other helpful axes are:

ancestor – All parent nodes up to root
attribute – Attributes of the current node
following – Nodes after current node
preceding – Nodes before current node
self – Current node itself

These offer even more approaches to navigating the tree relative to a context node.

Functions

One of XPath's powerful features is extensibility via functions. Many built-in and custom functions are available for manipulating nodesets “on the fly”.

Built-in Functions

Some commonly used XPath functions include:

text() – Text content of node
name() – Name of node
contains() – Whether string contains text
substring() – Extract part of a string
number() – Convert string to number
string() – Convert number to string
normalize-space() – Normalize whitespace in string
starts-with(), ends-with() – Check string prefixes/suffixes

XPath functions give great flexibility to massage content as it is extracted without having to post-process in the host language.

Custom Functions

Most XPath libraries allow registering custom functions as well. For example:

// Register function
xpath.register('lowercase', str => {
  return str.toLowerCase()
})

// Use in expression
const results = xpath('//div', {
  lowercase: xpath.lowercase
})

This allows extending XPath's capabilities to handle any application-specific logic.

Putting It All Together

Now that we've covered different types of selections, traversal, and some functions, let's look at an example making use of these together:

<div class="comments">

  <div class="comment">
    <img src="user1.png">
    <p>This is comment #1</p>    
  </div>
  
  <div class="comment">
    <img src="user2.png">
    <p>This is comment #2</p>
  </div>
  
</div>

To extract text of all comments by the profile images we can use:

//div[@class='comments']/div[@class='comment']/img/following-sibling::p/text()

Breaking this down:

//div[@class='comments'] – Match parent container
/div[@class='comment'] – Select each comment child
/img – Select image within comment
/following-sibling::p – Get paragraph after image
/text() – Extract text content

The key is combining different types of selection and traversal to target the data we want precisely.

Advanced Techniques

We've now covered the core foundations of XPath. Here are some more advanced techniques and tips for special use cases you may encounter.

Conditionals

XPath supports if/else conditional expressions for additional logic:

if (condition) 
  then (expression1)
  else (expression2)

Some examples:

//img[
  if (@src) 
    then @src
    else @data-src
  ]

//div[
  if (@class='highlight')
    then ./p/text()
  else string()  
]

This allows applying different extraction logic depending on attribute values and other conditions.

Loops

XPath also provides iterative processing of nodesets using for expressions:

for $item in (expression)
  return data($item)

For example:

for $product in //product
  return 
    <item>
      {$product/@name}
      {$product/@price} 
    </item>

This iterates over <product> nodes transforming each into a new <item> structure. Loops enable repeating extraction logic over node lists.

Parameters

XPath libraries allow passing in external values as parameters:

const results = xpath('//a[text()=$linkText]', {
  linkText: 'Next Page' 
})

The $param syntax integrates external data into the expression. Parameters are useful for reuse and avoiding hard-coded values.

Regular Expressions

The matches() function allows matching text via regex:

//a[matches(text(), '\d+')] finds links containing numbers.

Regular expressions provide powerful text parsing.

Namespaces

Namespaces are a consideration for XML documents. The * wildcard does not match namespaced nodes. Prefix your XPath with namespace declarations to match namespaced elements:

/xhtml:html/xhtml:body//xhtml:div

See Namespace Axes for more details.

Using XPath in Code

Now that we understand XPath querying, let's look at how it can be used in real code. XPath is supported in all major programming languages either natively or via common libraries.

XPath in JavaScript

JavaScript supports XPath via the DOM document.evaluate() method:

const xpath = '//a[@class="highlight"]'
const nodes = document.evaluate(xpath, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null)

This evaluates the XPath returning matched nodes.

XPath in Python

For Python, the lxml library provides excellent XPath support:

from lxml import html

tree = html.parse('page.html')

links = tree.xpath('//a[@class="highlight"]')

xpath() returns a list of matched elements to extract data from.

XPath in PHP

PHP also includes XPath capabilities:

$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$links = $xpath->query('//a[@class="highlight"]');

The DOMXPath::query() method evaluates XPath expressions.

Other Languages

Most other languages have XPath libraries including:

C#: System.Xml.XPath
Java: javax.xml.xpath
Ruby: Nokogiri
Go: XPath
Rust: kuchiki
Swift: XMLCoder

No matter your language, there are robust XPath processing options available.

Tips and Best Practices

Here are some key tips to follow when using XPath for HTML parsing:

Favor // descendant over / child axes for more resilient selectors
Leverage attributes over positional indexes when possible
Use context nodes like .. and . for relatives selectors
Learn to use sibling, ancestry, and reverse axes over complex paths
Become familiar with key string, numeric and boolean functions
Use an XPath tester tool to build interactively and test queries
Beware of performance issues with complex expressions impacting large documents

Mastering XPath for HTML parsing takes learning and practice. An excellent approach for learning is to test XPath expressions in the browser console on live pages using document.evaluate(). This provides fast feedback on selectors as you iterate.

Conclusion

XPath takes time to master but pays off in providing resilient data extraction from HTML and XML. For parsing web content, it pairs extremely well with HTML parsers like lxml in Python and jQuery in JavaScript to reliably target and extract relevant information. I hope this guide provides a solid foundation for using XPath in your own projects.