Ultimate XPath Cheatsheet for HTML Parsing in Web Scraping

XPath is one of the most powerful tools for parsing and extracting data from HTML and XML documents. It allows you to navigate through element trees and select nodes based on various criteria. In web scraping, XPath is commonly used for robustly extracting data from pages. It can help scrape dynamic content that is challenging with basic regex parsing.

This ultimate cheatsheet aims to be a comprehensive reference for using XPath in the context of web scraping. It covers all the key features, functions, and usage patterns you need to master XPath parsing.

XPath Basics

Before diving into the details, let's understand the basics of XPath syntax and how it can be used for web scraping tasks.

What is XPath?

XPath stands for XML Path Language. It is a query language for selecting nodes from XML documents. Since HTML is a subset of XML, XPath can be used to parse HTML as well.

The primary purpose of XPath is to address elements within a document for retrieval and manipulation. It models an XML or HTML document as a tree of nodes that can be traversed to find specific pieces of data.

XPath Syntax

XPath uses path expressions to select nodes or node-sets in an XML/HTML document. The node is selected by following a path or steps. An XPath expression generally looks like:

xpath expression

Where a location path is a sequence of steps separated by / or //:

step/step/...

And each step can filter via a node name test or predicates:

node[predicate]

Some examples of simple XPath expressions:

  • /html/body/div¬†– selects all¬†div¬†elements under¬†body
  • //div[@id='main']¬†–¬†div¬†with¬†id='main'¬†anywhere in doc
  • //a[starts-with(@href, 'category')]¬†–¬†a¬†links starting with ‘category'

Using XPath for Web Scraping

XPath is supported in all major Python web scraping libraries:

lxml

from lxml import html

doc = html.fromstring(page_html)

xpaths = doc.xpath('//div[@class="product"]')

scrapy

response.xpath('//h1/text()')

parsel

from parsel import Selector

sel = Selector(text=page_html) 
sel.xpath('//title/text()')

So XPath can be used for scraping without needing to touch regex!

XPath Versions

There are currently 3 main XPath versions:

  • XPath 1.0¬†– Recommended version for web scraping. Offer wide support and all necessary features.
  • XPath 2.0¬†– Adds some helpful functions but not fully supported.
  • XPath 3.0¬†– Mainly adds mappings to new data types.

This guide focuses on XPath 1.0 and 2.0 features that are useful for web scraping. Now let's explore how to use XPath for some common web scraping tasks.

Selecting Nodes by Name

The most basic way to select nodes is by their element name:

<h1>Heading</h1>

<div>Content</div>

To select the h1 element:

//h1

And to select all div elements:

//div

Name wildcards

You can use the * wildcard to select any element name:

//*

This allows you to select all elements under a section without knowing the exact tags.

Matching by Attribute

Elements can be selected based on attributes using:

//div[@class='product']

This will select div elements with a class attribute matching ‘product'.

Attribute operators

Attribute values can also be filtered using comparison operators:

//div[@id!='footer'] 
//a[starts-with(@href, 'category')]

Some useful operators are =, !=, >, >=, <, <= for numeric values.

Attribute patterns

The contains() function allows partial matching on attribute values:

//div[contains(@class, 'product')]

This is useful when the exact values are unpredictable.

Selecting by Position

Elements can be selected based on their position using square brackets:

<div>
  <p>First</p>
  <p>Second</p>
  <p>Third</p>
</div>
  • //p[1]¬†– first¬†p¬†element
  • //p[2]¬†– second¬†p¬†element
  • //p[last()]¬†– last¬†p¬†element

The position() function also returns the position index:

//p[position() > 1]

This selects p elements after the first.

Matching Text Values

The text() method returns the inner text of elements:

<div>
  <p>Hello World</p>
</div>

//p/text() will return “Hello World”

You can then match based on text values using functions like:

//p[contains(text(), 'Hello')]

This will select p elements containing the text “Hello”.

Some other useful text functions are:

  • starts-with(text(), 'Hello')
  • ends-with(text(), 'World')
  • matches(text(), '\d+')¬†– regex match

Selecting Child and Descendant Elements

The / character selects direct children of elements:

<div>
  <p>Child</p>
  <span>
    <p>Grandchild</p>  
  </span>
</div>
  • /div/p¬†will select the direct¬†p¬†child
  • /div/span/p¬†will select the nested¬†p

The // characters select any descendant element:

//div//p

This will select both p elements regardless of nesting depth.

Navigating to Parent and Ancestor Nodes

The .. selector navigates to the parent node:

<div>
  <p>
    <span>Target</span>
  </p>
</div>
  • //span/..¬†selects the parent¬†p¬†element.

The ancestor axis selects any ancestor node:

//span/ancestor::div

This will select the div element. Some other useful axes are:

  • ancestor-or-self¬†– ancestors and current node
  • descendant-or-self¬†– descendants and current node

Selecting Sibling Elements

To select sibling elements, you can use:

/div/p[1]/following-sibling::p

This will select p elements following the first p. Some other axes for siblings are:

  • preceding-sibling¬†– previous siblings
  • following-sibling¬†– next siblings

And for vertical navigation:

  • preceding¬†– all preceding elements
  • following¬†– all following elements

Which allows you to select elements before and after the current node.

Combining Multiple Selectors

You can combine multiple XPath expressions using:

  • |¬†– union of expressions
  • or¬†– requires one expression to match
  • and¬†– requires all expressions to match

For example:

//div[@class='product'] | //div/h2

Will select either div.product elements or h2 under div.

Filtering Nodes with Predicates

Predicates allow adding advanced criteria to a step:

/books/book[price > 30]

This will select book nodes where the price attribute is greater than 30. You can also chain multiple predicates:

/books/book[price > 30][year > 2010]

Which will apply both filters.

Utility Functions

Along with the text functions we saw earlier, some other useful utility functions are:

String manipulation

  • concat(*args)¬†– concatenate strings
  • substring(str, start, end)¬†– extract a substring
  • normalize-space(str)¬†– normalize whitespace

Node info

  • name()¬†– get node name
  • count(node)¬†– count matching nodes

Conditionals

  • not(expr)¬†– logical NOT operator
  • true()/false()¬†– boolean functions

Regular Expression Matching

The matches() function allows matching text values against regular expressions:

Which will apply both filters.

Utility Functions
Along with the text functions we saw earlier, some other useful utility functions are:

String manipulation

concat(*args) - concatenate strings
substring(str, start, end) - extract a substring
normalize-space(str) - normalize whitespace
Node info

name() - get node name
count(node) - count matching nodes
Conditionals

not(expr) - logical NOT operator
true()/false() - boolean functions
Regular Expression Matching
The matches() function allows matching text values against regular expressions:

This will select title elements with alphanumeric text. Some engines also support regexp: or re: namespace:

//title[re:test(text(), '\w+')]

Which works similarly.

Robust Matching Tips

  • Avoid long complex paths like¬†/html/body/div/span/p
  • Prefer¬†//¬†descendant axes for reliability
  • Leverage attributes over fickle element names/nesting
  • Match partial attributes with¬†contains()
  • Use text functions for text node selection
  • Apply¬†normalize-space()¬†when comparing text

Proper use of XPath can create very robust scrape queries resistant to minor HTML changes.

Example Patterns

Here are some examples of common XPath scraping patterns:

Get text from complex HTML

<div class="description">
  Call <span class="phone">(888) 888-8888</span> to order!
</div>

//div[@class='description']//text()

Will extract all the text regardless of nesting.

Filter listings by price

<div class="product" data-price="39.99">...</div>
<div class="product" data-price="99.99">...</div>

//div[@class="product"][number(data-price) < 40]

Converts to a number and filters by price.

Select row cells from tables

<table>
 <tr><td>Cell 1</td><td>Cell 2</td></tr>
</table>

//tr/td[position() = 2]

Gets the 2nd cell value from rows.

Paginated content

(//h2[text()='Products'])[1]/following::div[1]

Finds the first div after the first products heading.

And many more!

Take a look at the tools section for sites with more example patterns.

Comparison to CSS Selectors

CSS selectors are an alternative to XPath for HTML parsing. Some advantages of each:

CSS Selectors

  • Simple and terse syntax
  • Widespread browser support
  • Integrated into CSS

XPath

  • Full programmatic access to manipulate results
  • Advanced conditional logic
  • Additional helper functions
  • Can query both HTML and XML sources

Here are some common conversions between CSS and XPath:

CSS               XPath
---------------------------------------------------
.class            //*[contains(concat(' ',normalize-space(@class),' '), ' class ')]
#id               //*[@id='id']
div > span        /div/span
div span          //div//span 
div + p           //div/following-sibling::p[1]
div ~ p           //div/following-sibling::p
a[target]         //a[@target]
a[target="_blank"] //a[@target='_blank']

So CSS selectors can typically be converted to XPath equivalents.

XPath 2.0/3.0

XPath 2.0 and 3.0 add some advanced capabilities:

XPath 2.0

  • Better string and numeric handling
  • lower-case(),¬†upper-case(),¬†ends-with(), etc.
  • Regular expression support
  • parse-xml()¬†to load XML

XPath 3.0

  • Support for JSON
  • New data types:¬†map,¬†array,¬†binary
  • match¬†keyword for pattern matching
  • Enhanced error handling

Most of these just provide alternative ways to do operations already possible in XPath 1.0. The major advantage is better string processing functions. Since XPath 3.0 is relatively new, it has limited availability in most scraping libraries currently. But XPath 2.0 functions are making their way in.

Tools and Resources

Here are some useful XPath tools and resources:

For documentation, the standards are published at:

And to dive deeper, the book XPath and XPointer provides a great overview.

Conclusion

This concludes our detailed XPath cheatsheet, essential for web scraping mastery! By mastering XPath, you unlock the potential to extract virtually any piece of information from HTML and XML documents. The core skill lies in adeptly navigating and querying node trees with the path syntax. This collection of examples aims to solidify your understanding, empowering you to utilize XPath effectively in your projects. Wishing you successful data extraction!

John Rooney

John Rooney

John Watson Rooney, a self-taught Python developer and content creator with a focus on web scraping, APIs, and automation. I love sharing my knowledge and expertise through my YouTube channel, My channel caters to all levels of developers, from beginners looking to get started in web scraping to experienced programmers seeking to advance their skills with modern techniques. I have worked in the e-commerce sector for many years, gaining extensive real-world experience in data handling, API integrations, and project management. I am passionate about teaching others and simplifying complex concepts to make them more accessible to a wider audience. In addition to my YouTube channel, I also maintain a personal website where I share my coding projects and other related content.

We will be happy to hear your thoughts

      Leave a reply

      Proxy-Zone
      Compare items
      • Total (0)
      Compare
      0