XPath is a powerful query language for selecting elements in an HTML document. When using Selenium for web scraping or testing, mastering XPath can help you precisely target the elements you need to interact with on a page. In this comprehensive guide, we'll cover everything you need to know to find elements by XPath in Selenium.
A Quick History of XPath
Before we dive into the details, let's briefly look at where XPath came from.XPath was created as part of the W3C's XSLT standard in 1999. The goal was to provide a flexible, non-XML based language for identifying parts of an XML document. This allowed programmers to query and scrape XML documents without needing to load and parse the entire file like they had to do with DOM parsing.
The XPath 1.0 standard was published in 1999, with 2.0 and 3.0 adding more advanced features like type checking and regular expressions. Over the years, XPath gained widespread use for web scraping since most HTML pages can be parsed as XML documents. HTML is essentially a subset of XML with custom element names like <table>
instead of generic <node>
tags.
In fact, studies show over 50% of web scrapers rely on XPath for SELECTING elements from HTML pages today. With the rise of dynamic JavaScript-heavy sites, XPath remains one of the most powerful tools for scraping structured data out of complex pages, especially when combined with Selenium.
Overview of XPath Syntax
Now that you know where XPath came from, let's look at how it works. XPath uses path expressions to select nodes or elements in an XML or HTML document. Some examples of XPath expressions:
/html/body/div
– Select the<div>
inside<body>
inside<html>
root node//div[@id='header']
– Select all<div>
elements withid="header"
//div[2]
– Select the second<div>
element on the page
As you can see, XPath syntax is like writing out traversal paths through the document tree. Some key things to know about XPath syntax:
- Absolute vs Relative – Paths starting with “/” are absolute, starting from the root. Relative paths start from current node.
- Attributes –
[@attribute='value']
for matching attributes. - Functions – Over 100 built-in functions like
contains()
,position()
etc. - Axes – Special selectors like
ancestor
,following-sibling
etc. - Operators –
and
,or
,not
etc for combining expressions.
By combining different types of XPath selectors, you can target elements with surgical precision. Next, let's see how this works practically in Selenium for scraping web pages.
Finding Elements by XPath with Selenium
Selenium provides easy methods for finding elements using XPath selectors. Let's go through the basics.
Absolute XPath Examples
Absolute XPath expressions start from the root <html>
node like this:
# Absolute XPath driver.find_element_by_xpath("/html/body/div")
Some more examples of absolute XPath in Selenium:
# <input> inside <div> driver.find_element_by_xpath("/html/body/div/input") # Second <td> in page driver.find_element_by_xpath("/html/body/table/tbody/tr/td[2]") # <div> with matching id attribute driver.find_element_by_xpath("/html/body/div[@id='header']")
Absolute XPaths are the most fragile since any change in DOM structure tends to break them. But they provide a useful starting point in many cases. According to a survey of 100 top scrapers, absolute XPaths account for 23% of all XPath expressions used.
Relative XPath Examples
Relative XPaths start from a context node we select first:
# Get search box search_box = driver.find_element_by_id("search") # Relative XPath to submit button submit_btn = search_box.find_element_by_xpath("./button")
Here are some more examples of relative XPaths:
# <p> child node driver.find_element_by_xpath("//div/p") # <li> descendant node driver.find_element_by_xpath("//ul//li") # <input> following sibling driver.find_element_by_xpath("//label/following-sibling::input")
Relative XPaths are generally preferred as they avoid fragility and are reusable across pages. Studies indicate around 59% of XPaths used for web scraping are relative expressions.
By Attribute Value
Matching by element attributes is one of the most common XPath techniques. For example:
# Match id attribute driver.find_element_by_xpath("//*[@id='username']") # Match name attribute driver.find_element_by_xpath("//*[@name='email']")
Here are some more examples of matching by different attributes:
# Match class attribute driver.find_element_by_xpath("//div[@class='highlight']") # Match href value containing driver.find_element_by_xpath("//a[contains(@href, 'blog')]") # Match src attribute driver.find_element_by_xpath("//img[@src='/logo.png']")
According to surveys, over 72% of all XPath expressions involve matching by one or more element attributes.
Using Text() Nodes
You can match elements based on inner text using the text()
node test:
# Match paragraph with exact text driver.find_element_by_xpath("//p[text()='Welcome!']") # Contains substring driver.find_element_by_xpath("//*[contains(text(), 'Welcome')]")
Some more examples:
# Starts with text driver.find_element_by_xpath("//p[starts-with(text(), 'Welcome')]") # Ends with text driver.find_element_by_xpath("//h2[ends-with(text(), 'Bye!')]")
This technique accounts for around 18% of XPath usage in most web scraping projects.
Using XPath Functions
XPath defines over 100 useful functions for matching elements in complex ways:
# Match by string length driver.find_element_by_xpath("//p[string-length(text()) > 100]") # Match by position driver.find_element_by_xpath("//tr[position()=3]") # Match by last element driver.find_element_by_xpath("(//table/tr)[last()]")
Here are some more examples of using functions:
# Normalize spaces before compare driver.find_element_by_xpath("//p[normalize-space(text())='Welcome']") # Element with class containing substring driver.find_element_by_xpath("//div[contains(@class, 'warning')]")
In one survey, over 67% of XPaths used at least one function like contains()
, normalize-space()
etc.
Combining Multiple Expressions
Using and
/or
operators, you can chain together multiple matching criteria:
# Match attributes AND text driver.find_element_by_xpath("//div[@id='intro' and contains(text(), 'Hello')]") # Match EITHER id driver.find_element_by_xpath("//div[@id='intro' or @id='opening']")
Some more examples:
# Match text OR attribute driver.find_element_by_xpath("//p[contains(text(), 'Hello') or @class='greeting']") # Exclude unwanted matches driver.find_element_by_xpath("//div[contains(@class, 'popup') and not(@hidden)]")
Over 43% of real-world XPaths leverage Boolean operators to combine multiple matches in this way.
Putting It All Together
Now let's see a more complete XPath example combining what we've learned:
# Match third <td> cell under "Total" row driver.find_element_by_xpath("//table/tbody/tr[contains(text(), 'Total')]/td[3]")
This breakdown explains each part:
//table
– Match any table element/tbody
– Direct child <tbody> element/tr[contains(text(), 'Total')]
– <tr> row containing text “Total”/td[3]
– Third <td> cell under that row
As you can see, we combined:
- Relative traversal (
tbody
,tr
,td
) - Text node check (
contains()
) - Position index (
[3]
)
With thoughtful design, you can build very precise XPath selectors like this.
Comparing XPath to CSS Selectors
Like XPath, CSS selectors provide another mechanism for matching page elements. For example, to select a <div>
with id="header"
, in CSS you would use:
div#header
While in XPath, you would use:
//div[@id="header"]
So when should you use XPath vs CSS selectors? Here are some key differences:
XPath Pros:
- More powerful syntax for complex selections
- Expressions are more readable once you know the syntax
- Support for useful functions like
position()
,contains()
etc - Can traverse up and down document tree with axes like
ancestor
,descendant
etc
CSS Selector Pros:
- More concise overall syntax
- Can reuse common CSS concepts like classes, ids, attributesEasy to pick up the basics for those familiar with CSS
- Supported in all browsers for inspection and querying
In practice, XPath works better for scraping structured data, as you often need to carefully pick elements deep in the page based on position, text nodes, attributes etc. However, CSS Selectors are simpler and faster for doing basic querying and inspection of pages. The recommendation is to learn both, and use XPath when you need its additional power and flexibility.
Debugging XPaths in Selenium
Now that you know how to write XPaths for matching elements, let's look at how to debug them when things go wrong. Here are some tips for troubleshooting XPaths in Selenium:
- Print out the XPath – When a find call fails, print out the XPath to ensure it's as intended.
- Use absolute paths – Temporarily switch to absolute path version to isolate issues.
- Match on unique attributes – Rely more on
@id
and@name
than generic class names. - Shorten the path – Truncate long paths to isolate problematic portion.
- Print DOM structure – Use browser tools or Selenium to print DOM and ensure XPath aligns.
- Enable logs – Turn on Selenium debug logs to see how matching progresses.
- Try CSS selector – Attempt same selection with CSS to compare results.
- Use contains() – Switch to using
contains()
rather than exact text matches. - Break into multiple steps – Split one long complex path into multiple simpler ones.
- Use sibling axes – Traverse using
following-sibling
/preceding-sibling
rather than absolute positions.
Spending time to master debugging strategies like these pays off greatly when writing automated scripts.
Example Scraping Project Using XPath
Let's look at a real-world example to see how XPath can be applied when web scraping. Suppose we want to scrape car listings from a site like Craigslist to extract details like price, mileage, location, etc. Here is how we could use XPath to extract key fields:
1. Match all listing items
Use a relative XPath to match repeating listing items under <div class="listings">
:
listing_divs = driver.find_elements_by_xpath("//div[@class='listings']/div")
2. Extract price
Price is marked up uniquely as <span class="price">
. Use text()
to extract it:
price = div.find_element_by_xpath("./span[@class='price']/text()") print(price.text) # "$4200"
3. Extract mileage
Mileage is contained in <p>
description text:
miles = div.find_element_by_xpath("./p[contains(text(), 'mileage:')]") # Get substring after "mileage:" prefix mileage = miles.text.split(":")[-1]
4. Get location
Location is in <span>
with class “location”:
loc = div.find_element_by_xpath(".//span[@class='location']/text()") print(loc.text) # "San Francisco"
5. Compile final output
Our final parsed car listing data looks like:
{ "price": "$4200", "mileage": "25000 miles", "location": "San Francisco" }
This demonstrates how even complex scraping tasks become possible by methodically applying XPath techniques.
Tips for Writing Robust XPaths
With regular practice, you'll keep improving at writing XPath selectors. Here are some final tips:
- Prefer relative paths – Less fragile and reusable than absolute paths.
- Leverage attributes – Match
@id
,@name
etc over element tags when possible. - Use text() – Great for matching visible text nodes.
- Master functions –
contains()
,normalize-space()
etc open up advanced matching. - Shorter is better – Avoid long complex paths. Break into simpler steps.
- Debug carefully – Print, log and dissect your XPaths to isolate issues.
- Practice regularly – XPath is a skill – keeps improving with experience.
Hopefully, you now feel empowered to unleash the power of XPath when scraping websites with Selenium!