Web scraping relies on being able to locate and extract data from complex HTML documents accurately. Two of the main technologies used for parsing HTML when scraping are XPath and CSS selectors.
Though XPath and CSS selectors essentially perform the same function – allowing developers to target elements in HTML – they work quite differently under the hood. Each approach has its own strengths and weaknesses. So when it comes to web scraping, should you use XPath or CSS selectors? Or maybe a combination of both?
Overview
XPath and CSS selectors provide two different approaches to parsing and extracting data from HTML. Though they have some overlap in functionality, here are some key differences:
- XPath allows easy traversal up and down the HTML tree, while CSS selectors can only traverse down.
- XPath has advanced built-in functions like text search that CSS lacks.
- CSS selectors have a more concise and readable syntax in many cases.
- XPath handles complex selections with more flexibility where CSS is limited.
The choice between XPath vs CSS comes down to your specific needs:
- Use CSS for simple element selection.
- Use XPath when traversal or advanced logic is required.
- Combine them together to get the best of both!
But this is far from enough. Next, we will conduct a more in-depth analysis to help you better grasp the differences between them.
How XPath Works
XPath stands for XML Path Language. It's a query language for selecting elements in an XML document. Since HTML is a derivative of XML, XPath can be used to target elements on an HTML page. XPath models an HTML document as a tree structure made up of nodes. Here's a simple example:
<html> <body> <div> <p>Example paragraph</p> </div> </body> </html>
The corresponding XPath tree would look like:
/html /body /div /p
With XPath, you compose expressions to navigate this tree and find nodes you want to extract.
Some examples of XPath expressions:
/html/body/div/p
– Select the paragraph node//div
– Select all div nodes/html/body/div[1]
– Select the first div node under body
XPath has a large set of capabilities for targeting nodes including:
- Navigating the tree with
/
and//
- Indexing with
[n]
- Wildcards with
*
- Searching by attribute with
@
- Filtering with predicates
[]
- Searching by text contents with
contains()
and other functions
This allows you to write complex expressions to zone in on specific elements even in messy HTML.
How CSS Selectors Work
CSS stands for Cascading Style Sheets. It's the language used to style web pages. CSS selectors allow developers to target HTML elements to apply styling rules. Some examples:
p { color: blue; } div.product { border: 1px solid black; } a[target="_blank"] { text-decoration: underline; }
Since CSS selectors are designed to find elements on a page, they can also be used for extracting data when web scraping. The main types of CSS selectors are:
- Type selectors – Match by element type like
div
- Class selectors – Match by class name like
.product
- ID selectors – Match by ID attribute like
#header
- Attribute selectors – Match by attribute value like
[target=_blank]
- Pseudo-selectors – Special selectors like
:first-child
- Combinators – Combine other selectors like
div.product > p
Some key differences vs XPath are:
- CSS can't traverse up the HTML tree
- No wildcard or index based selection
- More limited search functions
- But the syntax is more concise overall
Now that we've covered the basics of how XPath and CSS selectors work, let's look at some key differences in more detail.
Key Difference #1: Traversing the HTML Tree
A major difference between XPath and CSS is how they allow you to navigate the hierarchical structure of HTML elements.
XPath can easily traverse both up and down the HTML tree.
For example, to select the body
element from a div
deep in the page:
//div[@class='product']/ancestor::body
The ancestor
axis allows going up to any parent element. Whereas CSS selectors can only traverse down the tree by combining selectors. There is no way to traverse up to a parent element. The furthest we could go is to a sibling:
body div.product { /* Styles */ }
This is a key advantage of XPath in that it can contextualize elements by their position in the full HTML document.
Key Difference #2: Searching by Text Content
Another major advantage of XPath is the ability to find elements by their text content. For example, to find a div
containing the text “Hello world”:
//div[contains(text(), 'Hello world')]
The contains()
function lets you search text content.
CSS has no native way to select elements by text. The closest equivalent would be attribute selectors:
div[title='Hello world'] { /* Styles */ }
But this only works if the text is present in an attribute value. XPath's text search is much more flexible.
Key Difference #3: Additional Functions
XPath includes many built-in functions that are useful for scraping, but not present in CSS selectors. Some examples:
substring()
,concat()
,string-length()
– Handle stringscontains()
,starts-with()
,ends-with()
– Search texttranslate()
– Replace charactersboolean()
,not()
,and()
,or()
– Boolean logicsum()
,avg()
,min()
,max()
– Math operationsceiling()
,floor()
,round()
– Number rounding- Regular expression functions for pattern matching
Plus XPath allows you to define custom functions. This gives XPath a lot more built-in power for data extraction and parsing tasks.
Key Difference #4: Syntax and Readability
In general, CSS selectors have a more concise and readable syntax compared to long XPath expressions. For example to get all p
elements with a class
containing “highlight”:
p[class*="highlight"] { /* Styles */ }
Compared to the equivalent XPath:
//p[contains(@class, 'highlight')]
The CSS is simpler in this case. Though for more complex selections, XPath gives you more flexibility at the cost of verbosity.
When to Use XPath vs CSS Selectors
Given the different strengths and weaknesses, when should you use XPath vs CSS Selectors when scraping pages? Here are some best practices:
- Use CSS selectors for simple element selection, especially by class, ID or attributes. The syntax is more compact.
- Use XPath when you need to traverse up the HTML tree or leverage advanced functions like text search.
- Use XPath for complex selections where you need to filter elements based on position, siblings, values, etc.
- Use CSS for styling related tasks like grabbing inline styles.
- Prefer XPath when scraping XML or XHTML documents.
- Use CSS selectors for integration with browser tools like querySelector.
- Mix them together to take advantage of both tools!
Here is a quick reference guide for when to favor XPath or CSS:
Task | Prefer XPath or CSS? |
---|---|
Simple element selection | CSS |
Traverse HTML tree | XPath |
Search by text | XPath |
Advanced functions | XPath |
Concise syntax | CSS |
Complex selections | XPath |
Style related | CSS |
XML/XHTML documents | XPath |
Browser integration | CSS |
Powerful Techniques for Combining XPath and CSS
In many cases, the best approach is to combine XPath and CSS selectors together. This allows you to benefit from the concise syntax of CSS and the power of XPath where needed. Some examples of combining them:
Use CSS to select elements, then XPath to filter
For example:
divs = response.css('div') # All divs prices = divs.xpath('./span[@class="price"]/text()') # Filter for price
Use XPath to traverse up tree, CSS to go back down
For example:
products = response.xpath('//div[@id="products"]/../div') # Traverse up names = products.css('h2.name::text') # Back down with CSS
Apply CSS classes/IDs then use XPath text search
For example:
panels = response.css('.panel') ip_addresses = panels.xpath('.//p[contains(text(), "IP Address:")]/text()')
As you can see, combining XPath and CSS can give you the simplicity of CSS with the power of XPath where needed.
Which is Better – XPath or CSS Selectors?
Given their different strengths and weaknesses, is one of these parsing languages better overall for web scraping?
The short answer is: that it depends on your specific needs.
For simple scraping tasks, CSS selectors might be perfectly sufficient with their more compact syntax. But for complex data extraction, XPath gives you more tools to handle messy HTML structures.
Their similarities are:
- Both are W3C standards for navigating XML/HTML
- Can be used independently or together
- Supported in every major programming language
Some key differences:
XPath | CSS Selectors |
---|---|
More complex syntax | More concise syntax |
Can traverse up and down | Only downward traversal |
Advanced functions | Limited functions |
Powerful but verbose | Simple but limited |
In most cases, the best approach is to combine XPath and CSS selectors together in your scrapers. This allows you to keep simple selections concise with CSS, while leveraging XPath for more robust logic when required.
Scraping Tools That Support XPath and CSS
All major web scraping frameworks and libraries support using both XPath and CSS selectors.
Here are some examples:
- Python: Scrapy, Beautiful Soup, lxml, and parsel
- JavaScript: Puppeteer and Cheerio
- Ruby: Nokogiri
- Java: jsoup
- C#: HtmlAgilityPack
- PHP: Simple HTML DOM
And many other languages and tools… You'll be able to use both XPath and CSS selectors for almost any web scraping project.
No matter if you choose XPath, CSS, or a mix, having a strong understanding of these key technologies, will help you become an expert web scraper capable of extracting data from even the most complex websites.