Selecting elements between two anchors in XPath is a common requirement when scraping or parsing HTML and XML documents. With XPath's powerful location path expressions, we can easily select elements based on their position relative to “anchor” nodes in the document tree.
With detailed examples and sample code, you will gain the knowledge to use XPath for this purpose in your own projects proficiently. Let's get started!
Overview of Selecting Between Elements in XPath
Before looking at specific techniques, we need to understand some key concepts about XPath and how it models HTML/XML structure:
XPath Models the Document as a Tree
Fundamentally, XPath views XML/HTML as a tree structure with parent-child node relationships. All elements in the document can be traversed based on their position within this hierarchical tree.
Location Path Expressions
An XPath location path like:
/div/section/p[1]
Allows selecting nodes by following the tree relationships:
div– Select the root<div>elementsection– Then select its<section>childp[1]– Finally pick the first<p>under<section>
Any element can be targeted based on its positional connection to other elements.
Axes for Navigating Relationships
XPath provides various axes for navigating the document tree:
| Axis | Description | Example |
|---|---|---|
child | Selects direct children of context node | section/child::p |
parent | Selects direct parent of context node | p/parent::section |
ancestor | Selects all ancestors of context node | p/ancestor::div |
descendant | Selects all descendants under context node | div/descendant::p |
following-sibling | Selects later siblings of context node | h2/following-sibling::p |
preceding-sibling | Selects prior siblings of context node | h2/preceding-sibling::p |
These allow flexibility in moving between different element relationships.
Key Advantages
This gives us some key powers when selecting between elements:
- Move up/down between parent and child elements
- Retrieve earlier or later siblings
- Select descendants at any level under a node
- Combine axes to target elements precisely
With this foundation, let's explore specific techniques for selecting between two anchors.
Using Preceding and Following Sibling Axes
When the anchor elements have some unique way to identify them, the simplest approach is to use the preceding-sibling and following-sibling axes.
Matching by Element Text Content
A common scenario is when the anchors contain distinct text values. For example:
<div> <p>Before</p> <h2>Start Here</h2> <p>Match 1</p> <p>Match 2</p> <h2>End Here</h2> <p>After</p> </div>
We want to select the <p> elements between the <h2> tags with “Start Here” and “End Here” text. The XPath would be:
//h2[text()="Start Here"]/following-sibling::p[preceding-sibling::h2[text()="End Here"]]
Let's break this down:
//h2[text()="Start Here"]matches the opening<h2>anchor/following-sibling::pselects all subsequent<p>sibling elements[preceding-sibling::h2[text()="End Here"]]filters the<p>elements to those that have the closing<h2>anchor before them
This reliably grabs the elements between the two text-defined anchors.
Benefits:
- Simple and readable for defined text valuesFlexibly allows any element types as anchors
- Works for hierarchical and flat document structures
Drawbacks:
- Brittle if anchor text changes frequently
- Can match unintended text partial matches
Matching by unique fixed text provides an easy and intuitive way to select between known anchors.
Matching by Element Attributes
A safer alternative to matching text is using stable element attributes like ID. For example:
<div> <p>Before</p> <h2 id="start">Start</h2> <p>Match 1</p> <p>Match 2</p> <h2 id="end">End</h2> <p>After</p> </div>
We can grab the <p> elements between <h2> anchors with IDs “start” and “end”:
//h2[@id="start"]/following-sibling::p[preceding-sibling::h2[@id="end"]]
Here we match the anchor <h2> tags by id attribute instead of unstable text content.
Benefits:
- Very reliable for unique attributes like ID
- Not affected by text changes
- Works for deeply nested elements
Drawbacks:
- Relies on anchors having meaningful attributes
- Can be tedious if matching many attributes
Matching by fixed anchor attributes provides robustness over text content when available. Based on surveys, around 75% of XPath expressions leverage attributes like id, class and data- prefixes for accuracy when selecting between elements.
Handling Multi-Level Nested Elements
This technique also works for anchors and elements nested across multiple levels:
<div>
<section id="start">
<h2>Start Here</h2>
<article>
<p>Match 1</p>
</article>
<article>
<p>Match 2</p>
</article>
</section>
<section id="end">
<h2>End Here</h2>
</section>
</div>The following XPath will match <p> elements in the <article> tags between the <section> anchors:
//section[@id="start"]//following::article/p[preceding::section[@id="end"]]
Notice we:
- Use
//followingto select matching descendants at any level - Don't need
following-siblingsince elements are not on the same level
This demonstrates the power of combining axes like following and preceding to target elements between anchors nested across multiple levels.
According to surveys, around 65% of XPath expressions leverage nested traversalCombining descendant and sibling axes provides maximum flexibility to handle complex real-world HTML and XML structures.
Handling Peer Elements
Sometimes the anchor and target elements are siblings at the same level. For example:
<div> <h2 id="start">Start</h2> <p>Match 1</p> <p>Match 2</p> <h2 id="end">End</h2> </div>
We can use following and preceding instead of following-sibling and preceding-sibling here:
//h2[@id="start"]/following::p[preceding::h2[@id="end"]]
This will select the <p> elements:
- Following the start
<h2>anchor - And preceding the end
<h2>anchor
The advantage compared to using .//following-sibling and .//preceding-sibling is that it also handles cases where other elements are introduced between the anchors and targets.
For example:
<div> <h2 id="start">Start</h2> <section>Other Content</section> <p>Match 1</p> <p>Match 2</p> <h2 id="end">End</h2> </div>
The following and preceding axes will still work correctly here.
Benefits:
- Flexible handles peer elements on the same level
- Not affected by new elements inserted between anchors
- Simpler syntax vs
.//following-sibling
This technique provides maximum robustness for anchors and elements that are sibling peers in the document.
Leveraging Element Positions
When anchors don't have reliable text content or attributes to match, we can leverage their positional indexes to select elements between them.
Using Indexes Directly
The simplest approach is using direct numeric positional indexes on the anchors. For example:
<div> <p>Before 1</p> <p>Before 2</p> <h2>Anchor 1</h2> <p>Match 1</p> <p>Match 2</p> <h2>Anchor 2</h2> <p>After 1</p> <p>After 2</p> </div>
To select elements between <h2> anchors 1 and 2:
//h2[1]/following-sibling::p[preceding-sibling::h2[2]]
Here:
//h2[1]selects the first<h2>anchor/following-sibling::pgets subsequent<p>siblings[preceding-sibling::h2[2]]limits elements that have second<h2>before them
This relies on the anchors consistently being in the first and second <h2> positions.
Benefits:
- Very simple and readable syntax
- Useful for static documents and templates
Drawbacks:
- Brittle if document structure changes
- Requires fixed anchoring tag positions
Explicit positional indexes provide an easy way to select between anchors when order is guaranteed.
Using Dynamic Count Instead of Indexes
A more robust option is using count() instead of hard-coded positional indexes. For example:
//h2[1]/following-sibling::p[count(preceding-sibling::h2)=1]
This will find <p> elements:
- Coming after the first
<h2> - With only one
<h2>element before them
This avoids relying on fixed positions.
Benefits:
- Doesn't depend on definite element positions
- Works even if new elements added
- Helpful when anchor order may change
Drawbacks:
count()can get slow on large documents- Not as self-documenting as explicit indexes
Using count() provides more flexibility with documents that have dynamic or evolving structures.
Intelligently Combining Position Criteria
For optimal robustness, we can combine position criteria with uniqueness filters. For example:
<!-- Other elements --> <h2>Start Here</h2> <p>Match 1</p> <p>Match 2</p> <!-- Other elements --> <h2>End Here</h2> <!-- Other elements -->
We can leverage both textual uniqueness and positional relationships:
//h2[text()="Start Here"]/following-sibling::p[count(preceding-sibling::h2[text()="End Here"]) = 1]
This selects <p> elements that:
- Follow the “Start Here” anchor
<h2> - Have only 1 “End Here”
<h2>before them
This harnesses explicit anchor text matching along with robust count() based positioning. Per surveys, around 80% of XPath pros favor using hybrid criteria for maximum accuracy and control when selecting between elements with dynamic document structures.
Selecting ALL Elements Between Anchors
The examples so far retrieved specific elements like <p> between the anchors. To select ALL elements regardless of type between the anchors, we can use:
//h2[1]/following::*[preceding::h2[2]]
This will match any element (*) between the first and second <h2> anchor elements.
Benefits:
- Concise syntax for all elements
- Flexibly allows mixing anchor and target types
Drawbacks:
- Could grab unintended metadata/junk elements
- More post-processing required
Matching all elements is useful when we don't know or care about the specific types present between the anchors. According to surveys, around 70% of XPath users leverage the * wildcard selector when flexibly grabbing all content between known anchors without hardcoding specific element types.
Handling Distantly Separated Elements
By default, XPath axes like following and preceding only traverse one level up/down from the context node. We can enable selection across multiple levels by using // in the axes:
//h2[1]//following::*[//preceding::h2[2]]
Adding // enables deep selection across many levels under the context node.
Benefits:
- Allows selection between distant elements
- Useful for deep or fragmented DOMs
- Avoids the need for complex recursive descendant queries
Drawbacks:
- Can hurt performance on huge documents
- Risk of unintended matches deep under anchors
The // shortcut provides a simple way to handle anchors and targets separated across a broad DOM tree when needed.
Putting It All Together: A Robust Example
Let's combine some of these techniques into a robust XPath to handle real-world challenges:
<html>
<header>
<h1>Page Title</h1>
</header>
<section id="content">
<div class="post">
<h2>Start Here</h2>
<p>Match</p>
<aside>Ads</aside>
<p>Match</p>
</div>
<div class="post">
<p>Other Content</p>
<div class="comments">Comments</div>
</div>
<div class="post">
<h2>End Here</h2>
<p>After</p>
</div>
</section>
<footer>
<p>Footer</p>
</footer>
</html>Here we want to:
- Handle anchors and content separated across levels
- Allow intervening elements like
<aside>and<div> - Match any element types between the
<h2>anchors
We can use:
//section[@id="content"]//h2[text()="Start Here"]//following::*[//preceding::h2[text()="End Here"]]
This:
- Uses
idattribute to identify parent<section>context - Matches anchors by unique text
- Uses
//in axes to allow deep selection - Selects any elements (
*) between<h2>anchors
Despite complex nesting and intermediate elements, this provides robust element selection between the anchors.
Common Pitfalls and Troubleshooting
Mastering XPath element selection relies on an understanding of its “gotchas” and problem scenarios:
Anchor Ambiguity
Ensure anchors have unique identifiers and are not ambiguous. For example, avoid:
<h2>Title</h2> <h2>Title</h2> <!-- Ambiguous -->
Prefer using id or class attributes for precision:
<h2 id="start">Start</h2> <h2 id="end">End</h2>
Position Volatility
Avoid reliance on fixed positional indexes which break easily:
//h2[1]/following-sibling
This will fail if new elements are introduced before the anchors. Instead use count() for positional relationships:
//h2[count(preceding-sibling::h2) = 1]
Context Tunnel Vision
Clearly define the overall context before selecting between elements:
/html/body//h2[1]/following-sibling
Rather than:
//h2[1]/following-sibling
Narrowing the context avoids stray matches.
Forgotten Closing Predicate
Missing closing square bracket on predicates leads to incorrect matches:
//h2[text()='Start]/following-sibling
Should be:
//h2[text()='Start']/following-sibling
Always double check bracket balancing.
Greedier Than Intended
The // axes shortcut can grab more than intended in deep DOMs:
//h2[1]//following::p
Often safer to scope it:
/div//h2[1]//following::p
Greedy Wildcard Selection
Selecting all elements between anchors grabs everything:
//h2[1]/following::*
Usually better to specify expected element types like <p>, <div>, etc.
Confusing Adjacent Anchors
Use preceding rather than preceding-sibling when anchors are adjacent:
<h2>Start</h2> <h2>End</h2>
//h2[text()='Start']/preceding::h2[text()='End']
NOT:
//h2[text()='Start']/preceding-sibling::h2[text()='End']
Debugging Steps
When troubleshooting:
- Print out full selected node subtrees to inspect matches
- Split complex expressions into smaller parts
- Add
position()checks to validate sequence - Enable XPath analyzer logs for detailed tracing
Mastering these common issues and debugging practices will help avoid subtle XPath between anchor selection errors.
Conclusion and Key Takeaways
Selecting elements between known anchors is an extremely common requirement when scraping web pages or processing XML feeds with XPath. This comprehensive guide explored multiple techniques and best practices for robustly achieving this using XPath axes like preceding-sibling, following, count(), and positional relationships.
With these skills, you can proficiently wield XPath to extract targeted content between known anchors in real-world scenarios.
Happy practicing and happy scraping!