JSON has become the most popular data format on the web. When scraping modern websites, APIs and web services you'll often encounter JSON datasets. These can contain hundreds of different nested data fields – making parsing JSON an essential part of the web scraping process.
In this guide, we'll explore how to leverage JMESPath – a popular JSON query language – to extract and reshape datasets in Python.
What is JMESPath?
JMESPath stands for JSON Matching Expressions Path and is a query language for parsing JSON documents. It allows writing advanced path expressions to filter, slice, reshape and transform JSON data.
JMESPath works similar to XPath for XML or CSS selectors for HTML, but is designed specifically for querying JSON datasets. It's implemented in many languages like Python, Javascript, Java, Go, PHP and more. Some key features of JMESPath include:
- Queries JSON in a similar way to how MongoDB or SQL allow querying databases
- Allows selecting, filtering, flattening, indexing, slicing and projecting JSON datasets
- Reshapes JSON by mapping old keys to new ones or collapsing objects and arrays
- Works across different programming languages with only slight syntactical differences
- Implemented for Python via the
jmespath
module we'll use in this tutorial
In web scraping JMESPath becomes useful when scraping modern JSON-heavy sites and APIs. It allows easily extracting only the relevant data fields from often very large and nested JSON documents.
Now let's look at how we can use JMESPath for parsing JSON in Python.
Installing JMESPath in Python
JMESPath for Python is provided via the jmespath
package on PyPI. We can install it using pip:
pip install jmespath
After installing, we can import jmespath
and start using it:
import json import jmespath data = { "people": [...] } # Query data results = jmespath.search("people[].name", data)
The jmespath.search()
method accepts a JMESpath expression and JSON data to search. It returns the matched results.
Now let's explore the common JMESPath features for querying JSON.
Querying JSON with JMESPath Basics
If you've worked with JSON in Python before, JMESPath will feel very familiar. It uses the same dot notation and array indexing as native dictionaries and lists.
Navigating JSON Objects
JSON objects are equivalent to Python dicts. JMESPath accesses their keys using dot syntax:
data = { "people": { "john": {"age": 35}, "jane": {"age": 40} } } jmespath.search('people.john.age', data) # 35
We can traverse multiple layers of nested objects:
data = { "companies": { "acme": { "employees": { "john": {"age": 35}, "jane": {"age": 40} } } } } jmespath.search('companies.acme.employees.john.age', data) # 35
Accessing JSON Arrays
JSON arrays map to Python lists. JMESPath grabs elements or slices using brackets:
data = { "people": [ {"name": "John", "age": 35}, {"name": "Jane", "age": 40} ] } # First element jmespath.search('people[0]', data) # {"name": "John", "age": 35} # Second element jmespath.search('people[1]', data) # {"name": "Jane", "age": 40} # First two elements jmespath.search('people[:2]', data) # [{"name": "John", "age": 35}, {"name": "Jane", "age": 40}]
We can combine dots and brackets to access nested arrays:
data = { "companies": [ { "name": "ABC", "employees": [ {"name": "John", "age": 30}, {"name": "Jane", "age": 25} ] }, # ... ] } # John from first company jmespath.search('companies[0].employees[0]', data) # {"name": "John", "age": 30} # Jane from first company jmespath.search('companies[0].employees[1]', data) # {"name": "Jane", "age": 25}
So far, JMESPath syntax mirrors native Python JSON manipulation. But it can do much more!
Advanced JMESPath Features
Now let's look at some more powerful JMESPath capabilities that really shine for transforming JSON documents.
Filtering Arrays
A common task is filtering arrays to just the elements we want. JMESPath does this using expressions inside brackets:
data = { "people": [ {"name": "John", "age": 30}, {"name": "Jane", "age": 20}, {"name": "Bob", "age": 25} ] } # People older than 25 jmespath.search('people[?age > `25`]', data) # [{"name": "John", "age": 30}] # Names of people older than 25 jmespath.search('people[?age > `25`].name', data) # ["John"]
We can filter by:
- Exact equality:
?x == 1
- Greater/less than:
?x > 1
- String containment:
?xcontains 'foo'
- String starts with:
?xstarts_with 'foo'
- Regex match:
?xmatches 'f.*'
- Array length:
?length(@) > 1
- Much more, see JMESPath filter examples
Filtering is useful for narrowing down arrays to just what's needed.
Flattening Arrays
Often JSON contains arrays nested within arrays. JMESPath makes it easy to flatten everything into a single list using [*]
:
data = { "companies": [ { "employees": [ {"name": "John"}, {"name": "Jane"} ] }, { "employees": [ {"name": "Bob"}, {"name": "Kate"} ] } ] } jmespath.search('companies[*].employees[*].name', data) # ["John", "Jane", "Bob", "Kate"]
[*]
works recursively to expand all arrays. While [ ]
without *
keeps the original nested structure.
Flattening is handy for normalizing nested structures.
Reshaping Objects
One of JMESPath's most useful features is reshaping JSON objects using .[]
syntax:
data = { "companies": [ { "id": 123, "name": "ABC", "founder": { "firstName": "John", "lastName": "Doe" } }, { "id": 456, "name": "DEF", "founder": { "firstName": "Jane", "lastName": "Doe" } } ] } jmespath.search('companies[].{id: id, name: name, founderName: founder.firstName}', data) # [ # {"id": 123, "name": "ABC", "founderName": "John"}, # {"id": 456, "name": "DEF", "founderName": "Jane"} # ]
We've reshaped companies array by projecting new objects with the desired keys. This allows easily transforming JSON data on the fly. Some other examples:
# Pivot array of objects jmespath.search('employees[*].[firstName, lastName]', data) # Merge fields into new object jmespath.search('accounts[*]{id: id, fullName: firstName + " " + lastName}', data)
Reshaping is incredibly useful when scraping JSON that doesn't match the format your application expects.
JMESPath vs. Alternatives
Now that we've covered core functionality – how does JMESPath compare to other JSON parsers and query languages?
JMESPath vs. JSONPath
JSONPath is probably JMESPath's closest competitor. It offers very similar capabilities for filtering, flattening and projecting JSON documents. The syntax between JSONPath and JMESPath is nearly identical for basic queries. However, JMESPath has a few advantages:
- JMESPath allows additional expressions for reshaping objects, while JSONPath is more limited
- JMESPath has great library support for Python, Java, JavaScript and other languages
- JMESPath adds useful filters like length, regex matching, starts/ends with etc.
- Overall JMESPath feels a bit more fully featured
But JSONPath is also great option especially if you need support in something like Golang or C#.
JMESPath vs. XPath
XPath is a query language for parsing XML. It allows traversing XML trees to extract elements and attributes. JMESPath borrows a lot of syntax from XPath but works specifically for JSON structures rather than XML.
In general, XPath has broader support across languages and tools. But JMESPath is much better suited for JSON-based web scraping.
JMESPath vs. CSS Selectors
For HTML scraping, we often use CSS Selectors to extract DOM nodes. The syntax is very different – CSS uses tag, id and class names rather than property accessors. Selectors are limited to DOM scraping. For JSON, JMESPath offers more powerful capabilities to reshape data.
So in summary, JMESPath occupies a nice niche applying XPath-like expressions specifically to JSON documents.
Using JMESPath for Web Scraping
Alright, enough background – let's look at a real example using JMESPath to scrape JSON data! For this demo, we'll build a scraper for Realtor.com – a popular US real estate listings site. Our goal will be extracting key property data from their JSON API.
Here are the steps:
- Fetch raw JSON data from Realtor's API
- Use JMESPath to query and reshape the response
- Output extracted property fields
First, let's install the requests
module to make API calls:
pip install requests
And import our libraries:
import requests import json import jmespath
Now let's call the Realtor API to grab JSON data for a property:
property_id = "M000003123" url = f"https://realtor.p.rapidapi.com/properties/v2/detail?property_id={property_id}" headers = { "X-RapidAPI-Key": "YOUR_API_KEY", "X-RapidAPI-Host": "realtor.p.rapidapi.com" } response = requests.get(url, headers=headers) data = json.loads(response.text)
This returns a large JSON document containing every detail about the property:
{ "properties": [ { "address": { "line": "123 Main St", "postal_code": "90210", //... }, "baths": 2, "beds": 4, "building": { //... }, "description": { //... }, //...TONS more fields } ] }
With JMESPath we can query and extract just what we need from this massive response. Let's grab something simple like the address:
jmespath.search('properties[0].address.line', data) # '123 Main St'
And we can also reshape the JSON into a simpler format:
result = jmespath.search(""" properties[0]{ address: address.line, baths: baths, beds: beds, price: price } """, data) print(result)
That outputs:
{ "address": "123 Main St", "baths": 2, "beds": 4, "price": 500000 }
Using JMESPath's powerful filtering and reshaping features we've parsed down the large JSON response to exactly the fields needed. I hope this gives you a good idea of how JMESPath can be leveraged when web scraping JSON APIs and sites!
Summary
We've explored several key aspects of JMESPath in the discussion above. We started with the basics, touching upon dot notation, array operations, and filtering techniques. As we delved deeper, we examined its advanced features tailored for efficient JSON manipulation.
For context, we also compared JMESPath with other popular parsers, such as JSONPath and XPath. To provide a practical perspective, we walked through an example where JMESPath was used to extract real estate data from Realtor's API. For anyone working with web scraping, especially on JSON-heavy platforms, JMESPath proves to be an indispensable tool. I strongly recommend adding it to your toolkit.