Quick Intro to Parsing JSON with JMESPath in Python

JSON has become the most popular data format on the web. When scraping modern websites, APIs and web services you'll often encounter JSON datasets. These can contain hundreds of different nested data fields – making parsing JSON an essential part of the web scraping process.

In this guide, we'll explore how to leverage JMESPath – a popular JSON query language – to extract and reshape datasets in Python.

What is JMESPath?

JMESPath stands for JSON Matching Expressions Path and is a query language for parsing JSON documents. It allows writing advanced path expressions to filter, slice, reshape and transform JSON data.

JMESPath works similar to XPath for XML or CSS selectors for HTML, but is designed specifically for querying JSON datasets. It's implemented in many languages like Python, Javascript, Java, Go, PHP and more. Some key features of JMESPath include:

Queries JSON in a similar way to how MongoDB or SQL allow querying databases
Allows selecting, filtering, flattening, indexing, slicing and projecting JSON datasets
Reshapes JSON by mapping old keys to new ones or collapsing objects and arrays
Works across different programming languages with only slight syntactical differences
Implemented for Python via the jmespath module we'll use in this tutorial

In web scraping JMESPath becomes useful when scraping modern JSON-heavy sites and APIs. It allows easily extracting only the relevant data fields from often very large and nested JSON documents.

Now let's look at how we can use JMESPath for parsing JSON in Python.

Installing JMESPath in Python

JMESPath for Python is provided via the jmespath package on PyPI. We can install it using pip:

pip install jmespath

After installing, we can import jmespath and start using it:

import json 
import jmespath

data = {
  "people": [...]
}

# Query data
results = jmespath.search("people[].name", data)

The jmespath.search() method accepts a JMESpath expression and JSON data to search. It returns the matched results.

Now let's explore the common JMESPath features for querying JSON.

Querying JSON with JMESPath Basics

If you've worked with JSON in Python before, JMESPath will feel very familiar. It uses the same dot notation and array indexing as native dictionaries and lists.

Navigating JSON Objects

JSON objects are equivalent to Python dicts. JMESPath accesses their keys using dot syntax:

data = {
  "people": {
    "john": {"age": 35}, 
    "jane": {"age": 40}    
  }
}

jmespath.search('people.john.age', data)
# 35

We can traverse multiple layers of nested objects:

data = {
  "companies": {
    "acme": {  
      "employees": {
        "john": {"age": 35},
        "jane": {"age": 40}   
      }
    }
  }
}

jmespath.search('companies.acme.employees.john.age', data) 
# 35

Accessing JSON Arrays

JSON arrays map to Python lists. JMESPath grabs elements or slices using brackets:

data = {
  "people": [
    {"name": "John", "age": 35},
    {"name": "Jane", "age": 40} 
  ]
}

# First element
jmespath.search('people[0]', data)
# {"name": "John", "age": 35} 

# Second element
jmespath.search('people[1]', data)  
# {"name": "Jane", "age": 40}

# First two elements 
jmespath.search('people[:2]', data)
# [{"name": "John", "age": 35}, {"name": "Jane", "age": 40}]

We can combine dots and brackets to access nested arrays:

data = {
  "companies": [
    {
       "name": "ABC",
       "employees": [
         {"name": "John", "age": 30},
         {"name": "Jane", "age": 25} 
       ]
    },
    # ...
  ]
}

# John from first company
jmespath.search('companies[0].employees[0]', data)
# {"name": "John", "age": 30}

# Jane from first company
jmespath.search('companies[0].employees[1]', data)
# {"name": "Jane", "age": 25}

So far, JMESPath syntax mirrors native Python JSON manipulation. But it can do much more!

Advanced JMESPath Features

Now let's look at some more powerful JMESPath capabilities that really shine for transforming JSON documents.

Filtering Arrays

A common task is filtering arrays to just the elements we want. JMESPath does this using expressions inside brackets:

data = {
  "people": [
    {"name": "John", "age": 30},
    {"name": "Jane", "age": 20},
    {"name": "Bob", "age": 25}  
  ]
}

# People older than 25
jmespath.search('people[?age > `25`]', data)
# [{"name": "John", "age": 30}]

# Names of people older than 25 
jmespath.search('people[?age > `25`].name', data) 
# ["John"]

We can filter by:

Exact equality: ?x == 1
Greater/less than: ?x > 1
String containment: ?xcontains 'foo'
String starts with: ?xstarts_with 'foo'
Regex match: ?xmatches 'f.*'
Array length: ?length(@) > 1
Much more, see JMESPath filter examples

Filtering is useful for narrowing down arrays to just what's needed.

Flattening Arrays

Often JSON contains arrays nested within arrays. JMESPath makes it easy to flatten everything into a single list using [*]:

data = {
  "companies": [
    {
      "employees": [
        {"name": "John"},
        {"name": "Jane"}  
      ]
    },
    {
      "employees": [
        {"name": "Bob"},
        {"name": "Kate"}
      ]
    }
  ]
}

jmespath.search('companies[*].employees[*].name', data)
# ["John", "Jane", "Bob", "Kate"]

[*] works recursively to expand all arrays. While [ ] without * keeps the original nested structure.

Flattening is handy for normalizing nested structures.

Reshaping Objects

One of JMESPath's most useful features is reshaping JSON objects using .[] syntax:

data = {
  "companies": [
    {
       "id": 123,
       "name": "ABC",
       "founder": {
         "firstName": "John",
         "lastName": "Doe"
       }
    },
    {
      "id": 456,
      "name": "DEF",
       "founder": {
         "firstName": "Jane",
         "lastName": "Doe"
       }
    }
  ]
}

jmespath.search('companies[].{id: id, name: name, founderName: founder.firstName}', data)

# [
#   {"id": 123, "name": "ABC", "founderName": "John"},
#   {"id": 456, "name": "DEF", "founderName": "Jane"}  
# ]

We've reshaped companies array by projecting new objects with the desired keys. This allows easily transforming JSON data on the fly. Some other examples:

# Pivot array of objects 
jmespath.search('employees[*].[firstName, lastName]', data)

# Merge fields into new object
jmespath.search('accounts[*]{id: id, fullName: firstName + " " + lastName}', data)

Reshaping is incredibly useful when scraping JSON that doesn't match the format your application expects.

JMESPath vs. Alternatives

Now that we've covered core functionality – how does JMESPath compare to other JSON parsers and query languages?

JMESPath vs. JSONPath

JSONPath is probably JMESPath's closest competitor. It offers very similar capabilities for filtering, flattening and projecting JSON documents. The syntax between JSONPath and JMESPath is nearly identical for basic queries. However, JMESPath has a few advantages:

JMESPath allows additional expressions for reshaping objects, while JSONPath is more limited
JMESPath has great library support for Python, Java, JavaScript and other languages
JMESPath adds useful filters like length, regex matching, starts/ends with etc.
Overall JMESPath feels a bit more fully featured

But JSONPath is also great option especially if you need support in something like Golang or C#.

JMESPath vs. XPath

XPath is a query language for parsing XML. It allows traversing XML trees to extract elements and attributes. JMESPath borrows a lot of syntax from XPath but works specifically for JSON structures rather than XML.

In general, XPath has broader support across languages and tools. But JMESPath is much better suited for JSON-based web scraping.

JMESPath vs. CSS Selectors

For HTML scraping, we often use CSS Selectors to extract DOM nodes. The syntax is very different – CSS uses tag, id and class names rather than property accessors. Selectors are limited to DOM scraping. For JSON, JMESPath offers more powerful capabilities to reshape data.

So in summary, JMESPath occupies a nice niche applying XPath-like expressions specifically to JSON documents.

Using JMESPath for Web Scraping

Alright, enough background – let's look at a real example using JMESPath to scrape JSON data! For this demo, we'll build a scraper for Realtor.com – a popular US real estate listings site. Our goal will be extracting key property data from their JSON API.

Here are the steps:

Fetch raw JSON data from Realtor's API
Use JMESPath to query and reshape the response
Output extracted property fields

First, let's install the requests module to make API calls:

pip install requests

And import our libraries:

import requests
import json
import jmespath

Now let's call the Realtor API to grab JSON data for a property:

property_id = "M000003123"

url = f"https://realtor.p.rapidapi.com/properties/v2/detail?property_id={property_id}" 

headers = {
   "X-RapidAPI-Key": "YOUR_API_KEY",
   "X-RapidAPI-Host": "realtor.p.rapidapi.com" 
}

response = requests.get(url, headers=headers)
data = json.loads(response.text)

This returns a large JSON document containing every detail about the property:

{
   "properties": [
      {
         "address": {
            "line": "123 Main St",
            "postal_code": "90210",
            //...
         },
         "baths": 2,
         "beds": 4,
         "building": {
            //...   
         },
         "description": {
            //...
         },
         //...TONS more fields
      }
   ]
}

With JMESPath we can query and extract just what we need from this massive response. Let's grab something simple like the address:

jmespath.search('properties[0].address.line', data)
# '123 Main St'

And we can also reshape the JSON into a simpler format:

result = jmespath.search("""
properties[0]{  
  address: address.line,
  baths: baths,
  beds: beds,
  price: price  
}
""", data) 

print(result)

That outputs:

{
   "address": "123 Main St",
   "baths": 2,
   "beds": 4,
   "price": 500000
}

Using JMESPath's powerful filtering and reshaping features we've parsed down the large JSON response to exactly the fields needed. I hope this gives you a good idea of how JMESPath can be leveraged when web scraping JSON APIs and sites!

Summary

We've explored several key aspects of JMESPath in the discussion above. We started with the basics, touching upon dot notation, array operations, and filtering techniques. As we delved deeper, we examined its advanced features tailored for efficient JSON manipulation.

For context, we also compared JMESPath with other popular parsers, such as JSONPath and XPath. To provide a practical perspective, we walked through an example where JMESPath was used to extract real estate data from Realtor's API. For anyone working with web scraping, especially on JSON-heavy platforms, JMESPath proves to be an indispensable tool. I strongly recommend adding it to your toolkit.