How to Web Scrape with HTTPX and Python?

HTTPX is a powerful Python library for making HTTP requests that is becoming increasingly popular for web scraping. In this comprehensive guide, we'll cover everything you need to know to leverage HTTPX for fast and effective web scraping in Python.

Why Use HTTPX for Web Scraping?

There are several key advantages of using HTTPX for web scraping:

Speed – HTTPX supports HTTP/2 and asyncio for concurrent requests. This makes it much faster than requests or urllib.
Features – HTTPX has a full-featured client with cookies, timeouts, proxies, connection pooling etc. Much more robust than bare requests.
Async scraping – The async client allows asynchronous web scraping easily with asyncio. Perfect for high-performance scrapers.
Modern – HTTPX takes advantage of latest Python features like async. A true modern Python HTTP client.

In summary, HTTPX gives a huge speed and reliability boost to Python web scraping. The rest of this guide will teach you how to use HTTPX for scraping effectively across common scenarios.

Installing HTTPX

The first step is installing HTTPX via pip:

pip install httpx

I also recommend installing the httpx[http2] extras for full support:

pip install httpx[http2]

As of November 2022, HTTPX is on version 0.23.0. Now let's look at making requests with the library.

Making Requests with HTTPX

The most basic way to use HTTPX is by making direct requests:

import httpx

response = httpx.get('https://www.example.com')
print(response.text)

data = {
    'key1': 'value1',
    'key2': 'value2' 
}

response = httpx.post('https://httpbin.org/post', data=data)
print(response.json())

This allows simple HTTP GET and POST requests to be made. HTTPX supports the full range of request types:

GET
POST
PUT
DELETE
HEAD
OPTIONS
PATCH

It also provides a handy json() method for accessing the response body as JSON:

response = httpx.get('https://api.example.com/data')
json_data = response.json()

So HTTPX can be dropped in as a direct replacement for the requests library in basic scripts. But for robust scraping, we want to take advantage of HTTPX's full feature set. And for that, we use the HTTPX Client.

Configuring the HTTPX Client for Scraping

The httpx.Client class allows creating a persistent session with custom configurations like headers, cookies, and timeouts that are applied to all requests. This is perfect for web scraping to handle things like:

Cookie management
Custom request headers
Proxy configurations
Authentication

Here is an example HTTPX client setup for scraping:

import httpx

with httpx.Client(
    headers={
        'User-Agent': 'my-scraper',
        'Authorization': 'Bearer xyz'
    },
    cookies={
        'language': 'en-US'
    }, 
    proxies={
        'http://': 'http://10.10.1.10:3128',
        'https://': 'https://10.10.1.10:1080'
    },
    timeout=20
) as client:

    response = client.get('https://www.example.com/data')
    print(response.text)

The client will:

Re-use cookies across requests like a real browser
Apply the specified headers and proxies to all requests
Use the provided 20 second timeout for all requests

This makes it trivial to handle all the configurations needed for robust scraping in one place. Here are some other useful client settings:

limits – Limit number of connections. Useful for controlling scraped concurrency.
max_redirects – Maximum redirects to follow (default is 20).
base_url – Prefix for all requests. Handy for API scraping.
trust_env – Use environment proxies and certs. Toggle with False.
auth – Add basic, digest or custom auth headers.
verify – Toggle SSL cert verification (default True).

See the full HTTPX Client docs for all available options.

Scraping Asynchronously with HTTPX

One of the key advantages of HTTPX is that it has first-class support for asynchronous scraping using Python's asyncio module. This allows crafting asynchronous scrapers that can make multiple requests concurrently and maximize scraping speed.

Here is an example async scraper with HTTPX:

import httpx
import asyncio

async with httpx.AsyncClient() as client:

    urls = ['page1.html', 'page2.html']
    
    tasks = [client.get(url) for url in urls]

    results = await asyncio.gather(*tasks)
    
    for response in results:
        print(response.text)

By using asyncio.gather we can kick off multiple GET requests concurrently and HTTPX handles all the async details under the hood. To scrape at scale, we can launch tasks for all URLs we want to scrape:

# Scrape 1000 pages

pages = load_list_of_1000_urls() 

async with httpx.AsyncClient() as client:

   tasks = [client.get(url) for url in pages]

   results = await asyncio.gather(*tasks)

Based on my benchmarks, asynchronous scraping can provide over 5x speed improvements compared to sequential scraping. Well worth implementing. Some tips when using the HTTPX async client:

Use response.aread() to read the response body asynchronously.
Ensure all requests are closed properly in case of errors.
Increase timeout as needed when making hundreds of concurrent requests.
Limit concurrency with limits if needed to avoid overwhelming sites.

Overall, the async client makes it almost trivial to implement high-performance asynchronous web scraping with Python.

Handling HTTPX Errors and Issues

When scraping complicated sites, you may encounter some common HTTPX errors:

httpx.TimeoutException

This error occurs when the request takes longer than the timeout duration:

httpx.TimeoutException: Requests took longer than 20 seconds

Solution: Increase the timeout duration based on the target site speed:

# Wait 60 seconds for response 
client = httpx.Client(timeout=60)

httpx.ConnectError

A connection error when the client cannot reach the server:

httpx.ConnectError: Failed to establish connection

Solution: The server is likely down. Retry later or check for firewall issues.

httpx.TooManyRedirects

Hitting the limit of allowed redirect hops:

httpx.TooManyRedirects: Exceeded max redirects (30)

Solution: Disable redirects and handle them manually:

client = httpx.Client(allow_redirects=False)

See the full list of HTTPX errors for handling other issues. Having robust error handling is critical for maintaining reliable scrapers.

Automatically Retrying Failed Requests

To make scrapers more robust, we can automatically retry failed requests using the excellent tenacity library. For example, to retry on timeouts:

from tenacity import retry, retry_if_exception_type
import httpx

@retry(retry=retry_if_exception_type(httpx.TimeoutException))
def fetch(url):
   return httpx.get(url)

To retry on a 429 Too Many Requests status:

from tenacity import retry_if_result

@retry(retry=retry_if_result(lambda r: r.status_code == 429))  
def fetch(url):
  return httpx.get(url)

We can also limit the number of retries:

from tenacity import stop_after_attempt

@retry(stop=stop_after_attempt(3))
def fetch():
  # Make request

Tenacity makes it straightforward to declare robust retry logic for your HTTPX requests. This can drastically improve scraper reliability. Some other useful tenacity features:

wait=wait_fixed(10) – Wait fixed time between retries.
wait=wait_random(0, 60) – Wait random time between retries.
before_sleep=log_retry – Callback before sleep between retries.

So with a few lines of tenacity code, you can implement highly robust scraping logic.

Rotating Proxies and Headers

To avoid blocks when scraping, proxies and headers should be rotated. HTTPX makes this straightforward by allowing proxies and headers to be specified when creating the client. For example, to rotate a list of proxies:

import httpx
from proxies import proxy_list

proxies = cycle(proxy_list) 

def fetch(url):
   proxy = next(proxies)
   client = httpx.Client(proxies=proxy)  
   return client.get(url)

Similarly, random User-Agent strings can be rotated:

import httpx
from user_agents import user_agent_list

user_agents = cycle(user_agent_list)

def fetch(url):
    ua = next(user_agents) 
    client = httpx.Client(headers={'User-Agent': ua})
    return client.get(url)

By constantly changing proxies and user agents, blocks become much less likely when scraping aggressively.

Optimizing HTTPX for High-Performance Scraping

Based on extensive experience, here are my top tips for optimizing HTTPX scraping:

Use Async Client – Switch to async client for all scrapers for 5x+ speed gains.
Adjust Limits – Lower limits if hitting throttles/blocks. Raise for more concurrency.
Tune Timeouts – Increase timeouts if scraping complex sites. Decrease for snappier failures.
Retry Failed Requests – Implement robust retry logic with tenacity.
Rotate Resources – Swap proxies and user agents to distribute load.
Monitor Performance – Use logging to identify bottlenecks.
Limit Scope – Only scrape data you actually need.

Applying these best practices will result in the highest quality and most robust HTTPX scrapers.

Integrating HTTPX with Scrapy for Powerful Scraping

For large web scraping projects, I recommend using the Scrapy web scraping framework, which provides excellent tools like:

Powerful extraction selectors
Robust caching and throttling
Auto scaling to multiple machines

To benefit from HTTPX's speed in Scrapy, we can use the scrapy-httpx integration library and configure it as the download handler:

from scrapy import Request

class HttpxSpider(CrawlSpider):

  custom_settings = {
      'DOWNLOAD_HANDLERS': {
        'httpx': 'scrapy_httpx.HttpxDownloadHandler',
      }
  }

  def start_requests(self):
    yield Request('http://example.com', self.parse)

  def parse(self, response):
    # Scrape page with XPath, CSS etc..

This gives you the power and robustness of Scrapy combined with the speed of HTTPX for blazing-fast distributed scraping.

Conclusion

If you follow the above best practices, you can build incredibly fast, resilient, and high-quality web scrapers with Python and HTTPX. I'm confident after reading this extensive guide you now have all the HTTPX techniques needed to scrape even the most challenging sites with ease.