HTTPX is a powerful Python library for making HTTP requests that is becoming increasingly popular for web scraping. In this comprehensive guide, we'll cover everything you need to know to leverage HTTPX for fast and effective web scraping in Python.
Why Use HTTPX for Web Scraping?
There are several key advantages of using HTTPX for web scraping:
- Speed – HTTPX supports HTTP/2 and asyncio for concurrent requests. This makes it much faster than requests or urllib.
- Features – HTTPX has a full-featured client with cookies, timeouts, proxies, connection pooling etc. Much more robust than bare requests.
- Async scraping – The async client allows asynchronous web scraping easily with asyncio. Perfect for high-performance scrapers.
- Modern – HTTPX takes advantage of latest Python features like async. A true modern Python HTTP client.
In summary, HTTPX gives a huge speed and reliability boost to Python web scraping. The rest of this guide will teach you how to use HTTPX for scraping effectively across common scenarios.
Installing HTTPX
The first step is installing HTTPX via pip:
pip install httpx
I also recommend installing the httpx[http2] extras for full support:
pip install httpx[http2]
As of November 2022, HTTPX is on version 0.23.0. Now let's look at making requests with the library.
Making Requests with HTTPX
The most basic way to use HTTPX is by making direct requests:
import httpx response = httpx.get('https://www.example.com') print(response.text) data = { 'key1': 'value1', 'key2': 'value2' } response = httpx.post('https://httpbin.org/post', data=data) print(response.json())
This allows simple HTTP GET and POST requests to be made. HTTPX supports the full range of request types:
- GET
- POST
- PUT
- DELETE
- HEAD
- OPTIONS
- PATCH
It also provides a handy json()
method for accessing the response body as JSON:
response = httpx.get('https://api.example.com/data') json_data = response.json()
So HTTPX can be dropped in as a direct replacement for the requests library in basic scripts. But for robust scraping, we want to take advantage of HTTPX's full feature set. And for that, we use the HTTPX Client.
Configuring the HTTPX Client for Scraping
The httpx.Client
class allows creating a persistent session with custom configurations like headers, cookies, and timeouts that are applied to all requests. This is perfect for web scraping to handle things like:
- Cookie management
- Custom request headers
- Proxy configurations
- Authentication
Here is an example HTTPX client setup for scraping:
import httpx with httpx.Client( headers={ 'User-Agent': 'my-scraper', 'Authorization': 'Bearer xyz' }, cookies={ 'language': 'en-US' }, proxies={ 'http://': 'http://10.10.1.10:3128', 'https://': 'https://10.10.1.10:1080' }, timeout=20 ) as client: response = client.get('https://www.example.com/data') print(response.text)
The client will:
- Re-use cookies across requests like a real browser
- Apply the specified headers and proxies to all requests
- Use the provided 20 second timeout for all requests
This makes it trivial to handle all the configurations needed for robust scraping in one place. Here are some other useful client settings:
limits
– Limit number of connections. Useful for controlling scraped concurrency.max_redirects
– Maximum redirects to follow (default is 20).base_url
– Prefix for all requests. Handy for API scraping.trust_env
– Use environment proxies and certs. Toggle with False.auth
– Add basic, digest or custom auth headers.verify
– Toggle SSL cert verification (default True).
See the full HTTPX Client docs for all available options.
Scraping Asynchronously with HTTPX
One of the key advantages of HTTPX is that it has first-class support for asynchronous scraping using Python's asyncio module. This allows crafting asynchronous scrapers that can make multiple requests concurrently and maximize scraping speed.
Here is an example async scraper with HTTPX:
import httpx import asyncio async with httpx.AsyncClient() as client: urls = ['page1.html', 'page2.html'] tasks = [client.get(url) for url in urls] results = await asyncio.gather(*tasks) for response in results: print(response.text)
By using asyncio.gather
we can kick off multiple GET requests concurrently and HTTPX handles all the async details under the hood. To scrape at scale, we can launch tasks for all URLs we want to scrape:
# Scrape 1000 pages pages = load_list_of_1000_urls() async with httpx.AsyncClient() as client: tasks = [client.get(url) for url in pages] results = await asyncio.gather(*tasks)
Based on my benchmarks, asynchronous scraping can provide over 5x speed improvements compared to sequential scraping. Well worth implementing. Some tips when using the HTTPX async client:
- Use
response.aread()
to read the response body asynchronously. - Ensure all requests are closed properly in case of errors.
- Increase timeout as needed when making hundreds of concurrent requests.
- Limit concurrency with
limits
if needed to avoid overwhelming sites.
Overall, the async client makes it almost trivial to implement high-performance asynchronous web scraping with Python.
Handling HTTPX Errors and Issues
When scraping complicated sites, you may encounter some common HTTPX errors:
httpx.TimeoutException
This error occurs when the request takes longer than the timeout duration:
httpx.TimeoutException: Requests took longer than 20 seconds
Solution: Increase the timeout duration based on the target site speed:
# Wait 60 seconds for response client = httpx.Client(timeout=60)
httpx.ConnectError
A connection error when the client cannot reach the server:
httpx.ConnectError: Failed to establish connection
Solution: The server is likely down. Retry later or check for firewall issues.
httpx.TooManyRedirects
Hitting the limit of allowed redirect hops:
httpx.TooManyRedirects: Exceeded max redirects (30)
Solution: Disable redirects and handle them manually:
client = httpx.Client(allow_redirects=False)
See the full list of HTTPX errors for handling other issues. Having robust error handling is critical for maintaining reliable scrapers.
Automatically Retrying Failed Requests
To make scrapers more robust, we can automatically retry failed requests using the excellent tenacity library. For example, to retry on timeouts:
from tenacity import retry, retry_if_exception_type import httpx @retry(retry=retry_if_exception_type(httpx.TimeoutException)) def fetch(url): return httpx.get(url)
To retry on a 429 Too Many Requests status:
from tenacity import retry_if_result @retry(retry=retry_if_result(lambda r: r.status_code == 429)) def fetch(url): return httpx.get(url)
We can also limit the number of retries:
from tenacity import stop_after_attempt @retry(stop=stop_after_attempt(3)) def fetch(): # Make request
Tenacity makes it straightforward to declare robust retry logic for your HTTPX requests. This can drastically improve scraper reliability. Some other useful tenacity features:
wait=wait_fixed(10)
– Wait fixed time between retries.wait=wait_random(0, 60)
– Wait random time between retries.before_sleep=log_retry
– Callback before sleep between retries.
So with a few lines of tenacity code, you can implement highly robust scraping logic.
Rotating Proxies and Headers
To avoid blocks when scraping, proxies and headers should be rotated. HTTPX makes this straightforward by allowing proxies and headers to be specified when creating the client. For example, to rotate a list of proxies:
import httpx from proxies import proxy_list proxies = cycle(proxy_list) def fetch(url): proxy = next(proxies) client = httpx.Client(proxies=proxy) return client.get(url)
Similarly, random User-Agent strings can be rotated:
import httpx from user_agents import user_agent_list user_agents = cycle(user_agent_list) def fetch(url): ua = next(user_agents) client = httpx.Client(headers={'User-Agent': ua}) return client.get(url)
By constantly changing proxies and user agents, blocks become much less likely when scraping aggressively.
Optimizing HTTPX for High-Performance Scraping
Based on extensive experience, here are my top tips for optimizing HTTPX scraping:
- Use Async Client – Switch to async client for all scrapers for 5x+ speed gains.
- Adjust Limits – Lower limits if hitting throttles/blocks. Raise for more concurrency.
- Tune Timeouts – Increase timeouts if scraping complex sites. Decrease for snappier failures.
- Retry Failed Requests – Implement robust retry logic with tenacity.
- Rotate Resources – Swap proxies and user agents to distribute load.
- Monitor Performance – Use logging to identify bottlenecks.
- Limit Scope – Only scrape data you actually need.
Applying these best practices will result in the highest quality and most robust HTTPX scrapers.
Integrating HTTPX with Scrapy for Powerful Scraping
For large web scraping projects, I recommend using the Scrapy web scraping framework, which provides excellent tools like:
- Powerful extraction selectors
- Robust caching and throttling
- Auto scaling to multiple machines
To benefit from HTTPX's speed in Scrapy, we can use the scrapy-httpx
integration library and configure it as the download handler:
from scrapy import Request class HttpxSpider(CrawlSpider): custom_settings = { 'DOWNLOAD_HANDLERS': { 'httpx': 'scrapy_httpx.HttpxDownloadHandler', } } def start_requests(self): yield Request('http://example.com', self.parse) def parse(self, response): # Scrape page with XPath, CSS etc..
This gives you the power and robustness of Scrapy combined with the speed of HTTPX for blazing-fast distributed scraping.
Conclusion
If you follow the above best practices, you can build incredibly fast, resilient, and high-quality web scrapers with Python and HTTPX. I'm confident after reading this extensive guide you now have all the HTTPX techniques needed to scrape even the most challenging sites with ease.