Python has a rich ecosystem of HTTP client libraries. When it comes to web scraping and HTTP clients, three libraries stand out as popular options:
- Requests – The mature, feature-rich, sync HTTP library.
- Aiohttp – An async HTTP client/server for fast, non-blocking requests.
- Httpx – A next-gen, async HTTP client with HTTP/2 support.
So how do you choose between them and what are the key differences you need to know? In this comprehensive guide, we'll compare Requests, Aiohttp, and Httpx to highlight the strengths of each and help you decide which is best for your Python web scraping needs.
Overview
In the diverse landscape of Python HTTP clients, several options, such as Requests, Aiohttp, and Httpx have distinguished themselves. If you're wondering which is apt for your web scraping projects, consider the following:
- Requests is the seasoned choice, perfect for simple scripts due to its reliability, user-friendliness, and expansive ecosystem.
- Aiohttp stands out with its asynchronous capabilities, especially suited for crafting async web applications.
- Httpx is swiftly gaining traction as it encapsulates the best features of both Requests and Aiohttp, offering cutting-edge performance and functionalities.
While the trusted Requests has served many well, the potency of Httpx in supporting asyncio, HTTP/2, and seamless integrations positions it as a premier HTTP client for modern, efficient web scraping. Let's continue to delve into comparative research.
High Level Comparison
Before diving into details, let's start with a high-level overview of how Requests, Aiohttp and Httpx differ:
Requests
- Released in 2012, Requests is the oldest and most mature option. It has the richest ecosystem of supporting libraries and integrations.
- Requests is synchronous and blocking – it does not support asyncio.
- Simple and easy to use interface modeled after Python's standard urllib2 library. Great docs and tutorials available.
- Lacks support for HTTP/2 and some modern features.
Aiohttp
- Asynchronous HTTP client released in 2014 built on asyncio. Provides non-blocking I/O for better performance.
- Also can act as an HTTP server making it great for building asynchronous web apps and scrapers.
- Supports HTTP/1.1 but not HTTP/2.
- Steeper learning curve than Requests due to asynchronous usage.
Httpx
- Released in 2019, Httpx is the new, modern HTTP client for Python. It is asynchronous and supports HTTP/2.
- Unifies the interfaces of Requests and Aiohttp into one fast, feature-rich library.
- Fewer third-party integrations compared to Requests but quickly gaining popularity.
Sync vs Async – Performance
One of the biggest differences between these libraries is synchronous versus asynchronous requests.
Requests uses synchronous, blocking I/O. Each request must completely finish before another can be sent. This can impact performance when you need to send many requests.
Aiohttp and Httpx use asynchronous I/O via the asyncio module. This allows them to perform non-blocking requests in parallel rather than sequentially. Async performance is especially noticeable when sending multiple requests or when responses are delayed.
Let's compare them with a basic benchmark:
# Example to fetch 100 URLs sequentially import requests import time urls = [#list of 100 urls] start = time.time() for url in urls: response = requests.get(url) end = time.time() print(f"Total time: {end - start}")
equests: ~28 seconds
Now the async version:
import httpx import asyncio async def get_url(url): async with httpx.AsyncClient() as client: return await client.get(url) start = time.time() loop = asyncio.get_event_loop() coroutines = [get_url(url) for url in urls] results = loop.run_until_complete(asyncio.gather(*coroutines)) end = time.time() print(f"Total time: {end - start}")
Httpx: ~3 seconds
Using async allows Httpx to send all requests in parallel. This provides significant performance benefits for fetching multiple URLs. Aiohttp will show similar async performance gains over Requests.
HTTP/2 Benefits
In addition to async I/O, Httpx also supports HTTP/2. This modern protocol provides further performance improvements:
- Multiplexing – Multiple requests can be sent over one TCP connection, removing the lag of establishing new connections.
- Server Push – The server can push additional resources to clients without waiting for new requests.
- Header Compression – Reduces transferred header data volume for faster transfers.
According to benchmarks, these HTTP/2 features can provide 2-3x speed improvements in Httpx when fetching multiple resources from the same domain compared to HTTP/1.1:
# Fetch 100 SVG images from same domain Requests time: 55 seconds Httpx time: 18 seconds
Aiohttp and Requests currently lack HTTP/2 support. This gives Httpx a big performance advantage when scraping modern websites utilizing HTTP/2.
Async Applications with Aiohttp
In addition to an async HTTP client, Aiohttp also provides an HTTP server. This allows you to build asynchronous Python web apps and APIs using asyncio. The server handles incoming requests while the client handles outgoing ones. Why does this matter for web scraping?
Scrapers often persist data to databases or pass it between other services. By using Aiohttp's client and server together, you can build a asynchronous scraper application that avoids unnecessary network overhead and improves performance.
For example:
async def handle_request(request): data = await scrape_page(request.url) await save_to_database(data) return web.Response(text=f"Scraped {request.url}") app = web.Application() app.add_routes([web.get('/', handle_request)]) # Run client and server together async def main(): server = await aiohttp_server(app) async with aiohttp_client() as client: await client.get("http://localhost:8080/") asyncio.run(main())
This allows scraping requests to be handled asynchronously by the application without extra network hops.
Feature Comparison
Beyond high-level differences, Requests, Aiohttp, and Httpx contain similar feature sets but with varying APIs and implementations. Let's dive deeper on how they compare across common usage:
Sending Requests
All three provide simple, standard ways to make HTTP requests:
# Requests requests.get("https://www.example.com") # Aiohttp async with aiohttp.ClientSession() as session: await session.get("https://www.example.com") # Httpx async with httpx.AsyncClient() as client: await client.get("https://www.example.com")
Httpx mirrors both Requests familiar interface and Aiohttp's async context manager approach.
Sessions & Connection Pooling
Reusing session connections and pools provides performance benefits. All three libraries support this:
# Requests session = requests.Session() session.get("https://example.com") # Aiohttp async with aiohttp.ClientSession() as session: await session.get("https://example.com") # Httpx client = httpx.AsyncClient() await client.get("https://example.com")
Sessions handle connection persistency. Pools manage a reusable set of connections.
Httpx matches Requests API while still providing asyncio support.
Timeouts, Retries & Errors
Robust request handling is important for scrapers. All three libraries support configurable timeouts, retries, and error handling:
# Requests requests.get("https://example.com", timeout=3.05) response.raise_for_status() # Aiohttp try: async with timeout(3.05), session.get("https://example.com") as response: response.raise_for_status() except aiohttp.ClientResponseError: print("Request failed") # Httpx client = httpx.AsyncClient(timeout=3.05) try: await client.get("https://example.com") except httpx.ConnectTimeout: client.retries = 3 await client.get("https://example.com")
The async nature of Aiohttp and Httpx requires special async syntax for features like timeouts.
Proxies
Proxied requests are important for careful web scraping. All three libraries allow proxying requests:
# Requests proxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', } requests.get("https://example.com", proxies=proxies) # Aiohttp connector = aiohttp.ProxyConnector(proxy="http://10.10.1.10:3128") async with aiohttp.ClientSession(connector=connector) as session: await session.get('https://example.com') # Httpx client = httpx.AsyncClient(proxies={'all://': 'http://myproxy.com'}) await client.get("https://example.com")
Httpx matches Requests proxy syntax while supporting async actions.
Cookies, Headers & Redirects
All three clients handle HTTP features like cookies, headers, and redirects:
# Handle cookies session.cookies.set("sessionid", "1234abc") # Custom headers headers = {"User-Agent": "MyScraper 1.0"} response = session.get("https://example.com", headers=headers) # Handle redirects response = session.get("https://example.com", allow_redirects=True)
Httpx and Aiohttp provide async friendly interfaces like ClientResponse for accessing headers and cookies. Overall capabilities are similar.
Streaming & Downloads
For large responses, streaming & chunked downloads are supported:
# Stream response with requests.get("https://example.com/bigfile") as response: for chunk in response.iter_content(1024): print(chunk) # Save response directly to file with open("bigfile.zip", "wb") as f: f.write(response.content)
Aiohttp and Httpx provide async asynchronous iterators and context managers for streaming.
Ecosystem & Utils
Given its maturity, Requests has the biggest ecosystem of supporting libraries and integrations:
- Utility packages like
requests-html
for parsing HTML responses. - Packages like
requests-cache
for caching responses. - Integration with data science tools like Pandas and NumPy.
Aiohttp and Httpx have smaller ecosystems since they are newer. But both have robust util packages:
aiohttp-socks
– SOCKS proxy support for Aiohttp.httpx-oauth
– OAuth 1.0 and 2.0 support for Httpx.
Over time, expect the Httpx and Aiohttp ecosystems to grow and match Requests capabilities.
Use Cases & Recommendations
Given their differences and tradeoffs, when should you choose Requests, Aiohttp or Httpx?
Requests
- Simplicity and synchronous usage is preferred.
- Already have existing code using Requests.
- Need compatibility with a library or tool only supporting Requests.
- Require some complex Requests ecosystem feature like caching or integrations.
Aiohttp
- Require asynchronous performance for many requests or slow websites.
- Building asynchronous web applications along with scraping capability.
- Plan to integrate with other async tools like asyncio queues.
Httpx
- Require modern performance features like HTTP/2 and asyncio support.
- Building a new application without existing legacy dependencies.
- Prefer a single, integrated solution combining Requests and Aiohttp pros.
If I had to choose just one for a robust web scraper, I would go with Httpx since it combines the best of both worlds with HTTP/2 and asyncio capability. However, many scrapers leverage multiple libraries – using Requests for simplicity and Aiohttp when asynchronous performance matters.
Example Code Snippets
To demonstrate usage, here are some examples of how these clients can accomplish common web scraping tasks:
Fetch a page and extract the title:
# Requests import requests resp = requests.get("https://example.com") print(resp.text.split("<title>")[1].split("</title>")[0]) # Aiohttp import asyncio import aiohttp async def get_title(url): async with aiohttp.ClientSession() as session: async with session.get(url) as resp: data = await resp.text() print(data.split("<title>")[1].split("</title>")[0]) asyncio.run(get_title("https://example.com")) # Httpx import httpx async with httpx.AsyncClient() as client: resp = await client.get("https://example.com") print(resp.text.split("<title>")[1].split("</title>")[0])
Speed up multiple requests with async:
# Aiohttp import asyncio import aiohttp async def fetch(url, session): async with session.get(url) as response: return await response.read() async def main(): async with aiohttp.ClientSession() as session: urls = [ "https://example.com/1", "https://example.com/2", # etc ] coroutines = [fetch(url, session) for url in urls] results = await asyncio.gather(*coroutines) asyncio.run(main()) # Httpx import httpx import asyncio async with httpx.AsyncClient() as client: tasks = [client.get(url) for url in urls] results = await asyncio.gather(*tasks)
Make proxied requests:
# Requests import requests proxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', } requests.get("https://example.com", proxies=proxies) # Aiohttp import aiohttp async with aiohttp.ClientSession( connector=aiohttp.ProxyConnector.from_url('http://10.10.1.10:3128')) as session: await session.get('https://example.com') # Httpx import httpx async with httpx.AsyncClient( proxies={'all://': 'http://10.10.1.10:3128'}) as client: await client.get('https://example.com')