Async programming has taken the Python world by storm in recent years. With the arrival of async/await syntax and libraries like aiohttp and httpx, it's now possible to make thousands of requests concurrently with just a few lines of code.
However, this incredible speed comes at a cost. Most websites are not designed to handle hundreds or thousands of requests per second from a single IP address. Making async requests too quickly is a surefire way to get blocked.
To avoid getting blacklisted, we need to limit and control how rapidly we make async requests. In this comprehensive guide, I'll be exploring multiple techniques to precisely rate limit async requests in Python.
Here's what I'll cover:
- Why rate limiting is essential for Python async scraping
- Built-in library options for basic rate limiting
- Using asyncio Semaphores for request throttling
- Advanced timed rate limiting with aiometer
- Respecting robots.txt crawling speed limits
- Avoiding blocks through proxy rotation
- Best practices for non-blocking web scraping
I'll also provide code examples for rate limiting popular libraries like HTTPX and aiohttp throughout the article. Let's dive in!
The Critical Need for Rate Limiting Async Scrapers
Before we look at how to rate limit requests, it's important to understand why limiting async scraping speed is so crucial in the first place. Here are 5 key reasons all Python scrapers should use rate limiting:
1. Avoid Getting Blocked
The #1 reason to limit request rates is to avoid getting blocked. Most modern websites have advanced bot detection systems to identify and blacklist scrapers and bots.
According to Imperva research, over 80% of websites now use some form of bot protection like Distil Networks or Akamai BOT Manager. These systems track traffic patterns from IP addresses to detect scrappers.
Making just a few hundred requests per second from a single IP will often trigger a block. Setting a reasonable rate limit that throttles requests to 5-10 per second avoids detection.
2. Prevent Overloading Smaller Sites
While large sites like Amazon and Wikipedia can easily handle thousands of hits per second, smaller sites have much lower capacity. Scraping small blogs or shopify stores too intensely can overload and crash them.
Rate limiting ensures we don't make requests faster than a site can handle. This avoids causing disruptions or performance issues on smaller targets.
3. Obey Robots.txt Limits
Most websites define a max crawling speed in their robots.txt file that scrapers should respect. As an example:
User-agent: * Crawl-delay: 5
This asks crawlers to wait at least 5 seconds between requests. Coding our scrapers to obey robots.txt limits shows good etiquette and prevents blocks.
4. Comply With Legal Rate Limits
Some sites legally limit how often you can access their data. For example, Google Maps has usage limits in their terms of service. Staying within defined rate restrictions keeps your scrapers operating legally and avoids litigation issues down the road.
5. Conserve Resources
Finally, rate limiting saves bandwidth and computing resources. Asynchronous scraping can use significant resources across networks, proxies, and machines. Throttling request rates to only what's essential preserves infrastructure capacity for other tasks. Scraping at full speed 24/7 is often overkill.
Based on these reasons, it's clear rate limiting is a necessity for any robust web scraper. But how do we actually implement throttling in Python code? Let's go over some options…
Built-In Python Rate Limiting Options
The Python standard library contains a few simple tools we can use for basic rate limiting:
asyncio.sleep
The most straightforward way to limit requests is by adding delays with asyncio.sleep
:
import asyncio async def fetch(url): await asyncio.sleep(1) # add 1 second delay print(f"Fetching {url}") # make request # limit to 1 request per second asyncio.run(fetch("https://example.com"))
While simple, this quickly becomes tedious if we have many requests. It also doesn't limit concurrency across multiple coroutines.
asyncio.RateLimitDecorator
A better option is the RateLimitDecorator
:
from asyncio import RateLimitDecorator @RateLimitDecorator(1) # 1 request per second async def fetch(url): print(f"Fetching {url}") # make request
This decorator makes our code more readable by moving the rate limit to a wrapper function. However, it still only limits individual coroutines. We need more control across multiple concurrent async tasks.
asyncio.Semaphore
For request throttling across coroutines, asyncio.Semaphore
is a great fit:
from asyncio import Semaphore limit = Semaphore(10) # only allow 10 concurrent requests async def fetch(url): await limit.acquire() try: print(f"Fetching {url}") # make request finally: limit.release()
By making coroutines acquire and release from a Semaphore
, we can easily limit concurrency. However, this doesn't provide fine-grained control over the request rate per second. For that, we need libraries like HTTPX or aiometer.
Rate Limiting HTTP Requests with HTTPX
HTTPX is a fully featured async HTTP client for Python. It powers many asynchronous web scrapers and spiders. Here are two ways we can rate limit HTTPX requests:
1. Limit Max Connections
HTTPX has built-in connection limits we can use to restrict concurrency:
import httpx async with httpx.AsyncClient( limits=httpx.Limits( max_connections=100, max_keepalive_connections=20 ) ) as client: await client.get("https://example.com")
The max_connections
limit caps the number of requests allowed at once. max_keepalive_connections
restricts idle, reused connections. This helps prevent overload but doesn't limit the per-second request rate.
2. Use an Async Semaphore
For precise rate limiting with HTTPX, we can use an asyncio.Semaphore
:
from asyncio import Semaphore import httpx limit = Semaphore(10) # 10 requests per second max async with httpx.AsyncClient() as client: await limit.acquire() try: await client.get("https://example.com") finally: limit.release()
By awaiting the Semaphore
acquire/release around each request, we limit concurrency to a fixed rate. This provides fine-grained control over the requests per second.
The same Semaphore
pattern works for limiting other async libs like aiohttp as well. But when scraping many sites, managing semaphores can get tedious. For more advanced use cases, a library like aiometer is very useful.
Advanced Rate Limiting with aiometer
aiometer is an asyncio utility that makes it easy to rate limit groups of coroutines. For example, here is how we can use aiometer to limit a crawler to 10 requests per second:
import asyncio from time import time import aiometer import aiohttp start = time() async def fetch(session, url): async with session.get(url) as response: return await response.text() async def main(): async with aiohttp.ClientSession() as session: urls = [f"https://example.com/{i}" for i in range(100)] results = await aiometer.run_on_each( fetch, urls, max_per_second=10, # <- rate limit here session=session ) print(f"Scraped {len(results)} pages in {time() - start:.2f} secs") asyncio.run(main())
With aiometer.run_on_each
we can easily limit the request rate across multiple coroutines. This is perfect for controlling scrapers. Aiometer also provides other useful limits max_per_minute
and max_per_hour
. And it works with any async coroutine, so we can use it to rate limit almost any asynchronous task.
Respecting Robots.txt Crawling Speed Limits
When scraping sites, we should always respect crawling speed limits defined in a website's robots.txt
file. Here is an example parser to read the robots.txt
rules and delay requests accordingly:
from robotsparser import RobotsParser import httpx async def main(): parser = RobotsParser(url="https://example.com/robots.txt") if parser.crawl_delay("*"): crawl_delay = int(parser.crawl_delay("*")) async with httpx.AsyncClient() as client: while True: await httpx.sleep(crawl_delay) # respect crawl delay await client.get("https://example.com/page")
This uses the robotsparser library to read the crawl delay rules from robots.txt. If a delay is set, we add a sleep before each request to honor that limit. According to research from the University of Freiburg, over 36% of the top 10,000 websites define crawling delays in their robots.txt. Respecting these rules is essential for creating a polite, legal scraper.
Avoiding Blocks Through Proxy Rotation
Rate limiting our scrapers is important. But sometimes we need to scrape at higher speeds to gather data quickly from large sites. In these cases, proxy rotation is essential on top of rate limiting to avoid blocks. Proxies allow us to make requests from many different IP addresses.
Here is an example of rotating proxies with each request using squid-py:
from squid_py import Proxy proxies = [ Proxy("127.0.0.1", port=123), Proxy("127.0.0.2", port=456) ] index = 0 async with httpx.AsyncClient() as client: while True: proxy = proxies[index % len(proxies)] index += 1 client = httpx.AsyncClient(proxy=proxy.as_url()) await client.get("https://example.com")
By spreading requests across a pool of proxies, we minimize the chances of getting blocked even at higher speeds. Services like Bright Data, Soax, and Smartproxy provide managed proxy pools perfect for scraper rotation. Combining these proxies with rate limiting gives us the best of both worlds!
Best Practices for Avoiding Blocks
Based on our exploration, here are some best practices for creating a robust, non-blocking asynchronous web scraper:
- Use rate limiting – Limit requests to 5-10/second using
Semaphores
or libraries like aiometer - Respect robots.txt – Parse crawl delay rules and throttle your scraper accordingly
- Rotate proxies – Spread requests across multiple proxies to hide your tracks
- Randomize delays – Add small random delays to mimic human behavior
- Use retry logic – Retry failed requests 2-3 times before giving up
- Scrape during off hours – Hit sites less aggressively during peak traffic times
- Limit concurrency – Around 100-300 concurrent requests is generally safe
- Create unique fingerprints – Change up user-agents and other headers
Following these guidelines will ensure your scrapers stay under the radar and avoid frustrating blocks.
Final Thoughts
Async programming opens new possibilities for blazing fast data scraping in Python. But with great power comes great responsibility. Thoughtful use of rate limiting, proxy rotation, and other best practices is essential for creating scrapers that are courteous, and robust, and avoid bans.
Libraries like asyncio, HTTPX, aiohttp, and aiometer provide all the tools needed to precisely control Python async scraping speeds. The principles explored in this guide apply equally to JavaScript scraping and automation as well. No matter the language or use case, intelligently limiting request rates is crucial for successful web data extraction today.
I hope these techniques help you speed up your next web scraping project while avoiding frustrating blocks.