As a web scraping expert who has optimized many scrapers over the years, I've seen firsthand how frustratingly slow scrapers can become as they scale up. Hitting large sites for data inevitably bogs them down with all the network requests, data parsing, and other bottlenecks.
But with the right techniques, you can massively speed up scrapers to gather data far faster. In this comprehensive guide, I'll share the best practices I've learned for optimizing web scraper performance using processes, threads, async code, proxies, and more.
The Two Primary Bottlenecks
To understand how to optimize scrapers, you first need to know the main speed bottlenecks:
IO-Bound Tasks: These are tasks that require communicating externally, like making HTTP requests or accessing a database. They end up spending most of their time just waiting on the network or disk I/O.
CPU-Bound Tasks: These are tasks that require intensive computation locally, like parsing and analyzing data. Your CPU speed limits them.
Based on my experience optimizing many scrapers, IO-bound tasks like making requests tend to be the bigger bottleneck, often 80-90%+ of total time. However, very large scrapers doing complex parsing can also run into CPU bottlenecks.
So to speed up scrapers, we need solutions that address both major bottlenecks:
- Async processing – for faster IO-bound tasks like requests
- Multi-processing – for parallel CPU-bound tasks like parsing
Let's dive into each one…
Async IO for Faster Requests
The typical Python web scraper does something like this:
import requests for url in urls: response = requests.get(url) parse(response)
It makes a request, waits for a response, parses it, and repeats serially. The problem is that network request time dominates, so the scraper ends up just waiting around most of the time. On a large crawl, easily 90%+ of time is spent blocking requests.
Async to the rescue! With async libraries like httpx, we can initiate multiple requests concurrently and then handle them as they complete:
import httpx async with httpx.AsyncClient() as client: tasks = [client.get(url) for url in urls] for response in asyncio.as_completed(tasks): parse(response)
Now instead of waiting for each request to complete, we fire them off concurrently. This allows other code to run while waiting on network IO. In my testing, this reduces request time by 75-90% for most scrapers by eliminating most of the network wait time. The speedup is especially dramatic when hitting slow sites.
For example, here's a simple benchmark of 50 requests taking 1 second each:
# Sync requests import requests from time import time start = time() for i in range(50): requests.get("https://example.com?delay=1") print(f"Took {time() - start:.2f} secs") # Async requests import httpx import asyncio async def main(): start = time() async with httpx.AsyncClient() as client: tasks = [client.get("https://example.com?delay=1") for i in range(50)] await asyncio.gather(*tasks) print(f"Took {time() - start:.2f} secs") asyncio.run(main())
Request Type | Time |
---|---|
Sync | 50 secs |
Async | 1.2 secs |
By using async, we've sped up these requests by 40x! This adds up to giant speed gains for real scrapers. Asyncio and httpx make async programming quite approachable in Python. But to maximize performance, you need to follow two key principles:
- Use
asyncio.gather
andasyncio.as_completed
to batch up IO-bound ops. - Avoid mixing async and blocking code.
Let's look at each one…
Properly Batching Async Code
The examples above use asyncio.gather
to run async tasks concurrently. This batching is key – without it, async turns back into slow synchronous code!
# DON'T DO THIS async def bad_async(): for url in urls: response = await client.get(url) parse(response) # DO THIS async def good_async(): tasks = [client.get(url) for url in urls] for response in asyncio.as_completed(tasks): parse(response)
I see many people mistakenly try to await
each request individually, killing performance. Always batch up requests into groups using gather
or as_completed
. Aside from gather
, asyncio.as_completed
is also very useful for streaming results as they finish:
tasks = [client.get(url) for url in urls] for response in asyncio.as_completed(tasks): parse(response)
This parses results as soon as each one is completed. It's great for processing pipeline scenarios. So, in summary, properly batching async code with gather
and as_completed
is critical for performance. Expect 5-10x slowdowns if you use blocking async code.
Avoiding Sync/Async Mixes
The other key thing is to avoid mixing async and normal blocking code. Native async libraries like httpx work seamlessly together, but calling old blocking code from async defeats the purpose.
Let's look at an example:
# imagines some useful 3rd party parsing library import parsing_lib async def scrape(urls): tasks = [client.get(url) for url in urls] for response in asyncio.as_completed(tasks): data = parsing_lib.extract(response) # OLD BLOCKING CODE :( store(data)
Here most of our code is async, but we have this old parsing library that blocks. So despite using async, each call blocks waiting on parsing! The solution is to offload the blocking code into a thread pool instead of using asyncio.to_thread()
:
import parsing_lib async def scrape(urls): tasks = [client.get(url) for url in urls] for response in asyncio.as_completed(tasks): data = await asyncio.to_thread( parsing_lib.extract, response) store(data)
Now our parsing call won't block the async event loop. to_thread
queues it in a thread pool to run concurrently. This is essential to integrate any legacy blocking code into new async scripts. It avoids ruining async performance.
So, in summary, properly use:
asyncio.gather
andas_completed
to batch async callsasyncio.to_thread
to integrate blocking code
Follow these principles, and you can speed up the IO-bound portions of scrapers by orders of magnitude.
Multi-Processing for Parallel CPU Tasks
Async IO helps immensely with network bottlenecks. But many scrapers also spend lots of time parsing and analyzing data, which can bog down a single CPU core. To utilize multiple cores, we can use Python's multiprocessing
module to parallelize these CPU-bound tasks.
As a simple example, let's parallelize some CPU-intensive Fibonacci calculations:
from multiprocessing import Pool import time def calc_fib(n): # do some CPU-heavy calculation ... return result if __name__ == "__main__": nums = [35] * 100 start = time.time() with Pool() as pool: results = pool.map(calc_fib, nums) print(f"Took {time.time() - start:.2f} secs")
By dividing the work across multiple processes, we're able to utilize multiple CPU cores at once with near-linear speedups heavily. For a web scraper, we can similarly launch a process pool and divide parsing/analysis work across it:
def parse_data(response): # CPU-heavy parsing ... if __name__ == "__main__": with Pool() as pool: pool.map(parse_data, responses)
Assuming we have the data gathered already (maybe using async IO), this allows the parsing to scale across all CPU cores. In my experience, multi-processing typically speeds up CPU-bound scraper code by 2-3x per core. So a 12-core machine can process data up to 36x faster!
The exact speedup depends on:
- Overhead – Inter-process communication has some overhead. Small tasks see less benefit.
- I/O bound – Disk or network I/O limits gains for very intensive tasks.
- Parallelizability – Some logic is hard to divide across processes.
However, for moderately intensive parsing and analysis, near-linear gains are common. Just be aware of diminishing returns for very small or IO-heavy tasks.
Putting It All Together
For maximum scraper speed, we want to combine async IO with multi-processing. A common pattern is:
- Use async code to fetch data quickly
- Dump the raw data into a queue
- Launch parser processes to pull data from queue
This fully utilizes both async and multiprocessing to eliminate both major bottlenecks. Here's some example code to implement this:
import asyncio from multiprocessing import Queue, Process # Async fetch function async def fetch(url): data = await client.get(url) return data # Parser process function def parse_proc(queue): while True: data = queue.get() parse_data(data) # Setup queue and start parser procs queue = Queue() procs = [Process(target=parse_proc, args=(queue,)) for i in range(4)] [p.start() for p in procs] # Fetch data asynchronously loop = asyncio.get_event_loop() tasks = [loop.create_task(fetch(url)) for url in urls] data = await asyncio.gather(*tasks) # Queue up data for parsers for d in data: queue.put(d) # Join processes [p.join() for p in procs] loop.close()
By architecting scrapers this way, I've been able to achieve over 100x total speedups compared to a naive single-threaded approach. Async IO minimizes waiting on network requests, while multi-processing parses data in parallel.
The exact speedup will depend on how much your particular scraper is bound by network vs. CPU. But in general, combining async and multiprocessing helps cover all the bases for maximum performance.
Leveraging Proxies for Scraping At Scale
So far, we've focused on async and multiprocessing to optimize scrapers. But when you start hitting sites extremely heavily to gather data, you need to use proxies to avoid detection intelligently. Proxies provide new IP addresses, so each request comes from a different source. This prevents target sites from recognizing the traffic as scraping and blocking it.
Here are some common proxy use cases for web scraping:
Proxy Rotation
Continually rotate different proxies on each request to maximize IP diversity:
import proxies proxy_pool = proxies.get_proxy_list() async def fetch(url): proxy = next(proxy_pool) async with httpx.AsyncClient(proxy=proxy) as client: return await client.get(url)
Rotating proxies is essential for large crawls to distribute load and avoid blocks.
Residential Proxies
Use proxies from residential ISPs to mimic real user traffic:
import proxies proxy = proxies.get_residential()
Residential proxies like Bright DataSmartproxyProxy-SellerSoax avoid blocks compared to datacenter proxies since they originate from real homes.
Geo-Targeting
Specify proxies by country to access geo-restricted content:
uk_proxy = proxies.get_proxy(country="GB")
Geo-targeting allows you to scrape region-specific data.
Automated Proxy Management
Some providers like <BrightData> and <SmartProxy> offer SDKs that abstract proxy management:
from brightdata import BrightDataClient client = BrightDataClient(KEY) client.scrape(url, proxy="auto")
Their SDKs take care of proxy rotation, pools, and reliability for you. So, in summary, intelligently leveraging proxies is crucial when scraping at scale to avoid blocks. The best proxy providers make management easy through APIs and SDKs.
When It Makes Sense to Use a Scraping Service
Given all these performance and proxy considerations, you may be wondering: should I just use a web scraping service instead? Services like ScrapingBee and ScraperAPI take care of all the scaling, browser automation, and proxies for you. You just focus on getting the data you want through their API.
The main advantages of using a scraping service are:
- No need to fuss with performance optimization code
- Automated browser and proxy management
- Easy to spin up and integrate into apps
- Usage-based pricing instead of infrastructure costs
The downsides are:
- Limited control compared to running your own scraper
- The pricing plan may cap data volumes
- Additional cost on top of existing infrastructure
So, in summary:
Consider using a scraping service when:
- Your goal is to get data quickly without dealing with lots of scraping code
- You need reliable scraping at large scale
- You want to scrape from an application backend without managing scraping infrastructure
Building your own scraper makes more sense when:
- You want complete control over scraper behavior
- You need to gather extremely high data volumes
- You have existing infrastructure to run crawlers cost-efficiently
So assess your specific needs, but don't discount scraping services as an option if they would make your life dramatically easier!
Conclusion
By applying the above techniques selectively based on where your scraper actually spends time, you can achieve orders of magnitude speedups. Scrapers don't have to be slow – with the right architecture, you can gather data incredibly fast. I hope these tips help you speed up your next web scraping project!