Web Scraping Speed: How to Boost It?

5 Views

As a web scraping expert who has optimized many scrapers over the years, I've seen firsthand how frustratingly slow scrapers can become as they scale up. Hitting large sites for data inevitably bogs them down with all the network requests, data parsing, and other bottlenecks.

But with the right techniques, you can massively speed up scrapers to gather data far faster. In this comprehensive guide, I'll share the best practices I've learned for optimizing web scraper performance using processes, threads, async code, proxies, and more.

The Two Primary Bottlenecks

To understand how to optimize scrapers, you first need to know the main speed bottlenecks:

IO-Bound Tasks: These are tasks that require communicating externally, like making HTTP requests or accessing a database. They end up spending most of their time just waiting on the network or disk I/O.

CPU-Bound Tasks: These are tasks that require intensive computation locally, like parsing and analyzing data. Your CPU speed limits them.

Based on my experience optimizing many scrapers, IO-bound tasks like making requests tend to be the bigger bottleneck, often 80-90%+ of total time. However, very large scrapers doing complex parsing can also run into CPU bottlenecks.

So to speed up scrapers, we need solutions that address both major bottlenecks:

Async processing – for faster IO-bound tasks like requests
Multi-processing – for parallel CPU-bound tasks like parsing

Let's dive into each one…

Async IO for Faster Requests

The typical Python web scraper does something like this:

import requests 

for url in urls:
  response = requests.get(url)
  parse(response)

It makes a request, waits for a response, parses it, and repeats serially. The problem is that network request time dominates, so the scraper ends up just waiting around most of the time. On a large crawl, easily 90%+ of time is spent blocking requests.

Async to the rescue! With async libraries like httpx, we can initiate multiple requests concurrently and then handle them as they complete:

import httpx

async with httpx.AsyncClient() as client:

  tasks = [client.get(url) for url in urls]
  
  for response in asyncio.as_completed(tasks):
    parse(response)

Now instead of waiting for each request to complete, we fire them off concurrently. This allows other code to run while waiting on network IO. In my testing, this reduces request time by 75-90% for most scrapers by eliminating most of the network wait time. The speedup is especially dramatic when hitting slow sites.

For example, here's a simple benchmark of 50 requests taking 1 second each:

# Sync requests
import requests
from time import time

start = time()
for i in range(50):
  requests.get("https://example.com?delay=1")
print(f"Took {time() - start:.2f} secs")

# Async requests
import httpx 
import asyncio

async def main():
  start = time()  
  async with httpx.AsyncClient() as client:
    tasks = [client.get("https://example.com?delay=1") 
             for i in range(50)]
    await asyncio.gather(*tasks)

  print(f"Took {time() - start:.2f} secs") 

asyncio.run(main())

Request Type	Time
Sync	50 secs
Async	1.2 secs

By using async, we've sped up these requests by 40x! This adds up to giant speed gains for real scrapers. Asyncio and httpx make async programming quite approachable in Python. But to maximize performance, you need to follow two key principles:

Use asyncio.gather and asyncio.as_completed to batch up IO-bound ops.
Avoid mixing async and blocking code.

Let's look at each one…

Properly Batching Async Code

The examples above use asyncio.gather to run async tasks concurrently. This batching is key – without it, async turns back into slow synchronous code!

# DON'T DO THIS
async def bad_async():

  for url in urls: 
    response = await client.get(url)
    parse(response)

# DO THIS
async def good_async():
  
  tasks = [client.get(url) for url in urls]

  for response in asyncio.as_completed(tasks):
    parse(response)

I see many people mistakenly try to await each request individually, killing performance. Always batch up requests into groups using gather or as_completed. Aside from gather, asyncio.as_completed is also very useful for streaming results as they finish:

tasks = [client.get(url) for url in urls]

for response in asyncio.as_completed(tasks):
  parse(response)

This parses results as soon as each one is completed. It's great for processing pipeline scenarios. So, in summary, properly batching async code with gather and as_completed is critical for performance. Expect 5-10x slowdowns if you use blocking async code.

Avoiding Sync/Async Mixes

The other key thing is to avoid mixing async and normal blocking code. Native async libraries like httpx work seamlessly together, but calling old blocking code from async defeats the purpose.

Let's look at an example:

# imagines some useful 3rd party parsing library
import parsing_lib 

async def scrape(urls):

  tasks = [client.get(url) for url in urls]
  
  for response in asyncio.as_completed(tasks):

    data = parsing_lib.extract(response) # OLD BLOCKING CODE :(
    
    store(data)

Here most of our code is async, but we have this old parsing library that blocks. So despite using async, each call blocks waiting on parsing! The solution is to offload the blocking code into a thread pool instead of using asyncio.to_thread():

import parsing_lib

async def scrape(urls):

  tasks = [client.get(url) for url in urls]

  for response in asyncio.as_completed(tasks):
   
    data = await asyncio.to_thread(
      parsing_lib.extract, response)
    
    store(data)

Now our parsing call won't block the async event loop. to_thread queues it in a thread pool to run concurrently. This is essential to integrate any legacy blocking code into new async scripts. It avoids ruining async performance.

So, in summary, properly use:

asyncio.gather and as_completed to batch async calls
asyncio.to_thread to integrate blocking code

Follow these principles, and you can speed up the IO-bound portions of scrapers by orders of magnitude.

Multi-Processing for Parallel CPU Tasks

Async IO helps immensely with network bottlenecks. But many scrapers also spend lots of time parsing and analyzing data, which can bog down a single CPU core. To utilize multiple cores, we can use Python's multiprocessing module to parallelize these CPU-bound tasks.

As a simple example, let's parallelize some CPU-intensive Fibonacci calculations:

from multiprocessing import Pool
import time

def calc_fib(n):
  
  # do some CPU-heavy calculation
  ...
  return result 

if __name__ == "__main__":

  nums = [35] * 100

  start = time.time()

  with Pool() as pool:
    results = pool.map(calc_fib, nums)
  
  print(f"Took {time.time() - start:.2f} secs")

By dividing the work across multiple processes, we're able to utilize multiple CPU cores at once with near-linear speedups heavily. For a web scraper, we can similarly launch a process pool and divide parsing/analysis work across it:

def parse_data(response):

  # CPU-heavy parsing
  ...

if __name__ == "__main__":

  with Pool() as pool: 

    pool.map(parse_data, responses)

Assuming we have the data gathered already (maybe using async IO), this allows the parsing to scale across all CPU cores. In my experience, multi-processing typically speeds up CPU-bound scraper code by 2-3x per core. So a 12-core machine can process data up to 36x faster!

The exact speedup depends on:

Overhead – Inter-process communication has some overhead. Small tasks see less benefit.
I/O bound – Disk or network I/O limits gains for very intensive tasks.
Parallelizability – Some logic is hard to divide across processes.

However, for moderately intensive parsing and analysis, near-linear gains are common. Just be aware of diminishing returns for very small or IO-heavy tasks.

Putting It All Together

For maximum scraper speed, we want to combine async IO with multi-processing. A common pattern is:

Use async code to fetch data quickly
Dump the raw data into a queue
Launch parser processes to pull data from queue

This fully utilizes both async and multiprocessing to eliminate both major bottlenecks. Here's some example code to implement this:

import asyncio
from multiprocessing import Queue, Process 

# Async fetch function
async def fetch(url):
  data = await client.get(url)
  return data

# Parser process function
def parse_proc(queue):
  while True:
    data = queue.get()
    parse_data(data)

# Setup queue and start parser procs
queue = Queue()
procs = [Process(target=parse_proc, args=(queue,)) 
         for i in range(4)]
[p.start() for p in procs]

# Fetch data asynchronously 
loop = asyncio.get_event_loop()

tasks = [loop.create_task(fetch(url)) for url in urls]
data = await asyncio.gather(*tasks)

# Queue up data for parsers
for d in data:
  queue.put(d) 

# Join processes
[p.join() for p in procs]
loop.close()

By architecting scrapers this way, I've been able to achieve over 100x total speedups compared to a naive single-threaded approach. Async IO minimizes waiting on network requests, while multi-processing parses data in parallel.

The exact speedup will depend on how much your particular scraper is bound by network vs. CPU. But in general, combining async and multiprocessing helps cover all the bases for maximum performance.

Leveraging Proxies for Scraping At Scale

So far, we've focused on async and multiprocessing to optimize scrapers. But when you start hitting sites extremely heavily to gather data, you need to use proxies to avoid detection intelligently. Proxies provide new IP addresses, so each request comes from a different source. This prevents target sites from recognizing the traffic as scraping and blocking it.

Here are some common proxy use cases for web scraping:

Proxy Rotation

Continually rotate different proxies on each request to maximize IP diversity:

import proxies 

proxy_pool = proxies.get_proxy_list()

async def fetch(url):

  proxy = next(proxy_pool) 
  async with httpx.AsyncClient(proxy=proxy) as client:
    return await client.get(url)

Rotating proxies is essential for large crawls to distribute load and avoid blocks.

Residential Proxies

Use proxies from residential ISPs to mimic real user traffic:

import proxies

proxy = proxies.get_residential()

Residential proxies like Bright Data, Smartproxy, Proxy-Seller, and Soax avoid blocks compared to datacenter proxies since they originate from real homes.

Geo-Targeting

Specify proxies by country to access geo-restricted content:

uk_proxy = proxies.get_proxy(country="GB")

Geo-targeting allows you to scrape region-specific data.

Automated Proxy Management

Some providers like <BrightData> and <SmartProxy> offer SDKs that abstract proxy management:

from brightdata import BrightDataClient

client = BrightDataClient(KEY)
client.scrape(url, proxy="auto")

Their SDKs take care of proxy rotation, pools, and reliability for you. So, in summary, intelligently leveraging proxies is crucial when scraping at scale to avoid blocks. The best proxy providers make management easy through APIs and SDKs.

When It Makes Sense to Use a Scraping Service

Given all these performance and proxy considerations, you may be wondering: should I just use a web scraping service instead? Services like ScrapingBee and ScraperAPI take care of all the scaling, browser automation, and proxies for you. You just focus on getting the data you want through their API.

The main advantages of using a scraping service are:

No need to fuss with performance optimization code
Automated browser and proxy management
Easy to spin up and integrate into apps
Usage-based pricing instead of infrastructure costs

The downsides are:

Limited control compared to running your own scraper
The pricing plan may cap data volumes
Additional cost on top of existing infrastructure

So, in summary:

Consider using a scraping service when:

Your goal is to get data quickly without dealing with lots of scraping code
You need reliable scraping at large scale
You want to scrape from an application backend without managing scraping infrastructure

Building your own scraper makes more sense when:

You want complete control over scraper behavior
You need to gather extremely high data volumes
You have existing infrastructure to run crawlers cost-efficiently

So assess your specific needs, but don't discount scraping services as an option if they would make your life dramatically easier!

Conclusion

By applying the above techniques selectively based on where your scraper actually spends time, you can achieve orders of magnitude speedups. Scrapers don't have to be slow – with the right architecture, you can gather data incredibly fast. I hope these tips help you speed up your next web scraping project!