How to Rotate Proxies in Web Scraping?

Web scraping often requires proxies to avoid blocking and rate limiting. When scraping at scale, correctly managing and rotating proxy resources is crucial for stability, performance and cost efficiency.

In this comprehensive guide, I'll explain common proxy rotation strategies and considerations for implementation and share actionable code examples you can reference.

Why Use Proxies in Web Scraping?

Briefly, proxies serve two key functions:

Obfuscate scraper identity and distribute requests across different IPs to avoid patterns that trigger blocks
Provide geographic targeting options to access locally restricted content

Without proxies, scrapers typically get blocked once they exceed thresholds for things like requests per minute. Rotating proxies helps tremendously in avoiding these limits.

Overview of Proxy Rotation

Proxy rotation refers to the pattern in which proxies are chosen from a pool to make requests through. Instead of routing all traffic through a single proxy IP, we can rotate through different ones randomly or based on custom logic. This breaks usage patterns and allows proxies that get blocked time to recover.

Benefits of proxy rotation include:

Avoiding rate limits and blocks based on request patterns
Maximizing use of available healthy proxies
Allowing blocked proxies time to recover

The effectiveness of a rotation strategy depends on the approach and logic used. Next, I'll explain popular rotation options.

Proxy Rotation Strategies

There are several proven proxy rotation strategies, each with their own pros and cons.

Random Proxy Selection

A basic approach is selecting a random proxy from the pool uniformly for each request.

Pros

Simple to implement
Provides good randomness

Cons

Risk of same subnet/ASN getting picked repeatedly
Doesn't consider performance and health factors

Round-Robin Proxy Selection

Round-robin loops through proxies sequentially. Once the end of the pool is reached, it starts again from the beginning.

Pros

Very simple to implement
Built-in recovery period if blocked

Cons

Deterministic and predictable pattern
Doesn't account for performance

Weighted Random Proxy Selection

Weighted random selection chooses proxies randomly but applies custom probability weighting. Proxies can be promoted or demoted intelligently.

Pros

True randomness
Can optimize for health/performance
Highly customizable

Cons

More complex to implement

Proxy Rotation Implementation Factors

When implementing a rotation strategy, here are some key factors to consider:

Subnet

Proxies often come from pooled subnets, so picking the same subnet twice in a row risks duplicates getting caught.

ASN

The ASN (Autonomous System Number) provides information about the network owner. Diverse ASNs help avoid detection.

Location Diversity

Rotating datacenter proxies across different metropolitan areas improves geo-targeting success. For residential proxies like Bright Data, Smartproxy, Proxy-Seller, and Soax, diverse locations also help avoid regional blocks.

Performance & Health Tracking

Monitoring proxy status (alive/dead) and request latency allows penalizing underperforming proxies automatically. We can give healthy, responsive proxies higher selection priority. Implementing logic to rotate across all these axes deliberately makes our proxy use very hard to detect and block. Next, I'll share some code examples of common rotation approaches.

Basic Random Proxy Rotator

Here is a simple proxy rotator implementing random selection. It keeps track of the last used subnet to avoid consecutive duplicates:

import random

proxies = [
  '104.56.25.47:8080',
  '104.57.15.95:8080', 
  '52.29.127.55:8000',
 ]

last_subnet = None

def get_proxy():
  global last_subnet
  proxy = random.choice(proxies) 
  if proxy.split(":")[0].rpartition('.')[0] == last_subnet:
    return get_proxy()
  last_subnet = proxy.split(":")[0].rpartition('.')[0]  
  return proxy

This works well for a basic use case. Next, let's look at slightly smarter implementations.

Round-Robin Proxy Rotator with Recovery

Here is an example round-robin proxy rotator that handles dead proxies by skipping them. After 3 cycles, blocked proxies are rotated back in:

from time import time
from typing import Literal, Dict

RotationStatus = Literal['alive', 'dead', 'recovering']

class Proxy:
  ip: str
  status: RotationStatus

proxies = [
  Proxy('104.56.25.47:8080'),
  Proxy('104.57.15.95:8080'),
  # etc
]

proxy_status: Dict[str, RotationStatus] = {}

def get_proxy() -> Proxy:
  global proxies

  for proxy in proxies:
    if proxy_status.get(proxy.ip) == 'dead': 
      continue # skip blocked proxies
    if proxy_status.get(proxy.ip) == 'recovering':
      proxy_status[proxy.ip] = 'alive' # put proxy back in rotation
    return proxy

  # all proxies blocked - allow dead ones back in  
  for proxy in proxies: 
    if proxy_status[proxy.ip] == 'recovering':
      proxy_status[proxy.ip] = 'alive'
    else:
      proxy_status[proxy.ip] = 'recovering'

  return get_proxy()

This allows proxies some recovery time if blocked while continuing to rotate available ones.

Weighted Random Proxy Rotator

Finally, here is an example weighted random proxy rotator. It uses proxy metadata to influence selection probability intelligently:

from random import choices
from datetime import datetime, timedelta
from typing import Dict, Literal

RotationStatus = Literal['alive', 'dead']

class Proxy:
  ip: str 
  location: str
  status: RotationStatus
  last_used: datetime

proxy_pool = [
  Proxy('104.56.25.47:8080', 'United States', 'alive', None), 
  Proxy('104.57.15.95:8080', 'Canada', 'alive', None),
]

proxy_weights: Dict[str, int] = {}

def weigh_proxy(proxy: Proxy) -> int:
  if proxy.status == 'dead':
    return 1

  weight = 10

  if proxy_weights[proxy.ip] > 0:
    weight -= 2

  # prefer different location
  if proxy.location == 'United States': 
    weight += 3

  time_limit = datetime.utcnow() - timedelta(minutes=2)
  if proxy.last_used and proxy.last_used > time_limit:
    weight -= 2 # recently used

  return weight

def get_proxy() -> Proxy:   
  for proxy in proxy_pool:
    proxy_weights[proxy.ip] = weigh_proxy(proxy)

  proxy = choices(proxy_pool, weights=proxy_weights.values())[0] 
  proxy.last_used = datetime.utcnow()

  return proxy

This allows applying custom logic to make proxy selection appropriately random but also smart.

UseProxy API for Proxies Rotating

Some of the proxy providers we mentioned earlier also provide rotating proxy APIs. Now I will use Bright Data's Proxy API as an example to introduce how to use a proxy API for rotating proxies.

To use Bright Data's proxy API for rotating proxies in web scraping, you can follow these steps:

Step 1. Sign up for a Bright Data account and choose the appropriate proxy plan based on your needs.

Step 2. Implement the proxy rotation in your web scraping code. In Python, you can use the `requests` library along with a list of proxies. Here's an example of how to do this:

import random
import requests

proxies = ["103.155.217.1:41317", "47.91.56.120:8080", "103.141.143.102:41516", "167.114.96.13:9300", "103.83.232.122:80"]

def scraping_request(url):
    ip = random.randrange(0, len(proxies))
    ips = {"http": proxies[ip], "https": proxies[ip]}
    response = requests.get(url, proxies=ips)
    print(f"Proxy currently being used: {ips['https']}")
    return response.text

This code selects a random proxy from your list each time it is called and uses it in scraping requests.

Step 3. If you're using Bright Data's proxies, replace the proxies list in the code above with the proxies provided by Bright Data.

Step 4. Make sure to handle errors and exceptions in your code to avoid issues with invalid proxies or rate limits.

Step 5. If you're using a web scraping framework like Scrapy, you can also integrate the rotating proxies into your Scrapy project.

Best Practices for Effective Proxy Rotation

Beyond the various algorithms, here are some guidelines for proxy rotation success:

Penalize Underperforming Proxies: Monitor latency, errors, and blocks to identify bad proxies automatically. Reduce their selection probability.
Avoid Same Subnets in Sequence: Check for subnet conflicts on random selection to prevent obvious patterns.
Smart Recovery Logic: Don't permanently blacklist blocked proxies. Rotate them back in after sufficient cooldown periods.
Utilize Proxy Metadata: Factor proxy details like ASN, subnets, and locations into selection weighting logic.
Combine Approaches Thoughtfully: For example, use round-robin selection by location group and random within groups.

Summary

Implementing an intelligent proxy rotation strategy is crucial for stability and scale when web scraping. This guide covered popular algorithms like random, round-robin, and weighted selection – discussing the pros and cons of each. I also shared code examples for common rotation approaches you can reference.

I hope these actionable techniques for proxy rotation help you take your web scraping project to the next level!