How to Block Resources in Selenium and Python?

When using Selenium for web scraping, it can often load many unnecessary resources like images, videos, ads, and tracking scripts. This slows down the scraping and uses extra bandwidth.In this guide, we'll see how to configure Selenium to block loading these unnecessary resources and speed up scraping.

Why Block Resources in Selenium?

Here are some key benefits of blocking resources in Selenium:

  • Faster page loads: By not loading images, videos, ads etc, pages will load much faster. This significantly speeds up scraping.
  • Lower bandwidth usage: Blocking resources can reduce bandwidth usage by 2-10x or more. This helps avoid hitting rate limits.
  • Avoid anti-scraping: Some sites track scraping through resources like analytics scripts. Blocking them helps avoid detection.
  • Cleaner DOM: The DOM will contain only the essential content without unnecessary clutter from media and ads. This makes scraping easier.

Overall, blocking resources results in faster, more efficient, and more reliable web scraping with Selenium.

How to Block Resources in Selenium

Selenium doesn't provide built-in support for blocking requests. But we can use a proxy like mitmproxy to intercept requests and block them before they reach the browser. Here are the steps:

1. Install mitmproxy

We'll use mitmproxy since it lets us intercept traffic and write custom blocking logic in Python. Install it with:

pip install mitmproxy

2. Create a blocking script

Create a Python script called block.py with the following code:

from mitmproxy import http

BLOCK_RESOURCE_TYPES = [
    "image",
    "stylesheet",
    "media",
    "font",
    "script",
    "xmlhttprequest",
    "object",
    "other"
]

def request(flow: http.HTTPFlow) -> None:
    if flow.request.pretty_url.startswith("http"):
        # Block the resource if request type is in BLOCK_RESOURCE_TYPES
        if flow.request.pretty_type() in BLOCK_RESOURCE_TYPES:
            flow.response = http.HTTPResponse.make(
                404,  # (optional) custom response code
                b"Blocked",  # (optional) custom response content
                {"Content-Type": "text/html"}  # (optional) custom response headers
            )

This script blocks requests based on their type, like image, script, etc. You can also block by domain, URL patterns, file extensions, etc.

3. Start the proxy

Run the proxy with the block script:

mitmproxy -s block.py

This will start mitmproxy on port 8080.

4. Configure Selenium to use the proxy

Now we need to make Selenium use this proxy. When creating the webdriver, set the proxy like:

This will start mitmproxy on port 8080.

4. Configure Selenium to use the proxy
Now we need to make Selenium use this proxy.

When creating the webdriver, set the proxy like:

This routes all traffic from Selenium through mitmproxy where our blocking rules are applied.

5. Test it out

Let's try it out on a product page:

driver.get("https://web-scraping.dev/product/1")

All resources matching the block rules like images and scripts won't load. The page will be lean and load faster. And that's it! With these steps, you can block unnecessary resources in Selenium to speed up and optimize your web scraping.

Customizing Blocking Rules

The block.py script gives you full control to implement any blocking logic you need:

  • Block by domain names like *.google-analytics.com
  • Block by file extensions like .jpg.png
  • Block by path patterns like /tracking/ads
  • Block by request type like imagemedia
  • Allow some resources like CSS/JS using regex
  • And many other combinations

Here are some examples of different blocking approaches:

Block by domain name

BLOCK_DOMAINS = [
    "www.google-analytics.com",
    "stats.g.doubleclick.net",
    "widget.intercom.io",
    " *.tracker.com" # wildcard
]

if flow.request.pretty_host() in BLOCK_DOMAINS:
  flow.response = http.HTTPResponse.make(404, b"Blocked")

This is useful for blocking tracking/analytics scripts.

Block images by extension

IMAGE_EXTENSIONS = [".jpg", ".jpeg", ".png", ".svg", ".gif", ".webp"]

if flow.request.pretty_url.endswith(IMAGE_EXTENSIONS):
  flow.response = http.HTTPResponse.make(404, b"Blocked")

Speeds up scraping by not loading any images.

Block media resources

MEDIA_TYPES = ["image", "media", "font"]

if flow.request.pretty_type() in MEDIA_TYPES:
  flow.response = http.HTTPResponse.make(404, b"Blocked")

Good for avoiding media files like images, videos, fonts.

Allow CSS/JS

if flow.request.pretty_type() in ["stylesheet", "script"]:
  if not re.search(r"\.(css|js)$", flow.request.pretty_url):    
    flow.response = http.HTTPResponse.make(404, b"Blocked")

Blocks scripts/stylesheets but allow .css and .js files. The possibilities are endless for creating targeted blocking rules.

Optimizing Blocking Rules

When using blocking, keep these tips in mind:

  • Start with the broadest blocking by type, like images, scripts, etc. Then selectively allow what's needed.
  • Block third-party domains aggressively. Most are unnecessary for scraping.
  • Use blocking by URL patterns, not just domain names. Subdomains can bypass domain blocking.
  • Continuously fine-tune rules. Analyze network traffic to identify unnecessary resources to block.
  • Allow minimal CSS/JS. Use regex only to allow essential static assets.
  • For heavy pages, block partially, then unblock selectively to identify only required resources.
  • The balance between blocking and unblocking. Avoid breaking page functionality.
  • Test extensively after each change. Some resources can be necessary for the page to work properly.

With some trial and error, you'll be able to optimize the blocking rules to maximize speed and efficiency for each site.

Pros and Cons of Blocking Resources

Blocking unnecessary resources has significant benefits but also some downsides to keep in mind:

Pros

  • Faster page loads – Speeds up scraping significantly
  • Lower bandwidth – Reduces data usage by 2-10x
  • Avoid tracking – Blocks analytics scripts and tracking pixels
  • Cleaner DOM – Removes unnecessary clutter from DOM
  • No changes to scraping code required – Just configure blocking rules once

Cons

  • Configuration required – Need to configure blocking rules for each site
  • Can break pages – Overblocking can cause pages to break
  • Extra dependency – Need to install and run the proxy alongside scraper
  • Debugging harder – No access to blocked requests for debugging
  • Extra server resource – Proxy needs extra CPU/memory resources

So, in summary, resource blocking can massively improve scraping performance if configured properly for each site. It requires some extra setup but gives benefits worth the effort in most cases.

Tools for Blocking Resources

There are a few different tools you can use for request blocking:

  • mitmproxy – Python-based intercepting proxy with custom scripting.
  • Fiddler – Popular Windows web debugging proxy with blocking support.
  • Browser Mob Proxy – Open-source proxy with Selenium integration. Allows scripting in JavaScript.
  • PhantomJS – Headless browser with API for request blocking. No longer maintained.

mitmproxy is recommended for flexibility, scripting in Python, and seamless integration with Selenium. Alternatives like Fiddler work, too but may require more setup work. The proxy itself doesn't matter as long as it allows blocking requests programmatically.

Blocking Resources on Other Web Scraping Tools

Request blocking isn't limited only to Selenium – it can help speed up and optimize any web scraping tool or library. Here's how to use a proxy for blocking with some other popular web scraping tools:

Scrapy

Add a HttpProxyMiddleware to handle proxying all requests:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = "http://localhost:8080"
        
# attach middleware to scrapy  
scrapy_obj.middlewares.setdefault('ProxyMiddleware', 500)

Then add a DOWNLOAD_DELAY of ~0.5 seconds to prevent throttling the proxy.

requests

Pass the proxy URL to requests.get():

proxy = "http://localhost:8080" 
requests.get(url, proxies={"http": proxy, "https": proxy})

Puppeteer

Pass the proxy server when launching Puppeteer:

const browser = await puppeteer.launch({
  args: [`--proxy-server=${proxyUrl}`]
});

Playwright

Use a browser context with proxy enabled:

const context = await browser.newContext({
  proxy: {
    server: proxyUrl
  }
});

So blocking can be added to any scraping stack for big performance improvements.

Final Thoughts

Blocking unnecessary resources is a simple but highly effective optimization for Selenium scrapers. When properly implemented, this approach leads to Selenium scrapers that are not only faster but also more efficient in bandwidth usage and more reliable when operating at scale.

However, it's crucial to rigorously test your sites after implementing resource blocking to avoid inadvertently hindering necessary functions. Integrating resource blocking can significantly elevate the efficiency of your Selenium web scraping projects.

John Rooney

John Rooney

John Watson Rooney, a self-taught Python developer and content creator with a focus on web scraping, APIs, and automation. I love sharing my knowledge and expertise through my YouTube channel, My channel caters to all levels of developers, from beginners looking to get started in web scraping to experienced programmers seeking to advance their skills with modern techniques. I have worked in the e-commerce sector for many years, gaining extensive real-world experience in data handling, API integrations, and project management. I am passionate about teaching others and simplifying complex concepts to make them more accessible to a wider audience. In addition to my YouTube channel, I also maintain a personal website where I share my coding projects and other related content.

We will be happy to hear your thoughts

      Leave a reply

      Proxy-Zone
      Compare items
      • Total (0)
      Compare
      0