When using Selenium for web scraping, it can often load many unnecessary resources like images, videos, ads, and tracking scripts. This slows down the scraping and uses extra bandwidth.In this guide, we'll see how to configure Selenium to block loading these unnecessary resources and speed up scraping.
Why Block Resources in Selenium?
Here are some key benefits of blocking resources in Selenium:
- Faster page loads: By not loading images, videos, ads etc, pages will load much faster. This significantly speeds up scraping.
- Lower bandwidth usage: Blocking resources can reduce bandwidth usage by 2-10x or more. This helps avoid hitting rate limits.
- Avoid anti-scraping: Some sites track scraping through resources like analytics scripts. Blocking them helps avoid detection.
- Cleaner DOM: The DOM will contain only the essential content without unnecessary clutter from media and ads. This makes scraping easier.
Overall, blocking resources results in faster, more efficient, and more reliable web scraping with Selenium.
How to Block Resources in Selenium
Selenium doesn't provide built-in support for blocking requests. But we can use a proxy like mitmproxy to intercept requests and block them before they reach the browser. Here are the steps:
1. Install mitmproxy
We'll use mitmproxy since it lets us intercept traffic and write custom blocking logic in Python. Install it with:
pip install mitmproxy
2. Create a blocking script
Create a Python script called block.py
with the following code:
from mitmproxy import http BLOCK_RESOURCE_TYPES = [ "image", "stylesheet", "media", "font", "script", "xmlhttprequest", "object", "other" ] def request(flow: http.HTTPFlow) -> None: if flow.request.pretty_url.startswith("http"): # Block the resource if request type is in BLOCK_RESOURCE_TYPES if flow.request.pretty_type() in BLOCK_RESOURCE_TYPES: flow.response = http.HTTPResponse.make( 404, # (optional) custom response code b"Blocked", # (optional) custom response content {"Content-Type": "text/html"} # (optional) custom response headers )
This script blocks requests based on their type, like image, script, etc. You can also block by domain, URL patterns, file extensions, etc.
3. Start the proxy
Run the proxy with the block script:
mitmproxy -s block.py
This will start mitmproxy on port 8080.
4. Configure Selenium to use the proxy
Now we need to make Selenium use this proxy. When creating the webdriver, set the proxy like:
This will start mitmproxy on port 8080. 4. Configure Selenium to use the proxy Now we need to make Selenium use this proxy. When creating the webdriver, set the proxy like:
This routes all traffic from Selenium through mitmproxy where our blocking rules are applied.
5. Test it out
Let's try it out on a product page:
driver.get("https://web-scraping.dev/product/1")
All resources matching the block rules like images and scripts won't load. The page will be lean and load faster. And that's it! With these steps, you can block unnecessary resources in Selenium to speed up and optimize your web scraping.
Customizing Blocking Rules
The block.py
script gives you full control to implement any blocking logic you need:
- Block by domain names like
*.google-analytics.com
- Block by file extensions like
.jpg
,.png
- Block by path patterns like
/tracking
,/ads
- Block by request type like
image
,media
- Allow some resources like CSS/JS using regex
- And many other combinations
Here are some examples of different blocking approaches:
Block by domain name
BLOCK_DOMAINS = [ "www.google-analytics.com", "stats.g.doubleclick.net", "widget.intercom.io", " *.tracker.com" # wildcard ] if flow.request.pretty_host() in BLOCK_DOMAINS: flow.response = http.HTTPResponse.make(404, b"Blocked")
This is useful for blocking tracking/analytics scripts.
Block images by extension
IMAGE_EXTENSIONS = [".jpg", ".jpeg", ".png", ".svg", ".gif", ".webp"] if flow.request.pretty_url.endswith(IMAGE_EXTENSIONS): flow.response = http.HTTPResponse.make(404, b"Blocked")
Speeds up scraping by not loading any images.
Block media resources
MEDIA_TYPES = ["image", "media", "font"] if flow.request.pretty_type() in MEDIA_TYPES: flow.response = http.HTTPResponse.make(404, b"Blocked")
Good for avoiding media files like images, videos, fonts.
Allow CSS/JS
if flow.request.pretty_type() in ["stylesheet", "script"]: if not re.search(r"\.(css|js)$", flow.request.pretty_url): flow.response = http.HTTPResponse.make(404, b"Blocked")
Blocks scripts/stylesheets but allow .css
and .js
files. The possibilities are endless for creating targeted blocking rules.
Optimizing Blocking Rules
When using blocking, keep these tips in mind:
- Start with the broadest blocking by type, like images, scripts, etc. Then selectively allow what's needed.
- Block third-party domains aggressively. Most are unnecessary for scraping.
- Use blocking by URL patterns, not just domain names. Subdomains can bypass domain blocking.
- Continuously fine-tune rules. Analyze network traffic to identify unnecessary resources to block.
- Allow minimal CSS/JS. Use regex only to allow essential static assets.
- For heavy pages, block partially, then unblock selectively to identify only required resources.
- The balance between blocking and unblocking. Avoid breaking page functionality.
- Test extensively after each change. Some resources can be necessary for the page to work properly.
With some trial and error, you'll be able to optimize the blocking rules to maximize speed and efficiency for each site.
Pros and Cons of Blocking Resources
Blocking unnecessary resources has significant benefits but also some downsides to keep in mind:
Pros
- Faster page loads – Speeds up scraping significantly
- Lower bandwidth – Reduces data usage by 2-10x
- Avoid tracking – Blocks analytics scripts and tracking pixels
- Cleaner DOM – Removes unnecessary clutter from DOM
- No changes to scraping code required – Just configure blocking rules once
Cons
- Configuration required – Need to configure blocking rules for each site
- Can break pages – Overblocking can cause pages to break
- Extra dependency – Need to install and run the proxy alongside scraper
- Debugging harder – No access to blocked requests for debugging
- Extra server resource – Proxy needs extra CPU/memory resources
So, in summary, resource blocking can massively improve scraping performance if configured properly for each site. It requires some extra setup but gives benefits worth the effort in most cases.
Tools for Blocking Resources
There are a few different tools you can use for request blocking:
- mitmproxy – Python-based intercepting proxy with custom scripting.
- Fiddler – Popular Windows web debugging proxy with blocking support.
- Browser Mob Proxy – Open-source proxy with Selenium integration. Allows scripting in JavaScript.
- PhantomJS – Headless browser with API for request blocking. No longer maintained.
mitmproxy is recommended for flexibility, scripting in Python, and seamless integration with Selenium. Alternatives like Fiddler work, too but may require more setup work. The proxy itself doesn't matter as long as it allows blocking requests programmatically.
Blocking Resources on Other Web Scraping Tools
Request blocking isn't limited only to Selenium – it can help speed up and optimize any web scraping tool or library. Here's how to use a proxy for blocking with some other popular web scraping tools:
Scrapy
Add a HttpProxyMiddleware
to handle proxying all requests:
class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = "http://localhost:8080" # attach middleware to scrapy scrapy_obj.middlewares.setdefault('ProxyMiddleware', 500)
Then add a DOWNLOAD_DELAY
of ~0.5 seconds to prevent throttling the proxy.
requests
Pass the proxy URL to requests.get()
:
proxy = "http://localhost:8080" requests.get(url, proxies={"http": proxy, "https": proxy})
Puppeteer
Pass the proxy server when launching Puppeteer:
const browser = await puppeteer.launch({ args: [`--proxy-server=${proxyUrl}`] });
Playwright
Use a browser context with proxy enabled:
const context = await browser.newContext({ proxy: { server: proxyUrl } });
So blocking can be added to any scraping stack for big performance improvements.
Final Thoughts
Blocking unnecessary resources is a simple but highly effective optimization for Selenium scrapers. When properly implemented, this approach leads to Selenium scrapers that are not only faster but also more efficient in bandwidth usage and more reliable when operating at scale.
However, it's crucial to rigorously test your sites after implementing resource blocking to avoid inadvertently hindering necessary functions. Integrating resource blocking can significantly elevate the efficiency of your Selenium web scraping projects.