Playwright is a powerful web automation library for Python that allows you to control Chromium, Firefox, and WebKit browsers with a single API. One of its most useful features is the ability to intercept and block requests, which can significantly improve web scraping performance.
In this comprehensive guide, we'll cover how to use Playwright's request interception API to block resources like images, stylesheets, scripts, fonts, and more. We'll also look at common techniques for blocking third-party resources like trackers and ads.
Why Block Requests?
When scraping pages, we're primarily interested in the main document body and HTML. All the extra requests for images, ads, analytics etc. just slow down the scraping process while providing little value.
By blocking unnecessary resources, we can speed up page loads 2-10x in some cases! This is because the browser makes fewer requests overall, reducing bandwidth usage and load on the target site.
Some key benefits of blocking requests during web scraping:
- Faster page loads – fewer requests means less time waiting
- Lower bandwidth usage – important when scraping large sites
- Avoid anti-scraping defenses – blocks common bot detection scripts
- Cleaner HTML – less noise from ads/trackers cluttering the DOM
Using Playwright's Request Interception
Playwright allows intercepting requests via the route()
method on the page instance. This registers a callback that gets invoked for every request made by the page. Inside the callback, we can analyze the request and decide whether to abort()
it or continue_()
loading normally.
Here is a simple example that blocks all image requests:
from playwright.sync_api import sync_playwright def block_images(route): if route.request.resource_type == "image": route.abort() else: route.continue_() with sync_playwright() as p: browser = p.firefox.launch() page = browser.new_page() page.route("**/*", block_images) # match all requests page.goto("https://example.com")
When we run this, any route
where resource_type
is "image"
will call route.abort()
, blocking the request. All other resources will load normally.
This is just a simple example, but shows the basic usage pattern:
- Register a route handler with
page.route()
- Inspect incoming request via the
route
object - Call
route.abort()
to block, orroute.continue_()
to allow
Next let's look at various techniques for actually blocking requests efficiently during scrapes.
Blocking by Resource Type
The most straightforward way to block requests is checking route.request.resource_type
which tells us the type of resource. Common resource types we typically want to block include:
- images –
image
- videos –
media
- fonts –
font
- stylesheets –
stylesheet
- scripts –
script
- XHR/FETCH requests –
xhr
,fetch
- websockets –
websocket
For example, to block images, media, and fonts:
BLOCK_TYPES = ["image", "media", "font"] def block_resources(route): if route.request.resource_type in BLOCK_TYPES: route.abort() else: route.continue_()
Blocking images, videos, fonts, and other media can improve page load times significantly without affecting the HTML content.
Be a bit careful with stylesheets and scripts – blocking these can sometimes break page functionality and JavaScript rendering. I'd recommend only blocking them if you run into specific anti-bot defenses that way.
Blocking by URL Patterns
Another useful technique is to block requests based on matching the URL. This allows the blocking of specific third-party domains like ad networks, social widgets, etc. For example, we can block requests that contain certain keywords:
BLOCK_URLS = ["ad", "analytics", "tracking", "log"] def block_resources(route): if any(keyword in route.request.url for keyword in BLOCK_URLS): route.abort() else: route.continue_()
This will match any URLs containing “ad”, “analytics”, “tracking”, etc. We can also block specific domain names:
BLOCK_DOMAINS = ["doubleclick.net", "google-analytics.com"] def block_resources(route): if route.request.url.domain in BLOCK_DOMAINS: route.abort() else: route.continue_()
Some common third-party domains to block during web scraping:
- Google Analytics –
google-analytics.com
- Google Tag Manager –
googletagmanager.com
- Google Ads –
doubleclick.net
- Facebook –
connect.facebook.net
- Twitter –
platform.twitter.com
- Ad networks –
adzerk.net
,adrta.com
, etc.
This can help avoid anti-bot services and reduce clutter in the HTML from ad/tracking markup.
Blocking Based on Resource Size
For some scrapers, downloading large media files isn't necessary. We can block requests over a certain size threshold like:
MAX_RESOURCE_SIZE = 200 * 1024 # 200 KB def block_large_resources(route): if route.request.resource_size > MAX_RESOURCE_SIZE: route.abort() else: route.continue_()
This helps avoid very large requests for audio/video files, large PDFs, etc.
Blocking Browser Pre-Fetching
Browsers do some clever “pre-fetching” where they speculate on resources the user may need next (like <link rel=”preload”>). This leads to extra requests. We can identify pre-fetching requests via:
if route.request.is_navigation_request: # navigation requests load the actual page continue_() # otherwise, abort as it's prefetching route.abort()
This may slightly improve speeds by avoiding speculative resource loading.
Allow-Listing Essential Requests
Instead of trying to block every possible resource, we can take the opposite approach: only allow the bare minimum requests needed to render the page. For example:
# list of domains or URL patterns to allow ALLOWED = ["example.com", "*.css", "*.js"] def allowlist_resources(route): if any(pattern in route.request.url for pattern in ALLOWED): route.continue_() else: route.abort()
This improves security by ensuring we only load resources from the site itself + minimal CSS/JS. The downside is more complex maintenance as the site's resources change over time.
Blocking Requests in Headless Mode
All the examples so far assume running Playwright in headless mode. If you want to visually see the results of blocking requests, we can launch Playwright normally:
browser = pw.chromium.launch(headless=False)
Now we can see the browser loading pages normally, but with resources blocked as desired. This helps debug scraping issues caused by excessive blocking. I'd recommend testing rules in a real browser first before running headless.
Measuring Bandwidth Savings
When launching the browser, we can pass devtools=True
to enable the DevTools panel to give useful metrics:
browser = pw.chromium.launch(devtools=True)
Now in the Network tab, we can see total requests, data transferred, load timing, etc. Compare with and without request blocking to quantify the exact bandwidth reduction.
Avoid Breaking Page Functionality
Blocking requests can sometimes break pages by removing key resources they depend on. Here are some tips to avoid issues:
- Only block media/images initially not CSS/JS
- Test blocked and allowed rules extensively
- Analyze dev tools Network panel for missing resources
- Enable the browser head fully to debug scraping issues
Start by only blocking safe resources like images, videos, and fonts. Slowly add stricter blocking, testing frequently to avoid going too far.
Automating Blocked Rules at Scale
Once you have optimized blocked rules for a site, you'll want to apply them automatically across all pages. Here is a simple Playwright scraper template with automated blocking:
# list of rules to block requests BLOCK_RULES = [ block_images, block_media, block_fonts, block_ads ] def scrape_page(page): # extract page content here... print(page.content()) with sync_playwright() as p: browser = p.chromium.launch() for page_url in URLS: page = browser.new_page() for rule in BLOCK_RULES: page.route("**/*", rule) page.goto(page_url) scrape_page(page)
This will apply our blocking logic to every new page while iterating through a list of URLs.
Some tips for scaling:
- Run Playwright in headless mode once tested
- Use asynchronous Playwright if speed is critical
- Rotate IPs to avoid traffic limits
- Schedule scrapes during off-peak hours
- Deploy scraper to distributed nodes
With the right architecture, you can leverage Playwright to scrape even the largest sites while optimizing bandwidth.
Caveats of Request Blocking
While request blocking provides large speed boosts, be aware it can also cause the following issues:
- May break page functionality if overdone
- Blocked resources may be re-requested via JS
- Can trigger bot/DDoS protections if abused
- Ethics concerns around ad blocking
Make sure to throttle requests, randomize delays, and rotate IPs/proxies to scrape responsibly.
Conclusion
Enhancing web scraping performance using Playwright can be achieved by effectively blocking unnecessary requests. By strategically tuning the resources you block, you not only accelerate the scraping process but also cut down on bandwidth expenses.
Playwright offers robust and flexible APIs for intercepting web requests, granting you precise control over the loading of resources. Adhering to best practices for responsible request blocking is key to optimizing efficiency and maximizing performance gains.