How to Block Resources in Playwright?

Playwright is a powerful web automation library for Python that allows you to control Chromium, Firefox, and WebKit browsers with a single API. One of its most useful features is the ability to intercept and block requests, which can significantly improve web scraping performance.

In this comprehensive guide, we'll cover how to use Playwright's request interception API to block resources like images, stylesheets, scripts, fonts, and more. We'll also look at common techniques for blocking third-party resources like trackers and ads.

Why Block Requests?

When scraping pages, we're primarily interested in the main document body and HTML. All the extra requests for images, ads, analytics etc. just slow down the scraping process while providing little value.

By blocking unnecessary resources, we can speed up page loads 2-10x in some cases! This is because the browser makes fewer requests overall, reducing bandwidth usage and load on the target site.

Some key benefits of blocking requests during web scraping:

Faster page loads – fewer requests means less time waiting
Lower bandwidth usage – important when scraping large sites
Avoid anti-scraping defenses – blocks common bot detection scripts
Cleaner HTML – less noise from ads/trackers cluttering the DOM

Using Playwright's Request Interception

Playwright allows intercepting requests via the route() method on the page instance. This registers a callback that gets invoked for every request made by the page. Inside the callback, we can analyze the request and decide whether to abort() it or continue_() loading normally.

Here is a simple example that blocks all image requests:

from playwright.sync_api import sync_playwright

def block_images(route):
  if route.request.resource_type == "image":
    route.abort()
  else:
    route.continue_()

with sync_playwright() as p:
  browser = p.firefox.launch()
  page = browser.new_page()

  page.route("**/*", block_images) # match all requests

  page.goto("https://example.com")

When we run this, any route where resource_type is "image" will call route.abort(), blocking the request. All other resources will load normally.

This is just a simple example, but shows the basic usage pattern:

Register a route handler with page.route()
Inspect incoming request via the route object
Call route.abort() to block, or route.continue_() to allow

Next let's look at various techniques for actually blocking requests efficiently during scrapes.

Blocking by Resource Type

The most straightforward way to block requests is checking route.request.resource_type which tells us the type of resource. Common resource types we typically want to block include:

images – image
videos – media
fonts – font
stylesheets – stylesheet
scripts – script
XHR/FETCH requests – xhr, fetch
websockets – websocket

For example, to block images, media, and fonts:

BLOCK_TYPES = ["image", "media", "font"]

def block_resources(route):
  if route.request.resource_type in BLOCK_TYPES:
    route.abort()
  else: 
    route.continue_()

Blocking images, videos, fonts, and other media can improve page load times significantly without affecting the HTML content.

Be a bit careful with stylesheets and scripts – blocking these can sometimes break page functionality and JavaScript rendering. I'd recommend only blocking them if you run into specific anti-bot defenses that way.

Blocking by URL Patterns

Another useful technique is to block requests based on matching the URL. This allows the blocking of specific third-party domains like ad networks, social widgets, etc. For example, we can block requests that contain certain keywords:

BLOCK_URLS = ["ad", "analytics", "tracking", "log"]

def block_resources(route):
  if any(keyword in route.request.url for keyword in BLOCK_URLS):
    route.abort()
  else:
    route.continue_()

This will match any URLs containing “ad”, “analytics”, “tracking”, etc. We can also block specific domain names:

BLOCK_DOMAINS = ["doubleclick.net", "google-analytics.com"] 

def block_resources(route):
  if route.request.url.domain in BLOCK_DOMAINS:
    route.abort()
  else:
    route.continue_()

Some common third-party domains to block during web scraping:

Google Analytics – google-analytics.com
Google Tag Manager – googletagmanager.com
Google Ads – doubleclick.net
Facebook – connect.facebook.net
Twitter – platform.twitter.com
Ad networks – adzerk.net, adrta.com, etc.

This can help avoid anti-bot services and reduce clutter in the HTML from ad/tracking markup.

Blocking Based on Resource Size

For some scrapers, downloading large media files isn't necessary. We can block requests over a certain size threshold like:

MAX_RESOURCE_SIZE = 200 * 1024 # 200 KB

def block_large_resources(route):
  if route.request.resource_size > MAX_RESOURCE_SIZE:
    route.abort()
  else:
    route.continue_()

This helps avoid very large requests for audio/video files, large PDFs, etc.

Blocking Browser Pre-Fetching

Browsers do some clever “pre-fetching” where they speculate on resources the user may need next (like <link rel=”preload”>). This leads to extra requests. We can identify pre-fetching requests via:

if route.request.is_navigation_request:
  # navigation requests load the actual page
  continue_()

# otherwise, abort as it's prefetching  
route.abort()

This may slightly improve speeds by avoiding speculative resource loading.

Allow-Listing Essential Requests

Instead of trying to block every possible resource, we can take the opposite approach: only allow the bare minimum requests needed to render the page. For example:

# list of domains or URL patterns to allow
ALLOWED = ["example.com", "*.css", "*.js"] 

def allowlist_resources(route):
  if any(pattern in route.request.url for pattern in ALLOWED):
    route.continue_()  
  else:
    route.abort()

This improves security by ensuring we only load resources from the site itself + minimal CSS/JS. The downside is more complex maintenance as the site's resources change over time.

Blocking Requests in Headless Mode

All the examples so far assume running Playwright in headless mode. If you want to visually see the results of blocking requests, we can launch Playwright normally:

browser = pw.chromium.launch(headless=False)

Now we can see the browser loading pages normally, but with resources blocked as desired. This helps debug scraping issues caused by excessive blocking. I'd recommend testing rules in a real browser first before running headless.

Measuring Bandwidth Savings

When launching the browser, we can pass devtools=True to enable the DevTools panel to give useful metrics:

browser = pw.chromium.launch(devtools=True)

Now in the Network tab, we can see total requests, data transferred, load timing, etc. Compare with and without request blocking to quantify the exact bandwidth reduction.

Avoid Breaking Page Functionality

Blocking requests can sometimes break pages by removing key resources they depend on. Here are some tips to avoid issues:

Only block media/images initially not CSS/JS
Test blocked and allowed rules extensively
Analyze dev tools Network panel for missing resources
Enable the browser head fully to debug scraping issues

Start by only blocking safe resources like images, videos, and fonts. Slowly add stricter blocking, testing frequently to avoid going too far.

Automating Blocked Rules at Scale

Once you have optimized blocked rules for a site, you'll want to apply them automatically across all pages. Here is a simple Playwright scraper template with automated blocking:

# list of rules to block requests
BLOCK_RULES = [
  block_images,
  block_media,
  block_fonts,
  block_ads
]

def scrape_page(page):
  # extract page content here...
  print(page.content())

with sync_playwright() as p:
  browser = p.chromium.launch()
  
  for page_url in URLS:
    page = browser.new_page()

    for rule in BLOCK_RULES:
      page.route("**/*", rule)

    page.goto(page_url)
    scrape_page(page)

This will apply our blocking logic to every new page while iterating through a list of URLs.

Some tips for scaling:

Run Playwright in headless mode once tested
Use asynchronous Playwright if speed is critical
Rotate IPs to avoid traffic limits
Schedule scrapes during off-peak hours
Deploy scraper to distributed nodes

With the right architecture, you can leverage Playwright to scrape even the largest sites while optimizing bandwidth.

Caveats of Request Blocking

While request blocking provides large speed boosts, be aware it can also cause the following issues:

May break page functionality if overdone
Blocked resources may be re-requested via JS
Can trigger bot/DDoS protections if abused
Ethics concerns around ad blocking

Make sure to throttle requests, randomize delays, and rotate IPs/proxies to scrape responsibly.

Conclusion

Enhancing web scraping performance using Playwright can be achieved by effectively blocking unnecessary requests. By strategically tuning the resources you block, you not only accelerate the scraping process but also cut down on bandwidth expenses.

Playwright offers robust and flexible APIs for intercepting web requests, granting you precise control over the loading of resources. Adhering to best practices for responsible request blocking is key to optimizing efficiency and maximizing performance gains.