Web Scraping Background Requests with Headless Browsers and Python

Background requests power much of the modern web and can be a powerful opportunity for web scraping. In this comprehensive guide, we'll cover how to leverage headless browser scraping techniques like Playwright, Selenium, and Puppeteer to scrape data from background requests using Python.

Introduction to Web Scraping Techniques

There are a few primary approaches to web scraping:

Parsing HTML

This involves using a HTTP client to download a page's HTML and then extracting the desired data using parsers like BeautifulSoup. Pros are it's fast and simple. Cons are it can break easily if the site changes and is limited to data in the HTML source.

API Scraping

For sites with open APIs, you can reverse engineer and call them directly to scrape structured data. Pros are it's robust and fast. Cons are APIs can be difficult to find and scrape at scale.

Capturing Background Requests

This technique loads pages in a real browser like Chrome and captures AJAX/XHR requests made in the background. Pros are it's robust and can access hidden data. Cons are it requires more complex setup and can have performance limitations.

Each approach has tradeoffs but scraping background requests strikes a nice balance of speed, power, and reliability.

Headless Browsers for Scraping

headless browser scraping technique provide an automated way to scrape web pages using real browser engines like Chromium and Firefox. Here are some popular options:

Playwright

  • Created by Microsoft as an open source alternative to Puppeteer.
  • Supports Python, JavaScript, C#, and Java.
  • Very fast and reliable browser control.
  • Strong community and documentation.

Selenium

  • The most established browser automation tool.
  • Supports many languages like Python, Java, C#, etc.
  • Very extensible via a wide range of drivers.
  • Large user base and support available.

Puppeteer

  • Created by Google specifically for Chrome.
  • JavaScript API with good TypeScript support.
  • Fast and lightweight browser automation.
  • Largest ecosystem of related tools and libraries.

All provide capabilities to intercept background requests which we'll leverage for web scraping. They also allow executing actions like clicks, scrolls, and form fills to trigger XHRs.

Benefits of Scraping Background Requests

Here are some of the advantages of using headless browser scraping technique and capturing background requests for web scraping:

Access Hidden Data

Often the HTML of a page will only contain a portion of the available data. Background requests can reveal additional data that's loaded asynchronously. This opens up new scraping possibilities.

Robust to Site Changes

Unlike parsing fixed HTML, background requests are a fluid way to obtain data that works even if parts of the site change. The scrapers follow user flows instead of fixed parsing rules.

Mimic User Actions

Headless browsers allow simulating user actions like clicks, scrolls, and more to trigger additional requests. This makes scraping reactive sites with endless scrolling, popups, etc. much more reliable.

How to Capture Background Requests

Here is the high-level approach to intercept background requests using headless browser scraping technique:

1. Initiate the Browser

Launch a browser instance like Chromium using Playwright, Selenium, or Puppeteer and navigate to the target page.

2. Enable Request Interception

Call the browser's request interception API to set up a callback handler.

3. Load Target Page

Navigate the browser to the target URL to scrape. Wait for page load if required.

4. Execute Actions

Perform user actions like clicks, scrolls, form fills, etc. to trigger requests.

5. Parse Requests

In the request handler callback, inspect responses and extract the data.

Now let's see how this works with Python examples.

Scraping Example Projects

To demonstrate scraping background requests, we'll walk through two common examples using both Playwright and Selenium with Python. The examples use web-scraping.dev, a site designed specifically for demonstrating and testing scrapers.

Button Click to Load More Reviews

For this example, we'll scrape a product page that uses “Load More” button clicks to fetch additional reviews via background requests. Here's how it looks:

# Playwright Example

import json
from playwright.sync_api import sync_playwright

url = "https://web-scraping.dev/product/1" 

with sync_playwright() as p:
  browser = p.chromium.launch()
  page = browser.new_page()
  page.on("response", lambda r: print(r.url)) # print all requests
  
  page.goto(url)

  # click load more until no new requests
  while True:
    prev_requests = page.request.all_urls()
    page.click("[id^=load-more-reviews]")
    page.wait_for_timeout(1000)
    new_requests = page.request.all_urls()
    if prev_requests == new_requests:
        break

browser.close()

# Selenium Example

from selenium import webdriver
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions() 
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(options=options)

url = "https://web-scraping.dev/product/1"
driver.get(url)

while True:
  prev_requests = driver.execute_script("return performance.getEntriesByType('resource')")

  driver.find_element(By.ID, "load-more-reviews").click()

  new_requests = driver.execute_script("return performance.getEntriesByType('resource')")
  if prev_requests == new_requests:
    break

driver.quit()

This handles clicking the button until no new requests are captured, scraping all the review data.

Scrolling to Trigger Pagination

Another common pattern is endless scroll pagination, where sites load content as you scroll down. Here's an example script to scrape paginated testimonials using scrolling:

# Playwright

import json
from playwright.sync_api import sync_playwright

url = "https://web-scraping.dev/testimonials"

with sync_playwright() as p:
  browser = p.chromium.launch()
  page = browser.new_page()

  page.goto(url)
  page.on("response", lambda r: print(r.url)) # print all requests

  prev_height = -1

  while True:
    curr_height = page.evaluate('document.body.scrollHeight')  
    page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
    page.wait_for_timeout(1000)
    
    if curr_height == prev_height:
        break
    
    prev_height = curr_height

browser.close()


# Selenium

from selenium import webdriver

options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)

url = "https://web-scraping.dev/testimonials"
driver.get(url)

prev_height = -1

while True:

  curr_height = driver.execute_script("return document.body.scrollHeight")
  driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")

  if curr_height == prev_height:
      break  

  prev_height = curr_height

driver.quit()

This handles scrolling until no new content is loaded, capturing all background requests along the way.

Common Challenges

There are a couple of common challenges you may encounter when scraping background requests:

Waiting for Page Load

Browsers load content asynchronously, so you need to properly wait for elements or requests to finish loading before interacting with the page. Each browser tool provides robust wait/timeout APIs to handle this.

Scrolling Correctly

To trigger scroll-based pagination, you need to replicate the browser's native scrolling closely. Just using window.scrollTo() often isn't sufficient to simulate a user's scroll behavior. Libraries like Autoscroll.js can help here, providing a more realistic scroll motion.

Best Practices

Here are some tips for smoothly scraping background requests at scale:

  • Block unnecessary requests – Improves speed and reduces bandwidth costs. Browser tools allow blocking CSS, media, fonts, etc.
  • Use proxy rotation – Rotate IPs to avoid blocks and scrape more reliably. Integrations like Bright DataSmartproxyProxy-Seller, and Soax work well.
  • Retry failed requests – Network errors are common. Add retry logic and exponential backoff to handle flakiness.
  • Export results incrementally – For large scrapes, save results as you go instead of keeping them all in memory.

Advanced Techniques

You can further optimize and scale background request scraping using:

  • Scraping SPAs – Use Puppeteer, Playwright, or Selenium to render client-side pages and XHR.
  • Calling APIs directly – Reverse engineer and directly call REST or GraphQL APIs for higher throughput.
  • Performance tuning – Tweak browser instances for higher concurrency and throughput via flags and configs.
  • Distributed scraping – Spread jobs across scrapers on multiple machines for faster crawling.

Comparing Playwright, Selenium, and Puppeteer

While Playwright, Selenium, and Puppeteer all provide the ability to intercept requests, there are some key differences between the tools:

Speed and Rendering

Playwright and Puppeteer use Chromium directly so they tend to have faster page loads and resource usage. Selenium supports many browsers which can impact performance.

Languages and Platforms

Playwright supports Python, C#, Java, and JavaScript. Puppeteer is JavaScript only. Selenium has APIs for Python, Java, C#, Ruby, PHP, and more.

Extensibility

Selenium has a very rich ecosystem of browser drivers and extensions. Playwright and Puppeteer provide robust APIs that cover most use cases.

Documentation and Community

Due to its longevity, Selenium has more available content and support. But Playwright and Puppeteer have excellent official docs and growing communities.

FAQ

What is the difference between XHR and API?

XHR (XMLHTTPRequest) is the front-end technique used to call APIs. The API (Application Programming Interface) is the server-side endpoint that returns data.

What are the pros and cons of Playwright vs Selenium?

Playwright is faster while Selenium supports more languages. Playwright's API covers most common needs while Selenium provides more extensibility.

How do I parse JSON data from a request?

In Python you can use the json module to parse:

import json

json_data = json.loads(request.text())

What status codes should I retry requests for?

5XX errors indicate server issues, so those are good candidates for retries. 429 Too Many Requests responses may also benefit from retrying after delaying.

How can I prevent getting blocked while scraping?

Using proxies and random user agents helps distribute the load, which looks less bot-like. Starting with low concurrency and slowly ramping up request volume can also help avoid spikes that trigger blocks.

Conclusion

Capturing network traffic from headless browsers is a powerful technique to build robust web scrapers in Python. Libraries like Playwright, Selenium, and Puppeteer include many tools to intercept requests and retrieve data loaded in the background.

While more complex than parsing HTML, this approach provides access to hidden page data and maintains resilience even if sites change over time. With the right patterns and tools, you can scrape complex user flows and extract data not otherwise available.

John Rooney

John Rooney

John Watson Rooney, a self-taught Python developer and content creator with a focus on web scraping, APIs, and automation. I love sharing my knowledge and expertise through my YouTube channel, My channel caters to all levels of developers, from beginners looking to get started in web scraping to experienced programmers seeking to advance their skills with modern techniques. I have worked in the e-commerce sector for many years, gaining extensive real-world experience in data handling, API integrations, and project management. I am passionate about teaching others and simplifying complex concepts to make them more accessible to a wider audience. In addition to my YouTube channel, I also maintain a personal website where I share my coding projects and other related content.

We will be happy to hear your thoughts

      Leave a reply

      Proxy-Zone
      Compare items
      • Total (0)
      Compare
      0