Puppeteer is a popular Node.js library created by Google for controlling headless Chrome and Chromium. It provides a high-level API for automating browser actions like navigating, user input, JavaScript execution, etc.
One very useful but less known Puppeteer feature is its ability to intercept and analyze background requests and responses. These are the XHR, Fetch and other requests that webpages make in the background to load additional resources like images, JSON data, ads scripts etc. Capturing and analyzing these requests can be invaluable when web scraping.
In this comprehensive guide you'll learn:
- What are background requests and why analyze them for web scraping?
- How to capture background requests and responses in Puppeteer?
- Tips for filtering and processing these requests.
- How to block requests to reduce bandwidth.
- Example use cases like capturing API calls.
What are Background Requests?
Modern web pages are rarely static. After the initial HTML is loaded, the browser makes additional requests in the background to fetch more resources needed by the page. For example, a news article may initially return minimal HTML, then make API requests to fetch the full article content, related articles, comments, etc. Similarly, SPAs make API calls to fetch JSON data.
Here are some common types of background requests:
- XHR/Fetch requests – To fetch JSON data from APIs.
- Image requests – For loading images, icons, etc.
- Script requests – To load JavaScript files.
- Style requests – To load CSS files.
- Font requests – For loading font files.
These requests happen automatically in the browser. The webpage JavaScript will listen to the response and update the DOM accordingly.
From a web scraping perspective, the response content from some of these requests may contain the actual data we want to extract. For example, the full article content, product data, API response, etc.
So having visibility into these requests and analyzing them is very valuable for web scraping.
Why Analyze Background Requests?
Here are some of the benefits of capturing and analyzing background requests when web scraping:
1. Understand How the Website Works
By logging and analyzing background requests, you can better understand how the website works:
- Which requests are fetching the actual data?
- Are there any APIs that can be called directly?
- Is the data loaded via multiple API endpoints?
- How are filters, and pagination implemented behind the scenes?
This knowledge helps in developing more robust scrapers.
2. Scrape Data from APIs Directly
For some sites, scraping data directly from the APIs may be more efficient than parsing HTML. Analyzing background requests helps discover these APIs. APIs also provide data in a structured format which is easier to parse.
3. Fetch Additional Data
The initial HTML often contains minimal data. The full data gets loaded in the background. Analyzing these requests allows scraping additional data from the responses.
For example, on a product page – the HTML may contain basic info like title, image, and pricing. However the full description, reviews, recommendations, and other metadata may be loaded via background XHR requests.
4. Deal with SPAs
Single Page Apps dynamically fetch content as JSON via APIs. Analyzing network requests is crucial for scraping them.
5. Reduce Bandwidth Usage
Some background requests, like ads, analytics scripts, social widgets, etc are not relevant for scraping. Analyzing requests allows blocking these to reduce bandwidth usage.
How to Capture Background Requests in Puppeteer?
Now that we know why capturing background requests is useful let's look at how to do it in Puppeteer using these steps:
Step 1. Enable Request Interception
To intercept requests, call page.setRequestInterception(true)
:
const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.setRequestInterception(true);
This enables request interception mode in Puppeteer.
Step 2. Add Request and Response Event Handlers
Next, add request
and response
event handlers to log requests and responses:
page.on('request', request => { console.log('Request ', request.url()); }); page.on('response', response => { console.log('Response ', response.url()); });
The request
event is triggered before the request is sent. The response
event fires after the response is received.
Step 3. Continue Intercepted Requests
In interception mode, requests won't go through until we explicitly call request.continue()
:
page.on('request', request => { // log request request.continue(); // continue the request });
This allows selectively blocking requests which we'll cover later.
Step 4. Navigate and Interact with Pages
Now navigate to pages and interact to trigger background requests:
// navigate to page await page.goto('https://example.com'); // click buttons, scroll etc to trigger requests await page.click('.load-more-btn');
The page interactions will trigger various background requests which the handlers will capture.
Step 5. Filter and Process Requests
The request
and response
handlers receive a Request and Response object respectively, containing details like URL, method, headers, post data, etc. We can filter and process these requests:
page.on('response', response => { if (response.request().resourceType() === 'xhr') { // process XHR response } else if (response.request().resourceType() === 'fetch') { // process fetch response } });
Some useful methods:
request.resourceType()
– To filter request types like xhr, fetch, image etc.response.status()
– Get response status code.response.statusText()
,response.ok()
– Check response status.response.headers()
– Get response headers.response.json()
– Parse JSON response body.response.text()
– Get text response.response.buffer()
– Get raw Buffer response.
This allows capturing and processing specific requests that contain the data we actually want.
Step 6. Persist Intercepted Requests
To save intercepted requests for analysis:
- Log requests/responses to a file or database.
- Or save responses to disk –
fs.writeFileSync(./response.json, response.text())
Example: Capturing API Calls
Let's see an example of intercepting and capture API calls made by a Sample SPA:
// enable interception await page.setRequestInterception(true); page.on('request', request => { const url = request.url(); // intercept API requests if (url.includes('/api/')) { console.log('Captured API request to ', request.url()); } request.continue(); }); // navigate page await page.goto('https://spa.example.com'); // click buttons etc to trigger API calls await page.click('.load-data');
This will log any requests made to /api/
endpoints. The response bodies from these requests can then be parsed to extract relevant data.
Tips for Processing Background Requests
Here are some additional tips for handling background requests:
- Set the
Cache-Control
header tono-cache
in request headers to disable caching and ensure you capture all requests. - Redirect URLs to a mock server under your control to analyze responses.
- Override the browser's fetch handler to change request behavior.
- Use a headless browser like Puppeteer to avoid visually altering the pages.
- Follow Robots.txt directives and avoid making too many frequent requests.
- Use proxies and rotation to minimize blocking risks when making large volumes of requests.
How to Block Requests?
Blocking specific requests can help reduce bandwidth usage and speed up page processing. To block requests simply don't call request.continue()
:
page.on('request', request => { if (request.resourceType() === 'image') { request.abort(); // block image requests } else { request.continue(); } });
Other ways to block requests:
- Block by type – scripts, styles, fonts, images, etc.
- Block by URL patterns.
- Block by domains like APIs, ads, etc.
- Block images to save bandwidth for text scraping.
But don't block all requests indiscriminately, as it may break pages.
Conclusion
Analyzing background requests allows understanding of how modern web pages fetch and load data. Capturing network requests in Puppeteer provides invaluable insight for building robust web scrapers. The examples in this post should help you get started with intercepting requests using the Puppeteer API.
What other use cases have you worked on that involve analyzing network requests? I'd love to hear other examples in the comments!