Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium. It allows you to programmatically navigate pages, interact with elements, run commands, etc.
A common use case of Puppeteer is web scraping – where you want to extract data from web pages. To scrape the data, you often need the full HTML source of the rendered web page. In this guide, we'll see different ways to get the page source in Puppeteer.
Using page.content()
The easiest way to get full page source in Puppeteer is using the page.content()
method. For example:
const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://example.com'); const html = await page.content(); // get full page HTML console.log(html); await browser.close(); })();
The page.content()
returns a Promise that resolves to the full HTML source of the page, including the doctype, html, head, body tags. This is the simplest way to get page source in Puppeteer.
However, there are a couple of caveats:
- It will wait for all network requests to complete before resolving. This means it can be slow on pages with lots of media/images.
- The HTML returned may not match what's rendered on the page if the page relies heavily on JavaScript to modify the DOM.
To mitigate the first issue, we can pass a waitUntil
option:
// Wait only for DOM content, not other resources like images const html = await page.content({waitUntil: 'domcontentloaded'});
This will return the HTML faster, though images/stylesheets may not be loaded yet. For the second issue, we need other approaches outlined in the rest of this guide.
Serializing the DOM
Instead of getting the entire source HTML, we can serialize the current state of the DOM using page.evaluate()
:
const html = await page.evaluate(() => { return document.documentElement.outerHTML; });
This executes a JS function within the page context, allowing us to retrieve the current DOM's outer HTML. The advantage over page.content()
is it will reflect any changes made by JavaScript after the page loads.
However, this approach also has some downsides to be aware of:
- It cannot retrieve the full HTML including doctype/html/head/body tags
- Stylesheets and some attributes may be excluded
- Serializing large DOM trees can be slow
So page.content()
is generally better if you need the full static HTML source. page.evaluate()
is useful when working with pages that heavily manipulate the DOM.
Accessing Response Objects
When the page navigates, Puppeteer fires a response
event with the Response object:
page.on('response', response => { const url = response.url(); const html = response.text(); // log HTML of requests });
We can check the response URL to find the HTML response and get the full source. This approach allows capturing the raw HTML served by the server before any client-side modifications. However, it only works for the initial navigation request – not other resources like XHR/fetch requests or iframes. Large pages may also have multiple response objects.
Accessing Frame Sources
For pages with iframes, we can get the HTML source of individual frames:
// Assuming iframe exists with name="frame1" const frame = page.frames().find(f => f.name() === 'frame1'); const html = await frame.content();
The frame.content()
method allows getting full HTML source of any frame. This is useful for scraping pages with multiple cross-domain iframes.
Saving Files to Disk
Instead of accessing HTML programmatically, another approach is to save the files to disk directly:
// Override browser request handler await page.setRequestInterception(true); page.on('request', request => { const url = request.url(); // Filter for main HTML file if (url.endsWith('.html')) { request.abort(); // abort actual request // save file instead saveFileToDisk(url); } else { request.continue(); // resume other requests } }); async function saveFileToDisk(url) { const path = `./files/${getFileName(url)}`; const data = await fetch(url).then(r => r.text()); fs.writeFileSync(path, data); }
By aborting the main HTML request and mocking the response, we can save the file directly from the network request.
This approach is a little more complex but allows capturing the unmodified source HTML.
Scraping API Alternatives
While Puppeteer provides full control, sometimes using a dedicated scraping API service can be easier. Many cloud scraping tools like Bright Data, Apify, or ScraperAPI allow fetching pages and getting full HTML just through an API call.
For example, with BrightData:
const brightdata = require('brightdata'); const html = await brightdata.fetch('https://example.com');
No need to deal with browsers or proxies. The APIs abstract away all that complexity with just a single method call. Scraping APIs are simpler to use, but Puppeteer provides more customization options if needed.
Handling Dynamic Content
For sites relying heavily on JavaScript to render content, the basic techniques above may not be sufficient. Some common cases where the HTML source won't match the final rendered page:
- Content loaded dynamically via AJAX/fetch requests after page load
- Pages rendered fully client-side using frameworks like React or Vue
- Extensive DOM manipulation via JavaScript
In these cases, we need to wait for the JavaScript to execute before getting the HTML source. Here are a couple of ways to deal with dynamic content:
Wait for Network Idle
Use the networkidle
option to wait for network requests to settle before getting content:
// Wait up to 5 seconds for any network requests await page.goto('https://example.com', {waitUntil: 'networkidle', timeout: 5000}); // HTML will reflect any additional fetches const html = await page.content();
This allows time for any asynchronous fetches and DOM changes to occur before we retrieve the HTML.
Wait for Selector
If the page updates after a specific element appears, we can wait for that element first:
// Wait for content to load await page.waitForSelector('.content-loaded'); // Get HTML after selector appears const html = await page.content();
This ensures that dynamically loaded content has been added to the DOM before getting the source.
Wait for Navigation
For single-page apps that render routes client-side, we may need to click links and wait for ‘virtual' navigation to finish:
// Navigate to #users page await page.click('.nav-users'); // Wait for route change await page.waitForNavigation(); // Get HTML for client-rendered page const html = await page.content();
Executing Scripts
As a last resort, we can inject JavaScript to trigger events and manipulate the DOM manually:
// Click a button await page.evaluate(() => { document.querySelector('button').click(); }); // Trigger dynamic content load await page.evaluate(() => { window.scrollTo(0, document.body.scrollHeight); }); // Wait for XHR request await page.waitForResponse('https://api.example.com/data'); // Get HTML after our custom steps const html = await page.content();
Executing scripts this way provides full control to mimic user interactions and force content to load. The downside is increased fragility and complexity compared to just waiting for the network to idle.
Final Thoughts
Obtaining the page source is crucial in web scraping, and Puppeteer offers several methods to accomplish this. This article outlines the primary techniques for retrieving the page source using Puppeteer. For dynamic websites, it's important to delay the process until the network is idle, wait for specific selectors, complete navigation, or inject scripts to guarantee that all content is loaded before extracting the HTML.
Select the most suitable method depending on your particular requirements. May this guide assist you in efficiently extracting HTML from websites with Puppeteer for your web scraping endeavors!