How to Get Page Source in Puppeteer?

5 Views

Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium. It allows you to programmatically navigate pages, interact with elements, run commands, etc.

A common use case of Puppeteer is web scraping – where you want to extract data from web pages. To scrape the data, you often need the full HTML source of the rendered web page. In this guide, we'll see different ways to get the page source in Puppeteer.

Using page.content()

The easiest way to get full page source in Puppeteer is using the page.content() method. For example:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com');

  const html = await page.content(); // get full page HTML

  console.log(html);

  await browser.close();
})();

The page.content() returns a Promise that resolves to the full HTML source of the page, including the doctype, html, head, body tags. This is the simplest way to get page source in Puppeteer.

However, there are a couple of caveats:

It will wait for all network requests to complete before resolving. This means it can be slow on pages with lots of media/images.
The HTML returned may not match what's rendered on the page if the page relies heavily on JavaScript to modify the DOM.

To mitigate the first issue, we can pass a waitUntil option:

// Wait only for DOM content, not other resources like images
const html = await page.content({waitUntil: 'domcontentloaded'});

This will return the HTML faster, though images/stylesheets may not be loaded yet. For the second issue, we need other approaches outlined in the rest of this guide.

Serializing the DOM

Instead of getting the entire source HTML, we can serialize the current state of the DOM using page.evaluate():

const html = await page.evaluate(() => {
  return document.documentElement.outerHTML;
});

This executes a JS function within the page context, allowing us to retrieve the current DOM's outer HTML. The advantage over page.content() is it will reflect any changes made by JavaScript after the page loads.

However, this approach also has some downsides to be aware of:

It cannot retrieve the full HTML including doctype/html/head/body tags
Stylesheets and some attributes may be excluded
Serializing large DOM trees can be slow

So page.content() is generally better if you need the full static HTML source. page.evaluate() is useful when working with pages that heavily manipulate the DOM.

Accessing Response Objects

When the page navigates, Puppeteer fires a response event with the Response object:

page.on('response', response => {
  const url = response.url();
  const html = response.text();
  
  // log HTML of requests
});

We can check the response URL to find the HTML response and get the full source. This approach allows capturing the raw HTML served by the server before any client-side modifications. However, it only works for the initial navigation request – not other resources like XHR/fetch requests or iframes. Large pages may also have multiple response objects.

Accessing Frame Sources

For pages with iframes, we can get the HTML source of individual frames:

// Assuming iframe exists with name="frame1"
const frame = page.frames().find(f => f.name() === 'frame1');

const html = await frame.content();

The frame.content() method allows getting full HTML source of any frame. This is useful for scraping pages with multiple cross-domain iframes.

Saving Files to Disk

Instead of accessing HTML programmatically, another approach is to save the files to disk directly:

// Override browser request handler
await page.setRequestInterception(true);
page.on('request', request => {

  const url = request.url();

  // Filter for main HTML file
  if (url.endsWith('.html')) {
    request.abort(); // abort actual request

    // save file instead 
    saveFileToDisk(url); 
  } else {
    request.continue(); // resume other requests 
  }

});

async function saveFileToDisk(url) {
  const path = `./files/${getFileName(url)}`;
  const data = await fetch(url).then(r => r.text());

  fs.writeFileSync(path, data); 
}

By aborting the main HTML request and mocking the response, we can save the file directly from the network request.

This approach is a little more complex but allows capturing the unmodified source HTML.

Scraping API Alternatives

While Puppeteer provides full control, sometimes using a dedicated scraping API service can be easier. Many cloud scraping tools like Bright Data, Apify, or ScraperAPI allow fetching pages and getting full HTML just through an API call.

For example, with BrightData:

const brightdata = require('brightdata');

const html = await brightdata.fetch('https://example.com');

No need to deal with browsers or proxies. The APIs abstract away all that complexity with just a single method call. Scraping APIs are simpler to use, but Puppeteer provides more customization options if needed.

Handling Dynamic Content

For sites relying heavily on JavaScript to render content, the basic techniques above may not be sufficient. Some common cases where the HTML source won't match the final rendered page:

Content loaded dynamically via AJAX/fetch requests after page load
Pages rendered fully client-side using frameworks like React or Vue
Extensive DOM manipulation via JavaScript

In these cases, we need to wait for the JavaScript to execute before getting the HTML source. Here are a couple of ways to deal with dynamic content:

Wait for Network Idle

Use the networkidle option to wait for network requests to settle before getting content:

// Wait up to 5 seconds for any network requests
await page.goto('https://example.com', {waitUntil: 'networkidle', timeout: 5000}); 

// HTML will reflect any additional fetches  
const html = await page.content();

This allows time for any asynchronous fetches and DOM changes to occur before we retrieve the HTML.

Wait for Selector

If the page updates after a specific element appears, we can wait for that element first:

// Wait for content to load
await page.waitForSelector('.content-loaded'); 

// Get HTML after selector appears
const html = await page.content();

This ensures that dynamically loaded content has been added to the DOM before getting the source.

Wait for Navigation

For single-page apps that render routes client-side, we may need to click links and wait for ‘virtual' navigation to finish:

// Navigate to #users page
await page.click('.nav-users');

// Wait for route change
await page.waitForNavigation();

// Get HTML for client-rendered page
const html = await page.content();

Executing Scripts

As a last resort, we can inject JavaScript to trigger events and manipulate the DOM manually:

// Click a button 
await page.evaluate(() => {
  document.querySelector('button').click(); 
});

// Trigger dynamic content load
await page.evaluate(() => {
  window.scrollTo(0, document.body.scrollHeight);
});

// Wait for XHR request
await page.waitForResponse('https://api.example.com/data');

// Get HTML after our custom steps  
const html = await page.content();

Executing scripts this way provides full control to mimic user interactions and force content to load. The downside is increased fragility and complexity compared to just waiting for the network to idle.

Final Thoughts

Obtaining the page source is crucial in web scraping, and Puppeteer offers several methods to accomplish this. This article outlines the primary techniques for retrieving the page source using Puppeteer. For dynamic websites, it's important to delay the process until the network is idle, wait for specific selectors, complete navigation, or inject scripts to guarantee that all content is loaded before extracting the HTML.

Select the most suitable method depending on your particular requirements. May this guide assist you in efficiently extracting HTML from websites with Puppeteer for your web scraping endeavors!