Web Scraping With a Headless Browser - Puppeteer

Web scraping dynamic websites can be challenging with traditional HTTP clients like fetch and axios. Browser automation tools like Puppeteer make scraping modern JavaScript-heavy sites much easier than using old-school HTTP clients. In this guide, we’ll cover web scraping dynamic pages with Puppeteer and Node.js, including:

How headless browser automation works
Core API overview with examples
Waiting for page loads and content rendering
Selecting elements and extracting data
Scraping profile data from TikTok
Optimization and anti-blocking techniques
Scaling up with cloud services

This will provide a solid foundation for using Puppeteer for scraping. Let’s jump in!

What is Puppeteer?

Puppeteer is a Node.js library that provides a high-level API for controlling headless Chrome or Firefox over the DevTools Protocol. With traditional HTTP clients like axios and node-fetch, scraping dynamic JavaScript content can be very challenging. Puppeteer spins up a real browser that renders everything just like a normal user would see.

This makes scraping much easier. For example, here's how to extract the text content from a page:

const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.goto('https://example.com');

const text = await page.evaluate(() => {
  return document.body.innerText;
});

console.log(text);

await browser.close();

Puppeteer also allows:

Executing JavaScript in the browser context
Filling out forms and clicking elements
Generating screenshots of pages
Creating an ever-rotating proxy workaround

In other words, it operates a full simulated browser giving us complete access to scrape dynamic sites. The main downsides are that is more resource-intensive and complex than a simple HTTP request. But the scraping superpowers gained are worth it for many applications.

Overview of Puppeteer API

Let's look at some of the key concepts and classes in Puppeteer’s API.

First install it:

npm install puppeteer

Then include it in your script:

const puppeteer = require('puppeteer');

Launching the browser

The entry point is the puppeteer.launch() method which boots up a browser instance:

const browser = await puppeteer.launch();

Pass options like headless: false to disable headless mode and see the browser.

Creating pages

To control tabs use browser.newPage():

const page = await browser.newPage();

Navigating

The page.goto() method loads a URL in the tab:

await page.goto('https://example.com');

Extracting content

Use page.content() to get the full HTML:

const html = await page.content();

Or page.evaluate() to run browser JavaScript:

const title = await page.evaluate(() => {
  return document.querySelector('title').textContent;
});

Emulating interactions

Click elements and submit forms:

await page.click('#submit-button');

Closing

Don't forget to close the browser:

await browser.close();

This covers the basics of controlling a page! Now we can start scraping.

Waiting For Content To Load

Here we encounter the first major snag – how do we know when a page has fully loaded before scraping it? With simple static sites, the initial HTML download completes the loading. But modern dynamic pages continue assembling content even after the base HTML arrives.

For example, it may fire off fetch() requests in JavaScript to populate data. So we need to wait for all network requests to complete before scraping to avoid missing data.

DOMContentLoaded vs NetworkIdle

The page.goto() method accepts a waitUntil option to define landing conditions. Two common choices are:

domcontentloaded – waits for the initial HTML structure to complete
networkidle0 – waits for all network connections to go idle

For example:

await page.goto(url, {
  waitUntil: 'networkidle0'
});

domcontentloaded fires sooner but networkidle ensures all network requests are complete. So for dynamic sites, networkidle it is safer. The number networkidle2 defines how many requests can be active.

Waiting For Selectors

However, even networkidle can be unreliable in some cases. A better method is explicitly waiting for elements to appear before continuing. Use page.waitForSelector() to wait for a CSS selector:

// Wait for header to load
await page.waitForSelector('header');

Or page.waitForXPath() for an XPath query:

// Wait for first product
await page.waitForXPath('//div[@class="product"]');

This way, we can confirm page readiness based on elements we want to interact with or scrape. We can even wait for multiple selectors to combine conditions:

await Promise.all([
  page.waitForSelector('header'),
  page.waitForSelector('.product-list'), 
  page.waitForSelector('footer') 
]);

Next, let's look at techniques for extracting loaded data.

Selecting and Extracting Content

Once the page has loaded, we can parse and extract the information we want. Puppeteer gives us full access to the DOM through selectors and browser evaluation.

Using CSS Selectors

To scrape content based on CSS selectors:

Get a single element

Use the page.$eval() helper:

const title = await page.$eval('h1', el => el.textContent);

Get multiple elements

Use page.$$eval() to return an array:

const productPrices = await page.$$eval('.product-price', nodes => {
  return nodes.map(n => n.innerText);  
});

Querying

The page.$() and page.$$() methods find matching elements but don't return their value directly. We can chain .evaluate():

const links = await page.$$('a.product-link');

for(let link of links) {
  const href = await link.evaluate(el => el.getAttribute('href'));
  console.log(href);
}

This allows iterating through the matching results.

Using XPath

Besides CSS, we can use XPath selectors with page.$x() and page.$eval():

// Get first 
const imgUrl = await page.$eval('//img', img => {
  return img.getAttribute('src'); 
});

// Get all
const productNames = await page.$x('//*[@class="product"]//h2');

XPath is helpful when CSS selectors get too complex.

Scraping Attributes, HTML, and More

In addition to textContent we can retrieve:

innerHTML – full inner HTML
outerHTML – full outer HTML
getAttribute() – value of attributes like href
value for form fields

For example:

// Get href attribute
const link = await page.$eval('a', el => el.getAttribute('href')); 

// Get outer html
const html = await page.$eval('header', el => el.outerHTML);

This gives us multiple options for extracting data!

Real-World Example: Scraping TikTok Profiles

Now that we understand the basics, let's walk through a real-world scraper for TikTok profiles using the techniques covered. We'll get profile details and video metadata by:

Searching for a hashtag like #cats
Getting top video creators from the results
Visiting each profile
Extracting their profile info
Going through their recent videos
Scraping video view counts and descriptions

Launching the Browser

First, we'll launch a visible browser (rather than default headless mode) to see what's happening:

const browser = await puppeteer.launch({
  headless: false
});

Searching for a Hashtag

Next, we'll open a new page, go to TikTok.com, and search for a hashtag:

const page = await browser.newPage();

await page.goto('https://tiktok.com'); 

await page.type('input[type="search"]', '#cats');

await page.keyboard.press('Enter');

We locate the search input using its type="search" attribute, enter the text, and press Enter.

Getting Top Profiles

Now we can wait for some profiles to load and extract their page URLs:

// Wait for profile links to exist
await page.waitForSelector('a[data-e2e="user-item-author-avatar"]');

// Get href for each
const urls = await page.$$eval('a[data-e2e="user-item-author-avatar"]', links => {
  return links.map(link => link.href);
});

We locate profile links through their unique data-e2e attribute.

Visiting Profile Pages

Next, we'll loop through the profile URLs, visit each page, and extract info:

const profiles = [];

for (let url of urls) {

  // Visit profile page
  await page.goto(url);

  // Wait for username to exist 
  await page.waitForSelector('.share-title');

  // Get profile details
  const username = await page.$eval('.share-title', el => el.textContent);
  const followerCount = await page.$eval('.follower-count', el => el.textContent);

  // Store 
  profiles.push({
    username, 
    followerCount
  });
  
}

We wait for the username element, scrape details, and add it to the profiles array.

Getting Video Data

Now we can scrape videos from each user. We'll get the 5 most recent:

for (let profile of profiles) {

  const page = await browser.newPage();
  await page.goto('https://tiktok.com/@' + profile.username);

  const videos = [];

  // Wait for video links to load
  await page.waitForSelector('.tiktok-avatar');

  // Get href for first 5
  const urls = await page.$$eval('.tiktok-avatar', elements => {
    return elements.map(el => el.href).slice(0, 5);
  });

  // Visit and parse each video page
  for (let url of urls) {
    const page = await browser.newPage();
    await page.goto(url);

    // Get view and comment counts
    const viewCount = await page.$eval('.view-count', el => el.textContent); 
    const commentCount = await page.$eval('.comment-count', el => el.textContent);

    videos.push({
      viewCount, 
      commentCount  
    });

    await page.close();
  }

  profile.videos = videos;

}

This gives us video metadata for each profile to go alongside the profile info! While more logic is needed for a production scraper, it demonstrates core Puppeteer techniques.

Common Challenges

So far, we have focused on happy path scraping. But in reality, you often run into obstacles like:

Rate limiting and blocking
Slow performance
Bot detection

Let's discuss some best practices to handle them.

Making Scrapers Faster

Puppeteer spins up an entire browser, so it's resource intensive. Here are some optimizations:

Run headless – eliminates render overhead
Disable JavaScript – page.setJavaScriptEnabled(false)
Block requests – don't load unnecessary assets like images, CSS, etc.

For example, block images with:

// Remove image loading
page.on('request', request => {
  if(request.resourceType() === 'image'){
    request.abort(); 
  } else {
    request.continue();
  }  
});

This saves bandwidth and speeds up requests. Unique domains like ad trackers can also be blocked. You can test various blocking rules and measure speedup. The common speed boost is 2-5x, depending on the site.

Preventing Bot Detection

Sites don't like getting scraped. So they deploy various bot detection and mitigation systems:

Browser fingerprinting
IP rate limiting
CAPTCHAs

Here are ways to help avoid them:

Rotate user agents – Don't use default Puppeteer agent
Randomize delays – Vary timing between actions
Use proxies – Rotate IPs with each request. Recommend Bright Data, Smartproxy, Proxy-Seller, and Soax。
Browser extensions – Change fingerprint with plugins

For example, a plugin like Puppeteer Stealth helps mask the headless browser fingerprint. With a robust setup, Puppeteer can avoid many protections. However it requires constant maintenance to keep up with the latest bot mitigations.

Scaling Puppeteer Scrapers

When it's time to scale scraping, challenges multiply. Running distributed scrapers while managing proxies, throttling, and detection avoidance becomes a major undertaking. Cloud scraping services like ScraperAPI can provide hosted solutions without the headache.

For example, here is ScraperAPI rendering a page using Puppeteer:

const html = await axios.get('https://api.scraper.com/render', {
  params: {
    api_key: API_KEY,
    url: 'https://www.example.com'
  }
});

Benefits include:

Instantly distributed across IP addresses
Built-in proxy rotation
Fingerprint masking defenses
Bypasses common bot mitigations
Detailed usage metrics and logs

This lifts the operational overhead to scale while you focus on writing the scraping logic.

Conclusion

Puppeteer provides an excellent on-ramp for getting started with headless browser scraping. But handling proxies, fingerprinting defenses, and other modern bot mitigations ultimately requires significant maintenance. For production scraping, cloud providers like ScrapFly offer a more scalable and maintainable solution. This allows focusing development on the unique scraping needs rather than operational challenges.

Of course, always make sure to comply with site terms of service and scrape ethically. I hope this post helps level up your JavaScript web scraping skills!