Web scraping dynamic websites can be challenging with traditional HTTP clients like fetch and axios. Browser automation tools like Puppeteer make scraping modern JavaScript-heavy sites much easier than using old-school HTTP clients. In this guide, we’ll cover web scraping dynamic pages with Puppeteer and Node.js, including:
- How headless browser automation works
- Core API overview with examples
- Waiting for page loads and content rendering
- Selecting elements and extracting data
- Scraping profile data from TikTok
- Optimization and anti-blocking techniques
- Scaling up with cloud services
This will provide a solid foundation for using Puppeteer for scraping. Let’s jump in!
What is Puppeteer?
Puppeteer is a Node.js library that provides a high-level API for controlling headless Chrome or Firefox over the DevTools Protocol. With traditional HTTP
clients like axios
and node-fetch
, scraping dynamic JavaScript content can be very challenging. Puppeteer spins up a real browser that renders everything just like a normal user would see.
This makes scraping much easier. For example, here's how to extract the text content from a page:
const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://example.com'); const text = await page.evaluate(() => { return document.body.innerText; }); console.log(text); await browser.close();
Puppeteer also allows:
- Executing JavaScript in the browser context
- Filling out forms and clicking elements
- Generating screenshots of pages
- Creating an ever-rotating proxy workaround
In other words, it operates a full simulated browser giving us complete access to scrape dynamic sites. The main downsides are that is more resource-intensive and complex than a simple HTTP
request. But the scraping superpowers gained are worth it for many applications.
Overview of Puppeteer API
Let's look at some of the key concepts and classes in Puppeteer’s API.
First install it:
npm install puppeteer
Then include it in your script:
const puppeteer = require('puppeteer');
Launching the browser
The entry point is the puppeteer.launch()
method which boots up a browser instance:
const browser = await puppeteer.launch();
Pass options like headless: false
to disable headless mode and see the browser.
Creating pages
To control tabs use browser.newPage()
:
const page = await browser.newPage();
Navigating
The page.goto()
method loads a URL in the tab:
await page.goto('https://example.com');
Extracting content
Use page.content()
to get the full HTML:
const html = await page.content();
Or page.evaluate()
to run browser JavaScript:
const title = await page.evaluate(() => { return document.querySelector('title').textContent; });
Emulating interactions
Click elements and submit forms:
await page.click('#submit-button');
Closing
Don't forget to close the browser:
await browser.close();
This covers the basics of controlling a page! Now we can start scraping.
Waiting For Content To Load
Here we encounter the first major snag – how do we know when a page has fully loaded before scraping it? With simple static sites, the initial HTML
download completes the loading. But modern dynamic pages continue assembling content even after the base HTML
arrives.
For example, it may fire off fetch()
requests in JavaScript to populate data. So we need to wait for all network requests to complete before scraping to avoid missing data.
DOMContentLoaded vs NetworkIdle
The page.goto()
method accepts a waitUntil
option to define landing conditions. Two common choices are:
domcontentloaded
– waits for the initial HTML structure to completenetworkidle0
– waits for all network connections to go idle
For example:
await page.goto(url, { waitUntil: 'networkidle0' });
domcontentloaded
fires sooner but networkidle
ensures all network requests are complete. So for dynamic sites, networkidle
it is safer. The number networkidle2
defines how many requests can be active.
Waiting For Selectors
However, even networkidle
can be unreliable in some cases. A better method is explicitly waiting for elements to appear before continuing. Use page.waitForSelector()
to wait for a CSS
selector:
// Wait for header to load await page.waitForSelector('header');
Or page.waitForXPath()
for an XPath query:
// Wait for first product await page.waitForXPath('//div[@class="product"]');
This way, we can confirm page readiness based on elements we want to interact with or scrape. We can even wait for multiple selectors to combine conditions:
await Promise.all([ page.waitForSelector('header'), page.waitForSelector('.product-list'), page.waitForSelector('footer') ]);
Next, let's look at techniques for extracting loaded data.
Selecting and Extracting Content
Once the page has loaded, we can parse and extract the information we want. Puppeteer gives us full access to the DOM through selectors and browser evaluation.
Using CSS Selectors
To scrape content based on CSS selectors:
Get a single element
Use the page.$eval()
helper:
const title = await page.$eval('h1', el => el.textContent);
Get multiple elements
Use page.$$eval()
to return an array:
const productPrices = await page.$$eval('.product-price', nodes => { return nodes.map(n => n.innerText); });
Querying
The page.$()
and page.$$()
methods find matching elements but don't return their value directly. We can chain .evaluate()
:
const links = await page.$$('a.product-link'); for(let link of links) { const href = await link.evaluate(el => el.getAttribute('href')); console.log(href); }
This allows iterating through the matching results.
Using XPath
Besides CSS, we can use XPath selectors with page.$x()
and page.$eval()
:
// Get first const imgUrl = await page.$eval('//img', img => { return img.getAttribute('src'); }); // Get all const productNames = await page.$x('//*[@class="product"]//h2');
XPath is helpful when CSS selectors get too complex.
Scraping Attributes, HTML, and More
In addition to textContent
we can retrieve:
innerHTML
– full inner HTMLouterHTML
– full outer HTMLgetAttribute()
– value of attributes likehref
value
for form fields
For example:
// Get href attribute const link = await page.$eval('a', el => el.getAttribute('href')); // Get outer html const html = await page.$eval('header', el => el.outerHTML);
This gives us multiple options for extracting data!
Real-World Example: Scraping TikTok Profiles
Now that we understand the basics, let's walk through a real-world scraper for TikTok profiles using the techniques covered. We'll get profile details and video metadata by:
- Searching for a hashtag like
#cats
- Getting top video creators from the results
- Visiting each profile
- Extracting their profile info
- Going through their recent videos
- Scraping video view counts and descriptions
Launching the Browser
First, we'll launch a visible browser (rather than default headless mode) to see what's happening:
const browser = await puppeteer.launch({ headless: false });
Searching for a Hashtag
Next, we'll open a new page, go to TikTok.com, and search for a hashtag:
const page = await browser.newPage(); await page.goto('https://tiktok.com'); await page.type('input[type="search"]', '#cats'); await page.keyboard.press('Enter');
We locate the search input using its type="search"
attribute, enter the text, and press Enter
.
Getting Top Profiles
Now we can wait for some profiles to load and extract their page URLs:
// Wait for profile links to exist await page.waitForSelector('a[data-e2e="user-item-author-avatar"]'); // Get href for each const urls = await page.$$eval('a[data-e2e="user-item-author-avatar"]', links => { return links.map(link => link.href); });
We locate profile links through their unique data-e2e
attribute.
Visiting Profile Pages
Next, we'll loop through the profile URLs, visit each page, and extract info:
const profiles = []; for (let url of urls) { // Visit profile page await page.goto(url); // Wait for username to exist await page.waitForSelector('.share-title'); // Get profile details const username = await page.$eval('.share-title', el => el.textContent); const followerCount = await page.$eval('.follower-count', el => el.textContent); // Store profiles.push({ username, followerCount }); }
We wait for the username element, scrape details, and add it to the profiles
array.
Getting Video Data
Now we can scrape videos from each user. We'll get the 5 most recent:
for (let profile of profiles) { const page = await browser.newPage(); await page.goto('https://tiktok.com/@' + profile.username); const videos = []; // Wait for video links to load await page.waitForSelector('.tiktok-avatar'); // Get href for first 5 const urls = await page.$$eval('.tiktok-avatar', elements => { return elements.map(el => el.href).slice(0, 5); }); // Visit and parse each video page for (let url of urls) { const page = await browser.newPage(); await page.goto(url); // Get view and comment counts const viewCount = await page.$eval('.view-count', el => el.textContent); const commentCount = await page.$eval('.comment-count', el => el.textContent); videos.push({ viewCount, commentCount }); await page.close(); } profile.videos = videos; }
This gives us video metadata for each profile to go alongside the profile info! While more logic is needed for a production scraper, it demonstrates core Puppeteer techniques.
Common Challenges
So far, we have focused on happy path scraping. But in reality, you often run into obstacles like:
- Rate limiting and blocking
- Slow performance
- Bot detection
Let's discuss some best practices to handle them.
Making Scrapers Faster
Puppeteer spins up an entire browser, so it's resource intensive. Here are some optimizations:
- Run headless – eliminates render overhead
- Disable JavaScript – page.setJavaScriptEnabled(false)
- Block requests – don't load unnecessary assets like images, CSS, etc.
For example, block images with:
// Remove image loading page.on('request', request => { if(request.resourceType() === 'image'){ request.abort(); } else { request.continue(); } });
This saves bandwidth and speeds up requests. Unique domains like ad trackers can also be blocked. You can test various blocking rules and measure speedup. The common speed boost is 2-5x, depending on the site.
Preventing Bot Detection
Sites don't like getting scraped. So they deploy various bot detection and mitigation systems:
- Browser fingerprinting
- IP rate limiting
- CAPTCHAs
Here are ways to help avoid them:
- Rotate user agents – Don't use default Puppeteer agent
- Randomize delays – Vary timing between actions
- Use proxies – Rotate IPs with each request. Recommend Bright Data, Smartproxy, Proxy-Seller, and Soax。
- Browser extensions – Change fingerprint with plugins
For example, a plugin like Puppeteer Stealth helps mask the headless browser fingerprint. With a robust setup, Puppeteer can avoid many protections. However it requires constant maintenance to keep up with the latest bot mitigations.
Scaling Puppeteer Scrapers
When it's time to scale scraping, challenges multiply. Running distributed scrapers while managing proxies, throttling, and detection avoidance becomes a major undertaking. Cloud scraping services like ScraperAPI can provide hosted solutions without the headache.
For example, here is ScraperAPI rendering a page using Puppeteer:
const html = await axios.get('https://api.scraper.com/render', { params: { api_key: API_KEY, url: 'https://www.example.com' } });
Benefits include:
- Instantly distributed across IP addresses
- Built-in proxy rotation
- Fingerprint masking defenses
- Bypasses common bot mitigations
- Detailed usage metrics and logs
This lifts the operational overhead to scale while you focus on writing the scraping logic.
Conclusion
Puppeteer provides an excellent on-ramp for getting started with headless browser scraping. But handling proxies, fingerprinting defenses, and other modern bot mitigations ultimately requires significant maintenance. For production scraping, cloud providers like ScrapFly offer a more scalable and maintainable solution. This allows focusing development on the unique scraping needs rather than operational challenges.
Of course, always make sure to comply with site terms of service and scrape ethically. I hope this post helps level up your JavaScript web scraping skills!