How to Save and Load Cookies in Puppeteer?

Cookies are small pieces of data stored in your browser that help websites remember information about you. When web scraping, saving and loading cookies can be extremely useful to maintain session state and continue scraping where you left off.

In this comprehensive guide, I'll explain what cookies are, why saving and loading them matters for web scraping, and provide code examples of how to save and load cookies using Puppeteer in Node.js.

What are Cookies?

Cookies are tiny text files that websites store on your computer. They allow sites to remember information about you and your preferences. Some common examples of data cookies store:

  • Login session info – Keeps you logged into a site without having to re-enter your credentials each time.
  • Shopping cart items – Remembers items you've added to your cart across page visits.
  • User preferences – Such as default language, theme, or location.
  • Tracking data – Data that follows you across sites for advertising and analytics.

When you first visit a website, it can create a cookie file on your machine. On subsequent visits, your browser will send those cookies back to the site. This allows the site to “remember” you. Cookies set by the website domain you are currently visiting are called first-party cookies. However, sites can also set third-party cookies from other domains (like ads and analytics scripts).

By saving and loading cookies, we can maintain this logged-in state in our scrapers as we navigate pages.

Why Save and Load Cookies for Web Scraping?

Saving and reusing cookies is vital for robust web scraping. Here are some key reasons:

Logged-in State

Many sites require you to log in to access content. Cookies store your session ID and other authentication info to keep you logged in as you navigate pages. By saving cookies after logging in, we can regain our authenticated state later by re-loading them. This allows scraping both public and logged-in content without having to fully re-login for every scraper run.

Continue Scraping Where You Left Off

Web scrapers often need to run continuously for days or weeks to get all data. By saving cookies periodically as we scrape, we can pick up where we left off if our scraper crashes or is rate limited. Instead of starting over from the beginning, loading cookies will continue the session right at the last page or API request.

Avoid Re-downloading Resources

Some sites may limit how often you can download data from them. By reusing cookies, we avoid re-triggering any download counters associated with them. This helps prevent scrapers from getting blocked for making too many frequent duplicate requests. Cookies allow each request to be part of the same ongoing session.

Mimic Natural Browsing

Humans don't start their browsers from a blank cookie slate every time they visit a website. By mimicking real user behavior with persistent cookies, scrapers appear more natural and less suspicious to sites. Rotating IP addresses can also help here. But reusing cookies takes it a step further for an authentic user experience.

Saving Cookies with Puppeteer

Now that we know why saving cookies matters let's go through how to save and load them using Puppeteer in Node.js. Puppeteer is a popular Node library created by Google for browser automation and scraping. It launches a headless Chrome browser to control pages and extract data programmatically.

To save cookies with Puppeteer, we use the page.cookies() method. Here is a basic example:

// Require Puppeteer library 
const puppeteer = require('puppeteer');

// Launch browser and open a new page
const browser = await puppeteer.launch();
const page = await browser.newPage(); 

// Navigate page to target website  
await page.goto('https://example.com');

// Save current cookies to variable
const cookies = await page.cookies(); 

// Close browser  
await browser.close();

The page.cookies() method returns an array of cookie objects for the current page. Each object contains details like name, value, domain, and expiration. To save the cookies to a file, we can stringify the array and write to disk:

// Require 'fs' module to write files
const fs = require('fs').promises;

// Save cookies as JSON 
await fs.writeFile('cookies.json', JSON.stringify(cookies));

Now we have persisted the cookies to a reusable file.

Cookie Scope

One important thing about page.cookies() – it only returns cookies for the current page domain. So if you want to capture cookies across the entire browser session (like for a multi-domain scraper), use browser.cookies() instead:

// Get ALL cookies from browser context
const cookies = await browser.cookies();

When to Save Cookies

In a real scraper, we'd want to save cookies as we navigate pages periodically:

// Login
await login(page); 

// Loop through page links
for(let pageUrl of pageLinks){

  await page.goto(pageUrl);

  // Save cookies every ~10 pages  
  if(pageCount % 10 == 0){
    var cookies = await page.cookies();
    await fs.writeFile('cookies.json', JSON.stringify(cookies));  
  }

  // Rest of scraping script...
}

This lets us resume from any page or step if the script gets halted.

Cookie Storage Options

While a JSON file works for basic cookie saving, here are some other good storage options:

  • Database¬†– Store cookies in a table to make querying easier. Redis provides a fast key-value store.
  • Object storage¬†– Services like S3 allow saving cookies for distributed scrapers.
  • Data lake¬†– Cookies could be stored in directories in cloud storage like AWS Lake Formation.

Adjust based on your infrastructure and query needs. The key is having them persisted somewhere reusable.

Loading Cookies with Puppeteer

Once we have cookies saved, we can load them to resume or recreate the logged-in state. The main method for loading cookies in Puppeteer is page.setCookie(). It accepts the cookie objects we stored earlier:

// Require Puppeteer
const puppeteer = require('puppeteer');

// Launch browser
const browser = await puppeteer.launch();

// Open page        
const page = await browser.newPage();

// Load saved cookies
await page.setCookie(savedCookies); 

// Navigate to target site
await page.goto('https://example.com');

page.setCookie() allows you to set multiple cookies by passing an array of cookie objects. Behind the scenes, it will add each one back to the browser's cookie jar to reconstruct your session.

Reload Cookies on Each Page

For best results, make sure to load cookies on every new page as you navigate. Don't just load once. Browsers treat each page independently for some security properties like cookies. So re-set on each navigation:

// Loop through URL list
for(let url of urls){

  // Load cookies  
  await page.setCookie(savedCookies);
  
  // Navigate page
  await page.goto(url); 
  
  // Rest of scraper...
}

This ensures your cookies persist across multiple page navigations as you scrape through a site.

Check for Expired Cookies

One issue – cookies can expire! The scraper will fail if you try loading an old, expired cookie. Check expires and only load non-expired ones:

const activeCookies = cookies.filter(c => !c.expires || c.expires > Date.now());
await page.setCookie(activeCookies);

Alternatively, you can set a expires value further in the future when saving cookies to keep them alive longer.

Cookie Domains Matter

Pay attention to cookie domains when loading. page.setCookie() will fail if domains don't match up with the current page. For example, cookies  siteA.com won't work on siteB.com. Have a strategy for handling multi-domain cookies.

Example: Login, Save, and Reuse Cookies

Let's put this all together into a full script that logs into a site, saves cookies, and then reloads them on a second run.

const puppeteer = require('puppeteer');
const fs = require('fs').promises;

// Function to handle login
const login = async (page) => {
  // Go to login page
  await page.goto('https://example.com/login');

  // Enter credentials and submit form
  // ...

  // Wait for navigation after login button click
  await Promise.all([
    page.waitForNavigation(), 
    page.click('#login') 
  ]);
}

// Initial script run - get cookies
(async () => {

  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await login(page); // Log in to site

  const cookies = await page.cookies(); // Save cookies

  await browser.close();

  await fs.writeFile('cookies.json', JSON.stringify(cookies)); // Save to file

})();


// Second run - load cookies
(async () => {

  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const cookiesString = await fs.readFile('cookies.json');
  const cookies = JSON.parse(cookiesString);

  await page.setCookie(cookies); // Load cookies

  await page.goto('https://example.com/dashboard'); // Browse site logged in
  
  // Rest of script...  

  await browser.close();

})();

This allows us to perform any logged-in scraping tasks after restoring the session without needing to re-login each time. The same flow could be adapted for an infinite looping scraper, saving cookies periodically as it runs and loading them if it gets halted and restarted.

Storing Cookies for Multiple Domains

For scrapers that interact with multiple domains, we need a way to save and associate cookies with their original site.

Approach 1: Separate Cookie Files

One method is saving cookies to separate files named by domain:

siteA-cookies.json
siteB-cookies.json 
siteC-cookies.json

Then when switching sites, load the matching cookie file. The downside is having to manage many cookie files.

Approach 2: Cookie Map

Another strategy is a cookie map that associates domains with their cookie arrays:

const cookieMap = {
  'siteA.com': siteACookies,
  'siteB.com': siteBCookies,
  //...
}

Before visiting each site, lookup and load its cookies from the map. Allows storing all cookies in one data structure. DB or object storage works for larger datasets.

Approach 3: Authenticated Sessions

An advanced option is having separate browser instances and pages for each domain. Maintain authenticated sessions for each site that you can switch between as needed in the script:

// Create browsers
const siteABrowser = await puppeteer.launch(); 
const siteBBrowser = await puppeteer.launch();

// Save cookies for each browser  
await saveSiteACookies(siteABrowser);
await saveSiteBCookies(siteBBrowser);

// Reuse browsers to scrape both sites  
await scrapeSiteA(siteABrowser); 
await scrapeSiteB(siteBBrowser);

More memory intensive but ensures clean cookie separation across domains.

Types of Data to Store in Cookies

Cookies can contain a variety of data to save state. Some useful examples for web scraping:

  • Session ID¬†– Identifier for login state on the site. Critical for authentication.
  • CSRF tokens – Tokens that prevent cross-site request forgery attacks. May be required for some actions.
  • User ID¬†– Logged in user ID to make requests on behalf of.
  • Localization – Site language, region, currency, and other locale preferences.
  • Consents – GDPR, tracking, and other cookie consent values.
  • A/B tests¬†– Assignment to different site experiments.
  • Recent views¬†– Ids for cached or recently viewed items to simulate natural browsing.

Review the site's actual cookies to see what's critical to reproduce the state. Focus on cookies used in authentication, personalization and site mechanics.

Cookie Privacy Considerations

While reusing cookies makes scraping easier, keep in mind privacy implications:

  • Don't collect unnecessary personal or tracking data without consent.
  • Mask cookies like¬†userID¬†in logging if they identify individuals.
  • Delete cookies when no longer needed for your business purposes.
  • Make sure usage complies with site terms of use.

As always, respect user privacy and minimize data collection to only what's required.

Alternatives to Cookies for State

While cookies are commonly used to manage browser state, here are some other approaches:

  • Local Storage: LocalStorage works like cookies but stores data in the browser instead of sending to the server. Puppeteer can access via page.evaluate():
// Get localStorage data
const data = await page.evaluate(() => {
  return JSON.parse(window.localStorage.getItem('key')); 
})

// Set localStorage data
await page.evaluate((key, data) => {
  window.localStorage.setItem(key, JSON.stringify(data));
}, 'key', 'value');

Helpful for the state that doesn't need to persist across sessions.

  • Browser Cache: Storing response data in the browser cache can avoid re-fetching resources. Enable via cache-control response headers. Puppeteer respects caching automatically. Downside is caches reset after the browser exits. Must replay requests to repopulate.
  • Session Replay: Some tools like SessionBox and SiteTruth record browser sessions to replay later. Allows recreating state without needing direct cookie access. But a more complex setup.
  • Headless Browser Containers: An advanced option is running each browser session in a separate Docker container. Cookies and states are encapsulated in the container. Just restart existing containers to resume sessions. Takes more infrastructure, but isolated contexts avoid cookie conflicts.

Troubleshooting Cookie Issues

Here are some common cookie troubles and how to resolve them:

  • Blank Cookie Value: Your cookie file may contain a cookie with a blank value. This usually means it expired. Filter out cookies with blank values before loading to avoid errors:
const activeCookies = cookies.filter(c => c.value !== '');
  • Cookie Domain Mismatch: Loading a cookie from one domain onto a page from another domain causes errors. Double-check your cookie file domains match the site you are visiting.
  • Expired Cookies: Cookies past their expiration date will fail to load. Check for the expires field and filter out expired entries before restoring.
  • Max Cookies Exceeded: Browsers limit total cookies per domain. Chrome allows up to 180. If your page errors on cookie sets, try deleting old cookies to free up slots.
  • CORS Errors: Browsers block reading cross-origin cookies by default due to CORS. Use {sameSite: 'none', secure: true} flags when saving cookies to avoid restrictions loading them across domains.

Conclusion

Efficient cookie management is essential in web scraping to avoid redundant logins and repeated data collection. By simulating the behavior of a regular user, scrapers can operate more discreetly. However, cookies are just one facet of state management‚ÄĒLocalStorage, browser caching, session recordings, and headless containers can also enhance scraping efficiency. The key is to tailor a combination of these techniques to your specific scraping scenario.

This discussion aims to highlight the importance of cookies in web scraping and guide you on their proper handling with Puppeteer. Remember to collect only the data you absolutely need and always prioritize user privacy. When used wisely, cookies are a powerful ally in achieving dependable and respectful web scraping.

John Rooney

John Rooney

John Watson Rooney, a self-taught Python developer and content creator with a focus on web scraping, APIs, and automation. I love sharing my knowledge and expertise through my YouTube channel, My channel caters to all levels of developers, from beginners looking to get started in web scraping to experienced programmers seeking to advance their skills with modern techniques. I have worked in the e-commerce sector for many years, gaining extensive real-world experience in data handling, API integrations, and project management. I am passionate about teaching others and simplifying complex concepts to make them more accessible to a wider audience. In addition to my YouTube channel, I also maintain a personal website where I share my coding projects and other related content.

We will be happy to hear your thoughts

      Leave a reply

      Proxy-Zone
      Compare items
      • Total (0)
      Compare
      0