Cookies are small pieces of data stored in your browser that help websites remember information about you. When web scraping, saving and loading cookies can be extremely useful to maintain session state and continue scraping where you left off.
In this comprehensive guide, I'll explain what cookies are, why saving and loading them matters for web scraping, and provide code examples of how to save and load cookies using Puppeteer in Node.js.
What are Cookies?
Cookies are tiny text files that websites store on your computer. They allow sites to remember information about you and your preferences. Some common examples of data cookies store:
- Login session info – Keeps you logged into a site without having to re-enter your credentials each time.
- Shopping cart items – Remembers items you've added to your cart across page visits.
- User preferences – Such as default language, theme, or location.
- Tracking data – Data that follows you across sites for advertising and analytics.
When you first visit a website, it can create a cookie file on your machine. On subsequent visits, your browser will send those cookies back to the site. This allows the site to “remember” you. Cookies set by the website domain you are currently visiting are called first-party cookies. However, sites can also set third-party cookies from other domains (like ads and analytics scripts).
By saving and loading cookies, we can maintain this logged-in state in our scrapers as we navigate pages.
Why Save and Load Cookies for Web Scraping?
Saving and reusing cookies is vital for robust web scraping. Here are some key reasons:
Logged-in State
Many sites require you to log in to access content. Cookies store your session ID and other authentication info to keep you logged in as you navigate pages. By saving cookies after logging in, we can regain our authenticated state later by re-loading them. This allows scraping both public and logged-in content without having to fully re-login for every scraper run.
Continue Scraping Where You Left Off
Web scrapers often need to run continuously for days or weeks to get all data. By saving cookies periodically as we scrape, we can pick up where we left off if our scraper crashes or is rate limited. Instead of starting over from the beginning, loading cookies will continue the session right at the last page or API request.
Avoid Re-downloading Resources
Some sites may limit how often you can download data from them. By reusing cookies, we avoid re-triggering any download counters associated with them. This helps prevent scrapers from getting blocked for making too many frequent duplicate requests. Cookies allow each request to be part of the same ongoing session.
Mimic Natural Browsing
Humans don't start their browsers from a blank cookie slate every time they visit a website. By mimicking real user behavior with persistent cookies, scrapers appear more natural and less suspicious to sites. Rotating IP addresses can also help here. But reusing cookies takes it a step further for an authentic user experience.
Saving Cookies with Puppeteer
Now that we know why saving cookies matters let's go through how to save and load them using Puppeteer in Node.js. Puppeteer is a popular Node library created by Google for browser automation and scraping. It launches a headless Chrome browser to control pages and extract data programmatically.
To save cookies with Puppeteer, we use the page.cookies()
method. Here is a basic example:
// Require Puppeteer library const puppeteer = require('puppeteer'); // Launch browser and open a new page const browser = await puppeteer.launch(); const page = await browser.newPage(); // Navigate page to target website await page.goto('https://example.com'); // Save current cookies to variable const cookies = await page.cookies(); // Close browser await browser.close();
The page.cookies()
method returns an array of cookie objects for the current page. Each object contains details like name, value, domain, and expiration. To save the cookies to a file, we can stringify the array and write to disk:
// Require 'fs' module to write files const fs = require('fs').promises; // Save cookies as JSON await fs.writeFile('cookies.json', JSON.stringify(cookies));
Now we have persisted the cookies to a reusable file.
Cookie Scope
One important thing about page.cookies()
– it only returns cookies for the current page domain. So if you want to capture cookies across the entire browser session (like for a multi-domain scraper), use browser.cookies()
instead:
// Get ALL cookies from browser context const cookies = await browser.cookies();
When to Save Cookies
In a real scraper, we'd want to save cookies as we navigate pages periodically:
// Login await login(page); // Loop through page links for(let pageUrl of pageLinks){ await page.goto(pageUrl); // Save cookies every ~10 pages if(pageCount % 10 == 0){ var cookies = await page.cookies(); await fs.writeFile('cookies.json', JSON.stringify(cookies)); } // Rest of scraping script... }
This lets us resume from any page or step if the script gets halted.
Cookie Storage Options
While a JSON file works for basic cookie saving, here are some other good storage options:
- Database – Store cookies in a table to make querying easier. Redis provides a fast key-value store.
- Object storage – Services like S3 allow saving cookies for distributed scrapers.
- Data lake – Cookies could be stored in directories in cloud storage like AWS Lake Formation.
Adjust based on your infrastructure and query needs. The key is having them persisted somewhere reusable.
Loading Cookies with Puppeteer
Once we have cookies saved, we can load them to resume or recreate the logged-in state. The main method for loading cookies in Puppeteer is page.setCookie()
. It accepts the cookie objects we stored earlier:
// Require Puppeteer const puppeteer = require('puppeteer'); // Launch browser const browser = await puppeteer.launch(); // Open page const page = await browser.newPage(); // Load saved cookies await page.setCookie(savedCookies); // Navigate to target site await page.goto('https://example.com');
page.setCookie()
allows you to set multiple cookies by passing an array of cookie objects. Behind the scenes, it will add each one back to the browser's cookie jar to reconstruct your session.
Reload Cookies on Each Page
For best results, make sure to load cookies on every new page as you navigate. Don't just load once. Browsers treat each page independently for some security properties like cookies. So re-set on each navigation:
// Loop through URL list for(let url of urls){ // Load cookies await page.setCookie(savedCookies); // Navigate page await page.goto(url); // Rest of scraper... }
This ensures your cookies persist across multiple page navigations as you scrape through a site.
Check for Expired Cookies
One issue – cookies can expire! The scraper will fail if you try loading an old, expired cookie. Check expires
and only load non-expired ones:
const activeCookies = cookies.filter(c => !c.expires || c.expires > Date.now()); await page.setCookie(activeCookies);
Alternatively, you can set a expires
value further in the future when saving cookies to keep them alive longer.
Cookie Domains Matter
Pay attention to cookie domains when loading. page.setCookie()
will fail if domains don't match up with the current page. For example, cookies siteA.com
won't work on siteB.com
. Have a strategy for handling multi-domain cookies.
Example: Login, Save, and Reuse Cookies
Let's put this all together into a full script that logs into a site, saves cookies, and then reloads them on a second run.
const puppeteer = require('puppeteer'); const fs = require('fs').promises; // Function to handle login const login = async (page) => { // Go to login page await page.goto('https://example.com/login'); // Enter credentials and submit form // ... // Wait for navigation after login button click await Promise.all([ page.waitForNavigation(), page.click('#login') ]); } // Initial script run - get cookies (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await login(page); // Log in to site const cookies = await page.cookies(); // Save cookies await browser.close(); await fs.writeFile('cookies.json', JSON.stringify(cookies)); // Save to file })(); // Second run - load cookies (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); const cookiesString = await fs.readFile('cookies.json'); const cookies = JSON.parse(cookiesString); await page.setCookie(cookies); // Load cookies await page.goto('https://example.com/dashboard'); // Browse site logged in // Rest of script... await browser.close(); })();
This allows us to perform any logged-in scraping tasks after restoring the session without needing to re-login each time. The same flow could be adapted for an infinite looping scraper, saving cookies periodically as it runs and loading them if it gets halted and restarted.
Storing Cookies for Multiple Domains
For scrapers that interact with multiple domains, we need a way to save and associate cookies with their original site.
Approach 1: Separate Cookie Files
One method is saving cookies to separate files named by domain:
siteA-cookies.json siteB-cookies.json siteC-cookies.json
Then when switching sites, load the matching cookie file. The downside is having to manage many cookie files.
Approach 2: Cookie Map
Another strategy is a cookie map that associates domains with their cookie arrays:
const cookieMap = { 'siteA.com': siteACookies, 'siteB.com': siteBCookies, //... }
Before visiting each site, lookup and load its cookies from the map. Allows storing all cookies in one data structure. DB or object storage works for larger datasets.
Approach 3: Authenticated Sessions
An advanced option is having separate browser instances and pages for each domain. Maintain authenticated sessions for each site that you can switch between as needed in the script:
// Create browsers const siteABrowser = await puppeteer.launch(); const siteBBrowser = await puppeteer.launch(); // Save cookies for each browser await saveSiteACookies(siteABrowser); await saveSiteBCookies(siteBBrowser); // Reuse browsers to scrape both sites await scrapeSiteA(siteABrowser); await scrapeSiteB(siteBBrowser);
More memory intensive but ensures clean cookie separation across domains.
Types of Data to Store in Cookies
Cookies can contain a variety of data to save state. Some useful examples for web scraping:
- Session ID – Identifier for login state on the site. Critical for authentication.
- CSRF tokens – Tokens that prevent cross-site request forgery attacks. May be required for some actions.
- User ID – Logged in user ID to make requests on behalf of.
- Localization – Site language, region, currency, and other locale preferences.
- Consents – GDPR, tracking, and other cookie consent values.
- A/B tests – Assignment to different site experiments.
- Recent views – Ids for cached or recently viewed items to simulate natural browsing.
Review the site's actual cookies to see what's critical to reproduce the state. Focus on cookies used in authentication, personalization and site mechanics.
Cookie Privacy Considerations
While reusing cookies makes scraping easier, keep in mind privacy implications:
- Don't collect unnecessary personal or tracking data without consent.
- Mask cookies like
userID
in logging if they identify individuals. - Delete cookies when no longer needed for your business purposes.
- Make sure usage complies with site terms of use.
As always, respect user privacy and minimize data collection to only what's required.
Alternatives to Cookies for State
While cookies are commonly used to manage browser state, here are some other approaches:
- Local Storage: LocalStorage works like cookies but stores data in the browser instead of sending to the server. Puppeteer can access via
page.evaluate()
:
// Get localStorage data const data = await page.evaluate(() => { return JSON.parse(window.localStorage.getItem('key')); }) // Set localStorage data await page.evaluate((key, data) => { window.localStorage.setItem(key, JSON.stringify(data)); }, 'key', 'value');
Helpful for the state that doesn't need to persist across sessions.
- Browser Cache: Storing response data in the browser cache can avoid re-fetching resources. Enable via
cache-control
response headers. Puppeteer respects caching automatically. Downside is caches reset after the browser exits. Must replay requests to repopulate. - Session Replay: Some tools like SessionBox and SiteTruth record browser sessions to replay later. Allows recreating state without needing direct cookie access. But a more complex setup.
- Headless Browser Containers: An advanced option is running each browser session in a separate Docker container. Cookies and states are encapsulated in the container. Just restart existing containers to resume sessions. Takes more infrastructure, but isolated contexts avoid cookie conflicts.
Troubleshooting Cookie Issues
Here are some common cookie troubles and how to resolve them:
- Blank Cookie Value: Your cookie file may contain a cookie with a blank value. This usually means it expired. Filter out cookies with blank values before loading to avoid errors:
const activeCookies = cookies.filter(c => c.value !== '');
- Cookie Domain Mismatch: Loading a cookie from one domain onto a page from another domain causes errors. Double-check your cookie file domains match the site you are visiting.
- Expired Cookies: Cookies past their expiration date will fail to load. Check for the
expires
field and filter out expired entries before restoring. - Max Cookies Exceeded: Browsers limit total cookies per domain. Chrome allows up to 180. If your page errors on cookie sets, try deleting old cookies to free up slots.
- CORS Errors: Browsers block reading cross-origin cookies by default due to CORS. Use
{sameSite: 'none', secure: true}
flags when saving cookies to avoid restrictions loading them across domains.
Conclusion
Efficient cookie management is essential in web scraping to avoid redundant logins and repeated data collection. By simulating the behavior of a regular user, scrapers can operate more discreetly. However, cookies are just one facet of state management—LocalStorage, browser caching, session recordings, and headless containers can also enhance scraping efficiency. The key is to tailor a combination of these techniques to your specific scraping scenario.
This discussion aims to highlight the importance of cookies in web scraping and guide you on their proper handling with Puppeteer. Remember to collect only the data you absolutely need and always prioritize user privacy. When used wisely, cookies are a powerful ally in achieving dependable and respectful web scraping.