How to Download a File with Puppeteer?

Downloading files from websites is a common need for many web scrapers and automation scripts. Whether you need to collect reports, extract data from documents, or backup resources, programmatically downloading files can save hours of tedious manual work. This is where Puppeteer comes in handy!

Puppeteer is a powerful Node.js library created by the Google Chrome team. It allows you to control headless Chrome and automate web browsers using code. While it doesn't have built-in methods for downloading files, it provides all the tools you need to build custom file download solutions. You get complete access to emulate browser interactions like clicking links and fetching resources.

In this comprehensive guide, you'll learn how to leverage Puppeteer to download files from the web using JavaScript.

Why Download Files with Puppeteer?

Before we dive into the how-to, let's briefly look at why you may need to download files from websites in the first place.

  • Web Scraping: A common need is to collect data from files like PDF reports, spreadsheet exports, etc. Manually downloading these files would be tedious. With Puppeteer, you can programmatically find download links and save the files for later data extraction.
  • Offline Backups: Websites often use downloads to provide access to resources like documents, videos, imagery, and more. You could use Puppeteer to backup these files to your local machine for offline access later.
  • Application Testing: QA engineers often need to verify file downloads are working as expected on a web app. Puppeteer provides an automated way to test downloading different file types.
  • Content Migration: For migrating a website to a new platform, you may need to bulk download all its files. Puppeteer can speed this up by scraping links and downloading in parallel.

In summary, anytime you need to get files from websites programmatically, Puppeteer likely has you covered. Let's look at how to set it up.

Setting Up Puppeteer for File Downloads

To start downloading files, we first need to install Puppeteer and launch a browser instance to control.

Installing Puppeteer

Since Puppeteer is a Node.js package, we can install it using npm:

npm install puppeteer

This will add puppeteer to your package.json file and install it locally.

Note: To use Puppeteer in the browser instead of Node.js, you can also load it as a standalone script tag. See the docs for details.

Launching the Browser

Now we can require and launch a Puppeteer browser instance:

const puppeteer = require('puppeteer');

(async () => {

  const browser = await puppeteer.launch();
  
  // rest of code...
  
})();

This launches a headless Chrome instance by default. There are also many options we can pass to puppeteer.launch():

await puppeteer.launch({
  headless: false, // open browser visibly
  defaultViewport: null, // don't set viewport size  
  args: ['--start-maximized'] // open maximized
})

For file downloads, it's often useful to set headless: false so we can actually see the browser initiate and complete the download. Now that we have Puppeteer set, let's look at approaches to download files.

Saving Downloads to the Filesystem

The most straightforward way to download a file is to save it to your filesystem directly. Puppeteer gives us full control over the browser, so we can automate the steps to:

  1. Navigate to a page
  2. Click on a file download link/button
  3. Specify a download path

Puppeteer will then handle saving the file in the background. Here is an example:

// Require filesystem modules
const fs = require('fs'); 
const os = require('os');
const path = require('path');

// Launch headless browser
const browser = await puppeteer.launch();

// Create folder to store downloads
const downloadsFolder = path.join(os.tmpdir(), 'puppeteer-downloads');
if (!fs.existsSync(downloadsFolder)) {
  fs.mkdirSync(downloadsFolder);
}

// Set download path 
await page._client.send('Page.setDownloadBehavior', {
  behavior: 'allow',
  downloadPath: downloadsFolder  
});

// Go to website
const page = await browser.newPage();
await page.goto('https://example.com'); 

// Click on download button
await page.click('#download-button');

Here are some key points about this approach:

  • We use Node.js filesystem modules to prepare a folder for downloads.
  • The page._client.send() method sets the browser download path. This is how Puppeteer knows where to save files.
  • We navigate to a page and click on a download button/link to initiate the file download.
  • Puppeteer will now handle downloading the file in the background and save it to our specified path.

This provides the simplest way to automate file downloads with Puppeteer. But there are some downsides:

Downsides:

  • We don't get access to the file contents directly in our code.
  • We have to hardcode the download button/link selector, which may change.
  • The download happens asynchronously, so we don't know exactly when it completes.

In the following sections, we'll look at techniques to get the file contents into a variable in our Node.js code.

Downloading Files to a Buffer in Memory

In many cases, we want to directly access the downloaded file content from our Node.js code rather than saving it to a separate file. This allows us to extract data or parse the file contents without having to read from the filesystem.

To do this, we can leverage Puppeteer's powerful page.evaluate() method. This allows us to execute code within the browser context. Inside page.evaluate(), we can use the standard browser Fetch API to download a file and then pass the contents back to Node.js.

Here is an example:

// Get download link URL 
const url = await page.$eval('a.download', el => el.href); 

// Use evaluate() to fetch file
const content = await page.evaluate(async url => {

  // Fetch file contents 
  const response = await fetch(url);
  const buffer = await response.arrayBuffer();
  
  // Convert to base64 to return from evaluate
  const base64 = btoa(String.fromCharCode.apply(null, new Uint8Array(buffer)));
  return base64;

}, url);

// Convert base64 to Buffer in Node.js
const buffer = Buffer.from(content, 'base64');

Let's break down what's happening:

  1. We grab the download URL from the link selector using $eval().
  2. Inside evaluate(), we fetch the file contents using the standard Fetch API. This returns an ArrayBuffer.
  3. We convert the ArrayBuffer to base64 encoding so it can be passed out of the browser context back to Node.js.
  4. Finally, we convert the base64 string back into a Buffer that contains the downloaded file contents.

Now we have the entire file in a Buffer variable in our code!

Benefits:

  • We get the file contents directly without saving to the filesystem.
  • It works for any file download link on the page.
  • The download happens synchronously inside evaluate().

Downsides:

  • Loads the entire file contents into memory, so large files may exceed limits.

Overall this is a very flexible approach for most use cases where the files are reasonably sized.

Streaming Downloads for Large Files

When downloading very large files, buffering the entire contents into memory is impractical. In these cases, we need to stream the download directly to disk rather than loading it into a variable.

Fortunately, Puppeteer provides a method to generate arbitrary streams from web pages via page.createPDFStream(). While designed for PDFs, this can produce a stream from any URL. We can pipe the stream to a file to download large files.

Here is an example:

// Get download link URL
const url = await page.$eval('a.download', el => el.href);

// Create write stream 
const fileStream = fs.createWriteStream('./file.pdf'); 

// Generate download stream
const downloadStream = await page.createPDFStream({
  url,
  displayHeaderFooter: false,
  printBackground: true
});

// Pipe browser download stream to file
downloadStream.pipe(fileStream);

// Wait for download to complete 
await new Promise(resolve => 
  fileStream.on('finish', resolve));

The key points:

  • We create a writable stream to a local file using fs.createWriteStream().
  • The createPDFStream() method returns a readable stream from the browser.
  • We pipe() the browser stream to the writable file stream. This pipes the download through to disk.
  • We wait for the finish event to know when the download completed.

This allows us to download arbitrarily large files without buffering them in memory. The streams handle efficiently piping the content to disk.

Benefits:

  • Can download files of any size without memory limits.
  • Streams handle backpressure and smooth downloads.

Downsides:

  • Requires more code to handle streams properly.
  • Only provides raw file content, not a usable variable.

Overall streaming downloads is preferred for large files where buffering the entire contents is not practical.

Handling Multiple Parallel Downloads

Sometimes when you click a download button, it may Trigger multiple files to download at the same time. In these cases, we need to wait for all downloads to complete before continuing our script.

Here is an example to handle parallel downloads:

// Start waiting for downloads
const downloadPromise = page.waitForEvent('download');

// Click download button to trigger multiple downloads  
await page.click('#download-btn');

// Wait for downloads to complete
const downloads = await downloadPromise; 

// Process each download 
for(let download of downloads) {

  const path = await download.path();

  // Download file from `path`  

}

The key steps are:

  • Call page.waitForEvent('download') before clicking to start listening for downloads.
  • After clicking, it returns a promise that resolves to an array of all Download objects.
  • We loop through each Download and get the saved path to process the file.

This ensures we wait for all parallel downloads to finish before continuing. You could also track progress by handling the Download object's downloadProgress events.

Troubleshooting: Common Download Issues

There are a few common issues that may arise when downloading files with Puppeteer:

  • Browser blocks downloads in headless mode: By default, Chrome blocks downloads initiated in headless mode. Set headless: false when launching the browser to allow downloads.
  • PDF downloads don't work: The page.pdf() API does not allow downloading PDFs. Use createPDFStream() instead.
  • Can't download cross-origin resources: Puppeteer blocks downloads from different origins by default. Use the --disable-web-security flag when launching Chromium to work around this.
  • Authenticated or cookie-based downloads failing: You may need to set cookies or authentication headers before downloading protected resources manually.
  • Downloads hang or timeout: There are many reasons downloads can hang. Restarting the browser instance often fixes it. Also, try slowing down actions before a download using page.waitFor(250) to allow time for the download to start.

Comparing Puppeteer Download Techniques

Throughout this guide, we covered several different approaches to downloading files with Puppeteer:

MethodHow it WorksWhen to Use
Save to FilesystemClick download links and save directly to diskSimple background downloads
Download to BufferFetch in page.evaluate() and save to memoryGet file contents immediately
Stream DownloadsPipe browser stream to writable file streamLarge downloads without memory limits
Handle MultipleWait for array of all completed downloadsWhen clicking triggers multiple downloads

Here is a quick comparison of the pros and cons of each approach:

Save to Filesystem

  • + Simplest approach
  • + Downloads in background
  •  No access to file content
  •  Async so unsure when finished

Download to Buffer

  • + Get file contents immediately
  • + Works with any download link
  •  Loads fully into memory

Stream Downloads

  • + No memory size limits
  • + Efficient piping to disk
  •  More complex code
  •  No buffer variable

Handle Multiple

  • + Wait for all downloads to complete
  • + Can process each file separately
  •  More logic to handle array

Consider which benefits and limitations make sense for your specific use case when deciding on an approach.

Final Thoughts

Downloading files from the web is a common need for automation and scraping projects. No matter what your use case is, Puppeteer likely provides a way to download files from websites programmatically. With these approaches, you should have all the tools needed to integrate file downloading into your Puppeteer web scraper or automation script.

I hope this guide provided you with a comprehensive overview of the various techniques and best practices for downloading files with Puppeteer. Automated downloading will take your web scraping and testing projects to the next level.

John Rooney

John Rooney

John Watson Rooney, a self-taught Python developer and content creator with a focus on web scraping, APIs, and automation. I love sharing my knowledge and expertise through my YouTube channel, My channel caters to all levels of developers, from beginners looking to get started in web scraping to experienced programmers seeking to advance their skills with modern techniques. I have worked in the e-commerce sector for many years, gaining extensive real-world experience in data handling, API integrations, and project management. I am passionate about teaching others and simplifying complex concepts to make them more accessible to a wider audience. In addition to my YouTube channel, I also maintain a personal website where I share my coding projects and other related content.

We will be happy to hear your thoughts

      Leave a reply

      Proxy-Zone
      Compare items
      • Total (0)
      Compare
      0