Downloading files from websites is a common need for many web scrapers and automation scripts. Whether you need to collect reports, extract data from documents, or backup resources, programmatically downloading files can save hours of tedious manual work. This is where Puppeteer comes in handy!
Puppeteer is a powerful Node.js library created by the Google Chrome team. It allows you to control headless Chrome and automate web browsers using code. While it doesn't have built-in methods for downloading files, it provides all the tools you need to build custom file download solutions. You get complete access to emulate browser interactions like clicking links and fetching resources.
In this comprehensive guide, you'll learn how to leverage Puppeteer to download files from the web using JavaScript.
Why Download Files with Puppeteer?
Before we dive into the how-to, let's briefly look at why you may need to download files from websites in the first place.
- Web Scraping: A common need is to collect data from files like PDF reports, spreadsheet exports, etc. Manually downloading these files would be tedious. With Puppeteer, you can programmatically find download links and save the files for later data extraction.
- Offline Backups: Websites often use downloads to provide access to resources like documents, videos, imagery, and more. You could use Puppeteer to backup these files to your local machine for offline access later.
- Application Testing: QA engineers often need to verify file downloads are working as expected on a web app. Puppeteer provides an automated way to test downloading different file types.
- Content Migration: For migrating a website to a new platform, you may need to bulk download all its files. Puppeteer can speed this up by scraping links and downloading in parallel.
In summary, anytime you need to get files from websites programmatically, Puppeteer likely has you covered. Let's look at how to set it up.
Setting Up Puppeteer for File Downloads
To start downloading files, we first need to install Puppeteer and launch a browser instance to control.
Installing Puppeteer
Since Puppeteer is a Node.js package, we can install it using npm
:
npm install puppeteer
This will add puppeteer
to your package.json
file and install it locally.
Note: To use Puppeteer in the browser instead of Node.js, you can also load it as a standalone script tag. See the docs for details.
Launching the Browser
Now we can require
and launch a Puppeteer browser instance:
const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); // rest of code... })();
This launches a headless Chrome instance by default. There are also many options we can pass to puppeteer.launch()
:
await puppeteer.launch({ headless: false, // open browser visibly defaultViewport: null, // don't set viewport size args: ['--start-maximized'] // open maximized })
For file downloads, it's often useful to set headless: false
so we can actually see the browser initiate and complete the download. Now that we have Puppeteer set, let's look at approaches to download files.
Saving Downloads to the Filesystem
The most straightforward way to download a file is to save it to your filesystem directly. Puppeteer gives us full control over the browser, so we can automate the steps to:
- Navigate to a page
- Click on a file download link/button
- Specify a download path
Puppeteer will then handle saving the file in the background. Here is an example:
// Require filesystem modules const fs = require('fs'); const os = require('os'); const path = require('path'); // Launch headless browser const browser = await puppeteer.launch(); // Create folder to store downloads const downloadsFolder = path.join(os.tmpdir(), 'puppeteer-downloads'); if (!fs.existsSync(downloadsFolder)) { fs.mkdirSync(downloadsFolder); } // Set download path await page._client.send('Page.setDownloadBehavior', { behavior: 'allow', downloadPath: downloadsFolder }); // Go to website const page = await browser.newPage(); await page.goto('https://example.com'); // Click on download button await page.click('#download-button');
Here are some key points about this approach:
- We use Node.js filesystem modules to prepare a folder for downloads.
- The
page._client.send()
method sets the browser download path. This is how Puppeteer knows where to save files. - We navigate to a page and click on a download button/link to initiate the file download.
- Puppeteer will now handle downloading the file in the background and save it to our specified path.
This provides the simplest way to automate file downloads with Puppeteer. But there are some downsides:
Downsides:
- We don't get access to the file contents directly in our code.
- We have to hardcode the download button/link selector, which may change.
- The download happens asynchronously, so we don't know exactly when it completes.
In the following sections, we'll look at techniques to get the file contents into a variable in our Node.js code.
Downloading Files to a Buffer in Memory
In many cases, we want to directly access the downloaded file content from our Node.js code rather than saving it to a separate file. This allows us to extract data or parse the file contents without having to read from the filesystem.
To do this, we can leverage Puppeteer's powerful page.evaluate()
method. This allows us to execute code within the browser context. Inside page.evaluate()
, we can use the standard browser Fetch API to download a file and then pass the contents back to Node.js.
Here is an example:
// Get download link URL const url = await page.$eval('a.download', el => el.href); // Use evaluate() to fetch file const content = await page.evaluate(async url => { // Fetch file contents const response = await fetch(url); const buffer = await response.arrayBuffer(); // Convert to base64 to return from evaluate const base64 = btoa(String.fromCharCode.apply(null, new Uint8Array(buffer))); return base64; }, url); // Convert base64 to Buffer in Node.js const buffer = Buffer.from(content, 'base64');
Let's break down what's happening:
- We grab the download URL from the link selector using
$eval()
. - Inside
evaluate()
, we fetch the file contents using the standard Fetch API. This returns an ArrayBuffer. - We convert the ArrayBuffer to base64 encoding so it can be passed out of the browser context back to Node.js.
- Finally, we convert the base64 string back into a Buffer that contains the downloaded file contents.
Now we have the entire file in a Buffer variable in our code!
Benefits:
- We get the file contents directly without saving to the filesystem.
- It works for any file download link on the page.
- The download happens synchronously inside
evaluate()
.
Downsides:
- Loads the entire file contents into memory, so large files may exceed limits.
Overall this is a very flexible approach for most use cases where the files are reasonably sized.
Streaming Downloads for Large Files
When downloading very large files, buffering the entire contents into memory is impractical. In these cases, we need to stream the download directly to disk rather than loading it into a variable.
Fortunately, Puppeteer provides a method to generate arbitrary streams from web pages via page.createPDFStream()
. While designed for PDFs, this can produce a stream from any URL. We can pipe the stream to a file to download large files.
Here is an example:
// Get download link URL const url = await page.$eval('a.download', el => el.href); // Create write stream const fileStream = fs.createWriteStream('./file.pdf'); // Generate download stream const downloadStream = await page.createPDFStream({ url, displayHeaderFooter: false, printBackground: true }); // Pipe browser download stream to file downloadStream.pipe(fileStream); // Wait for download to complete await new Promise(resolve => fileStream.on('finish', resolve));
The key points:
- We create a writable stream to a local file using
fs.createWriteStream()
. - The
createPDFStream()
method returns a readable stream from the browser. - We
pipe()
the browser stream to the writable file stream. This pipes the download through to disk. - We wait for the
finish
event to know when the download completed.
This allows us to download arbitrarily large files without buffering them in memory. The streams handle efficiently piping the content to disk.
Benefits:
- Can download files of any size without memory limits.
- Streams handle backpressure and smooth downloads.
Downsides:
- Requires more code to handle streams properly.
- Only provides raw file content, not a usable variable.
Overall streaming downloads is preferred for large files where buffering the entire contents is not practical.
Handling Multiple Parallel Downloads
Sometimes when you click a download button, it may Trigger multiple files to download at the same time. In these cases, we need to wait for all downloads to complete before continuing our script.
Here is an example to handle parallel downloads:
// Start waiting for downloads const downloadPromise = page.waitForEvent('download'); // Click download button to trigger multiple downloads await page.click('#download-btn'); // Wait for downloads to complete const downloads = await downloadPromise; // Process each download for(let download of downloads) { const path = await download.path(); // Download file from `path` }
The key steps are:
- Call
page.waitForEvent('download')
before clicking to start listening for downloads. - After clicking, it returns a promise that resolves to an array of all
Download
objects. - We loop through each
Download
and get the savedpath
to process the file.
This ensures we wait for all parallel downloads to finish before continuing. You could also track progress by handling the Download
object's downloadProgress
events.
Troubleshooting: Common Download Issues
There are a few common issues that may arise when downloading files with Puppeteer:
- Browser blocks downloads in headless mode: By default, Chrome blocks downloads initiated in headless mode. Set
headless: false
when launching the browser to allow downloads. - PDF downloads don't work: The
page.pdf()
API does not allow downloading PDFs. UsecreatePDFStream()
instead. - Can't download cross-origin resources: Puppeteer blocks downloads from different origins by default. Use the
--disable-web-security
flag when launching Chromium to work around this. - Authenticated or cookie-based downloads failing: You may need to set cookies or authentication headers before downloading protected resources manually.
- Downloads hang or timeout: There are many reasons downloads can hang. Restarting the browser instance often fixes it. Also, try slowing down actions before a download using
page.waitFor(250)
to allow time for the download to start.
Comparing Puppeteer Download Techniques
Throughout this guide, we covered several different approaches to downloading files with Puppeteer:
Method | How it Works | When to Use |
---|---|---|
Save to Filesystem | Click download links and save directly to disk | Simple background downloads |
Download to Buffer | Fetch in page.evaluate() and save to memory | Get file contents immediately |
Stream Downloads | Pipe browser stream to writable file stream | Large downloads without memory limits |
Handle Multiple | Wait for array of all completed downloads | When clicking triggers multiple downloads |
Here is a quick comparison of the pros and cons of each approach:
Save to Filesystem
- + Simplest approach
- + Downloads in background
- – No access to file content
- – Async so unsure when finished
Download to Buffer
- + Get file contents immediately
- + Works with any download link
- – Loads fully into memory
Stream Downloads
- + No memory size limits
- + Efficient piping to disk
- – More complex code
- – No buffer variable
Handle Multiple
- + Wait for all downloads to complete
- + Can process each file separately
- – More logic to handle array
Consider which benefits and limitations make sense for your specific use case when deciding on an approach.
Final Thoughts
Downloading files from the web is a common need for automation and scraping projects. No matter what your use case is, Puppeteer likely provides a way to download files from websites programmatically. With these approaches, you should have all the tools needed to integrate file downloading into your Puppeteer web scraper or automation script.
I hope this guide provided you with a comprehensive overview of the various techniques and best practices for downloading files with Puppeteer. Automated downloading will take your web scraping and testing projects to the next level.