Web Scraping with NodeJS and Javascript

Web scraping is the process of extracting data from websites automatically through code. With the rise of dynamic, JavaScript-heavy websites, Node.js has become a popular platform for building scrapers thanks to its asynchronous architecture and vast ecosystem of scraping packages.

In this comprehensive guide, we'll cover the key steps and tools for building a robust web scraper with Node.js and JavaScript: Whether you're new to web scraping or looking to improve your Node.js scraping skills, this guide will provide the knowledge to build production-ready scrapers. Let's get started!

The Rising Role of Web Scraping Across Industries

Web scraping facilitates data gathering for a widening range of industries and use cases:

E-commerce – Monitoring competitor pricing, inventory, and customer reviews
Travel – Aggregating flight/hotel deals, availability from OTAs
Finance – Analyzing earnings transcripts, executive changes, market news
Automotive – Compiling used car classifieds, dealer inventories
Real Estate – Assembling rental listings, agent contact info
Marketing – Building targeted prospect lists, email databases
Supply Chain – Tracking inventory levels, logistics data

This data provides vital signals for business strategy and operations. The shift toward dynamic web apps driven by JavaScript has made scraping more challenging. Server-side rendering has been replaced by client-side JavaScript that modifies page structure on-the-fly.

Node.js has become a top choice for scraping today's complex sites thanks to:

Asynchronous Architecture – Node.js utilizes asynchronous I/O and an event loop allowing requests and processing to happen in parallel. Python's synchronous nature leads to slow multi-threaded scraping.
npm Ecosystem – Node.js package manager provides 1000s of modules for all scraping needs – HTTP clients, HTML parsers, proxy management, etc.
JS Runtime – Running JavaScript end-to-end means scrapers can directly execute page code for fully rendered HTML.

Let's explore the key steps and tools for building robust scrapers in Node.js flexible environment.

Making Robust HTTP Requests with Axios

The first step in any scraping workflow is making HTTP requests to target sites and downloading response content. Node.js offers many HTTP client options like the built-in http/https modules, Request, Got and of course Axios.

Axios stands out for its approachable API, robust feature set and active maintenance. It simplifies both making requests and handling responses across Node.js and browsers.

HTTP Request Types

Axios supports all commonly used HTTP request types:

GET – Retrieve a resource from the server. This is the most common request type used in scraping.

const { data } = await axios.get('https://api.example.com/products');

POST – Send data to the server, often used for form submissions.

const payload = {
  email: '[email protected]',
  password: 'securepass123' 
}

await axios.post('/login', payload);

PUT – Fully update/replace a resource on the server.

const update = {
  title: 'New ebook title'  
}

await axios.put('/books/123', update);

PATCH – Partially update a resource on the server.

await axios.patch('/users/123', {
  email: '[email protected]'
});

DELETE – Delete a resource from the server.

await axios.delete('/posts/456');

Axios includes built-in support for serializing JavaScript objects into JSON for sending request data. For non-trivial scraping pipelines, you'll likely leverage a mix of GETs for page downloads and POSTs for data submissions like searches, filters, and logins.

Setting Request Headers

HTTP headers provide metadata about the client making the request. Mimicking a real web browser's headers is vital for avoiding scraper detection. Common headers that need to be set properly include:

User-Agent – Identifies the browser name, engine, OS, etc.
Accept – The content-types the client can accept.
Accept-Encoding – Supported compression formats.
Accept-Language – Preferred language for response text.

With Axios we can easily set default headers that apply to all requests:

const headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
  'Accept': 'application/json, text/plain, */*' 
};

const client = axios.create({
  headers: headers
});

Now any requests through our client will contain these headers by default. I recommend maintaining a list of current browser User-Agents to randomly select from. Sites may blacklist specific agents after a period. Rotating the User-Agent and other headers helps ensure your scraper stays stealthy.

Managing Cookies and Sessions

Many sites involve workflows that span multiple pages and require maintaining a session state. For example, adding items to a cart, proceeding to checkout, and completing a transaction. HTTP cookies store session data between requests. Handling cookies is essential for scraping sites with login functionality or flows across pages.

By default, Axios does not automatically handle cookies. We need to integrate an additional module like axios-cookiejar-support which adds this capability:

const cookieJar = new CookieJar();

const client = axios.create({
  jar: cookieJar  
});

Now any cookies set by the server will be persisted in cookieJar and sent with follow-up requests.

For scraping contexts like ecommerce or classifieds that involve traversing catalog pages, item details, search filters, and checkout workflows – properly tracking cookies is required.

Optimizing Performance

Two key facets of request performance are:

Concurrency – Node.js asynchronous nature allows requests to execute concurrently. Scraping logic should take advantage of this via Promise all:

const urls = [/* list of 100 urls */]; 

const requests = urls.map(url => axios.get(url));

await Promise.all(requests);

This kicks off all requests in parallel rather than waiting for each to finish.

Retries – Requests often fail due to network issues. Retrying with backoff is essential:

const request = async () => {

  try {
    return await axios.get(url); 
  } catch (e) {
    console.log(e);
    
    await new Promise(resolve => setTimeout(resolve, 5000));
    
    return request(); 
  }

};

request();

This simple recursive approach retries failed requests after a 5-second delay. Robust error handling and retries will avoid losing data to transient issues.

Now that we know how to make optimized, production-ready requests, let's look at processing the response HTML.

Parsing Complex HTML with Cheerio and XPath

To extract the data points we need, we must parse the HTML programmatically. Node.js provides two main approaches:

CSS selectors – Query DOM elements using selector syntax. The Cheerio library enables this style of parsing.
XPath – Traverse DOM structures using XPath query language. Modules like xpath and xmldom provide XPath parsing.

Let's explore examples of each approach…

Cheerio – Fast CSS Selectors

Cheerio allows parsing HTML using the same familiar CSS selector API as jQuery:

const cheerio = require('cheerio');

const html = `
  <div class="post">
    <h2 class="title">Post Title</h2> 
    <div class="metadata">
      <div class="views">1,337 views</div>
      <div class="comments">10 comments</div>
    </div>
  </div>
`;

const $ = cheerio.load(html);

const title = $('.post .title').text(); // Post Title

Selectors provide a concise, expressive way to query elements. Some useful selectors include:

el – Tag name
.class – Class name
#id – ID
el1 el2 – Descendant
el1 > el2 – Direct child
[attr=value] – Attribute equality

We can also call useful DOM manipulation methods like:

text() – Get inner text
html() – Get inner HTML
attr(name) – Get attribute value

Cheerio is essentially server-side jQuery implemented in Node.js. The familiar API lowers the learning curve compared to other parsers.

xpath – Advanced Querying

For complex page structures, xpath provides extremely flexible querying capabilities using navigational paths and predicates:

const xpath = require('xpath');
const dom = require('xmldom').DOMParser;
 
const doc = new dom().parseFromString(html);

xpath.select('//h2[@class="title"]/text()', doc);

xpath includes hundreds of available functions including:

node() – Current node
text() – Text value
@attr – Attribute
.. – Parent
// – Recursive descent

And expressions like:

nodename – Tag name
[expr] – Predicate
a | b – Union
a / b – Hierarchy

This provides extreme flexibility for handling even the most complex page markup.

When to Use Each Approach

Cheerio is great for straightforward use cases with simple, stable selectors. It has a gentler learning curve.
Xpath enables querying HTML dynamically and declaratively for pages with unstable structures.

I typically recommend starting with Cheerio for most scraping tasks and then utilizing Xpath for the minority of cases that require advanced selectors. Now that we can sling HTTP requests and parse the ensuing HTML, our scrapers are ready to run! But when operating at scale, additional challenges arise…

Handling Common Scraping Challenges

While HTTP requests and HTML parsing form the core of any scraper, additional challenges frequently impede scraping success:

Bot protection services identify and block scrapers
Captchas and cookie consent gates prevent access
JavaScript rendering is required for dynamic SPAs
Rate limits need management to avoid bans

Let's explore industry-standard techniques to overcome each obstacle.

Bypassing Cloudflare, Distil Networks, and Others

Top sites are shielded by anti-bot services like Cloudflare, Distil Networks, and Imperva. They fingerprint visitor characteristics to distinguish scrapers from real users. Common protections include:

Browser validation – Testing for real browser headers, WebGL, fonts, etc.
Behavior analysis – Detecting non-human click patterns
JavaScript challenges – Requiring code execution to pass
Rate limiting – Banning scrapers exceeding thresholds

Residential proxy services like Bright Data, Smartproxy, Proxy-Seller, and Soax provide thousands of fresh IP addresses from real home networks every month. This allows your scraper to appear as many different residential users.

Here is how to configure proxies with Axios:

const proxyUrl = 'http://user:[email protected]:22225'; 

const client = axios.create({
  proxy: {
    host: proxyUrl
  }
});

Now requests will be routed through proxy IPs.

Residential proxies cost more than datacenter providers but are the only effective solution for bypassing fingerprinting defenses. Expect 80-90% savings over running your own proxy infrastructure.

Solving CAPTCHAs and Cookie Consent

Emerging privacy regulations like GDPR have led sites to add reCAPTCHA and cookie consent gates – extra hoops for scrapers to jump through. Thankfully services like BrightData offer integrations to automatically:

Solve Google reCAPTCHA v2 and v3 challenges
Consent to cookie tracking restrictions
Manage consent withdrawal after scraping

This seamlessly bypasses compliance gates without any extra coding.

Rendering JavaScript for SPAs

Modern sites rely heavily on JavaScript to render content on the client side rather than server side. Scrapers must execute JavaScript to retrieve the full post-processed HTML. Headless browsers like Puppeteer provide a complete runtime for dynamic execution:

const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.goto('https://spa.example.com'); 

// Wait for JS to render
await page.waitFor(5000);

const html = await page.content(); // fully rendered HTML

Services like BrightData offer a JavaScript rendering API requiring no complex Puppeteer setup.

Managing Rate Limits

To prevent overwhelming their servers, sites enforce rate limits – a maximum request ceiling per time window like 10 requests/second. Scrapers should implement throttling to respect these restrictions using modules like Bottleneck:

const limiter = new Bottleneck({
  minTime: 100 // max 10 requests/second
});

const scrape = async () => {
  // scraping logic...
}

for(let i = 0; i < 1000; i++) {
  await limiter.schedule(scrape); 
}

This enforces a minimum 100ms delay between each request. Scraping responsibly within defined rate limits is key to maintaining site access long-term.

Comprehensive Scraping Script Walkthrough

Let's tie together everything we've covered by walking through a complete scraping script:

const axios = require('axios'); 
const cheerio = require('cheerio');

// Configure default headers
const headers = {
  'User-Agent': 'Mozilla/5.0...',
};

const client = axios.create({
  headers,
  
  // Base URL
  baseURL: 'https://www.example-site.com/' 
});

// Load homepage
const { data } = await client.get();

// Parse HTML 
const $ = cheerio.load(data);

// Extract product links
const links = $('.product a').map((i, el) => {
  return $(el).attr('href'); 
}).get();

// Scrape each product page
const products = [];

for(const link of links) {

  const { data } = await client.get(link); 
  
  const $ = cheerio.load(data);
  
  // Extract fields
  const product = {
    title: $('.product-title').text(),
    description: $('.product-description').text(),
    //...
  };
  
  products.push(product);
}

// Output CSV
require('fs').writeFileSync('products.csv', stringify(products));

This covers the key steps:

Initialize request client – With appropriate headers and base URL
Fetch homepage – Make initial request to get links
Parse HTML – Use Cheerio to extract product URLs
Iterate product pages – Loop through URLs, fetching and parsing each
Extract fields – Use Cheerio to parse page and get product details
Output results – Here saving to a CSV file for further analysis

While basic, this illustrates the end-to-end scraping process combining the foundations covered in this guide.

Conclusion

In this guide we covered fundamental concepts like making HTTP requests with Axios, parsing HTML with Cheerio and Xpath, and handling common scraping challenges like proxies and throttling.

There are certainly more complexities that arise, but this should provide a solid foundation for building robust scrapers with Node.js and JavaScript. The asynchronous event-driven architecture of Node.js combined with its vast ecosystem make it an ideal environment for today's dynamic websites.

To learn more, refer to the Axios, Cheerio, and Bottleneck documentation. Services like BrightData can also provide valuable tools and proxies to simplify and scale your scraping projects.