Complete Guide to Web Scraping Using Typescript

Typescript has rapidly grown as a robust platform for developing large-scale web scrapers. With its optional typing and seamless integration with Node.js, Typescript enables crafting complex crawlers that can handle sizable data extraction jobs with reliability and performance.

In this guide, we will dive deep into common web scraping concepts and patterns using Typescript. We'll cover the key techniques you need to create scrapers that can power data pipelines at any scale.

Getting Started with a Typescript Scraping Project

Let's first look at how to set up a scraping project in Typescript:

Installing Required Packages

To start, we need to create a Node.js project and install a couple of essential packages:

npm init -y
npm install axios cheerio

This will initialize a package.json file and install:

Axios: Popular HTTP client for making web requests
Cheerio: HTML parsing and manipulation library for extracting data

There are alternatives like node-fetch or request for requests and jsdom or x-ray for parsing. However, axios and cheerio are the most battle-tested choices specifically meant for web scraping.

Running Typescript Code

We have two options to run Typescript code:

Transpile to JavaScript: We can compile .ts code to .js using tsc and then run it on Node.js normally.
Use ts-node: This allows directly running .ts files without transpiling by wrapping Node.js.

For development simplicity, we'll use ts-node:

npm install -D ts-node

Now we can run Typescript code directly with npx ts-node.

Writing Our First Scraper

Let's create an index.ts file and write our starter scraper:

import axios from 'axios';
import * as cheerio from 'cheerio';

async function main() {

  const url = 'https://example.com';

  const response = await axios.get(url);

  const $ = cheerio.load(response.data);

  const title = $('h1').text();

  console.log({ title });

}

main();

Here's what it does:

Uses axios to make a GET request to the URL
Loads the returned HTML into Cheerio
Extracts the <h1> element's text
Prints out the title

We can run it with:

npx ts-node index.ts

And we have a simple Typescript scraper ready! Now let's dive deeper into robust scraping approaches.

Making Reliable Web Requests

To scrape at scale, we need proper request logic to handle errors and avoid blocks. Let's go over some best practices.

Auto Retrying Failed Requests

Network requests often fail randomly in complex cloud environments. The failure could be anything – DNS issues, stale sockets, read timeouts, unstable protocols etc. According to Cloudflare, the chance of any request failing is 1-2% on average. For JavaScript specifically, the failure rate can be 3-4%.

Retrying failed requests drastically improves reliability. We should retry up to 3-5 times with delays before giving up. The axios-retry package makes this easy:

import axios from 'axios';
import axiosRetry from 'axios-retry';

axiosRetry(axios, { retries: 3 });

const response = await axios.get(url);

Now all requests made via this axios instance will automatically retry up to 3 times on failure. We can also configure:

Exponential backoff between retries
Conditionally retrying only for certain errors
Retry counts per request

Proper retry handling ensures we never fail on sporadic issues and end up scraping as much data as possible.

Setting Optimal Request Headers

Request headers contain important metadata about the client making the request. We need to set proper headers to ensure sites see requests coming from a real browser. Some crucial headers are:

User-Agent: Probably the most important header that identifies client browser, OS, and version. Without proper browser User-Agent, we can be detected as a scraper.
Accept: Specifies accepted content types like text/html. Needs to match real browsers.
Accept-Language: Browser language preferences for localization.
Referer: Contains previous page URL, often required to avoid blocks.

There are many other headers like Accept-Encoding, DNT, and Cache-Control that may be needed. In axios, we can set custom headers like so:

const headers = {
  'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
  'Accept': 'text/html',
};

await axios.get(url, { headers });

Mimicking headers from a real browser is essential for avoiding bot blocks while scraping.

Rotating Random Proxy Servers

Scraping from a single IP risks getting blocked especially if making a lot of requests very quickly. Using proxy servers spread requests across multiple IPs making detection harder. Here are the best rotating proxies: Bright Data, Smartproxy, Proxy-Seller, and Soax.

Here's how to route requests via a proxy with axios:

const proxyUrl = 'http://192.168.1.42:3128';

const response = await axios.get(url, {
  proxy: {
    host: proxyUrl
  }   
});

This tunnels the request via the proxy IP.

Key considerations for proxies:

Multiple countries – Using different geographic regions reduces suspicion
ASN diversity – Spread across multiple Autonomous Systems and providers
IP rotation – Each proxy IP should only be used a few times
Proxy types – Residential proxies work better than datacenter IPs

Regularly rotating thousands of proxies makes scraping consistently robust at scale.

Parsing Page Content

Once we get page HTML, we need to extract the data we need intelligently. Let's explore common parsing approaches.

CSS Selector-Based Extraction

This is the most popular approach using selector queries like jQuery. Cheerio allows running CSS selectors on HTML. Consider this sample HTML:

<div class="product">

  <h3 class="name">XPhone Ultra</h3>
  
  <p class="description">
    A high-end phone with ultimate features
  </p>

  <div class="pricing">
    <span class="price">$899</span>  
  </div>

</div>

Here are some ways to extract data using Cheerio selectors:

const $ = cheerio.load(html);

const name = $('.product .name').text(); // XPhone Ultra
const description = $('.description').text(); // A high-end phone... 
const price = $('.pricing .price').text(); // $899

This makes parsing arbitrarily complex HTML very easy and concise. Some things to watch out for are handling duplicate fields, missing fields, and nested sub-data. Overall, Cheerio makes HTML scraping hassle-free, especially for well-structured sites.

Using Regular Expressions

For scrapingchallenging HTML, regular expressions can sometimes be easier than selectors. Let's extract the price from this messy HTML:

<div> 
  Price only <b>$899</b> for today! 
</div>

Using a regex with capture groups lets us cleanly extract the price:

const html = `<div>Price only <b>$899</b> for today!</div>`;

const match = html.match(/Price only <b>(\$\d+)<\/b>/);
const price = match[1]; // $899

For robust regex parsing, we need to:

Use capture groups properly
Avoid greedy matching
Handle optional parts
Take care of whitespace and newlines

Overall, selectors are usually a better option but regexes can help in one-off scenarios.

Parsing via the DOM

Instead of using string manipulation, we can parse HTML by modeling it as a DOM tree. The jsdom package allows navigating HTML as DOM nodes:

import { JSDOM } from 'jsdom';

const dom = new JSDOM(html);

const price = dom.window.document.querySelector('.price').textContent;

This provides native browser DOM APIs for scraping directly. Some key benefits of the DOM approach are:

More intuitive than text processing
Naturally handles nested structures
Enables clicking, typing, events etc. programmatically

The downside is DOM manipulation can be verbose and complex for large scraping jobs.

Scraping Helpers

Writing parsers for real-world sites with lots of dynamic data can be challenging. Scraping helpers like Apify and ScraperAPI provide declarative APIs to extract data without needing to use selectors, regexes or DOM manipulation directly.

For example, using Apify scrapers:

import { scrapePage } from 'page-scraper';

const data = await scrapePage({
  url: 'https://www.example-shop.com/products/abc123',
  fields: {
    title: 'h1',
    description: {sel: 'div.description'},
    price: {sel: 'span.price', how: 'text'},
    image: {sel: 'img.main-image', attr: 'src'}
  }
});

console.log(data);
/*
{
  title: "Green Shirt",
  description: "High quality green shirt...",
  price: "$29.99",
  image: "images/green-shirt.jpg"  
}
*/

Scraping helpers greatly simplify extracting even complex nested data. Under the hood they still use some combination of selectors, regexes, and DOM scraping but expose a higher-level API. They also handle common needs like:

Scraping paginated listings across multiple pages
Following links from page to page
Reading JSON and JavaScript
Handling client side rendering

For complex scraping jobs, they are highly recommended to overwrite everything from scratch.

Composing Robust and Scalable Scrapers

Now that we have seen core scraping techniques let's look at best practices for architecting full-featured scrapers.

Separation of Concerns

For any non-trivial scraper, we need to break it down into logical components:

Requester: Handles making HTTP requests with retries, headers etc.
Parser: Extracts data from HTML content
Storage: Persists scraped data to databases
Job Queue: Coordinates parsing jobs for scaling
Web Application: Provides monitoring capabilities

This separation of concerns makes robust scrapers easier to maintain and scale.

Asynchronous Coordination

Scraping typically involves multiple sequential or concurrent actions like:

Requesting many pages asynchronously
Parsing each one after another
Persisting data in the background

In Typescript, async/await syntax makes coordination simple:

const htmls = await Promise.all(urls.map(fetchPage)); 

const data = [];

for(const html of htmls) {
  const scraped = await parsePage(html);
  data.push(scraped);
}

await persistData(data);

This allows us to chain scraping steps easily. For even more complex flows, libraries like Bull provide asynchronous queues and jobs.

Handling Errors

In complex, long-running scraping jobs, errors can happen anytime. We need to handle them to avoid losing data carefully.

try {

  const html = await fetchPage(url);
  
  const data = parsePage(html);

  await storeData(data);

} catch(err) {

  console.error(err);
  
  // Retry on transient errors
  if(isTransientError(err)) {
    queuePageForRetry(url);
  }

}

Robust error handling requires:

Logging all errors
Classification into transient and non-transient
Retrying transient failures

This prevents scraper crashes and data loss. Additionally, we should track analytics like URLs scraped, data extracted, failures etc. to monitor scraper health.

Scaling up Scrapers

To scrape truly large sites with millions of pages, we need to scale up our scraper:

Run distributed – Spread across multiple servers using containers/Kubernetes
Use caches – Cache page data, HTTP requests etc. where possible
Partition work – Parallelize by geography, topics etc.
Monitor queues – Watch job queues for optimal concurrency

There are also managed services like ScrapeOps that provide infrastructure to scale scrapers. With the right architecture, we can scale up Typescript scrapers to any needed capacity.

Going Beyond Basics with Advanced Scraping Techniques

Let's explore some advanced scraping capabilities by leveraging additional libraries.

Browser Automation for JavaScript Sites

A huge portion of sites rely heavily on JavaScript to render content. Server-side rendering is not enough to scrape these sites. We need to execute JavaScript by controlling an actual browser. Libraries like Playwright and Puppeteer provide APIs to control Chrome and Firefox browsers remotely.

Here is an example with Playwright:

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();

await page.goto('https://dynamicpage.com'); 

// Wait for JS content to load
await page.waitForSelector('.loaded');

const html = await page.content(); // Fully rendered HTML

await browser.close();

This gives us the HTML after all JavaScript has executed, allowing us to scrape highly dynamic sites. Some things to watch out for are:

Resource usage of browsers
Navigation timings to avoid early HTML
Handling browser fingerprints for stability

Working Around CAPTCHAs

Large scrapers often have to deal with CAPTCHAs. Simple OCR CAPTCHAs can be automatically solved, but advanced ones like reCAPTCHA require human input. To handle these, we need integration with captcha solving services:

// 1. Extract captcha element from page
const captchaUrl = $('.captcha img').attr('src');

// 2. Send to service like AntiCaptcha to solve
const captchaText = await solveCaptcha(captchaUrl);

// 3. Submit form with solved text 
await submitLoginForm(email, password, captchaText);

Key factors to consider are:

Pricing models – per CAPTCHA, monthly subscriptions etc.
Solution accuracy and speed
Integrations – APIs, browser extensions etc.

Reliably solving CAPTCHAs removes a major scraping bottleneck.

Scraping Data from APIs

Web APIs provide structured data that does not require HTML parsing. Scraping APIs is much simpler in many cases. Here is an example of scraping a product API:

const response = await axios.get('/api/v1/products?category=electronics');

const products = response.data; 

for(const product of products) {
  console.log(product.name, product.price); 
}

APIs have many advantages compared to HTML scraping:

Structured data – JSON, CSV, XML etc. are easier to work with
No rendering – No need to execute JavaScript
Caching – API responses can be aggressively cached
Pagination – Steady pagination via URL parameters, cursors etc.
Documentation – Clear specification of fields, data types etc.
Authentication – Can limit scraping through API keys

However, there are some downsides as well:

Rate limiting – APIs tend to limit requests more aggressively
Cost – APIs may have usage charges, paid tiers
No user interface – Can't see data layouts visually

Overall, preferring API data over page scraping when possible makes building data pipelines easier. But HTML scraping is still needed where rich UIs exist without API access.

Scraping Real-Time Data Feeds

To get up-to-date data as it changes, we can tap into live data streams: WebSockets allow subscribing to real-time feeds from servers:

import WebSocket from 'ws';

const socket = new WebSocket('wss://data.com/live');

socket.onmessage = (event) => {
  const data = event.data; // Live updates
  scrape(data);
}

WebSockets enable scraping dashboards, chat data and other real-time sources. Server-sent Events are another unidirectional stream transported via HTTP:

const stream = new EventSource('/updates'); 

stream.onmessage = (event) => {
  const data = event.data; // Live feed
  processData(data);
}

SSE streams work well for live server updates. GraphQL Subscriptions are web socket-based subscriptions popular in the GraphQL ecosystem:

const client = new GraphQLWsClient(url);

const observable = client.subscribe({
  query: `
    subscription {
      newProducts {
        name
        price
      }
    }
  `  
});

observable.subscribe(data => {
  // New products
});

GraphQL provides a typed query language for real-time and historical data. For the most up-to-date scraping, leveraging live data streams is essential beyond just HTTP requests.

Scraping Single Page Apps

Modern web apps rely heavily on JavaScript frameworks like React and Vue. These Single Page Apps (SPAs) pose a challenge for scraping since the content is managed client-side. Scraping libraries like Puppeteer and Playwright can drive SPAs like a real user:

// Navigate pages
await page.goto('/products');
await page.waitForNavigation();

// Click elements 
await page.click('.load-more');

// Extract SPA content after interactions
const html = await page.content();

This won't work for SEO content but can obtain data shown only after JavaScript execution. Some best practices are:

Use React/Vue devtools to understand data flow
Reverse engineer network requests
Watch out for loading indicators
Mimic user actions like typing and clicking

Scraping JavaScript SPAs requires browser automation, unlike traditional multi-page sites.

Conclusion and

And that concludes our comprehensive guide to professional web scraping with Typescript and Node.js! With these patterns, we can build Typescript scrapers of any complexity to power data projects. Robust coding practices make our scrapers scalable and resilient. Typescript's optional typing makes our scrapers more robust compared to plain JavaScript. The thousands of NPM packages integrate nicely with features like distributed computing.

Overall, Typescript strikes a great balance of productivity and scale to cover scraping needs from simple to enterprise.