Typescript has rapidly grown as a robust platform for developing large-scale web scrapers. With its optional typing and seamless integration with Node.js, Typescript enables crafting complex crawlers that can handle sizable data extraction jobs with reliability and performance.
In this guide, we will dive deep into common web scraping concepts and patterns using Typescript. We'll cover the key techniques you need to create scrapers that can power data pipelines at any scale.
Getting Started with a Typescript Scraping Project
Let's first look at how to set up a scraping project in Typescript:
Installing Required Packages
To start, we need to create a Node.js project and install a couple of essential packages:
npm init -y npm install axios cheerio
This will initialize a package.json
file and install:
- Axios: Popular HTTP client for making web requests
- Cheerio: HTML parsing and manipulation library for extracting data
There are alternatives like node-fetch or request for requests and jsdom or x-ray for parsing. However, axios and cheerio are the most battle-tested choices specifically meant for web scraping.
Running Typescript Code
We have two options to run Typescript code:
- Transpile to JavaScript: We can compile
.ts
code to.js
usingtsc
and then run it on Node.js normally. - Use ts-node: This allows directly running
.ts
files without transpiling by wrapping Node.js.
For development simplicity, we'll use ts-node
:
npm install -D ts-node
Now we can run Typescript code directly with npx ts-node
.
Writing Our First Scraper
Let's create an index.ts
file and write our starter scraper:
import axios from 'axios'; import * as cheerio from 'cheerio'; async function main() { const url = 'https://example.com'; const response = await axios.get(url); const $ = cheerio.load(response.data); const title = $('h1').text(); console.log({ title }); } main();
Here's what it does:
- Uses axios to make a GET request to the URL
- Loads the returned HTML into Cheerio
- Extracts the
<h1>
element's text - Prints out the title
We can run it with:
npx ts-node index.ts
And we have a simple Typescript scraper ready! Now let's dive deeper into robust scraping approaches.
Making Reliable Web Requests
To scrape at scale, we need proper request logic to handle errors and avoid blocks. Let's go over some best practices.
Auto Retrying Failed Requests
Network requests often fail randomly in complex cloud environments. The failure could be anything – DNS issues, stale sockets, read timeouts, unstable protocols etc. According to Cloudflare, the chance of any request failing is 1-2% on average. For JavaScript specifically, the failure rate can be 3-4%.
Retrying failed requests drastically improves reliability. We should retry up to 3-5 times with delays before giving up. The axios-retry package makes this easy:
import axios from 'axios'; import axiosRetry from 'axios-retry'; axiosRetry(axios, { retries: 3 }); const response = await axios.get(url);
Now all requests made via this axios instance will automatically retry up to 3 times on failure. We can also configure:
- Exponential backoff between retries
- Conditionally retrying only for certain errors
- Retry counts per request
Proper retry handling ensures we never fail on sporadic issues and end up scraping as much data as possible.
Setting Optimal Request Headers
Request headers contain important metadata about the client making the request. We need to set proper headers to ensure sites see requests coming from a real browser. Some crucial headers are:
- User-Agent: Probably the most important header that identifies client browser, OS, and version. Without proper browser User-Agent, we can be detected as a scraper.
- Accept: Specifies accepted content types like
text/html
. Needs to match real browsers. - Accept-Language: Browser language preferences for localization.
- Referer: Contains previous page URL, often required to avoid blocks.
There are many other headers like Accept-Encoding, DNT, and Cache-Control that may be needed. In axios, we can set custom headers like so:
const headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36', 'Accept': 'text/html', }; await axios.get(url, { headers });
Mimicking headers from a real browser is essential for avoiding bot blocks while scraping.
Rotating Random Proxy Servers
Scraping from a single IP risks getting blocked especially if making a lot of requests very quickly. Using proxy servers spread requests across multiple IPs making detection harder. Here are the best rotating proxies: Bright Data, Smartproxy, Proxy-Seller, and Soax.
Here's how to route requests via a proxy with axios:
const proxyUrl = 'http://192.168.1.42:3128'; const response = await axios.get(url, { proxy: { host: proxyUrl } });
This tunnels the request via the proxy IP.
Key considerations for proxies:
- Multiple countries – Using different geographic regions reduces suspicion
- ASN diversity – Spread across multiple Autonomous Systems and providers
- IP rotation – Each proxy IP should only be used a few times
- Proxy types – Residential proxies work better than datacenter IPs
Regularly rotating thousands of proxies makes scraping consistently robust at scale.
Parsing Page Content
Once we get page HTML, we need to extract the data we need intelligently. Let's explore common parsing approaches.
CSS Selector-Based Extraction
This is the most popular approach using selector queries like jQuery. Cheerio allows running CSS selectors on HTML. Consider this sample HTML:
<div class="product"> <h3 class="name">XPhone Ultra</h3> <p class="description"> A high-end phone with ultimate features </p> <div class="pricing"> <span class="price">$899</span> </div> </div>
Here are some ways to extract data using Cheerio selectors:
const $ = cheerio.load(html); const name = $('.product .name').text(); // XPhone Ultra const description = $('.description').text(); // A high-end phone... const price = $('.pricing .price').text(); // $899
This makes parsing arbitrarily complex HTML very easy and concise. Some things to watch out for are handling duplicate fields, missing fields, and nested sub-data. Overall, Cheerio makes HTML scraping hassle-free, especially for well-structured sites.
Using Regular Expressions
For scrapingchallenging HTML, regular expressions can sometimes be easier than selectors. Let's extract the price from this messy HTML:
<div> Price only <b>$899</b> for today! </div>
Using a regex with capture groups lets us cleanly extract the price:
const html = `<div>Price only <b>$899</b> for today!</div>`; const match = html.match(/Price only <b>(\$\d+)<\/b>/); const price = match[1]; // $899
For robust regex parsing, we need to:
- Use capture groups properly
- Avoid greedy matching
- Handle optional parts
- Take care of whitespace and newlines
Overall, selectors are usually a better option but regexes can help in one-off scenarios.
Parsing via the DOM
Instead of using string manipulation, we can parse HTML by modeling it as a DOM tree. The jsdom package allows navigating HTML as DOM nodes:
import { JSDOM } from 'jsdom'; const dom = new JSDOM(html); const price = dom.window.document.querySelector('.price').textContent;
This provides native browser DOM APIs for scraping directly. Some key benefits of the DOM approach are:
- More intuitive than text processing
- Naturally handles nested structures
- Enables clicking, typing, events etc. programmatically
The downside is DOM manipulation can be verbose and complex for large scraping jobs.
Scraping Helpers
Writing parsers for real-world sites with lots of dynamic data can be challenging. Scraping helpers like Apify and ScraperAPI provide declarative APIs to extract data without needing to use selectors, regexes or DOM manipulation directly.
For example, using Apify scrapers:
import { scrapePage } from 'page-scraper'; const data = await scrapePage({ url: 'https://www.example-shop.com/products/abc123', fields: { title: 'h1', description: {sel: 'div.description'}, price: {sel: 'span.price', how: 'text'}, image: {sel: 'img.main-image', attr: 'src'} } }); console.log(data); /* { title: "Green Shirt", description: "High quality green shirt...", price: "$29.99", image: "images/green-shirt.jpg" } */
Scraping helpers greatly simplify extracting even complex nested data. Under the hood they still use some combination of selectors, regexes, and DOM scraping but expose a higher-level API. They also handle common needs like:
- Scraping paginated listings across multiple pages
- Following links from page to page
- Reading JSON and JavaScript
- Handling client side rendering
For complex scraping jobs, they are highly recommended to overwrite everything from scratch.
Composing Robust and Scalable Scrapers
Now that we have seen core scraping techniques let's look at best practices for architecting full-featured scrapers.
Separation of Concerns
For any non-trivial scraper, we need to break it down into logical components:
- Requester: Handles making HTTP requests with retries, headers etc.
- Parser: Extracts data from HTML content
- Storage: Persists scraped data to databases
- Job Queue: Coordinates parsing jobs for scaling
- Web Application: Provides monitoring capabilities
This separation of concerns makes robust scrapers easier to maintain and scale.
Asynchronous Coordination
Scraping typically involves multiple sequential or concurrent actions like:
- Requesting many pages asynchronously
- Parsing each one after another
- Persisting data in the background
In Typescript, async/await
syntax makes coordination simple:
const htmls = await Promise.all(urls.map(fetchPage)); const data = []; for(const html of htmls) { const scraped = await parsePage(html); data.push(scraped); } await persistData(data);
This allows us to chain scraping steps easily. For even more complex flows, libraries like Bull provide asynchronous queues and jobs.
Handling Errors
In complex, long-running scraping jobs, errors can happen anytime. We need to handle them to avoid losing data carefully.
try { const html = await fetchPage(url); const data = parsePage(html); await storeData(data); } catch(err) { console.error(err); // Retry on transient errors if(isTransientError(err)) { queuePageForRetry(url); } }
Robust error handling requires:
- Logging all errors
- Classification into transient and non-transient
- Retrying transient failures
This prevents scraper crashes and data loss. Additionally, we should track analytics like URLs scraped, data extracted, failures etc. to monitor scraper health.
Scaling up Scrapers
To scrape truly large sites with millions of pages, we need to scale up our scraper:
- Run distributed – Spread across multiple servers using containers/Kubernetes
- Use caches – Cache page data, HTTP requests etc. where possible
- Partition work – Parallelize by geography, topics etc.
- Monitor queues – Watch job queues for optimal concurrency
There are also managed services like ScrapeOps that provide infrastructure to scale scrapers. With the right architecture, we can scale up Typescript scrapers to any needed capacity.
Going Beyond Basics with Advanced Scraping Techniques
Let's explore some advanced scraping capabilities by leveraging additional libraries.
Browser Automation for JavaScript Sites
A huge portion of sites rely heavily on JavaScript to render content. Server-side rendering is not enough to scrape these sites. We need to execute JavaScript by controlling an actual browser. Libraries like Playwright and Puppeteer provide APIs to control Chrome and Firefox browsers remotely.
Here is an example with Playwright:
import { chromium } from 'playwright'; const browser = await chromium.launch(); const page = await browser.newPage(); await page.goto('https://dynamicpage.com'); // Wait for JS content to load await page.waitForSelector('.loaded'); const html = await page.content(); // Fully rendered HTML await browser.close();
This gives us the HTML after all JavaScript has executed, allowing us to scrape highly dynamic sites. Some things to watch out for are:
- Resource usage of browsers
- Navigation timings to avoid early HTML
- Handling browser fingerprints for stability
Working Around CAPTCHAs
Large scrapers often have to deal with CAPTCHAs. Simple OCR CAPTCHAs can be automatically solved, but advanced ones like reCAPTCHA require human input. To handle these, we need integration with captcha solving services:
// 1. Extract captcha element from page const captchaUrl = $('.captcha img').attr('src'); // 2. Send to service like AntiCaptcha to solve const captchaText = await solveCaptcha(captchaUrl); // 3. Submit form with solved text await submitLoginForm(email, password, captchaText);
Key factors to consider are:
- Pricing models – per CAPTCHA, monthly subscriptions etc.
- Solution accuracy and speed
- Integrations – APIs, browser extensions etc.
Reliably solving CAPTCHAs removes a major scraping bottleneck.
Scraping Data from APIs
Web APIs provide structured data that does not require HTML parsing. Scraping APIs is much simpler in many cases. Here is an example of scraping a product API:
const response = await axios.get('/api/v1/products?category=electronics'); const products = response.data; for(const product of products) { console.log(product.name, product.price); }
APIs have many advantages compared to HTML scraping:
- Structured data – JSON, CSV, XML etc. are easier to work with
- No rendering – No need to execute JavaScript
- Caching – API responses can be aggressively cached
- Pagination – Steady pagination via URL parameters, cursors etc.
- Documentation – Clear specification of fields, data types etc.
- Authentication – Can limit scraping through API keys
However, there are some downsides as well:
- Rate limiting – APIs tend to limit requests more aggressively
- Cost – APIs may have usage charges, paid tiers
- No user interface – Can't see data layouts visually
Overall, preferring API data over page scraping when possible makes building data pipelines easier. But HTML scraping is still needed where rich UIs exist without API access.
Scraping Real-Time Data Feeds
To get up-to-date data as it changes, we can tap into live data streams: WebSockets allow subscribing to real-time feeds from servers:
import WebSocket from 'ws'; const socket = new WebSocket('wss://data.com/live'); socket.onmessage = (event) => { const data = event.data; // Live updates scrape(data); }
WebSockets enable scraping dashboards, chat data and other real-time sources. Server-sent Events are another unidirectional stream transported via HTTP:
const stream = new EventSource('/updates'); stream.onmessage = (event) => { const data = event.data; // Live feed processData(data); }
SSE streams work well for live server updates. GraphQL Subscriptions are web socket-based subscriptions popular in the GraphQL ecosystem:
const client = new GraphQLWsClient(url); const observable = client.subscribe({ query: ` subscription { newProducts { name price } } ` }); observable.subscribe(data => { // New products });
GraphQL provides a typed query language for real-time and historical data. For the most up-to-date scraping, leveraging live data streams is essential beyond just HTTP requests.
Scraping Single Page Apps
Modern web apps rely heavily on JavaScript frameworks like React and Vue. These Single Page Apps (SPAs) pose a challenge for scraping since the content is managed client-side. Scraping libraries like Puppeteer and Playwright can drive SPAs like a real user:
// Navigate pages await page.goto('/products'); await page.waitForNavigation(); // Click elements await page.click('.load-more'); // Extract SPA content after interactions const html = await page.content();
This won't work for SEO content but can obtain data shown only after JavaScript execution. Some best practices are:
- Use React/Vue devtools to understand data flow
- Reverse engineer network requests
- Watch out for loading indicators
- Mimic user actions like typing and clicking
Scraping JavaScript SPAs requires browser automation, unlike traditional multi-page sites.
Conclusion and
And that concludes our comprehensive guide to professional web scraping with Typescript and Node.js! With these patterns, we can build Typescript scrapers of any complexity to power data projects. Robust coding practices make our scrapers scalable and resilient. Typescript's optional typing makes our scrapers more robust compared to plain JavaScript. The thousands of NPM packages integrate nicely with features like distributed computing.
Overall, Typescript strikes a great balance of productivity and scale to cover scraping needs from simple to enterprise.