Web Scraping With Node-Unblocker

Web scraping is a useful technique for extracting data from websites, but scrapers often face challenges like blocking or restrictions based on geography. Proxy tools like Node-Unblocker can help bypass these issues for scrapers written in NodeJS.

In this comprehensive guide, we'll explore how to use Node-Unblocker as an effective proxy solution for your NodeJS web scraping projects.

Introduction to Web Scraping

Web scraping refers to automated techniques for extracting information from websites. Scrapers simulate human web browsing using tools like Headless Browsers and HTTP clients. Some common use cases include:

Extracting product data for price comparison sites
Gathering contact information for sales leads
Compiling data for research purposes
Archiving/backup of websites

Web scraping can be accomplished using languages like Python, NodeJS, C# etc. In this guide we'll specifically look at integrating proxies using Node-Unblocker in NodeJS based web scrapers.

The Need for Proxies in Web Scraping

While useful, web scraping also comes with some difficulties:

Blocking – Many sites try to detect and block scrapers. This can happen at the IP, account or regional level.
Geo-Restrictions – Sites may restrict content/access based on geographical location. For example, Netflix libraries vary by country.
Rate Limiting – Limits on requests per time period to prevent overload.

To overcome these issues, using proxies like BrightData, Smartproxy, Proxy-Seller, and Soax is essential for reliable web scraping. Proxies provide new IP addresses and locations to mask scrapers.

Introducing Node-Unblocker

Node-Unblocker is an open source NodeJS proxy designed specifically for bypassing internet censorship or geo-blocks.

Key features:

Native NodeJS implementation and API
Easily setup as local proxy server
Request & response middlewares for modifying traffic
Hackable to add custom extensions

For NodeJS based web scraping, Node-Unblocker is a lightweight solution to add proxy capabilities. Let's look at how it works.

Basic Setup and Usage

To get started, install Node-Unblocker from NPM:

npm install node-unblocker express

Then we can create an Express server with Unblocker middleware:

const express = require("express");
const Unblocker = require("unblocker");

const app = express();
const unblocker = Unblocker(); 

app.use(unblocker);

// redirect homepage to Wikipedia via proxy
app.get("/", (req, res) =>
  res.redirect("/proxy/https://en.wikipedia.org/wiki/Main_Page") 
);

const port = 8080;
app.listen(port);

This gives us a basic local proxy server. We can make requests through it:

$ curl http://localhost:8080/proxy/https://api.ipify.org?format=json

// Response: 
{
  "ip": "<proxy server IP>" 
}

The request first hits our proxy server, which then forwards it to the target URL. Responses are returned back through the proxy. This allows our scraper IP to be hidden. The target server sees requests coming from the proxy server instead.

Using With a Web Scraper

To use Node-Unblocker with an existing scraper, we simply need to forward requests through it. For example with the Puppeteer headless browser library:

const puppeteer = require('puppeteer'); 

const proxyUrl = 'http://localhost:8080/proxy/'; 

async function scrape(url) {

  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // pass request through proxy 
  await page.goto(proxyUrl + url);  

  // rest of scraping logic...

  await browser.close();

}

We can deploy and scale this setup using multiple proxy servers to distribute requests across different IPs.

Avoiding Blocks with Multiple Proxies

A single proxy IP will still appear suspicious if sending too many requests. To further avoid blocks, we can deploy Node-Unblocker on multiple servers and rotate between them. For example:

const proxyPool = [
  'https://111.222.22.33:8080',
  'https://111.222.22.34:8080',
  'https://111.222.22.35:8080',  
];

function getRandomProxy() {
  return proxyPool[Math.floor(Math.random() * proxyPool.length)];
}

async function scrape(url) {

  const proxy = getRandomProxy();

  // ... scraping logic

   await page.goto(proxy + '/proxy/' + url); 

}

By spreading requests across different IPs, each proxy handles a smaller number of requests. This helps avoid triggering blanket IP blocks.

Deploying Proxies Globally

Another benefit of proxies is being able to route requests through different countries and regions. Many sites restrict content and access based on the visitor's geographical location. For example, Netflix and Hulu have different media libraries depending on the country you connect from.

By deploying proxies globally, we can make region-specific requests:

// US based proxy
const US_PROXY = 'http://us-server.example.com:8080';

// India based proxy 
const IN_PROXY = 'http://in-server.example.com:8080'; 

async function scrapeIndiaLibrary() {

  const proxy = IN_PROXY;  

  await page.goto(proxy + '/proxy/https://www.hotstar.com/in');

  // scrape India Hotstar content...
  
}

async function scrapeUSLibrary() {

  const proxy = US_PROXY;
  
  await page.goto(proxy + '/proxy/https://www.hulu.com');

  // scrape US Hulu content...

}

This makes it easy to target regional versions of sites. The proxies can be Node-Unblocker servers or any HTTP proxy.

Advanced: Using Middlewares

One powerful feature of Node-Unblocker is the ability to modify requests and responses using custom middleware functions. Middlewares sit between the scraper and target site, allowing us to manipulate traffic. Some examples of using middlewares:

Add request headers

// Attach authentiation header
const authMiddleware = (req) => {

  if(req.url.includes('api.example.com')){
     req.headers['Authorization'] = 'Bearer xxxyyy'; 
  }

}

const unblocker = Unblocker({
   requestMiddleware: [authMiddleware]
});

Filter response cookies

Filter response cookies

We can use middlewares to add any custom logic around requests and responses. For example, scraping related logic like session handling and authentication can be abstracted to the proxy layer instead of each scraper reimplementing it.

Deployment Tips

Node-Unblocker provides a convenient NodeJS proxy server, but we still have to deploy and host it somewhere to be useful. Here are some tips:

Use a cloud hosting provider like AWS, GCP, Azure etc for easy global deployment. Spin up servers in required regions.
Setup Docker containers for easy deployment and scaling. We can Dockerize the Node-Unblocker server using a Dockerfile.
Use residential proxy services which provide backconnect rotating IPs to avoid blocks. They usually offer NodeJS libraries for easy integration.
Monitor server logs for traffic metrics, errors and blocks to identify issues proactively.
Implement a proxy manager microservice to control proxy distribution across scrapers dynamically.
Setup monitoring for server uptime to catch issues immediately. Use health checks and alarms.

Limitations of Node-Unblocker

While useful, Node-Unblocker does have some limitations:

Basic proxy features only – no browser emulation or other evasion techniques.
Need to deploy and manage your own proxy servers.
Limited IP pool, still risks blocks if traffic is concentrated
No support for complex sites like Google, Instagram etc which implement advanced bot protection.
No integration for commercial proxy services with large residential IP pools.

So while Node-Unblocker serves basic proxying needs, other tools might be required for large scale scraping of complex sites.

Alternatives and Complements

Scraper API

For more advanced needs, proxy APIs like ScraperAPI provide additional capabilities:

Millions of residential IPs to prevent blocks even for heavy scraping.
Javascript rendering to scrape complex sites.
Built-in support for sites like Instagram and Google.
Geo-targeting, browser fingerprinting and other evasion techniques.
High availability infrastructure.
Easy to integrate – no server management overhead.

ScraperAPI has clients for NodeJS and Python making it easy to add to existing scrapers.

Commercial Proxy Services

Specialized commercial proxies like Luminati and Oxylabs provide backend residential IPs along with libraries and clients for various languages. Compared to DIY proxies, these services offer larger and higher-quality IP pools that are less likely to be blocked. However, costs are higher.

Blending Solutions

In practice, a combination of self-hosted and commercial proxies often works best:

Use commercial proxy pools for baseline scraping to avoid common blocks.
Deploy own custom proxies with advanced scraping logic for specific sites.
Implement proxy rotation across both for maximum scale and evasion.

Based on use case a blended solution can provide cost and feature optimization.

Conclusion

Node-Unblocker provides a handy way to set up a proxy through a simple NodeJS-based server. It serves as a flexible building block in your web scraping stack. However, for fully reliable large-scale scraping, some additional tools may be required. Integrating with commercial proxy pools and services can help take it to the next level.

I hope this guide gave you some ideas on how Node-Unblocker can help your web scraping projects!