Web scraping is a useful technique for extracting data from websites, but scrapers often face challenges like blocking or restrictions based on geography. Proxy tools like Node-Unblocker can help bypass these issues for scrapers written in NodeJS.
In this comprehensive guide, we'll explore how to use Node-Unblocker as an effective proxy solution for your NodeJS web scraping projects.
Introduction to Web Scraping
Web scraping refers to automated techniques for extracting information from websites. Scrapers simulate human web browsing using tools like Headless Browsers and HTTP clients. Some common use cases include:
- Extracting product data for price comparison sites
- Gathering contact information for sales leads
- Compiling data for research purposes
- Archiving/backup of websites
Web scraping can be accomplished using languages like Python, NodeJS, C# etc. In this guide we'll specifically look at integrating proxies using Node-Unblocker in NodeJS based web scrapers.
The Need for Proxies in Web Scraping
While useful, web scraping also comes with some difficulties:
- Blocking – Many sites try to detect and block scrapers. This can happen at the IP, account or regional level.
- Geo-Restrictions – Sites may restrict content/access based on geographical location. For example, Netflix libraries vary by country.
- Rate Limiting – Limits on requests per time period to prevent overload.
To overcome these issues, using proxies like BrightData, Smartproxy, Proxy-Seller, and Soax is essential for reliable web scraping. Proxies provide new IP addresses and locations to mask scrapers.
Introducing Node-Unblocker
Node-Unblocker is an open source NodeJS proxy designed specifically for bypassing internet censorship or geo-blocks.
Key features:
- Native NodeJS implementation and API
- Easily setup as local proxy server
- Request & response middlewares for modifying traffic
- Hackable to add custom extensions
For NodeJS based web scraping, Node-Unblocker is a lightweight solution to add proxy capabilities. Let's look at how it works.
Basic Setup and Usage
To get started, install Node-Unblocker from NPM:
npm install node-unblocker express
Then we can create an Express server with Unblocker middleware:
const express = require("express"); const Unblocker = require("unblocker"); const app = express(); const unblocker = Unblocker(); app.use(unblocker); // redirect homepage to Wikipedia via proxy app.get("/", (req, res) => res.redirect("/proxy/https://en.wikipedia.org/wiki/Main_Page") ); const port = 8080; app.listen(port);
This gives us a basic local proxy server. We can make requests through it:
$ curl http://localhost:8080/proxy/https://api.ipify.org?format=json // Response: { "ip": "<proxy server IP>" }
The request first hits our proxy server, which then forwards it to the target URL. Responses are returned back through the proxy. This allows our scraper IP to be hidden. The target server sees requests coming from the proxy server instead.
Using With a Web Scraper
To use Node-Unblocker with an existing scraper, we simply need to forward requests through it. For example with the Puppeteer headless browser library:
const puppeteer = require('puppeteer'); const proxyUrl = 'http://localhost:8080/proxy/'; async function scrape(url) { const browser = await puppeteer.launch(); const page = await browser.newPage(); // pass request through proxy await page.goto(proxyUrl + url); // rest of scraping logic... await browser.close(); }
We can deploy and scale this setup using multiple proxy servers to distribute requests across different IPs.
Avoiding Blocks with Multiple Proxies
A single proxy IP will still appear suspicious if sending too many requests. To further avoid blocks, we can deploy Node-Unblocker on multiple servers and rotate between them. For example:
const proxyPool = [ 'https://111.222.22.33:8080', 'https://111.222.22.34:8080', 'https://111.222.22.35:8080', ]; function getRandomProxy() { return proxyPool[Math.floor(Math.random() * proxyPool.length)]; } async function scrape(url) { const proxy = getRandomProxy(); // ... scraping logic await page.goto(proxy + '/proxy/' + url); }
By spreading requests across different IPs, each proxy handles a smaller number of requests. This helps avoid triggering blanket IP blocks.
Deploying Proxies Globally
Another benefit of proxies is being able to route requests through different countries and regions. Many sites restrict content and access based on the visitor's geographical location. For example, Netflix and Hulu have different media libraries depending on the country you connect from.
By deploying proxies globally, we can make region-specific requests:
// US based proxy const US_PROXY = 'http://us-server.example.com:8080'; // India based proxy const IN_PROXY = 'http://in-server.example.com:8080'; async function scrapeIndiaLibrary() { const proxy = IN_PROXY; await page.goto(proxy + '/proxy/https://www.hotstar.com/in'); // scrape India Hotstar content... } async function scrapeUSLibrary() { const proxy = US_PROXY; await page.goto(proxy + '/proxy/https://www.hulu.com'); // scrape US Hulu content... }
This makes it easy to target regional versions of sites. The proxies can be Node-Unblocker servers or any HTTP proxy.
Advanced: Using Middlewares
One powerful feature of Node-Unblocker is the ability to modify requests and responses using custom middleware functions. Middlewares sit between the scraper and target site, allowing us to manipulate traffic. Some examples of using middlewares:
Add request headers
// Attach authentiation header const authMiddleware = (req) => { if(req.url.includes('api.example.com')){ req.headers['Authorization'] = 'Bearer xxxyyy'; } } const unblocker = Unblocker({ requestMiddleware: [authMiddleware] });
Filter response cookies
Filter response cookies
We can use middlewares to add any custom logic around requests and responses. For example, scraping related logic like session handling and authentication can be abstracted to the proxy layer instead of each scraper reimplementing it.
Deployment Tips
Node-Unblocker provides a convenient NodeJS proxy server, but we still have to deploy and host it somewhere to be useful. Here are some tips:
- Use a cloud hosting provider like AWS, GCP, Azure etc for easy global deployment. Spin up servers in required regions.
- Setup Docker containers for easy deployment and scaling. We can Dockerize the Node-Unblocker server using a Dockerfile.
- Use residential proxy services which provide backconnect rotating IPs to avoid blocks. They usually offer NodeJS libraries for easy integration.
- Monitor server logs for traffic metrics, errors and blocks to identify issues proactively.
- Implement a proxy manager microservice to control proxy distribution across scrapers dynamically.
- Setup monitoring for server uptime to catch issues immediately. Use health checks and alarms.
Limitations of Node-Unblocker
While useful, Node-Unblocker does have some limitations:
- Basic proxy features only – no browser emulation or other evasion techniques.
- Need to deploy and manage your own proxy servers.
- Limited IP pool, still risks blocks if traffic is concentrated
- No support for complex sites like Google, Instagram etc which implement advanced bot protection.
- No integration for commercial proxy services with large residential IP pools.
So while Node-Unblocker serves basic proxying needs, other tools might be required for large scale scraping of complex sites.
Alternatives and Complements
Scraper API
For more advanced needs, proxy APIs like ScraperAPI provide additional capabilities:
- Millions of residential IPs to prevent blocks even for heavy scraping.
- Javascript rendering to scrape complex sites.
- Built-in support for sites like Instagram and Google.
- Geo-targeting, browser fingerprinting and other evasion techniques.
- High availability infrastructure.
- Easy to integrate – no server management overhead.
ScraperAPI has clients for NodeJS and Python making it easy to add to existing scrapers.
Commercial Proxy Services
Specialized commercial proxies like Luminati and Oxylabs provide backend residential IPs along with libraries and clients for various languages. Compared to DIY proxies, these services offer larger and higher-quality IP pools that are less likely to be blocked. However, costs are higher.
Blending Solutions
In practice, a combination of self-hosted and commercial proxies often works best:
- Use commercial proxy pools for baseline scraping to avoid common blocks.
- Deploy own custom proxies with advanced scraping logic for specific sites.
- Implement proxy rotation across both for maximum scale and evasion.
Based on use case a blended solution can provide cost and feature optimization.
Conclusion
Node-Unblocker provides a handy way to set up a proxy through a simple NodeJS-based server. It serves as a flexible building block in your web scraping stack. However, for fully reliable large-scale scraping, some additional tools may be required. Integrating with commercial proxy pools and services can help take it to the next level.
I hope this guide gave you some ideas on how Node-Unblocker can help your web scraping projects!