How to Bypass Datadome Anti Scraping?

55 Views

Datadome is a popular anti-scraping solution used by many sites to detect and block scrapers and bots. Getting past Datadome's protections can be challenging, but is possible with the right techniques.

In this comprehensive guide, we'll cover everything you need to know to bypass Datadome. Equipped with this information, you'll be able to scrape Datadome-protected sites much more effectively. Let's get started!

Understanding Datadome’s Trust Scoring

Datadome does not make the exact details of its algorithm public. Based on testing and monitoring traffic patterns, researchers have mapped out the key components of how it likely calculates trust:

IP Reputation (30% weight) – IPs are known to belong to cloud providers and hosting services, and TOR exit nodes see the heavy penalties. Residential and mobile networks score highly.
TLS Fingerprint (20% weight) – Mismatched cipher suites penalized as out of spec for major browsers. Related weaknesses around SSL session IDs were flagged.
HTTP Headers (15% weight) – Any deviations from Chrome, Firefox user agent strings, and order discounted. Missing headers like Accept-Language or malformed values are also detectors.
Javascript Rendering (10% weight) – Full support for ES6 features, WebGL expected. Errors and missing capabilities detected.
Behavior Analysis (25% weight) – Beyond just technical factors, patterns like access frequency, a variance of locations, and impact the potential for site are considered.

Based on the weighting of these key components, the algorithm makes an allow/block decision and assigns challenge difficulty proportional to the risk level assessed.

Maximizing Trust Scores and Bypass Datadome

With visibility into how Datadome calculates trust, we can now look at techniques to optimize signals from our scrapers.

Configuring Residential Proxies

The easiest quick win is routing traffic through residential proxy services like BrightData, SmartProxy and others to inherit clean IP reputations. These providers maintain millions of IPs globally through ISP partnerships:

Provider	Proxy Count	Locations	Pricing
Bright Data	72M+	World wide	$500/mo
Smartproxy	55M+	195+ countries	$14/mo
Proxy-Seller	72M+	World wide	$500+/mo
Soax	72M+	World wide	$10+/GB

Here is a sample Python request routing via BrightData residential proxies:

import requests
proxy = “zproxy.lum-superproxy.io:22225” 
proxies = {“http”: proxy, “https”: proxy}
response = requests.get(“https://target.com”, proxies=proxies)

Often, geo-targeting proxies to match target sites' regions can further maximize relevance.

Browser Engine Mimicry

Beyond IP evasion, we need to mimic browser capabilities as well through headless automation tools like Playwright, Puppeteer and Selenium with plugins that spoof details like:

Tool	Render Speed	Ease of Use	Platforms
Playwright	Fastest	Simple APIs	Cross-platform
Puppeteer	Very fast	More configuration	Chrome-based
Selenium	Moderate	Very configurable	Any browser

Here is an example leveraging Playwright with Stealth plugin:

const { chromium } = require('playwright'); 
const StealthPlugin = require('playwright-stealth'); 
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.route('**/*', route => { 
    route.continue(); 
});

await page.evaluateOnNewDocument(() => {
  Object.defineProperty(navigator, 'webdriver', {
   get: () => false,
  });
}); 

await page.goto('https://targetwebsite.com/');
// Extract data
await browser.close();
})();

Beyond just mimicking headers, full browser rendering provides complete JavaScript support and few anomalies for behavioral analysis to pick up on.

Distributing Requests

To avoid suspicious request patterns, distributing traffic across many proxies, browsers and machines is key. This is where orchestration frameworks like Scrapy come in handy:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
  # Scrapy handles crawling across proxy IPs
  start_urls = ['https://target.com'] 
  
  def parse(self, response):
    data = extract_data(response)
    return data

process = CrawlerProcess()
process.crawl(MySpider)
process.start()

Some commercial services also provide out-of-box distribution capabilities starting at ~$100/mo like ScraperAPI and ProxyCrawl. These solutions also maintain infrastructure for proxy rotation so developers can focus time on data tasks.

Caching as an Alternate Source

Major search engines like Google keep cached copies of billions of pages that can serve as a proxy source for content protected by Datadome, as these providers tend to be whitelisted. The contents of caches represent the state of sites at the last crawl, so data can be inconsistent but static assets prove very valuable.

Here is an example of fetching content from Google Cache if live site is blocked:

import requests
URL = "https://targetwebsite.com"

try:
    resp = requests.get(URL)
    if resp.status_code == 200:
        # Success - got updated live content
        doc = resp.text  
    else:
        # Fetch from google cache
        CACHE_URL = "https://webcache.googleusercontent.com/search?q=cache:" + URL 
        cache_resp = requests.get(CACHE_URL)
        doc = cache_resp.text 
        
except Exception as e:
     print(e)

According to recent estimates by the Internet Archive, which also provides cached pages, over 384 billion pages are preserved, giving developers data sources unaffected by anti-scraping protections with some modest tradeoffs.

Legal Considerations

There are no laws expressly prohibiting the circumvention of protections like Datadome. The precedents that do exist, like hiQ vs. LinkedIn, have upheld scraping of public data as permissible based on the First Amendment and long-standing expectations of open access.

That said, techniques that exploit vulnerabilities or directly attack servers violate the Computer Fraud and Abuse Act (CFAA). Any harmful impacts can also trigger violations.

As always, it comes down to what data is used for – commercial resale, internal analytics, and price aggregation all have a legal grounding in the US under fair use provisions, while scraping content for competitive republishing starts to cross ethical lines.

Conclusion

With techniques like leveraging reputable residential proxies, mimicking browser signatures through headless tools, distributing requests, and tapping alternate sources, we found Datadome can indeed be circumvented reliably.

As anti-scraping solutions evolve, so must scrapers themselves. But with all the tools now available and precedence set around fair use, developers have many options for continuing to gather valuable public data.

The key is just staying vigilant around changes, leaning on commercial services to abstract complexity, and establishing business practices compliant with norms and regulations around sourcing, competitiveness, attribution, and transparency. With that discipline, data aggregation can still thrive no matter how pervasive anti-scraping efforts become!