Datadome is a popular anti-scraping solution used by many sites to detect and block scrapers and bots. Getting past Datadome's protections can be challenging, but is possible with the right techniques.
In this comprehensive guide, we'll cover everything you need to know to bypass Datadome. Equipped with this information, you'll be able to scrape Datadome-protected sites much more effectively. Let's get started!
Understanding Datadome’s Trust Scoring
Datadome does not make the exact details of its algorithm public. Based on testing and monitoring traffic patterns, researchers have mapped out the key components of how it likely calculates trust:
- IP Reputation (30% weight) – IPs are known to belong to cloud providers and hosting services, and TOR exit nodes see the heavy penalties. Residential and mobile networks score highly.
- TLS Fingerprint (20% weight) – Mismatched cipher suites penalized as out of spec for major browsers. Related weaknesses around SSL session IDs were flagged.
- HTTP Headers (15% weight) – Any deviations from Chrome, Firefox user agent strings, and order discounted. Missing headers like Accept-Language or malformed values are also detectors.
- Javascript Rendering (10% weight) – Full support for ES6 features, WebGL expected. Errors and missing capabilities detected.
- Behavior Analysis (25% weight) – Beyond just technical factors, patterns like access frequency, a variance of locations, and impact the potential for site are considered.
Based on the weighting of these key components, the algorithm makes an allow/block decision and assigns challenge difficulty proportional to the risk level assessed.
Maximizing Trust Scores and Bypass Datadome
With visibility into how Datadome calculates trust, we can now look at techniques to optimize signals from our scrapers.
Configuring Residential Proxies
The easiest quick win is routing traffic through residential proxy services like BrightData, SmartProxy and others to inherit clean IP reputations. These providers maintain millions of IPs globally through ISP partnerships:
Provider | Proxy Count | Locations | Pricing |
---|---|---|---|
Bright Data | 72M+ | World wide | $500/mo |
Smartproxy | 55M+ | 195+ countries | $14/mo |
Proxy-Seller | 72M+ | World wide | $500+/mo |
Soax | 72M+ | World wide | $10+/GB |
Here is a sample Python request routing via BrightData residential proxies:
import requests proxy = “zproxy.lum-superproxy.io:22225” proxies = {“http”: proxy, “https”: proxy} response = requests.get(“https://target.com”, proxies=proxies)
Often, geo-targeting proxies to match target sites' regions can further maximize relevance.
Browser Engine Mimicry
Beyond IP evasion, we need to mimic browser capabilities as well through headless automation tools like Playwright, Puppeteer and Selenium with plugins that spoof details like:
Tool | Render Speed | Ease of Use | Platforms |
---|---|---|---|
Playwright | Fastest | Simple APIs | Cross-platform |
Puppeteer | Very fast | More configuration | Chrome-based |
Selenium | Moderate | Very configurable | Any browser |
Here is an example leveraging Playwright with Stealth plugin:
const { chromium } = require('playwright'); const StealthPlugin = require('playwright-stealth'); (async () => { const browser = await chromium.launch(); const page = await browser.newPage(); await page.route('**/*', route => { route.continue(); }); await page.evaluateOnNewDocument(() => { Object.defineProperty(navigator, 'webdriver', { get: () => false, }); }); await page.goto('https://targetwebsite.com/'); // Extract data await browser.close(); })();
Beyond just mimicking headers, full browser rendering provides complete JavaScript support and few anomalies for behavioral analysis to pick up on.
Distributing Requests
To avoid suspicious request patterns, distributing traffic across many proxies, browsers and machines is key. This is where orchestration frameworks like Scrapy come in handy:
import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Scrapy handles crawling across proxy IPs start_urls = ['https://target.com'] def parse(self, response): data = extract_data(response) return data process = CrawlerProcess() process.crawl(MySpider) process.start()
Some commercial services also provide out-of-box distribution capabilities starting at ~$100/mo like ScraperAPI and ProxyCrawl. These solutions also maintain infrastructure for proxy rotation so developers can focus time on data tasks.
Caching as an Alternate Source
Major search engines like Google keep cached copies of billions of pages that can serve as a proxy source for content protected by Datadome, as these providers tend to be whitelisted. The contents of caches represent the state of sites at the last crawl, so data can be inconsistent but static assets prove very valuable.
Here is an example of fetching content from Google Cache if live site is blocked:
import requests URL = "https://targetwebsite.com" try: resp = requests.get(URL) if resp.status_code == 200: # Success - got updated live content doc = resp.text else: # Fetch from google cache CACHE_URL = "https://webcache.googleusercontent.com/search?q=cache:" + URL cache_resp = requests.get(CACHE_URL) doc = cache_resp.text except Exception as e: print(e)
According to recent estimates by the Internet Archive, which also provides cached pages, over 384 billion pages are preserved, giving developers data sources unaffected by anti-scraping protections with some modest tradeoffs.
Legal Considerations
There are no laws expressly prohibiting the circumvention of protections like Datadome. The precedents that do exist, like hiQ vs. LinkedIn, have upheld scraping of public data as permissible based on the First Amendment and long-standing expectations of open access.
That said, techniques that exploit vulnerabilities or directly attack servers violate the Computer Fraud and Abuse Act (CFAA). Any harmful impacts can also trigger violations.
As always, it comes down to what data is used for – commercial resale, internal analytics, and price aggregation all have a legal grounding in the US under fair use provisions, while scraping content for competitive republishing starts to cross ethical lines.
Conclusion
With techniques like leveraging reputable residential proxies, mimicking browser signatures through headless tools, distributing requests, and tapping alternate sources, we found Datadome can indeed be circumvented reliably.
As anti-scraping solutions evolve, so must scrapers themselves. But with all the tools now available and precedence set around fair use, developers have many options for continuing to gather valuable public data.
The key is just staying vigilant around changes, leaning on commercial services to abstract complexity, and establishing business practices compliant with norms and regulations around sourcing, competitiveness, attribution, and transparency. With that discipline, data aggregation can still thrive no matter how pervasive anti-scraping efforts become!