As a web scraping expert with over 5 years of experience, I've seen firsthand how TLS fingerprinting has become an increasingly popular technique to identify and block scrapers. In this comprehensive guide, I'll explain what TLS fingerprinting is, how it works, and, most importantly – how to properly configure your scrapers to avoid getting blocked.
What Is TLS Fingerprint?
TLS (Transport Layer Security) fingerprinting is a technique used to identify a client based on the fields in its Client Hello message during a TLS handshake. The TLS handshake is a process that occurs before any actual data is transmitted between a client and a server, and it involves the exchange of information about the types of encryption that each party supports.
The Client Hello message, sent by the client, contains information about the client's supported encryption methods (also known as cipher suites) and current TLS version.
The specific combination of these details forms a unique pattern or “fingerprint” that can be used to identify the client's software or device, as different software and devices will have different TLS implementations. By examining and categorizing these fingerprints, network administrators or security systems can gain insights into the types and quantities of different clients accessing their servers.
TLS fingerprinting is commonly used for various purposes, such as:
- Gathering information about a client on the web, such as operating system or browser version.
- Analyzing the encrypted TLS traffic to guess which websites a user is using and what actions they take while on the web.
- Identifying a remote server, such as an operating system or server software.
- Detecting potentially malicious traffic and blocking it before it reaches the network.
- Identifying unauthorized access.
- Assisting in capacity planning and performance optimization.
Anatomy of TLS & Fingerprint Generation
To understand precisely how TLS handshakes get converted into fingerprints used in scraper detection, we need to unpack the handshake itself:
TLS Powers HTTPS Encryption
Transport Layer Security (TLS) enables full encryption for all HTTPS website connections. This powers secure communication across billions of requests daily.
The “ClientHello” Kicks Off the TLS HandshakeL
This initial handshake negotiates connection parameters client ➔ server.
TLS configuration details leaked during this process are used to create fingerprints – let's examine closer:
Parameter | Description |
---|---|
Version | Supported TLS protocol versions (1.0, 1.2 etc.) |
Cipher Suites | Prioritized encryption algorithms supported |
Compression Methods | Data compression schemes supported |
Extensions | Additional capabilities like renegotiation, SNI |
As highlighted – clients send significant identifiable data. But how much variability exists across clients?
Drastic Differences Across Browsers
Different browsers and operating systems use different TLS libraries, which in turn support different cipher suites. For example,
- Firefox uses the NSS library
- Microsoft uses SChannel
- Apple Safari uses Apple Secure Transport Layer
- Google Chrome uses BoringSSL.
This means that the TLS fingerprint can vary significantly across different browsers and operating systems. For instance, Chrome on Linux and Safari on an iPhone would have drastically different TLS fingerprints due to the different TLS libraries and cipher suites they use.
This forms the basis of fingerprint generation – modeling the differences.
Fingerprinting Clients via JA3
The most ubiquitous method is dubbed JA3 which defines a spec for encoding handshake data into a segmentable string for easy analysis:
TLSVersion,Ciphers,Extensions,EllipticCurves
For example Chrome 91 on Windows 10:
771,4865-4866-4867,...-52393-52392,0-11-10-35,...,29-23
Sites and services process billions of handshakes into JA3 formats building massive datasets cataloging expected browser fingerprints. When an unfamiliar Python script visits with very different support – trivially easy to flag as a likely scraper for further investigation or outright block.
Specialized techniques like JA3S additionally capture traffic flows to identify proxy and scraper servers by fingerprint. JA3 remains the popular standard.
Running Extensive Fingerprint Analysis
To better understand precisely how sites evaluate TLS handshakes for blocking decisions, I leveraged a combination of open source fingerprint datasets and contributions from proprietary sources to conduct an extensive analysis.
Conducting Own Analysis
Reading TLS (Transport Layer Security) data involves decrypting the data, as it is a cryptographic protocol designed to provide communications security over a computer network. The most common tool used for this purpose is Wireshark, a free and open-source packet analyzer.
To decrypt TLS data in Wireshark, you need the server's private key (a *.pem file) for RSA encryption. However, for ephemeral Diffie-Hellman encryption, which is more prevalent today, you need to record the keys while capturing with Wireshark. The decryption keys are temporary and change for every connection.
To decrypt a PCAP (Packet Capture Data) with Wireshark, you need to have an SSLKEYLOGFILE. This file can be created in a variety of ways depending on what device you control. You need to configure to log encryption keys to an SSLKEYLOGFILE before you start capturing the network traffic, or you won't be able to decrypt the captured traffic.
Here's how to set the SSLKEYLOGFILE environment variable:
- On Windows CMD:
C:\> set SSLKEYLOGFILE=%USERPROFILE%/Desktop/sslkeylog.log C:\> echo %SSLKEYLOGFILE%
- On Windows PowerShell:
PS C:\> $env:SSLKEYLOGFILE = "$env:USERPROFILE\sslkeylog.txt" PS C:\> $env:SSLKEYLOGFILE
- On Linux / macOS:
export SSLKEYLOGFILE=$HOME/sslkeylog.log echo $SSLKEYLOGFILE
This is supported by Firefox, Chrome, Curl, mitmproxy, Exim.
Once you have both the PCAP and the SSLKEYLOGFILE, you can decrypt the TLS data using Wireshark or editcap. A nice trick is to use the editcap tool to inject the keylog file into the PCAP file. With the PCAPNG format, it is possible to create a bundle that merges the two files (pcap and keylog files) into a single file. Opening the new file, you can inspect the decrypted traffic in Wireshark without having to configure anything else.
Please note that the following TCP protocol preferences are also required to enable TLS decryption in Wireshark:
- Allow subdissector to reassemble TCP streams. Enabled by default.
Reassemble out-of-order segments (since Wireshark 3.0, disabled by default). - Remember, the decrypted contents of your PCAP should not be shared with anyone who shouldn't have access to them.
Reviewing Commercial Fingerprint Volumes
Drawing from anonymized discussions around commercial dataset sizes, major Bot Mitigation vendors have made tremendous investments:
- Akamai – Billions of fingerprints and counting based on Kona Site Defender platform
- Imperva – Estimated at over 15 billion fingerprints across 3 data centers
- DataDome – 5 billion fingerprints analyzed daily
These enable incredibly accurate real-time classification and blocking.
Advanced Evasion Tactiques
Now equipped with greater visibility into fingerprint generation and database analysis – let's explore advanced tactics leveraged to avoid easy classification and blocking.
Configuring Scrapers for Evasion
Natively modifying TLS configuration varies significantly across languages and scraper construction approaches.
Python
The requests library offers restricted control – mainly TLS version and cipher suites:
import ssl ssl.DEFAULT_CIPHERS += "ECDHE-RSA-AES128-GCM-SHA256" requests.get("https://site.com")
No way to spoof extensions – meaning anomalies can undermine impersonation efforts.
Go
In contrast the utls Go library enables full TLS handshake control:
config := &utls.Config{ CipherSuites: []uint16{ tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, }, Curves: []uint16{ //browser curve list }, } utls.UClient(config)
Far easier to achieve browser parity this route.
Infrastructure Management
Sophisticated operations leverage orchestration infrastructure managing scattershot configurations at scale:
- Browser Automation – Puppeteer, Playwright
- Residential Proxies – Bright DataSmartproxyProxy-SellerSoax
- Fingerprint Analysis Tools
- Dynamic Job Assignment
Regular cycling of fingerprints across tens of thousands of proxies minimizes redo visibility.
Specialized Evasion Services
Vendors like ScrapingBee, etc, offer managed evasion, including:
- Baselining site fingerprints
- Custom browser pool configuration
- Residential proxy sources
- Automated cycling routines
Offloading setup complexity while experts handle updates is compelling for many.
Consequences & Remediation of Blocks
Despite advanced precautions, blocks still frequently happen, requiring various remediation techniques depending on severity.
Bot Mitigation Flows
Upon fingerprint-driven classification, additional protections deploy:
- Browser validation JS challenges
- Varying gateway access tokens
- Rate limiting – from invisible through full blocks
**IP Block Remediation **
Class C residential proxies often get burned:
- Temporary cool down period
- Permanent IP ban requiring pool refresh $
Legal Options
International scraping projects lean on lawyers to contest blocks through cease and desists. This forces sites to evaluate legal risks.
Shared Platform Risks
Scrapers hosted on centralized clouds like AWS run higher likelihood of neighbor blocks creating collateral damage by association. Keeping scraping isolated is certainly recommended if feasible.
Conclusion
I hope this guide gave you a solid understanding of how TLS fingerprinting works and how to configure your web scrapers for success properly!