How TLS Fingerprint is Used to Block Web Scrapers?

As a web scraping expert with over 5 years of experience, I've seen firsthand how TLS fingerprinting has become an increasingly popular technique to identify and block scrapers. In this comprehensive guide, I'll explain what TLS fingerprinting is, how it works, and, most importantly – how to properly configure your scrapers to avoid getting blocked.

What Is TLS Fingerprint?

TLS (Transport Layer Security) fingerprinting is a technique used to identify a client based on the fields in its Client Hello message during a TLS handshake. The TLS handshake is a process that occurs before any actual data is transmitted between a client and a server, and it involves the exchange of information about the types of encryption that each party supports.

The Client Hello message, sent by the client, contains information about the client's supported encryption methods (also known as cipher suites) and current TLS version.

The specific combination of these details forms a unique pattern or “fingerprint” that can be used to identify the client's software or device, as different software and devices will have different TLS implementations. By examining and categorizing these fingerprints, network administrators or security systems can gain insights into the types and quantities of different clients accessing their servers.

TLS fingerprinting is commonly used for various purposes, such as:

Gathering information about a client on the web, such as operating system or browser version.
Analyzing the encrypted TLS traffic to guess which websites a user is using and what actions they take while on the web.
Identifying a remote server, such as an operating system or server software.
Detecting potentially malicious traffic and blocking it before it reaches the network.
Identifying unauthorized access.
Assisting in capacity planning and performance optimization.

Anatomy of TLS & Fingerprint Generation

To understand precisely how TLS handshakes get converted into fingerprints used in scraper detection, we need to unpack the handshake itself:

TLS Powers HTTPS Encryption

Transport Layer Security (TLS) enables full encryption for all HTTPS website connections. This powers secure communication across billions of requests daily.

The “ClientHello” Kicks Off the TLS HandshakeL

This initial handshake negotiates connection parameters client ➔ server.

TLS configuration details leaked during this process are used to create fingerprints – let's examine closer:

Parameter	Description
Version	Supported TLS protocol versions (1.0, 1.2 etc.)
Cipher Suites	Prioritized encryption algorithms supported
Compression Methods	Data compression schemes supported
Extensions	Additional capabilities like renegotiation, SNI

As highlighted – clients send significant identifiable data. But how much variability exists across clients?

Drastic Differences Across Browsers

Different browsers and operating systems use different TLS libraries, which in turn support different cipher suites. For example,

Firefox uses the NSS library
Microsoft uses SChannel
Apple Safari uses Apple Secure Transport Layer
Google Chrome uses BoringSSL.

This means that the TLS fingerprint can vary significantly across different browsers and operating systems. For instance, Chrome on Linux and Safari on an iPhone would have drastically different TLS fingerprints due to the different TLS libraries and cipher suites they use.

This forms the basis of fingerprint generation – modeling the differences.

Fingerprinting Clients via JA3

The most ubiquitous method is dubbed JA3 which defines a spec for encoding handshake data into a segmentable string for easy analysis:

TLSVersion,Ciphers,Extensions,EllipticCurves

For example Chrome 91 on Windows 10:

771,4865-4866-4867,...-52393-52392,0-11-10-35,...,29-23

Sites and services process billions of handshakes into JA3 formats building massive datasets cataloging expected browser fingerprints. When an unfamiliar Python script visits with very different support – trivially easy to flag as a likely scraper for further investigation or outright block.

Specialized techniques like JA3S additionally capture traffic flows to identify proxy and scraper servers by fingerprint. JA3 remains the popular standard.

Running Extensive Fingerprint Analysis

To better understand precisely how sites evaluate TLS handshakes for blocking decisions, I leveraged a combination of open source fingerprint datasets and contributions from proprietary sources to conduct an extensive analysis.

Conducting Own Analysis

Reading TLS (Transport Layer Security) data involves decrypting the data, as it is a cryptographic protocol designed to provide communications security over a computer network. The most common tool used for this purpose is Wireshark, a free and open-source packet analyzer.

To decrypt TLS data in Wireshark, you need the server's private key (a *.pem file) for RSA encryption. However, for ephemeral Diffie-Hellman encryption, which is more prevalent today, you need to record the keys while capturing with Wireshark. The decryption keys are temporary and change for every connection.

To decrypt a PCAP (Packet Capture Data) with Wireshark, you need to have an SSLKEYLOGFILE. This file can be created in a variety of ways depending on what device you control. You need to configure to log encryption keys to an SSLKEYLOGFILE before you start capturing the network traffic, or you won't be able to decrypt the captured traffic.

Here's how to set the SSLKEYLOGFILE environment variable:

On Windows CMD:

C:\> set SSLKEYLOGFILE=%USERPROFILE%/Desktop/sslkeylog.log
C:\> echo %SSLKEYLOGFILE%

On Windows PowerShell:

PS C:\> $env:SSLKEYLOGFILE = "$env:USERPROFILE\sslkeylog.txt"
PS C:\> $env:SSLKEYLOGFILE

On Linux / macOS:

export SSLKEYLOGFILE=$HOME/sslkeylog.log
echo $SSLKEYLOGFILE

This is supported by Firefox, Chrome, Curl, mitmproxy, Exim.

Once you have both the PCAP and the SSLKEYLOGFILE, you can decrypt the TLS data using Wireshark or editcap. A nice trick is to use the editcap tool to inject the keylog file into the PCAP file. With the PCAPNG format, it is possible to create a bundle that merges the two files (pcap and keylog files) into a single file. Opening the new file, you can inspect the decrypted traffic in Wireshark without having to configure anything else.

Please note that the following TCP protocol preferences are also required to enable TLS decryption in Wireshark:

Allow subdissector to reassemble TCP streams. Enabled by default.
Reassemble out-of-order segments (since Wireshark 3.0, disabled by default).
Remember, the decrypted contents of your PCAP should not be shared with anyone who shouldn't have access to them.

Reviewing Commercial Fingerprint Volumes

Drawing from anonymized discussions around commercial dataset sizes, major Bot Mitigation vendors have made tremendous investments:

Akamai – Billions of fingerprints and counting based on Kona Site Defender platform
Imperva – Estimated at over 15 billion fingerprints across 3 data centers
DataDome – 5 billion fingerprints analyzed daily

These enable incredibly accurate real-time classification and blocking.

Advanced Evasion Tactiques

Now equipped with greater visibility into fingerprint generation and database analysis – let's explore advanced tactics leveraged to avoid easy classification and blocking.

Configuring Scrapers for Evasion

Natively modifying TLS configuration varies significantly across languages and scraper construction approaches.

Python

The requests library offers restricted control – mainly TLS version and cipher suites:

import ssl 
ssl.DEFAULT_CIPHERS += "ECDHE-RSA-AES128-GCM-SHA256"
requests.get("https://site.com")

No way to spoof extensions – meaning anomalies can undermine impersonation efforts.

Go

In contrast the utls Go library enables full TLS handshake control:

config := &utls.Config{
    CipherSuites: []uint16{ 
        tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
    },
    Curves: []uint16{
        //browser curve list 
    },    
} 

utls.UClient(config)

Far easier to achieve browser parity this route.

Infrastructure Management

Sophisticated operations leverage orchestration infrastructure managing scattershot configurations at scale:

Browser Automation – Puppeteer, Playwright
Residential Proxies – Bright Data, Smartproxy, Proxy-Seller, and Soax
Fingerprint Analysis Tools
Dynamic Job Assignment

Regular cycling of fingerprints across tens of thousands of proxies minimizes redo visibility.

Specialized Evasion Services

Vendors like ScrapingBee, etc, offer managed evasion, including:

Baselining site fingerprints
Custom browser pool configuration
Residential proxy sources
Automated cycling routines

Offloading setup complexity while experts handle updates is compelling for many.

Consequences & Remediation of Blocks

Despite advanced precautions, blocks still frequently happen, requiring various remediation techniques depending on severity.

Bot Mitigation Flows

Upon fingerprint-driven classification, additional protections deploy:

Browser validation JS challenges
Varying gateway access tokens
Rate limiting – from invisible through full blocks

**IP Block Remediation **

Class C residential proxies often get burned:

Temporary cool down period
Permanent IP ban requiring pool refresh $

Legal Options

International scraping projects lean on lawyers to contest blocks through cease and desists. This forces sites to evaluate legal risks.

Shared Platform Risks

Scrapers hosted on centralized clouds like AWS run higher likelihood of neighbor blocks creating collateral damage by association. Keeping scraping isolated is certainly recommended if feasible.