How to Scrape Algolia Search?

Algolia is a popular search API service used by many websites to implement fast and relevant search functionalities with minimal backend effort. In this comprehensive guide, we'll learn how Algolia search works and how to build a Python scraper for any Algolia-powered website using BrightData residential proxies to avoid getting blocked.

Overview of Algolia Search API

Algolia provides a hosted search engine API that websites can integrate to quickly add fast and relevant searches to their apps and websites. With Algolia, sites don't need to build and maintain their own search infrastructure – they simply send their data to Algolia, which handles indexing and query serving.

Some advantages of using Algolia include:

Blazing fast search built for relevance and speed
Typo-tolerance, synonyms and advanced search features
Real-time updates to keep search results accurate
APIs for easy integration into any tech stack
Hosted infrastructure requiring minimal maintenance

Many popular sites across different industries use Algolia to power their search, including GitHub, Stripe, Medium, Twitch, and Zendesk.

Understanding How Algolia Search Works

When a user performs a search on a site using Algolia, here is what happens behind the scenes:

The site sends the search query text to Algolia's API endpoint. This includes an application ID, API key, and any search parameters.
Algolia looks up the indexed search corpus for the site and returns the matching results.
The website displays the returned results to the user.

Algolia provides client libraries like Algolia for JavaScript that handle the search requests and integration. But as scrapers, we are interested in understanding the underlying API requests to reverse engineer the search functionality.

Let's look at a real example using alternativeto.net – a software directory site that uses Algolia for search.

When searching on their homepage, we see a POST request made to an Algolia URL containing the secret API keys and our search query. With this insight, we can now replicate search requests directly in Python without needing to use their website UI.

Scraping Algolia Search in Python

To scrape Algolia, we need 3 key pieces of information:

The Application ID
The API Search Key
The Index Name

The App ID and API key allow us to authenticate with Algolia's API. The index name indicates which search corpus we want to query. For any Algolia-powered site, we first need to find these credentials. They are usually exposed somewhere in the JavaScript code of the website. We'll cover techniques for discovering these keys later.

For now, let's use the keys we found for alternativeto.net:

ALGOLIA_APP_ID = "ABC123APPID" 
ALGOLIA_API_KEY = "acbd1234api456key"
ALGOLIA_INDEX_NAME = "software_index"

Next, we'll define the search API endpoint URL along with the authentication headers:

import httpx

SEARCH_URL = "https://ABC123APPID.algolia.net/1/indexes/" + ALGOLIA_INDEX_NAME + "/query"

headers = {
    "X-Algolia-Application-Id": ALGOLIA_APP_ID,
    "X-Algolia-API-Key": ALGOLIA_API_KEY
}

Now we can make search queries by sending POST requests:

search_params = {
  "query": "Chrome",
  "hitsPerPage": 5
}

search = httpx.post(SEARCH_URL, json=search_params, headers=headers).json()

print(search['hits'])

This will print out the first 5 matching results for our query! We can add parameters like filters, facets, and pagination to customize the search query further. For example:

search_params = {
  "query": "Chrome",
  "hitsPerPage": 20,
  "filters": "os:windows"  
}

The full list of search parameters is available in the Algolia docs.

Scraping Multiple Pages of Algolia Results

To scrape beyond the first page of results, we need to handle pagination. Algolia returns pagination metadata in the response like page and nbPages which we can use to request subsequent pages.

Here is an example to scrape all pages of results:

from typing import List

def scrape_algolia_search(query: str, index: str, app_id: str, api_key: str) -> List[dict]:
  
  url = f"https://algolia.net/1/indexes/{index}/query"
  
  # Fetch first page
  results = []
  res = httpx.post(url, json={"query": query, "hitsPerPage": 100}, 
                   headers={"X-Algolia-Application-Id": app_id,
                            "X-Algolia-API-Key": api_key}).json()

  results.extend(res["hits"])
  max_pages = res["nbPages"]

  # Fetch remaining pages            
  for page in range(2, max_pages+1):
    res = httpx.post(url, json={"query": query, 
                                "hitsPerPage": 100,
                                "page": page},  
                    headers=headers).json()
    
    results.extend(res["hits"])
  
  return results 
  
results = scrape_algolia_search("Chrome", "software_index", ALGOLIA_APP_ID, ALGOLIA_API_KEY)

This paginated scraper allows us to extract all matching results across all pages for any query term.

Discovering Algolia Credentials from JavaScript

In the above examples, we hardcoded the Algolia credentials, which were discovered through manual analysis. Let's look at how we can programmatically extract these keys from a site's client-side JavaScript code.

The Algolia credentials are usually exposed in JavaScript somewhere, as they need to be available on the front end to make API calls. We can inspect the HTML, JS files, and scripts on the page to try and extract the keys. Some common places to look:

Script tags of the main HTML
External JS files loaded like app.js or algolia.js
Input fields are hidden in the HTML

For example, on alternativeto.net, the Algolia keys are exposed in cleartext in the main app.min.js file:

// app.min.js

var algolia_app_id = "ABC123APPID"; 
var algolia_search_api_key = "acbd1234api456key";
var algolia_index_name = "software_index";

So we can download and parse this JS file to find the credentials. Here is one approach:

import re
import httpx
from urllib.parse import urljoin

def find_algolia_keys(url: str) -> dict:

  response = httpx.get(url)
  
  # Inspect script tags
  for script in response.html.find("script"):
    if match := re.search(r'algolia.{0,30}?("|\')(\w{8,32})("|\')', script.text):
      print(f"Found Algolia key: {match.group(2)}")

  # Look in external scripts   
  for js_file in response.html.xpath("//script/@src"):
    js_url = urljoin(url, js_file)
    
    js_body = httpx.get(js_url).text
    
    if match := re.search(r'algolia.{0,30}?("|\')(\w{8,32})("|\')', js_body):
      print(f"Found Algolia key: {match.group(2)} in {js_url}") 

  # Look in input fields
  for input_tag in response.html.xpath("//input"):
    if input_tag.attrs.get("name") == "algolia_key":
       algolia_key = input_tag.attrs["value"]
       print(f"Found algolia key in input field: {algolia_key}")
  
  # Return all found keys  
  return {
    "app_id": ...,
    "api_key": ...,
    # ...
  }

This scans the main HTML, external scripts, and input fields, looking for signatures of Algolia keys. With some tuning based on the target site, we can reliably extract the credentials.

Avoiding Blocking with BrightData Proxies

A challenge with scraping sites like Algolia is that they actively block scrapers and bots. If you send too many requests from a single IP address, you will start seeing captchas, blocking, and rate limiting. To avoid this, we can route our scraper through Bright Data residential proxy servers. BrightData provides access to a large, rotating pool of residential IP addresses that mimic real users.

Here is how we would configure our Python scraper to use BrightData proxies:

from brightdata import BrightDataClient
from brightdata.types import Options

proxy = BrightDataClient(YOUR_BRIGHTDATA_KEY)

options = Options(country="US") 

with proxy.scraper(options) as scraper:

  for page in range(1, 10):
     search_params = {
       "query": "Chrome",  
       "page": page
     }
     
     headers = {
       "X-Algolia-Application-Id": ...,
       "X-Algolia-API-Key": ...
     }
     
     # Proxied request
     response = scraper.post(SEARCH_URL, json=search_params, headers=headers) 
     
     results = response.json()
     print(results)

By wrapping our scraper in a context manager, all requests will be routed through BrightData's pool of residential IPs. Some key advantages of BrightData proxies for avoiding blocks:

72M+ IPs – A very large pool of IPs across countries and ISPs to prevent easy blocking based on IP patterns.
Real browsers – Options like full Javascript rendering using real Chrome browsers to mimic authentic users.
Anti-blocking tools – Automatically handle and bypass anti-bot services.
Fast speeds – Proxies are optimized for speed and minimal latency overhead.
Easy integration – Simple APIs for all languages.

Using a proxy solution like BrightData allows building an Algolia scraper that can run 24/7 without blocks or interference.

Scraping Best Practices

Here are some best practices to follow when scraping Algolia or other commercial APIs:

Restrict usage: Retrieve only the data you need, avoid unnecessary queries, and limit request concurrency/volume to reasonable levels.
Use proxies: Route through residential proxies like BrightData to mimic real users and avoid blocks.
Cache data: Store any data you retrieve to avoid repeated queries. Most sites forbid reusing scraped data publicly/commercially, but caching for your own internal use is recommended.
Rotate keys: If you need to scrape an API continuously, rotate between different API keys to distribute the load.
Check Terms of Service: Understand any restrictions specified by the site's ToS. Scraping for non-competitive personal use cases is generally acceptable.
Make ethical use: Consider the impact of your scraping on the site and avoid disrupting their services.

Conclusion

Algolia provides a convenient hosted API for implementing site search functionality. By reverse engineering the front-end search requests, we can build scrapers that extract data through Algolia's search APIs in Python. To avoid blocks, routing through proxy services like BrightData is recommended. As always, when web scraping, be sure to follow ethical practices and honor reasonable terms of service.