How to Scrape Goat.com in Python?

Goat – the largest sneaker and apparel marketplace – has seen explosive growth, with over 30 million users and passing $2 billion in transactions in 2021 alone. With over 1.5 million products from 1,200+ brands, Goat offers unparalleled insights into fashion industry demand and trends.

In this comprehensive guide, we'll walk through how developers and data analysts can leverage Goat's open web data by building Python scrapers to extract large-scale product catalogs and search data.

The Value of Goat's Data Treasure Trove

Before we dive into code, it's worth discussing why fashion brands, e-commerce merchants, and data analysts scrape sites like Goat. What can we do with such a huge catalog of apparel data? A few high-impact use cases include:

Competitive Benchmarking – by analyzing prices and product availability for specific brands, retailers can optimize pricing and assortment strategies. Goat's data helps uncover market opportunities.
Demand Forecasting – identifying best selling items on Goat signals rising trends. Fashion brands use this intel to inform product development and inventory planning. For example, if Goat's top sneakers are 90s basketball styles, brands can focus on retro products.
Price Optimization – by scraping Goat's pricing spectrum, e-commerce sites can build data models correlating price, demand, and product characteristics to optimize pricing for maximum revenue. Adding proxy data makes models more robust.
Ad Targeting – understanding the brands and products engaged users purchase allows precision ad targeting both on and off Goat's platform.

These use cases only scratch the surface of the insights waiting to be unlocked from Goat's catalog, which grows over 50,000 items per month. But how can developers access this data at scale? Enter web scraping.

Scraping GOAT Sites – Finding Hidden Data

If we look at the HTML source of a Goat product page, we won't find the structured data. Instead, it's tucked away in JavaScript scripts. This is due to Goat's frontend framework Next.js. It renders pages and data on the client for performance, so the server HTML only includes skeleton scripts.

However, we can easily extract the product JSON from these script tags using a parser like Parsel in Python:

import json
from parsel import Selector 

html = # fetch page
sel = Selector(html)

json_data = json.loads(sel.css('script#__NEXT_DATA__').get())
print(json_data['pageProps']['product'])

Compared to traditional tag-based scraping, this API-driven structure requires an extra hop to get the underlying JSON. But the scrape logic remains straightforward. Now let's walk through how we'd expand this to build a complete Goat product scraper.

Scraping Individual Products

To scrape details for a single Goat product like https://www.goat.com/sneakers/air-jordan-1-retro-high-dark-mocha-555088-105, we just need to:

Fetch the page HTML
Parse the JSON data from the script tag
Extract the product attributes we want

Here is a full Python implementation:

import httpx
import json
from parsel import Selector

product_id = 'air-jordan-1-retro-high-dark-mocha' 

url = f'https://www.goat.com/sneakers/{product_id}'

html = httpx.get(url).text

sel = Selector(html)
json_data = json.loads(sel.css('script#__NEXT_DATA__').get())

product = json_data['props']['pageProps']['productTemplate']
product['offers'] = json_data['props']['pageProps']['offers']

print(product['title'])
# "Air Jordan 1 Retro High Dark Mocha"

print(product['retailPriceCents'])
# 190000 

print(len(product['images']))
# 8

With just a few lines, we've extracted key attributes like title, price, images, etc! This scraper can be wrapped in a function to process product IDs in bulk:

# products.py

from httpx import AsyncClient

async def scrape_product(product_id):

  url = f'https://www.goat.com/sneakers/{product_id}'

  html = await httpx.AsyncClient().get(url)  
  # parse html

  return {
    'id': product_id,
    'title': title,
    'price': price,
    # etc
  }
  
# main.py
import asyncio
from products import scrape_product

product_ids = ['air-jordan-1', 'yeezy-boost-350', 'adidas-stan-smith'] 

async def main():

  scrapers = [scrape_product(id) for id in product_ids]  
  results = await asyncio.gather(*scrapers)

  print(results)

asyncio.run(main())

This provides a blueprint for scaling up to scrape any number of product pages by distributing the requests concurrently. Next, let's discuss fetching products using Goat's search API.

Scraping Search Results at Scale

In addition to scraping individual pages, we also want to retrieve products matching certain search criteria. Goat's search implementation uses a JSON API. Analyzing the network requests, we can reverse engineer the API structure:

GET https://www.goat.com/api/search

Parameters:
  - query: search term
  - page: result page number 
  - limit: results per page

To scrape results, we'll:

Call the API for the first page
Parse the response to get total pages
Generate URLs for each additional page
Fetch all pages concurrently
Combine the results

Here is a sample implementation:

import asyncio
import httpx
from urllib.parse import urlencode

API_URL = 'https://www.goat.com/api/search'

params = {
  'query': 'air jordan',
  'page': 1,
  'limit': 100
}

url = f'{API_URL}?{urlencode(params)}'
data = httpx.get(url).json()

pages = data['totalPages']
results = data['products']

async def fetch_page(page):
  
  params['page'] = page
  url = f'{API_URL}?{urlencode(params)}'

  resp = await httpx.AsyncClient().get(url)
  data = resp.json()

  return data['products']

scrapers = [fetch_page(p) for p in range(2, pages + 1)]

for result in asyncio.as_completed(scrapers):
  results.extend(await result)

print(f'Total Results: {len(results)}')

This scales seamlessly by spreading the page scraping across threads and aggregating. With the full power of asyncio and multiple workers, we can pull 1,000s of products matching any keywords from Goat with ease.

Dodging Blocks with Proxies

When scraping APIs at scale, we'll eventually hit blocks from protections like Imperva or Cloudflare: To avoid these, we can make requests through residential proxies – IPs from real devices instead of datacenters. Popular proxy services include Bright Data, Smartproxy, Soax, and Proxy-Seller.

Here's how to integrate proxy rotation with Soax in Python:

from Soax import BrightDataHttpClient

proxy = SoaxHttpClient.create(
  customer_id = 'Soax-customer-id', 
  zone = 'static_residential'
)

headers = {'user-agent': 'Mozilla/5.0'}

proxy.get('https://www.goat.com/api/search', headers=headers)

By pooling millions of IPs, we can mimic organic human traffic patterns and avoid tripping scraping alarms. Integrating this into our scrapers allows for a massive scale.

Building the Full Product Catalog

Scraping search is fantastic for gathering products matching keywords. But to construct Goat's complete catalog, we need to extract all 500,000+ product URLs. Rather than paginating search, we can source this list from their sitemap XML index:

import httpx
from urllib.parse import urlparse

sitemap = httpx.get('https://www.goat.com/sitemap_index.xml')

product_urls = []

for sitemap_url in sitemap.urls:

  data = httpx.get(sitemap_url)  

  for url in data.urls:
    path = urlparse(url).path
    if path.startswith('/sneakers/') or path.startswith('/clothing/'):
      product_urls.append(url) 

print(len(product_urls))
# 530942

We can then pass these URLs to our scrape_product function to build the full catalog. For incremental syncs, services like Diffbot can identify new products as they're added.

Crawling Best Practices

When running large scraping jobs, it's wise to take precautions:

Add throttling – use time.sleep() to add delays and avoid overloading servers
Randomize patterns – shuffle page order and add jitter to spacing
Use multiple keys – rotate through different accounts for APIs and services
Monitor status codes – track rate limiting and blocks
Retry failures – use exponential backoff and retry quotas
Run during off-peak hours – reduce impact when traffic is lower

Being courteous ensures we get maximum data while maintaining site health.

Scraping Goat Listings via the Mobile App

So far we've focused on scraping Goat's web data, but another approach is extracting data from their mobile apps. Services like ScrapingBee provide headless Chrome browsers which can drive apps to render content and access APIs.

Here's an example using their Python SDK:

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient('<API_KEY>')

crawler = client.launch(
  os='android',
  device='samsung s9',
  app='com.goat.android'  
)

crawler.goto('/products/air-jordan-3-retro-white')
data = crawler.page_data() # returns product JSON

crawler.close()

The mobile apps provide an alternative data source when hitting limits on web scraping.

Automating Goat Data Pipelines with Airflow

For ongoing data needs, we want to orchestrate our Goat scrapers into scheduled pipelines. Python's Apache Airflow allows easy building and monitoring of data workflows. We can containerize our scraper code and connect it to Airflow tasks:

from airflow import DAG
from docker import AirflowOperator

dag = DAG(schedule='0 12 * * *', max_active_runs=1)

scrape_task = DockerOperator(
  image='goat-scraper',
  command='npm start',
  dag=dag
)

Chaining scraping, processing, and database tasks in Airflow lets us automate the entire pipeline!

Conclusion and Next Steps

Goat's platform provides invaluable and exclusive product insights. Unlocking it with web scraping delivers powerful competitive intelligence for your business. In this guide, we walked through a variety of techniques for scraping data from Goat at scale using Python.

I'm happy to discuss or provide hands-on help implementing your Goat scraping solution. Feel free to reach out!