Goat – the largest sneaker and apparel marketplace – has seen explosive growth, with over 30 million users and passing $2 billion in transactions in 2021 alone. With over 1.5 million products from 1,200+ brands, Goat offers unparalleled insights into fashion industry demand and trends.
In this comprehensive guide, we'll walk through how developers and data analysts can leverage Goat's open web data by building Python scrapers to extract large-scale product catalogs and search data.
The Value of Goat's Data Treasure Trove
Before we dive into code, it's worth discussing why fashion brands, e-commerce merchants, and data analysts scrape sites like Goat. What can we do with such a huge catalog of apparel data? A few high-impact use cases include:
- Competitive Benchmarking – by analyzing prices and product availability for specific brands, retailers can optimize pricing and assortment strategies. Goat's data helps uncover market opportunities.
- Demand Forecasting – identifying best selling items on Goat signals rising trends. Fashion brands use this intel to inform product development and inventory planning. For example, if Goat's top sneakers are 90s basketball styles, brands can focus on retro products.
- Price Optimization – by scraping Goat's pricing spectrum, e-commerce sites can build data models correlating price, demand, and product characteristics to optimize pricing for maximum revenue. Adding proxy data makes models more robust.
- Ad Targeting – understanding the brands and products engaged users purchase allows precision ad targeting both on and off Goat's platform.
These use cases only scratch the surface of the insights waiting to be unlocked from Goat's catalog, which grows over 50,000 items per month. But how can developers access this data at scale? Enter web scraping.
Scraping GOAT Sites – Finding Hidden Data
If we look at the HTML source of a Goat product page, we won't find the structured data. Instead, it's tucked away in JavaScript scripts. This is due to Goat's frontend framework Next.js. It renders pages and data on the client for performance, so the server HTML only includes skeleton scripts.
However, we can easily extract the product JSON from these script tags using a parser like Parsel in Python:
import json from parsel import Selector html = # fetch page sel = Selector(html) json_data = json.loads(sel.css('script#__NEXT_DATA__').get()) print(json_data['pageProps']['product'])
Compared to traditional tag-based scraping, this API-driven structure requires an extra hop to get the underlying JSON. But the scrape logic remains straightforward. Now let's walk through how we'd expand this to build a complete Goat product scraper.
Scraping Individual Products
To scrape details for a single Goat product like https://www.goat.com/sneakers/air-jordan-1-retro-high-dark-mocha-555088-105, we just need to:
- Fetch the page HTML
- Parse the JSON data from the script tag
- Extract the product attributes we want
Here is a full Python implementation:
import httpx import json from parsel import Selector product_id = 'air-jordan-1-retro-high-dark-mocha' url = f'https://www.goat.com/sneakers/{product_id}' html = httpx.get(url).text sel = Selector(html) json_data = json.loads(sel.css('script#__NEXT_DATA__').get()) product = json_data['props']['pageProps']['productTemplate'] product['offers'] = json_data['props']['pageProps']['offers'] print(product['title']) # "Air Jordan 1 Retro High Dark Mocha" print(product['retailPriceCents']) # 190000 print(len(product['images'])) # 8
With just a few lines, we've extracted key attributes like title, price, images, etc! This scraper can be wrapped in a function to process product IDs in bulk:
# products.py from httpx import AsyncClient async def scrape_product(product_id): url = f'https://www.goat.com/sneakers/{product_id}' html = await httpx.AsyncClient().get(url) # parse html return { 'id': product_id, 'title': title, 'price': price, # etc } # main.py import asyncio from products import scrape_product product_ids = ['air-jordan-1', 'yeezy-boost-350', 'adidas-stan-smith'] async def main(): scrapers = [scrape_product(id) for id in product_ids] results = await asyncio.gather(*scrapers) print(results) asyncio.run(main())
This provides a blueprint for scaling up to scrape any number of product pages by distributing the requests concurrently. Next, let's discuss fetching products using Goat's search API.
Scraping Search Results at Scale
In addition to scraping individual pages, we also want to retrieve products matching certain search criteria. Goat's search implementation uses a JSON API. Analyzing the network requests, we can reverse engineer the API structure:
GET https://www.goat.com/api/search Parameters: - query: search term - page: result page number - limit: results per page
To scrape results, we'll:
- Call the API for the first page
- Parse the response to get total pages
- Generate URLs for each additional page
- Fetch all pages concurrently
- Combine the results
Here is a sample implementation:
import asyncio import httpx from urllib.parse import urlencode API_URL = 'https://www.goat.com/api/search' params = { 'query': 'air jordan', 'page': 1, 'limit': 100 } url = f'{API_URL}?{urlencode(params)}' data = httpx.get(url).json() pages = data['totalPages'] results = data['products'] async def fetch_page(page): params['page'] = page url = f'{API_URL}?{urlencode(params)}' resp = await httpx.AsyncClient().get(url) data = resp.json() return data['products'] scrapers = [fetch_page(p) for p in range(2, pages + 1)] for result in asyncio.as_completed(scrapers): results.extend(await result) print(f'Total Results: {len(results)}')
This scales seamlessly by spreading the page scraping across threads and aggregating. With the full power of asyncio and multiple workers, we can pull 1,000s of products matching any keywords from Goat with ease.
Dodging Blocks with Proxies
When scraping APIs at scale, we'll eventually hit blocks from protections like Imperva or Cloudflare: To avoid these, we can make requests through residential proxies – IPs from real devices instead of datacenters. Popular proxy services include Bright Data, Smartproxy, Soax, and Proxy-Seller.
Here's how to integrate proxy rotation with Soax in Python:
from Soax import BrightDataHttpClient proxy = SoaxHttpClient.create( customer_id = 'Soax-customer-id', zone = 'static_residential' ) headers = {'user-agent': 'Mozilla/5.0'} proxy.get('https://www.goat.com/api/search', headers=headers)
By pooling millions of IPs, we can mimic organic human traffic patterns and avoid tripping scraping alarms. Integrating this into our scrapers allows for a massive scale.
Building the Full Product Catalog
Scraping search is fantastic for gathering products matching keywords. But to construct Goat's complete catalog, we need to extract all 500,000+ product URLs. Rather than paginating search, we can source this list from their sitemap XML index:
import httpx from urllib.parse import urlparse sitemap = httpx.get('https://www.goat.com/sitemap_index.xml') product_urls = [] for sitemap_url in sitemap.urls: data = httpx.get(sitemap_url) for url in data.urls: path = urlparse(url).path if path.startswith('/sneakers/') or path.startswith('/clothing/'): product_urls.append(url) print(len(product_urls)) # 530942
We can then pass these URLs to our scrape_product
function to build the full catalog. For incremental syncs, services like Diffbot can identify new products as they're added.
Crawling Best Practices
When running large scraping jobs, it's wise to take precautions:
- Add throttling – use
time.sleep()
to add delays and avoid overloading servers - Randomize patterns – shuffle page order and add jitter to spacing
- Use multiple keys – rotate through different accounts for APIs and services
- Monitor status codes – track rate limiting and blocks
- Retry failures – use exponential backoff and retry quotas
- Run during off-peak hours – reduce impact when traffic is lower
Being courteous ensures we get maximum data while maintaining site health.
Scraping Goat Listings via the Mobile App
So far we've focused on scraping Goat's web data, but another approach is extracting data from their mobile apps. Services like ScrapingBee provide headless Chrome browsers which can drive apps to render content and access APIs.
Here's an example using their Python SDK:
from scrapingbee import ScrapingBeeClient client = ScrapingBeeClient('<API_KEY>') crawler = client.launch( os='android', device='samsung s9', app='com.goat.android' ) crawler.goto('/products/air-jordan-3-retro-white') data = crawler.page_data() # returns product JSON crawler.close()
The mobile apps provide an alternative data source when hitting limits on web scraping.
Automating Goat Data Pipelines with Airflow
For ongoing data needs, we want to orchestrate our Goat scrapers into scheduled pipelines. Python's Apache Airflow allows easy building and monitoring of data workflows. We can containerize our scraper code and connect it to Airflow tasks:
from airflow import DAG from docker import AirflowOperator dag = DAG(schedule='0 12 * * *', max_active_runs=1) scrape_task = DockerOperator( image='goat-scraper', command='npm start', dag=dag )
Chaining scraping, processing, and database tasks in Airflow lets us automate the entire pipeline!
Conclusion and Next Steps
Goat's platform provides invaluable and exclusive product insights. Unlocking it with web scraping delivers powerful competitive intelligence for your business. In this guide, we walked through a variety of techniques for scraping data from Goat at scale using Python.
I'm happy to discuss or provide hands-on help implementing your Goat scraping solution. Feel free to reach out!