How to Web Scrape Yelp.com?

6 Views

Yelp has become the go-to website for discovering and researching local businesses. With over 200 million reviews spanning restaurants, salons, mechanics, and more, Yelp offers a treasure trove of consumer sentiment data.

For data analysts, scraping Yelp can unlock unique insights around customer satisfaction, common complaints, price sensitivity, and more for a business and entire industry sectors. Brands use web-scraped Yelp data to benchmark performance versus competitors and improve offerings.

However, scraping a site as popular as Yelp brings unique challenges:

Strict anti-scraping measures that throttle and block scrapers
Dynamic, obfuscated HTML and heavy Javascript usage that needs to be reverse engineered
Scaling difficulties when extracting thousands of listings and reviews

In this comprehensive guide, you'll learn professional techniques to build a robust Yelp web scraper in Python and extract business listings as well as reviews while avoiding blocks.

Overview of Yelp's Structure

Yelp serves as an online yellow page where users can search for businesses in a geographic area and read visitor commentary. For a business, the Yelp listing includes key information like:

Name
Address
Phone number
Website
Opening hours
Photos
Visitor ratings and reviews

Listings are organized by categories like restaurants, hotels, auto shops, etc. Yelp also provides curated editorial content highlighting exceptional local businesses. Behind the scenes, JavaScript rendering and calls to internal APIs power Yelp's search and listings. So scraping involves carefully inspecting network requests to reverse engineer parameters.

Reviews are loaded dynamically via AJAX as the user scrolls down or clicks on pagination links. Each review has metadata like:

Author name and info
Star rating given
Date of review
Text commentary

Now let's see how we can systematically scrape business listings as well as reviews from Yelp.

Setting Up Scraping Environment

For this tutorial, we will use Python since it has a vast ecosystem of scraping libraries and tools. The key packages we need are:

pip install httpx requests parsel beautifulsoup4

We'll use httpx and requests for sending HTTP requests to Yelp's servers. While parsel and beautifulsoup4 will help in parsing and extracting data from HTML and API responses.

In addition, it is highly advisable to use proxies for scraping projects to prevent blocks from repeated requests from a single residential IP address. We'll integrate proxies using BrightData's API later in this guide.

Crafting Yelp Search Queries

The starting point is simulating searches on Yelp to discover matching businesses. Yelp search allows looking up listings by:

Keywords – For the category, service etc. like “restaurants”, “plumbers”
Location – Area or city to focus the search on

The search request is made to this URL pattern:

https://www.yelp.com/search/snippet?find_desc=KEYWORDS&find_loc=LOCATION&start=0

We need to URL encode the keywords and location parameters. By passing the start parameter we can paginate through multiple pages of search results. Each request returns 10 listings at a time in the JSON response. So we'll need to monitor the total results returned to iterate through all pages.

Let's create a helper method fetch_search_results() that accepts the search criteria and handles pagination:

import requests
import urllib
    
KEYWORDS = "movers" 
LOCATION = "San Diego, CA"

def fetch_search_results(keywords, location):
  
  # Encode search criteria  
  full_url = f"https://www.yelp.com/search/snippet?find_desc={urllib.parse.quote_plus(keywords)}&find_loc={urllib.parse.quote_plus(location)}&start=0"
  
  # Fetch initial results
  response = requests.get(full_url) 
  data = response.json()
  
  # Get total businesses found
  total = data["searchPageProps"]["mainContentComponentsListProps"][1]["props"]["resultCount"] 
  
  # Store IDs
  business_ids = []

  # Paginate through all result pages
  for offset in range(0, total, 10):
   
    # Build paginated URL
    url = full_url + f"&start{offset}"  
    
    # Fetch page  
    response = requests.get(url)  
    page_data = response.json()

    # Extract IDs from each listing
    for listing in page_data["searchPageProps"]["mainContentComponentsListProps"]:
       business_ids.append(listing["searchResultBusiness"]["id"])
       
  return business_ids

This covers the initial step of harvesting business IDs matching a search query across all result pages.

Scraping Business Listing Data

Armed with IDs, we can now iterate through and scrape key details from each business page. The business profile pages have URLs like:

https://www.yelp.com/biz/rhythym-brewing-co-el-cajon

Here rhythym-brewing-co-el-cajon is the unique ID assigned for that business. Let's create another method to scrape data from a listing page:

import requests
from bs4 import BeautifulSoup

def scrape_business(id):

  # Build business page URL 
  url = f"https://www.yelp.com/biz/{id}"

  # Fetch page    
  response = requests.get(url)

  # Parse HTML
  soup = BeautifulSoup(response.content, "html.parser")

  data = {
    "id": id,  
    "name": soup.select_one("h1[class^=lemon--h1__373c0]").text,
    "address" : soup.select_one("p[class^=lemon--p__373c0][itemprop='address']").text,
    "phone" : soup.select_one("p[class^=lemon--p__373c0]:contains('Phone number') + p").text,    
    "rating" : float(soup.select_one("div[class*='i-stars__373c0']").attrs["aria-label"].split(" ")[0]),
  }

  return data

Here we locate key fields in the HTML using CSS selectors and extract the business name, address, phone number and star rating programmatically. To extract opening hours, which is nested tabular data, we can use a small helper function:

def parse_hours(soup):

    hours = {}
    
    for day in soup.select("tr[class*='lemon--tr__373c0']"):
        key = day.select_one(".day-of-the-week").text 
        value = day.select_one(".nowrap").text
        hours[key.strip()] = value  

    return hours

And integrate it:

data["timings"] = parse_hours(soup)

Run these methods in sequence for each ID:

# Search 
ids = fetch_search_results("movers", "San Diego")

# Listing scraper
all_data = [] 

for id in ids:
   business = scrape_business(id)  
   all_data.append(business)

print(all_data)

Which extracts complete listing data ready for analysis!

Scraping reviews

Now let's tackle harvesting reviews left by customers on a business' Yelp profile. While basic info is in the HTML, the actual reviews are loaded via calls to Yelp's internal API.

For a business like:

https://www.yelp.com/biz/underbelly-san-diego?osq=Restaurants

Its reviews API endpoint would be:

https://www.yelp.com/biz/UNDERBELLY_ID/review_feed?rl=en&q=&sort_by=relevance_desc&start=0

Where UNDERBELLY_ID is the unique identifier assigned for that listing, which we can find embedded in the HTML as:

<meta name="yelp-biz-id" content="UNDERBELLY_ID">

Let's create a scrape_reviews() method:

import json
import requests
from bs4 import BeautifulSoup

def scrape_reviews(url):

  # Fetch HTML
  response = requests.get(url)
  soup = BeautifulSoup(response.content, "html.parser")

  # Get business ID meta tag
  business_id = soup.find("meta", {"name": "yelp-biz-id"})["content"]

  # Build reviews API url
  api_url = f'https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start=0'

  # Fetch first page
  api_response = requests.get(api_url)
  api_data = json.loads(api_response.text)
  
  # Get total review count
  total = api_data["pagination"]["totalResults"]

  print(f"Scraping {total} reviews...")

  reviews = api_data["reviews"] # List of reviews

  # Paginate through all review pages
  for offset in range(0, total, 20):

     # Build paginated URL
     next_page  = api_url + f"&start{offset}"

     # Fetch page 
     next_response = requests.get(next_page)  
     next_page_data = json.loads(next_response.text)

     # Extend reviews list
     reviews.extend(next_page_data["reviews"])

  return reviews

Key aspects covered:

Extract business ID from HTML meta tag
Construct API endpoint for reviews
Paginate through all review pages by manipulating offset
Build a complete list of reviews in order

Let's retrieve reviews for a restaurant:

url = "https://www.yelp.com/biz/underbelly-san-diego?osq=Restaurants"
reviews = scrape_reviews(url)

print(len(reviews)) # 152
print(reviews[0]["text"]) # Sample review text

And that's it – we can now harvest all user reviews for any given Yelp business URL!

Avoiding Blocks with Proxies

While our scraping logic works, sending thousands of requests from a single IP will likely get flagged by Yelp, leading to throttling or blocking. To maximize uptime, it is highly recommended to route requests through proxies. Proxies provide alternate IP addresses across different geographic locations and ISPs.

We will use Bright Data‘s Python library to integrate over 40M subnets into our scraper seamlessly.

First, install the package:

pip install brightdata

Then swap out the requests module with BrightData's proxy-enabled client:

from brightdata.client import BrightDataClient
bd = BrightDataClient(YOUR_API_KEY)

response = bd.get(url)
html = response.text

That's it! Each request will now use automated proxy rotation with automatic retry on the block. Here is the full listings scraper wrapped to use BrightData:

from brightdata.client import BrightDataClient

bd = BrightDataClient(YOUR_API_KEY)

def fetch_search_results(keywords, location):

  # Search query...

  while True:

    try:
     response = bd.get(url)  
     data = response.json()
     break

    except Exception as e:
     print(f"Error: {e}")  
     
  # Remainder of method...

def scrape_business(id):
  
  # Fetch page   
  response = bd.get(url) 

  # On failure, retry automatically
  html = response.text 

  # Remainder of method...

By handling errors and retries, we can keep scraping reliably.

Storing Scraped Data

As the scraper harvests reviews and business info, we need a database to accumulate and query the Yelp data. For storage, MySQL works well since we need to index fields like business name, location etc. Open-source options like PostgreSQL are equally good.

First, create tables to model the entities and relationships:

CREATE TABLE businesses (
  id VARCHAR(100) PRIMARY KEY,
  name VARCHAR(200),
  address VARCHAR(500),
  phone VARCHAR(20),
  rating FLOAT
);

CREATE TABLE reviews (
  id INT AUTO_INCREMENT PRIMARY KEY, 
  business_id VARCHAR(100),
  user_name VARCHAR(100),
  text TEXT,
  rating TINYINT,  
  FOREIGN KEY (business_id) REFERENCES businesses(id)
);

Then insert scraped data:

import mysql.connector

# Database connection
mydb = mysql.connector.connect(
  host="localhost",
  user="root",
  password="password",
  database="yelp_scrape"
)

cursor = mydb.cursor()

# Persist business
cursor.execute('''INSERT INTO businesses 
  (id, name, address, phone, rating) 
  VALUES (%s, %s, %s, %s, %s)''', 
  (business["id"], business["name"], 
   business["address"], business["phone"],  
   business["rating"]))

# Persist reviews
for review in reviews:

  cursor.execute('''INSERT INTO reviews
    (business_id, user_name, text, rating)  
    VALUES (%s, %s, %s, %s)''',  
    (business["id"], review["user"]["name"],  
     review["text"], review["rating"])) 


mydb.commit()

Now the data is available for SQL analysis and reporting!

SELECT * FROM businesses; 

SELECT name, COUNT(*) AS review_count
FROM businesses b
JOIN reviews r ON b.id = r.business_id
GROUP BY b.id
ORDER BY review_count DESC;

Final Notes

And there we have it – a robust recipe for scraping business listings as well as reviews from Yelp without getting blocked. With a bit of refining, you should be able to scrap thousands of Yelp ratings and reviews reliably. The business insights unlocked provide powerful competitive intelligence otherwise inaccessible!