Yelp has become the go-to website for discovering and researching local businesses. With over 200 million reviews spanning restaurants, salons, mechanics, and more, Yelp offers a treasure trove of consumer sentiment data.
For data analysts, scraping Yelp can unlock unique insights around customer satisfaction, common complaints, price sensitivity, and more for a business and entire industry sectors. Brands use web-scraped Yelp data to benchmark performance versus competitors and improve offerings.
However, scraping a site as popular as Yelp brings unique challenges:
- Strict anti-scraping measures that throttle and block scrapers
- Dynamic, obfuscated HTML and heavy Javascript usage that needs to be reverse engineered
- Scaling difficulties when extracting thousands of listings and reviews
In this comprehensive guide, you'll learn professional techniques to build a robust Yelp web scraper in Python and extract business listings as well as reviews while avoiding blocks.
Overview of Yelp's Structure
Yelp serves as an online yellow page where users can search for businesses in a geographic area and read visitor commentary. For a business, the Yelp listing includes key information like:
- Name
- Address
- Phone number
- Website
- Opening hours
- Photos
- Visitor ratings and reviews
Listings are organized by categories like restaurants, hotels, auto shops, etc. Yelp also provides curated editorial content highlighting exceptional local businesses. Behind the scenes, JavaScript rendering and calls to internal APIs power Yelp's search and listings. So scraping involves carefully inspecting network requests to reverse engineer parameters.
Reviews are loaded dynamically via AJAX as the user scrolls down or clicks on pagination links. Each review has metadata like:
- Author name and info
- Star rating given
- Date of review
- Text commentary
Now let's see how we can systematically scrape business listings as well as reviews from Yelp.
Setting Up Scraping Environment
For this tutorial, we will use Python since it has a vast ecosystem of scraping libraries and tools. The key packages we need are:
pip install httpx requests parsel beautifulsoup4
We'll use httpx and requests for sending HTTP requests to Yelp's servers. While parsel and beautifulsoup4 will help in parsing and extracting data from HTML and API responses.
In addition, it is highly advisable to use proxies for scraping projects to prevent blocks from repeated requests from a single residential IP address. We'll integrate proxies using BrightData's API later in this guide.
Crafting Yelp Search Queries
The starting point is simulating searches on Yelp to discover matching businesses. Yelp search allows looking up listings by:
- Keywords – For the category, service etc. like “restaurants”, “plumbers”
- Location – Area or city to focus the search on
The search request is made to this URL pattern:
https://www.yelp.com/search/snippet?find_desc=KEYWORDS&find_loc=LOCATION&start=0
We need to URL encode the keywords and location parameters. By passing the start parameter we can paginate through multiple pages of search results. Each request returns 10 listings at a time in the JSON response. So we'll need to monitor the total results returned to iterate through all pages.
Let's create a helper method fetch_search_results() that accepts the search criteria and handles pagination:
import requests
import urllib
KEYWORDS = "movers"
LOCATION = "San Diego, CA"
def fetch_search_results(keywords, location):
# Encode search criteria
full_url = f"https://www.yelp.com/search/snippet?find_desc={urllib.parse.quote_plus(keywords)}&find_loc={urllib.parse.quote_plus(location)}&start=0"
# Fetch initial results
response = requests.get(full_url)
data = response.json()
# Get total businesses found
total = data["searchPageProps"]["mainContentComponentsListProps"][1]["props"]["resultCount"]
# Store IDs
business_ids = []
# Paginate through all result pages
for offset in range(0, total, 10):
# Build paginated URL
url = full_url + f"&start{offset}"
# Fetch page
response = requests.get(url)
page_data = response.json()
# Extract IDs from each listing
for listing in page_data["searchPageProps"]["mainContentComponentsListProps"]:
business_ids.append(listing["searchResultBusiness"]["id"])
return business_idsThis covers the initial step of harvesting business IDs matching a search query across all result pages.
Scraping Business Listing Data
Armed with IDs, we can now iterate through and scrape key details from each business page. The business profile pages have URLs like:
https://www.yelp.com/biz/rhythym-brewing-co-el-cajon
Here rhythym-brewing-co-el-cajon is the unique ID assigned for that business. Let's create another method to scrape data from a listing page:
import requests
from bs4 import BeautifulSoup
def scrape_business(id):
# Build business page URL
url = f"https://www.yelp.com/biz/{id}"
# Fetch page
response = requests.get(url)
# Parse HTML
soup = BeautifulSoup(response.content, "html.parser")
data = {
"id": id,
"name": soup.select_one("h1[class^=lemon--h1__373c0]").text,
"address" : soup.select_one("p[class^=lemon--p__373c0][itemprop='address']").text,
"phone" : soup.select_one("p[class^=lemon--p__373c0]:contains('Phone number') + p").text,
"rating" : float(soup.select_one("div[class*='i-stars__373c0']").attrs["aria-label"].split(" ")[0]),
}
return dataHere we locate key fields in the HTML using CSS selectors and extract the business name, address, phone number and star rating programmatically. To extract opening hours, which is nested tabular data, we can use a small helper function:
def parse_hours(soup):
hours = {}
for day in soup.select("tr[class*='lemon--tr__373c0']"):
key = day.select_one(".day-of-the-week").text
value = day.select_one(".nowrap").text
hours[key.strip()] = value
return hoursAnd integrate it:
data["timings"] = parse_hours(soup)
Run these methods in sequence for each ID:
# Search
ids = fetch_search_results("movers", "San Diego")
# Listing scraper
all_data = []
for id in ids:
business = scrape_business(id)
all_data.append(business)
print(all_data)Which extracts complete listing data ready for analysis!
Scraping reviews
Now let's tackle harvesting reviews left by customers on a business' Yelp profile. While basic info is in the HTML, the actual reviews are loaded via calls to Yelp's internal API.
For a business like:
https://www.yelp.com/biz/underbelly-san-diego?osq=Restaurants
Its reviews API endpoint would be:
https://www.yelp.com/biz/UNDERBELLY_ID/review_feed?rl=en&q=&sort_by=relevance_desc&start=0
Where UNDERBELLY_ID is the unique identifier assigned for that listing, which we can find embedded in the HTML as:
<meta name="yelp-biz-id" content="UNDERBELLY_ID">
Let's create a scrape_reviews() method:
import json
import requests
from bs4 import BeautifulSoup
def scrape_reviews(url):
# Fetch HTML
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# Get business ID meta tag
business_id = soup.find("meta", {"name": "yelp-biz-id"})["content"]
# Build reviews API url
api_url = f'https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start=0'
# Fetch first page
api_response = requests.get(api_url)
api_data = json.loads(api_response.text)
# Get total review count
total = api_data["pagination"]["totalResults"]
print(f"Scraping {total} reviews...")
reviews = api_data["reviews"] # List of reviews
# Paginate through all review pages
for offset in range(0, total, 20):
# Build paginated URL
next_page = api_url + f"&start{offset}"
# Fetch page
next_response = requests.get(next_page)
next_page_data = json.loads(next_response.text)
# Extend reviews list
reviews.extend(next_page_data["reviews"])
return reviewsKey aspects covered:
- Extract business ID from HTML meta tag
- Construct API endpoint for reviews
- Paginate through all review pages by manipulating offset
- Build a complete list of reviews in order
Let's retrieve reviews for a restaurant:
url = "https://www.yelp.com/biz/underbelly-san-diego?osq=Restaurants" reviews = scrape_reviews(url) print(len(reviews)) # 152 print(reviews[0]["text"]) # Sample review text
And that's it – we can now harvest all user reviews for any given Yelp business URL!
Avoiding Blocks with Proxies
While our scraping logic works, sending thousands of requests from a single IP will likely get flagged by Yelp, leading to throttling or blocking. To maximize uptime, it is highly recommended to route requests through proxies. Proxies provide alternate IP addresses across different geographic locations and ISPs.
We will use Bright Data‘s Python library to integrate over 40M subnets into our scraper seamlessly.
First, install the package:
pip install brightdata
Then swap out the requests module with BrightData's proxy-enabled client:
from brightdata.client import BrightDataClient bd = BrightDataClient(YOUR_API_KEY) response = bd.get(url) html = response.text
That's it! Each request will now use automated proxy rotation with automatic retry on the block. Here is the full listings scraper wrapped to use BrightData:
from brightdata.client import BrightDataClient
bd = BrightDataClient(YOUR_API_KEY)
def fetch_search_results(keywords, location):
# Search query...
while True:
try:
response = bd.get(url)
data = response.json()
break
except Exception as e:
print(f"Error: {e}")
# Remainder of method...
def scrape_business(id):
# Fetch page
response = bd.get(url)
# On failure, retry automatically
html = response.text
# Remainder of method...By handling errors and retries, we can keep scraping reliably.
Storing Scraped Data
As the scraper harvests reviews and business info, we need a database to accumulate and query the Yelp data. For storage, MySQL works well since we need to index fields like business name, location etc. Open-source options like PostgreSQL are equally good.
First, create tables to model the entities and relationships:
CREATE TABLE businesses ( id VARCHAR(100) PRIMARY KEY, name VARCHAR(200), address VARCHAR(500), phone VARCHAR(20), rating FLOAT ); CREATE TABLE reviews ( id INT AUTO_INCREMENT PRIMARY KEY, business_id VARCHAR(100), user_name VARCHAR(100), text TEXT, rating TINYINT, FOREIGN KEY (business_id) REFERENCES businesses(id) );
Then insert scraped data:
import mysql.connector
# Database connection
mydb = mysql.connector.connect(
host="localhost",
user="root",
password="password",
database="yelp_scrape"
)
cursor = mydb.cursor()
# Persist business
cursor.execute('''INSERT INTO businesses
(id, name, address, phone, rating)
VALUES (%s, %s, %s, %s, %s)''',
(business["id"], business["name"],
business["address"], business["phone"],
business["rating"]))
# Persist reviews
for review in reviews:
cursor.execute('''INSERT INTO reviews
(business_id, user_name, text, rating)
VALUES (%s, %s, %s, %s)''',
(business["id"], review["user"]["name"],
review["text"], review["rating"]))
mydb.commit()Now the data is available for SQL analysis and reporting!
SELECT * FROM businesses; SELECT name, COUNT(*) AS review_count FROM businesses b JOIN reviews r ON b.id = r.business_id GROUP BY b.id ORDER BY review_count DESC;
Final Notes
And there we have it – a robust recipe for scraping business listings as well as reviews from Yelp without getting blocked. With a bit of refining, you should be able to scrap thousands of Yelp ratings and reviews reliably. The business insights unlocked provide powerful competitive intelligence otherwise inaccessible!