Yelp has become the go-to website for discovering and researching local businesses. With over 200 million reviews spanning restaurants, salons, mechanics, and more, Yelp offers a treasure trove of consumer sentiment data.
For data analysts, scraping Yelp can unlock unique insights around customer satisfaction, common complaints, price sensitivity, and more for a business and entire industry sectors. Brands use web-scraped Yelp data to benchmark performance versus competitors and improve offerings.
However, scraping a site as popular as Yelp brings unique challenges:
- Strict anti-scraping measures that throttle and block scrapers
- Dynamic, obfuscated HTML and heavy Javascript usage that needs to be reverse engineered
- Scaling difficulties when extracting thousands of listings and reviews
In this comprehensive guide, you'll learn professional techniques to build a robust Yelp web scraper in Python and extract business listings as well as reviews while avoiding blocks.
Overview of Yelp's Structure
Yelp serves as an online yellow page where users can search for businesses in a geographic area and read visitor commentary. For a business, the Yelp listing includes key information like:
- Name
- Address
- Phone number
- Website
- Opening hours
- Photos
- Visitor ratings and reviews
Listings are organized by categories like restaurants, hotels, auto shops, etc. Yelp also provides curated editorial content highlighting exceptional local businesses. Behind the scenes, JavaScript rendering and calls to internal APIs power Yelp's search and listings. So scraping involves carefully inspecting network requests to reverse engineer parameters.
Reviews are loaded dynamically via AJAX as the user scrolls down or clicks on pagination links. Each review has metadata like:
- Author name and info
- Star rating given
- Date of review
- Text commentary
Now let's see how we can systematically scrape business listings as well as reviews from Yelp.
Setting Up Scraping Environment
For this tutorial, we will use Python since it has a vast ecosystem of scraping libraries and tools. The key packages we need are:
pip install httpx requests parsel beautifulsoup4
We'll use httpx
and requests
for sending HTTP requests to Yelp's servers. While parsel
and beautifulsoup4
will help in parsing and extracting data from HTML and API responses.
In addition, it is highly advisable to use proxies for scraping projects to prevent blocks from repeated requests from a single residential IP address. We'll integrate proxies using BrightData's API later in this guide.
Crafting Yelp Search Queries
The starting point is simulating searches on Yelp to discover matching businesses. Yelp search allows looking up listings by:
- Keywords – For the category, service etc. like “restaurants”, “plumbers”
- Location – Area or city to focus the search on
The search request is made to this URL pattern:
https://www.yelp.com/search/snippet?find_desc=KEYWORDS&find_loc=LOCATION&start=0
We need to URL encode the keywords and location parameters. By passing the start
parameter we can paginate through multiple pages of search results. Each request returns 10 listings at a time in the JSON response. So we'll need to monitor the total results returned to iterate through all pages.
Let's create a helper method fetch_search_results()
that accepts the search criteria and handles pagination:
import requests import urllib KEYWORDS = "movers" LOCATION = "San Diego, CA" def fetch_search_results(keywords, location): # Encode search criteria full_url = f"https://www.yelp.com/search/snippet?find_desc={urllib.parse.quote_plus(keywords)}&find_loc={urllib.parse.quote_plus(location)}&start=0" # Fetch initial results response = requests.get(full_url) data = response.json() # Get total businesses found total = data["searchPageProps"]["mainContentComponentsListProps"][1]["props"]["resultCount"] # Store IDs business_ids = [] # Paginate through all result pages for offset in range(0, total, 10): # Build paginated URL url = full_url + f"&start{offset}" # Fetch page response = requests.get(url) page_data = response.json() # Extract IDs from each listing for listing in page_data["searchPageProps"]["mainContentComponentsListProps"]: business_ids.append(listing["searchResultBusiness"]["id"]) return business_ids
This covers the initial step of harvesting business IDs matching a search query across all result pages.
Scraping Business Listing Data
Armed with IDs, we can now iterate through and scrape key details from each business page. The business profile pages have URLs like:
https://www.yelp.com/biz/rhythym-brewing-co-el-cajon
Here rhythym-brewing-co-el-cajon
is the unique ID assigned for that business. Let's create another method to scrape data from a listing page:
import requests from bs4 import BeautifulSoup def scrape_business(id): # Build business page URL url = f"https://www.yelp.com/biz/{id}" # Fetch page response = requests.get(url) # Parse HTML soup = BeautifulSoup(response.content, "html.parser") data = { "id": id, "name": soup.select_one("h1[class^=lemon--h1__373c0]").text, "address" : soup.select_one("p[class^=lemon--p__373c0][itemprop='address']").text, "phone" : soup.select_one("p[class^=lemon--p__373c0]:contains('Phone number') + p").text, "rating" : float(soup.select_one("div[class*='i-stars__373c0']").attrs["aria-label"].split(" ")[0]), } return data
Here we locate key fields in the HTML using CSS selectors and extract the business name, address, phone number and star rating programmatically. To extract opening hours, which is nested tabular data, we can use a small helper function:
def parse_hours(soup): hours = {} for day in soup.select("tr[class*='lemon--tr__373c0']"): key = day.select_one(".day-of-the-week").text value = day.select_one(".nowrap").text hours[key.strip()] = value return hours
And integrate it:
data["timings"] = parse_hours(soup)
Run these methods in sequence for each ID:
# Search ids = fetch_search_results("movers", "San Diego") # Listing scraper all_data = [] for id in ids: business = scrape_business(id) all_data.append(business) print(all_data)
Which extracts complete listing data ready for analysis!
Scraping reviews
Now let's tackle harvesting reviews left by customers on a business' Yelp profile. While basic info is in the HTML, the actual reviews are loaded via calls to Yelp's internal API.
For a business like:
https://www.yelp.com/biz/underbelly-san-diego?osq=Restaurants
Its reviews API endpoint would be:
https://www.yelp.com/biz/UNDERBELLY_ID/review_feed?rl=en&q=&sort_by=relevance_desc&start=0
Where UNDERBELLY_ID
is the unique identifier assigned for that listing, which we can find embedded in the HTML as:
<meta name="yelp-biz-id" content="UNDERBELLY_ID">
Let's create a scrape_reviews()
method:
import json import requests from bs4 import BeautifulSoup def scrape_reviews(url): # Fetch HTML response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") # Get business ID meta tag business_id = soup.find("meta", {"name": "yelp-biz-id"})["content"] # Build reviews API url api_url = f'https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start=0' # Fetch first page api_response = requests.get(api_url) api_data = json.loads(api_response.text) # Get total review count total = api_data["pagination"]["totalResults"] print(f"Scraping {total} reviews...") reviews = api_data["reviews"] # List of reviews # Paginate through all review pages for offset in range(0, total, 20): # Build paginated URL next_page = api_url + f"&start{offset}" # Fetch page next_response = requests.get(next_page) next_page_data = json.loads(next_response.text) # Extend reviews list reviews.extend(next_page_data["reviews"]) return reviews
Key aspects covered:
- Extract business ID from HTML meta tag
- Construct API endpoint for reviews
- Paginate through all review pages by manipulating offset
- Build a complete list of reviews in order
Let's retrieve reviews for a restaurant:
url = "https://www.yelp.com/biz/underbelly-san-diego?osq=Restaurants" reviews = scrape_reviews(url) print(len(reviews)) # 152 print(reviews[0]["text"]) # Sample review text
And that's it – we can now harvest all user reviews for any given Yelp business URL!
Avoiding Blocks with Proxies
While our scraping logic works, sending thousands of requests from a single IP will likely get flagged by Yelp, leading to throttling or blocking. To maximize uptime, it is highly recommended to route requests through proxies. Proxies provide alternate IP addresses across different geographic locations and ISPs.
We will use Bright Data‘s Python library to integrate over 40M subnets into our scraper seamlessly.
First, install the package:
pip install brightdata
Then swap out the requests
module with BrightData's proxy-enabled client:
from brightdata.client import BrightDataClient bd = BrightDataClient(YOUR_API_KEY) response = bd.get(url) html = response.text
That's it! Each request will now use automated proxy rotation with automatic retry on the block. Here is the full listings scraper wrapped to use BrightData:
from brightdata.client import BrightDataClient bd = BrightDataClient(YOUR_API_KEY) def fetch_search_results(keywords, location): # Search query... while True: try: response = bd.get(url) data = response.json() break except Exception as e: print(f"Error: {e}") # Remainder of method... def scrape_business(id): # Fetch page response = bd.get(url) # On failure, retry automatically html = response.text # Remainder of method...
By handling errors and retries, we can keep scraping reliably.
Storing Scraped Data
As the scraper harvests reviews and business info, we need a database to accumulate and query the Yelp data. For storage, MySQL works well since we need to index fields like business name, location etc. Open-source options like PostgreSQL are equally good.
First, create tables to model the entities and relationships:
CREATE TABLE businesses ( id VARCHAR(100) PRIMARY KEY, name VARCHAR(200), address VARCHAR(500), phone VARCHAR(20), rating FLOAT ); CREATE TABLE reviews ( id INT AUTO_INCREMENT PRIMARY KEY, business_id VARCHAR(100), user_name VARCHAR(100), text TEXT, rating TINYINT, FOREIGN KEY (business_id) REFERENCES businesses(id) );
Then insert scraped data:
import mysql.connector # Database connection mydb = mysql.connector.connect( host="localhost", user="root", password="password", database="yelp_scrape" ) cursor = mydb.cursor() # Persist business cursor.execute('''INSERT INTO businesses (id, name, address, phone, rating) VALUES (%s, %s, %s, %s, %s)''', (business["id"], business["name"], business["address"], business["phone"], business["rating"])) # Persist reviews for review in reviews: cursor.execute('''INSERT INTO reviews (business_id, user_name, text, rating) VALUES (%s, %s, %s, %s)''', (business["id"], review["user"]["name"], review["text"], review["rating"])) mydb.commit()
Now the data is available for SQL analysis and reporting!
SELECT * FROM businesses; SELECT name, COUNT(*) AS review_count FROM businesses b JOIN reviews r ON b.id = r.business_id GROUP BY b.id ORDER BY review_count DESC;
Final Notes
And there we have it – a robust recipe for scraping business listings as well as reviews from Yelp without getting blocked. With a bit of refining, you should be able to scrap thousands of Yelp ratings and reviews reliably. The business insights unlocked provide powerful competitive intelligence otherwise inaccessible!