How to Scrape YellowPages.com Business Data and Reviews?

YellowPages.com is one of the largest online business directories in the US, containing millions of listings and reviews for local businesses across the country. For data scientists, marketers, or business analysts, scraping YellowPages can provide a wealth of data for research, lead generation, and competitive analysis.

In this comprehensive guide, we'll walk through step-by-step how to build a robust web scraper to extract key business information and reviews from YellowPages using Python. Let's get started!

Why Scrape YellowPages?

Here are some of the key reasons you may want to scrape data from YellowPages:

Market research – Analyze competitors, pricing, locations, categories
Lead generation – Contact info for sales outreach
Reputation monitoring – Track customer sentiment and ratings
Data enrichment – Enhance business databases with additional info
SEO monitoring – Track online presence of competitors
Offline to online mapping – Connect offline business data to online profiles

This data can provide powerful insights for a wide range of applications in sales, marketing, and analytics.

Is it Legal to Scrape YellowPages.com?

An important question that comes up is whether scraping YellowPages is legal. The short answer is yes, as long as you follow proper ethical scraping practices:

Scrape publicly accessible data only
Use the data for internal analysis or non-commercial purposes
Scrape at a reasonable rate (a few requests per second)
Provide value and do not overburden the site
Don't violate YellowPages's Terms of Service

That said, always consult an attorney to understand legal risks for your specific use case and location. But for many internal analytics and research applications, YellowPages scraping falls into ethical gray areas.

Tools You'll Need

To follow this guide and build a YellowPages scraper, you'll need:

Python 3 – The code examples use Python 3.6+
requests – For sending HTTP requests to crawl pages
BeautifulSoup – Popular Python HTML parsing library
proxies – To avoid blocks when scraping at scale

You can install the Python dependencies via pip:

pip install requests beautifulsoup4

Let's now dive into the scraper code!

Searching for Businesses on YellowPages

Our first goal is to search for and find businesses we want to scrape. YellowPages provides a search engine to look up businesses by category, name, or location. For example, to find Japanese restaurants in San Francisco, we could search on YellowPages.com.

The search results page contains business names, categories, locations, and other info – exactly what we want to extract. Let's see how to replicate the YellowPages search in Python. The search URL format is:

https://www.yellowpages.com/search?search_terms=[QUERY]&geo_location_terms=[LOCATION]

We can search by filling in the search_terms and geo_location_terms parameters. Here's a simple function to generate a search URL:

import requests

BASE_URL = "https://www.yellowpages.com/search"

def yellowpages_search(query, location):
  params = {
    "search_terms": query, 
    "geo_location_terms": location
  }

  return requests.get(BASE_URL, params=params)

To search for “Japanese Restaurants in San Francisco”, we would call:

search_url = yellowpages_search("Japanese Restaurants", "San Francisco")

This will return a Response object containing the HTML of the results page. Next, we need to parse the HTML to extract the businesses.

Extracting Search Results

To parse the search results, we can use Beautiful Soup. First, install it:

pip install beautifulsoup4

Then we can parse the HTML like:

from bs4 import BeautifulSoup

html = search_url.text
soup = BeautifulSoup(html, "html.parser")

The search results are contained in <div class="result"> tags. We can find all of them with:

results = soup.find_all("div", class_="result")

Each result div contains the key details we want – name, categories, address etc. We can loop through and extract them:

businesses = []

for result in results:
  
  name = result.find("a", class_="business-name").text
  
  categories = [tag.text for tag in result.find_all("a", class_="category-link")]
  
  address = result.find("div", class_="street-address").text
  
  # And so on for other fields...

  businesses.append({
    "name": name,
    "categories": categories, 
    "address": address
  })

print(businesses)

This will extract a list of businesses with their key info! We can also paginate through multiple search pages to extract all the results.

Paginating Through Search Pages

By default, YellowPages displays 27 results per page. To get all matching businesses, we need to paginate through the additional pages. The page number is controlled by a page parameter in the URL:

https://www.yellowpages.com/search?page=2

To find the total number of pages, we can parse it from the HTML:

result_count = soup.find("span", class_="pagination-result-count").text
pages = int(result_count.split(" ")[3]) / 27

We can then loop from 1 to pages to paginate:

all_businesses = []

for page in range(1, pages+1):
  
  params["page"] = page
  response = requests.get(BASE_URL, params=params)
  
  # Parse businesses
  businesses = parse_search_results(response) 
  
  all_businesses.extend(businesses)

And that's it! With these concepts you can build a complete YellowPages search scraper to find matching businesses. Next, let's look at extracting the full business details.

Scraping Business Listing Pages

For each search result, YellowPages provides a link to the business's full listing page containing additional info like hours, services, photos and reviews. To extract this data, we need to:

Get listing page URLs from search results
Send requests to listing URLs
Parse listing page HTML

Let's go through each step.

First, when parsing search results, we can also extract the listing URL:

listing_url = result.find("a", class_="business-name")["href"]

Then we can iterate through these URLs and send requests:

# After searching

listing_urls = [result["listing_url"] for result in all_businesses]

for url in listing_urls:
  listing_page = requests.get(url)

Finally, we can use a similar parsing pattern to extract data fields:

soup = BeautifulSoup(listing_page.content)

name = soup.find("h1", class_="business-name").text
phone = soup.find("p", class_="phone").text
address = soup.find("p", class_="address").text 

# And so on...

Bringing it all together:

businesses = [] 

# Search logic...

for url in listing_urls:

  page = requests.get(url)
  soup = BeautifulSoup(page.content)
  
  name = soup.find("h1", class_="business-name").text
  phone = soup.find("p", class_="phone").text

  business = {
    "name": name,
    "phone": phone
  }

  businesses.append(business)

This will give you a complete dataset of YellowPages business info!

Scraping YellowPages Reviews

In addition to business info, YellowPages also contains detailed customer reviews – which can provide invaluable insights into reputation, sentiment, and more. Reviews are contained in <article> tags on each listing page To extract reviews, we can add on to our existing parser:

from bs4 import BeautifulSoup

# After fetching listing page

soup = BeautifulSoup(page.content)

reviews = []

for review in soup.find_all("article", class_="review"):

  title = review.find("div", class_="review-title").text
  body = review.find("p", class_="review-body").text
  rating = len(review.find_all("span", class_="rating-star"))

  reviews.append({
    "title": title,
    "body": body,
    "rating": rating
  }) 

print(reviews)

This will extract the review title, text and 5-star rating for analysis. Now let's look at managing this at scale.

Scraping YellowPages at Scale

While our scraper works well for small datasets, scraping large volumes of listings and reviews from YellowPages requires a more robust strategy. Here are some tips for managing YellowPages scraping at scale:

Use threading/async – Utilize parallelism by using threads or async code to send concurrent requests. This speeds up scraping vs doing it linearly.
Random delays – Add random delays between requests and paginate slowly to mimic human browsing patterns. This helps avoid detection.
Proxy rotation – Rotate different proxies for each request. Proxies like Smartproxy, Proxy-Seller, Bright Data, and Soax help mask scrapers and prevent IP blocks.
Store incrementally – Instead of storing everything in memory, incrementally save scraped data to disk/database. This allows resuming if errors occur.
Error handling – Robustly handle HTTP errors, blocks, captchas etc using retries and exception handling.
Deploy on cloud servers – Run scrapers on cloud servers to leverage more bandwidth, IPs and processing power.

Let's look at integrating proxies more closely.

Avoiding Blocks with Proxies

One key technique for stable large-scale scraping is using proxies. Proxies rotate different IP addresses for each request, preventing your main IP from being blocked. Some popular paid proxy services include:

Smartproxy – Proxy manager API with real-time control
Proxy-Seller – Flexible payment plans with affordable prices
BrightData – Reliable residential proxies with worldwide location
Soax – Millions of rotating proxies globally

Here's an example using BrightData proxies:

from brightdata.brightdata_scrapy import BrightdataScraper

scraper = BrightdataScraper(project="YELLOWPAGES") 

for url in listing_urls:

  page = scraper.get(url) # Requests go through proxy
  
  # Parse page...

The BrightData library will automatically rotate their large proxy pool with each request, preventing IP blocks. For more advanced configuration, BrightData also provides a Proxy Manager API for granular control over groups, targeting, and more.