YellowPages.com is one of the largest online business directories in the US, containing millions of listings and reviews for local businesses across the country. For data scientists, marketers, or business analysts, scraping YellowPages can provide a wealth of data for research, lead generation, and competitive analysis.
In this comprehensive guide, we'll walk through step-by-step how to build a robust web scraper to extract key business information and reviews from YellowPages using Python. Let's get started!
Why Scrape YellowPages?
Here are some of the key reasons you may want to scrape data from YellowPages:
- Market research – Analyze competitors, pricing, locations, categories
- Lead generation – Contact info for sales outreach
- Reputation monitoring – Track customer sentiment and ratings
- Data enrichment – Enhance business databases with additional info
- SEO monitoring – Track online presence of competitors
- Offline to online mapping – Connect offline business data to online profiles
This data can provide powerful insights for a wide range of applications in sales, marketing, and analytics.
Is it Legal to Scrape YellowPages.com?
An important question that comes up is whether scraping YellowPages is legal. The short answer is yes, as long as you follow proper ethical scraping practices:
- Scrape publicly accessible data only
- Use the data for internal analysis or non-commercial purposes
- Scrape at a reasonable rate (a few requests per second)
- Provide value and do not overburden the site
- Don't violate YellowPages's Terms of Service
That said, always consult an attorney to understand legal risks for your specific use case and location. But for many internal analytics and research applications, YellowPages scraping falls into ethical gray areas.
Tools You'll Need
To follow this guide and build a YellowPages scraper, you'll need:
- Python 3 – The code examples use Python 3.6+
- requests – For sending HTTP requests to crawl pages
- BeautifulSoup – Popular Python HTML parsing library
- proxies – To avoid blocks when scraping at scale
You can install the Python dependencies via pip
:
pip install requests beautifulsoup4
Let's now dive into the scraper code!
Searching for Businesses on YellowPages
Our first goal is to search for and find businesses we want to scrape. YellowPages provides a search engine to look up businesses by category, name, or location. For example, to find Japanese restaurants in San Francisco, we could search on YellowPages.com.
The search results page contains business names, categories, locations, and other info – exactly what we want to extract. Let's see how to replicate the YellowPages search in Python. The search URL format is:
https://www.yellowpages.com/search?search_terms=[QUERY]&geo_location_terms=[LOCATION]
We can search by filling in the search_terms
and geo_location_terms
parameters. Here's a simple function to generate a search URL:
import requests BASE_URL = "https://www.yellowpages.com/search" def yellowpages_search(query, location): params = { "search_terms": query, "geo_location_terms": location } return requests.get(BASE_URL, params=params)
To search for “Japanese Restaurants in San Francisco”, we would call:
search_url = yellowpages_search("Japanese Restaurants", "San Francisco")
This will return a Response
object containing the HTML of the results page. Next, we need to parse the HTML to extract the businesses.
Extracting Search Results
To parse the search results, we can use Beautiful Soup. First, install it:
pip install beautifulsoup4
Then we can parse the HTML like:
from bs4 import BeautifulSoup html = search_url.text soup = BeautifulSoup(html, "html.parser")
The search results are contained in <div class="result">
tags. We can find all of them with:
results = soup.find_all("div", class_="result")
Each result div contains the key details we want – name, categories, address etc. We can loop through and extract them:
businesses = [] for result in results: name = result.find("a", class_="business-name").text categories = [tag.text for tag in result.find_all("a", class_="category-link")] address = result.find("div", class_="street-address").text # And so on for other fields... businesses.append({ "name": name, "categories": categories, "address": address }) print(businesses)
This will extract a list of businesses with their key info! We can also paginate through multiple search pages to extract all the results.
Paginating Through Search Pages
By default, YellowPages displays 27 results per page. To get all matching businesses, we need to paginate through the additional pages. The page number is controlled by a page
parameter in the URL:
https://www.yellowpages.com/search?page=2
To find the total number of pages, we can parse it from the HTML:
result_count = soup.find("span", class_="pagination-result-count").text pages = int(result_count.split(" ")[3]) / 27
We can then loop from 1 to pages
to paginate:
all_businesses = [] for page in range(1, pages+1): params["page"] = page response = requests.get(BASE_URL, params=params) # Parse businesses businesses = parse_search_results(response) all_businesses.extend(businesses)
And that's it! With these concepts you can build a complete YellowPages search scraper to find matching businesses. Next, let's look at extracting the full business details.
Scraping Business Listing Pages
For each search result, YellowPages provides a link to the business's full listing page containing additional info like hours, services, photos and reviews. To extract this data, we need to:
- Get listing page URLs from search results
- Send requests to listing URLs
- Parse listing page HTML
Let's go through each step.
First, when parsing search results, we can also extract the listing URL:
listing_url = result.find("a", class_="business-name")["href"]
Then we can iterate through these URLs and send requests:
# After searching listing_urls = [result["listing_url"] for result in all_businesses] for url in listing_urls: listing_page = requests.get(url)
Finally, we can use a similar parsing pattern to extract data fields:
soup = BeautifulSoup(listing_page.content) name = soup.find("h1", class_="business-name").text phone = soup.find("p", class_="phone").text address = soup.find("p", class_="address").text # And so on...
Bringing it all together:
businesses = [] # Search logic... for url in listing_urls: page = requests.get(url) soup = BeautifulSoup(page.content) name = soup.find("h1", class_="business-name").text phone = soup.find("p", class_="phone").text business = { "name": name, "phone": phone } businesses.append(business)
This will give you a complete dataset of YellowPages business info!
Scraping YellowPages Reviews
In addition to business info, YellowPages also contains detailed customer reviews – which can provide invaluable insights into reputation, sentiment, and more. Reviews are contained in <article>
tags on each listing page To extract reviews, we can add on to our existing parser:
from bs4 import BeautifulSoup # After fetching listing page soup = BeautifulSoup(page.content) reviews = [] for review in soup.find_all("article", class_="review"): title = review.find("div", class_="review-title").text body = review.find("p", class_="review-body").text rating = len(review.find_all("span", class_="rating-star")) reviews.append({ "title": title, "body": body, "rating": rating }) print(reviews)
This will extract the review title, text and 5-star rating for analysis. Now let's look at managing this at scale.
Scraping YellowPages at Scale
While our scraper works well for small datasets, scraping large volumes of listings and reviews from YellowPages requires a more robust strategy. Here are some tips for managing YellowPages scraping at scale:
- Use threading/async – Utilize parallelism by using threads or async code to send concurrent requests. This speeds up scraping vs doing it linearly.
- Random delays – Add random delays between requests and paginate slowly to mimic human browsing patterns. This helps avoid detection.
- Proxy rotation – Rotate different proxies for each request. Proxies like Smartproxy, Proxy-Seller, Bright Data, and Soax help mask scrapers and prevent IP blocks.
- Store incrementally – Instead of storing everything in memory, incrementally save scraped data to disk/database. This allows resuming if errors occur.
- Error handling – Robustly handle HTTP errors, blocks, captchas etc using retries and exception handling.
- Deploy on cloud servers – Run scrapers on cloud servers to leverage more bandwidth, IPs and processing power.
Let's look at integrating proxies more closely.
Avoiding Blocks with Proxies
One key technique for stable large-scale scraping is using proxies. Proxies rotate different IP addresses for each request, preventing your main IP from being blocked. Some popular paid proxy services include:
- Smartproxy – Proxy manager API with real-time control
- Proxy-Seller – Flexible payment plans with affordable prices
- BrightData – Reliable residential proxies with worldwide location
- Soax – Millions of rotating proxies globally
Here's an example using BrightData proxies:
from brightdata.brightdata_scrapy import BrightdataScraper scraper = BrightdataScraper(project="YELLOWPAGES") for url in listing_urls: page = scraper.get(url) # Requests go through proxy # Parse page...
The BrightData library will automatically rotate their large proxy pool with each request, preventing IP blocks. For more advanced configuration, BrightData also provides a Proxy Manager API for granular control over groups, targeting, and more.
YellowPages Scraping Recap
And that wraps up our guide on scraping YellowPages.com with Python! Here are some key takeaways:
- YellowPages provides a wealth of data for business intelligence and research
- Build scrapers to extract search results, business details, reviews, and more
- Carefully paginate through all search pages
- Manage large-scale scraping with threads, delays, proxies
- Integrate with paid proxy tools like BrightData to avoid blocks
Hopefully, this provides a comprehensive blueprint for rolling your own YellowPages.com scraper! Please reach out if you have any other questions.