Booking.com is one of the largest online travel agencies with listings for hotels, apartments, resorts and other accommodations across the globe. The site is a treasure trove of valuable data on properties like prices, availability, reviews, amenities and more.
In this comprehensive guide, we will walk through the steps to build a robust web scraper to extract and collect data from Booking.com using Python.
Why Scrape Booking.com Data?
Here are some potential applications of scraped Booking.com data:
- Travel research – Analyze hotel prices across destinations to plan vacations and trips. Monitor prices and availability for booking at optimal times.
- Competitive intelligence – Track prices and occupancy of rival hotels in a region. Identify competitive strengths and weaknesses.
- Market research – Gain insights on tourism trends and analyze changing demand across seasons. Identify most popular destinations.
- Price monitoring – Build a price tracking system and get alerts for price drops on target hotels. Help travelers book at lowest prices.
- Meta-search integration – Populate your own travel meta-search engine with Booking listings. Provide comparison across Booking and other sites.
- Review analysis – Extract and analyze reviews to identify traveller sentiment and service quality across properties.
The data can also be combined with information from other travel sites like AirBnb, Expedia etc. to provide a comprehensive overview.
Challenges in Scraping Booking.com
However, scraping Booking at scale comes with some key technical challenges:
- Heavy use of JavaScript – Site content is loaded dynamically via JS. Scrapers have to execute JS to get complete data.
- Anti-scraping mechanisms – Blocking of bots, CAPTCHAs and other roadblocks. Need workarounds to avoid getting blocked.
- Large data volumes – Millions of listings across countries, languages and filters. Need to manage scale.
- Multiple endpoints – Details, pricing and reviews live across different URLs. Need orchestration.
We'll explore solutions for each of these challenges as we build our scraper.
Overview of Our Booking.com Scraper
At a high level, we will:
- Set up our Python environment and dependencies.
- Find entry points into Booking's hotel listings – via search or sitemaps.
- Scrape search results page to extract hotel URLs, names, addresses etc.
- Loop through each hotel URL to scrape key details like description, amenities etc.
- Make additional API calls to fetch pricing and availability data for each hotel.
- Pull reviews and ratings for each hotel by hitting separate endpoints.
- Store scraped data in a database or CSV file for further analysis.
Let's get started!
Step 1 – Setup Python Environment
We'll use Python 3.8 or above for our scraper. Here are the key dependencies we'll need:
pip install requests lxml BeautifulSoup4 aiohttp
- Requests – For sending HTTP requests to Booking.com pages.
- lxml – Faster XML and HTML parsing.
- BeautifulSoup4 – HTML parsing and extraction of data.
- aiohttp – Asynchronous requests to speed up scraping.
additionally, we'll use:
- Proxies – To rotate IP addresses and bypass blocks. Such like BrightData, Smartproxy, Proxy-Seller, and Soax are the best.
- Asyncio – For performance gains through asynchronous requests.
Let's start by importing the libraries we need:
import requests from bs4 import BeautifulSoup import aiohttp import asyncio import json
And define a headers
dictionary to mimic a browser:
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36" }
Step 2 – Find Hotel Listing Pages
The first step is identifying the pages where we can find hotel listings to scrape. There are two main approaches here:
1. Using Booking.com Sitemaps
Booking.com provides sitemap files that list all pages on the site. The sitemaps can be accessed programmatically by hitting this URL:
https://www.booking.com/hotel/sitemap-index.xml
This provides sitemap indexes, each linking to thousands of actual sitemap files containing URLs of hotel, destination and other pages. Here's how we can use the sitemaps in our scraper:
import requests import lxml.etree as etree sitemap_index = requests.get("https://www.booking.com/hotel/sitemap-index.xml") # Parse index to get all sitemap file URLs index_xml = etree.fromstring(sitemap_index.content) sitemaps = [url.text for url in index_xml.findall(".//sitemap/loc")] # Download each sitemap and extract hotel URLs for sitemap in sitemaps: xml = requests.get(sitemap).content sitemap_xml = etree.fromstring(xml) hotel_urls = [url.text for url in sitemap_xml.findall(".//url/loc")] # Now scrape each hotel_url...
This gives us a comprehensive list of hotel pages to feed into our scraper.
2. Scrape Search Results Pages
Alternatively, we can scrape Booking's search endpoint to find hotels matching keywords, locations, date,s and other search filters. For example:
https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaN0BiAEBmAExuAEKyAEF2AED6AEB-AECiAIBqAIDuAK-u5eUBsACAdICJGRlN2VhYzYyLTJiYzItNDE0MS1iYmY4LWYwZjkxNTc0OGY4ONgCBOACAQ&lang=en-us&sid=d208f9e19693dd7b6bc484a17edd95c7&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaN0BiAEBmAExuAEKyAEF2AED6AEB-AECiAIBqAIDuAK-u5eUBsACAdICJGRlN2VhYzYyLTJiYzItNDE0MS1iYmY4LWYwZjkxNTc0OGY4ONgCBOACAQ%26lang%3Den-us%26sid%3Dd208f9e19693dd7b6bc484a17edd95c7%26sb_price_type%3Dtotal%26%26&ss=New+York%2C+New+York+State%2C+USA&is_ski_area=0&ssne=New+York&ssne_untouched=New+York&city=-2140479&checkin_year=2023&checkin_month=2&checkin_monthday=26&checkout_year=2023&checkout_month=3&checkout_monthday=5&group_adults=2&group_children=0&no_rooms=1&from_sf=1&search_pageview_id=7a286162a5020001
We can replicate these search queries in Python:
import requests from urllib.parse import urlencode base_url = "https://www.booking.com/searchresults.html?" params = { "ss": "New York", "checkin_date": "2023-02-26", "checkout_date": "2023-03-05", "group_adults": 2, "no_rooms": 1 } final_url = base_url + urlencode(params) response = requests.get(final_url, headers=headers)
This searches Booking.com and retrieves the HTML of the search results page for scraping. The search approach provides more flexibility than sitemaps to target specific destinations, dates and filters.
Step 3 – Scrape Search Results Page
With search results page HTML obtained in previous step, we can now parse it to extract hotel URLs and key metadata like name, address, ratings etc. We'll use BeautifulSoup for parsing:
from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, "lxml") hotels = [] for result in soup.find_all("div", class_="sr_item_content"): name = result.find("span", class_="sr-hotel__name").text url = result.find("a")["href"] rating = result.find("div", class_="bui-review-score__badge") address = result.find("span", class_="hp_address_subtitle").text hotels.append({ "name": name, "url": url, "rating": rating, "address": address })
This gives us a list of dicts containing hotel URLs and metadata ready for the next stage. The process can be repeated across multiple paginated search results by incrementing page number in the search query.
Step 4 – Scrape Hotel Page Data
Now we loop through the list of hotel URLs, and send requests to scrape key data from each page:
for hotel in hotels: page = requests.get(hotel["url"], headers=headers) soup = BeautifulSoup(page.text, "lxml") description = soup.find("div", id="property_description_content").text amenities = [item.text for item in soup.find_all("li", class_="facilitiesChecklistItem")] hotel["description"] = description hotel["amenities"] = amenities
Some other details we can extract include:
- Hotel name, address
- Latitude and longitude
- Number and types of rooms
- List of available amenities
- High-res photos
- TripAdvisor rating
These can be extracted by parsing different id
and class
attributes on the page.
Step 5 – Get Hotel Price & Availability Data
The hotel pages themselves do not contain pricing and availability information. To get that, we need to:
- Fetch variables like hotel ID, csrf token etc. from page HTML
- Make a JSON API request to Booking.com's Availability API by populating these variables.
Here is a sampleAvailability API request:
https://www.booking.com/graphql/availability Payload: { "hotel_id": "20472", "language": "en-gb", "csrf_token": "abCDef1234", "variables": { "search": { "rooms": [ { "adults": 2, "children": [] } ], "stay": { "checkin": "2023-03-10", "checkout": "2023-03-15" } } } }
And sample Python code to make this API request:
api_url = "https://www.booking.com/graphql/availability" payload = { "hotel_id": "20472", "csrf_token": "abCDef1234", # extract token from page HTML "variables": { "search": { "rooms": [ { "adults": 2, "children": [] } ], "stay": { "checkin": "2023-03-10", "checkout": "2023-03-15" } } } } response = requests.post(api_url, json=payload, headers=headers) data = response.json() print(data["data"]["property"]["rooms"][0]["availability"]["price"]["total"])
This returns the total price for the specified stay. The API can be queried with different date ranges to build a price calendar.
Step 6 – Scrape Reviews for Each Hotel
Reviews and ratings provide crucial information for travelers. To get reviews for a hotel, we need to:
- Navigate to its reviews page e.g.
www.booking.com/hotel/reviews/20472.en-gb.html
- Extract number of review pages from pagination
- Hit the paginated API endpoint for each page
https://www.booking.com/reviewlist.en-gb.html?pagename=20472&rows=10&order=f_recent_desc&offset=0
Here's a Python function to scrape all reviews for a hotel:
def get_reviews(hotel_id): # Get first page url = f"https://www.booking.com/reviewlist.en-gb.html?pagename={hotel_id}&rows=20&offset=0" page = requests.get(url, headers=headers) soup = BeautifulSoup(page.text, 'lxml') total_pages = soup.find("li", class_="bui-pagination__pages").text total_pages = int(total_pages.split()[-1]) reviews = [] # Hit paginated URLs for page in range(total_pages): offset = page * 20 # 20 reviews per page url = f"https://www.booking.com/reviewlist.en-gb.html?pagename={hotel_id}&rows=20&offset={offset}" page = requests.get(url, headers=headers) soup = BeautifulSoup(page.text, 'lxml') # Extract reviews from page page_reviews = extract_reviews(soup) reviews.extend(page_reviews) return reviews def extract_reviews(soup): # Parse HTML to get info like review text, username, date etc. # ... # return list of review dicts pass
This collects reviews across all pages for a hotel ID into a consolidated list.
Step 7 – Store Scraped Hotel Data
As a final step, we can persist the scraped hotel data for easier access and analysis. Some options are:
- JSON – Store key-value serialized JSON objects on disk, one file per hotel.
- CSV – Output a CSV file with one row per hotel and columns for all attributes.
- Database – Insert hotels into a relational database like Postgres for complex querying.
- Elasticsearch – For full-text search and analytics. Could build hotel search engine!
CSV or JSON files provide a simple format for offline analysis in Python/pandas. A database would enable building a production app on top of the scraped content.
Enhancements and Optimizations
Here are some tips to make our Booking scraper more robust and production-ready:
- Asynchronous requests with
aiohttp
to speed up scraping and make it concurrent. - Proxy rotation to prevent IP blocks by distributing requests across different IPs and geolocations.
- Random delays between requests to mimic human behavior and avoid bot detections.
- User-Agents rotation through a pre-defined list of desktop/mobile browser agents, so all requests are not from the same user-agent.
- Scraping in batches to handle large volumes without memory overhead.
- Separate scrapers for listings, hotel details, pricing, and reviews to decouple components.
- Containerization with Docker for easier deployment and scaling.
So, in summary, for a production-grade scraper:
- Asynchronous scraping
- Random delays
- Proxy rotation
- Batching
- Microservice architecture
- Containerization
Are some best practices to follow. This will help circumvent anti-scraping measures while also improving performance and reliability.
Legal and Ethical Considerations
When building scrapers, it's important we consider potential legal and ethical implications:
- Only scrape publicly accessible data that does not require authentication or payment. Avoid scraping non-public or personal information.
- Understand the target site's Terms of Service and robots.txt to ensure your scraping activities are permitted and non-disruptive. For Booking.com, some key points from their ToS:
- Scraping is allowed in general for publicly available data.
- No hitting secured areas of site or user accounts.
- Employ rate limiting to minimize load.
- Use proxies and limit request volume to reduce load on target site. Distribute requests over long durations.
- Do not overwhelmingly flood a site with scraping requests, as that can be considered abuse/DoS attack.
- Respect opt-outs like CAPTCHAs and employ workarounds sparingly.
- Cache scraped data locally and monitor changes to avoid re-scraping unchanged pages.
- Credit sources appropriately and backlink when publishing data.
The key principles are ensuring you have permission, minimizing disruption to the site, and avoiding excessive load through rate limiting and caching. While most travel sites allow some scraping, it should only be done in moderation. Make reasonable efforts to minimize inconvenience and be transparent about your activities.
Conclusion
In bove, we detailed how to create a Booking.com scraper with Python. We emphasized ethical practices like using proxies, caching, and adhering to Booking's terms. This base scraper can be enhanced to gather details like images, room types, and more. It's a solid start for travel analytics, price tracking, or hotel searches. The same methods can be tweaked for sites like Airbnb, Expedia, and TripAdvisor. I hope you found this guide useful.