Realtor.com is one of the largest real estate listing sites in the US, with over 5 million active property listings. As an immense public data source, Realtor.com provides a wealth of information for data scientists, investors, urban planners, and anyone looking to analyze real estate market trends.
In this comprehensive technical guide, we'll walk through how to build advanced web scrapers to extract Realtor.com real estate data using Python. By the end, you'll have a complete blueprint for scraping large-scale real estate data through proxies, tracking listing changes, avoiding blocks, and deploying scrapers at scale.
Why Scrape Realtor.com?
Before we dive in, let's discuss why you may want to scrape Realtor.com data:
- Market research – analyze real estate supply, demand, and pricing trends. See which neighborhoods and property types are heating up.
- Investment analysis – research properties for renovation potential. Look for rising or falling areas.
- Academic studies – extract large datasets for urban planning, economics, or public policy research.
- Competitor tracking – follow listings added by other brokers for competitive intelligence.
- Building apps – create real estate search tools, pricing estimators, and investment calculators.
Many industries utilize data from Realtor.com through web scraping to power their applications, analytics, and decision-making. While Realtor.com doesn't have a public API, with the right technique, we can extract large volumes of listing data through scraping.
Setting up Your Python Environment
Before diving into the code, let's look at getting your Python environment configured correctly:
Virtual Environments
It's recommended to use a virtual environment for each Python project to isolate dependencies. Some popular options:
- venv – Python's built-in virtual environment module:
python3 -m venv myenv
- virtualenv – An external package to create virtualenvs:
pip install virtualenv virtualenv myenv
- pipenv – A package that combines virtualenvs and dependency management:
pip install pipenv pipenv shell
- conda – Anaconda's virtual env system for data science:
conda create -n myenv conda activate myenv
Activate your environment before installing packages.
Packages
Next, install the key packages we'll use:
pip install requests BeautifulSoup4 selenium pandas
This gives us:
requests
– for making HTTP requests to the websiteBeautifulSoup
– for parsing HTML and extracting dataSelenium
– for browser automation to bypass JSpandas
– for data analysis and CSV storage
We may also use packages like lxml
for faster HTML parsing, scramp
for multiprocessing, and pyppeteer
for headless Chrome automation.
IDEs and Tools
For development, we recommend using:
- Jupyter Notebooks – great for experimenting with scraping code
- VS Code – robust code editor with Python debugging
- PyCharm – full Python IDE with auto-complete and tools
Some other useful tools include:
scrapy
– a web scraping framework for large scraperspostman
– allows manually testing API requestsmitmproxy
– inspect HTTP traffic for reverse engineering
With the environment setup, let's now dive into our Realtor.com scraper.
Scraping Real Estate Listing Details
Our first scraper will extract all the data from a single real estate listing page like this example. Viewing the page source, we can see the property details are conveniently loaded in a JSON object called __NEXT_DATA__.
Let's extract it:
import requests from bs4 import BeautifulSoup url = "https://www.realtor.com/realestateandhomes-detail/49-Mariner-Green-Dr_Copiague_NY_11726_M54679-01315" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') data = soup.find("script", id="__NEXT_DATA__") json_data = json.loads(data.contents[0]) property_details = json_data["props"]["pageProps"]["property"] print(property_details["price"]) # 335000
The property_details
contains a rich set of attributes about the listing:
{ "property_id": "6789123", "status": "FOR_SALE", "price": 335000, "baths": 2, "beds": 4, "area": 2215, "address": "49 Mariner Green Dr", "city": "Copiague", "state": "NY", "zipcode": "11726", "latitude": 40.843123, "longitude": -73.410989, "photos": [ "https://ap.rdcpix.com/74ca71da924c63a1afc0d785279f042cl-m1660895823xd-w1020_h770_q80.jpg", ... ], "description": "Lovely move-in ready ranch. Freshly painted interior...", "schools": [...], "features": ["Waterfront", "Garage"], "year_built": 1960, "lot_size": 6534, "listing_agent": { "name": "John Smith", "phone": "555-555-1234", "email": "[email protected]", } ... }
That gives us tons of fields we can extract – everything from pricing, photos, and geolocation to amenities! With over 5 million listings, this data can power all sorts of real estate analysis and tools.
Extracting Search Results at Scale
Now that we can scrape details for individual listings let's see how we can extract search results at scale. Realtor.com's search URL looks like this:
https://realtor.com/realestateandhomes-search/new-york-ny/pg-1
It supports pagination with the pg-X
parameter. So we'll need to:
- Scrape the first page to get the total result count
- Calculate the number of pages needed
- Loop through each page scraping results
Here's how to implement it:
import math import requests from bs4 import BeautifulSoup search_url = "https://realtor.com/realestateandhomes-search/new-york-ny/pg-1" first_page = requests.get(search_url) soup = BeautifulSoup(first_page.text, 'html.parser') data = json.loads(soup.find("script", id="__NEXT_DATA__").contents[0]) total_count = data["props"]["pageProps"]["searchResults"]["home_search"]["total"] per_page = data["props"]["pageProps"]["searchResults"]["home_search"]["count"] total_pages = math.ceil(total_count / per_page) print(f"Found {total_count} results across {total_pages} pages") all_results = [] for page in range(1, total_pages+1): url = search_url.replace("pg-1", f"pg-{page}") response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') data = json.loads(soup.find("script", id="__NEXT_DATA__").contents[0]) results = data["props"]["pageProps"]["searchResults"]["home_search"]["results"] all_results.extend(results) print(len(all_results)) # 23452 results
By iterating through each page, we can extract all the matching results – up to 500,000 for some locations! The result data contains a subset of attributes like price, beds, area, agent info but not the full details we scraped earlier. To get those, we would:
- Extract the
permalink
for each listing - Feed the listing URLs into our detail scraper
This gives us both a wide set of search results as well as the deep property details.
Tracking Real Estate Listing Changes
In addition to scraping snapshots, we also want to keep our database updated with the latest listing changes. Fortunately, Realtor provides several RSS feeds covering:
- New listings
- Price changes
- Status changes
- Open houses
- Recently sold
For example, here is Realtor's California price change feed. It's an XML file containing listing URLs and publish dates whenever a price is updated:
<item> <link>realtor.com/ABC</link> <pubDate>Sun, 26 Feb 2023 11:00:00 EST</pubDate> </item>
We can poll these feeds to pick up new changes by:
- Fetching the XML feeds on a schedule
- Parsing out the updated listing URLs
- Scraping each listing URL to get the latest details
- Saving the fresh data to our database
Here's sample code to parse the feed XML and extract listings:
import feedparser feed_url = "https://www.realtor.com/realestateandhomes-detail/sitemap-rss-price/rss-price-ca.xml" feed_data = feedparser.parse(feed_url) for entry in feed_data.entries: url = entry.link published = entry.published # Scrape url to get updated price scrape_listing(url)
By running this regularly (say every 6 hours), we can keep our real estate data current!
Avoiding Web Scraping Blocks
A common issue when scraping at scale is getting blocked by the website's protections. Realtor.com actively tries to prevent scraping through:
- IP Rate Limiting – blocking requests after a volume threshold
- Captchas – requiring solving images to prove you are human
- Block Pages – serving an “Access Denied” page
There are several techniques we can use to avoid triggering these protections:
Randomized User Agents
Websites track the User-Agent
header to identify bots vs real users. Setting a random desktop user agent makes our scraper appear like a browser:
import random from fake_useragent import UserAgent ua = UserAgent() headers = { 'User-Agent': ua.random } requests.get(url, headers=headers)
Rotating user agents helps avoid IP rate limits.
Residential Proxies
Routing traffic through residential proxies masks our scraper IP and gives us fresh IPs to rotate through. Popular proxy providers include Bright Data, Smartproxy, Proxy-Seller, and Soax.
Here's how to make requests through proxies:
import requests proxy_url = "http://username:password@residential-proxy:8080" proxies = { 'http': proxy_url, 'https': proxy_url } requests.get(url, proxies=proxies)
Residential proxies simulate real home users to appear more human.
Browser Automation
Headless browsers like Selenium and Puppeteer render JavaScript just like a real user:
from selenium import webdriver options = webdriver.ChromeOptions() driver = webdriver.Chrome(options=options) driver.get('https://realtor.com')
Browser automation is slower but can bypass more advanced protections.
Handling Captchas
To automate solving Realtor's captchas, we can use services like AntiCaptcha and 2Captcha which solve image and ReCAPTCHA challenges:
import anticaptcha solver = anticaptcha.PythonSolver('api_key') # when captcha encountered: solver.recaptcha(site_key=sitekey, url=page_url)
These services employ humans to solve captchas which feed back into your code.
Using Scraping APIs
Tools like ScrapingBee, ScraperAPI, and PromptCloud offer managed scraping through proxies, browsers, and captcha solving in the cloud.
import scrapingbee client = scrapingbee.ScrapingBeeClient('API_KEY') data = client.get(url)
APIs spare you managing proxies and Selenium, but cost per API call. By combining these tactics, we can scrape Realtor.com sustainably at scale.
Storing Scraped Real Estate Data
Once we've extracted Realtor data, we need to store it somewhere for analysis and use. Here are some good options:
CSV Files
For simple datasets, we can write to CSV files using the csv
module:
import csv with open('realtor-data.csv', 'w') as file: writer = csv.writer(file) writer.writerow(["Price", "Beds", "Area", "Agent"]) writer.writerow([house_data['price'], house_data['beds'],...])
CSVs work well for under 100,000 rows. They can be opened in Excel or imported to databases.
JSON Lines
For larger data, JSON lines format is fast to append by just writing one JSON object per line:
import json with open('realtor.jsonl', 'a') as f: f.write(json.dumps(house_data) + "\n")
SQLite Database
For more sophisticated relational queries, we can use SQLite – a lightweight SQL database in Python.
First, define schema:
import sqlite3 conn = sqlite3.connect('realtor.db') c.execute(''' CREATE TABLE listings (id TEXT, address TEXT, price INT, baths INT, beds INT, area INT) ''')
Then insert scraped data:
c.execute( "INSERT INTO listings VALUES (?, ?, ?, ?, ?, ?)", (house_data['id'], house_data['address'], house_data['price'], ...) ) conn.commit()
SQLite handles indexing, filtering and joining millions of rows.
MongoDB
NoSQL databases like MongoDB are great for unstructured nested JSON data:
from pymongo import MongoClient client = MongoClient() db = client.realtor db.listings.insert_one(house_data)
It allows storing objects as-is without schema migrations. For maximum scalability, real estate datasets can be loaded into data warehouses like Google BigQuery. The data can then power dashboards, maps, and market analysis.
Containerizing Scrapers for Scalable Deployment
To scale up our scrapers, we can containerize them using Docker for easy deployment:
1. Dockerfile
First, we define a Dockerfile to create an image:
FROM python:3.8 WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD [ "python", "scraper.py
This installs Python and our scraper's dependencies then copies the code and sets the run command.
2. Build Image
Next, we build the Docker image from the directory with the Dockerfile:
docker build -t realtor-scraper .
This produces an image we can instantiate containers from.
3. Run Container
To launch the scraper, we run it as a container:
docker run -it realtor-scraper
The scraper runs in an isolated container environment.
4. Docker Compose
For multi-container systems, Docker Compose defines the services and networking:
version: "3" services: scraper: image: realtor-scraper db: image: mysql ...
This coordinates scraper, databases, proxies etc.
5. Kubernetes
Finally, we can use Kubernetes to orchestrate containers across servers:
apiVersion: apps/v1 kind: Deployment metadata: name: scraper spec: replicas: 3 selector: matchLabels: app: scraper template: metadata: labels: app: scraper spec: containers: - name: scraper image: realtor-scraper
Kubernetes handles scaling, failovers, load balancing and more! By containerizing scrapers, we can easily scale up to multiple servers. The containers abstract away dependencies, making deployment seamless.
Conclusion
This guide meticulously walks you through the creation of a robust Python system for scraping listings from Realtor.com. Employ these methods to harvest vast datasets from Realtor.com, fueling your applications and analytical endeavors.
If this guide proved valuable to you, please pass it along to others who might benefit from harnessing web scraping to tap into the wealth of data on Realtor.com!