How to Scrape Realtor.com?

Realtor.com is one of the largest real estate listing sites in the US, with over 5 million active property listings. As an immense public data source, Realtor.com provides a wealth of information for data scientists, investors, urban planners, and anyone looking to analyze real estate market trends.

In this comprehensive technical guide, we'll walk through how to build advanced web scrapers to extract Realtor.com real estate data using Python. By the end, you'll have a complete blueprint for scraping large-scale real estate data through proxies, tracking listing changes, avoiding blocks, and deploying scrapers at scale.

Why Scrape Realtor.com?

Before we dive in, let's discuss why you may want to scrape Realtor.com data:

Market research – analyze real estate supply, demand, and pricing trends. See which neighborhoods and property types are heating up.
Investment analysis – research properties for renovation potential. Look for rising or falling areas.
Academic studies – extract large datasets for urban planning, economics, or public policy research.
Competitor tracking – follow listings added by other brokers for competitive intelligence.
Building apps – create real estate search tools, pricing estimators, and investment calculators.

Many industries utilize data from Realtor.com through web scraping to power their applications, analytics, and decision-making. While Realtor.com doesn't have a public API, with the right technique, we can extract large volumes of listing data through scraping.

Setting up Your Python Environment

Before diving into the code, let's look at getting your Python environment configured correctly:

Virtual Environments

It's recommended to use a virtual environment for each Python project to isolate dependencies. Some popular options:

venv – Python's built-in virtual environment module:

python3 -m venv myenv

virtualenv – An external package to create virtualenvs:

pip install virtualenv
virtualenv myenv

pipenv – A package that combines virtualenvs and dependency management:

pip install pipenv
pipenv shell

conda – Anaconda's virtual env system for data science:

conda create -n myenv
conda activate myenv

Activate your environment before installing packages.

Packages

Next, install the key packages we'll use:

pip install requests BeautifulSoup4 selenium pandas

This gives us:

requests – for making HTTP requests to the website
BeautifulSoup – for parsing HTML and extracting data
Selenium – for browser automation to bypass JS
pandas – for data analysis and CSV storage

We may also use packages like lxml for faster HTML parsing, scramp for multiprocessing, and pyppeteer for headless Chrome automation.

IDEs and Tools

For development, we recommend using:

Jupyter Notebooks – great for experimenting with scraping code
VS Code – robust code editor with Python debugging
PyCharm – full Python IDE with auto-complete and tools

Some other useful tools include:

scrapy – a web scraping framework for large scrapers
postman – allows manually testing API requests
mitmproxy – inspect HTTP traffic for reverse engineering

With the environment setup, let's now dive into our Realtor.com scraper.

Scraping Real Estate Listing Details

Our first scraper will extract all the data from a single real estate listing page like this example. Viewing the page source, we can see the property details are conveniently loaded in a JSON object called __NEXT_DATA__.

Let's extract it:

import requests
from bs4 import BeautifulSoup

url = "https://www.realtor.com/realestateandhomes-detail/49-Mariner-Green-Dr_Copiague_NY_11726_M54679-01315"

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

data = soup.find("script", id="__NEXT_DATA__")
json_data = json.loads(data.contents[0])

property_details = json_data["props"]["pageProps"]["property"]

print(property_details["price"])
# 335000

The property_details contains a rich set of attributes about the listing:

{
  "property_id": "6789123",
  "status": "FOR_SALE", 
  "price": 335000,
  "baths": 2,
  "beds": 4,
  "area": 2215,
  
  "address": "49 Mariner Green Dr",
  "city": "Copiague",
  "state": "NY",
  "zipcode": "11726",
  
  "latitude": 40.843123,
  "longitude": -73.410989,

  "photos": [
    "https://ap.rdcpix.com/74ca71da924c63a1afc0d785279f042cl-m1660895823xd-w1020_h770_q80.jpg",
    ...
  ],

  "description": "Lovely move-in ready ranch. Freshly painted interior...",
  
  "schools": [...],

  "features": ["Waterfront", "Garage"],

  "year_built": 1960,
  "lot_size": 6534,
  
  "listing_agent": {
    "name": "John Smith",
    "phone": "555-555-1234",
    "email": "[email protected]",    
  }
  ...
}

That gives us tons of fields we can extract – everything from pricing, photos, and geolocation to amenities! With over 5 million listings, this data can power all sorts of real estate analysis and tools.

Extracting Search Results at Scale

Now that we can scrape details for individual listings let's see how we can extract search results at scale. Realtor.com's search URL looks like this:

https://realtor.com/realestateandhomes-search/new-york-ny/pg-1

It supports pagination with the pg-X parameter. So we'll need to:

Scrape the first page to get the total result count
Calculate the number of pages needed
Loop through each page scraping results

Here's how to implement it:

import math
import requests
from bs4 import BeautifulSoup

search_url = "https://realtor.com/realestateandhomes-search/new-york-ny/pg-1"

first_page = requests.get(search_url)
soup = BeautifulSoup(first_page.text, 'html.parser')

data = json.loads(soup.find("script", id="__NEXT_DATA__").contents[0])

total_count = data["props"]["pageProps"]["searchResults"]["home_search"]["total"]
per_page = data["props"]["pageProps"]["searchResults"]["home_search"]["count"]

total_pages = math.ceil(total_count / per_page)
print(f"Found {total_count} results across {total_pages} pages")

all_results = []

for page in range(1, total_pages+1):
  url = search_url.replace("pg-1", f"pg-{page}")
  response = requests.get(url)
  
  soup = BeautifulSoup(response.text, 'html.parser')
  data = json.loads(soup.find("script", id="__NEXT_DATA__").contents[0])
  
  results = data["props"]["pageProps"]["searchResults"]["home_search"]["results"]
  
  all_results.extend(results)

print(len(all_results))
# 23452 results

By iterating through each page, we can extract all the matching results – up to 500,000 for some locations! The result data contains a subset of attributes like price, beds, area, agent info but not the full details we scraped earlier. To get those, we would:

Extract the permalink for each listing
Feed the listing URLs into our detail scraper

This gives us both a wide set of search results as well as the deep property details.

Tracking Real Estate Listing Changes

In addition to scraping snapshots, we also want to keep our database updated with the latest listing changes. Fortunately, Realtor provides several RSS feeds covering:

New listings
Price changes
Status changes
Open houses
Recently sold

For example, here is Realtor's California price change feed. It's an XML file containing listing URLs and publish dates whenever a price is updated:

<item>
  <link>realtor.com/ABC</link> 
  <pubDate>Sun, 26 Feb 2023 11:00:00 EST</pubDate>
</item>

We can poll these feeds to pick up new changes by:

Fetching the XML feeds on a schedule
Parsing out the updated listing URLs
Scraping each listing URL to get the latest details
Saving the fresh data to our database

Here's sample code to parse the feed XML and extract listings:

import feedparser

feed_url = "https://www.realtor.com/realestateandhomes-detail/sitemap-rss-price/rss-price-ca.xml"

feed_data = feedparser.parse(feed_url)

for entry in feed_data.entries:
  url = entry.link
  published = entry.published
  
  # Scrape url to get updated price 
  scrape_listing(url)

By running this regularly (say every 6 hours), we can keep our real estate data current!

Avoiding Web Scraping Blocks

A common issue when scraping at scale is getting blocked by the website's protections. Realtor.com actively tries to prevent scraping through:

IP Rate Limiting – blocking requests after a volume threshold
Captchas – requiring solving images to prove you are human
Block Pages – serving an “Access Denied” page

There are several techniques we can use to avoid triggering these protections:

Randomized User Agents

Websites track the User-Agent header to identify bots vs real users. Setting a random desktop user agent makes our scraper appear like a browser:

import random
from fake_useragent import UserAgent

ua = UserAgent()

headers = {
  'User-Agent': ua.random
}

requests.get(url, headers=headers)

Rotating user agents helps avoid IP rate limits.

Residential Proxies

Routing traffic through residential proxies masks our scraper IP and gives us fresh IPs to rotate through. Popular proxy providers include Bright Data, Smartproxy, Proxy-Seller, and Soax.

Here's how to make requests through proxies:

import requests

proxy_url = "http://username:password@residential-proxy:8080"

proxies = {
  'http': proxy_url,
  'https': proxy_url
}

requests.get(url, proxies=proxies)

Residential proxies simulate real home users to appear more human.

Browser Automation

Headless browsers like Selenium and Puppeteer render JavaScript just like a real user:

from selenium import webdriver

options = webdriver.ChromeOptions() 
driver = webdriver.Chrome(options=options)

driver.get('https://realtor.com')

Browser automation is slower but can bypass more advanced protections.

Handling Captchas

To automate solving Realtor's captchas, we can use services like AntiCaptcha and 2Captcha which solve image and ReCAPTCHA challenges:

import anticaptcha

solver = anticaptcha.PythonSolver('api_key')

# when captcha encountered:
solver.recaptcha(site_key=sitekey, url=page_url)

These services employ humans to solve captchas which feed back into your code.

Using Scraping APIs

Tools like ScrapingBee, ScraperAPI, and PromptCloud offer managed scraping through proxies, browsers, and captcha solving in the cloud.

import scrapingbee

client = scrapingbee.ScrapingBeeClient('API_KEY')
data = client.get(url)

APIs spare you managing proxies and Selenium, but cost per API call. By combining these tactics, we can scrape Realtor.com sustainably at scale.

Storing Scraped Real Estate Data

Once we've extracted Realtor data, we need to store it somewhere for analysis and use. Here are some good options:

CSV Files

For simple datasets, we can write to CSV files using the csv module:

import csv

with open('realtor-data.csv', 'w') as file:
  writer = csv.writer(file)
  writer.writerow(["Price", "Beds", "Area", "Agent"])
  writer.writerow([house_data['price'], house_data['beds'],...])

CSVs work well for under 100,000 rows. They can be opened in Excel or imported to databases.

JSON Lines

For larger data, JSON lines format is fast to append by just writing one JSON object per line:

import json 

with open('realtor.jsonl', 'a') as f:
  f.write(json.dumps(house_data) + "\n")

SQLite Database

For more sophisticated relational queries, we can use SQLite – a lightweight SQL database in Python.

First, define schema:

import sqlite3

conn = sqlite3.connect('realtor.db')

c.execute('''
  CREATE TABLE listings
  (id TEXT, address TEXT, price INT, baths INT, beds INT, area INT)
''')

Then insert scraped data:

c.execute(
  "INSERT INTO listings VALUES (?, ?, ?, ?, ?, ?)",
  (house_data['id'], house_data['address'], house_data['price'], ...)
)

conn.commit()

SQLite handles indexing, filtering and joining millions of rows.

MongoDB

NoSQL databases like MongoDB are great for unstructured nested JSON data:

from pymongo import MongoClient

client = MongoClient()
db = client.realtor

db.listings.insert_one(house_data)

It allows storing objects as-is without schema migrations. For maximum scalability, real estate datasets can be loaded into data warehouses like Google BigQuery. The data can then power dashboards, maps, and market analysis.

Containerizing Scrapers for Scalable Deployment

To scale up our scrapers, we can containerize them using Docker for easy deployment:

1. Dockerfile

First, we define a Dockerfile to create an image:

FROM python:3.8

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

CMD [ "python", "scraper.py

This installs Python and our scraper's dependencies then copies the code and sets the run command.

2. Build Image

Next, we build the Docker image from the directory with the Dockerfile:

docker build -t realtor-scraper .

This produces an image we can instantiate containers from.

3. Run Container

To launch the scraper, we run it as a container:

docker run -it realtor-scraper

The scraper runs in an isolated container environment.

4. Docker Compose

For multi-container systems, Docker Compose defines the services and networking:

version: "3"
services:

  scraper:
    image: realtor-scraper
  
  db:
    image: mysql
    ...

This coordinates scraper, databases, proxies etc.

5. Kubernetes

Finally, we can use Kubernetes to orchestrate containers across servers:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scraper
  
spec:
  replicas: 3
  
  selector:
    matchLabels:
      app: scraper

  template:
    metadata:
      labels:
        app: scraper
        
    spec:
      containers:
      - name: scraper
        image: realtor-scraper

Kubernetes handles scaling, failovers, load balancing and more! By containerizing scrapers, we can easily scale up to multiple servers. The containers abstract away dependencies, making deployment seamless.

Conclusion

This guide meticulously walks you through the creation of a robust Python system for scraping listings from Realtor.com. Employ these methods to harvest vast datasets from Realtor.com, fueling your applications and analytical endeavors.

If this guide proved valuable to you, please pass it along to others who might benefit from harnessing web scraping to tap into the wealth of data on Realtor.com!