How to Scrape Indeed.com?

Indeed.com is one of the largest job search engines on the web, with hundreds of millions of job listings across countless industries and career levels. As such, it's an extremely valuable resource for gaining insights into the job market through data scraping and analysis.

In this comprehensive guide, we'll walk through scraping Indeed.com job listings using Python, handling pagination, extracting key details, avoiding blocks, and more.

Scraping Indeed.com Listings – Is it Allowed?

First, let's quickly discuss the legality and ethics of scraping Indeed. According to their terms of service, web scraping is permitted on Indeed.com with reasonable usage limits. They specifically call out that:

“Data mining or scraping Indeed webpages in order to collect job postings in bulk is allowed in accordance with the restrictions below…”

The main restrictions are to limit the frequency of requests, use a delay of at least one second between requests, and don't adversely impact their systems.

In other words, light to moderate scraping for research or personal use cases is allowed. Just be sure to implement throttling in your scraper code to avoid overloading their servers. Commercial use cases may require their written permission.

Scraping Stack – Python & Friends

For this web scraping project, we'll utilize the following core Python libraries:

requests – to make HTTP requests and fetch page content
BeautifulSoup – for parsing and extracting data from HTML
pandas – to structure and store extracted job data
Rotating Residential Proxies – (optional) to implement automated IP rotation to prevent blockages

There are certainly other options we could use like Scrapy for spiders, Selenium for browser automation, or aiohttp for async HTTP. But requests + BeautifulSoup offers the right balance of simplicity while still giving us full control.

We'll also define some common headers to send with each request so we mimic a real web browser:

headers = {
  'User-Agent': 'Mozilla/5.0',
  'Accept': 'text/html',
}

Now let's dive into the specifics of scraping Indeed pages!

Building Optimized Indeed Search URLs

The foundation of any Indeed scraper is constructing optimized search URLS to source relevant job listings. Indeed's main search page can be accessed at indeed.com/jobs. From there users can enter keywords, locations, and other filters to search for openings. For example, to find remote Python developer jobs, one might search for:

Keywords: python developer remote
Location: (blank)

And end up on a URL like:

https://www.indeed.com/jobs?q=python+developer+remote

The key parameters we can see are:

q= – The search query keywords
l= – Location filter if specified

Thus to programmatically generate a search URL, we need to construct the query string and concatenate it to the base jobs page URL:

import urllib.parse

def get_search_url(keywords, location=''):

  query_params = {
    'q': keywords,
    'l': location  
  }

  query_string = urllib.parse.urlencode(query_params)

  return f'https://www.indeed.com/jobs?{query_string}'

We can then generate optimized Indeed search URLs for any query:

search_url = get_search_url('python developer', 'Los Angeles') 

# https://www.indeed.com/jobs?q=python+developer&l=Los+Angeles

This provides the base search page – next we'll look at extracting the actual job results!

Parsing Job Listings from the Search Page

When visiting our generated search URL, the first page of ~15 matching job results are embedded in the HTML response. We could extract these by parsing the DOM and extracting data from elements like:

<div class="job_seen_beacon">
  
  <h2>Python Developer</h2>
  
  <span class="companyName">CoolStartup</span>
  
  <div class="job-snippet">
    <ul>
      <li>Build scalable backend Python services</li>
      <li>REST API and database experience</li>
    </ul>
  </div>

  ...

</div>

However, there's an even easier way! Indeed conveniently provides all job data we need in a JSON object embedded directly in the page:

// Formatted for readability

var mosaicProviderJobcards = {

  "results": [
    {
      "jobId": "abc123", 
      "jobTitle": "Python Developer",
      "companyName": "CoolStartup" 
      // ...other fields
    },
    {  
      "jobId": "def456",
      "jobTitle": "Junior Python Developer", 
      "companyName": "BigEnterpriseCompany"
    } 
  ],

  "queries": [...],

  "meta": {
    "totalResults": 368,
    "formattedLocation": "Los Angeles, CA"
  } 

};

Rather than parse the DOM, we can extract this object using a simple regex:

import re
import json

JOBCARD_REGEX = r'mosaicProviderJobcards\s=\s(\{.+\});'

def extract_search_results(page_html):
  
  match = re.search(JOBCARD_REGEX, page_html)
  if match:
    data = match.group(1)
    data = json.loads(data)
  
    results = data['results']
    meta = data['meta']

  else:
    results = []
    meta = {}
  
  return results, meta

We can then fetch our search URL and pass the HTML to this function to extract structured results and metadata. Next we'll look at paginating through all available search pages.

Paginating Through Search Results

By default, Indeed search pages only contain ~15 job listings. To retrieve all matching listings, we'll need to paginate through multiple search result pages. In the meta data extracted above, the totalResults field indicates how many total jobs were found for our query across all pages.

We can use this result count along with the page size to calculate how many paginated requests we'll need to make. Indeed search allows us to paginate via a start parameter indicating the offset index:

https://www.indeed.com/jobs?q=python&l=Austin&start=0
https://www.indeed.com/jobs?q=python&l=Austin&start=10 
https://www.indeed.com/jobs?q=python&l=Austin&start=20

Let's put this all together:

from math import ceil

RESULTS_PER_PAGE = 15

def paginate_search(query, location):

  # Fetch initial page
  url = get_search_url(query, location)
  html = fetch_page(url)
  results, meta = extract_search_results(html)

  total_results = meta['totalResults']
  print(f'Found {total_results} results')

  # Calculate pagination
  num_pages = ceil(total_results / RESULTS_PER_PAGE)

  for page in range(1, num_pages + 1):

    # Calculate offset 
    offset = (page - 1) * RESULTS_PER_PAGE

    # Fetch page with offset 
    next_url = f'{url}&start={offset}'
    next_html = fetch_page(next_url)
    next_results, _ = extract_search_results(next_html)

    # Add to master results list
    results.extend(next_results)
  
  return results

This paginates through every search result page until we have extracted every job listing matched to our keyword and location filters. Now let's look at scraping additional details from job post pages.

Scraping Additional Data from Job Pages

Our search results provide useful data like job titles, companies, locations, and short summaries. However, details like the full description, responsibilities, salary range and more reside on the unique job post page for each listing.

To access the individual job page, we can use the jobId provided in each search result:

JOB_PAGE_URL = 'https://www.indeed.com/viewjob?jk={jobId}'

page_url = JOB_PAGE_URL.format(jobId=result['jobId'])

We can then scrape these URLs in a loop to extract additional data points:

def scrape_job_page(jobId):

  url = JOB_PAGE_URL.format(jobId=jobId)

  page_html = fetch_page(url)

  # Parse HTML to extract:
  # - Description
  # - Salary
  # - Responsibilities
  # - Qualifications
  # - etc.

  job_data = {
    'title': find_title(page_html),
    'description': find_description(page_html),
    ...
  }

  return job_data

all_job_data = []

for result in search_results:
  
  job_data = scrape_job_page(result['jobId'])

  # Combine with original result data
  all_job_data.append({**result, **job_data})

Analyzing this supplemental data along with our core search results allows for rich insights into job openings. Now let's look at some best practices for avoiding blocks when scraping at scale.

Avoiding Blocks with Proxies and Throttling

When scraping large volumes, it's important to implement precautions to avoid overloading Indeed's servers and getting blocked. Some common anti-scraping techniques Indeed may employ include:

IP Rate Limiting – blocking scraping from fixed IP addresses
CAPTCHAs – challenge users to prove they aren't bots
Blocking User Agents – banning common scraper user agent strings

To avoid blocks, here are some tips:

Use Proxies – Route requests through residential proxies like Bright Data, Smartproxy, Proxy-Seller, or Soax to prevent fixed IP blocks.
Custom User-Agents – Set realistic browser user agents and routinely rotate them.
Implement Throttling – Limit request frequency to 1-2 requests per second at most. Indeed specifically calls out 1 second.
Monitor Performance – Watch for increasing CAPTCHAs or errors suggesting blocks.
Refine Over Time – Continuously tweak parameters to balance performance and avoidance.

With the right balance of proxies, throttling, and tuning, you can extract large volumes of listings data without issue. Now let's look at analyzing and storing scraped data.

Storing and Analyzing Indeed Listings Data

Once we've built our optimized Indeed scraper, we can accumulate hundreds of thousands or even millions of job listings over time. But data is only valuable if we can unlock insights! Let's discuss some ideas for storage and analysis:

SQL Database – For structured analysis, store scraped listings in a relational database like PostgreSQL. This allows filtering, aggregation, and joining with other datasets.
NoSQL Database – To handle scale, a NoSQL database like MongoDB can store semi-structured JSON data with high throughput.
Object Storage – For cold storage of raw listing data, cheap object stores like S3 can serve as a data lake.
Business Intelligence – Connect your database to BI tools like Tableau to build interactive dashboards visualizing trends.
Model Training – Feed your listings into machine learning models to uncover non-obvious insights and predictions.
ETL Pipelines – Use frameworks like Airflow to schedule scraping workflows and transform the data for downstream needs.

The possibilities are truly endless given enough quality job listings data!

Scraping Indeed Listings: Next Steps

The full code for a production-grade scraper would require significantly more logic for robustness. I'd recommend starting with a simplified prototype first, then evolving to handle scale and complexity. I hope this guide provides a solid basis for you to start collecting and analyzing Indeed job listing data at scale.