How to Scrape Glassdoor?

Glassdoor is one of the largest websites used by job seekers, employees, recruiters and researchers to find jobs, company information, salaries, reviews, interviews and workplace insights. With over 59 million reviews and data on over 1 million companies in 190 countries, Glassdoor contains a vast and comprehensive dataset on the global job market.

In this detailed guide, we'll dive into techniques to scrape and extract data from Glassdoor using Python and proxies.

Scraping Environment Setup

Let's look at the core tools and libraries we'll utilize for scraping Glassdoor:

Python 3 – Our language of choice for scraping due to its rich ecosystem of libraries and tools.
Requests – A very popular Python module for fetching web pages via HTTP requests.BeautifulSoup – A battle-tested Python library for parsing and extracting data from HTML and XML
documents.
Scrapy – A powerful web crawling framework for building robust, high-performance scrapers.
Proxy Service – A paid proxy API for residential IPs to mask scrapers and avoid blocks.
MongoDB – For storing and querying our scraped Glassdoor data.

For browser automation scenarios, we may also leverage tools like Selenium, Playwright, or Puppeteer depending on our specific needs. But for most data scraping, Requests + BeautifulSoup provides a simple and fast solution.

Let's start by installing the key packages:

pip install requests beautifulsoup4 scrapy pymongo

We'll also need accounts with a proxy service (like BrightData, GeoSurf, etc) and a MongoDB database to store scraped results.

Obtaining Company IDs

To scrape company specific pages on Glassdoor, we first need to obtain their unique company identifiers. These alphanumeric IDs are not exposed directly on the Glassdoor site, but can be retrieved by calling the autocomplete/search API:

import requests
import json

search_url = "https://www.glassdoor.com/api/api.htm?t=TYPEAHEAD&action=employers&q=microsoft"

response = requests.get(search_url)
data = json.loads(response.text)

for result in data:
    print(result['id'], result['name'])

This searches for “microsoft” and prints out:

16752 Microsoft

We can then construct company specific URLs using this ID, like:

https://www.glassdoor.com/Overview/Working-at-Microsoft-EI_IE16752.11,19.htm

To get the IDs for our target companies, we can extract them directly from Glassdoor's sitemap:

import requests
from bs4 import BeautifulSoup
import re

response = requests.get("https://www.glassdoor.com/Sitemap/Company-Sitemap.xml")

soup = BeautifulSoup(response.text, 'lxml')

for url in soup.find_all('url'):
  match = re.search(r"Working-at-(.+)-EI_IE(\d+)", url.find('loc').text)
  
  if match:
    company_name = match.group(1)
    company_id = match.group(2)
    
    print(company_name, company_id)

This fetches the sitemap XML, then extracts out the company name and ID from each url, getting us IDs for all companies on Glassdoor. We can also customize the search API call to retrieve IDs for specific companies we want to target. With IDs in hand, we can now start scraping company specific pages.

Bypassing Anti-Scraping Mechanisms

Like most major websites, Glassdoor employs a series of anti-scraping mechanisms to detect and block bots and scrapers. These include:

IP rate limiting – Banning IPs that send too many requests in a period of time.
CAPTCHAs – Challenging suspect IPs to solve images/text captchas.
Blocking user-agents – Banning common scraper user-agents like Python-urllib.
Cookies/JavaScript – Requiring cookies and JS rendering to access some pages.

Here are some techniques we can use to bypass these protections and scrape effectively:

Use proxies: By routing requests through residential proxy IPs, we can appear as many different users, avoiding IP blocks. Proxy services like Bright Data offer millions of IPs to cycle through.
Rotate user-agents: We can randomly select a user-agent from a list of real browser UAs on each request, preventing blocks by scraper user-agents.
Add delays: Introducing 3-5 second delays between requests and limiting request rates helps avoid triggering abusive scraping detections.
JS rendering: We can integrate headless browsers like Puppeteer for sites requiring JS to evaluate page scripts.
Solve CAPTCHAs: If encountering occasional CAPTCHAs, we can programmatically parse out the image/audio challenge and solve it using a CAPTCHA solving service to continue scraping.
Mimic human behavior: Clicking elements, scrolling pages, and moving the mouse in random patterns helps evade bot protections. These tricks allow us to scrape Glassdoor reliably at scale without getting blocked. Next let's see how to extract data from company pages.

Scraping Company Overview Pages

Every company on Glassdoor has a dedicated overview page with useful metadata like:

Company description
Headquarters location
Industry and sector
Company size
Revenue
Founded date
CEO approval rating

To extract this information, we'll make a request to fetch the page HTML, then parse out the data we want:

import requests
from bs4 import BeautifulSoup

company_id = "16752" 

url = f"https://www.glassdoor.com/Overview/Working-at-Microsoft-EI_IE{company_id}.11,19.htm"

response = requests.get(url, proxies=get_proxy())  
soup = BeautifulSoup(response.text, 'html.parser')

desc = soup.find("div", {"data-test": "employerDescription"}).text
ceo_rating = soup.find("span", {"data-test": "ceo-rating"}).get('title')  
size = soup.find("div", {"data-test": "employer-size"}).text
revenue = soup.find("div", {"data-test": "employer-revenue"}).text

print(desc)
print(ceo_rating) 
print(size)
print(revenue)

Here we:

Fetch the page HTML
Parse out the key fields we want
Print the extracted data

We can extract dozens more useful data points like headquarters, type, industry, website, and more by selecting additional CSS elements from the page. Now let's look at scaling this up to extract overview data across many companies using Scrapy.

Scraping Company Overviews at Scale with Scrapy

Scrapy is a popular Python scraping framework optimized for crawling many pages quickly and efficiently. To leverage Scrapy, we can define a Spider subclass to scrape and parse overview pages:

import scrapy
from scrapy.crawler import CrawlerProcess

class OverviewSpider(scrapy.Spider):
  name = 'overview'
  
  custom_settings = {
    'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
  }

  def start_requests(self):
    urls = [
      f"https://www.glassdoor.com/Overview/Working-at-Microsoft-EI_IE16752.11,19.htm", 
      f"https://www.glassdoor.com/Overview/Working-at-Facebook-E40772.11,18.htm" 
    ]

    for url in urls:
      yield scrapy.Request(url=url, callback=self.parse)

  def parse(self, response):
    desc = response.css("div[data-test='employerDescription'] ::text").get()
    ceo_rating = response.css("span[data-test='ceo-rating']::attr(title)").get()

    yield {
        'description': desc,
        'ceo_rating': ceo_rating
    }

process = CrawlerProcess()
process.crawl(OverviewSpider)
process.start()

This defines a simple Spider to scrape and parse 2 example companies. To scale it up, we can pass many more URLs or generate them dynamically from company IDs. We can also add more CSS selectors to parse additional fields, export results to MongoDB, add proxies, etc. Scrapy provides a very fast and convenient way to scrape data across thousands of overview pages.

Extracting Job Listings

Glassdoor hosts millions of job listings aggregated from company sites and job boards all over the web. These job postings provide useful data like:

Job title
Company name
Location
Job description
Salary estimate
Skills required

To extract all job listings for a given company, we'll need to paginate through the multiple pages of results. Here's an approach:

import math
import json 
import requests
from bs4 import BeautifulSoup

company_id = "16752" # Microsoft 

def extract_listings(page):
  
  url = f"https://www.glassdoor.com/Job/microsoft-jobs-SRCH_KE0,9_IP{page}.htm"
   
  response = requests.get(url, proxies=get_proxy()) 
  soup = BeautifulSoup(response.text, 'html.parser')

  total = int(soup.select_one(".paginationFooter").getText().split()[-1])
  max_pages = math.ceil(total / 20) # 20 jobs per page
  
  jobs = []

  for el in soup.select(".jl"):
    
    title = el.select_one(".jobLink").getText()
    location = el.select_one(".subtleloc").getText()

    jobs.append({
      'title': title,
      'location': location
    })

  return jobs, max_pages

results = []

for page in range(1, max_pages + 1):

  jobs, max_pages = extract_listings(page)
  results.extend(jobs)

print(json.dumps(results, indent=2))

This paginates through each listing page, extracting job titles, location, and other attributes. To further enrich the data, we can parse additional details from each job page like description, department, salary range, and skills.

Parsing Company Reviews

Reviews from current and past employees can provide tremendous insight into company culture, sentiment, management styles, and more. Glassdoor contains over 59 million reviews with breakdowns by department, job role, pros/cons, ratings, and other metadata.

Let's look at how we can extract all reviews for a company. The approach is similar to jobs – paginate over review pages and parse each one:

import math
import json
import requests 
from bs4 import BeautifulSoup

company_id = "16752"

def extract_reviews(page):

  url = f"https://www.glassdoor.com/Reviews/Microsoft-Reviews-E16752_P{page}.htm"

  response = requests.get(url, proxies=get_proxy())
  soup = BeautifulSoup(response.text, 'html.parser')

  total = int(soup.select_one(".pagination").getText().split()[-1])
  max_pages = math.ceil(total / 20)

  reviews = []

  for div in soup.select(".review"):
   
    title = div.select_one(".summary").getText()
    rating = div.find("span", {"class": "rating"}).get("title")

    reviews.append({
      'title': title,
      'rating': rating  
    })

  return reviews, max_pages

results = []

for page in range(1, max_pages + 1):
  
  reviews, max_pages = extract_reviews(page)
  results.extend(reviews)
  
print(json.dumps(results, indent=2))

This extracts the review title and rating for each one. We can also parse pros, cons, advisor ratings, and other fields from each review. Sentiment analysis of these reviews can identify companies with exceptionally happy or dissatisfied employees.

Gathering Salary Data

In addition to job listings, Glassdoor allows users to self-report salary details anonymously. These crowdsourced salary reports can provide pay range insights by:

Job title
Location
Years of experience
Company
Size
Revenue
Gender

Let's look at an approach to extract reported salaries for a company:

import json
import requests
from bs4 import BeautifulSoup

company_id = "16752" 

url = f"https://www.glassdoor.com/Salary/Microsoft-Salaries-E16752.htm"

response = requests.get(url, proxies=get_proxy())
soup = BeautifulSoup(response.text, 'html.parser')

salaries = []

for row in soup.select(".compensationRow"):
  
  role = row.select_one(".jobTitle").getText()
  salary = row.select_one(".gray").getText()

  salaries.append({
     'role': role.strip(),
     'salary': salary
  })
  
print(json.dumps(salaries, indent=2))

This scrapes each salary report row, extracting the job title and reported salary into our salaries list. We can further enrich this data by also scraping the location, date submitted, years of experience, and other attributes from each report. Glassdoor contains hundreds of thousands of salary reports across companies and provides localized pay info globally.

Scraping Interview Details

Interview reviews and questions posed to candidates can help job seekers prepare and know what to expect during the process. Glassdoor has a section dedicated to user-submitted interview details such as:

Interview questions asked
Process difficulty rating
Interview experience reviews
Tips from candidates
Offer and rejection stats

Let's look at how we can scrape and extract these interview insights. Similarly to jobs and reviews, we'll need to paginate over multiple pages and parse each one:

import math
import json
import requests
from bs4 import BeautifulSoup 

company_id = "16752"

url = f"https://www.glassdoor.com/Interview/Microsoft-Interview-Questions-E16752.htm"

response = requests.get(url, proxies=get_proxy())
soup = BeautifulSoup(response.text, 'html.parser')

total = int(soup.select_one(".pagination").getText().split()[-1])  
max_pages = math.ceil(total / 20)

questions = []

for page in range(1, max_pages + 1):

  response = requests.get(f"{url}_P{page}.htm", proxies=get_proxy())
  soup = BeautifulSoup(response.text, 'html.parser')
    
  for li in soup.select(".interviewQuestion"):
    question = li.select_one(".questionText").getText()  
    questions.append(question)

print(json.dumps(questions, indent=2))

This extracts just the interview question text, but we can also grab difficulty ratings, candidate tips, and other details. Analyzing these questions can help surface the most common ones to expect for a given company or role.

Storing Scraped Data

Now that we know how to scrape pages and extract data, let's look at how to store it for easier analysis and querying. Here are some good storage options for scraped Glassdoor data:

JSON Files

For simple datasets, we can save results to JSON files:

import json

jobs = # scraped jobs

with open('jobs.json', 'w') as f:
  json.dump(jobs, f)

This writes our jobs list to jobs.json on disk. Easy to parse locally but doesn't scale well.

MySQL

For relational data, MySQL is a solid option:

import mysql.connector

db = mysql.connector.connect(
  host="localhost",
  user="root",
  password="password123",
  database="glassdoor" 
)

cursor = db.cursor()

for job in jobs:
  sql = "INSERT INTO jobs (title, company, location) VALUES (%s, %s, %s)"
  cursor.execute(sql, (job['title'], job['company'], job['location']))

db.commit()

This inserts each job into a MySQL table for structured querying.

MongoDB

For more flexibility, we can use a document store like MongoDB:

from pymongo import MongoClient

client = MongoClient("localhost", 27017)
db = client["glassdoor"]
jobs_col = db["jobs"]

jobs_col.insert_many(jobs)

MongoDB allows storing unstructured JSON data easily with querying flexibility. Great for web scraping! There are many other options like PostgreSQL, Elasticsearch, and cloud databases to consider as well based on needs.

Analyzing Scraped Data

Now that we have Glassdoor data in a structured database, what can we do with it? Here are some examples of insights to extract:

Top company interview questions

from collections import Counter

counts = Counter()

for interview in interviews_col.find({"company": "Facebook"}):
  counts.update(interview['questions'])
  
print(counts.most_common(5))

Prints the 5 most common Facebook interview questions.

Top rated technology companies

import pandas as pd

df = pd.DataFrame(list(overviews_col.find())) 

tech_df = df[df['industry'] == 'Technology']
top_comps = tech_df.nlargest(10, 'overall_rating')

print(top_comps)

Prints the top 10 highest rated tech companies.

Salary difference between genders

from scipy.stats import mannwhitneyu

female_salaries = []
male_salaries = []

for salary in salaries_col.find():
  if salary['gender'] == 'Female':
    female_salaries.append(salary['amount'])
  elif salary['gender'] == 'Male':  
    male_salaries.append(salary['amount'])

p_value = mannwhitneyu(female_salaries, male_salaries)

print(p_value)

Statistical significance test for gender pay gaps. The possibilities are endless!

Avoiding Blocks

When scraping Glassdoor at scale, blocks and captchas are inevitable. Here are some tips to minimize disruptions:

Use a proxy rotation – Rotate IP addresses with each request to appear as distinct users. Bright Data, Smartproxy, Proxy-Seller, and Soax are the best proxy providers with this feature.
Limit request rates – Add delays and throttle requests to reasonable limits.
Randomize user-agents – Use a variety of browser user-agents.
Retry blocked requests – Retry fetching a page that got blocked after a delay.
Solve captchas manually – Parse out captcha images/audio to be solved by a service.
Use residential proxies – Avoid blocks by using residential IPs with ISP-level diversity.

Proper precautions can minimize scraping interruptions and keep data flowing 24/7.

Scraping Ethically

While most Glassdoor data scraping is perfectly legal, it's important we ethically:

Limit scrape volume to reasonable levels
Avoid aggressively scraping at speeds that may cause infrastructure issues
Attribute Glassdoor when publishing data in reports
Don't scrape private user info like names and contact details
Respect Glassdoor's Terms of Service
Implement polite crawling with delays to reduce load
Communicate with Glassdoor to resolve any concerns

By respecting site policies and scraping responsibly, we can avoid problems down the road.

Conclusion

In this guide, we covered a variety of techniques for robust Glassdoor scraping using Python. The methods here can serve as templates to extract large amounts of data from Glassdoor with proper care and responsibility. With some customization for your specific needs, you can build powerful Glassdoor scrapers to fuel research, recruiting, data science, and more.