Glassdoor is one of the largest websites used by job seekers, employees, recruiters and researchers to find jobs, company information, salaries, reviews, interviews and workplace insights. With over 59 million reviews and data on over 1 million companies in 190 countries, Glassdoor contains a vast and comprehensive dataset on the global job market.
In this detailed guide, we'll dive into techniques to scrape and extract data from Glassdoor using Python and proxies.
Scraping Environment Setup
Let's look at the core tools and libraries we'll utilize for scraping Glassdoor:
- Python 3 – Our language of choice for scraping due to its rich ecosystem of libraries and tools.
- Requests – A very popular Python module for fetching web pages via HTTP requests.BeautifulSoup – A battle-tested Python library for parsing and extracting data from HTML and XML
- documents.
- Scrapy – A powerful web crawling framework for building robust, high-performance scrapers.
- Proxy Service – A paid proxy API for residential IPs to mask scrapers and avoid blocks.
- MongoDB – For storing and querying our scraped Glassdoor data.
For browser automation scenarios, we may also leverage tools like Selenium, Playwright, or Puppeteer depending on our specific needs. But for most data scraping, Requests + BeautifulSoup provides a simple and fast solution.
Let's start by installing the key packages:
pip install requests beautifulsoup4 scrapy pymongo
We'll also need accounts with a proxy service (like BrightData, GeoSurf, etc) and a MongoDB database to store scraped results.
Obtaining Company IDs
To scrape company specific pages on Glassdoor, we first need to obtain their unique company identifiers. These alphanumeric IDs are not exposed directly on the Glassdoor site, but can be retrieved by calling the autocomplete/search API:
import requests import json search_url = "https://www.glassdoor.com/api/api.htm?t=TYPEAHEAD&action=employers&q=microsoft" response = requests.get(search_url) data = json.loads(response.text) for result in data: print(result['id'], result['name'])
This searches for “microsoft” and prints out:
16752 Microsoft
We can then construct company specific URLs using this ID, like:
https://www.glassdoor.com/Overview/Working-at-Microsoft-EI_IE16752.11,19.htm
To get the IDs for our target companies, we can extract them directly from Glassdoor's sitemap:
import requests from bs4 import BeautifulSoup import re response = requests.get("https://www.glassdoor.com/Sitemap/Company-Sitemap.xml") soup = BeautifulSoup(response.text, 'lxml') for url in soup.find_all('url'): match = re.search(r"Working-at-(.+)-EI_IE(\d+)", url.find('loc').text) if match: company_name = match.group(1) company_id = match.group(2) print(company_name, company_id)
This fetches the sitemap XML, then extracts out the company name and ID from each url, getting us IDs for all companies on Glassdoor. We can also customize the search API call to retrieve IDs for specific companies we want to target. With IDs in hand, we can now start scraping company specific pages.
Bypassing Anti-Scraping Mechanisms
Like most major websites, Glassdoor employs a series of anti-scraping mechanisms to detect and block bots and scrapers. These include:
- IP rate limiting – Banning IPs that send too many requests in a period of time.
- CAPTCHAs – Challenging suspect IPs to solve images/text captchas.
- Blocking user-agents – Banning common scraper user-agents like
Python-urllib
. - Cookies/JavaScript – Requiring cookies and JS rendering to access some pages.
Here are some techniques we can use to bypass these protections and scrape effectively:
- Use proxies: By routing requests through residential proxy IPs, we can appear as many different users, avoiding IP blocks. Proxy services like Bright Data offer millions of IPs to cycle through.
- Rotate user-agents: We can randomly select a user-agent from a list of real browser UAs on each request, preventing blocks by scraper user-agents.
- Add delays: Introducing 3-5 second delays between requests and limiting request rates helps avoid triggering abusive scraping detections.
- JS rendering: We can integrate headless browsers like Puppeteer for sites requiring JS to evaluate page scripts.
- Solve CAPTCHAs: If encountering occasional CAPTCHAs, we can programmatically parse out the image/audio challenge and solve it using a CAPTCHA solving service to continue scraping.
- Mimic human behavior: Clicking elements, scrolling pages, and moving the mouse in random patterns helps evade bot protections. These tricks allow us to scrape Glassdoor reliably at scale without getting blocked. Next let's see how to extract data from company pages.
Scraping Company Overview Pages
Every company on Glassdoor has a dedicated overview page with useful metadata like:
- Company description
- Headquarters location
- Industry and sector
- Company size
- Revenue
- Founded date
- CEO approval rating
To extract this information, we'll make a request to fetch the page HTML, then parse out the data we want:
import requests from bs4 import BeautifulSoup company_id = "16752" url = f"https://www.glassdoor.com/Overview/Working-at-Microsoft-EI_IE{company_id}.11,19.htm" response = requests.get(url, proxies=get_proxy()) soup = BeautifulSoup(response.text, 'html.parser') desc = soup.find("div", {"data-test": "employerDescription"}).text ceo_rating = soup.find("span", {"data-test": "ceo-rating"}).get('title') size = soup.find("div", {"data-test": "employer-size"}).text revenue = soup.find("div", {"data-test": "employer-revenue"}).text print(desc) print(ceo_rating) print(size) print(revenue)
Here we:
- Fetch the page HTML
- Parse out the key fields we want
- Print the extracted data
We can extract dozens more useful data points like headquarters, type, industry, website, and more by selecting additional CSS elements from the page. Now let's look at scaling this up to extract overview data across many companies using Scrapy.
Scraping Company Overviews at Scale with Scrapy
Scrapy is a popular Python scraping framework optimized for crawling many pages quickly and efficiently. To leverage Scrapy, we can define a Spider subclass to scrape and parse overview pages:
import scrapy from scrapy.crawler import CrawlerProcess class OverviewSpider(scrapy.Spider): name = 'overview' custom_settings = { 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36' } def start_requests(self): urls = [ f"https://www.glassdoor.com/Overview/Working-at-Microsoft-EI_IE16752.11,19.htm", f"https://www.glassdoor.com/Overview/Working-at-Facebook-E40772.11,18.htm" ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): desc = response.css("div[data-test='employerDescription'] ::text").get() ceo_rating = response.css("span[data-test='ceo-rating']::attr(title)").get() yield { 'description': desc, 'ceo_rating': ceo_rating } process = CrawlerProcess() process.crawl(OverviewSpider) process.start()
This defines a simple Spider to scrape and parse 2 example companies. To scale it up, we can pass many more URLs or generate them dynamically from company IDs. We can also add more CSS selectors to parse additional fields, export results to MongoDB, add proxies, etc. Scrapy provides a very fast and convenient way to scrape data across thousands of overview pages.
Extracting Job Listings
Glassdoor hosts millions of job listings aggregated from company sites and job boards all over the web. These job postings provide useful data like:
- Job title
- Company name
- Location
- Job description
- Salary estimate
- Skills required
To extract all job listings for a given company, we'll need to paginate through the multiple pages of results. Here's an approach:
import math import json import requests from bs4 import BeautifulSoup company_id = "16752" # Microsoft def extract_listings(page): url = f"https://www.glassdoor.com/Job/microsoft-jobs-SRCH_KE0,9_IP{page}.htm" response = requests.get(url, proxies=get_proxy()) soup = BeautifulSoup(response.text, 'html.parser') total = int(soup.select_one(".paginationFooter").getText().split()[-1]) max_pages = math.ceil(total / 20) # 20 jobs per page jobs = [] for el in soup.select(".jl"): title = el.select_one(".jobLink").getText() location = el.select_one(".subtleloc").getText() jobs.append({ 'title': title, 'location': location }) return jobs, max_pages results = [] for page in range(1, max_pages + 1): jobs, max_pages = extract_listings(page) results.extend(jobs) print(json.dumps(results, indent=2))
This paginates through each listing page, extracting job titles, location, and other attributes. To further enrich the data, we can parse additional details from each job page like description, department, salary range, and skills.
Parsing Company Reviews
Reviews from current and past employees can provide tremendous insight into company culture, sentiment, management styles, and more. Glassdoor contains over 59 million reviews with breakdowns by department, job role, pros/cons, ratings, and other metadata.
Let's look at how we can extract all reviews for a company. The approach is similar to jobs – paginate over review pages and parse each one:
import math import json import requests from bs4 import BeautifulSoup company_id = "16752" def extract_reviews(page): url = f"https://www.glassdoor.com/Reviews/Microsoft-Reviews-E16752_P{page}.htm" response = requests.get(url, proxies=get_proxy()) soup = BeautifulSoup(response.text, 'html.parser') total = int(soup.select_one(".pagination").getText().split()[-1]) max_pages = math.ceil(total / 20) reviews = [] for div in soup.select(".review"): title = div.select_one(".summary").getText() rating = div.find("span", {"class": "rating"}).get("title") reviews.append({ 'title': title, 'rating': rating }) return reviews, max_pages results = [] for page in range(1, max_pages + 1): reviews, max_pages = extract_reviews(page) results.extend(reviews) print(json.dumps(results, indent=2))
This extracts the review title and rating for each one. We can also parse pros, cons, advisor ratings, and other fields from each review. Sentiment analysis of these reviews can identify companies with exceptionally happy or dissatisfied employees.
Gathering Salary Data
In addition to job listings, Glassdoor allows users to self-report salary details anonymously. These crowdsourced salary reports can provide pay range insights by:
- Job title
- Location
- Years of experience
- Company
- Size
- Revenue
- Gender
Let's look at an approach to extract reported salaries for a company:
import json import requests from bs4 import BeautifulSoup company_id = "16752" url = f"https://www.glassdoor.com/Salary/Microsoft-Salaries-E16752.htm" response = requests.get(url, proxies=get_proxy()) soup = BeautifulSoup(response.text, 'html.parser') salaries = [] for row in soup.select(".compensationRow"): role = row.select_one(".jobTitle").getText() salary = row.select_one(".gray").getText() salaries.append({ 'role': role.strip(), 'salary': salary }) print(json.dumps(salaries, indent=2))
This scrapes each salary report row, extracting the job title and reported salary into our salaries list. We can further enrich this data by also scraping the location, date submitted, years of experience, and other attributes from each report. Glassdoor contains hundreds of thousands of salary reports across companies and provides localized pay info globally.
Scraping Interview Details
Interview reviews and questions posed to candidates can help job seekers prepare and know what to expect during the process. Glassdoor has a section dedicated to user-submitted interview details such as:
- Interview questions asked
- Process difficulty rating
- Interview experience reviews
- Tips from candidates
- Offer and rejection stats
Let's look at how we can scrape and extract these interview insights. Similarly to jobs and reviews, we'll need to paginate over multiple pages and parse each one:
import math import json import requests from bs4 import BeautifulSoup company_id = "16752" url = f"https://www.glassdoor.com/Interview/Microsoft-Interview-Questions-E16752.htm" response = requests.get(url, proxies=get_proxy()) soup = BeautifulSoup(response.text, 'html.parser') total = int(soup.select_one(".pagination").getText().split()[-1]) max_pages = math.ceil(total / 20) questions = [] for page in range(1, max_pages + 1): response = requests.get(f"{url}_P{page}.htm", proxies=get_proxy()) soup = BeautifulSoup(response.text, 'html.parser') for li in soup.select(".interviewQuestion"): question = li.select_one(".questionText").getText() questions.append(question) print(json.dumps(questions, indent=2))
This extracts just the interview question text, but we can also grab difficulty ratings, candidate tips, and other details. Analyzing these questions can help surface the most common ones to expect for a given company or role.
Storing Scraped Data
Now that we know how to scrape pages and extract data, let's look at how to store it for easier analysis and querying. Here are some good storage options for scraped Glassdoor data:
JSON Files
For simple datasets, we can save results to JSON files:
import json jobs = # scraped jobs with open('jobs.json', 'w') as f: json.dump(jobs, f)
This writes our jobs list to jobs.json
on disk. Easy to parse locally but doesn't scale well.
MySQL
For relational data, MySQL is a solid option:
import mysql.connector db = mysql.connector.connect( host="localhost", user="root", password="password123", database="glassdoor" ) cursor = db.cursor() for job in jobs: sql = "INSERT INTO jobs (title, company, location) VALUES (%s, %s, %s)" cursor.execute(sql, (job['title'], job['company'], job['location'])) db.commit()
This inserts each job into a MySQL table for structured querying.
MongoDB
For more flexibility, we can use a document store like MongoDB:
from pymongo import MongoClient client = MongoClient("localhost", 27017) db = client["glassdoor"] jobs_col = db["jobs"] jobs_col.insert_many(jobs)
MongoDB allows storing unstructured JSON data easily with querying flexibility. Great for web scraping! There are many other options like PostgreSQL, Elasticsearch, and cloud databases to consider as well based on needs.
Analyzing Scraped Data
Now that we have Glassdoor data in a structured database, what can we do with it? Here are some examples of insights to extract:
Top company interview questions
from collections import Counter counts = Counter() for interview in interviews_col.find({"company": "Facebook"}): counts.update(interview['questions']) print(counts.most_common(5))
Prints the 5 most common Facebook interview questions.
Top rated technology companies
import pandas as pd df = pd.DataFrame(list(overviews_col.find())) tech_df = df[df['industry'] == 'Technology'] top_comps = tech_df.nlargest(10, 'overall_rating') print(top_comps)
Prints the top 10 highest rated tech companies.
Salary difference between genders
from scipy.stats import mannwhitneyu female_salaries = [] male_salaries = [] for salary in salaries_col.find(): if salary['gender'] == 'Female': female_salaries.append(salary['amount']) elif salary['gender'] == 'Male': male_salaries.append(salary['amount']) p_value = mannwhitneyu(female_salaries, male_salaries) print(p_value)
Statistical significance test for gender pay gaps. The possibilities are endless!
Avoiding Blocks
When scraping Glassdoor at scale, blocks and captchas are inevitable. Here are some tips to minimize disruptions:
- Use a proxy rotation – Rotate IP addresses with each request to appear as distinct users. Bright Data, Smartproxy, Proxy-Seller, and Soax are the best proxy providers with this feature.
- Limit request rates – Add delays and throttle requests to reasonable limits.
- Randomize user-agents – Use a variety of browser user-agents.
- Retry blocked requests – Retry fetching a page that got blocked after a delay.
- Solve captchas manually – Parse out captcha images/audio to be solved by a service.
- Use residential proxies – Avoid blocks by using residential IPs with ISP-level diversity.
Proper precautions can minimize scraping interruptions and keep data flowing 24/7.
Scraping Ethically
While most Glassdoor data scraping is perfectly legal, it's important we ethically:
- Limit scrape volume to reasonable levels
- Avoid aggressively scraping at speeds that may cause infrastructure issues
- Attribute Glassdoor when publishing data in reports
- Don't scrape private user info like names and contact details
- Respect Glassdoor's Terms of Service
- Implement polite crawling with delays to reduce load
- Communicate with Glassdoor to resolve any concerns
By respecting site policies and scraping responsibly, we can avoid problems down the road.
Conclusion
In this guide, we covered a variety of techniques for robust Glassdoor scraping using Python. The methods here can serve as templates to extract large amounts of data from Glassdoor with proper care and responsibility. With some customization for your specific needs, you can build powerful Glassdoor scrapers to fuel research, recruiting, data science, and more.