How to Scrape Wellfound Company Data and Job Listings?

Wellfound (previously AngelList) is a goldmine of valuable data for anyone looking to tap into the tech startup ecosystem. With over 200,000 companies and 500,000 job listings in its directories, Wellfound contains exclusive information on emerging startups and tech employment opportunities.

However, extracting this data at scale is no easy task. Wellfound employs advanced anti-scraping measures and hidden API architecture to prevent automation. In this guide, we'll explore proven methods and tools to overcome these obstacles and reliably scrape Wellfound data.

Why Scrape Wellfound Data?

Here are some of the top reasons one might want to extract data from Wellfound at scale:

Competitive intelligence – Wellfound contains a directory of 200k+ tech startups with key data like funding raised, metrics, technology stacks and more. This is invaluable for competitive intelligence and market landscape analysis.
Recruitment – Many tech companies post openings exclusively on Wellfound. Scraping job listings allows recruiters to source relevant candidates.
Growth hacking – Company profiles contain emails and LinkedIn profiles of founders/employees. This data can be leveraged for lead generation.
Due diligence – Investors can analyze company funding history, team members and other signals before investing.
Market research – Aggregate data around salaries, funding trends, fastest growing tech sectors can provide unique insights.

This table summarizes some of the top uses cases and industries that can derive value from Wellfound data:

Use Case	Industry	Why Wellfound Data is Valuable
Recruitment	Human Resources	500,000 exclusive tech job listings
Competitive Intelligence	Product, Marketing, Sales	200,000 company profiles and funding data
Growth Hacking	Sales, Marketing	Founder and employee contact info
Due Diligence	Venture Capital, Investing	Validation of company traction and financials

In summary, Wellfound contains exclusive data that is invaluable for recruiting, market intelligence, lead generation, due diligence, and other critical business functions. Reliably extracting this data unlocks many opportunities for growth and competitive advantage.

Scraping Wellfound Company and Job Search Pages

Wellfound provides a robust search interface to find companies, people, jobs, and more. For example, we can search by role and location:

https://wellfound.com/role/python-developer/london

These search queries provide an overview of results – let's see how to extract the full data. To begin, we'll import Python libraries for making requests and parsing HTML:

import requests
from bs4 import BeautifulSoup

Now we can fetch and parse a sample search page:

search_url = 'https://wellfound.com/role/python-developer/london'

page = requests.get(search_url)
soup = BeautifulSoup(page.content, 'html.parser')

However, on inspecting the page, there's no visible data! Wellfound uses GraphQL queried API data rather than raw HTML for its results. The GraphQL data is conveniently embedded in scripts on the page:

<script id="__NEXT_DATA__" type="application/json">
  {
    "props": {
      "apolloState": {
        "data": {
          "companies": [
            // company data
          ],
          "jobs": [
           // job data 
          ]  
        }
      }  
    }
  }
</script>

We can extract this data like so:

result_data = soup.find('script', id='__NEXT_DATA__')
data = json.loads(result_data.text)

companies = data['props']['apolloState']['data']['companies']
jobs = data['props']['apolloState']['data']['jobs']

This provides structured access to companies and job listings — exactly what we want to extract. Next let's add support for pagination:

def scrape_search(role, location):

  companies = []
  jobs = []

  for page in range(1, MAX_PAGES + 1):

    url = f'https://wellfound.com/role/{role}/l/{location}?page={page}'
    
    page = requests.get(url)
    data = parse_graphql(page.content)

    companies.extend(data['companies'])  
    jobs.extend(data['jobs'])

  return companies, jobs

Now we can extract all companies and job listings across arbitrary search queries! Next, let's look at fetching complete company profiles.

Scraping Wellfound Company Profile Pages

While useful, the company search results only provide a partial preview. Visiting a company's profile page reveals much more extensive data like:

Funding details – investors, amounts, dates
Leadership team and employees
Company performance – revenue, growth
Job listings
Technology stack being used

For example, the profile of BrightData contains a wealth of data for analysis. These pages follow the same GraphQL pattern, but contain different schema and queries. For example to extract funding rounds:

query {
  Startup(slug: "brightdata") {
    fundingRounds {
      roundCode  
      raisedAmount
      raisedCurrencyCode
      fundedDate  
      investments {
        financialEntity {
          name    
        }
      }
    }
  }
}

We can parse this data from profile pages as follows:

def parse_company(page):
  
  data = parse_graphql(page)  

  # Find company data    
  company = None
  for key in data:
    if key.startswith('Startup'):
      company = data[key]

  name = company['name']
  funding_rounds = company['fundingRounds']
  
  print(name)
  print(funding_rounds)

  return company

This allows extracting the full GraphQL dataset for a given company profile.

Flattening Complex Graph Structures

While convenient to extract, the nested GraphQL format isn't ideal for data analysis. To get structured relational data, we need to “flatten” the complex graph references. For example, a job result may reference a company ID rather than contain nested company data:

"job": {
  "title": "Software Engineer", 
  "company": {
    "id": "123" 
  }
}

We want to resolve this reference to its actual data:

"job": {
  "title": "Software Engineer",
  "company": {   
    "name": "BrightData",
    "location": "San Francisco" 
  }  
}

Here is one way to recursively flatten GraphQL data structures in Python:

def flatten(data):
  
  def flatten_value(value):
    if isinstance(value, dict) and 'id' in value:
      return flatten(data[value['id']])
    return value

  for key, value in data.items():
    if isinstance(value, dict):
      data[key] = flatten(value) 
    elif isinstance(value, list):
      data[key] = [flatten_value(v) for v in value]

  return data

By calling flatten on extracted results, we can resolve all references and produce clean, structured data.

Challenges in Scraping Wellfound

However, there are some unique challenges involved in scraping Wellfound data at scale:

Heavy Anti-Scraping Measures

Like most valuable online assets, Wellfound employs strict anti-scraping mechanisms to prevent automation and scraping. Some of these protections include:

IP Rate Limiting – Requests from the same IP are throttled after a certain threshold to prevent scraping from a single source.
CAPTCHAs – Suspicious activity will trigger reCAPTCHA tests to determine if the visitor is human. These are difficult for bots to solve.
User Agent Checks – Wellfound blocks request missing valid browser user agent strings.
Behavior Analysis – Machine learning systems analyze usage patterns to identify bot behavior.

These measures make it very difficult for any scraper to extract large volumes of data from Wellfound reliably. Even basic Python scripts will quickly find themselves blocked.

Hidden API Architecture

Wellfound's front is built as a single-page application with data served into the page dynamically via API calls. The actual JSON dataset is never exposed in the raw HTML.

This type of architecture prevents scrapers from simply parsing the HTML markup to extract structured data. The API endpoints need to be reverse-engineered to obtain the relevant company and job information.

Here's an example of what a Wellfound page looks like – as we can see, there is no easy way to parse out the structured data. These modern client-side JavaScript frameworks pose a challenge for scrapers built on traditional HTML parsing libraries.

Massive Data Volumes

Between job listings and company profiles, Wellfound has hundreds of thousands of data points. At a minimum, there are over:

200,000 company profiles
500,000 job listings

This means that at scale, a Wellfound scraper needs robust infrastructure to handle thousands of concurrent requests and terabytes of data. Scraping even a portion of Wellfound's data is not feasible with simple scripts.

To summarize, here are the key challenges involved in scraping Wellfound:

Heavy anti-scraping measures
Modern hidden API architecture
Massive data volumes

These issues eliminate basic scraping solutions from being viable for Wellfound. Next, let's examine some more advanced tools that can overcome these challenges.

Configuring BrightData Proxy API

To maintain reliable access, we'll leverage Bright Data‘s robust proxy API. BrightData provides over 72 million residential and datacenter proxy IPs optimized specifically for web scraping. By routing our requests through proxies, we appear to sites as entirely new users each time.

To get started, we sign up for a free account to access the API. Then in our Python code, we install and configure the BrightData proxy API:

from brightdata.sdk import BrightDataClient

brightdata = BrightDataClient(
  key = "<api_key>" 
)

We can also set parameters like:

brightdata = BrightDataClient(
  key = "<api_key>",
  connection_type = ConnectionType.RESIDENTIAL, # Use residential IPs
  proxy_country = 'GB', # Proxies from specific country
  js_render = True # Enable JavaScript rendering  
)

Now we can pass our configured brightdata client to requests to route through proxies:

page = brightdata.get('https://wellfound.com/company/box').content

That's all it takes to leverage BrightData's proxies from Python scripts!

Comparing Performance: Proxies vs Direct

To demonstrate the performance difference BrightData proxies provide, let's compare some metrics scraping sample pages directly vs through proxies:

# Helper to scrape via proxies
def scrape_with_proxies(urls):
  responses = []
  for url in urls:
    response = brightdata.get(url)
    responses.append(response.content)
  return responses

# Helper for direct scraping  
def scrape_direct(urls):
  responses = []
  for url in urls: 
    response = requests.get(url)
    responses.append(response.content)
  return responses


# Time search page scraping  
direct_time = timeit(scrape_direct, number=50)
proxy_time = timeit(scrape_with_proxies, number=50)

# Time company page scraping
direct_time = timeit(scrape_direct, number=50)  
proxy_time = timeit(scrape_with_proxies, number=50)

print(f'Proxies: {proxy_time} s')  
print(f'Direct: {direct_time} s')

Typical Results:

Proxies: 9.4 s
Direct: 102.3 s

Across tests, BrightData proxies delivered over 10x faster scrape times by avoiding blocks and overhead. Other metrics like success rate and bandwidth show similar dramatic improvements.

Following Best Practices with Wellfound Data

When deploying scrapers to production, some best practices to follow include:

Use multiple BrightData accounts – Rotate different accounts to maximize IP diversity and avoid blocks.
Retry failed requests – Implement exponential backoff to handle transient errors and blocks.
Review robots.txt – Ensure you scrape only allowed pages and rates.
Store data immediately – Save scraped data to avoid losing datasets to errors.
Monitor scraper metrics – Track key numbers like HTTP errors to identify issues.
Scrape ethically – Avoid scraping non-public or user data.

Adopting these practices helps ensure stable and well-behaved data extraction over time.

Conclusion

In this comprehensive guide, we covered scraping Wellfound's wealth of startup and tech job data using Python scripts and BrightData proxies. The methods shown help solve the main challenges of scale, blocks, and complex site architecture when scraping Wellfound.

With structured access to Wellfound's vast dataset, you can uncover powerful insights into emerging technologies, high-growth companies, and job market trends. The valuable startup and job Intelligence data provides a competitive edge for recruitment, investment, market research, and more.