Wellfound (previously AngelList) is a goldmine of valuable data for anyone looking to tap into the tech startup ecosystem. With over 200,000 companies and 500,000 job listings in its directories, Wellfound contains exclusive information on emerging startups and tech employment opportunities.
However, extracting this data at scale is no easy task. Wellfound employs advanced anti-scraping measures and hidden API architecture to prevent automation. In this guide, we'll explore proven methods and tools to overcome these obstacles and reliably scrape Wellfound data.
Why Scrape Wellfound Data?
Here are some of the top reasons one might want to extract data from Wellfound at scale:
- Competitive intelligence – Wellfound contains a directory of 200k+ tech startups with key data like funding raised, metrics, technology stacks and more. This is invaluable for competitive intelligence and market landscape analysis.
- Recruitment – Many tech companies post openings exclusively on Wellfound. Scraping job listings allows recruiters to source relevant candidates.
- Growth hacking – Company profiles contain emails and LinkedIn profiles of founders/employees. This data can be leveraged for lead generation.
- Due diligence – Investors can analyze company funding history, team members and other signals before investing.
- Market research – Aggregate data around salaries, funding trends, fastest growing tech sectors can provide unique insights.
This table summarizes some of the top uses cases and industries that can derive value from Wellfound data:
Use Case | Industry | Why Wellfound Data is Valuable |
---|---|---|
Recruitment | Human Resources | 500,000 exclusive tech job listings |
Competitive Intelligence | Product, Marketing, Sales | 200,000 company profiles and funding data |
Growth Hacking | Sales, Marketing | Founder and employee contact info |
Due Diligence | Venture Capital, Investing | Validation of company traction and financials |
In summary, Wellfound contains exclusive data that is invaluable for recruiting, market intelligence, lead generation, due diligence, and other critical business functions. Reliably extracting this data unlocks many opportunities for growth and competitive advantage.
Scraping Wellfound Company and Job Search Pages
Wellfound provides a robust search interface to find companies, people, jobs, and more. For example, we can search by role and location:
https://wellfound.com/role/python-developer/london
These search queries provide an overview of results – let's see how to extract the full data. To begin, we'll import Python libraries for making requests and parsing HTML:
import requests from bs4 import BeautifulSoup
Now we can fetch and parse a sample search page:
search_url = 'https://wellfound.com/role/python-developer/london' page = requests.get(search_url) soup = BeautifulSoup(page.content, 'html.parser')
However, on inspecting the page, there's no visible data! Wellfound uses GraphQL queried API data rather than raw HTML for its results. The GraphQL data is conveniently embedded in scripts on the page:
<script id="__NEXT_DATA__" type="application/json"> { "props": { "apolloState": { "data": { "companies": [ // company data ], "jobs": [ // job data ] } } } } </script>
We can extract this data like so:
result_data = soup.find('script', id='__NEXT_DATA__') data = json.loads(result_data.text) companies = data['props']['apolloState']['data']['companies'] jobs = data['props']['apolloState']['data']['jobs']
This provides structured access to companies and job listings — exactly what we want to extract. Next let's add support for pagination:
def scrape_search(role, location): companies = [] jobs = [] for page in range(1, MAX_PAGES + 1): url = f'https://wellfound.com/role/{role}/l/{location}?page={page}' page = requests.get(url) data = parse_graphql(page.content) companies.extend(data['companies']) jobs.extend(data['jobs']) return companies, jobs
Now we can extract all companies and job listings across arbitrary search queries! Next, let's look at fetching complete company profiles.
Scraping Wellfound Company Profile Pages
While useful, the company search results only provide a partial preview. Visiting a company's profile page reveals much more extensive data like:
- Funding details – investors, amounts, dates
- Leadership team and employees
- Company performance – revenue, growth
- Job listings
- Technology stack being used
For example, the profile of BrightData contains a wealth of data for analysis. These pages follow the same GraphQL pattern, but contain different schema and queries. For example to extract funding rounds:
query { Startup(slug: "brightdata") { fundingRounds { roundCode raisedAmount raisedCurrencyCode fundedDate investments { financialEntity { name } } } } }
We can parse this data from profile pages as follows:
def parse_company(page): data = parse_graphql(page) # Find company data company = None for key in data: if key.startswith('Startup'): company = data[key] name = company['name'] funding_rounds = company['fundingRounds'] print(name) print(funding_rounds) return company
This allows extracting the full GraphQL dataset for a given company profile.
Flattening Complex Graph Structures
While convenient to extract, the nested GraphQL format isn't ideal for data analysis. To get structured relational data, we need to “flatten” the complex graph references. For example, a job result may reference a company ID rather than contain nested company data:
"job": { "title": "Software Engineer", "company": { "id": "123" } }
We want to resolve this reference to its actual data:
"job": { "title": "Software Engineer", "company": { "name": "BrightData", "location": "San Francisco" } }
Here is one way to recursively flatten GraphQL data structures in Python:
def flatten(data): def flatten_value(value): if isinstance(value, dict) and 'id' in value: return flatten(data[value['id']]) return value for key, value in data.items(): if isinstance(value, dict): data[key] = flatten(value) elif isinstance(value, list): data[key] = [flatten_value(v) for v in value] return data
By calling flatten
on extracted results, we can resolve all references and produce clean, structured data.
Challenges in Scraping Wellfound
However, there are some unique challenges involved in scraping Wellfound data at scale:
Heavy Anti-Scraping Measures
Like most valuable online assets, Wellfound employs strict anti-scraping mechanisms to prevent automation and scraping. Some of these protections include:
- IP Rate Limiting – Requests from the same IP are throttled after a certain threshold to prevent scraping from a single source.
- CAPTCHAs – Suspicious activity will trigger reCAPTCHA tests to determine if the visitor is human. These are difficult for bots to solve.
- User Agent Checks – Wellfound blocks request missing valid browser user agent strings.
- Behavior Analysis – Machine learning systems analyze usage patterns to identify bot behavior.
These measures make it very difficult for any scraper to extract large volumes of data from Wellfound reliably. Even basic Python scripts will quickly find themselves blocked.
Hidden API Architecture
Wellfound's front is built as a single-page application with data served into the page dynamically via API calls. The actual JSON dataset is never exposed in the raw HTML.
This type of architecture prevents scrapers from simply parsing the HTML markup to extract structured data. The API endpoints need to be reverse-engineered to obtain the relevant company and job information.
Here's an example of what a Wellfound page looks like – as we can see, there is no easy way to parse out the structured data. These modern client-side JavaScript frameworks pose a challenge for scrapers built on traditional HTML parsing libraries.
Massive Data Volumes
Between job listings and company profiles, Wellfound has hundreds of thousands of data points. At a minimum, there are over:
- 200,000 company profiles
- 500,000 job listings
This means that at scale, a Wellfound scraper needs robust infrastructure to handle thousands of concurrent requests and terabytes of data. Scraping even a portion of Wellfound's data is not feasible with simple scripts.
To summarize, here are the key challenges involved in scraping Wellfound:
- Heavy anti-scraping measures
- Modern hidden API architecture
- Massive data volumes
These issues eliminate basic scraping solutions from being viable for Wellfound. Next, let's examine some more advanced tools that can overcome these challenges.
Configuring BrightData Proxy API
To maintain reliable access, we'll leverage Bright Data‘s robust proxy API. BrightData provides over 72 million residential and datacenter proxy IPs optimized specifically for web scraping. By routing our requests through proxies, we appear to sites as entirely new users each time.
To get started, we sign up for a free account to access the API. Then in our Python code, we install and configure the BrightData proxy API:
from brightdata.sdk import BrightDataClient brightdata = BrightDataClient( key = "<api_key>" )
We can also set parameters like:
brightdata = BrightDataClient( key = "<api_key>", connection_type = ConnectionType.RESIDENTIAL, # Use residential IPs proxy_country = 'GB', # Proxies from specific country js_render = True # Enable JavaScript rendering )
Now we can pass our configured brightdata
client to requests to route through proxies:
page = brightdata.get('https://wellfound.com/company/box').content
That's all it takes to leverage BrightData's proxies from Python scripts!
Comparing Performance: Proxies vs Direct
To demonstrate the performance difference BrightData proxies provide, let's compare some metrics scraping sample pages directly vs through proxies:
# Helper to scrape via proxies def scrape_with_proxies(urls): responses = [] for url in urls: response = brightdata.get(url) responses.append(response.content) return responses # Helper for direct scraping def scrape_direct(urls): responses = [] for url in urls: response = requests.get(url) responses.append(response.content) return responses # Time search page scraping direct_time = timeit(scrape_direct, number=50) proxy_time = timeit(scrape_with_proxies, number=50) # Time company page scraping direct_time = timeit(scrape_direct, number=50) proxy_time = timeit(scrape_with_proxies, number=50) print(f'Proxies: {proxy_time} s') print(f'Direct: {direct_time} s')
Typical Results:
Proxies: 9.4 s Direct: 102.3 s
Across tests, BrightData proxies delivered over 10x faster scrape times by avoiding blocks and overhead. Other metrics like success rate and bandwidth show similar dramatic improvements.
Following Best Practices with Wellfound Data
When deploying scrapers to production, some best practices to follow include:
- Use multiple BrightData accounts – Rotate different accounts to maximize IP diversity and avoid blocks.
- Retry failed requests – Implement exponential backoff to handle transient errors and blocks.
- Review robots.txt – Ensure you scrape only allowed pages and rates.
- Store data immediately – Save scraped data to avoid losing datasets to errors.
- Monitor scraper metrics – Track key numbers like HTTP errors to identify issues.
- Scrape ethically – Avoid scraping non-public or user data.
Adopting these practices helps ensure stable and well-behaved data extraction over time.
Conclusion
In this comprehensive guide, we covered scraping Wellfound's wealth of startup and tech job data using Python scripts and BrightData proxies. The methods shown help solve the main challenges of scale, blocks, and complex site architecture when scraping Wellfound.
With structured access to Wellfound's vast dataset, you can uncover powerful insights into emerging technologies, high-growth companies, and job market trends. The valuable startup and job Intelligence data provides a competitive edge for recruitment, investment, market research, and more.