Crunchbase is a massive database containing detailed information on companies, investors, funding rounds, and more. With over 700k company profiles, 300k people profiles, and 200k funding round details, Crunchbase is a goldmine for market research, lead generation, and competitive intelligence.
In this comprehensive guide, we'll cover multiple methods and best practices for scraping Crunchbase data at scale.
Why Scrape Crunchbase Data?
Here are some of the top reasons you may want to extract data from Crunchbase:
- Company profiling – Crunchbase has expansive datasets on private and public companies, including funding, leadership, technology stack, descriptions, locations, and more. Ideal for lead generation, recruiting and market research.
- Funding/M&A research – Details on startup funding rounds, investors, acquisitions, partnerships, valuations, and exit events. Useful for investment research and analysis.
- People/recruitment data – Contact details, backgrounds and career histories of executives, founders, developers and more. Great for recruiting and sales prospecting.
- Industry/market analytics – Aggregate data on companies, people, funding, acquisitions, and news by industry, category, location etc. For market size analysis, investment strategy, and more.
- Alternative data – Novel data points not found in company filings or other sources, like the technology used, team size estimates, and more. For an alpha generation in investing.
In summary, Crunchbase contains expansive, unique datasets useful for various business intelligence, analytics, and market research purposes.
Is it Legal to Scrape Crunchbase?
Broadly speaking, scraping public information from Crunchbase in a non-disruptive manner is legal in most jurisdictions.
Crunchbase's Terms of Service do not expressly prohibit scraping or text/data mining activities. As long as you scrape respectfully and do not use Crunchbase data for spam or illegal purposes, web scraping can be conducted legally.
That said, always consult the laws in your local jurisdiction before web scraping any website. When handling personal data, pay special attention to privacy regulations like GDPR. And remember to throttle scrape requests to minimize load on Crunchbase's servers.
Scraping Crunchbase with Python
For this guide, we'll use Python for web scraping Crunchbase, as it's one of the most popular languages for scraping and data analysis. We'll also leverage a few key Python libraries:
- Requests – for making HTTP requests to fetch and parse pages
- Beautiful Soup – for extracting data from HTML and XML
- Scrapy – for building more advanced web crawlers (optional)
In addition, we'll use proxies to avoid IP blocks and bypass Crunchbase bot detection. Proxies also help distribute the load over multiple IPs. Some good commercial proxy providers include BrightData, Proxy-Seller, Soax, and Smartproxy. Residential proxies often work best for mimicking real user traffic.
Let's dive in and cover various methods for effectively scraping Crunchbase pages and data!
Scraping Crunchbase Company Profiles
Crunchbase company profiles (for example: Tesla Motors profile) contain a wealth of data, including:
- Descriptions, summaries
- Locations, addresses
- Leadership teams
- Funding details
- Technology used
- News, events
- Financials, revenue, valuation estimates
To extract company data from Crunchbase, we'll utilize the following workflow:
- Discover the list of company profile URLs to scrape
- Fetch page HTML for each company profile
- Parse page HTML to extract key data points
- Store scraped company data
Let's go through each step…
Step 1: Discover Company Profile URLs
We first need a list of company profile URLs to feed into our scraper. There are a couple of approaches for this:
a) Scrape sitemaps
Crunchbase provides sitemap XML files containing all URLs indexed on their site:
- Main sitemap index: www.crunchbase.com/www-sitemaps/sitemap-index.xml
- Company sitemap: www.crunchbase.com/www-sitemaps/sitemap-organizations-XX.xml.gz
We can parse these sitemaps to extract all company profile URLs:
import requests from bs4 import BeautifulSoup sitemap_index = "https://www.crunchbase.com/www-sitemaps/sitemap-index.xml" # Fetch sitemap index index_xml = requests.get(sitemap_index) # Parse sitemap index to get list of organization sitemaps soup = BeautifulSoup(index_xml.content, "xml") org_sitemaps = [url.text for url in soup.find_all("loc")] # Iterate through organization sitemaps and extract company URLs company_urls = [] for sitemap in org_sitemaps: # Fetch sitemap sitemap_xml = requests.get(sitemap) # Parse sitemap to extract company URLs soup = BeautifulSoup(sitemap_xml.content, "xml") urls = [url.text for url in soup.find_all("loc")] # Add company URLs to list company_urls.extend(urls) print(len(company_urls)) # 735,624 company URLs!
This gives us 700k+ clean company profile URLs to feed into our scraper.
b) Search API
For smaller datasets, we can use the Crunchbase search API to look at companies by keyword, category, location, etc. This returns company IDs which we can use to construct profile URLs.
c) Website crawl
For complete, up-to-date coverage, we can crawl the Crunchbase website to discover new and updated company profiles. We'll leave this more advanced approach to a future guide on crawling Crunchbase with Scrapy.
Step 2: Fetch Company Profile Pages
Next, we'll fetch the HTML page content for each company profile URL. To do this efficiently and avoid blocks, we'll use the Python Requests library with proxies, such as Bright Data, Smartproxy, Proxy-Seller, and Soax:
import requests from random import choice # List of residential proxy IPs proxies = ["123.123.123.123:8080", "111.111.111.111:8080" ...] # Function to fetch page with random proxy def fetch(url): proxy = choice(proxies) headers = {"User-Agent": "Mozilla/5.0"} try: response = requests.get(url, proxies={"http": proxy}, headers=headers, timeout=30) return response.text except Exception as e: print(e) return None # Fetch company profile tesla_url = "https://www.crunchbase.com/organization/tesla-motors" html = fetch(tesla_url)
We randomize the proxy on each request to prevent IP blocks. And add a User-Agent header to mimic a real browser. This results in the raw HTML content for each company profile page.
Step 3: Parse Company Data
With the page HTML fetched, we can now parse and extract fields of interest using Beautiful Soup:
from bs4 import BeautifulSoup # Parse HTML with BeautifulSoup soup = BeautifulSoup(html, "html.parser") # Extract data points name = soup.find("h1", {"class": "name"}).text.strip() description = soup.select_one(".break-words").text website = soup.select_one("a[data-article='website']")["href"] linkedin = soup.select_one("a[data-article='linkedin']")["href"] # Get leadership team data leaders = [] for person in soup.select(".management-team-member"): name = person.select_one(".name").text title = person.select_one(".title").text leaders.append({"name": name, "title": title}) print(name) print(description) print(website) print(leaders) # list of dicts containing each person's name and title
This extracts the key fields from the page HTML using CSS selectors. We can extend this to capture other datasets like funding, technology, revenue, locations, news and more.
Step 4: Store Scraped Company Data
Finally, we'll want to store the scraped company data. For example, saving to a JSON file:
import json # Scraped data for company data = { "name": name, "description": description, "website": website, "leaders": leaders } # Save to JSON file with open("data.json", "a") as f: json.dump(data, f) f.write("\n")
For larger datasets, we would save to a database or data warehouse instead.
This covers the core workflow for scraping company profiles at scale from Crunchbase. Next we'll look at extracting Crunchbase funding data.
Scraping Crunchbase Funding Rounds
In addition to company profiles, Crunchbase contains funding details across 1M+ rounds globally. Key data points include:
- Company funded
- Investors
- Amount raised
- Announced date
- Stage (Series A, B, C etc)
- Post-money valuation
This data can help analyze startup funding trends, track valuations, research investors, and more. To extract Crunchbase funding data at scale, we'll utilize a similar methodology:
- Discover funding round URLs
- Fetch page HTML
- Parse funding details
- Store in database
Let's go through it…
Step 1: Discover Funding URLs
We can source a list of funding round URLs from the Crunchbase sitemaps:
sitemap_index = "https://www.crunchbase.com/www-sitemaps/sitemap-index.xml" # Fetch sitemap index index_xml = requests.get(sitemap_index) soup = BeautifulSoup(index_xml.content, "xml") # Get funding sitemap URLs funding_sitemaps = [url.text for url in soup.find_all("loc") if "funding_rounds" in url.text] # Parse funding sitemaps to get list of funding URLs funding_urls = [] for sitemap in funding_sitemaps: xml = requests.get(sitemap) soup = BeautifulSoup(xml.content, "xml") urls = [url.text for url in soup.find_all("loc")] funding_urls.extend(urls) print(len(funding_urls)) # 1,000,000+ funding round URLs!
Again this provides us a comprehensive list of URLs to feed into our funding data scraper.
Step 2: Fetch Pages
We'll fetch the page HTML for each funding URL similarly to how we did for companies:
import requests from random import choice proxies = ["123.123.123.123:8080", "111.111.111.111:8080" ...] def fetch(url): proxy = choice(proxies) headers = {"User-Agent": "Mozilla/5.0"} try: response = requests.get(url, proxies={"http": proxy}, headers=headers, timeout=30) return response.text except Exception as e: print(e) return None url = "https://www.crunchbase.com/funding_round/tesla-motors-series-c--8d79d327" html = fetch(url)
Again we use proxies and randomize them to avoid blocks.
Step 3: Parse Funding Data
We'll parse the key details from the page HTML:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') # Extract funding data company = soup.select_one(".component--field-formatter-nested-link")["title"] investors = [] for investor in soup.select(".funding_round_investor"): name = investor.select_one(".name").text investors.append(name) amount = soup.select_one(".component--field-formatter-number")["title"] date = soup.select_one(".component--field-formatter-date")["title"] stage = soup.select_one("h4.startup_stage").text print(company) print(investors) print(amount) print(date) print(stage)
Again we can extend this to capture other data points like valuation, images, descriptions etc.
Step 4: Store in Database
Finally, we would save the structured funding data into our database:
import sqlite3 # Connect to sqlite DB conn = sqlite3.connect("crunchbase.db") c = conn.cursor() # Create fundings table c.execute("""CREATE TABLE IF NOT EXISTS fundings (company TEXT, investors TEXT, amount INTEGER, date DATE, stage TEXT)""") # Insert funding data c.execute("INSERT INTO fundings VALUES (?, ?, ?, ?, ?)", (company, str(investors), amount, date, stage)) # Commit and close conn.commit() conn.close()
For larger datasets, we would use PostgreSQL, MySQL, Amazon Redshift or another production-grade database. This covers a basic workflow for extracting Crunchbase funding data at scale. Let's now look at scraping Crunchbase people and investor profiles.
Scraping Crunchbase People Profiles
In addition to companies and investments, Crunchbase has 300k+ profiles on startup founders, executives, investors and developers. These contain contact details, work histories, investments, education and more.
Key fields we can extract include:
- Name
- Title/role
- Company
- Location
- Bio
- Work history
- Investments
- Education
- Links/social profiles
Scraping people profiles follows a similar process:
- Discover list of people profile URLs
- Fetch profile page HTML
- Parse details from HTML
- Save structured data
Step 1: Discover People Profile URLs
Again we can source people profile URLs from the Crunchbase sitemaps:
sitemap_index = "https://www.crunchbase.com/www-sitemaps/sitemap-index.xml" # Get list of people sitemaps index = requests.get(sitemap_index) soup = BeautifulSoup(index.text, 'xml') people_sitemaps = [url.text for url in soup.find_all('loc') if 'people' in url.text] # Extract profile URLs people_urls = [] for sitemap in people_sitemaps: xml = requests.get(sitemap) soup = BeautifulSoup(xml.text, 'xml') urls = [url.text for url in soup.find_all('loc')] people_urls.extend(urls) print(len(people_urls)) # 300,000+ people profile URLs
This gives us a comprehensive list of people's profiles to feed into our scraper.
Step 2: Fetch Profile Pages
We'll fetch the page HTML for each profile URL using proxies:
import requests from random import choice proxies = ["123.123.123.123:8080", "111.111.111.111:8080" ...] def fetch(url): proxy = choice(proxies) headers = {"User-Agent": "Mozilla/5.0"} try: response = requests.get(url, proxies={"http": proxy}, headers=headers, timeout=30) return response.text except Exception as e: print(e) return None url = "https://www.crunchbase.com/person/elon-musk" html = fetch(url)
Step 3: Parse Profile Data
We'll parse details from the page HTML:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') name = soup.select_one("h1.name").text.strip() title = soup.select_one(".title").text company = soup.select_one("a[data-section='current-positions']")["title"] location = soup.select_one("a.locality").text socials = {} for item in soup.select("ul.social-links li"): key = item.select_one("img")["alt"] url = item.select_one("a")["href"] socials[key] = url bio = soup.select_one("div.bio").text.strip() work_history = [] for position in soup.select("section.past-positions ul li"): title = position.select_one("div.title").text company = position.select_one("div.company a")["title"] period = position.select_one("div.period").text work_history.append({"title": title, "company": company, "period": period}) education = [] for school in soup.select("section.education ul li"): degree = school.select_one("div.degree").text field = school.select_one("div.field").text institution = school.select_one("div.institution").text education.append({"degree": degree, "field": field, "institution": institution}) investments = [] for investment in soup.select("section.investments ul li"): company = investment.select_one(".company a")["title"] amount = investment.select_one(".money").text investments.append({"company": company, "amount": amount}) print(name) print(title) print(location) print(socials) print(bio) print(work_history) print(education) print(investments)
This extracts all the key details from a Crunchbase person profile using CSS selectors. We capture name, role, company, location, social links, bio, work history, education, investments and more.
Step 4: Store in Database
Finally, we would save the structured people data to our database:
import sqlite3 conn = sqlite3.connect("crunchbase.db") c = conn.cursor() c.execute("""CREATE TABLE people (id INTEGER PRIMARY KEY, name TEXT, role TEXT, company TEXT, location TEXT, socials TEXT, bio TEXT, work_history TEXT, education TEXT, investments TEXT)""") # Insert person data c.execute("INSERT INTO people VALUES (NULL, ?, ?, ?, ?, ?, ?, ?, ?, ?)", (name, role, company, location, str(socials), bio, str(work_history), str(education), str(investments))) conn.commit() conn.close()
And that covers a workflow for scraping Crunchbase people and investor data at scale!
Scraping Strategies to Avoid Blocks
When scraping large volumes of pages from Crunchbase, you'll likely encounter bot mitigation and blocks. Here are some tips to scrape effectively:
- Use proxies – Rotate residential proxies on each request to distribute load and mask scraper fingerprints, like.
- Random delays – Add random delays between requests to mimic human behavior.
- Throttle requests – Limit the request rate to a few per second to respect the target site.
- Rotate User-Agents – Change the user agent on each request to vary scraper signature.
- Handle captchas – Pause detection and solve captchas manually or with automation.
- Scrape during low-traffic hours – Hit the site mostly at night when fewer human visitors.
Proper proxies and thoughtful throttling/pausing are key for sustainable Crunchbase scraping. An automated proxy rotation service like Smartproxy, Bright Data, Proxy-Seller, or Soax can make proxy management seamless.
Scraping Crunchbase with Scale Using Scrapy
For large scale crawling of Crunchbase data, the Scrapy framework is a great option. Scrapy provides:
- Asynchronous crawling for fast parallel scraping
- Built-in throttling, retries, cookie handling etc
- Easy integration of proxies, random delays, user-agent rotation
- Powerful parsing selectors like XPath and CSS
- Pipeline for processing scraped items
- Integration with databases and data pipelines
Here is a sample Scrapy spider to crawl Crunchbase company profiles:
import scrapy from scrapy.crawler import CrawlerProcess class CrunchbaseSpider(scrapy.Spider): name = 'crunchbase' # Start urls from sitemap start_urls = [ 'https://www.crunchbase.com/organization/spacex', 'https://www.crunchbase.com/organization/coinbase' ] custom_settings = { 'DOWNLOAD_DELAY': 2, 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'HTTPCACHE_ENABLED': True } def parse(self, response): # Parse company name, leaders, etc yield { 'name': response.css('h1.name::text').get(), 'leaders': response.css('.management-team-member .name::text').getall(), # etc... } # Run spider process = CrawlerProcess() process.crawl(CrunchbaseSpider) process.start()
Key points:
- Can set a download delay and concurrency to throttle politely.
- Enabling caching avoids re-scraping duplicate content.
- Easy to integrate proxies, user-agent rotation etc.
- Useful for large-scale crawling of all Crunchbase companies, people, and funding data.
Final Thoughts
In this comprehensive guide, we covered various techniques for effectively scraping company, funding, and people data from Crunchbase at scale using Python.
The key steps for each page type are:
- Discover the list of URLs to scrape
- Fetch page HTML using proxies
- Parse and extract fields of interest
- Store structured data in the database
Crawl politely using proxies like , random delays, and throttling. Leveraging a framework like Scrapy helps for large-scale crawling. Scraped Crunchbase data can provide unique insights for investment analysis, business intelligence, recruiting, and more.