How to Scrape Crunchbase Company and People Data?

Crunchbase is a massive database containing detailed information on companies, investors, funding rounds, and more. With over 700k company profiles, 300k people profiles, and 200k funding round details, Crunchbase is a goldmine for market research, lead generation, and competitive intelligence.

In this comprehensive guide, we'll cover multiple methods and best practices for scraping Crunchbase data at scale.

Why Scrape Crunchbase Data?

Here are some of the top reasons you may want to extract data from Crunchbase:

Company profiling – Crunchbase has expansive datasets on private and public companies, including funding, leadership, technology stack, descriptions, locations, and more. Ideal for lead generation, recruiting and market research.
Funding/M&A research – Details on startup funding rounds, investors, acquisitions, partnerships, valuations, and exit events. Useful for investment research and analysis.
People/recruitment data – Contact details, backgrounds and career histories of executives, founders, developers and more. Great for recruiting and sales prospecting.
Industry/market analytics – Aggregate data on companies, people, funding, acquisitions, and news by industry, category, location etc. For market size analysis, investment strategy, and more.
Alternative data – Novel data points not found in company filings or other sources, like the technology used, team size estimates, and more. For an alpha generation in investing.

In summary, Crunchbase contains expansive, unique datasets useful for various business intelligence, analytics, and market research purposes.

Is it Legal to Scrape Crunchbase?

Broadly speaking, scraping public information from Crunchbase in a non-disruptive manner is legal in most jurisdictions.

Crunchbase's Terms of Service do not expressly prohibit scraping or text/data mining activities. As long as you scrape respectfully and do not use Crunchbase data for spam or illegal purposes, web scraping can be conducted legally.

That said, always consult the laws in your local jurisdiction before web scraping any website. When handling personal data, pay special attention to privacy regulations like GDPR. And remember to throttle scrape requests to minimize load on Crunchbase's servers.

Scraping Crunchbase with Python

For this guide, we'll use Python for web scraping Crunchbase, as it's one of the most popular languages for scraping and data analysis. We'll also leverage a few key Python libraries:

Requests – for making HTTP requests to fetch and parse pages
Beautiful Soup – for extracting data from HTML and XML
Scrapy – for building more advanced web crawlers (optional)

In addition, we'll use proxies to avoid IP blocks and bypass Crunchbase bot detection. Proxies also help distribute the load over multiple IPs. Some good commercial proxy providers include BrightData, Proxy-Seller, Soax, and Smartproxy. Residential proxies often work best for mimicking real user traffic.

Let's dive in and cover various methods for effectively scraping Crunchbase pages and data!

Scraping Crunchbase Company Profiles

Crunchbase company profiles (for example: Tesla Motors profile) contain a wealth of data, including:

Descriptions, summaries
Locations, addresses
Leadership teams
Funding details
Technology used
News, events
Financials, revenue, valuation estimates

To extract company data from Crunchbase, we'll utilize the following workflow:

Discover the list of company profile URLs to scrape
Fetch page HTML for each company profile
Parse page HTML to extract key data points
Store scraped company data

Let's go through each step…

Step 1: Discover Company Profile URLs

We first need a list of company profile URLs to feed into our scraper. There are a couple of approaches for this:

a) Scrape sitemaps

Crunchbase provides sitemap XML files containing all URLs indexed on their site:

Main sitemap index: www.crunchbase.com/www-sitemaps/sitemap-index.xml
Company sitemap: www.crunchbase.com/www-sitemaps/sitemap-organizations-XX.xml.gz

We can parse these sitemaps to extract all company profile URLs:

import requests
from bs4 import BeautifulSoup

sitemap_index = "https://www.crunchbase.com/www-sitemaps/sitemap-index.xml"

# Fetch sitemap index
index_xml = requests.get(sitemap_index)  

# Parse sitemap index to get list of organization sitemaps 
soup = BeautifulSoup(index_xml.content, "xml")
org_sitemaps = [url.text for url in soup.find_all("loc")]

# Iterate through organization sitemaps and extract company URLs
company_urls = []

for sitemap in org_sitemaps:

    # Fetch sitemap 
    sitemap_xml = requests.get(sitemap)
    
    # Parse sitemap to extract company URLs
    soup = BeautifulSoup(sitemap_xml.content, "xml")
    urls = [url.text for url in soup.find_all("loc")]
    
    # Add company URLs to list
    company_urls.extend(urls) 

print(len(company_urls))
# 735,624 company URLs!

This gives us 700k+ clean company profile URLs to feed into our scraper.

b) Search API

For smaller datasets, we can use the Crunchbase search API to look at companies by keyword, category, location, etc. This returns company IDs which we can use to construct profile URLs.

c) Website crawl

For complete, up-to-date coverage, we can crawl the Crunchbase website to discover new and updated company profiles. We'll leave this more advanced approach to a future guide on crawling Crunchbase with Scrapy.

Step 2: Fetch Company Profile Pages

Next, we'll fetch the HTML page content for each company profile URL. To do this efficiently and avoid blocks, we'll use the Python Requests library with proxies, such as Bright Data, Smartproxy, Proxy-Seller, and Soax:

import requests
from random import choice

# List of residential proxy IPs 
proxies = ["123.123.123.123:8080", "111.111.111.111:8080" ...] 

# Function to fetch page with random proxy
def fetch(url):
    proxy = choice(proxies) 
    headers = {"User-Agent": "Mozilla/5.0"}
    
    try:
        response = requests.get(url, proxies={"http": proxy}, headers=headers, timeout=30)
        return response.text
    
    except Exception as e:
        print(e)
        return None
        
# Fetch company profile        
tesla_url = "https://www.crunchbase.com/organization/tesla-motors"
html = fetch(tesla_url)

We randomize the proxy on each request to prevent IP blocks. And add a User-Agent header to mimic a real browser. This results in the raw HTML content for each company profile page.

Step 3: Parse Company Data

With the page HTML fetched, we can now parse and extract fields of interest using Beautiful Soup:

from bs4 import BeautifulSoup

# Parse HTML with BeautifulSoup 
soup = BeautifulSoup(html, "html.parser")

# Extract data points  
name = soup.find("h1", {"class": "name"}).text.strip() 
description = soup.select_one(".break-words").text
website = soup.select_one("a[data-article='website']")["href"]
linkedin = soup.select_one("a[data-article='linkedin']")["href"]

# Get leadership team data
leaders = []
for person in soup.select(".management-team-member"):
   name = person.select_one(".name").text
   title = person.select_one(".title").text
   leaders.append({"name": name, "title": title})
   
print(name)
print(description)
print(website) 
print(leaders) # list of dicts containing each person's name and title

This extracts the key fields from the page HTML using CSS selectors. We can extend this to capture other datasets like funding, technology, revenue, locations, news and more.

Step 4: Store Scraped Company Data

Finally, we'll want to store the scraped company data. For example, saving to a JSON file:

import json 

# Scraped data for company 
data = {
   "name": name,
   "description": description,
   "website": website,
   "leaders": leaders
}

# Save to JSON file
with open("data.json", "a") as f:
   json.dump(data, f)
   f.write("\n")

For larger datasets, we would save to a database or data warehouse instead.

This covers the core workflow for scraping company profiles at scale from Crunchbase. Next we'll look at extracting Crunchbase funding data.

Scraping Crunchbase Funding Rounds

In addition to company profiles, Crunchbase contains funding details across 1M+ rounds globally. Key data points include:

Company funded
Investors
Amount raised
Announced date
Stage (Series A, B, C etc)
Post-money valuation

This data can help analyze startup funding trends, track valuations, research investors, and more. To extract Crunchbase funding data at scale, we'll utilize a similar methodology:

Discover funding round URLs
Fetch page HTML
Parse funding details
Store in database

Let's go through it…

Step 1: Discover Funding URLs

We can source a list of funding round URLs from the Crunchbase sitemaps:

sitemap_index = "https://www.crunchbase.com/www-sitemaps/sitemap-index.xml"

# Fetch sitemap index
index_xml = requests.get(sitemap_index)
soup = BeautifulSoup(index_xml.content, "xml")

# Get funding sitemap URLs 
funding_sitemaps = [url.text for url in soup.find_all("loc") if "funding_rounds" in url.text]

# Parse funding sitemaps to get list of funding URLs
funding_urls = []

for sitemap in funding_sitemaps:
   xml = requests.get(sitemap)   
   soup = BeautifulSoup(xml.content, "xml")
   urls = [url.text for url in soup.find_all("loc")]
   funding_urls.extend(urls)
   
print(len(funding_urls))
# 1,000,000+ funding round URLs!

Again this provides us a comprehensive list of URLs to feed into our funding data scraper.

Step 2: Fetch Pages

We'll fetch the page HTML for each funding URL similarly to how we did for companies:

import requests
from random import choice

proxies = ["123.123.123.123:8080", "111.111.111.111:8080" ...]

def fetch(url):
   proxy = choice(proxies)
   headers = {"User-Agent": "Mozilla/5.0"}
   
   try:
      response = requests.get(url, proxies={"http": proxy}, headers=headers, timeout=30)   
      return response.text
      
   except Exception as e:
      print(e)
      return None  
      
url = "https://www.crunchbase.com/funding_round/tesla-motors-series-c--8d79d327"  
html = fetch(url)

Again we use proxies and randomize them to avoid blocks.

Step 3: Parse Funding Data

We'll parse the key details from the page HTML:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

# Extract funding data  
company = soup.select_one(".component--field-formatter-nested-link")["title"]

investors = [] 
for investor in soup.select(".funding_round_investor"):
   name = investor.select_one(".name").text
   investors.append(name)
   
amount = soup.select_one(".component--field-formatter-number")["title"]

date = soup.select_one(".component--field-formatter-date")["title"] 

stage = soup.select_one("h4.startup_stage").text 

print(company)
print(investors) 
print(amount)
print(date)
print(stage)

Again we can extend this to capture other data points like valuation, images, descriptions etc.

Step 4: Store in Database

Finally, we would save the structured funding data into our database:

import sqlite3

# Connect to sqlite DB
conn = sqlite3.connect("crunchbase.db")
c = conn.cursor()

# Create fundings table
c.execute("""CREATE TABLE IF NOT EXISTS fundings 
            (company TEXT, investors TEXT, amount INTEGER, date DATE, stage TEXT)""")

# Insert funding data              
c.execute("INSERT INTO fundings VALUES (?, ?, ?, ?, ?)", 
          (company, str(investors), amount, date, stage))
          
# Commit and close  
conn.commit()
conn.close()

For larger datasets, we would use PostgreSQL, MySQL, Amazon Redshift or another production-grade database. This covers a basic workflow for extracting Crunchbase funding data at scale. Let's now look at scraping Crunchbase people and investor profiles.

Scraping Crunchbase People Profiles

In addition to companies and investments, Crunchbase has 300k+ profiles on startup founders, executives, investors and developers. These contain contact details, work histories, investments, education and more.

Key fields we can extract include:

Name
Title/role
Company
Location
LinkedIn
Email
Bio
Work history
Investments
Education
Links/social profiles

Scraping people profiles follows a similar process:

Discover list of people profile URLs
Fetch profile page HTML
Parse details from HTML
Save structured data

Step 1: Discover People Profile URLs

Again we can source people profile URLs from the Crunchbase sitemaps:

sitemap_index = "https://www.crunchbase.com/www-sitemaps/sitemap-index.xml"

# Get list of people sitemaps
index = requests.get(sitemap_index)
soup = BeautifulSoup(index.text, 'xml')
people_sitemaps = [url.text for url in soup.find_all('loc') if 'people' in url.text]

# Extract profile URLs
people_urls = []

for sitemap in people_sitemaps:
   xml = requests.get(sitemap)
   soup = BeautifulSoup(xml.text, 'xml') 
   urls = [url.text for url in soup.find_all('loc')]
   people_urls.extend(urls)
   
print(len(people_urls))   
# 300,000+ people profile URLs

This gives us a comprehensive list of people's profiles to feed into our scraper.

Step 2: Fetch Profile Pages

We'll fetch the page HTML for each profile URL using proxies:

import requests
from random import choice  

proxies = ["123.123.123.123:8080", "111.111.111.111:8080" ...]

def fetch(url):
   proxy = choice(proxies)
   headers = {"User-Agent": "Mozilla/5.0"}
   
   try:
      response = requests.get(url, proxies={"http": proxy}, headers=headers, timeout=30)
      return response.text
      
   except Exception as e:
      print(e)
      return None
      
url = "https://www.crunchbase.com/person/elon-musk"
html = fetch(url)

Step 3: Parse Profile Data

We'll parse details from the page HTML:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

name = soup.select_one("h1.name").text.strip()
title = soup.select_one(".title").text 
company = soup.select_one("a[data-section='current-positions']")["title"]
location = soup.select_one("a.locality").text

socials = {}
for item in soup.select("ul.social-links li"):
   key = item.select_one("img")["alt"]
   url = item.select_one("a")["href"]
   socials[key] = url
   
bio = soup.select_one("div.bio").text.strip()  

work_history = []   
for position in soup.select("section.past-positions ul li"):
   title = position.select_one("div.title").text
   company = position.select_one("div.company a")["title"]
   period = position.select_one("div.period").text
   work_history.append({"title": title, "company": company, "period": period})
   
education = []
for school in soup.select("section.education ul li"):
   degree = school.select_one("div.degree").text
   field = school.select_one("div.field").text  
   institution = school.select_one("div.institution").text
   education.append({"degree": degree, "field": field, "institution": institution})

investments = []   
for investment in soup.select("section.investments ul li"):
   company = investment.select_one(".company a")["title"]
   amount = investment.select_one(".money").text
   investments.append({"company": company, "amount": amount})
   
print(name)
print(title)
print(location)
print(socials)
print(bio)
print(work_history)
print(education) 
print(investments)

This extracts all the key details from a Crunchbase person profile using CSS selectors. We capture name, role, company, location, social links, bio, work history, education, investments and more.

Step 4: Store in Database

Finally, we would save the structured people data to our database:

import sqlite3

conn = sqlite3.connect("crunchbase.db")
c = conn.cursor()

c.execute("""CREATE TABLE people  
           (id INTEGER PRIMARY KEY, name TEXT, role TEXT, company TEXT,  
            location TEXT, socials TEXT, bio TEXT,  
            work_history TEXT, education TEXT, investments TEXT)""")
            
# Insert person data            
c.execute("INSERT INTO people VALUES (NULL, ?, ?, ?, ?, ?, ?, ?, ?, ?)",  
          (name, role, company, location, str(socials), bio,  
           str(work_history), str(education), str(investments)))
           
conn.commit()
conn.close()

And that covers a workflow for scraping Crunchbase people and investor data at scale!

Scraping Strategies to Avoid Blocks

When scraping large volumes of pages from Crunchbase, you'll likely encounter bot mitigation and blocks. Here are some tips to scrape effectively:

Use proxies – Rotate residential proxies on each request to distribute load and mask scraper fingerprints, like.
Random delays – Add random delays between requests to mimic human behavior.
Throttle requests – Limit the request rate to a few per second to respect the target site.
Rotate User-Agents – Change the user agent on each request to vary scraper signature.
Handle captchas – Pause detection and solve captchas manually or with automation.
Scrape during low-traffic hours – Hit the site mostly at night when fewer human visitors.

Proper proxies and thoughtful throttling/pausing are key for sustainable Crunchbase scraping. An automated proxy rotation service like Smartproxy, Bright Data, Proxy-Seller, or Soax can make proxy management seamless.

Scraping Crunchbase with Scale Using Scrapy

For large scale crawling of Crunchbase data, the Scrapy framework is a great option. Scrapy provides:

Asynchronous crawling for fast parallel scraping
Built-in throttling, retries, cookie handling etc
Easy integration of proxies, random delays, user-agent rotation
Powerful parsing selectors like XPath and CSS
Pipeline for processing scraped items
Integration with databases and data pipelines

Here is a sample Scrapy spider to crawl Crunchbase company profiles:

import scrapy
from scrapy.crawler import CrawlerProcess

class CrunchbaseSpider(scrapy.Spider):
    name = 'crunchbase'
    
    # Start urls from sitemap
    start_urls = [
        'https://www.crunchbase.com/organization/spacex',
        'https://www.crunchbase.com/organization/coinbase' 
    ]

    custom_settings = {
        'DOWNLOAD_DELAY': 2,  
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'HTTPCACHE_ENABLED': True
    }

    def parse(self, response):
        # Parse company name, leaders, etc
        
        yield {
            'name': response.css('h1.name::text').get(),
            'leaders': response.css('.management-team-member .name::text').getall(),
            # etc...
        }
        
# Run spider
process = CrawlerProcess()
process.crawl(CrunchbaseSpider)
process.start()

Key points:

Can set a download delay and concurrency to throttle politely.
Enabling caching avoids re-scraping duplicate content.
Easy to integrate proxies, user-agent rotation etc.
Useful for large-scale crawling of all Crunchbase companies, people, and funding data.

Final Thoughts

In this comprehensive guide, we covered various techniques for effectively scraping company, funding, and people data from Crunchbase at scale using Python.

The key steps for each page type are:

Discover the list of URLs to scrape
Fetch page HTML using proxies
Parse and extract fields of interest
Store structured data in the database

Crawl politely using proxies like , random delays, and throttling. Leveraging a framework like Scrapy helps for large-scale crawling. Scraped Crunchbase data can provide unique insights for investment analysis, business intelligence, recruiting, and more.