How to Crawl the Web with Python?

Web crawling allows you to automatically scrape and navigate websites at scale. With Python, you can build flexible web crawlers for data mining, archiving, search engines, and more. So if you want to master large-scale data extraction from websites, you’ll love this in-depth crawl guide!

What Exactly is Web Crawling?

Web crawling refers to the automated traversal and scraping of websites by recursively following hyperlinks. The crawler starts from an initial list of seed URLs, downloads the content, extracts any links, and repeats this process for those links in turn. This allows the crawler to explore and scrape entire websites by navigating between pages automatically.

The key advantage of crawling compared to regular web scraping is the automatic discovery of new URLs to scrape based on links on each page. With traditional web scraping, you manually find and define the list of URLs to scrape. The scraper then visits each URL to extract data. Crawlers automatically build this list by parsing links from pages as they go. This enables scraping at much larger scales across entire sites.

Some key differences between traditional web scraping and web crawling:

Web ScrapingWeb Crawling
Manual list of URLs to scrapeAutomatically parses links to find URLs
Focused on extracting dataFocused on exploration and link graph
Lower complexityHandles complex traversal logic
 Lower scale (100s of pages)Higher scale (1000s of pages)

So in summary, web crawling involves automatically navigating between pages using link extraction and recursive traversal logic. This allows you to scrape very large websites without needing to manually find all links.

Building a Simple Crawler in Python

Let's now walk through a hands-on example to create a basic web crawler in Python.

We'll use some common scraping libraries like requests, BeautifulSoup and tldextract. Here are the dependencies we'll need:

pip install requests BeautifulSoup4 tldextract

The goal is to build a crawler that extracts links from pages of a specific website domain.

First, we need a function to download a page and extract the links:

from bs4 import BeautifulSoup
import requests

def crawl_page(url):
    print(f'Crawling {url}')
    resp = requests.get(url)
    
    soup = BeautifulSoup(resp.content, 'html.parser')
    links = []
    
    for link in soup.find_all('a'):
        href = link.get('href')
        links.append(href)
        
    return links

This uses requests to fetch the page content and BeautifulSoup to parse the HTML. We find all <a> anchor tags and extract the href attributes to get a list of links on the page.

Next, we need to filter these links to focus on our target domain. For this, we'll use the tldextract module:

from tldextract import extract

def filter_links(links, domain):
    filtered = []
    
    for url in links:
        try:
            subdom, dom, tld = extract(url)
            if dom == domain:
                filtered.append(url)
        except Exception as e:
            print(e)
            
    return filtered

This extracts the domain from each URL and discards any that aren't part of the target domain we want to crawl.

Now we can put everything together into a simple crawler:

from collections import deque

def crawl(urls, max_pages=100):
    visited = set()
    pages_crawled = 0
    
    queue = deque()
    for url in urls:
        queue.append(url)
        
    while queue and pages_crawled < max_pages:
                
        url = queue.popleft() # get next url
        
        if url in visited:
            continue
            
        links = crawl_page(url) # download and extract links
        links = filter_links(links, 'techbeamers.com') # filter
        
        queue.extend(links) # add new links to queue
        visited.add(url) # mark as visited
        
        pages_crawled += 1
        
    print(f'Crawled {pages_crawled} pages')
        
crawl(['https://techbeamers.com'])

This implements a basic breadth-first crawl of up to 100 pages from a seed URL. It uses a queue to manage the frontier of URLs to crawl and deduplicates visited pages.

And that's it! In about 50 lines we have a functional web crawler using common Python libraries like requests and BeautifulSoup.

While basic, this covers the core concepts like:

  • Downloading pages with requests
  • Using BeautifulSoup to parse and extract links
  • Filtering for a specific domain with tldextract
  • Maintaining frontier queue of URLs to crawl
  • Storing visited pages to avoid duplicates

There are definitely ways to extend this crawler further:

  • Storing scraped page data in a database
  • Using multithreading for faster crawling
  • Implementing depth limits, politeness policies
  • Adding proxy rotation to avoid bans

However, I hope this gives a good overview of how to approach building a simple crawler in Python. Next let's talk about some more advanced considerations.

Advanced Topics for Robust Crawlers

While we built a basic crawler above, here are some more nuanced topics to make your crawlers more robust:

Handling JavaScript Sites

A major challenge today is that many sites rely heavily on JavaScript to render content. Normal requests won't properly scrape these dynamic pages.

To crawl JavaScript sites, you need to use a headless browser like Selenium, Playwright or Puppeteer. These drive a real browser in the background to execute JavaScript.

Here is an example with Playwright in Python:

from playwright.sync_api import sync_playwright

def crawl_page(url):
  with sync_playwright() as p:
    browser = p.chromium.launch() 
    page = browser.new_page()
    page.goto(url)
    page_html = page.content() # Gets rendered HTML 
    browser.close()

  # Pass page_html to your parsing code
  # Extract links etc

This launches a headless Chromium browser using Playwright, loads the URL, and returns the fully rendered post-JavaScript HTML. Much more robust!

So using a headless browser library is highly recommended if you need to crawl complex JavaScript-heavy sites.

Stateful Crawling with Logins

Many sites require cookies, sessions or logins to access certain pages. For example, scraping an ecommerce website after adding items to a cart.

Crawlers need to be able to maintain state like cookies across requests and handle any required logins.

Here is one way to share state using the requests library:

import requests

session = requests.Session() 

# Login 
session.post('/login', data={'username': 'foo', 'password': 'bar'})

# Crawl pages
response = session.get(url)

This persists cookies across requests by using a requests.Session. You can also directly use Selenium/Playwright to get a persistent browser state.

Overall, your crawler needs some notion of sessions and the ability to perform any logins required.

Crawl Delays and Throttling

You don't want to overload target sites with requests and get banned. Some tips:

  • Use random delays of 1-5+ seconds between requests
  • Limit request rates to a reasonable level
  • Leverage proxy rotation services to distribute load

Here is an example of implementing throttling in a Python crawler:

import time
from random import randint

# Throttle request rate
time.sleep(randint(1,5)) 

# Random jitter between requests
delay = randint(3, 15) 
time.sleep(delay)

This adds randomized delays between requests to limit request rates. You can also track server response times and adapt your throttling accordingly.

Advanced commercial proxy solutions like BrightData offer thousands of IPs to rotate through. This can help minimize IP based blocks.

Respecting robots.txt

It's good practice to respect websites' robots.txt files which provide crawl delay rules. For example:

# robots.txt

User-agent: *  
Crawl-delay: 10

This asks crawlers to wait at least 10 seconds between requests. Make sure to inspect the robots.txt and adapt your crawler accordingly.

Storing Crawl Results

In most cases, you'll want to persist the pages, links and data scraped by your crawler somewhere.

Flat files or a database like PostgreSQL are good options. Here is an example storing links in PostgreSQL:

import psycopg2

# Connect to database 
conn = psycopg2.connect(...)

# Create links table
cursor.execute('''CREATE TABLE links (
  id serial PRIMARY KEY,
  url text NOT NULL
)''')

# Save links
cursor.execute('INSERT INTO links (url) VALUES (%s)', (url,))

Proper storage allows further analyzing and querying of crawl results after the crawl is complete. So in summary, topics like browser automation, throttling, and data storage are key for feature-rich production web crawlers.

There are certainly many more nuances as well around proxy rotation, url normalization, frontier management etc. But this covers some of the most essential aspects for robust crawling. Next let's look at taking things to the next level with asynchronous crawling and Scrapy.

Asynchronous Crawling for Speed and Scale

The sequential nature of most basic crawlers leads to slow performance and bottlenecks. Asynchronous concurrency can provide huge speedups.

Asyncio in Python

The asyncio module built into Python allows writing asynchronous code using async/await syntax.

Multiple coroutines (async functions) can execute concurrently within the same thread using an event loop. Perfect for I/O bound tasks like network requests in a crawler.

Here is a basic async crawler example:

import asyncio

async def crawl(url):
  print(f'Crawling {url}')
  # async version of crawl logic
  
async def main():
  
  # Schedule coroutines to run concurrently 
  await asyncio.gather(
     crawl(url1),
     crawl(url2),
     crawl(url3),
   )
  
asyncio.run(main())

By running multiple crawl() coroutines concurrently, you can process many more pages per second compared to sequential crawling.

The async approach also helps avoid wasted idle time from network I/O delays. The crawler can always have multiple requests in flight simultaneously.

For CPU-bound tasks like parsing, you would use multiprocessing instead. But asyncio is ideal for scaling I/O heavy activities like downloading and network calls.

Scrapy Framework for Production Crawling

Scrapy is a popular web crawling framework written in Python. It implements all the key components of a crawler we've discussed:

  • Robust HTTP client with proxy and cookie support
  • Flexible mechanisms for parsing responsesLink extraction, url normalization etc
  • Asynchronous concurrency with asyncio
  • Powerful frontend management and scheduling
  • Built-in support for Playwright/Selenium
  • Easy ways to store output data
  • Tons of customization options

In other words, Scrapy provides a production-ready web crawling solution right out of the box. The architecture is highly scalable and integrates well with databases and other data processing systems.

Some key advantages of Scrapy over a DIY crawler include:

  • Battle-tested robustness from real-world use
  • More efficient frontier management
  • Easy integration for JavaScript rendering
  • Advanced middleware and built-in throttling
  • Great for large scale distributed crawls

So I highly recommend Scrapy once you outgrow basic self-made crawlers and need enterprise grade capability.

Why are Web Crawlers Useful?

Here are some of the most common use cases and applications of web crawlers:

Search Engine Indexing

Search engines like Google use web crawlers to discover and index websites. The crawler starts from some seed URLs and recursively follows links to find new web pages. It extracts information like page titles, content, and other metadata to build the search index.

Google alone is estimated to crawl over 20 billion pages per day! Without crawlers, search engines would not be able to index the scale of the modern web.

Archiving Websites

Web crawlers can create local copies or archives of websites by scraping and saving all content. The Internet Archive project uses crawlers to archive petabytes of internet content. The crawler saves HTML pages, links, images, documents, and other media from websites.

This is useful for historians, researchers, digital preservationists, and anyone interested in studying older versions of the web.

Data Aggregation and Mining

Crawlers allow aggregating structured data from across different sites. For example, scraping all product listings from ecommerce websites, extracting research paper metadata from journal sites, or compiling statistics from sports sites.

This data can then be used for price monitoring, machine learning training data, research datasets, and more. Manually scraping at this scale would be infeasible.

Monitoring and Change Detection

You can use web crawlers to monitor websites for new content or changes. For example, a news crawler that checks sites for new articles containing specific keywords every hour. Or competitive price monitoring by crawling ecommerce sites daily.

This allows you to get alerts or trigger workflows in response to changes detected on websites by the crawler.

Web Graph Analysis

Academic researchers often use web crawlers to analyze link graphs between websites. This can reveal valuable insights about popular sites, influencers, related domains, and more. The connectivity of the web forms a treasure trove of data.

As you can see, the automated scraping capabilities of web crawlers enable data mining at a massive scale across thousands of sites. Next, let's understand technically how crawlers work their magic.

How do Web Crawlers Work?

The web crawler recursively follows links between pages to traverse entire websites. But how does it actually work under the hood?

The typical workflow of a crawler is:

  1. Takes a starting list of seed URLs for crawling
  2. Downloads the content of a URL using an HTTP client
  3. Parses the content to extract all links and data
  4. Filters links to focus crawling on a specific domain
  5. Adds the filtered links to the crawl frontier
  6. Repeats the process for each link in frontier
  7. Stops when thresholds like max pages reached

Here is a more detailed overview of the key components of a web crawler:

HTTP Client

The crawler needs the ability to download the HTML content for each URL it wants to crawl. For this purpose, it uses a web scraping HTTP client like requests in Python. The HTTP client handles sending GET requests to URLs and returning the response content to the crawler. It also handles cookies, proxies, authentication, retries, timeouts and other request functionality.

Parsing Logic

Once the HTML response is downloaded, the crawler must parse the content to extract links, titles, data, and other information. This parsing is commonly done using DOM parsing libraries like BeautifulSoup or lxml in Python. The parsing logic finds and extracts specific DOM elements like anchor tags.

URL Filtering

The crawler picks up all links from a page, but we need to filter out irrelevant or invalid URLs to focus the crawl.

For example, removing off-site links, image/media links, pages blocked by robots.txt etc. This helps guide the crawler. URL normalization and deduplication also happen during filtering to handle relative vs absolute URLs and avoid re-crawling duplicates.

Frontier/Queue

The frontier or queue data structure stores all the URLs that remain to be crawled. It manages what links get crawled next. Common implementations are breadth-first, depth-first or using priority queues. The order of links in the frontier affects efficiency and crawling behavior.

Data Storage

In most cases, the scraper will want to store the scraped page data, links, and metadata somewhere. A database like PostgreSQL is commonly used for this purpose. Storing crawl results allows further data mining and analysis.

Coordination

For large scale crawling, you need components to coordinate crawler instances across multiple servers and processes. This includes managing crawler workload, deduplicating effort, and dividing up the crawl frontier.

So in summary, the main capabilities of a crawler architecture are HTTP downloading, parsing, URL management, data storage, and coordination. Many frameworks like Scrapy provide implementations of these common components. But you can build a basic crawler fairly easily just using a few Python libraries.

Key Takeaways

Web crawling with Python involves using automated tools to traverse and scrape data from websites. These crawlers work by downloading pages, parsing links, and then repeating the process. Their applications range from fueling search engines to data mining and website monitoring.

To enhance their efficiency, techniques like browser automation and throttling are essential. Libraries such as Requests further ease this endeavor, empowering you to efficiently explore the digital landscape with Python.

Leon Petrou
We will be happy to hear your thoughts

      Leave a reply

      Proxy-Zone
      Compare items
      • Total (0)
      Compare
      0