How to Web Scraping with Scrapy?

Scrapy is one of the most popular open source web scraping frameworks used by over 12,000 companies including NASA, Mozilla, and MIT. This comprehensive guide will provide a deep dive into Scrapy fundamentals and advanced techniques to help you scrape data at scale.

Introduction to Scrapy Architecture

Scrapy is built on top of Twisted, an asynchronous networking framework written in Python. This allows Scrapy to handle multiple requests concurrently and achieve high performance. The key components are:

  • Scheduler – Queues up requests and handles order of page crawling
  • Downloader – Sends HTTP requests and receives responses
  • Spiders – Contains scraping logic and callbacks
  • Item Pipeline – Handles processing and storing scraped items
  • Downloader middlewares – Pre-process and post-process requests/responses
  • Spider middlewares – Pre-process and post-process spider input/output

When you run a spider, the scheduler pulls requests from the spider and sends them to the downloader via the downloader middlewares. The downloader handles sending requests and passing responses back to the spider for processing. The spider parses the responses and either yields scraped items or more requests. Scraped items go through the item pipeline, while requests go back to the scheduler.

This architecture makes Scrapy:

  • Fast – Asynchronous requests and non-blocking IO
  • Robust – Automatic retrying of failed requests
  • Scalable – Handle hundreds of requests concurrently
  • Extensible – Pluggable components and extensions

Next, let's see how to create Scrapy scrapers to leverage these capabilities.

Creating Your First Scrapy Project

Scrapy requires Python 3.6+ and can be installed easily using pip:

pip install scrapy

To create a new Scrapy project:

scrapy startproject myproject

This generates a project directory containing:

  • scrapy.cfg – Deployment configuration
  • myproject/ – Project's python module
    • items.py – Definition of scraped items
    • pipelines.py – Item pipelines
    • settings.py – Configuration settings
    • spiders/ – Location of spiders
      • __init__.py – Makes the spiders' folder a module

Spiders contain the scraping logic. To add a spider:

cd myproject
scrapy genspider mydomain mydomain.com

This creates a mydomain.py file containing a template spider class:

import scrapy

class MydomainSpider(scrapy.Spider):

  name = 'mydomain'
  
  allowed_domains = ['mydomain.com']
  start_urls = ['http://mydomain.com/']

  def parse(self, response):
    # Extract data here
  • name defines the spider's name
  • allowed_domains restricts domains to scrape
  • start_urls lists the initial URLs to crawl
  • parse() is a callback method for processing responses

Now let's see how to implement scraping logic inside spiders.

Writing Scraping Spiders

Spiders generate Requests to crawl pages and parse Responses to extract data:

1. Generate Requests

Let's scrape quotes from http://quotes.toscrape.com:

import scrapy

class QuotesSpider(scrapy.Spider):

  name = 'quotes'

  def start_requests(self):
    urls = [
      'http://quotes.toscrape.com/page/1/',
      'http://quotes.toscrape.com/page/2/'
    ]
    for url in urls:
      yield scrapy.Request(url=url, callback=self.parse)
  • Override start_requests() to dynamically generate requests.
  • yield a scrapy.Request object to schedule requests.
  • Pass a callback function to handle responses.

2. Parse Responses

In parse() callback, we can extract data using CSS selectors:

def parse(self, response):
  for quote in response.css('div.quote'):
    yield {
      'text': quote.css('span.text::text').get(),
      'author': quote.css('small.author::text').get(),
    }
  • CSS selectors extract data from HTML tags.
  • Yield Python dicts to return scraped items.

3. Recursive Crawling

To crawl pages recursively:

def parse(self, response):
  # Scrape current page
  ...

  next_page = response.css('li.next a::attr(href)').get() 
  if next_page is not None:
      yield response.follow(next_page, callback=self.parse)
  • response.follow makes a request using the extracted URL.
  • Call parse() as callback to continue crawling.

This simple 3-step pattern allows us to scrape entire websites with ease!

Storing Scraped Data

By itself, Scrapy will only output scraped items to the console. To store contents, we can use Feed Exports and Item Pipelines.

Feed Exports

Enable a JSON feed export in settings.py:

FEED_FORMAT = "json"
FEED_URI = "quotes.json"

This will store all scraped items into a JSON file. Other supported formats include JSON Lines, CSV, XML etc. Feed exports provide a quick way to save scraped data.

Item Pipelines

For more advanced item handling, we can use item pipelines. Pipelines are classes that implement a process_item() method. Each item goes through the pipelines in order:

# pipelines.py

class ValidateQuotesPipeline:

  def process_item(self, item, spider):
    if not 'text' in item:
        raise DropItem("Missing quote text!")
    else:
       return item

Enable it in settings.py:

ITEM_PIPELINES = {
     'myproject.pipelines.ValidateQuotesPipeline': 300
}

Multiple pipelines can be enabled in order. They allow cleaning, validation, deduplication etc. There are many other options for storing Scrapy items:

  • Relational databases like MySQL, Postgres
  • NoSQL databases like MongoDB
  • Amazon S3, Google Cloud Storage
  • Bulk import into Excel, CSV
  • JSON, XML files
  • Data frameworks like Pandas, SQLAlchemy

This makes Scrapy very flexible when it comes to handling scraped data.

Fine-tuning Scrapy Settings

Scrapy comes with good defaults, but tuning these settings can greatly improve scraping performance.

1. Handling Robots.txt

By default Scrapy obeys robots.txt rules. To ignore:

ROBOTSTXT_OBEY = False

2. Increase Concurrency

Control number of concurrent requests:

CONCURRENT_REQUESTS = 100

3. Auto-throttle Speed

Prevent banning at peak concurrency:

AUTOTHROTTLE_ENABLED = True

4. Enable Caching

Speed up debugging and development:

HTTPCACHE_ENABLED = True

5. Set User-Agent

Rotate user-agents to prevent blocks:

USER_AGENT = 'MyCustomAgent v1.2'

DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy.downloadermiddlewares.useragent.UserAgentRotatorMiddleware': 500,    
}

Take a look at the built-in settings reference to see all available options. Tweaking these settings goes a long way in building robust scrapers.

Scraping JavaScript Websites

By default, Scrapy only sees static HTML content. To scrape dynamic JavaScript sites, we need to integrate a browser rendering engine. Here are 3 popular options:

Splash

Splash is a lightweight JavaScript rendering service. To use:

yield SplashRequest(url="http://quotes.toscrape.com/js", callback=self.parse, 
    args={'wait': 2})
  • SplashRequest integrates with Splash to render JavaScript.
  • wait=2 pauses render to allow content to load.

Playwright

Playwright provides full browser automation through Chromium or Firefox:

from playwright.sync_api import sync_playwright

def start_requests(self):
   with sync_playwright() as p:
     browser = p.chromium.launch()
     page = browser.new_page()
     page.goto("http://quotes.toscrape.com/js")
     html = page.content()
     # Pass html to Scrapy 
   browser.close()

Selenium

Selenium drives a real Chrome/Firefox browser:

from selenium import webdriver

def start_requests(self):
   driver = webdriver.Chrome()
   driver.get("http://quotes.toscrape.com/js")
   html = driver.page_source
   driver.close() 
   # Pass html to Scrapy

While slower, browser automation enables the scraping of complex JavaScript sites.

Common Scraping Challenges

Here are some common challenges when scraping websites:

Handling Cookies

  • Extract cookies from responses
  • Store in cookiejar and resend on requests
  • Use cookiejar meta key on Request

Dealing with Logins

  • Submit login form request
  • Pass cookies to authenticated area
  • Use FormRequest andCookie objects

Scraper Blocking

  • Randomize user-agents
  • Use proxy rotation middleware
  • Slow down with DOWNLOAD_DELAY

JS Heavy Sites

  • Integrate Splash/Playwright/Selenium
  • Isolate API calls
  • Reverse engineer frontend code

Limited Crawl Depth

  • Allow higher DEPTH_LIMIT
  • Truncate pages from scraping with dont_filter=True

Scraping API Data

  • Study API calls made by site
  • Reverse engineer and replicate
  • Use JSON, XML processors for parsing

Getting scrapers to work reliably takes experience. Consulting Scrapy's extensive documentation and community resources is highly recommended when tackling these common challenges.

Production Scrapy Deployments

Here are some best practices for production scrapy deployments:

  • Multiple Spiders – Break up scraping activities into multiple smaller spiders
  • Scrapyd – Deploy spiders to production server via Scrapyd service
  • Scale Horizontally – Distribute spiders across multiple processes/servers
  • Use WAL Recorder – Enable spider log recordings for debugging
  • Monitor Performance – Track metrics like pages scraped, errors etc
  • Use Redis Queue – Coordinate distributed spiders via queue
  • Regular Restarts – Periodically restart long-running spiders
  • Deploy on Scrapinghub – Leverage managed cloud platform designed for Scrapy

Following these practices allows reliably scaling Scrapy spiders and maintaining robust scraping infrastructure.

Scrapy vs Other Web Scraping Libraries

ScrapyRequestsBeautiful SoupSelenium
TypeCrawling FrameworkHTTP LibraryHTML ParserBrowser Automation
SpeedVery FastFastFastSlow
Proxy SupportYesYesNoYes
JavaScriptNoNoNoYes
ConcurrencyHighMediumLowMedium
ScalabilityExcellentGoodFairDifficult
Scraper DevelopmentMediumSimpleSimpleComplex
Anti-ScrapingHardHardHardMedium

As shown in this comparison, Scrapy provides the best combination of speed, scalability, and robustness for large web scraping projects.

Conclusion

Scrapy stands out as a high-grade framework for crafting web scrapers, blending an asynchronous architecture, modular components, and a rich ecosystem to streamline large-scale data scraping. While mastering Scrapy involves a learning curve, the investment pays off in reliably handling complex scraping tasks at scale.

This guide is tailored to offer practical insights for Scrapy beginners while also delving into more sophisticated aspects, making it an essential resource for those looking to enhance their web scraping expertise with Scrapy.

John Rooney

John Rooney

John Watson Rooney, a self-taught Python developer and content creator with a focus on web scraping, APIs, and automation. I love sharing my knowledge and expertise through my YouTube channel, My channel caters to all levels of developers, from beginners looking to get started in web scraping to experienced programmers seeking to advance their skills with modern techniques. I have worked in the e-commerce sector for many years, gaining extensive real-world experience in data handling, API integrations, and project management. I am passionate about teaching others and simplifying complex concepts to make them more accessible to a wider audience. In addition to my YouTube channel, I also maintain a personal website where I share my coding projects and other related content.

We will be happy to hear your thoughts

      Leave a reply

      Proxy-Zone
      Compare items
      • Total (0)
      Compare
      0