Web Scraping with Playwright and Python

Web scraping is the process of extracting data from websites automatically. It allows you to collect large amounts of data that would be difficult or impossible to gather manually. Python is one of the most popular languages for web scraping due to its simplicity and vast ecosystem of scraping tools. Playwright is a new browser automation library that makes it easy to scrape dynamic websites built with JavaScript.

In this comprehensive guide, you'll learn how to use Playwright with Python to scrape complex web pages that traditional scraping libraries struggle with.

Why Playwright for Web Scraping?

Modern websites are highly dynamic – content loads asynchronously via AJAX requests, infinite scrolling fetches new data as you reach the bottom, DOM elements update without page reloads. Traditional Python scraping libraries like Beautiful Soup and Scrapy are designed for static content. They struggle with scraping dynamic pages that require execution of JavaScript code.

Playwright controls an actual browser like Chrome and Firefox. It executes full avaScript and perfectly renders web pages like a real user. This makes it ideal for scraping dynamic websites. Other advantages of using Playwright for web scraping:

Works across browsers (Chrome, Firefox, WebKit)
Supports multiple languages (Python, JavaScript, C#, Java)
Interact with pages by clicking buttons, filling forms
Mock geo-location, device types, throttling
Stealth mode to avoid bot detection
Network request interception

Overall, Playwright provides a very robust browser automation solution for Python web scraping.

Installation

Install Playwright for Python with:

pip install playwright

This will download browser binaries for Chromium, Firefox, and WebKit. We only need Chromium for this tutorial. Install a sync version that allows easier scraping scripts:

pip install playwright-sync

Optionally install IPython for experimenting in an interactive shell:

pip install ipython

Scraping Basics

The basic steps for scraping with Playwright:

Launch a browser
Navigate to URL
Wait for page load
Extract data
Rinse and repeat

Let's go through a simple example to scrape the Hacker News homepage.

Launch Browser

Launch a headless Chromium browser:

from playwright.sync_api import sync_playwright

browser = sync_playwright().start().chromium.launch(headless=True)

Headless mode runs the browser in the background without opening a GUI window.

Create Page

Open a new browser page/tab to navigate:

page = browser.new_page()

We'll execute all our scraping code in the context of this page.

Navigate to URL

Use page.goto() to navigate to a URL:

page.goto("https://news.google.com/")

This loads www.news.google.com on our browser page.

Wait for Page Load

After navigation, we need to wait for the page to fully load before scraping:

page.wait_for_load_state('networkidle') # wait for AJAX/XHR requests

Other wait options:

load – initial HTML document
domcontentloaded – HTML & DOM parsed
networkidle – no network connections for 500ms

Extract Data

We can now extract page data using Playwright's query selectors:

for link in page.query_selector_all('.storylink'):
  title = link.text_content()
  url = link.get_attribute('href')
  print(title, url)

This prints all the story titles and URLs into the terminal. The full code so far:

from playwright.sync_api import sync_playwright

browser = sync_playwright().start().chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://news.google.com/")
page.wait_for_load_state('networkidle') 

for link in page.query_selector_all('.storylink'):
  title = link.text_content()
  url = link.get_attribute('href')
  print(title, url)

This covers the scraping basics with Playwright – launch the browser, navigate to URLs, wait for load, and extract data. Pretty straightforward! Next, let's look at more advanced scraping capabilities.

Scraping Dynamic Content

A key strength of Playwright is interacting with dynamic JavaScript-heavy sites, like those using infinite scroll, dropdowns, popups, etc. Let's scrape an infinite scroll page – Unsplash photo feed. As you scroll down it lazily loads more images via AJAX requests. Playwright can automate scrolling to scrape all items.

Infinite Scroll Scraping

The main steps:

Scroll to bottom of page
Wait for new photos to load
Extract updated photo data
Repeat until no more photos

Here's how it looks in Python:

from playwright.sync_api import sync_playwright
import time 

browser = sync_playwright().start().chromium.launch()
page = browser.new_page()
page.goto("https://unsplash.com")

scroll_delay = 2

while True:
  print("Scrolling...")

  page.evaluate("""
    window.scrollBy(0, window.innerHeight);
  """)

  time.sleep(scroll_delay)

  new_photos = page.query_selector_all('.photo')  
  print(f"Found {len(new_photos)} photos")

  if len(new_photos) == 0:
    break

print("Done!")

The main logic:

Use JavaScript window.scrollBy() to scroll down the length of current viewport
Wait few seconds for lazy loading of images
Check if any new photos were loaded using query_selector_all()
If not, we reached the end and can break the loop

Some points:

page.evaluate() runs JS in the browser context, so we can access the webpage window
Scrolling to bottom triggers the lazy loading mechanism
After scroll, wait briefly for loading using time.sleep()
Check if new photos appeared with query_selector_all()
If no new photos, we scrolled to the end

This allows scraping infinitely long pages by automating the scrolling!

Handling Dropdowns

Another common dynamic element is dropdown selections. Let's see how to deal with those using the Airbnb site. Here's how to open the guests dropdown and print the options:

page.goto("https://www.airbnb.com")

page.click("._1k463rt") # click guests field

options = page.query_selector_all("._gig1e7")

for option in options:
  print(option.text_content())

The steps:

Use page.click() to open the dropdown
Fetch all the ._gig1e7 options
Print text content of each option

This allows interaction with dropdowns and selecting options programmatically.

Scraping Iframes

Some sites load content in iframes that are embedded as separate documents. To scrape iframes, we first need to switch page context into the iframe:

page.goto("https://somesite.com") 

# wait for iframe to load
iframe = page.wait_for_selector("iframe")

# switch context into iframe
frame = iframe.content_frame()

# now can extract data from inside frame
texts = frame.query_selector_all("p")

The key points:

Wait for iframe to load using wait_for_selector()
Get the iframe content frame with content_frame()
Switch context to interact inside the iframe
Query selectors now run within iframe document

This allows scraping data from complex pages with nested iframe documents.

Dealing with Bot Detection

A downside of automation is that websites can detect scraping bots and block them. Playwright provides ways to mimic human behavior and be less detectable when scraping.

Stealth Mode

Enable stealth mode to spoof or disable automation signals that can get detected:

browser = sync_playwright().start().chromium.launch(headless=False)
context = browser.new_context(
  viewport={"width": 1920, "height": 1080},
  color_scheme="light", # or "dark"
  ignore_https_errors=True, # skip cert checks
  stealth=True,
)

With stealth mode Playwright will:

Mask its agent / Chrome version
Override device descriptions
Disable extensions, plugins, web notifications
Prevent WebDriver flags and APIs
Spoof/block sensors like motion, touch, etc

This makes Playwright automation blend in more like a real user browser.

Slowing Down Interactions

Bots can be detected by fast inhuman interactions speeds. Add delays to mimic human level speeds:

from time import sleep

sleep(1) # delay between actions  

page.click("button", delay=500) # delay after click

page.type("input", "text", delay=100) # delay between key presses

Add sleeps between scraping steps
Use delay args for click, type to slow them down
Vary random delays to add natural human chaos

Slowing down makes scraping behavior appear more life-like.

Scrolling and Mouse Movement

Smooth natural scrolling and mouse movements also help avoid bot patterns:

# smooth scroll 
page.evaluate("""
  window.scrollTo({
    top: document.body.scrollHeight,
    left: 0, 
    behavior: 'smooth'
  });
""")

# humanized mouse movement
page.mouse.move(200, 300, steps=100)

Use behavior: smooth for natural scrolling
Slow down mouse movements over multiple micro-steps

Browser Profiles

Having many scrapers share one browser profile is suspicious. Create a new user profile for each page instance:

browser = sync_playwright().start().chromium.launch_persistent_context(
   user_data_dir="/tmp/new_profile", # unique temp dir
)

page = browser.new_page()

Set a custom user_data_dir for new profile
Launch persistent context to reuse profile
Now each page instance has own profile

Dedicated profiles mimic real users better.

Proxy Servers & Residential IPs

Using proxy servers and residential IP addresses helps avoid IP based blocking. Here's how to route Playwright through a proxy:

browser = sync_playwright().start().chromium.launch(
  proxy={
    "server": "http://proxy:8080", 
    "username": "user",
    "password": "pass"  
  }
)

Set proxy dict with your provider's credentials
Requests will route through the proxy server

Scraping from many different IPs and locations makes you harder to detect and block.

Advanced Techniques

Let's look at some advanced tricks to extend Playwright scraping capabilities.

Executing Custom JavaScript

For maximum flexibility, we can directly run any JavaScript code:

result = page.evaluate("""
  // can access full JS DOM API
  const links = Array.from(document.querySelectorAll('a'));
  return links.map(link => link.href)
""")

print(result)

Advantages:

Access any JavaScript API from browser context
Return data back to Python context
Avoid limitations of Playwright's built-in selectors

Allows doing virtually anything a browser can do!

Blocking Resources

By default Playwright loads all assets – images, CSS, ads, trackers, etc. We can block requests to speed up scraping and reduce bandwidth:

# block images
page.route("**/*.{png,jpg,jpeg}", route.abort())

# block analytics
page.route(
   "https://www.google-analytics.com/analytics.js",
   route.abort() 
)

Some ways to filter requests:

File types – .png, .jpg, etc
Domain names – google-analytics.com
URL patterns – /tracking, /advert
Resource types – image, media, font

Streamlines scraping by avoiding unnecessary downloads.

Intercepting Requests

Intercept network requests to monitor or modify them:

def intercept(route):
  # log request
  print(route.request)  

  # update headers
  route.request.headers['User-Agent'] = 'MyBot 1.0'

  # mock response 
  if route.request.url == 'https://api.site.com/token':
    route.fulfill(
      body='{"access": "1234"}'
    )
  else:
    route.continue_()

page.route('**/*', intercept)

Use cases:

Log requests for debugging
Modify headers
Mock API responses
Replay requests to cache data
Decode encrypted requests

Powerful for analyzing network activity.

Browser Contexts

Playwright pages share browser states like cookies, local storage, etc. For isolation, create pages in separate contexts:

# default shared context
browser.new_context() 

# isolated context
browser.new_context(storage_state="/tmp/new_storage")

Benefits:

Dedicated cookies, caches, settings per context
Simulate different users/sessions
Switch contexts to reuse browser

Helpful when you need stronger separation between page instances.

Python Integration

While Playwright provides the browser automation, we'll want to integrate it with Python libraries to build robust scraping pipelines.

Downloading Files

Use Python libraries like requests for file downloads:

import requests

urls = page.evaluate("""
  const images = document.querySelectorAll('img');
  return Array.from(images).map(img => img.src); 
""")

for url in urls:
  # stream image downloads
  response = requests.get(url, stream=True) 
  response.raw.decode_content = True
  
  with open(f"images/{url.split('/')[-1]}", 'wb') as file:  
    shutil.copyfileobj(response.raw, file)

Use JS to extract resource URLs from page
Stream download using requests
Save resulting files

This leverages Python's better file-handling capabilities.

Parsing Data

For parsing HTML we can use Python libraries like Beautiful Soup:

from bs4 import BeautifulSoup

html = page.content()
soup = BeautifulSoup(html, 'html.parser')

print(soup.find('h1').text)

Benefits of Playwright selectors:

A more mature and robust library
Very flexible querying API
Ability to manipulate DOM tree

Great combination for extraction + parsing.

Asynchronous Support

Playwright has async APIs for concurrent scraping:

import asyncio
from playwright.async_api import async_playwright

async def scrape(playwright):
  browser = await playwright.chromium.launch()
  page = await browser.new_page()
  # ... scraping logic ...
  await browser.close()

async def main():
  async with async_playwright() as playwright: 
    await scrape(playwright)

asyncio.run(main())

With asyncio:

Launch browsers concurrently
Scale up to many pages
Faster overall throughput

Integrates well with other Python async libraries.

Retries & Failures

Robust scrapers need to handle intermittent failures and retries.

from playwright._impl._api_types import Error

try:
  # scraping logic 
except Error as exc:
  print(f"Error: {exc}")

  if retry_count < 3: 
    retry_count += 1
    print(f"Retrying {retry_count}...")
    time.sleep(5)
    
    # re-run failed logic 

  else:
    print("Max retries reached")
    raise Exception("Failed to scrape")

Key points for resilience:

Wrap scraping in try/except
Catch playwright.Error exceptions
Track retry counts
Exponential backoff sleeps
Reraise after max retries

This provides a robust scraping loop with failure handling.

Conclusion

Some key takeaways:

Playwright provides a powerful Python browser automation solution for dynamic scraping.
It launches real Chrome/Firefox browsers and executes JavaScript.
Interact with pages by scrolling, clicking, typing, etc.
Extract data using query selectors or JavaScript.
Options for stealth, throttling, mobile simulation.
Mature API for flexibility.
Build robust pipelines by integrating Playwright with Python data science and web scraping stacks.

Hopefully, this gives you a comprehensive overview of web scraping in Python with Playwright!