Web scraping is the process of extracting data from websites automatically. It allows you to collect large amounts of data that would be difficult or impossible to gather manually. Python is one of the most popular languages for web scraping due to its simplicity and vast ecosystem of scraping tools. Playwright is a new browser automation library that makes it easy to scrape dynamic websites built with JavaScript.
In this comprehensive guide, you'll learn how to use Playwright with Python to scrape complex web pages that traditional scraping libraries struggle with.
Why Playwright for Web Scraping?
Modern websites are highly dynamic – content loads asynchronously via AJAX requests, infinite scrolling fetches new data as you reach the bottom, DOM elements update without page reloads. Traditional Python scraping libraries like Beautiful Soup and Scrapy are designed for static content. They struggle with scraping dynamic pages that require execution of JavaScript code.
Playwright controls an actual browser like Chrome and Firefox. It executes full avaScript and perfectly renders web pages like a real user. This makes it ideal for scraping dynamic websites. Other advantages of using Playwright for web scraping:
- Works across browsers (Chrome, Firefox, WebKit)
- Supports multiple languages (Python, JavaScript, C#, Java)
- Interact with pages by clicking buttons, filling forms
- Mock geo-location, device types, throttling
- Stealth mode to avoid bot detection
- Network request interception
Overall, Playwright provides a very robust browser automation solution for Python web scraping.
Installation
Install Playwright for Python with:
pip install playwright
This will download browser binaries for Chromium, Firefox, and WebKit. We only need Chromium for this tutorial. Install a sync version that allows easier scraping scripts:
pip install playwright-sync
Optionally install IPython for experimenting in an interactive shell:
pip install ipython
Scraping Basics
The basic steps for scraping with Playwright:
- Launch a browser
- Navigate to URL
- Wait for page load
- Extract data
- Rinse and repeat
Let's go through a simple example to scrape the Hacker News homepage.
Launch Browser
Launch a headless Chromium browser:
from playwright.sync_api import sync_playwright browser = sync_playwright().start().chromium.launch(headless=True)
Headless mode runs the browser in the background without opening a GUI window.
Create Page
Open a new browser page/tab to navigate:
page = browser.new_page()
We'll execute all our scraping code in the context of this page.
Navigate to URL
Use page.goto()
to navigate to a URL:
page.goto("https://news.google.com/")
This loads www.news.google.com on our browser page.
Wait for Page Load
After navigation, we need to wait for the page to fully load before scraping:
page.wait_for_load_state('networkidle') # wait for AJAX/XHR requests
Other wait options:
load
– initial HTML documentdomcontentloaded
– HTML & DOM parsednetworkidle
– no network connections for 500ms
Extract Data
We can now extract page data using Playwright's query selectors:
for link in page.query_selector_all('.storylink'): title = link.text_content() url = link.get_attribute('href') print(title, url)
This prints all the story titles and URLs into the terminal. The full code so far:
from playwright.sync_api import sync_playwright browser = sync_playwright().start().chromium.launch(headless=True) page = browser.new_page() page.goto("https://news.google.com/") page.wait_for_load_state('networkidle') for link in page.query_selector_all('.storylink'): title = link.text_content() url = link.get_attribute('href') print(title, url)
This covers the scraping basics with Playwright – launch the browser, navigate to URLs, wait for load, and extract data. Pretty straightforward! Next, let's look at more advanced scraping capabilities.
Scraping Dynamic Content
A key strength of Playwright is interacting with dynamic JavaScript-heavy sites, like those using infinite scroll, dropdowns, popups, etc. Let's scrape an infinite scroll page – Unsplash photo feed. As you scroll down it lazily loads more images via AJAX requests. Playwright can automate scrolling to scrape all items.
Infinite Scroll Scraping
The main steps:
- Scroll to bottom of page
- Wait for new photos to load
- Extract updated photo data
- Repeat until no more photos
Here's how it looks in Python:
from playwright.sync_api import sync_playwright import time browser = sync_playwright().start().chromium.launch() page = browser.new_page() page.goto("https://unsplash.com") scroll_delay = 2 while True: print("Scrolling...") page.evaluate(""" window.scrollBy(0, window.innerHeight); """) time.sleep(scroll_delay) new_photos = page.query_selector_all('.photo') print(f"Found {len(new_photos)} photos") if len(new_photos) == 0: break print("Done!")
The main logic:
- Use JavaScript
window.scrollBy()
to scroll down the length of current viewport - Wait few seconds for lazy loading of images
- Check if any new photos were loaded using
query_selector_all()
- If not, we reached the end and can break the loop
Some points:
page.evaluate()
runs JS in the browser context, so we can access the webpagewindow
- Scrolling to bottom triggers the lazy loading mechanism
- After scroll, wait briefly for loading using
time.sleep()
- Check if new photos appeared with
query_selector_all()
- If no new photos, we scrolled to the end
This allows scraping infinitely long pages by automating the scrolling!
Handling Dropdowns
Another common dynamic element is dropdown selections. Let's see how to deal with those using the Airbnb site. Here's how to open the guests dropdown and print the options:
page.goto("https://www.airbnb.com") page.click("._1k463rt") # click guests field options = page.query_selector_all("._gig1e7") for option in options: print(option.text_content())
The steps:
- Use
page.click()
to open the dropdown - Fetch all the
._gig1e7
options - Print text content of each option
This allows interaction with dropdowns and selecting options programmatically.
Scraping Iframes
Some sites load content in iframes that are embedded as separate documents. To scrape iframes, we first need to switch page context into the iframe:
page.goto("https://somesite.com") # wait for iframe to load iframe = page.wait_for_selector("iframe") # switch context into iframe frame = iframe.content_frame() # now can extract data from inside frame texts = frame.query_selector_all("p")
The key points:
- Wait for iframe to load using
wait_for_selector()
- Get the iframe content frame with
content_frame()
- Switch context to interact inside the iframe
- Query selectors now run within iframe document
This allows scraping data from complex pages with nested iframe documents.
Dealing with Bot Detection
A downside of automation is that websites can detect scraping bots and block them. Playwright provides ways to mimic human behavior and be less detectable when scraping.
Stealth Mode
Enable stealth mode to spoof or disable automation signals that can get detected:
browser = sync_playwright().start().chromium.launch(headless=False) context = browser.new_context( viewport={"width": 1920, "height": 1080}, color_scheme="light", # or "dark" ignore_https_errors=True, # skip cert checks stealth=True, )
With stealth mode Playwright will:
- Mask its agent / Chrome version
- Override device descriptions
- Disable extensions, plugins, web notifications
- Prevent WebDriver flags and APIs
- Spoof/block sensors like motion, touch, etc
This makes Playwright automation blend in more like a real user browser.
Slowing Down Interactions
Bots can be detected by fast inhuman interactions speeds. Add delays to mimic human level speeds:
from time import sleep sleep(1) # delay between actions page.click("button", delay=500) # delay after click page.type("input", "text", delay=100) # delay between key presses
- Add sleeps between scraping steps
- Use delay args for click, type to slow them down
- Vary random delays to add natural human chaos
Slowing down makes scraping behavior appear more life-like.
Scrolling and Mouse Movement
Smooth natural scrolling and mouse movements also help avoid bot patterns:
# smooth scroll page.evaluate(""" window.scrollTo({ top: document.body.scrollHeight, left: 0, behavior: 'smooth' }); """) # humanized mouse movement page.mouse.move(200, 300, steps=100)
- Use
behavior: smooth
for natural scrolling - Slow down mouse movements over multiple micro-steps
Browser Profiles
Having many scrapers share one browser profile is suspicious. Create a new user profile for each page instance:
browser = sync_playwright().start().chromium.launch_persistent_context( user_data_dir="/tmp/new_profile", # unique temp dir ) page = browser.new_page()
- Set a custom
user_data_dir
for new profile - Launch persistent context to reuse profile
- Now each page instance has own profile
Dedicated profiles mimic real users better.
Proxy Servers & Residential IPs
Using proxy servers and residential IP addresses helps avoid IP based blocking. Here's how to route Playwright through a proxy:
browser = sync_playwright().start().chromium.launch( proxy={ "server": "http://proxy:8080", "username": "user", "password": "pass" } )
- Set
proxy
dict with your provider's credentials - Requests will route through the proxy server
Scraping from many different IPs and locations makes you harder to detect and block.
Advanced Techniques
Let's look at some advanced tricks to extend Playwright scraping capabilities.
Executing Custom JavaScript
For maximum flexibility, we can directly run any JavaScript code:
result = page.evaluate(""" // can access full JS DOM API const links = Array.from(document.querySelectorAll('a')); return links.map(link => link.href) """) print(result)
Advantages:
- Access any JavaScript API from browser context
- Return data back to Python context
- Avoid limitations of Playwright's built-in selectors
Allows doing virtually anything a browser can do!
Blocking Resources
By default Playwright loads all assets – images, CSS, ads, trackers, etc. We can block requests to speed up scraping and reduce bandwidth:
# block images page.route("**/*.{png,jpg,jpeg}", route.abort()) # block analytics page.route( "https://www.google-analytics.com/analytics.js", route.abort() )
Some ways to filter requests:
- File types –
.png
,.jpg
, etc - Domain names –
google-analytics.com
- URL patterns –
/tracking
,/advert
- Resource types –
image
,media
,font
Streamlines scraping by avoiding unnecessary downloads.
Intercepting Requests
Intercept network requests to monitor or modify them:
def intercept(route): # log request print(route.request) # update headers route.request.headers['User-Agent'] = 'MyBot 1.0' # mock response if route.request.url == 'https://api.site.com/token': route.fulfill( body='{"access": "1234"}' ) else: route.continue_() page.route('**/*', intercept)
Use cases:
- Log requests for debugging
- Modify headers
- Mock API responses
- Replay requests to cache data
- Decode encrypted requests
Powerful for analyzing network activity.
Browser Contexts
Playwright pages share browser states like cookies, local storage, etc. For isolation, create pages in separate contexts:
# default shared context browser.new_context() # isolated context browser.new_context(storage_state="/tmp/new_storage")
Benefits:
- Dedicated cookies, caches, settings per context
- Simulate different users/sessions
- Switch contexts to reuse browser
Helpful when you need stronger separation between page instances.
Python Integration
While Playwright provides the browser automation, we'll want to integrate it with Python libraries to build robust scraping pipelines.
Downloading Files
Use Python libraries like requests
for file downloads:
import requests urls = page.evaluate(""" const images = document.querySelectorAll('img'); return Array.from(images).map(img => img.src); """) for url in urls: # stream image downloads response = requests.get(url, stream=True) response.raw.decode_content = True with open(f"images/{url.split('/')[-1]}", 'wb') as file: shutil.copyfileobj(response.raw, file)
- Use JS to extract resource URLs from page
- Stream download using
requests
- Save resulting files
This leverages Python's better file-handling capabilities.
Parsing Data
For parsing HTML we can use Python libraries like Beautiful Soup:
from bs4 import BeautifulSoup html = page.content() soup = BeautifulSoup(html, 'html.parser') print(soup.find('h1').text)
Benefits of Playwright selectors:
- A more mature and robust library
- Very flexible querying API
- Ability to manipulate DOM tree
Great combination for extraction + parsing.
Asynchronous Support
Playwright has async APIs for concurrent scraping:
import asyncio from playwright.async_api import async_playwright async def scrape(playwright): browser = await playwright.chromium.launch() page = await browser.new_page() # ... scraping logic ... await browser.close() async def main(): async with async_playwright() as playwright: await scrape(playwright) asyncio.run(main())
With asyncio:
- Launch browsers concurrently
- Scale up to many pages
- Faster overall throughput
Integrates well with other Python async libraries.
Retries & Failures
Robust scrapers need to handle intermittent failures and retries.
from playwright._impl._api_types import Error try: # scraping logic except Error as exc: print(f"Error: {exc}") if retry_count < 3: retry_count += 1 print(f"Retrying {retry_count}...") time.sleep(5) # re-run failed logic else: print("Max retries reached") raise Exception("Failed to scrape")
Key points for resilience:
- Wrap scraping in
try/except
- Catch
playwright.Error
exceptions - Track retry counts
- Exponential backoff sleeps
- Reraise after max retries
This provides a robust scraping loop with failure handling.
Conclusion
Some key takeaways:
- Playwright provides a powerful Python browser automation solution for dynamic scraping.
- It launches real Chrome/Firefox browsers and executes JavaScript.
- Interact with pages by scrolling, clicking, typing, etc.
- Extract data using query selectors or JavaScript.
- Options for stealth, throttling, mobile simulation.
- Mature API for flexibility.
- Build robust pipelines by integrating Playwright with Python data science and web scraping stacks.
Hopefully, this gives you a comprehensive overview of web scraping in Python with Playwright!