Why Playwright for Web Scraping?
Traditional scraping tools like requests and BeautifulSoup work great for static HTML pages. But modern websites are JavaScript-heavy single-page applications that render content dynamically. That's where Playwright comes in.
Playwright gives you a real browser engine (Chromium, Firefox, or WebKit) that executes JavaScript, handles SPAs, and renders pages exactly like a real user sees them. It's fast, reliable, and has a Python API that's a pleasure to use.
Key advantages over Selenium
- Auto-wait mechanisms — Playwright automatically waits for elements to be ready before interacting
- Multiple browser contexts — Run parallel sessions without separate browser instances
- Network interception — Capture API calls and modify requests on the fly
- Better performance — Faster execution and lower resource usage
Setting Up Your Project
First, install the dependencies:
pip install playwright
playwright install chromiumHere's a basic scraper structure:
from playwright.sync_api import sync_playwright
def scrape_page(url: str) -> dict:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
page = context.new_page()
page.goto(url, wait_until="networkidle")
# Extract data
title = page.title()
content = page.inner_text("main")
browser.close()
return {"title": title, "content": content}Handling Anti-Bot Protection
Modern websites employ various anti-bot measures. Here's how to handle the most common ones.
Rotating User Agents
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]
context = browser.new_context(
user_agent=random.choice(USER_AGENTS),
viewport={"width": 1920, "height": 1080},
)Adding Delays and Human-Like Behavior
import asyncio
import random
async def human_like_scroll(page):
"""Simulate natural scrolling behavior."""
for _ in range(random.randint(3, 7)):
await page.evaluate(
f"window.scrollBy(0, {random.randint(100, 300)})"
)
await asyncio.sleep(random.uniform(0.5, 1.5))Proxy Rotation
browser = p.chromium.launch(
headless=True,
proxy={"server": "http://proxy-server:8080"}
)Error Handling and Retry Logic
Production scrapers need robust error handling:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
def scrape_with_retry(url: str) -> dict:
try:
return scrape_page(url)
except TimeoutError:
raise # Will be retried
except Exception as e:
logger.error(f"Failed to scrape {url}: {e}")
raiseStructuring Extracted Data
Always validate and clean your extracted data before storing it:
from dataclasses import dataclass
from typing import Optional
@dataclass
class Product:
name: str
price: float
url: str
description: Optional[str] = None
def to_dict(self) -> dict:
return {
"name": self.name.strip(),
"price": round(self.price, 2),
"url": self.url,
"description": self.description,
}Scaling Your Scraper
For large-scale scraping, use async Playwright with multiple browser contexts:
import asyncio
from playwright.async_api import async_playwright
async def scrape_urls(urls: list[str], max_concurrent: int = 5):
semaphore = asyncio.Semaphore(max_concurrent)
async def scrape_one(url: str):
async with semaphore:
# Each task gets its own context
context = await browser.new_context()
page = await context.new_page()
try:
await page.goto(url, wait_until="networkidle")
return await extract_data(page)
finally:
await context.close()
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
results = await asyncio.gather(
*[scrape_one(url) for url in urls],
return_exceptions=True
)
await browser.close()
return [r for r in results if not isinstance(r, Exception)]Next Steps
Building production web scrapers requires careful attention to reliability, performance, and legal compliance. If you need help with a scraping project, check out our web scraping services or get in touch to discuss your requirements.
For more technical deep dives, check out our posts on integrating AI into your applications and custom software development costs.
