Web Scraping with Python and Playwright: A Production-Ready Guide

Why Playwright for Web Scraping?

Traditional scraping tools like requests and BeautifulSoup work great for static HTML pages. But modern websites are JavaScript-heavy single-page applications that render content dynamically. That's where Playwright comes in.

Playwright gives you a real browser engine (Chromium, Firefox, or WebKit) that executes JavaScript, handles SPAs, and renders pages exactly like a real user sees them. It's fast, reliable, and has a Python API that's a pleasure to use.

Key advantages over Selenium

Auto-wait mechanisms — Playwright automatically waits for elements to be ready before interacting
Multiple browser contexts — Run parallel sessions without separate browser instances
Network interception — Capture API calls and modify requests on the fly
Better performance — Faster execution and lower resource usage

Setting Up Your Project

First, install the dependencies:

pip install playwright
playwright install chromium

Here's a basic scraper structure:

from playwright.sync_api import sync_playwright
 
def scrape_page(url: str) -> dict:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        )
        page = context.new_page()
        page.goto(url, wait_until="networkidle")
 
        # Extract data
        title = page.title()
        content = page.inner_text("main")
 
        browser.close()
        return {"title": title, "content": content}

Handling Anti-Bot Protection

Modern websites employ various anti-bot measures. Here's how to handle the most common ones.

Rotating User Agents

import random
 
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]
 
context = browser.new_context(
    user_agent=random.choice(USER_AGENTS),
    viewport={"width": 1920, "height": 1080},
)

Adding Delays and Human-Like Behavior

import asyncio
import random
 
async def human_like_scroll(page):
    """Simulate natural scrolling behavior."""
    for _ in range(random.randint(3, 7)):
        await page.evaluate(
            f"window.scrollBy(0, {random.randint(100, 300)})"
        )
        await asyncio.sleep(random.uniform(0.5, 1.5))

Proxy Rotation

browser = p.chromium.launch(
    headless=True,
    proxy={"server": "http://proxy-server:8080"}
)

Error Handling and Retry Logic

Production scrapers need robust error handling:

from tenacity import retry, stop_after_attempt, wait_exponential
 
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def scrape_with_retry(url: str) -> dict:
    try:
        return scrape_page(url)
    except TimeoutError:
        raise  # Will be retried
    except Exception as e:
        logger.error(f"Failed to scrape {url}: {e}")
        raise

Structuring Extracted Data

Always validate and clean your extracted data before storing it:

from dataclasses import dataclass
from typing import Optional
 
@dataclass
class Product:
    name: str
    price: float
    url: str
    description: Optional[str] = None
 
    def to_dict(self) -> dict:
        return {
            "name": self.name.strip(),
            "price": round(self.price, 2),
            "url": self.url,
            "description": self.description,
        }

Scaling Your Scraper

For large-scale scraping, use async Playwright with multiple browser contexts:

import asyncio
from playwright.async_api import async_playwright
 
async def scrape_urls(urls: list[str], max_concurrent: int = 5):
    semaphore = asyncio.Semaphore(max_concurrent)
 
    async def scrape_one(url: str):
        async with semaphore:
            # Each task gets its own context
            context = await browser.new_context()
            page = await context.new_page()
            try:
                await page.goto(url, wait_until="networkidle")
                return await extract_data(page)
            finally:
                await context.close()
 
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        results = await asyncio.gather(
            *[scrape_one(url) for url in urls],
            return_exceptions=True
        )
        await browser.close()
 
    return [r for r in results if not isinstance(r, Exception)]

Next Steps

Building production web scrapers requires careful attention to reliability, performance, and legal compliance. If you need help with a scraping project, check out our web scraping services or get in touch to discuss your requirements.

For more technical deep dives, check out our posts on integrating AI into your applications and custom software development costs.