Tutorial
11 min read

Web Scraping at Scale with Python: Processing 500+ Records Daily Without Getting Blocked

Battle-tested strategies for building reliable, ethical web scrapers that process hundreds of records daily while respecting rate limits and avoiding blocks.

Web Scraping at Scale with Python: Processing 500+ Records Daily Without Getting Blocked
DP

Dibyank Padhy

Engineering Manager & Full Stack Developer

The Art of Responsible Scraping

Web scraping gets a bad reputation because most people do it wrong. They blast requests at maximum speed, ignore robots.txt, and wonder why they get IP-banned within minutes. When I built the scraping engine for SalesBridge.ai - which processes 500+ defense procurement opportunities from 11 government websites daily - I learned that sustainable scraping is about being a good citizen of the web.

In this guide, I will share the techniques that have kept our scrapers running reliably for months without a single IP ban or cease-and-desist notice.

Principle 1: Respect robots.txt and Rate Limits

This is not just ethical - it is practical. Sites that block you are sites you cannot scrape. Always check robots.txt first, honor Crawl-delay directives, and add your own conservative rate limiting on top.

python
import asyncio
import aiohttp
from urllib.robotparser import RobotFileParser
from datetime import datetime, timedelta

class RespectfulScraper:
    def __init__(self, base_url: str, requests_per_minute: int = 10):
        self.base_url = base_url
        self.rpm = requests_per_minute
        self.last_request_time = None
        self.min_interval = 60.0 / requests_per_minute

        # Parse robots.txt
        self.robot_parser = RobotFileParser()
        self.robot_parser.set_url(f"{base_url}/robots.txt")
        self.robot_parser.read()

    def can_fetch(self, url: str) -> bool:
        """Check if we're allowed to scrape this URL"""
        return self.robot_parser.can_fetch("SalesBridgeBot/1.0", url)

    async def fetch(self, session: aiohttp.ClientSession, url: str) -> str:
        """Fetch a URL with rate limiting and polite headers"""
        if not self.can_fetch(url):
            raise PermissionError(f"robots.txt disallows: {url}")

        # Enforce rate limit
        if self.last_request_time:
            elapsed = (datetime.now() - self.last_request_time).total_seconds()
            if elapsed < self.min_interval:
                await asyncio.sleep(self.min_interval - elapsed)

        headers = {
            'User-Agent': 'SalesBridgeBot/1.0 (+https://salesbridge.ai/bot)',
            'Accept': 'text/html,application/xhtml+xml',
            'Accept-Language': 'en-US,en;q=0.9',
        }

        self.last_request_time = datetime.now()

        async with session.get(url, headers=headers) as response:
            if response.status == 429:  # Too Many Requests
                retry_after = int(response.headers.get('Retry-After', 60))
                await asyncio.sleep(retry_after)
                return await self.fetch(session, url)  # Retry

            response.raise_for_status()
            return await response.text()

Principle 2: Handle JavaScript-Rendered Content

Many modern procurement portals use JavaScript frameworks that render content client-side. BeautifulSoup alone is useless for these sites. You need a headless browser.

python
from playwright.async_api import async_playwright

class DynamicScraper:
    async def scrape_js_rendered(self, url: str, wait_for: str = None):
        """Scrape content that requires JavaScript rendering"""
        async with async_playwright() as p:
            browser = await p.chromium.launch(
                headless=True,
                args=[
                    '--disable-gpu',
                    '--no-sandbox',
                    '--disable-dev-shm-usage',
                ]
            )

            context = await browser.new_context(
                viewport={'width': 1280, 'height': 720},
                user_agent='Mozilla/5.0 (compatible; SalesBridgeBot/1.0)',
            )

            page = await context.new_page()

            # Block unnecessary resources to speed up scraping
            await page.route("**/*.{png,jpg,jpeg,gif,svg,css,font,woff,woff2}",
                           lambda route: route.abort())

            await page.goto(url, wait_until='networkidle')

            if wait_for:
                await page.wait_for_selector(wait_for, timeout=10000)

            content = await page.content()
            await browser.close()

            return content

Principle 3: Build Resilient Pipelines

Scrapers break. Websites change their HTML structure, servers go down temporarily, and network issues happen. Your scraping pipeline needs to handle all of these gracefully:

Retry with exponential backoff - 3 retries with delays of 1s, 5s, and 30s before giving up

Store raw HTML alongside parsed data - when a parser breaks, you can reprocess stored HTML without re-scraping

Use CSS selectors with fallbacks - have multiple selector strategies for each field and try them in order

Alert on extraction rate drops - if you normally extract 50+ records per run and suddenly get 5, something has changed

Principle 4: Data Validation and Deduplication

Raw scraped data is messy. You need a robust validation layer between scraping and storage:

python
from pydantic import BaseModel, validator, HttpUrl
from datetime import date
from typing import Optional, List

class Opportunity(BaseModel):
    title: str
    agency: str
    solicitation_number: str
    response_deadline: date
    url: HttpUrl
    naics_codes: List[str] = []
    estimated_value: Optional[float] = None

    @validator('title')
    def title_not_empty(cls, v):
        if len(v.strip()) < 10:
            raise ValueError('Title too short - likely parsing error')
        return v.strip()

    @validator('response_deadline')
    def deadline_in_future(cls, v):
        if v < date.today():
            raise ValueError('Deadline has already passed')
        return v

    @validator('solicitation_number')
    def valid_solicitation(cls, v):
        # Government solicitation numbers follow specific patterns
        if not any(v.startswith(prefix) for prefix in ['W', 'FA', 'N', 'SP', 'HQ']):
            raise ValueError(f'Unusual solicitation format: {v}')
        return v

class DeduplicationEngine:
    def __init__(self, db_connection):
        self.db = db_connection

    async def is_duplicate(self, opportunity: Opportunity) -> bool:
        """Check if we've already seen this opportunity"""
        existing = await self.db.find_one({
            'solicitation_number': opportunity.solicitation_number
        })
        return existing is not None

    async def upsert(self, opportunity: Opportunity):
        """Insert or update opportunity"""
        await self.db.update_one(
            {'solicitation_number': opportunity.solicitation_number},
            {'$set': opportunity.dict(), '$setOnInsert': {'first_seen': date.today()}},
            upsert=True
        )

Principle 5: Monitoring and Alerting

A scraper running silently in the background is a scraper you have forgotten about. Set up monitoring from day one:

Track records scraped per source per day - any significant deviation triggers an alert

Monitor error rates by type - distinguish between network errors (temporary) and parsing errors (structural change)

Log response times per source - a sudden increase often indicates impending blocks

Set up daily summary emails with key metrics so you can spot trends before they become problems

Building reliable scrapers is not glamorous work, but it is the foundation that makes AI-powered platforms like SalesBridge.ai possible. The data quality of your AI outputs will never exceed the quality of your data inputs, and that starts with a well-built scraper.

Stay Updated

Get notified when I publish new articles on engineering, AI, and leadership. No spam, unsubscribe anytime.

Found this helpful? Share it with others

DP

About the Author

Dibyank Padhy is an Engineering Manager & Full Stack Developer with 7+ years of experience building scalable software solutions. Passionate about cloud architecture, team leadership, and AI integration.