Web Scraping at Scale with Python: Processing 500+ Records Daily Without Getting Blocked

The Art of Responsible Scraping

Web scraping gets a bad reputation because most people do it wrong. They blast requests at maximum speed, ignore robots.txt, and wonder why they get IP-banned within minutes. When I built the scraping engine for SalesBridge.ai - which processes 500+ defense procurement opportunities from 11 government websites daily - I learned that sustainable scraping is about being a good citizen of the web.

In this guide, I will share the techniques that have kept our scrapers running reliably for months without a single IP ban or cease-and-desist notice.

Principle 1: Respect robots.txt and Rate Limits

This is not just ethical - it is practical. Sites that block you are sites you cannot scrape. Always check robots.txt first, honor Crawl-delay directives, and add your own conservative rate limiting on top.

python

import asyncio
import aiohttp
from urllib.robotparser import RobotFileParser
from datetime import datetime, timedelta

class RespectfulScraper:
    def __init__(self, base_url: str, requests_per_minute: int = 10):
        self.base_url = base_url
        self.rpm = requests_per_minute
        self.last_request_time = None
        self.min_interval = 60.0 / requests_per_minute

        # Parse robots.txt
        self.robot_parser = RobotFileParser()
        self.robot_parser.set_url(f"{base_url}/robots.txt")
        self.robot_parser.read()

    def can_fetch(self, url: str) -> bool:
        """Check if we're allowed to scrape this URL"""
        return self.robot_parser.can_fetch("SalesBridgeBot/1.0", url)

    async def fetch(self, session: aiohttp.ClientSession, url: str) -> str:
        """Fetch a URL with rate limiting and polite headers"""
        if not self.can_fetch(url):
            raise PermissionError(f"robots.txt disallows: {url}")

        # Enforce rate limit
        if self.last_request_time:
            elapsed = (datetime.now() - self.last_request_time).total_seconds()
            if elapsed < self.min_interval:
                await asyncio.sleep(self.min_interval - elapsed)

        headers = {
            'User-Agent': 'SalesBridgeBot/1.0 (+https://salesbridge.ai/bot)',
            'Accept': 'text/html,application/xhtml+xml',
            'Accept-Language': 'en-US,en;q=0.9',
        }

        self.last_request_time = datetime.now()

        async with session.get(url, headers=headers) as response:
            if response.status == 429:  # Too Many Requests
                retry_after = int(response.headers.get('Retry-After', 60))
                await asyncio.sleep(retry_after)
                return await self.fetch(session, url)  # Retry

            response.raise_for_status()
            return await response.text()

Principle 2: Handle JavaScript-Rendered Content

Many modern procurement portals use JavaScript frameworks that render content client-side. BeautifulSoup alone is useless for these sites. You need a headless browser.

python

from playwright.async_api import async_playwright

class DynamicScraper:
    async def scrape_js_rendered(self, url: str, wait_for: str = None):
        """Scrape content that requires JavaScript rendering"""
        async with async_playwright() as p:
            browser = await p.chromium.launch(
                headless=True,
                args=[
                    '--disable-gpu',
                    '--no-sandbox',
                    '--disable-dev-shm-usage',
                ]
            )

            context = await browser.new_context(
                viewport={'width': 1280, 'height': 720},
                user_agent='Mozilla/5.0 (compatible; SalesBridgeBot/1.0)',
            )

            page = await context.new_page()

            # Block unnecessary resources to speed up scraping
            await page.route("**/*.{png,jpg,jpeg,gif,svg,css,font,woff,woff2}",
                           lambda route: route.abort())

            await page.goto(url, wait_until='networkidle')

            if wait_for:
                await page.wait_for_selector(wait_for, timeout=10000)

            content = await page.content()
            await browser.close()

            return content

Principle 3: Build Resilient Pipelines

Scrapers break. Websites change their HTML structure, servers go down temporarily, and network issues happen. Your scraping pipeline needs to handle all of these gracefully:

Retry with exponential backoff - 3 retries with delays of 1s, 5s, and 30s before giving up

Store raw HTML alongside parsed data - when a parser breaks, you can reprocess stored HTML without re-scraping

Use CSS selectors with fallbacks - have multiple selector strategies for each field and try them in order

Alert on extraction rate drops - if you normally extract 50+ records per run and suddenly get 5, something has changed

Principle 4: Data Validation and Deduplication

Raw scraped data is messy. You need a robust validation layer between scraping and storage:

python

from pydantic import BaseModel, validator, HttpUrl
from datetime import date
from typing import Optional, List

class Opportunity(BaseModel):
    title: str
    agency: str
    solicitation_number: str
    response_deadline: date
    url: HttpUrl
    naics_codes: List[str] = []
    estimated_value: Optional[float] = None

    @validator('title')
    def title_not_empty(cls, v):
        if len(v.strip()) < 10:
            raise ValueError('Title too short - likely parsing error')
        return v.strip()

    @validator('response_deadline')
    def deadline_in_future(cls, v):
        if v < date.today():
            raise ValueError('Deadline has already passed')
        return v

    @validator('solicitation_number')
    def valid_solicitation(cls, v):
        # Government solicitation numbers follow specific patterns
        if not any(v.startswith(prefix) for prefix in ['W', 'FA', 'N', 'SP', 'HQ']):
            raise ValueError(f'Unusual solicitation format: {v}')
        return v

class DeduplicationEngine:
    def __init__(self, db_connection):
        self.db = db_connection

    async def is_duplicate(self, opportunity: Opportunity) -> bool:
        """Check if we've already seen this opportunity"""
        existing = await self.db.find_one({
            'solicitation_number': opportunity.solicitation_number
        })
        return existing is not None

    async def upsert(self, opportunity: Opportunity):
        """Insert or update opportunity"""
        await self.db.update_one(
            {'solicitation_number': opportunity.solicitation_number},
            {'$set': opportunity.dict(), '$setOnInsert': {'first_seen': date.today()}},
            upsert=True
        )

Principle 5: Monitoring and Alerting

A scraper running silently in the background is a scraper you have forgotten about. Set up monitoring from day one:

Track records scraped per source per day - any significant deviation triggers an alert

Monitor error rates by type - distinguish between network errors (temporary) and parsing errors (structural change)

Log response times per source - a sudden increase often indicates impending blocks

Set up daily summary emails with key metrics so you can spot trends before they become problems

Building reliable scrapers is not glamorous work, but it is the foundation that makes AI-powered platforms like SalesBridge.ai possible. The data quality of your AI outputs will never exceed the quality of your data inputs, and that starts with a well-built scraper.

Web Scraping at Scale with Python: Processing 500+ Records Daily Without Getting Blocked

Table of Contents

The Art of Responsible Scraping

Principle 1: Respect robots.txt and Rate Limits

Principle 2: Handle JavaScript-Rendered Content

Principle 3: Build Resilient Pipelines

Principle 4: Data Validation and Deduplication

Principle 5: Monitoring and Alerting

Stay Updated

About the Author

More Articles

Astro vs Next.js in 2026: Choosing the Right Framework for Content-Driven Sites

Building Mobile Apps for Education: UX Lessons from Marathon Kids and LivingTree