Web Scraping at Scale with Python: Processing 500+ Records Daily Without Getting Blocked
Battle-tested strategies for building reliable, ethical web scrapers that process hundreds of records daily while respecting rate limits and avoiding blocks.
Dibyank Padhy
Engineering Manager & Full Stack Developer
Table of Contents
The Art of Responsible Scraping
Web scraping gets a bad reputation because most people do it wrong. They blast requests at maximum speed, ignore robots.txt, and wonder why they get IP-banned within minutes. When I built the scraping engine for SalesBridge.ai - which processes 500+ defense procurement opportunities from 11 government websites daily - I learned that sustainable scraping is about being a good citizen of the web.
In this guide, I will share the techniques that have kept our scrapers running reliably for months without a single IP ban or cease-and-desist notice.
Principle 1: Respect robots.txt and Rate Limits
This is not just ethical - it is practical. Sites that block you are sites you cannot scrape. Always check robots.txt first, honor Crawl-delay directives, and add your own conservative rate limiting on top.
import asyncio
import aiohttp
from urllib.robotparser import RobotFileParser
from datetime import datetime, timedelta
class RespectfulScraper:
def __init__(self, base_url: str, requests_per_minute: int = 10):
self.base_url = base_url
self.rpm = requests_per_minute
self.last_request_time = None
self.min_interval = 60.0 / requests_per_minute
# Parse robots.txt
self.robot_parser = RobotFileParser()
self.robot_parser.set_url(f"{base_url}/robots.txt")
self.robot_parser.read()
def can_fetch(self, url: str) -> bool:
"""Check if we're allowed to scrape this URL"""
return self.robot_parser.can_fetch("SalesBridgeBot/1.0", url)
async def fetch(self, session: aiohttp.ClientSession, url: str) -> str:
"""Fetch a URL with rate limiting and polite headers"""
if not self.can_fetch(url):
raise PermissionError(f"robots.txt disallows: {url}")
# Enforce rate limit
if self.last_request_time:
elapsed = (datetime.now() - self.last_request_time).total_seconds()
if elapsed < self.min_interval:
await asyncio.sleep(self.min_interval - elapsed)
headers = {
'User-Agent': 'SalesBridgeBot/1.0 (+https://salesbridge.ai/bot)',
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9',
}
self.last_request_time = datetime.now()
async with session.get(url, headers=headers) as response:
if response.status == 429: # Too Many Requests
retry_after = int(response.headers.get('Retry-After', 60))
await asyncio.sleep(retry_after)
return await self.fetch(session, url) # Retry
response.raise_for_status()
return await response.text()Principle 2: Handle JavaScript-Rendered Content
Many modern procurement portals use JavaScript frameworks that render content client-side. BeautifulSoup alone is useless for these sites. You need a headless browser.
from playwright.async_api import async_playwright
class DynamicScraper:
async def scrape_js_rendered(self, url: str, wait_for: str = None):
"""Scrape content that requires JavaScript rendering"""
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=[
'--disable-gpu',
'--no-sandbox',
'--disable-dev-shm-usage',
]
)
context = await browser.new_context(
viewport={'width': 1280, 'height': 720},
user_agent='Mozilla/5.0 (compatible; SalesBridgeBot/1.0)',
)
page = await context.new_page()
# Block unnecessary resources to speed up scraping
await page.route("**/*.{png,jpg,jpeg,gif,svg,css,font,woff,woff2}",
lambda route: route.abort())
await page.goto(url, wait_until='networkidle')
if wait_for:
await page.wait_for_selector(wait_for, timeout=10000)
content = await page.content()
await browser.close()
return contentPrinciple 3: Build Resilient Pipelines
Scrapers break. Websites change their HTML structure, servers go down temporarily, and network issues happen. Your scraping pipeline needs to handle all of these gracefully:
Retry with exponential backoff - 3 retries with delays of 1s, 5s, and 30s before giving up
Store raw HTML alongside parsed data - when a parser breaks, you can reprocess stored HTML without re-scraping
Use CSS selectors with fallbacks - have multiple selector strategies for each field and try them in order
Alert on extraction rate drops - if you normally extract 50+ records per run and suddenly get 5, something has changed
Principle 4: Data Validation and Deduplication
Raw scraped data is messy. You need a robust validation layer between scraping and storage:
from pydantic import BaseModel, validator, HttpUrl
from datetime import date
from typing import Optional, List
class Opportunity(BaseModel):
title: str
agency: str
solicitation_number: str
response_deadline: date
url: HttpUrl
naics_codes: List[str] = []
estimated_value: Optional[float] = None
@validator('title')
def title_not_empty(cls, v):
if len(v.strip()) < 10:
raise ValueError('Title too short - likely parsing error')
return v.strip()
@validator('response_deadline')
def deadline_in_future(cls, v):
if v < date.today():
raise ValueError('Deadline has already passed')
return v
@validator('solicitation_number')
def valid_solicitation(cls, v):
# Government solicitation numbers follow specific patterns
if not any(v.startswith(prefix) for prefix in ['W', 'FA', 'N', 'SP', 'HQ']):
raise ValueError(f'Unusual solicitation format: {v}')
return v
class DeduplicationEngine:
def __init__(self, db_connection):
self.db = db_connection
async def is_duplicate(self, opportunity: Opportunity) -> bool:
"""Check if we've already seen this opportunity"""
existing = await self.db.find_one({
'solicitation_number': opportunity.solicitation_number
})
return existing is not None
async def upsert(self, opportunity: Opportunity):
"""Insert or update opportunity"""
await self.db.update_one(
{'solicitation_number': opportunity.solicitation_number},
{'$set': opportunity.dict(), '$setOnInsert': {'first_seen': date.today()}},
upsert=True
)Principle 5: Monitoring and Alerting
A scraper running silently in the background is a scraper you have forgotten about. Set up monitoring from day one:
Track records scraped per source per day - any significant deviation triggers an alert
Monitor error rates by type - distinguish between network errors (temporary) and parsing errors (structural change)
Log response times per source - a sudden increase often indicates impending blocks
Set up daily summary emails with key metrics so you can spot trends before they become problems
Building reliable scrapers is not glamorous work, but it is the foundation that makes AI-powered platforms like SalesBridge.ai possible. The data quality of your AI outputs will never exceed the quality of your data inputs, and that starts with a well-built scraper.
Stay Updated
Get notified when I publish new articles on engineering, AI, and leadership. No spam, unsubscribe anytime.