Building Real-Time Messaging Infrastructure: How CloudTxt Handles 10 Million Messages
Behind the scenes of CloudTxt's high-performance messaging system - architecture, scaling strategies, and lessons from processing 10M+ messages.
Dibyank Padhy
Engineering Manager & Full Stack Developer
Table of Contents
The Scale Challenge
CloudTxt is a messaging infrastructure platform that powers communication for businesses at scale. When I joined the project, it was handling about 100,000 messages per day. By the time I was done, it was processing over 10 million. The journey from 100K to 10M was not just about adding more servers - it required fundamental rethinking of the architecture at multiple levels.
Message Flow Architecture
At its core, a messaging system needs to do three things reliably: accept messages, process them (routing, transformation, delivery), and confirm delivery. The challenge is doing all three with low latency and high throughput simultaneously.
// High-level message flow
Client -> API Gateway (rate limiting, auth)
-> Message Queue (SQS FIFO for ordering guarantees)
-> Processing Workers (Lambda fleet)
-> Routing Engine (determine delivery channel)
-> Channel Adapters (SMS via Twilio, Email via SES, Push via FCM)
-> Delivery Confirmation
-> Webhook Notification to Client
// Parallel flows:
Message -> Analytics Pipeline (Kinesis -> S3 -> Athena)
Message -> Compliance Logger (encrypted, immutable audit trail)Strategy 1: Queue-Based Load Leveling
The most important architectural pattern for messaging at scale is queue-based load leveling. Instead of processing messages synchronously through the API, we immediately acknowledge receipt, push the message onto a queue, and process it asynchronously.
This decouples ingestion throughput from processing throughput. During peak hours, when message volume spikes 5-10x above baseline, the queue absorbs the burst while workers process at a steady, sustainable rate. The client sees consistent sub-100ms API response times regardless of backend load.
Strategy 2: Partitioned Processing
Not all messages are equal. A transactional message like a password reset needs to be delivered within seconds. A marketing message can tolerate minutes of delay. We partition messages into priority queues:
Critical Queue: OTPs, password resets, security alerts. Dedicated worker pool with auto-scaling based on queue depth. Target latency: under 5 seconds.
Standard Queue: Notifications, updates, confirmations. Shared worker pool with batch processing. Target latency: under 60 seconds.
Bulk Queue: Marketing messages, newsletters, announcements. Throttled processing with configurable send rates. Target latency: under 30 minutes.
Strategy 3: Idempotent Message Processing
In distributed systems, messages can be delivered more than once. A worker might process a message, crash before acknowledging it, and the queue delivers it again. Without idempotency, the user receives duplicate messages.
class IdempotentMessageProcessor:
def __init__(self, redis_client):
self.redis = redis_client
async def process(self, message):
# Generate idempotency key from message content
idempotency_key = f"msg:{message.id}:{message.attempt}"
# Try to set the key with NX (only if not exists) and TTL
acquired = await self.redis.set(
idempotency_key, "processing",
nx=True, # Only set if not exists
ex=3600, # Expire after 1 hour
)
if not acquired:
# Another worker already processed this message
logger.info(f"Duplicate message detected: {message.id}")
return
try:
result = await self._deliver(message)
await self.redis.set(idempotency_key, "delivered", ex=86400)
return result
except Exception as e:
await self.redis.delete(idempotency_key)
raiseStrategy 4: Connection Pooling and Keep-Alive
When sending millions of messages through third-party APIs like Twilio and AWS SES, connection management becomes critical. Each new HTTPS connection requires a TCP handshake and TLS negotiation - roughly 100ms of overhead. At 10 million messages, that is 277 hours of wasted time just on connection setup.
We implemented aggressive connection pooling with HTTP keep-alive, maintaining warm connections to each provider. This alone reduced our P95 delivery latency by 40%.
Strategy 5: Real-Time Monitoring Dashboard
When you process 10 million messages per day, you need visibility into the system in real-time. We built a custom monitoring dashboard using DataDog that tracks:
Messages per second by channel (SMS, email, push) with anomaly detection
Delivery success rate per provider with automatic failover alerts
Queue depth trends with predictive scaling triggers
Cost per message by channel for financial forecasting
End-to-end latency percentiles (P50, P95, P99) with SLA breach alerts
The Numbers That Matter
Peak throughput: 2,400 messages per second sustained
Average end-to-end latency: 2.3 seconds (from API call to provider delivery)
Delivery success rate: 99.7% (remaining 0.3% are invalid phone numbers or emails)
Monthly infrastructure cost: Approximately $0.001 per message at scale
Zero message loss in 18 months of production operation
Building messaging infrastructure taught me that distributed systems are all about handling failure gracefully. The happy path is easy - it is the edge cases, the retries, the deduplication, and the monitoring that make the difference between a system that works in demos and one that works in production.
Stay Updated
Get notified when I publish new articles on engineering, AI, and leadership. No spam, unsubscribe anytime.