SYSTEM_DESIGN
System Design: SMS Gateway
System design for an SMS gateway covering carrier integration via SMPP protocol, message routing, delivery reports, throughput management, and handling billions of SMS messages per day.
Requirements
Functional Requirements:
- Send SMS messages to any mobile number globally via carrier networks
- Receive inbound SMS messages and route to application endpoints (webhooks)
- Delivery receipt tracking: submitted, delivered, failed, rejected
- Support for long messages (concatenated SMS for messages >160 chars)
- Sender ID management: short codes, long codes, alphanumeric sender IDs
- Number lookup (HLR query) to validate mobile numbers before sending
Non-Functional Requirements:
- Process 2 billion outbound SMS messages per day
- End-to-end latency under 5 seconds (gateway to carrier submission)
- 99.9% delivery rate for valid mobile numbers
- 99.99% gateway availability
- Compliance with per-carrier rate limits and regulations (TCPA, GDPR)
Scale Estimation
With 2 billion outbound SMS per day, the system processes approximately 23,150 messages per second sustained, with peaks of 100K/sec during marketing campaign bursts. Each SMS message is up to 160 characters (140 bytes in GSM-7 encoding), plus ~200 bytes of SMPP protocol overhead per PDU. Outbound throughput: ~8GB/sec at peak. Delivery reports arrive asynchronously from carriers, generating another 2 billion events per day. Inbound SMS adds approximately 500 million messages/day. Carrier connections: ~200 direct SMPP connections to carriers across 50+ countries, each supporting 100-1000 messages/sec.
High-Level Architecture
The SMS Gateway architecture has three layers: the API layer (customer-facing), the routing layer (internal logic), and the carrier layer (SMPP connections). The API Layer exposes a REST API for customers to submit SMS messages. Each request is validated (sender ID permissions, destination format, content encoding), assigned a unique message_id, and written to a Kafka topic outbound-sms partitioned by destination country code.
The Routing Layer contains the intelligence. A Message Router consumes from the outbound-sms topic and determines the best carrier for each message based on: destination country/network, cost, current carrier availability, and throughput capacity. The router maintains a routing table that maps (country, network_operator) → list of carrier connections ranked by priority. The router also handles number portability lookups — a number originally assigned to Carrier A may have been ported to Carrier B. An MNP (Mobile Number Portability) database is queried for ported numbers.
The Carrier Layer manages persistent SMPP connections to SMS aggregators and direct carrier gateways. Each SMPP Client maintains a TCP connection using the SMPP v3.4 or v5.0 protocol, submitting messages via submit_sm PDUs and receiving delivery receipts via deliver_sm PDUs. The SMPP client handles the windowing protocol (multiple unacknowledged messages in flight), throttling when the carrier returns ESME_RTHROTTLED, and automatic reconnection on connection drops.
Core Components
Message Router
The Router is the decision engine. For each outbound message, it: (1) parses the destination phone number to extract country code and determine the Mobile Network Operator (MNO) using a number prefix database (E.164 format + MCC/MNC mapping); (2) queries the MNP database if the country supports number portability; (3) selects the optimal carrier from the routing table based on a weighted score: 0.4*cost + 0.3*delivery_rate + 0.2*latency + 0.1*current_load; (4) applies regulatory checks (time-of-day restrictions, opt-out list verification, TCPA compliance); and (5) enqueues the message for the selected carrier connection.
SMPP Connection Manager
The Connection Manager maintains persistent TCP connections to carriers using the SMPP protocol. Each connection is a bidirectional stream: outbound submit_sm PDUs (sending SMS) and inbound deliver_sm PDUs (receiving delivery reports and MO messages). The manager implements the SMPP windowing protocol: up to W unacknowledged PDUs in flight per connection (typically W=100), with automatic backoff when the window fills. For carrier failover, each route has a primary and secondary carrier; if the primary returns persistent errors or disconnects, traffic is automatically shifted to the secondary within 10 seconds.
Delivery Report Processor
Carriers send delivery reports (DLRs) asynchronously via SMPP deliver_sm PDUs, typically 5-60 seconds after message submission. The DLR Processor matches each report to the original message using the carrier-assigned message ID (returned in the submit_sm_resp). Reports are classified: DELIVRD (delivered to handset), UNDELIV (permanent failure), EXPIRED (TTL exceeded), REJECTD (carrier rejected). The processor writes the status to the message database and triggers a webhook callback to the customer's configured endpoint with the delivery status.
Database Design
The message store uses PostgreSQL (sharded by customer_id) for transactional consistency. The messages table: message_id (UUID), customer_id, from_number, to_number, body, encoding (GSM-7/UCS-2), segment_count, carrier_id, carrier_message_id, status (queued/submitted/delivered/failed), submitted_at, delivered_at, cost_millicents, error_code. Indexes on (customer_id, created_at) for message history queries and (carrier_message_id) for DLR matching.
The routing table is stored in PostgreSQL and cached in Redis: route:{country_code}:{mno_code} → JSON array of {carrier_id, priority, cost_per_segment, throughput_limit}. The MNP database is a key-value store (Redis or RocksDB): mnp:{phone_number} → {current_operator, port_date}. Carrier connection state (current window size, messages/sec, error rate) is tracked in Redis hashes updated by the SMPP Connection Manager every second.
API Design
POST /api/v1/messages— Send SMS:{from: '+1234567890', to: '+447911123456', body: 'Hello', webhook_url?: 'https://...', validity_period?: 3600}; returns{message_id, segment_count, status: 'queued'}GET /api/v1/messages/{message_id}— Fetch message status: returns full message record including delivery report detailsPOST /api/v1/lookups/{phone_number}— HLR lookup: returns{valid: true, carrier: 'Vodafone UK', number_type: 'mobile', ported: false}GET /api/v1/messages?from={date}&to={date}&status={status}&limit=100— List messages with filters for reporting and analytics
Scaling & Bottlenecks
The primary bottleneck is carrier throughput limits. Each carrier connection typically supports 100-1000 messages/sec, and carriers enforce strict rate limits with ESME_RTHROTTLED responses. To achieve 100K messages/sec globally, the system maintains 200+ carrier connections across 50+ countries. The Router implements per-carrier token bucket rate limiting, automatically distributing traffic across multiple carriers for the same destination. When a carrier is throttled, overflow traffic is routed to backup carriers with slightly higher cost but available capacity.
SMPP connection reliability is critical. TCP connections to carriers can drop due to network issues, carrier maintenance, or idle timeouts. The Connection Manager implements: enquire_link heartbeats every 30 seconds (SMPP keepalive), automatic reconnection with exponential backoff (1s, 2s, 4s, up to 60s), and connection pooling (multiple SMPP sessions per carrier for throughput). During reconnection, queued messages are buffered in Kafka and replayed once the connection is re-established, ensuring zero message loss.
Key Trade-offs
- SMPP over HTTP-based carrier APIs: SMPP is the telecom standard with lower overhead per message (binary protocol, persistent connection), but requires managing complex TCP connection state; HTTP APIs are simpler but add latency and overhead per message
- Multi-carrier routing over single carrier: Using multiple carriers per destination improves delivery rates (failover) and reduces cost (competitive routing), but adds routing complexity and requires maintaining multiple carrier relationships and integrations
- PostgreSQL for messages over NoSQL: PostgreSQL provides ACID guarantees needed for billing accuracy (each message has a cost), but sharding adds complexity; NoSQL would simplify scaling but complicate the billing/reporting queries
- Synchronous carrier submission over fire-and-forget: Waiting for the
submit_sm_respbefore acknowledging the API request gives the customer accurate status, but adds carrier latency to the API response time — an async model would be faster but less reliable
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.