Real-time messaging system handling 100B+ messages per day with end-to-end encryption, presence tracking, and media delivery across billions of devices.
100B+
daily Messages
~1B
concurrent Connections
~7B photos, 1B videos
media Per Day
~10,000 Erlang servers (historically)
servers
<200ms message delivery globally
latency
Architecture Diagram
Data Flow
Mobile Client → Load BalancerConnect
Client establishes persistent WebSocket connection through the load balancer.
Load Balancer → Chat ServerRoute
Load balancer assigns connection to a chat server, maintaining session affinity.
Mobile Client → Chat ServerSend Message
Client sends encrypted message over WebSocket. Chat server validates and processes.
Chat Server → Presence ServiceCheck Status
Chat server checks if recipient is online and which chat server holds their connection.
Chat Server → Chat ServerForward
If recipient is online, message is forwarded directly to their chat server for immediate delivery.
Chat Server → Message QueueQueue
If recipient is offline, message is queued for later delivery.
Chat Server → Push NotificationNotify
Push notification sent via APNs/FCM to wake the recipient's device.
Mobile Client → Media StorageUpload Media
Client uploads encrypted media directly to storage, receives a media ID.
Mobile Client → CDNDownload Media
Recipient downloads encrypted media from nearest CDN edge.
Key Architectural Decisions
- End-to-end encryption using Signal Protocol means servers never see plaintext — limits server-side features but maximizes privacy
- Erlang/BEAM VM chosen for extreme concurrency — single server handles millions of connections
- Messages stored only until delivered, then deleted from servers — reduces storage but complicates multi-device sync
- Fan-out on write for small groups, fan-out on read for broadcast lists
- WebSocket for persistent connections with fallback to long polling
Tradeoffs
Strengths
- Extreme efficiency: ~50 engineers served 900M users at acquisition
- End-to-end encryption provides strong privacy guarantees
- Erlang's actor model naturally maps to per-connection processes
- Minimal server-side storage reduces data liability
Weaknesses
- Multi-device support is limited by the encryption model
- No server-side search or message history (by design)
- Group size limits due to fan-out costs
- Media transcoding happens client-side, increasing battery usage
Interview Drilldown Questions
- How would you handle message ordering in a distributed chat system?
- What happens when a user comes online and has 10,000 pending messages?
- How does end-to-end encryption work with group chats?
- How would you design the read receipt system?
- What's the strategy for handling media in regions with poor connectivity?
Components
Presence Service
Tracks online/offline/typing status with heartbeats
Push Notification
APNs/FCM for waking dormant clients
Related Concepts
Source: editorial — Synthesized from public engineering talks, Signal Protocol documentation, and system design references