Design Message Notification Service
System Overview
The Message Notification Service is a scalable, multi-channel notification platform designed to handle 10 million messages per day across email, SMS, and WeChat channels. The system employs event-driven architecture with message queues for decoupling, template-based messaging, and comprehensive delivery tracking.
Interview Insight: When discussing notification systems, emphasize the trade-offs between consistency and availability. For notifications, we typically choose availability over strict consistency since delayed delivery is preferable to no delivery.
graph TB
A[Business Services] --> B[MessageNotificationSDK]
B --> C[API Gateway]
C --> D[Message Service]
D --> E[Message Queue]
E --> F[Channel Processors]
F --> G[Email Service]
F --> H[SMS Service]
F --> I[WeChat Service]
D --> J[Template Engine]
D --> K[Scheduler Service]
F --> L[Delivery Tracker]
L --> M[Analytics DB]
D --> N[Message Store]
Core Architecture Components
Message Notification Service API
The central service provides RESTful APIs for immediate and scheduled notifications:
1 | { |
Interview Insight: Discuss idempotency here - each message should have a unique ID to prevent duplicate sends. This is crucial for financial notifications or critical alerts.
Message Queue Architecture
The system uses Apache Kafka for high-throughput message processing with the following topic structure:
notification.immediate
- Real-time notificationsnotification.scheduled
- Scheduled notificationsnotification.retry
- Failed message retriesnotification.dlq
- Dead letter queue for permanent failures
flowchart LR
A[API Gateway] --> B[Message Validator]
B --> C{Message Type}
C -->|Immediate| D[notification.immediate]
C -->|Scheduled| E[notification.scheduled]
D --> F[Channel Router]
E --> G[Scheduler Service]
G --> F
F --> H[Email Processor]
F --> I[SMS Processor]
F --> J[WeChat Processor]
H --> K[Email Provider]
I --> L[SMS Provider]
J --> M[WeChat API]
Interview Insight: Explain partitioning strategy - partition by user ID for ordered processing per user, or by message type for parallel processing. The choice depends on whether message ordering matters for your use case.
Template Engine Design
Templates support dynamic content injection with internationalization:
1 | templates: |
Interview Insight: Template versioning is critical for production systems. Discuss A/B testing capabilities where different template versions can be tested simultaneously to optimize engagement rates.
Scalability and Performance
High-Volume Message Processing
To handle 10 million messages daily (approximately 116 messages/second average, 1000+ messages/second peak):
Horizontal Scaling Strategy:
- Multiple Kafka consumer groups for parallel processing
- Channel-specific processors with independent scaling
- Load balancing across processor instances
Performance Optimizations:
- Connection pooling for external APIs
- Batch processing for similar notifications
- Asynchronous processing with circuit breakers
sequenceDiagram
participant BS as Business Service
participant SDK as Notification SDK
participant API as API Gateway
participant MQ as Message Queue
participant CP as Channel Processor
participant EP as Email Provider
BS->>SDK: sendNotification(request)
SDK->>API: POST /notifications
API->>API: Validate & Enrich
API->>MQ: Publish message
API-->>SDK: messageId (async)
SDK-->>BS: messageId
MQ->>CP: Consume message
CP->>CP: Apply template
CP->>EP: Send email
EP-->>CP: Delivery status
CP->>MQ: Update delivery status
Interview Insight: Discuss the CAP theorem application - in notification systems, we choose availability and partition tolerance over consistency. It’s better to potentially send a duplicate notification than to miss sending one entirely.
Caching Strategy
Multi-Level Caching:
- Template Cache: Redis cluster for compiled templates
- User Preference Cache: User notification preferences and contact info
- Rate Limiting Cache: Sliding window counters for rate limiting
Channel-Specific Implementations
Email Service
1 |
|
Provider Failover Strategy:
- Primary: AWS SES (high volume, cost-effective)
- Secondary: SendGrid (reliability backup)
- Tertiary: Mailgun (final fallback)
SMS Service Implementation
1 |
|
Interview Insight: SMS routing is geography-dependent. Different providers have better delivery rates in different regions. Discuss how you’d implement intelligent routing based on phone number analysis.
WeChat Integration
WeChat requires special handling due to its ecosystem:
1 |
|
Scheduling and Delivery Management
Scheduler Service Architecture
flowchart TD
A[Scheduled Messages] --> B[Time-based Partitioner]
B --> C[Quartz Scheduler Cluster]
C --> D[Message Trigger]
D --> E{Delivery Window?}
E -->|Yes| F[Send to Processing Queue]
E -->|No| G[Reschedule]
F --> H[Channel Processors]
G --> A
Delivery Window Management:
- Timezone-aware scheduling
- Business hours enforcement
- Frequency capping to prevent spam
Retry and Failure Handling
Exponential Backoff Strategy:
1 |
|
Interview Insight: Discuss the importance of classifying failures - temporary vs permanent. Retrying an invalid email address wastes resources, while network timeouts should be retried with backoff.
MessageNotificationSDK Design
SDK Architecture
1 |
|
SDK Configuration
1 | notification: |
Interview Insight: The SDK should be resilient to service unavailability. Discuss local queuing, circuit breakers, and graceful degradation strategies.
Monitoring and Observability
Key Metrics Dashboard
Throughput Metrics:
- Messages processed per second by channel
- Queue depth and processing latency
- Template rendering performance
Delivery Metrics:
- Delivery success rate by channel and provider
- Bounce and failure rates
- Time to delivery distribution
Business Metrics:
- User engagement rates
- Opt-out rates by channel
- Cost per notification by channel
graph LR
A[Notification Service] --> B[Metrics Collector]
B --> C[Prometheus]
C --> D[Grafana Dashboard]
B --> E[Application Logs]
E --> F[ELK Stack]
B --> G[Distributed Tracing]
G --> H[Jaeger]
Alerting Strategy
Critical Alerts:
- Queue depth > 10,000 messages
- Delivery success rate < 95%
- Provider API failure rate > 5%
Warning Alerts:
- Processing latency > 30 seconds
- Template rendering errors
- Unusual bounce rate increases
Security and Compliance
Data Protection
Encryption:
- At-rest: AES-256 encryption for stored messages
- In-transit: TLS 1.3 for all API communications
- PII masking in logs and metrics
Access Control:
1 |
|
Compliance Considerations
GDPR Compliance:
- Right to be forgotten: Automatic message deletion after retention period
- Consent management: Integration with preference center
- Data minimization: Only store necessary message data
CAN-SPAM Act:
- Automatic unsubscribe link injection
- Sender identification requirements
- Opt-out processing within 10 business days
Interview Insight: Security should be built-in, not bolted-on. Discuss defense in depth - encryption, authentication, authorization, input validation, and audit logging at every layer.
Performance Benchmarks and Capacity Planning
Load Testing Results
Target Performance:
- 10M messages/day = 115 messages/second average
- Peak capacity: 1,000 messages/second
- 99th percentile latency: < 100ms for API calls
- 95th percentile delivery time: < 30 seconds
Scaling Calculations:
1 | Email Channel: |
Database Sizing
Message Storage Requirements:
- 10M messages/day × 2KB average size = 20GB/day
- 90-day retention = 1.8TB storage requirement
- With replication and indexes: 5TB total
Cost Optimization Strategies
Provider Cost Management
Email Costs:
- AWS SES: $0.10 per 1,000 emails
- SendGrid: $0.20 per 1,000 emails
- Strategy: Primary on SES, failover to SendGrid
SMS Costs:
- Twilio US: $0.0075 per SMS
- International routing for cost optimization
- Bulk messaging discounts negotiation
Infrastructure Costs:
- Kafka cluster: $500/month
- Application servers: $800/month
- Database: $300/month
- Monitoring: $200/month
- Total: ~$1,800/month for 10M messages
Interview Insight: Always discuss cost optimization in system design. Show understanding of the business impact - a 10% improvement in delivery rates might justify 50% higher costs if it drives revenue.
Testing Strategy
Integration Testing
1 |
|
Chaos Engineering
Failure Scenarios:
- Provider API timeouts and rate limiting
- Database connection failures
- Kafka broker failures
- Network partitions between services
Future Enhancements
Advanced Features Roadmap
Machine Learning Integration:
- Optimal send time prediction per user
- Template A/B testing automation
- Delivery success rate optimization
Rich Media Support:
- Image and video attachments
- Interactive email templates
- Push notification rich media
Advanced Analytics:
- User engagement scoring
- Campaign performance analytics
- Predictive churn analysis
Interview Insight: Always end system design discussions with future considerations. This shows forward thinking and understanding that systems evolve. Discuss how your current architecture would accommodate these enhancements.
Conclusion
This Message Notification Service design provides a robust, scalable foundation for high-volume, multi-channel notifications. The architecture emphasizes reliability, observability, and maintainability while meeting the 10 million messages per day requirement with room for growth.
Key design principles applied:
- Decoupling: Message queues separate concerns and enable independent scaling
- Reliability: Multiple failover mechanisms and retry strategies
- Observability: Comprehensive monitoring and alerting
- Security: Built-in encryption, access control, and compliance features
- Cost Efficiency: Provider optimization and resource right-sizing
The system can be deployed incrementally, starting with core notification functionality and adding advanced features as business needs evolve.