Elasticsearch High Availability: Deep Dive Guide
Shards and Replicas: The Foundation of Elasticsearch HA
Understanding Shards
Shards are the fundamental building blocks of Elasticsearch’s distributed architecture. Each index is divided into multiple shards, which are essentially independent Lucene indices that can be distributed across different nodes in a cluster.
Primary Shards:
- Store the original data
- Handle write operations
- Number is fixed at index creation time
- Cannot be changed without reindexing
Shard Sizing Best Practices:
1 | PUT /my_index |
Replica Strategy for High Availability
Replicas are exact copies of primary shards that provide both redundancy and increased read throughput.
Production Replica Configuration:
1 | PUT /production_logs |
Real-World Example: E-commerce Platform
Consider an e-commerce platform handling 1TB of product data:
1 | PUT /products |
graph TB
subgraph "Node 1"
P1[Primary Shard 1]
R2[Replica Shard 2]
R3[Replica Shard 3]
end
subgraph "Node 2"
P2[Primary Shard 2]
R1[Replica Shard 1]
R4[Replica Shard 4]
end
subgraph "Node 3"
P3[Primary Shard 3]
P4[Primary Shard 4]
R5[Replica Shard 5]
end
P1 -.->|Replicates to| R1
P2 -.->|Replicates to| R2
P3 -.->|Replicates to| R3
P4 -.->|Replicates to| R4
Interview Insight: “How would you determine the optimal number of shards for a 500GB index with expected 50% growth annually?”
Answer: Calculate based on shard size (aim for 10-50GB per shard), consider node capacity, and factor in growth. For 500GB growing to 750GB: 15-75 shards initially, typically 20-30 shards with 1-2 replicas.
TransLog: Ensuring Write Durability
TransLog Mechanism
The Transaction Log (TransLog) is Elasticsearch’s write-ahead log that ensures data durability during unexpected shutdowns or power failures.
How TransLog Works:
- Write operation received
- Data written to in-memory buffer
- Operation logged to TransLog
- Acknowledgment sent to client
- Periodic flush to Lucene segments
TransLog Configuration for High Availability
1 | PUT /critical_data |
TransLog Durability Options:
request
: Fsync after each request (highest durability, lower performance)async
: Fsync every sync_interval (better performance, slight risk)
Production Example: Financial Trading System
1 | PUT /trading_transactions |
sequenceDiagram
participant Client
participant ES_Node
participant TransLog
participant Lucene
Client->>ES_Node: Index Document
ES_Node->>TransLog: Write to TransLog
TransLog-->>ES_Node: Confirm Write
ES_Node->>Lucene: Add to In-Memory Buffer
ES_Node-->>Client: Acknowledge Request
Note over ES_Node: Periodic Refresh
ES_Node->>Lucene: Flush Buffer to Segment
ES_Node->>TransLog: Clear TransLog Entries
Interview Insight: “What happens if a node crashes between TransLog write and Lucene flush?”
Answer: On restart, Elasticsearch replays TransLog entries to recover uncommitted operations. The TransLog ensures no acknowledged writes are lost, maintaining data consistency.
Production HA Challenges and Solutions
Common Production Issues
Split-Brain Syndrome
Problem: Network partitions causing multiple master nodes
Solution:
1 | # elasticsearch.yml |
Memory Pressure and GC Issues
Problem: Large heaps causing long GC pauses
Solution:
1 | # jvm.options |
Uneven Shard Distribution
Problem: Hot spots on specific nodes
Solution:
1 | PUT /_cluster/settings |
Real Production Case Study: Log Analytics Platform
Challenge: Processing 100GB/day of application logs with strict SLA requirements
Architecture:
graph LR
subgraph "Hot Tier"
H1[Hot Node 1]
H2[Hot Node 2]
H3[Hot Node 3]
end
subgraph "Warm Tier"
W1[Warm Node 1]
W2[Warm Node 2]
end
subgraph "Cold Tier"
C1[Cold Node 1]
end
Apps[Applications] --> LB[Load Balancer]
LB --> H1
LB --> H2
LB --> H3
H1 -.->|Age-based| W1
H2 -.->|Migration| W2
W1 -.->|Archive| C1
W2 -.->|Archive| C1
Index Template Configuration:
1 | PUT /_index_template/logs_template |
Interview Insight: “How would you handle a scenario where your Elasticsearch cluster is experiencing high write latency?”
Answer:
- Check TransLog settings (reduce durability if acceptable)
- Optimize refresh intervals
- Implement bulk indexing
- Scale horizontally by adding nodes
- Consider index lifecycle management
Optimization Strategies for Production HA
Rate Limiting Implementation
Circuit Breaker Pattern:
1 | public class ElasticsearchCircuitBreaker { |
Cluster-level Rate Limiting:
1 | PUT /_cluster/settings |
Message Queue Peak Shaving
Kafka Integration Example:
1 |
|
MQ Configuration for Peak Shaving:
1 | spring: |
Single Role Node Architecture
Dedicated Master Nodes:
1 | # master.yml |
Data Nodes Configuration:
1 | # data.yml |
Coordinating Nodes:
1 | # coordinator.yml |
Dual Cluster Deployment Strategy
Active-Passive Setup:
graph TB
subgraph "Primary DC"
P_LB[Load Balancer]
P_C1[Cluster 1 Node 1]
P_C2[Cluster 1 Node 2]
P_C3[Cluster 1 Node 3]
P_LB --> P_C1
P_LB --> P_C2
P_LB --> P_C3
end
subgraph "Secondary DC"
S_C1[Cluster 2 Node 1]
S_C2[Cluster 2 Node 2]
S_C3[Cluster 2 Node 3]
end
P_C1 -.->|Cross Cluster Replication| S_C1
P_C2 -.->|CCR| S_C2
P_C3 -.->|CCR| S_C3
Apps[Applications] --> P_LB
Cross-Cluster Replication Setup:
1 | PUT /_cluster/settings |
Advanced HA Monitoring and Alerting
Key Metrics to Monitor
Cluster Health Script:
1 |
|
Critical Metrics:
- Cluster status (green/yellow/red)
- Node availability
- Shard allocation status
- Memory usage and GC frequency
- Search and indexing latency
- TransLog size and flush frequency
Alerting Configuration Example
1 | # alertmanager.yml |
Performance Tuning for HA
Index Lifecycle Management
1 | PUT /_ilm/policy/production_policy |
Hardware Recommendations
Production Hardware Specs:
- CPU: 16+ cores for data nodes
- Memory: 64GB+ RAM (50% for heap, 50% for filesystem cache)
- Storage: NVMe SSDs for hot data, SATA SSDs for warm/cold
- Network: 10Gbps+ for inter-node communication
Interview Questions and Expert Answers
Q: “How would you recover from a complete cluster failure?”
A:
- Restore from snapshot if available
- If no snapshots, recover using
elasticsearch-node
tool - Implement proper backup strategy going forward
- Consider cross-cluster replication for future disasters
Q: “Explain the difference between index.refresh_interval
and TransLog flush.”
A:
refresh_interval
controls when in-memory documents become searchable- TransLog flush persists data to disk for durability
- Refresh affects search visibility, flush affects data safety
Q: “How do you handle version conflicts in a distributed environment?”
A:
- Use optimistic concurrency control with version numbers
- Implement retry logic with exponential backoff
- Consider using
_seq_no
and_primary_term
for more granular control
Security Considerations for HA
Authentication and Authorization
1 | # elasticsearch.yml |
Role-Based Access Control:
1 | PUT /_security/role/log_reader |
Best Practices Summary
Do’s
- Always use odd number of master-eligible nodes (3, 5, 7)
- Implement proper monitoring and alerting
- Use index templates for consistent settings
- Regularly test disaster recovery procedures
- Implement proper backup strategies
Don’ts
- Don’t set heap size above 32GB
- Don’t disable swap without proper configuration
- Don’t ignore yellow cluster status
- Don’t use default settings in production
- Don’t forget to monitor disk space
References and Additional Resources
- Elasticsearch Official Documentation
- Elasticsearch Performance Tuning Guide
- Elastic Cloud Architecture Best Practices
- Production Deployment Considerations
This guide provides a comprehensive foundation for implementing and maintaining highly available Elasticsearch clusters in production environments. Regular updates and testing of these configurations are essential for maintaining optimal performance and reliability.