Design Logs Analysis Platform with ELK Stack

Posted on 2025-06-13 In system-design

Platform Architecture Overview

A logs analysis platform is the backbone of modern observability, enabling organizations to collect, process, store, and analyze massive volumes of log data from distributed systems. This comprehensive guide covers the end-to-end design of a scalable, fault-tolerant logs analysis platform that not only helps with troubleshooting but also enables predictive fault detection.

High-Level Architecture


graph TB
subgraph "Data Sources"
    A[Application Logs]
    B[System Logs]
    C[Security Logs]
    D[Infrastructure Logs]
    E[Database Logs]
end

subgraph "Collection Layer"
    F[Filebeat]
    G[Metricbeat]
    H[Winlogbeat]
    I[Custom Beats]
end

subgraph "Message Queue"
    J[Kafka/Redis]
end

subgraph "Processing Layer"
    K[Logstash]
    L[Elasticsearch Ingest Pipelines]
end

subgraph "Storage Layer"
    M[Elasticsearch Cluster]
    N[Cold Storage S3/HDFS]
end

subgraph "Analytics & Visualization"
    O[Kibana]
    P[Grafana]
    Q[Custom Dashboards]
end

subgraph "AI/ML Layer"
    R[Elasticsearch ML]
    S[External ML Services]
end

A --> F
B --> G
C --> H
D --> I
E --> F

F --> J
G --> J
H --> J
I --> J

J --> K
J --> L

K --> M
L --> M

M --> N
M --> O
M --> P
M --> R

R --> S
O --> Q

Interview Insight: “When designing log platforms, interviewers often ask about handling different log formats and volumes. Emphasize the importance of a flexible ingestion layer and proper data modeling from day one.”

Data Collection Layer

Log Sources Classification

1. Application Logs

Structured Logs: JSON, XML formatted logs
Semi-structured: Key-value pairs, custom formats
Unstructured: Plain text, error dumps

2. Infrastructure Logs

Container logs (Docker, Kubernetes)
Load balancer logs (Nginx, HAProxy)
Web server logs (Apache, IIS)
Network device logs

3. System Logs

Operating system logs (syslog, Windows Event Log)
Authentication logs
Kernel logs

Collection Strategy with Beats

# Example Filebeat Configuration
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.log
  fields:
    service: web-app
    environment: production
  multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
  multiline.negate: true
  multiline.match: after

- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
  processors:
  - add_kubernetes_metadata:
      host: ${NODE_NAME}
      matchers:
      - logs_path:
          logs_path: "/var/lib/docker/containers"

output.kafka:
  hosts: ["kafka1:9092", "kafka2:9092", "kafka3:9092"]
  topic: 'logs-%{[fields.environment]}'
  partition.round_robin:
    reachable_only: false

Interview Insight: “Discuss the trade-offs between direct shipping to Elasticsearch vs. using a message queue. Kafka provides better reliability and backpressure handling, especially important for high-volume environments.”

Data Processing and Storage

Logstash Processing Pipeline

# Logstash Configuration Example
input {
  kafka {
    bootstrap_servers => "kafka1:9092,kafka2:9092"
    topics => ["logs-production", "logs-staging"]
    codec => json
  }
}

filter {
  # Parse application logs
  if [fields][service] == "web-app" {
    grok {
      match => { 
        "message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:level}\] %{DATA:logger} - %{GREEDYDATA:log_message}" 
      }
    }
    
    date {
      match => [ "timestamp", "ISO8601" ]
    }
    
    # Extract error patterns for ML
    if [level] == "ERROR" {
      mutate {
        add_tag => ["error", "needs_analysis"]
      }
    }
  }
  
  # Enrich with GeoIP for web logs
  if [fields][log_type] == "access" {
    geoip {
      source => "client_ip"
      target => "geoip"
    }
  }
  
  # Remove sensitive data
  mutate {
    remove_field => ["password", "credit_card", "ssn"]
  }
}

output {
  elasticsearch {
    hosts => ["es-node1:9200", "es-node2:9200", "es-node3:9200"]
    index => "logs-%{[fields][service]}-%{+YYYY.MM.dd}"
    template_name => "logs"
    template => "/etc/logstash/templates/logs-template.json"
  }
}

Elasticsearch Index Strategy

Index Lifecycle Management (ILM)

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "10GB",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "allocate": {
            "number_of_replicas": 0
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "number_of_replicas": 0,
            "require": {
              "box_type": "cold"
            }
          },
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d"
      }
    }
  }
}

Interview Insight: “Index lifecycle management is crucial for cost control. Explain how you’d balance search performance with storage costs, and discuss the trade-offs of different retention policies.”

Search and Analytics

Query Optimization Strategies

1. Efficient Query Patterns

{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h"
            }
          }
        },
        {
          "term": {
            "service.keyword": "payment-api"
          }
        }
      ],
      "must": [
        {
          "match": {
            "message": "error"
          }
        }
      ]
    }
  },
  "aggs": {
    "error_types": {
      "terms": {
        "field": "error_type.keyword",
        "size": 10
      }
    },
    "error_timeline": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "5m"
      }
    }
  }
}

2. Search Templates for Common Queries

{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "@timestamp": {
                  "gte": "{{start_time}}",
                  "lte": "{{end_time}}"
                }
              }
            },
            {
              "terms": {
                "service.keyword": "{{services}}"
              }
            }
          ]
        }
      }
    }
  }
}

Interview Insight: “Performance optimization questions are common. Discuss field data types (keyword vs text), query caching, and the importance of using filters over queries for better performance.”

Visualization and Monitoring

Kibana Dashboard Design

1. Operational Dashboard Structure


graph LR
subgraph "Executive Dashboard"
    A[System Health Overview]
    B[SLA Metrics]
    C[Cost Analytics]
end

subgraph "Operational Dashboard"
    D[Error Rate Trends]
    E[Service Performance]
    F[Infrastructure Metrics]
end

subgraph "Troubleshooting Dashboard"
    G[Error Investigation]
    H[Trace Analysis]
    I[Log Deep Dive]
end

A --> D
D --> G
B --> E
E --> H
C --> F
F --> I

2. Sample Kibana Visualization Config

{
  "visualization": {
    "title": "Error Rate by Service",
    "type": "line",
    "params": {
      "seriesParams": [
        {
          "data": {
            "id": "1",
            "label": "Error Rate"
          },
          "drawLinesBetweenPoints": true,
          "showCircles": true
        }
      ],
      "categoryAxes": [
        {
          "id": "CategoryAxis-1",
          "type": "category",
          "position": "bottom",
          "show": true,
          "title": {
            "text": "Time"
          }
        }
      ]
    }
  },
  "aggs": [
    {
      "id": "1",
      "type": "count",
      "schema": "metric",
      "params": {}
    },
    {
      "id": "2",
      "type": "date_histogram",
      "schema": "segment",
      "params": {
        "field": "@timestamp",
        "interval": "auto",
        "min_doc_count": 1
      }
    },
    {
      "id": "3",
      "type": "filters",
      "schema": "group",
      "params": {
        "filters": [
          {
            "input": {
              "query": {
                "match": {
                  "level": "ERROR"
                }
              }
            },
            "label": "Errors"
          }
        ]
      }
    }
  ]
}

Fault Prediction and Alerting

Machine Learning Implementation

1. Anomaly Detection Pipeline


flowchart TD
A[Log Ingestion] --> B[Feature Extraction]
B --> C[Anomaly Detection Model]
C --> D{Anomaly Score > Threshold?}
D -->|Yes| E[Generate Alert]
D -->|No| F[Continue Monitoring]
E --> G[Incident Management]
G --> H[Root Cause Analysis]
H --> I[Model Feedback]
I --> C
F --> A

2. Elasticsearch ML Job Configuration

{
  "job_id": "error-rate-anomaly",
  "analysis_config": {
    "bucket_span": "15m",
    "detectors": [
      {
        "detector_description": "High error rate",
        "function": "high_count",
        "by_field_name": "service.keyword"
      },
      {
        "detector_description": "Response time anomaly",
        "function": "high_mean",
        "field_name": "response_time",
        "by_field_name": "service.keyword"
      }
    ],
    "influencers": ["service.keyword", "host.keyword"]
  },
  "data_description": {
    "time_field": "@timestamp"
  },
  "model_plot_config": {
    "enabled": true
  }
}

Alerting Strategy

1. Alert Hierarchy

# Watcher Alert Example
{
  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": ["logs-*"],
        "body": {
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-5m"
                    }
                  }
                },
                {
                  "term": {
                    "level.keyword": "ERROR"
                  }
                }
              ]
            }
          },
          "aggs": {
            "error_count": {
              "cardinality": {
                "field": "message.keyword"
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.aggregations.error_count.value": {
        "gt": 10
      }
    }
  },
  "actions": {
    "send_slack": {
      "webhook": {
        "scheme": "https",
        "host": "hooks.slack.com",
        "port": 443,
        "method": "post",
        "path": "/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX",
        "body": "High error rate detected: {{ctx.payload.aggregations.error_count.value}} unique errors in the last 5 minutes"
      }
    }
  }
}

Interview Insight: “Discuss the difference between reactive and proactive monitoring. Explain how you’d tune alert thresholds to minimize false positives while ensuring critical issues are caught early.”

Security and Compliance

Security Implementation

1. Authentication and Authorization

# Elasticsearch Security Configuration
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

# Role-based Access Control
roles:
  log_reader:
    cluster: ["monitor"]
    indices:
      - names: ["logs-*"]
        privileges: ["read", "view_index_metadata"]
        field_security:
          grant: ["@timestamp", "level", "message", "service"]
          except: ["sensitive_data"]
  
  log_admin:
    cluster: ["all"]
    indices:
      - names: ["*"]
        privileges: ["all"]

2. Data Masking Pipeline

{
  "description": "Mask sensitive data",
  "processors": [
    {
      "gsub": {
        "field": "message",
        "pattern": "\\b\\d{4}[\\s-]?\\d{4}[\\s-]?\\d{4}[\\s-]?\\d{4}\\b",
        "replacement": "****-****-****-****"
      }
    },
    {
      "gsub": {
        "field": "message",
        "pattern": "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b",
        "replacement": "***@***.***"
      }
    }
  ]
}

Interview Insight: “Security questions often focus on PII handling and compliance. Be prepared to discuss GDPR implications, data retention policies, and the right to be forgotten in log systems.”

Scalability and Performance

Cluster Sizing and Architecture

1. Node Roles and Allocation


graph TB
subgraph "Master Nodes (3)"
    M1[Master-1]
    M2[Master-2]
    M3[Master-3]
end

subgraph "Hot Data Nodes (6)"
    H1[Hot-1<br/>High CPU/RAM<br/>SSD Storage]
    H2[Hot-2]
    H3[Hot-3]
    H4[Hot-4]
    H5[Hot-5]
    H6[Hot-6]
end

subgraph "Warm Data Nodes (4)"
    W1[Warm-1<br/>Medium CPU/RAM<br/>HDD Storage]
    W2[Warm-2]
    W3[Warm-3]
    W4[Warm-4]
end

subgraph "Cold Data Nodes (2)"
    C1[Cold-1<br/>Low CPU/RAM<br/>Cheap Storage]
    C2[Cold-2]
end

subgraph "Coordinating Nodes (2)"
    CO1[Coord-1<br/>Query Processing]
    CO2[Coord-2]
end

2. Performance Optimization

# Elasticsearch Configuration for Performance
cluster.name: logs-production
node.name: ${HOSTNAME}

# Memory Settings
bootstrap.memory_lock: true
indices.memory.index_buffer_size: 30%
indices.memory.min_index_buffer_size: 96mb

# Thread Pool Optimization
thread_pool.write.queue_size: 1000
thread_pool.search.queue_size: 1000

# Index Settings for High Volume
index.refresh_interval: 30s
index.number_of_shards: 3
index.number_of_replicas: 1
index.translog.flush_threshold_size: 1gb

Capacity Planning Model

# Capacity Planning Calculator
def calculate_storage_requirements(
    daily_log_volume_gb,
    retention_days,
    replication_factor,
    compression_ratio=0.7
):
    raw_storage = daily_log_volume_gb * retention_days
    with_replication = raw_storage * (1 + replication_factor)
    compressed_storage = with_replication * compression_ratio
    
    # Add 20% buffer for operations
    total_storage = compressed_storage * 1.2
    
    return {
        "raw_daily": daily_log_volume_gb,
        "total_compressed": compressed_storage,
        "recommended_capacity": total_storage,
        "hot_tier": total_storage * 0.3,  # 30% in hot
        "warm_tier": total_storage * 0.5,  # 50% in warm
        "cold_tier": total_storage * 0.2   # 20% in cold
    }

# Example calculation
requirements = calculate_storage_requirements(
    daily_log_volume_gb=500,
    retention_days=90,
    replication_factor=1
)

Interview Insight: “Capacity planning is a critical skill. Discuss how you’d model growth, handle traffic spikes, and plan for different data tiers. Include both storage and compute considerations.”

Implementation Roadmap

Phase-wise Implementation


gantt
title Logs Analysis Platform Implementation
dateFormat  YYYY-MM-DD
section Phase 1: Foundation
Infrastructure Setup           :done,    phase1a, 2024-01-01, 2024-01-15
Basic ELK Stack Deployment    :done,    phase1b, 2024-01-15, 2024-01-30
Initial Log Collection        :done,    phase1c, 2024-01-30, 2024-02-15

section Phase 2: Core Features
Advanced Processing           :active,  phase2a, 2024-02-15, 2024-03-01
Security Implementation       :         phase2b, 2024-03-01, 2024-03-15
Basic Dashboards             :         phase2c, 2024-03-15, 2024-03-30

section Phase 3: Intelligence
ML/Anomaly Detection         :         phase3a, 2024-03-30, 2024-04-15
Advanced Alerting            :         phase3b, 2024-04-15, 2024-04-30
Predictive Analytics         :         phase3c, 2024-04-30, 2024-05-15

section Phase 4: Optimization
Performance Tuning           :         phase4a, 2024-05-15, 2024-05-30
Cost Optimization            :         phase4b, 2024-05-30, 2024-06-15
Documentation & Training     :         phase4c, 2024-06-15, 2024-06-30

Migration Strategy

1. Parallel Run Approach


sequenceDiagram
participant Legacy as Legacy System
participant New as New ELK Platform
participant Apps as Applications
participant Ops as Operations Team

Note over Legacy, Ops: Phase 1: Parallel Ingestion
Apps->>Legacy: Continue logging
Apps->>New: Start dual logging
New->>Ops: Validation reports

Note over Legacy, Ops: Phase 2: Gradual Migration
Apps->>Legacy: Reduced logging
Apps->>New: Primary logging
New->>Ops: Performance metrics

Note over Legacy, Ops: Phase 3: Full Cutover
Apps->>New: All logging
Legacy->>New: Historical data migration
New->>Ops: Full operational control

Operational Best Practices

Monitoring and Maintenance

1. Platform Health Monitoring

# Metricbeat Configuration for ELK Monitoring
metricbeat.modules:
- module: elasticsearch
  metricsets:
    - node
    - node_stats
    - cluster_stats
    - index
    - index_recovery
    - index_summary
  period: 10s
  hosts: ["http://localhost:9200"]

- module: kibana
  metricsets: ["status"]
  period: 10s
  hosts: ["localhost:5601"]

- module: logstash
  metricsets: ["node", "node_stats"]
  period: 10s
  hosts: ["localhost:9600"]

2. Operational Runbooks

## Incident Response Runbook

### High CPU Usage on Elasticsearch Nodes
1. Check query patterns in slow log
2. Identify expensive aggregations
3. Review recent index changes
4. Scale horizontally if needed

### High Memory Usage
1. Check field data cache size
2. Review mapping for analyzed fields
3. Implement circuit breakers
4. Consider node memory increase

### Disk Space Issues
1. Check ILM policy execution
2. Force merge old indices
3. Move indices to cold tier
4. Delete unnecessary indices

Interview Insight: “Operations questions test your real-world experience. Discuss common failure scenarios, monitoring strategies, and how you’d handle a production outage with logs being critical for troubleshooting.”

Data Quality and Governance

1. Log Quality Metrics

{
  "quality_checks": {
    "completeness": {
      "missing_timestamp": 0.01,
      "missing_service_tag": 0.05,
      "empty_messages": 0.02
    },
    "consistency": {
      "format_compliance": 0.95,
      "schema_violations": 0.03
    },
    "timeliness": {
      "ingestion_delay_p95": "30s",
      "processing_delay_p95": "60s"
    }
  }
}

2. Cost Optimization Strategies

# Cost Optimization Analysis
def analyze_index_costs(indices_stats):
    cost_analysis = {}
    
    for index, stats in indices_stats.items():
        storage_gb = stats['store_size_gb']
        daily_queries = stats['search_count'] / stats['age_days']
        
        # Calculate cost per query
        storage_cost = storage_gb * 0.023  # AWS EBS cost
        compute_cost = daily_queries * 0.0001  # Estimated compute per query
        
        cost_analysis[index] = {
            'storage_cost': storage_cost,
            'compute_cost': compute_cost,
            'cost_per_query': (storage_cost + compute_cost) / max(daily_queries, 1),
            'recommendation': get_tier_recommendation(storage_gb, daily_queries)
        }
    
    return cost_analysis

def get_tier_recommendation(storage_gb, daily_queries):
    if daily_queries > 100:
        return "hot"
    elif daily_queries > 10:
        return "warm"
    else:
        return "cold"

Conclusion

This comprehensive logs analysis platform design provides a robust foundation for enterprise-scale log management, combining the power of the ELK stack with modern best practices for scalability, security, and operational excellence. The platform enables both reactive troubleshooting and proactive fault prediction, making it an essential component of any modern DevOps toolkit.

Key Success Factors

Proper Data Modeling: Design indices and mappings from the start
Scalable Architecture: Plan for growth in both volume and complexity
Security First: Implement proper access controls and data protection
Operational Excellence: Build comprehensive monitoring and alerting
Cost Awareness: Optimize storage tiers and retention policies
Team Training: Ensure proper adoption and utilization

Final Interview Insight: “When discussing log platforms in interviews, emphasize the business value: faster incident resolution, proactive issue detection, and data-driven decision making. Technical excellence should always tie back to business outcomes.”