Kafka Storage Architecture: Deep Dive Guide

Posted on 2025-06-09 In kafka

Introduction

Apache Kafka’s storage architecture is designed for high-throughput, fault-tolerant, and scalable distributed streaming. Understanding its storage mechanics is crucial for system design, performance tuning, and operational excellence.

Key Design Principles:

Append-only logs: Sequential writes for maximum performance
Immutable records: Once written, messages are never modified
Distributed partitioning: Horizontal scaling across brokers
Configurable retention: Time and size-based cleanup policies

Core Storage Components

Log Structure Overview


graph TD
A[Topic] --> B[Partition 0]
A --> C[Partition 1]
A --> D[Partition 2]

B --> E[Segment 0]
B --> F[Segment 1]
B --> G[Active Segment]

E --> H[.log file]
E --> I[.index file]
E --> J[.timeindex file]

subgraph "Broker File System"
    H
    I
    J
    K[.snapshot files]
    L[leader-epoch-checkpoint]
end

File Types and Their Purposes

File Type	Extension	Purpose	Size Limit
Log Segment	`.log`	Actual message data	`log.segment.bytes` (1GB default)
Offset Index	`.index`	Maps logical offset to physical position	`log.index.size.max.bytes` (10MB default)
Time Index	`.timeindex`	Maps timestamp to offset	`log.index.size.max.bytes`
Snapshot	`.snapshot`	Compacted topic state snapshots	Variable
Leader Epoch	`leader-epoch-checkpoint`	Tracks leadership changes	Small

Log Segments and File Structure

Segment Lifecycle


sequenceDiagram
participant Producer
participant ActiveSegment
participant ClosedSegment
participant CleanupThread

Producer->>ActiveSegment: Write messages
Note over ActiveSegment: Grows until segment.bytes limit
ActiveSegment->>ClosedSegment: Roll to new segment
Note over ClosedSegment: Becomes immutable
CleanupThread->>ClosedSegment: Apply retention policy
ClosedSegment->>CleanupThread: Delete when expired

Internal File Structure Example

Directory Structure:

/var/kafka-logs/
├── my-topic-0/
│   ├── 00000000000000000000.log      # Messages 0-999
│   ├── 00000000000000000000.index    # Offset index
│   ├── 00000000000000000000.timeindex # Time index
│   ├── 00000000000000001000.log      # Messages 1000-1999
│   ├── 00000000000000001000.index
│   ├── 00000000000000001000.timeindex
│   └── leader-epoch-checkpoint
└── my-topic-1/
    └── ... (similar structure)

Message Format Deep Dive

Record Batch Structure (v2):

Record Batch Header:
├── First Offset (8 bytes)
├── Batch Length (4 bytes)
├── Partition Leader Epoch (4 bytes)
├── Magic Byte (1 byte)
├── CRC (4 bytes)
├── Attributes (2 bytes)
├── Last Offset Delta (4 bytes)
├── First Timestamp (8 bytes)
├── Max Timestamp (8 bytes)
├── Producer ID (8 bytes)
├── Producer Epoch (2 bytes)
├── Base Sequence (4 bytes)
└── Records Array

Interview Insight: Why does Kafka use batch compression instead of individual message compression?

Reduces CPU overhead by compressing multiple messages together
Better compression ratios due to similarity between adjacent messages
Maintains high throughput by amortizing compression costs

Partition Distribution and Replication

Replica Placement Strategy


graph LR
subgraph "Broker 1"
    A[Topic-A-P0 Leader]
    B[Topic-A-P1 Follower]
    C[Topic-A-P2 Follower]
end

subgraph "Broker 2"
    D[Topic-A-P0 Follower]
    E[Topic-A-P1 Leader]
    F[Topic-A-P2 Follower]
end

subgraph "Broker 3"
    G[Topic-A-P0 Follower]
    H[Topic-A-P1 Follower]
    I[Topic-A-P2 Leader]
end

A -.->|Replication| D
A -.->|Replication| G
E -.->|Replication| B
E -.->|Replication| H
I -.->|Replication| C
I -.->|Replication| F

ISR (In-Sync Replicas) Management

Critical Configuration Parameters:

# Replica lag tolerance
replica.lag.time.max.ms=30000

# Minimum ISR required for writes
min.insync.replicas=2

# Unclean leader election (data loss risk)
unclean.leader.election.enable=false

Interview Question: What happens when ISR shrinks below min.insync.replicas?

Producers with acks=all will receive NotEnoughReplicasException
Topic becomes read-only until ISR is restored
This prevents data loss but reduces availability

Storage Best Practices

Disk Configuration

Optimal Setup:

Storage Strategy:
  Primary: SSD for active segments (faster writes)
  Secondary: HDD for older segments (cost-effective)
  RAID: RAID-10 for balance of performance and redundancy
  
File System:
  Recommended: XFS or ext4
  Mount Options: noatime,nodiratime
  
Directory Layout:
  /var/kafka-logs-1/  # SSD
  /var/kafka-logs-2/  # SSD  
  /var/kafka-logs-3/  # HDD (archive)

Retention Policies Showcase


flowchart TD
A[Message Arrives] --> B{Check Active Segment Size}
B -->|< segment.bytes| C[Append to Active Segment]
B -->|>= segment.bytes| D[Roll New Segment]

D --> E[Close Previous Segment]
E --> F{Apply Retention Policy}

F --> G[Time-based: log.retention.hours]
F --> H[Size-based: log.retention.bytes]
F --> I[Compaction: log.cleanup.policy=compact]

G --> J{Segment Age > Retention?}
H --> K{Total Size > Limit?}
I --> L[Run Log Compaction]

J -->|Yes| M[Delete Segment]
K -->|Yes| M
L --> N[Keep Latest Value per Key]

Performance Tuning Configuration

Producer Optimizations:

# Batching for throughput
batch.size=32768
linger.ms=10

# Compression
compression.type=lz4

# Memory allocation
buffer.memory=67108864

Broker Storage Optimizations:

# Segment settings
log.segment.bytes=268435456        # 256MB segments
log.roll.hours=168                 # Weekly rolls

# Flush settings (let OS handle)
log.flush.interval.messages=Long.MAX_VALUE
log.flush.interval.ms=Long.MAX_VALUE

# Index settings
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760  # 10MB

Performance Optimization

Throughput Optimization Strategies

Read Path Optimization:


graph LR
A[Consumer Request] --> B[Check Page Cache]
B -->|Hit| C[Return from Memory]
B -->|Miss| D[Read from Disk]
D --> E[Zero-Copy Transfer]
E --> F[sendfile System Call]
F --> G[Direct Disk-to-Network]

Write Path Optimization:


graph TD
A[Producer Batch] --> B[Memory Buffer]
B --> C[Page Cache]
C --> D[Async Flush to Disk]
D --> E[Sequential Write]

F[OS Background] --> G[Periodic fsync]
G --> H[Durability Guarantee]

Capacity Planning Formula

Storage Requirements Calculation:

Daily Storage = (Messages/day × Avg Message Size × Replication Factor) / Compression Ratio

Retention Storage = Daily Storage × Retention Days × Growth Factor

Example:
- 1M messages/day × 1KB × 3 replicas = 3GB/day
- 7 days retention × 1.2 growth factor = 25.2GB total

Monitoring and Troubleshooting

Key Metrics Dashboard

Metric Category	Key Indicators	Alert Thresholds
Storage	`kafka.log.size`, `disk.free`	< 15% free space
Replication	`UnderReplicatedPartitions`	> 0
Performance	`MessagesInPerSec`, `BytesInPerSec`	Baseline deviation
Lag	`ConsumerLag`, `ReplicaLag`	> 1000 messages

Common Storage Issues and Solutions

Issue 1: Disk Space Exhaustion

# Emergency cleanup - increase log cleanup frequency
kafka-configs.sh --alter --entity-type brokers --entity-name 0 \
  --add-config log.cleaner.min.cleanable.ratio=0.1

# Temporary retention reduction
kafka-configs.sh --alter --entity-type topics --entity-name my-topic \
  --add-config retention.ms=3600000  # 1 hour

Issue 2: Slow Consumer Performance

# Check if issue is disk I/O or network
iostat -x 1
iftop

# Verify zero-copy is working
strace -p <kafka-pid> | grep sendfile

Interview Questions & Real-World Scenarios

Scenario-Based Questions

Q1: Design Challenge
“You have a Kafka cluster handling 100GB/day with 7-day retention. One broker is running out of disk space. Walk me through your troubleshooting and resolution strategy.”

Answer Framework:

Immediate Actions: Check partition distribution, identify large partitions
Short-term: Reduce retention temporarily, enable log compaction if applicable
Long-term: Rebalance partitions, add storage capacity, implement tiered storage

Q2: Performance Analysis
“Your Kafka cluster shows decreasing write throughput over time. What could be the causes and how would you investigate?”

Investigation Checklist:

# Check segment distribution
ls -la /var/kafka-logs/*/

# Monitor I/O patterns
iotop -ao

# Analyze JVM garbage collection
jstat -gc <kafka-pid> 1s

# Check network utilization
netstat -i

Q3: Data Consistency
“Explain the trade-offs between acks=0, acks=1, and acks=all in terms of storage and durability.”

Setting	Durability	Performance	Use Case
`acks=0`	Lowest	Highest	Metrics, logs where some loss is acceptable
`acks=1`	Medium	Medium	General purpose, balanced approach
`acks=all`	Highest	Lowest	Financial transactions, critical data

Deep Technical Questions

Q4: Memory Management
“How does Kafka leverage the OS page cache, and why doesn’t it implement its own caching mechanism?”

Answer Points:

Kafka relies on OS page cache for read performance
Avoids double caching (JVM heap + OS cache)
Sequential access patterns work well with OS prefetching
Zero-copy transfers (sendfile) possible only with OS cache

Q5: Log Compaction Deep Dive
“Explain how log compaction works and when it might cause issues in production.”


graph TD
A[Original Log] --> B[Compaction Process]
B --> C[Compacted Log]

subgraph "Before Compaction"
    D[Key1:V1] --> E[Key2:V1] --> F[Key1:V2] --> G[Key3:V1] --> H[Key1:V3]
end

subgraph "After Compaction"
    I[Key2:V1] --> J[Key3:V1] --> K[Key1:V3]
end

Potential Issues:

Compaction lag during high-write periods
Tombstone records not cleaned up properly
Consumer offset management with compacted topics

Production Scenarios

Q6: Disaster Recovery
“A data center hosting 2 out of 3 Kafka brokers goes offline. Describe the impact and recovery process.”

Impact Analysis:

Partitions with min.insync.replicas=2: Unavailable for writes
Partitions with replicas in surviving broker: Continue operating
Consumer lag increases rapidly

Recovery Strategy:

# 1. Assess cluster state
kafka-topics.sh --bootstrap-server localhost:9092 --describe

# 2. Temporarily reduce min.insync.replicas if necessary
kafka-configs.sh --alter --entity-type topics --entity-name critical-topic \
  --add-config min.insync.replicas=1

# 3. Monitor under-replicated partitions
kafka-run-class.sh kafka.tools.ClusterTool --bootstrap-server localhost:9092

Best Practices Summary

Storage Design Principles:

Separate data and logs: Use different disks for Kafka data and application logs
Monitor disk usage trends: Set up automated alerts at 80% capacity
Plan for growth: Account for replication factor and retention policies
Test disaster recovery: Regular drills for broker failures and data corruption
Optimize for access patterns: Hot data on SSD, cold data on HDD

Configuration Management:

# Production-ready storage configuration
log.dirs=/var/kafka-logs-1,/var/kafka-logs-2
log.segment.bytes=536870912          # 512MB
log.retention.hours=168              # 1 week
log.retention.check.interval.ms=300000
log.cleanup.policy=delete
min.insync.replicas=2
unclean.leader.election.enable=false
auto.create.topics.enable=false

This comprehensive guide provides the foundation for understanding Kafka’s storage architecture while preparing you for both operational challenges and technical interviews. The key is to understand not just the “what” but the “why” behind each design decision.