Redis Distributed Locking: Complete Guide

Introduction

Distributed locking is a critical mechanism for coordinating access to shared resources across multiple processes or services in a distributed system. Redis, with its atomic operations and high performance, has become a popular choice for implementing distributed locks.

Interview Insight: Expect questions like “Why would you use Redis for distributed locking instead of database-based locks?” The key advantages are: Redis operates in memory (faster), provides atomic operations, has built-in TTL support, and offers better performance for high-frequency locking scenarios.

When to Use Distributed Locks

  • Preventing duplicate processing of tasks
  • Coordinating access to external APIs with rate limits
  • Ensuring single leader election in distributed systems
  • Managing shared resource access across microservices
  • Implementing distributed rate limiting

Core Concepts

Lock Properties

A robust distributed lock must satisfy several properties:

  1. Mutual Exclusion: Only one client can hold the lock at any time
  2. Deadlock Free: Eventually, it’s always possible to acquire the lock
  3. Fault Tolerance: Lock acquisition and release work even when clients fail
  4. Safety: Lock is not granted to multiple clients simultaneously
  5. Liveness: Requests to acquire/release locks eventually succeed

Interview Insight: Interviewers often ask about the CAP theorem implications. Distributed locks typically favor Consistency and Partition tolerance over Availability - it’s better to fail lock acquisition than to grant locks to multiple clients.


graph TD
A[Client Request] --> B{Lock Available?}
B -->|Yes| C[Acquire Lock with TTL]
B -->|No| D[Wait/Retry]
C --> E[Execute Critical Section]
E --> F[Release Lock]
D --> G[Timeout Check]
G -->|Continue| B
G -->|Timeout| H[Fail]
F --> I[Success]

Redis Atomic Operations

Redis provides several atomic operations crucial for distributed locking:

  • SET key value NX EX seconds - Set if not exists with expiration
  • EVAL - Execute Lua scripts atomically
  • DEL - Delete keys atomically

Single Instance Redis Locking

Basic Implementation

The simplest approach uses a single Redis instance with the SET command:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import redis
import uuid
import time

class RedisLock:
def __init__(self, redis_client, key, timeout=10, retry_delay=0.1):
self.redis = redis_client
self.key = key
self.timeout = timeout
self.retry_delay = retry_delay
self.identifier = str(uuid.uuid4())

def acquire(self, blocking=True, timeout=None):
"""Acquire the lock"""
end_time = time.time() + (timeout or self.timeout)

while True:
# Try to set the key with our identifier and TTL
if self.redis.set(self.key, self.identifier, nx=True, ex=self.timeout):
return True

if not blocking or time.time() > end_time:
return False

time.sleep(self.retry_delay)

def release(self):
"""Release the lock using Lua script for atomicity"""
lua_script = """
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
else
return 0
end
"""
return self.redis.eval(lua_script, 1, self.key, self.identifier)

# Usage example
redis_client = redis.Redis(host='localhost', port=6379, db=0)
lock = RedisLock(redis_client, "my_resource_lock")

if lock.acquire():
try:
# Critical section
print("Lock acquired, executing critical section")
time.sleep(5) # Simulate work
finally:
lock.release()
print("Lock released")
else:
print("Failed to acquire lock")

Interview Insight: A common question is “Why do you need a unique identifier for each lock holder?” The identifier prevents a client from accidentally releasing another client’s lock, especially important when dealing with timeouts and retries.

Context Manager Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from contextlib import contextmanager

@contextmanager
def redis_lock(redis_client, key, timeout=10):
lock = RedisLock(redis_client, key, timeout)
acquired = lock.acquire()

if not acquired:
raise Exception(f"Could not acquire lock for {key}")

try:
yield
finally:
lock.release()

# Usage
try:
with redis_lock(redis_client, "my_resource"):
# Critical section code here
process_shared_resource()
except Exception as e:
print(f"Lock acquisition failed: {e}")

Single Instance Limitations


flowchart TD
A[Client A] --> B[Redis Master]
C[Client B] --> B
B --> D[Redis Slave]
B -->|Fails| E[Data Loss]
E --> F[Both Clients Think They Have Lock]

style E fill:#ff9999
style F fill:#ff9999

Interview Insight: Interviewers will ask about single points of failure. The main issues are: Redis instance failure loses all locks, replication lag can cause multiple clients to acquire the same lock, and network partitions can lead to split-brain scenarios.

The Redlock Algorithm

The Redlock algorithm, proposed by Redis creator Salvatore Sanfilippo, addresses single-instance limitations by using multiple independent Redis instances.

Algorithm Steps


sequenceDiagram
participant C as Client
participant R1 as Redis 1
participant R2 as Redis 2
participant R3 as Redis 3
participant R4 as Redis 4
participant R5 as Redis 5

Note over C: Start timer
C->>R1: SET lock_key unique_id NX EX ttl
C->>R2: SET lock_key unique_id NX EX ttl
C->>R3: SET lock_key unique_id NX EX ttl
C->>R4: SET lock_key unique_id NX EX ttl
C->>R5: SET lock_key unique_id NX EX ttl

R1-->>C: OK
R2-->>C: OK
R3-->>C: FAIL
R4-->>C: OK
R5-->>C: FAIL

Note over C: Check: 3/5 nodes acquired<br/>Time elapsed < TTL<br/>Lock is valid

Redlock Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import redis
import time
import uuid
from typing import List, Optional

class Redlock:
def __init__(self, redis_instances: List[redis.Redis], retry_count=3, retry_delay=0.2):
self.redis_instances = redis_instances
self.retry_count = retry_count
self.retry_delay = retry_delay
self.quorum = len(redis_instances) // 2 + 1

def acquire(self, resource: str, ttl: int = 10000) -> Optional[dict]:
"""
Acquire lock on resource with TTL in milliseconds
Returns lock info if successful, None if failed
"""
identifier = str(uuid.uuid4())

for _ in range(self.retry_count):
start_time = int(time.time() * 1000)
successful_locks = 0

# Try to acquire lock on all instances
for redis_client in self.redis_instances:
try:
if redis_client.set(resource, identifier, nx=True, px=ttl):
successful_locks += 1
except Exception:
# Instance is down, continue with others
continue

# Calculate elapsed time
elapsed_time = int(time.time() * 1000) - start_time
remaining_ttl = ttl - elapsed_time

# Check if we have quorum and enough time left
if successful_locks >= self.quorum and remaining_ttl > 0:
return {
'resource': resource,
'identifier': identifier,
'ttl': remaining_ttl,
'acquired_locks': successful_locks
}

# Failed to acquire majority, release acquired locks
self._release_locks(resource, identifier)

# Random delay before retry to avoid thundering herd
time.sleep(self.retry_delay * (0.5 + 0.5 * time.random()))

return None

def release(self, lock_info: dict) -> bool:
"""Release the distributed lock"""
return self._release_locks(lock_info['resource'], lock_info['identifier'])

def _release_locks(self, resource: str, identifier: str) -> bool:
"""Release locks on all instances"""
lua_script = """
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
else
return 0
end
"""

released_count = 0
for redis_client in self.redis_instances:
try:
if redis_client.eval(lua_script, 1, resource, identifier):
released_count += 1
except Exception:
continue

return released_count >= self.quorum

# Usage example
redis_nodes = [
redis.Redis(host='redis1.example.com', port=6379),
redis.Redis(host='redis2.example.com', port=6379),
redis.Redis(host='redis3.example.com', port=6379),
redis.Redis(host='redis4.example.com', port=6379),
redis.Redis(host='redis5.example.com', port=6379),
]

redlock = Redlock(redis_nodes)

# Acquire lock
lock_info = redlock.acquire("shared_resource", ttl=30000) # 30 seconds
if lock_info:
try:
# Critical section
print(f"Lock acquired with {lock_info['acquired_locks']} nodes")
# Do work...
finally:
redlock.release(lock_info)
print("Lock released")
else:
print("Failed to acquire distributed lock")

Interview Insight: Common question: “What’s the minimum number of Redis instances needed for Redlock?” Answer: Minimum 3 for meaningful fault tolerance, typically 5 is recommended. The formula is N = 2F + 1, where N is total instances and F is the number of failures you want to tolerate.

Redlock Controversy

Martin Kleppmann’s criticism of Redlock highlights important considerations:


graph TD
A[Client Acquires Lock] --> B[GC Pause/Network Delay]
B --> C[Lock Expires]
C --> D[Another Client Acquires Same Lock]
D --> E[Two Clients in Critical Section]

style E fill:#ff9999

Interview Insight: Be prepared to discuss the “Redlock controversy.” Kleppmann argued that Redlock doesn’t provide the safety guarantees it claims due to timing assumptions. The key issues are: clock synchronization requirements, GC pauses can cause timing issues, and fencing tokens provide better safety.

Best Practices and Common Pitfalls

1. Appropriate TTL Selection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class AdaptiveLock:
def __init__(self, redis_client, base_ttl=10):
self.redis = redis_client
self.base_ttl = base_ttl
self.execution_times = []

def acquire_with_adaptive_ttl(self, key, expected_execution_time=None):
"""Acquire lock with TTL based on expected execution time"""
if expected_execution_time:
# TTL should be significantly longer than expected execution
ttl = max(expected_execution_time * 3, self.base_ttl)
else:
# Use historical data to estimate
if self.execution_times:
avg_time = sum(self.execution_times) / len(self.execution_times)
ttl = max(avg_time * 2, self.base_ttl)
else:
ttl = self.base_ttl

return self.redis.set(key, str(uuid.uuid4()), nx=True, ex=int(ttl))

Interview Insight: TTL selection is a classic interview topic. Too short = risk of premature expiration; too long = delayed recovery from failures. Best practice: TTL should be 2-3x your expected critical section execution time.

2. Lock Extension for Long Operations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import threading

class ExtendableLock:
def __init__(self, redis_client, key, initial_ttl=30):
self.redis = redis_client
self.key = key
self.ttl = initial_ttl
self.identifier = str(uuid.uuid4())
self.extend_timer = None
self.acquired = False

def acquire(self):
"""Acquire lock and start auto-extension"""
if self.redis.set(self.key, self.identifier, nx=True, ex=self.ttl):
self.acquired = True
self._start_extension_timer()
return True
return False

def _start_extension_timer(self):
"""Start timer to extend lock before expiration"""
if self.acquired:
# Extend at 1/3 of TTL interval
extend_interval = self.ttl / 3
self.extend_timer = threading.Timer(extend_interval, self._extend_lock)
self.extend_timer.start()

def _extend_lock(self):
"""Extend lock TTL if still held by us"""
lua_script = """
if redis.call("GET", KEYS[1]) == ARGV[1] then
redis.call("EXPIRE", KEYS[1], ARGV[2])
return 1
else
return 0
end
"""

if self.redis.eval(lua_script, 1, self.key, self.identifier, self.ttl):
self._start_extension_timer() # Schedule next extension
else:
self.acquired = False

def release(self):
"""Release lock and stop extensions"""
if self.extend_timer:
self.extend_timer.cancel()

if self.acquired:
lua_script = """
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
else
return 0
end
"""
return self.redis.eval(lua_script, 1, self.key, self.identifier)
return False

3. Retry Strategies

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import random
import math

class RetryStrategy:
@staticmethod
def exponential_backoff(attempt, base_delay=0.1, max_delay=60):
"""Exponential backoff with jitter"""
delay = min(base_delay * (2 ** attempt), max_delay)
# Add jitter to prevent thundering herd
jitter = delay * 0.1 * random.random()
return delay + jitter

@staticmethod
def linear_backoff(attempt, base_delay=0.1, increment=0.1):
"""Linear backoff"""
return base_delay + (attempt * increment)

class RobustRedisLock:
def __init__(self, redis_client, key, max_retries=10):
self.redis = redis_client
self.key = key
self.max_retries = max_retries
self.identifier = str(uuid.uuid4())

def acquire(self, timeout=30):
"""Acquire lock with retry strategy"""
start_time = time.time()

for attempt in range(self.max_retries):
if time.time() - start_time > timeout:
break

if self.redis.set(self.key, self.identifier, nx=True, ex=30):
return True

# Wait before retry
delay = RetryStrategy.exponential_backoff(attempt)
time.sleep(delay)

return False

Interview Insight: Retry strategy questions are common. Key points: exponential backoff prevents overwhelming the system, jitter prevents thundering herd, and you need maximum retry limits to avoid infinite loops.

4. Common Pitfalls

Pitfall 1: Race Condition in Release

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# WRONG - Race condition
def bad_release(redis_client, key, identifier):
if redis_client.get(key) == identifier:
# Another process could acquire the lock here!
redis_client.delete(key)

# CORRECT - Atomic release using Lua script
def good_release(redis_client, key, identifier):
lua_script = """
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
else
return 0
end
"""
return redis_client.eval(lua_script, 1, key, identifier)

Pitfall 2: Clock Drift Issues


graph TD
A[Server A Clock: 10:00:00] --> B[Acquires Lock TTL=10s]
C[Server B Clock: 10:00:05] --> D[Sees Lock Will Expire at 10:00:15]
B --> E[Server A Clock Drifts Behind]
E --> F[Lock Actually Expires Earlier]
D --> G[Server B Acquires Lock Prematurely]

style F fill:#ff9999
style G fill:#ff9999

Interview Insight: Clock drift is a subtle but important issue. Solutions include: using relative timeouts instead of absolute timestamps, implementing clock synchronization (NTP), and considering logical clocks for ordering.

Production Considerations

Monitoring and Observability

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import logging
import time
from dataclasses import dataclass
from typing import Dict, Any

@dataclass
class LockMetrics:
acquisition_time: float
hold_time: float
success: bool
retries: int

class MonitoredRedisLock:
def __init__(self, redis_client, key, metrics_collector=None):
self.redis = redis_client
self.key = key
self.identifier = str(uuid.uuid4())
self.metrics = metrics_collector
self.acquire_start_time = None
self.lock_acquired_time = None

def acquire(self, timeout=30):
"""Acquire lock with metrics collection"""
self.acquire_start_time = time.time()
retries = 0

while time.time() - self.acquire_start_time < timeout:
if self.redis.set(self.key, self.identifier, nx=True, ex=30):
self.lock_acquired_time = time.time()
acquisition_time = self.lock_acquired_time - self.acquire_start_time

# Log successful acquisition
logging.info(f"Lock acquired for {self.key} after {acquisition_time:.3f}s and {retries} retries")

if self.metrics:
self.metrics.record_acquisition(self.key, acquisition_time, retries)

return True

retries += 1
time.sleep(0.1 * retries) # Progressive backoff

# Log acquisition failure
logging.warning(f"Failed to acquire lock for {self.key} after {timeout}s and {retries} retries")
return False

def release(self):
"""Release lock with metrics"""
if self.lock_acquired_time:
hold_time = time.time() - self.lock_acquired_time

lua_script = """
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
else
return 0
end
"""

success = bool(self.redis.eval(lua_script, 1, self.key, self.identifier))

logging.info(f"Lock released for {self.key} after {hold_time:.3f}s hold time")

if self.metrics:
self.metrics.record_release(self.key, hold_time, success)

return success

return False

Health Checks and Circuit Breakers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import time
from enum import Enum

class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"

class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED

def call(self, func, *args, **kwargs):
"""Execute function with circuit breaker protection"""
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is OPEN")

try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise e

def _on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED

def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()

if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN

class ResilientRedisLock:
def __init__(self, redis_client, circuit_breaker=None):
self.redis = redis_client
self.circuit_breaker = circuit_breaker or CircuitBreaker()

def acquire(self, key, timeout=30):
"""Acquire lock with circuit breaker protection"""
def _acquire():
return self.redis.set(key, str(uuid.uuid4()), nx=True, ex=timeout)

try:
return self.circuit_breaker.call(_acquire)
except Exception:
# Fallback: maybe use local locking or skip the operation
logging.error(f"Lock acquisition failed for {key}, circuit breaker activated")
return False

Interview Insight: Production readiness questions often focus on: How do you monitor lock performance? What happens when Redis is down? How do you handle lock contention? Be prepared to discuss circuit breakers, fallback strategies, and metrics collection.

Alternative Approaches

Database-Based Locking

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Simple database lock table
CREATE TABLE distributed_locks (
lock_name VARCHAR(255) PRIMARY KEY,
owner_id VARCHAR(255) NOT NULL,
acquired_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP NOT NULL,
INDEX idx_expires_at (expires_at)
);

-- Acquire lock with timeout
INSERT INTO distributed_locks (lock_name, owner_id, expires_at)
VALUES ('resource_lock', 'client_123', DATE_ADD(NOW(), INTERVAL 30 SECOND))
ON DUPLICATE KEY UPDATE
owner_id = CASE
WHEN expires_at < NOW() THEN VALUES(owner_id)
ELSE owner_id
END,
expires_at = CASE
WHEN expires_at < NOW() THEN VALUES(expires_at)
ELSE expires_at
END;

Consensus-Based Solutions


graph TD
A[Client Request] --> B[Raft Leader]
B --> C[Propose Lock Acquisition]
C --> D[Replicate to Majority]
D --> E[Commit Lock Entry]
E --> F[Respond to Client]

G[etcd/Consul] --> H[Strong Consistency]
H --> I[Partition Tolerance]
I --> J[Higher Latency]

Interview Insight: When asked about alternatives, discuss trade-offs: Database locks provide ACID guarantees but are slower; Consensus systems like etcd/Consul provide stronger consistency but higher latency; ZooKeeper offers hierarchical locks but operational complexity.

Comparison Matrix

Solution Consistency Performance Complexity Fault Tolerance
Single Redis Weak High Low Poor
Redlock Medium Medium Medium Good
Database Strong Low Low Good
etcd/Consul Strong Medium High Excellent
ZooKeeper Strong Medium High Excellent

Conclusion

Distributed locking with Redis offers a pragmatic balance between performance and consistency for many use cases. The key takeaways are:

  1. Single Redis instance is suitable for non-critical applications where performance matters more than absolute consistency
  2. Redlock algorithm provides better fault tolerance but comes with complexity and timing assumptions
  3. Proper implementation requires attention to atomicity, TTL management, and retry strategies
  4. Production deployment needs monitoring, circuit breakers, and fallback mechanisms
  5. Alternative solutions like consensus systems may be better for critical applications requiring strong consistency

Final Interview Insight: The most important interview question is often: “When would you NOT use Redis for distributed locking?” Be ready to discuss scenarios requiring strong consistency (financial transactions), long-running locks (batch processing), or hierarchical locking (resource trees) where other solutions might be more appropriate.

Remember: distributed locking is fundamentally about trade-offs between consistency, availability, and partition tolerance. Choose the solution that best fits your specific requirements and constraints.