Charlie Feng's Tech Space

You will survive with skills

File Storage Service Design Guide

System Architecture Overview

A production-ready File Storage Service requires careful consideration of scalability, reliability, security, and performance. This service acts as a centralized file management system that abstracts the complexity of cloud storage while providing rich metadata capabilities through PostgreSQL.


graph TB
A[Client Applications] --> B[Load Balancer]
B --> C[File Storage API Gateway]
C --> D[File Storage Service Cluster]
D --> E[PostgreSQL Cluster]
D --> F[Google Cloud Storage]
D --> G[Redis Cache]
H[File Storage SDK] --> C
I[Monitoring & Logging] --> D
J[CDN] --> F

The architecture follows a microservices pattern where the File Storage Service is independently deployable and scalable. The service handles both file operations and metadata management, ensuring data consistency and providing high availability.

Interview Insight: When asked about system architecture decisions, emphasize the separation of concerns - metadata in PostgreSQL for complex queries and actual files in GCS for scalability and durability.

Database Schema Design

The PostgreSQL schema is designed to support various file types, track upload progress for chunked uploads, and maintain audit trails.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
-- Files table for storing file metadata
CREATE TABLE files (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
filename VARCHAR(255) NOT NULL,
original_filename VARCHAR(255) NOT NULL,
file_size BIGINT NOT NULL,
mime_type VARCHAR(100) NOT NULL,
file_hash VARCHAR(64) NOT NULL, -- SHA-256 hash for deduplication
gcs_bucket VARCHAR(100) NOT NULL,
gcs_object_key VARCHAR(500) NOT NULL,
upload_status VARCHAR(20) NOT NULL DEFAULT 'PENDING', -- PENDING, UPLOADING, COMPLETED, FAILED
created_by UUID NOT NULL,
project_id UUID,
tags JSONB,
metadata JSONB, -- Extensible metadata storage
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
expires_at TIMESTAMP WITH TIME ZONE, -- For temporary files

CONSTRAINT valid_upload_status CHECK (upload_status IN ('PENDING', 'UPLOADING', 'COMPLETED', 'FAILED'))
);

-- Chunked uploads table for managing large file uploads
CREATE TABLE chunked_uploads (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
file_id UUID REFERENCES files(id) ON DELETE CASCADE,
upload_id VARCHAR(255) NOT NULL, -- GCS multipart upload ID
total_chunks INTEGER NOT NULL,
completed_chunks INTEGER DEFAULT 0,
chunk_size INTEGER NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
expires_at TIMESTAMP WITH TIME ZONE NOT NULL -- Cleanup incomplete uploads
);

-- Chunks table for tracking individual chunk uploads
CREATE TABLE chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
chunked_upload_id UUID REFERENCES chunked_uploads(id) ON DELETE CASCADE,
chunk_number INTEGER NOT NULL,
chunk_size INTEGER NOT NULL,
etag VARCHAR(255), -- GCS ETag for the chunk
uploaded_at TIMESTAMP WITH TIME ZONE,

UNIQUE(chunked_upload_id, chunk_number)
);

-- File access logs for audit and analytics
CREATE TABLE file_access_logs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
file_id UUID REFERENCES files(id) ON DELETE SET NULL,
access_type VARCHAR(20) NOT NULL, -- UPLOAD, DOWNLOAD, DELETE, VIEW
user_id UUID NOT NULL,
ip_address INET,
user_agent TEXT,
accessed_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Indexes for performance
CREATE INDEX idx_files_project_id ON files(project_id);
CREATE INDEX idx_files_created_by ON files(created_by);
CREATE INDEX idx_files_file_hash ON files(file_hash);
CREATE INDEX idx_files_upload_status ON files(upload_status);
CREATE INDEX idx_files_created_at ON files(created_at DESC);
CREATE INDEX idx_chunked_uploads_file_id ON chunked_uploads(file_id);
CREATE INDEX idx_chunks_chunked_upload_id ON chunks(chunked_upload_id);
CREATE INDEX idx_file_access_logs_file_id ON file_access_logs(file_id);
CREATE INDEX idx_file_access_logs_user_id ON file_access_logs(user_id);

Interview Insight: The schema design demonstrates understanding of ACID properties, referential integrity, and performance optimization. The JSONB fields provide flexibility while maintaining query performance.

Core Service Implementation

File Service Interface

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
public interface FileStorageService {

/**
* Initiate a single file upload
*/
FileUploadResponse uploadFile(FileUploadRequest request);

/**
* Initiate a chunked upload for large files
*/
ChunkedUploadResponse initiateChunkedUpload(ChunkedUploadRequest request);

/**
* Upload a single chunk
*/
ChunkUploadResponse uploadChunk(ChunkUploadRequest request);

/**
* Complete chunked upload
*/
FileUploadResponse completeChunkedUpload(CompleteChunkedUploadRequest request);

/**
* Generate signed download URL
*/
FileDownloadResponse getDownloadUrl(String fileId, Duration expiration);

/**
* Query files with filtering and pagination
*/
FileQueryResponse queryFiles(FileQueryRequest request);

/**
* Delete a file
*/
void deleteFile(String fileId, String userId);

/**
* Get file metadata
*/
FileMetadata getFileMetadata(String fileId);
}

File Storage Service Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
@Service
@Transactional
public class FileStorageServiceImpl implements FileStorageService {

private final FileRepository fileRepository;
private final ChunkedUploadRepository chunkedUploadRepository;
private final GcsStorageClient gcsClient;
private final FileAccessLogger accessLogger;
private final RedisTemplate<String, Object> redisTemplate;

@Override
public FileUploadResponse uploadFile(FileUploadRequest request) {
validateUploadRequest(request);

// Check for file deduplication
Optional<FileEntity> existingFile = fileRepository
.findByFileHashAndProjectId(request.getFileHash(), request.getProjectId());

if (existingFile.isPresent() && request.isAllowDeduplication()) {
return createDeduplicationResponse(existingFile.get());
}

// Create file entity
FileEntity fileEntity = FileEntity.builder()
.filename(generateUniqueFilename(request.getOriginalFilename()))
.originalFilename(request.getOriginalFilename())
.fileSize(request.getFileSize())
.mimeType(request.getMimeType())
.fileHash(request.getFileHash())
.gcsBucket(gcsClient.getDefaultBucket())
.gcsObjectKey(generateObjectKey(request))
.uploadStatus(UploadStatus.PENDING)
.createdBy(request.getUserId())
.projectId(request.getProjectId())
.tags(request.getTags())
.metadata(request.getMetadata())
.build();

fileEntity = fileRepository.save(fileEntity);

try {
// Generate signed upload URL
String signedUrl = gcsClient.generateSignedUploadUrl(
fileEntity.getGcsBucket(),
fileEntity.getGcsObjectKey(),
request.getMimeType(),
Duration.ofHours(1)
);

// Cache upload session
cacheUploadSession(fileEntity.getId(), request.getUserId());

return FileUploadResponse.builder()
.fileId(fileEntity.getId())
.uploadUrl(signedUrl)
.expiresAt(Instant.now().plus(Duration.ofHours(1)))
.build();

} catch (Exception e) {
fileEntity.setUploadStatus(UploadStatus.FAILED);
fileRepository.save(fileEntity);
throw new FileUploadException("Failed to generate upload URL", e);
}
}

@Override
public ChunkedUploadResponse initiateChunkedUpload(ChunkedUploadRequest request) {
validateChunkedUploadRequest(request);

// Create file entity
FileEntity fileEntity = createFileEntity(request);
fileEntity = fileRepository.save(fileEntity);

try {
// Initiate multipart upload in GCS
String uploadId = gcsClient.initiateMultipartUpload(
fileEntity.getGcsBucket(),
fileEntity.getGcsObjectKey(),
request.getMimeType()
);

// Create chunked upload entity
ChunkedUploadEntity chunkedUpload = ChunkedUploadEntity.builder()
.fileId(fileEntity.getId())
.uploadId(uploadId)
.totalChunks(request.getTotalChunks())
.chunkSize(request.getChunkSize())
.expiresAt(Instant.now().plus(Duration.ofDays(1)))
.build();

chunkedUpload = chunkedUploadRepository.save(chunkedUpload);

return ChunkedUploadResponse.builder()
.fileId(fileEntity.getId())
.uploadId(chunkedUpload.getId())
.expiresAt(chunkedUpload.getExpiresAt())
.build();

} catch (Exception e) {
fileEntity.setUploadStatus(UploadStatus.FAILED);
fileRepository.save(fileEntity);
throw new FileUploadException("Failed to initiate chunked upload", e);
}
}

@Override
public ChunkUploadResponse uploadChunk(ChunkUploadRequest request) {
ChunkedUploadEntity chunkedUpload = chunkedUploadRepository
.findById(request.getUploadId())
.orElseThrow(() -> new ChunkedUploadNotFoundException("Upload not found"));

validateChunkUploadRequest(request, chunkedUpload);

try {
// Generate signed URL for chunk upload
String signedUrl = gcsClient.generateSignedChunkUploadUrl(
chunkedUpload.getUploadId(),
request.getChunkNumber(),
Duration.ofHours(1)
);

return ChunkUploadResponse.builder()
.chunkUploadUrl(signedUrl)
.expiresAt(Instant.now().plus(Duration.ofHours(1)))
.build();

} catch (Exception e) {
throw new ChunkUploadException("Failed to generate chunk upload URL", e);
}
}

@Override
public FileDownloadResponse getDownloadUrl(String fileId, Duration expiration) {
FileEntity fileEntity = fileRepository.findById(UUID.fromString(fileId))
.orElseThrow(() -> new FileNotFoundException("File not found"));

if (fileEntity.getUploadStatus() != UploadStatus.COMPLETED) {
throw new FileNotAvailableException("File is not available for download");
}

try {
String signedUrl = gcsClient.generateSignedDownloadUrl(
fileEntity.getGcsBucket(),
fileEntity.getGcsObjectKey(),
expiration
);

// Log access
accessLogger.logAccess(fileId, AccessType.DOWNLOAD);

return FileDownloadResponse.builder()
.downloadUrl(signedUrl)
.filename(fileEntity.getOriginalFilename())
.fileSize(fileEntity.getFileSize())
.mimeType(fileEntity.getMimeType())
.expiresAt(Instant.now().plus(expiration))
.build();

} catch (Exception e) {
throw new FileDownloadException("Failed to generate download URL", e);
}
}
}

Interview Insight: This implementation demonstrates proper error handling, transaction management, and separation of concerns. The use of signed URLs ensures security while offloading bandwidth from your servers.

API Endpoints Design

REST API Specification

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
@RestController
@RequestMapping("/api/v1/files")
@Validated
public class FileController {

private final FileStorageService fileStorageService;

/**
* Upload a single file (< 10MB)
* POST /api/v1/files/upload
*/
@PostMapping("/upload")
public ResponseEntity<FileUploadResponse> uploadFile(
@Valid @RequestBody FileUploadRequest request) {

FileUploadResponse response = fileStorageService.uploadFile(request);
return ResponseEntity.ok(response);
}

/**
* Initiate chunked upload for large files
* POST /api/v1/files/upload/chunked
*/
@PostMapping("/upload/chunked")
public ResponseEntity<ChunkedUploadResponse> initiateChunkedUpload(
@Valid @RequestBody ChunkedUploadRequest request) {

ChunkedUploadResponse response = fileStorageService.initiateChunkedUpload(request);
return ResponseEntity.status(HttpStatus.CREATED).body(response);
}

/**
* Get upload URL for a specific chunk
* POST /api/v1/files/upload/chunked/{uploadId}/chunks/{chunkNumber}
*/
@PostMapping("/upload/chunked/{uploadId}/chunks/{chunkNumber}")
public ResponseEntity<ChunkUploadResponse> getChunkUploadUrl(
@PathVariable String uploadId,
@PathVariable Integer chunkNumber) {

ChunkUploadRequest request = ChunkUploadRequest.builder()
.uploadId(uploadId)
.chunkNumber(chunkNumber)
.build();

ChunkUploadResponse response = fileStorageService.uploadChunk(request);
return ResponseEntity.ok(response);
}

/**
* Complete chunked upload
* POST /api/v1/files/upload/chunked/{uploadId}/complete
*/
@PostMapping("/upload/chunked/{uploadId}/complete")
public ResponseEntity<FileUploadResponse> completeChunkedUpload(
@PathVariable String uploadId,
@Valid @RequestBody CompleteChunkedUploadRequest request) {

request.setUploadId(uploadId);
FileUploadResponse response = fileStorageService.completeChunkedUpload(request);
return ResponseEntity.ok(response);
}

/**
* Get file download URL
* GET /api/v1/files/{fileId}/download
*/
@GetMapping("/{fileId}/download")
public ResponseEntity<FileDownloadResponse> getDownloadUrl(
@PathVariable String fileId,
@RequestParam(defaultValue = "3600") Long expirationSeconds) {

Duration expiration = Duration.ofSeconds(expirationSeconds);
FileDownloadResponse response = fileStorageService.getDownloadUrl(fileId, expiration);
return ResponseEntity.ok(response);
}

/**
* Query files with filtering and pagination
* GET /api/v1/files
*/
@GetMapping
public ResponseEntity<FileQueryResponse> queryFiles(
@RequestParam(required = false) String projectId,
@RequestParam(required = false) List<String> mimeTypes,
@RequestParam(required = false) String createdBy,
@RequestParam(required = false) @DateTimeFormat(iso = DateTimeFormat.ISO.DATE_TIME) Instant createdAfter,
@RequestParam(required = false) @DateTimeFormat(iso = DateTimeFormat.ISO.DATE_TIME) Instant createdBefore,
@RequestParam(defaultValue = "0") Integer page,
@RequestParam(defaultValue = "20") Integer size,
@RequestParam(defaultValue = "createdAt,desc") String sort) {

FileQueryRequest request = FileQueryRequest.builder()
.projectId(projectId)
.mimeTypes(mimeTypes)
.createdBy(createdBy)
.createdAfter(createdAfter)
.createdBefore(createdBefore)
.page(page)
.size(size)
.sort(sort)
.build();

FileQueryResponse response = fileStorageService.queryFiles(request);
return ResponseEntity.ok(response);
}

/**
* Get file metadata
* GET /api/v1/files/{fileId}
*/
@GetMapping("/{fileId}")
public ResponseEntity<FileMetadata> getFileMetadata(@PathVariable String fileId) {
FileMetadata metadata = fileStorageService.getFileMetadata(fileId);
return ResponseEntity.ok(metadata);
}

/**
* Delete file
* DELETE /api/v1/files/{fileId}
*/
@DeleteMapping("/{fileId}")
public ResponseEntity<Void> deleteFile(
@PathVariable String fileId,
@RequestHeader("X-User-Id") String userId) {

fileStorageService.deleteFile(fileId, userId);
return ResponseEntity.noContent().build();
}
}

API Request/Response Models

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
// Request models
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class FileUploadRequest {
@NotNull
@Size(min = 1, max = 255)
private String originalFilename;

@NotNull
@Min(1)
private Long fileSize;

@NotNull
@Size(min = 1, max = 100)
private String mimeType;

@NotNull
@Size(min = 64, max = 64)
private String fileHash; // SHA-256

@NotNull
private String userId;

private String projectId;

@Valid
private Map<String, String> tags;

@Valid
private Map<String, Object> metadata;

@Builder.Default
private Boolean allowDeduplication = true;
}

@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class ChunkedUploadRequest {
@NotNull
@Size(min = 1, max = 255)
private String originalFilename;

@NotNull
@Min(1)
private Long fileSize;

@NotNull
@Size(min = 1, max = 100)
private String mimeType;

@NotNull
@Size(min = 64, max = 64)
private String fileHash;

@NotNull
private String userId;

private String projectId;

@NotNull
@Min(2)
@Max(10000)
private Integer totalChunks;

@NotNull
@Min(1048576) // 1MB minimum
@Max(104857600) // 100MB maximum
private Integer chunkSize;

private Map<String, String> tags;
private Map<String, Object> metadata;
}

// Response models
@Data
@Builder
public class FileUploadResponse {
private String fileId;
private String uploadUrl;
private Instant expiresAt;
private Boolean isDuplicate;
}

@Data
@Builder
public class ChunkedUploadResponse {
private String fileId;
private String uploadId;
private Instant expiresAt;
}

@Data
@Builder
public class FileDownloadResponse {
private String downloadUrl;
private String filename;
private Long fileSize;
private String mimeType;
private Instant expiresAt;
}

File Storage SDK Implementation

SDK Architecture

The SDK provides a clean abstraction layer for client applications, handling authentication, retries, and error handling automatically.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
public class FileStorageSDK {

private final FileStorageClient client;
private final RetryPolicy retryPolicy;
private final AuthenticationProvider authProvider;

public FileStorageSDK(FileStorageConfig config) {
this.client = new FileStorageClient(config);
this.retryPolicy = createRetryPolicy(config);
this.authProvider = new AuthenticationProvider(config);
}

/**
* Upload a single file
*/
public CompletableFuture<FileUploadResult> uploadFile(File file, UploadOptions options) {
return CompletableFuture.supplyAsync(() -> {
try {
validateFile(file);

if (file.length() > options.getChunkThreshold()) {
return uploadFileInChunks(file, options);
} else {
return uploadSingleFile(file, options);
}
} catch (Exception e) {
throw new FileUploadException("Upload failed", e);
}
});
}

/**
* Download a file
*/
public CompletableFuture<File> downloadFile(String fileId, File destination) {
return CompletableFuture.supplyAsync(() -> {
try {
FileDownloadResponse response = client.getDownloadUrl(fileId);
return downloadFromSignedUrl(response.getDownloadUrl(), destination);
} catch (Exception e) {
throw new FileDownloadException("Download failed", e);
}
});
}

/**
* Upload file with progress callback
*/
public CompletableFuture<FileUploadResult> uploadFileWithProgress(
File file,
UploadOptions options,
ProgressCallback callback) {

if (file.length() <= options.getChunkThreshold()) {
return uploadSingleFileWithProgress(file, options, callback);
} else {
return uploadFileInChunksWithProgress(file, options, callback);
}
}

private FileUploadResult uploadFileInChunksWithProgress(
File file,
UploadOptions options,
ProgressCallback callback) {

try {
// Calculate file hash
String fileHash = calculateSHA256(file);

// Calculate chunks
long chunkSize = options.getChunkSize();
int totalChunks = (int) Math.ceil((double) file.length() / chunkSize);

// Initiate chunked upload
ChunkedUploadRequest request = ChunkedUploadRequest.builder()
.originalFilename(file.getName())
.fileSize(file.length())
.mimeType(detectMimeType(file))
.fileHash(fileHash)
.userId(options.getUserId())
.projectId(options.getProjectId())
.totalChunks(totalChunks)
.chunkSize((int) chunkSize)
.tags(options.getTags())
.metadata(options.getMetadata())
.build();

ChunkedUploadResponse uploadResponse = client.initiateChunkedUpload(request);

// Upload chunks
List<ChunkInfo> chunkInfos = new ArrayList<>();
try (RandomAccessFile raf = new RandomAccessFile(file, "r")) {
for (int i = 0; i < totalChunks; i++) {
long offset = (long) i * chunkSize;
long currentChunkSize = Math.min(chunkSize, file.length() - offset);

ChunkUploadResponse chunkResponse = client.getChunkUploadUrl(
uploadResponse.getUploadId(), i + 1);

// Upload chunk to GCS
byte[] chunkData = new byte[(int) currentChunkSize];
raf.seek(offset);
raf.readFully(chunkData);

String etag = uploadChunkToGCS(chunkResponse.getChunkUploadUrl(), chunkData);

chunkInfos.add(new ChunkInfo(i + 1, etag));

// Update progress
if (callback != null) {
callback.onProgress((double) (i + 1) / totalChunks);
}
}
}

// Complete upload
CompleteChunkedUploadRequest completeRequest = CompleteChunkedUploadRequest.builder()
.uploadId(uploadResponse.getUploadId())
.chunks(chunkInfos)
.build();

FileUploadResponse finalResponse = client.completeChunkedUpload(completeRequest);

if (callback != null) {
callback.onComplete();
}

return FileUploadResult.builder()
.fileId(finalResponse.getFileId())
.filename(file.getName())
.fileSize(file.length())
.uploadedAt(Instant.now())
.build();

} catch (Exception e) {
if (callback != null) {
callback.onError(e);
}
throw new FileUploadException("Chunked upload failed", e);
}
}

/**
* Query files with fluent API
*/
public FileQuery queryFiles() {
return new FileQuery(client);
}

/**
* Fluent query builder
*/
public static class FileQuery {
private final FileStorageClient client;
private final FileQueryRequest.FileQueryRequestBuilder builder;

public FileQuery(FileStorageClient client) {
this.client = client;
this.builder = FileQueryRequest.builder();
}

public FileQuery inProject(String projectId) {
builder.projectId(projectId);
return this;
}

public FileQuery withMimeTypes(String... mimeTypes) {
builder.mimeTypes(Arrays.asList(mimeTypes));
return this;
}

public FileQuery createdBy(String userId) {
builder.createdBy(userId);
return this;
}

public FileQuery createdAfter(Instant after) {
builder.createdAfter(after);
return this;
}

public FileQuery pageSize(int size) {
builder.size(size);
return this;
}

public FileQuery page(int page) {
builder.page(page);
return this;
}

public CompletableFuture<FileQueryResult> execute() {
return CompletableFuture.supplyAsync(() -> {
try {
FileQueryResponse response = client.queryFiles(builder.build());
return FileQueryResult.from(response);
} catch (Exception e) {
throw new FileQueryException("Query failed", e);
}
});
}
}
}

SDK Usage Examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Initialize SDK
FileStorageConfig config = FileStorageConfig.builder()
.baseUrl("https://api.example.com")
.apiKey("your-api-key")
.timeout(Duration.ofMinutes(5))
.chunkSize(5 * 1024 * 1024) // 5MB chunks
.build();

FileStorageSDK sdk = new FileStorageSDK(config);

// Simple file upload
File document = new File("document.pdf");
UploadOptions options = UploadOptions.builder()
.userId("user-123")
.projectId("project-456")
.tag("category", "documents")
.metadata("department", "engineering")
.build();

CompletableFuture<FileUploadResult> uploadFuture = sdk.uploadFile(document, options);
FileUploadResult result = uploadFuture.join();
System.out.println("File uploaded: " + result.getFileId());

// Upload with progress tracking
sdk.uploadFileWithProgress(document, options, new ProgressCallback() {
@Override
public void onProgress(double progress) {
System.out.printf("Upload progress: %.2f%%\n", progress * 100);
}

@Override
public void onComplete() {
System.out.println("Upload completed!");
}

@Override
public void onError(Exception error) {
System.err.println("Upload failed: " + error.getMessage());
}
});

// Query files
CompletableFuture<FileQueryResult> queryFuture = sdk.queryFiles()
.inProject("project-456")
.withMimeTypes("application/pdf", "image/jpeg")
.createdAfter(Instant.now().minus(Duration.ofDays(30)))
.pageSize(50)
.execute();

FileQueryResult queryResult = queryFuture.join();
queryResult.getFiles().forEach(file -> {
System.out.println("File: " + file.getFilename() + " (" + file.getFileSize() + " bytes)");
});

// Download file
CompletableFuture<File> downloadFuture = sdk.downloadFile(
result.getFileId(),
new File("downloads/document.pdf")
);
File downloadedFile = downloadFuture.join();

Interview Insight: The SDK design demonstrates understanding of asynchronous programming, builder patterns, and clean API design. The fluent query interface shows advanced Java skills.

Chunked Upload Flow

The chunked upload mechanism handles large files efficiently by splitting them into manageable chunks and uploading them in parallel when possible.


sequenceDiagram
participant C as Client
participant API as File API
participant DB as PostgreSQL
participant GCS as Google Cloud Storage

Note over C,GCS: Large File Upload (>10MB)

C->>API: POST /files/upload/chunked
API->>DB: Create file metadata (PENDING)
API->>GCS: Initiate multipart upload
GCS-->>API: Upload ID
API->>DB: Create chunked_upload record
API-->>C: Return upload ID & file ID

loop For each chunk
    C->>API: POST /upload/chunked/{uploadId}/chunks/{n}
    API-->>C: Return signed upload URL
    C->>GCS: PUT chunk data to signed URL
    GCS-->>C: ETag response
    C->>API: Notify chunk completion (ETag)
    API->>DB: Update chunk status
end

C->>API: POST /upload/chunked/{uploadId}/complete
API->>GCS: Complete multipart upload
GCS-->>API: Final object info
API->>DB: Update file status to COMPLETED
API-->>C: Return download URL

Chunked Upload Implementation Details

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
@Component
public class ChunkedUploadHandler {

private final ChunkedUploadRepository chunkedUploadRepository;
private final ChunkRepository chunkRepository;
private final GcsStorageClient gcsClient;
private final TaskExecutor taskExecutor;

/**
* Process chunk completion asynchronously
*/
@Async
public CompletableFuture<Void> handleChunkCompletion(
String uploadId,
Integer chunkNumber,
String etag) {

return CompletableFuture.runAsync(() -> {
try {
ChunkedUploadEntity upload = chunkedUploadRepository.findById(uploadId)
.orElseThrow(() -> new ChunkedUploadNotFoundException("Upload not found"));

// Update or create chunk record
ChunkEntity chunk = chunkRepository
.findByChunkedUploadIdAndChunkNumber(uploadId, chunkNumber)
.orElse(ChunkEntity.builder()
.chunkedUploadId(uploadId)
.chunkNumber(chunkNumber)
.build());

chunk.setEtag(etag);
chunk.setUploadedAt(Instant.now());
chunkRepository.save(chunk);

// Update completed chunks count
upload.setCompletedChunks(upload.getCompletedChunks() + 1);
chunkedUploadRepository.save(upload);

// Check if all chunks are completed
if (upload.getCompletedChunks().equals(upload.getTotalChunks())) {
notifyUploadCompletion(uploadId);
}

} catch (Exception e) {
log.error("Failed to handle chunk completion for upload: {}, chunk: {}",
uploadId, chunkNumber, e);
throw new ChunkProcessingException("Chunk processing failed", e);
}
}, taskExecutor);
}

/**
* Auto-complete upload when all chunks are received
*/
private void notifyUploadCompletion(String uploadId) {
// Use event-driven approach for loose coupling
applicationEventPublisher.publishEvent(
new ChunkedUploadCompletedEvent(uploadId)
);
}
}

Interview Insight: The asynchronous processing of chunk completions demonstrates understanding of event-driven architecture and prevents blocking the upload API.

Security and Authentication

API Security Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
@Component
public class FileSecurityService {

private final JwtTokenProvider jwtTokenProvider;
private final PermissionService permissionService;
private final RateLimitService rateLimitService;

/**
* Validate file access permissions
*/
public void validateFileAccess(String fileId, String userId, FileOperation operation) {
// Check rate limits
if (!rateLimitService.isAllowed(userId, operation)) {
throw new RateLimitExceededException("Rate limit exceeded");
}

// Validate file ownership or permissions
FileEntity file = fileRepository.findById(UUID.fromString(fileId))
.orElseThrow(() -> new FileNotFoundException("File not found"));

if (!canUserAccessFile(file, userId, operation)) {
throw new AccessDeniedException("Insufficient permissions");
}

// Validate file status for downloads
if (operation == FileOperation.DOWNLOAD &&
file.getUploadStatus() != UploadStatus.COMPLETED) {
throw new FileNotAvailableException("File not available");
}

// Check file expiration
if (file.getExpiresAt() != null &&
file.getExpiresAt().isBefore(Instant.now())) {
throw new FileExpiredException("File has expired");
}
}

private boolean canUserAccessFile(FileEntity file, String userId, FileOperation operation) {
// Owner always has full access
if (file.getCreatedBy().equals(UUID.fromString(userId))) {
return true;
}

// Check project-level permissions
if (file.getProjectId() != null) {
return permissionService.hasProjectPermission(
userId,
file.getProjectId().toString(),
operation.toPermission()
);
}

// Check organization-level permissions for non-project files
return permissionService.hasOrganizationPermission(
userId,
operation.toPermission()
);
}

/**
* Generate secure signed URLs with additional validation
*/
public String generateSecureSignedUrl(String fileId, String userId, Duration expiration) {
validateFileAccess(fileId, userId, FileOperation.DOWNLOAD);

FileEntity file = fileRepository.findById(UUID.fromString(fileId))
.orElseThrow(() -> new FileNotFoundException("File not found"));

// Generate signed URL with custom headers for validation
Map<String, String> customHeaders = Map.of(
"x-user-id", userId,
"x-file-id", fileId,
"x-timestamp", String.valueOf(Instant.now().getEpochSecond())
);

return gcsClient.generateSignedDownloadUrl(
file.getGcsBucket(),
file.getGcsObjectKey(),
expiration,
customHeaders
);
}
}

/**
* Security interceptor for file operations
*/
@Component
public class FileSecurityInterceptor implements HandlerInterceptor {

private final FileSecurityService securityService;
private final JwtTokenProvider jwtTokenProvider;

@Override
public boolean preHandle(HttpServletRequest request,
HttpServletResponse response,
Object handler) throws Exception {

String token = extractTokenFromRequest(request);
if (token == null || !jwtTokenProvider.validateToken(token)) {
response.setStatus(HttpStatus.UNAUTHORIZED.value());
return false;
}

String userId = jwtTokenProvider.getUserIdFromToken(token);

// Extract file ID from path
String fileId = extractFileIdFromPath(request.getRequestURI());
if (fileId != null) {
FileOperation operation = determineOperation(request.getMethod(), request.getRequestURI());
securityService.validateFileAccess(fileId, userId, operation);
}

// Set user context for the request
SecurityContextHolder.getContext().setUserId(userId);

return true;
}
}

Performance Optimization

Caching Strategy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
@Service
public class FileCacheService {

private final RedisTemplate<String, Object> redisTemplate;
private final CacheManager cacheManager;

private static final String METADATA_CACHE_PREFIX = "file:metadata:";
private static final String DOWNLOAD_URL_CACHE_PREFIX = "file:download:";
private static final Duration METADATA_CACHE_TTL = Duration.ofMinutes(30);
private static final Duration DOWNLOAD_URL_CACHE_TTL = Duration.ofMinutes(5);

/**
* Cache file metadata with intelligent TTL
*/
@Cacheable(value = "fileMetadata", key = "#fileId")
public FileMetadata getFileMetadata(String fileId) {
// Cache miss - load from database
FileEntity file = fileRepository.findById(UUID.fromString(fileId))
.orElseThrow(() -> new FileNotFoundException("File not found"));

FileMetadata metadata = FileMetadata.from(file);

// Cache with adaptive TTL based on file age
Duration ttl = calculateAdaptiveTTL(file.getCreatedAt());
cacheMetadataWithTTL(fileId, metadata, ttl);

return metadata;
}

/**
* Cache download URLs with short TTL for security
*/
public String getCachedDownloadUrl(String fileId, String userId) {
String cacheKey = DOWNLOAD_URL_CACHE_PREFIX + fileId + ":" + userId;
return (String) redisTemplate.opsForValue().get(cacheKey);
}

public void cacheDownloadUrl(String fileId, String userId, String url, Duration expiration) {
String cacheKey = DOWNLOAD_URL_CACHE_PREFIX + fileId + ":" + userId;
Duration cacheTTL = expiration.compareTo(DOWNLOAD_URL_CACHE_TTL) < 0 ?
expiration : DOWNLOAD_URL_CACHE_TTL;

redisTemplate.opsForValue().set(cacheKey, url, cacheTTL);
}

/**
* Invalidate cache when file is updated or deleted
*/
@CacheEvict(value = "fileMetadata", key = "#fileId")
public void invalidateFileCache(String fileId) {
// Also clear download URL caches for this file
Set<String> keys = redisTemplate.keys(DOWNLOAD_URL_CACHE_PREFIX + fileId + ":*");
if (!keys.isEmpty()) {
redisTemplate.delete(keys);
}
}

/**
* Calculate adaptive TTL based on file characteristics
*/
private Duration calculateAdaptiveTTL(Instant createdAt) {
Duration age = Duration.between(createdAt, Instant.now());

// Older files get longer cache TTL as they're less likely to change
if (age.toDays() > 30) {
return Duration.ofHours(2);
} else if (age.toDays() > 7) {
return Duration.ofMinutes(60);
} else {
return Duration.ofMinutes(15);
}
}
}

/**
* Database query optimization
*/
@Repository
public class OptimizedFileRepository {

@PersistenceContext
private EntityManager entityManager;

/**
* Optimized query for file listing with filtering
*/
@Query(value = """
SELECT f.*,
COUNT(*) OVER() as total_count
FROM files f
WHERE (?1 IS NULL OR f.project_id = ?1::uuid)
AND (?2 IS NULL OR f.mime_type = ANY(?2))
AND (?3 IS NULL OR f.created_by = ?3::uuid)
AND (?4 IS NULL OR f.created_at >= ?4)
AND (?5 IS NULL OR f.created_at <= ?5)
AND f.upload_status = 'COMPLETED'
ORDER BY f.created_at DESC
LIMIT ?6 OFFSET ?7
""", nativeQuery = true)
List<FileProjection> findFilesOptimized(
String projectId,
String[] mimeTypes,
String createdBy,
Instant createdAfter,
Instant createdBefore,
int limit,
int offset
);

/**
* Bulk operations for cleanup
*/
@Modifying
@Query("DELETE FROM FileEntity f WHERE f.uploadStatus = 'FAILED' AND f.createdAt < :cutoffDate")
int deleteFailedUploadsOlderThan(@Param("cutoffDate") Instant cutoffDate);

@Modifying
@Query("""
UPDATE FileEntity f
SET f.uploadStatus = 'EXPIRED'
WHERE f.expiresAt < :now AND f.uploadStatus = 'COMPLETED'
""")
int markExpiredFiles(@Param("now") Instant now);
}

Monitoring and Observability

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
@Component
public class FileStorageMetrics {

private final MeterRegistry meterRegistry;
private final Counter uploadCounter;
private final Counter downloadCounter;
private final Timer uploadTimer;
private final Gauge activeUploadsGauge;
private final DistributionSummary fileSizeDistribution;

public FileStorageMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.uploadCounter = Counter.builder("file_uploads_total")
.description("Total number of file uploads")
.register(meterRegistry);

this.downloadCounter = Counter.builder("file_downloads_total")
.description("Total number of file downloads")
.register(meterRegistry);

this.uploadTimer = Timer.builder("file_upload_duration")
.description("File upload duration")
.register(meterRegistry);

this.fileSizeDistribution = DistributionSummary.builder("file_size_bytes")
.description("Distribution of file sizes")
.register(meterRegistry);
}

public void recordUpload(String mimeType, long fileSize, Duration duration, boolean successful) {
uploadCounter.increment(
Tags.of(
"mime_type", mimeType,
"status", successful ? "success" : "failure"
)
);

if (successful) {
uploadTimer.record(duration);
fileSizeDistribution.record(fileSize);
}
}

public void recordDownload(String mimeType) {
downloadCounter.increment(Tags.of("mime_type", mimeType));
}

@Scheduled(fixedRate = 30000) // Every 30 seconds
public void updateActiveUploadsGauge() {
long activeUploads = chunkedUploadRepository.countByStatusIn(
Arrays.asList(UploadStatus.PENDING, UploadStatus.UPLOADING)
);

Gauge.builder("file_uploads_active")
.description("Number of active uploads")
.register(meterRegistry, this, unused -> activeUploads);
}
}

Error Handling and Resilience

Comprehensive Error Handling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
@ControllerAdvice
public class FileStorageExceptionHandler {

private static final Logger log = LoggerFactory.getLogger(FileStorageExceptionHandler.class);

@ExceptionHandler(FileNotFoundException.class)
public ResponseEntity<ErrorResponse> handleFileNotFound(FileNotFoundException e) {
log.warn("File not found: {}", e.getMessage());
return ResponseEntity.status(HttpStatus.NOT_FOUND)
.body(ErrorResponse.builder()
.error("FILE_NOT_FOUND")
.message(e.getMessage())
.timestamp(Instant.now())
.build());
}

@ExceptionHandler(FileUploadException.class)
public ResponseEntity<ErrorResponse> handleUploadException(FileUploadException e) {
log.error("File upload failed", e);
return ResponseEntity.status(HttpStatus.BAD_REQUEST)
.body(ErrorResponse.builder()
.error("UPLOAD_FAILED")
.message("File upload failed: " + e.getMessage())
.timestamp(Instant.now())
.details(extractErrorDetails(e))
.build());
}

@ExceptionHandler(RateLimitExceededException.class)
public ResponseEntity<ErrorResponse> handleRateLimit(RateLimitExceededException e) {
log.warn("Rate limit exceeded: {}", e.getMessage());
return ResponseEntity.status(HttpStatus.TOO_MANY_REQUESTS)
.header("Retry-After", "60")
.body(ErrorResponse.builder()
.error("RATE_LIMIT_EXCEEDED")
.message("Too many requests. Please try again later.")
.timestamp(Instant.now())
.build());
}

@ExceptionHandler(StorageQuotaExceededException.class)
public ResponseEntity<ErrorResponse> handleQuotaExceeded(StorageQuotaExceededException e) {
log.warn("Storage quota exceeded: {}", e.getMessage());
return ResponseEntity.status(HttpStatus.INSUFFICIENT_STORAGE)
.body(ErrorResponse.builder()
.error("QUOTA_EXCEEDED")
.message("Storage quota exceeded. Please upgrade your plan or delete some files.")
.timestamp(Instant.now())
.build());
}
}

/**
* Circuit breaker for GCS operations
*/
@Component
public class ResilientGcsClient {

private final Storage storage;
private final CircuitBreaker circuitBreaker;
private final RetryTemplate retryTemplate;

public ResilientGcsClient(Storage storage) {
this.storage = storage;
this.circuitBreaker = CircuitBreaker.builder("gcs-operations")
.slidingWindow(10, 5, CircuitBreaker.SlidingWindowType.COUNT_BASED)
.failureThreshold(50.0f)
.waitInterval(Duration.ofSeconds(30))
.build();

this.retryTemplate = RetryTemplate.builder()
.maxAttempts(3)
.exponentialBackoff(1000, 2, 10000)
.retryOn(StorageException.class)
.build();
}

public String generateSignedUrl(String bucket, String objectName, Duration expiration) {
return circuitBreaker.supply(() ->
retryTemplate.execute(context -> {
try {
BlobInfo blobInfo = BlobInfo.newBuilder(bucket, objectName).build();
return storage.signUrl(blobInfo, expiration.toMillis(), TimeUnit.MILLISECONDS)
.toString();
} catch (StorageException e) {
log.error("Failed to generate signed URL for {}/{}", bucket, objectName, e);
throw e;
}
})
);
}
}

Deployment and Infrastructure

Docker Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Multi-stage build for production optimization
FROM openjdk:17-jdk-slim AS builder

WORKDIR /app
COPY gradlew .
COPY gradle gradle
COPY build.gradle settings.gradle ./
COPY src src

RUN ./gradlew build -x test --no-daemon

FROM openjdk:17-jre-slim

# Security: Create non-root user
RUN groupadd -r fileservice && useradd -r -g fileservice fileservice

# Install required packages
RUN apt-get update && apt-get install -y \
curl \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy application jar
COPY --from=builder /app/build/libs/file-storage-service-*.jar app.jar

# Copy configuration files
COPY --chown=fileservice:fileservice docker/application-docker.yml application.yml

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8080/actuator/health || exit 1

# Switch to non-root user
USER fileservice

# JVM optimization for containers
ENV JAVA_OPTS="-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 -XX:+UseG1GC"

EXPOSE 8080

ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -jar app.jar"]

Kubernetes Deployment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: file-storage-service
namespace: file-storage
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: file-storage-service
template:
metadata:
labels:
app: file-storage-service
spec:
containers:
- name: file-storage-service
image: file-storage-service:latest
ports:
- containerPort: 8080
env:
- name: SPRING_PROFILES_ACTIVE
value: "kubernetes"
- name: DB_HOST
valueFrom:
secretKeyRef:
name: postgres-secret
key: host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/etc/gcp/service-account.json"
volumeMounts:
- name: gcp-service-account
mountPath: /etc/gcp
readOnly: true
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: gcp-service-account
secret:
secretName: gcp-service-account

---
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: file-storage-service
namespace: file-storage
spec:
selector:
app: file-storage-service
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: ClusterIP

---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: file-storage-hpa
namespace: file-storage
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: file-storage-service
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80

Testing Strategy

Integration Testing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@TestPropertySource(locations = "classpath:application-test.properties")
@Testcontainers
class FileStorageIntegrationTest {

@Container
static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:13")
.withDatabaseName("filetest")
.withUsername("test")
.withPassword("test");

@Container
static GenericContainer<?> redis = new GenericContainer<>("redis:6-alpine")
.withExposedPorts(6379);

@Autowired
private TestRestTemplate restTemplate;

@MockBean
private GcsStorageClient gcsClient;

@Test
void shouldUploadFileSuccessfully() {
// Given
FileUploadRequest request = FileUploadRequest.builder()
.originalFilename("test.pdf")
.fileSize(1024L)
.mimeType("application/pdf")
.fileHash("abcd1234")
.userId("user-123")
.projectId("project-456")
.build();

when(gcsClient.generateSignedUploadUrl(any(), any(), any(), any()))
.thenReturn("https://storage.googleapis.com/signed-url");

// When
ResponseEntity<FileUploadResponse> response = restTemplate.postForEntity(
"/api/v1/files/upload",
request,
FileUploadResponse.class
);

// Then
assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
assertThat(response.getBody().getFileId()).isNotNull();
assertThat(response.getBody().getUploadUrl()).contains("signed-url");
}

@Test
void shouldHandleChunkedUploadFlow() {
// Given
ChunkedUploadRequest request = ChunkedUploadRequest.builder()
.originalFilename("large-file.zip")
.fileSize(50_000_000L)
.mimeType("application/zip")
.fileHash("efgh5678")
.userId("user-123")
.totalChunks(10)
.chunkSize(5_000_000)
.build();

when(gcsClient.initiateMultipartUpload(any(), any(), any()))
.thenReturn("multipart-upload-id");

// When - Initiate chunked upload
ResponseEntity<ChunkedUploadResponse> initResponse = restTemplate.postForEntity(
"/api/v1/files/upload/chunked",
request,
ChunkedUploadResponse.class
);

// Then
assertThat(initResponse.getStatusCode()).isEqualTo(HttpStatus.CREATED);
String uploadId = initResponse.getBody().getUploadId();

// When - Get chunk upload URLs
for (int i = 1; i <= 10; i++) {
when(gcsClient.generateSignedChunkUploadUrl(any(), eq(i), any()))
.thenReturn("https://storage.googleapis.com/chunk-" + i);

ResponseEntity<ChunkUploadResponse> chunkResponse = restTemplate.postForEntity(
"/api/v1/files/upload/chunked/" + uploadId + "/chunks/" + i,
null,
ChunkUploadResponse.class
);

assertThat(chunkResponse.getStatusCode()).isEqualTo(HttpStatus.OK);
}

// When - Complete upload
CompleteChunkedUploadRequest completeRequest = CompleteChunkedUploadRequest.builder()
.uploadId(uploadId)
.chunks(IntStream.rangeClosed(1, 10)
.mapToObj(i -> new ChunkInfo(i, "etag-" + i))
.collect(Collectors.toList()))
.build();

when(gcsClient.completeMultipartUpload(any(), any()))
.thenReturn("final-object-key");

ResponseEntity<FileUploadResponse> completeResponse = restTemplate.postForEntity(
"/api/v1/files/upload/chunked/" + uploadId + "/complete",
completeRequest,
FileUploadResponse.class
);

// Then
assertThat(completeResponse.getStatusCode()).isEqualTo(HttpStatus.OK);
assertThat(completeResponse.getBody().getFileId()).isNotNull();
}
}

Common Interview Questions & Answers

Q: How do you handle concurrent uploads of the same file?

A: We implement file deduplication using SHA-256 hashes. When a file upload request comes in, we first check if a file with the same hash already exists for that project. If deduplication is enabled and the file exists, we return a reference to the existing file instead of creating a duplicate. For concurrent uploads of different files, we use database transactions and optimistic locking to handle race conditions.

Q: How do you ensure data consistency during chunked uploads?

A: We use a multi-phase approach:

  1. Database transactions ensure atomicity of metadata operations
  2. Each chunk upload is tracked individually with ETags from GCS
  3. We implement a completion verification step that validates all chunks before finalizing
  4. Failed uploads are automatically cleaned up using scheduled tasks
  5. We use optimistic locking on the chunked_uploads table to prevent race conditions

Q: How do you handle failures during large file uploads?

A: Our resilience strategy includes:

  1. Retry mechanisms: Exponential backoff for transient failures
  2. Circuit breakers: Prevent cascading failures to GCS
  3. Cleanup jobs: Remove orphaned uploads after expiration
  4. Resume capability: Clients can query upload status and resume from the last successful chunk
  5. Monitoring: Real-time alerts for high failure rates

Q: How do you scale the service horizontally?

A: The service is designed to be stateless and cloud-native:

  1. Stateless design: All upload state is stored in PostgreSQL/Redis
  2. Load balancing: Multiple service instances behind a load balancer
  3. Database connection pooling: Efficient resource utilization
  4. Caching: Redis for frequently accessed metadata
  5. Auto-scaling: Kubernetes HPA based on CPU/memory metrics
  6. CDN integration: CloudFront for global file delivery

Q: How do you handle different file types and validation?

A: We implement a comprehensive validation framework:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@Component
public class FileValidator {

private final Map<String, FileTypeValidator> validators;
private final VirusScanner virusScanner;

public ValidationResult validateFile(FileUploadRequest request, InputStream fileStream) {
ValidationResult result = ValidationResult.success();

// Basic validations
result.merge(validateFileSize(request.getFileSize()));
result.merge(validateMimeType(request.getMimeType()));
result.merge(validateFilename(request.getOriginalFilename()));

// Content-specific validation
FileTypeValidator validator = validators.get(request.getMimeType());
if (validator != null) {
result.merge(validator.validate(fileStream));
}

// Security scan
if (result.isValid()) {
result.merge(virusScanner.scan(fileStream));
}

return result;
}
}

Q: How do you implement file versioning?

A: File versioning can be implemented by extending the schema:

1
2
3
4
5
6
7
-- Add version tracking to files table
ALTER TABLE files ADD COLUMN version_number INTEGER DEFAULT 1;
ALTER TABLE files ADD COLUMN parent_file_id UUID REFERENCES files(id);
ALTER TABLE files ADD COLUMN is_latest_version BOOLEAN DEFAULT true;

-- Index for efficient version queries
CREATE INDEX idx_files_parent_version ON files(parent_file_id, version_number);

This allows tracking file history while maintaining backward compatibility with existing APIs.

Best Practices and Recommendations

Production Deployment Checklist

  • Security: Implement proper authentication, authorization, and input validation
  • Monitoring: Set up comprehensive logging, metrics, and alerting
  • Performance: Implement caching, connection pooling, and query optimization
  • Reliability: Use circuit breakers, retry mechanisms, and graceful degradation
  • Scalability: Design for horizontal scaling with stateless services
  • Data Protection: Implement backup strategies and disaster recovery
  • Compliance: Ensure GDPR/compliance requirements for file metadata and deletion

External Resources

This comprehensive guide provides a production-ready foundation for building a scalable, secure, and maintainable file storage service. The modular design allows for easy extension and customization based on specific business requirements.

System Overview

The File Storage Service is a robust, scalable solution for handling file uploads, storage, and retrieval operations. Built on Google Cloud Storage (GCS) and PostgreSQL, it provides enterprise-grade file management capabilities with comprehensive SDK support for seamless integration.

Architecture Goals

  • Scalability: Handle millions of files with horizontal scaling
  • Reliability: 99.9% uptime with fault tolerance
  • Security: End-to-end encryption and access control
  • Performance: Sub-second response times for file operations
  • Cost Efficiency: Intelligent storage tiering and lifecycle management

High-Level Architecture


graph TB
subgraph "Client Layer"
    A[Web Client] 
    B[Mobile Client]
    C[Service Integration]
end

subgraph "API Gateway"
    D[Load Balancer]
    E[Authentication]
    F[Rate Limiting]
end

subgraph "File Storage Service"
    G[Upload Controller]
    H[Download Controller]
    I[Metadata Service]
    J[Chunk Manager]
end

subgraph "Storage Layer"
    K[PostgreSQL<br/>Metadata DB]
    L[Redis<br/>Cache Layer]
    M[Google Cloud Storage<br/>File Storage]
end

subgraph "External Services"
    N[Notification Service]
    O[Audit Service]
    P[Monitoring]
end

A --> D
B --> D
C --> D
D --> E
E --> F
F --> G
F --> H
G --> I
H --> I
G --> J
I --> K
I --> L
J --> M
G --> M
H --> M
I --> N
I --> O
G --> P
H --> P

Core Components Deep Dive

File Upload Service

The upload service handles both single-shot uploads and chunked uploads for large files, implementing resumable upload patterns for reliability.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
@RestController
@RequestMapping("/api/v1/files")
public class FileUploadController {

private final FileUploadService fileUploadService;
private final ChunkManagerService chunkManagerService;
private final FileValidationService validationService;

@PostMapping("/upload")
public ResponseEntity<FileUploadResponse> uploadFile(
@RequestParam("file") MultipartFile file,
@RequestParam(value = "projectId", required = false) String projectId,
@RequestParam(value = "metadata", required = false) String metadata) {

// Validate file
validationService.validateFile(file);

// Determine upload strategy
if (file.getSize() > MAX_SINGLE_UPLOAD_SIZE) {
return initiateChunkedUpload(file, projectId, metadata);
} else {
return uploadDirectly(file, projectId, metadata);
}
}

@PostMapping("/upload/chunked/initiate")
public ResponseEntity<ChunkedUploadResponse> initiateChunkedUpload(
@RequestBody ChunkedUploadRequest request) {

String uploadId = UUID.randomUUID().toString();
ChunkedUploadSession session = chunkManagerService.createUploadSession(
uploadId, request.getFileName(), request.getTotalSize(),
request.getChunkSize(), request.getProjectId()
);

return ResponseEntity.ok(ChunkedUploadResponse.builder()
.uploadId(uploadId)
.totalChunks(session.getTotalChunks())
.chunkSize(session.getChunkSize())
.build());
}

@PutMapping("/upload/chunked/{uploadId}/chunk/{chunkNumber}")
public ResponseEntity<ChunkUploadResponse> uploadChunk(
@PathVariable String uploadId,
@PathVariable int chunkNumber,
@RequestParam("chunk") MultipartFile chunk) {

ChunkUploadResult result = chunkManagerService.uploadChunk(
uploadId, chunkNumber, chunk
);

if (result.isComplete()) {
// All chunks uploaded, finalize the file
FileMetadata fileMetadata = chunkManagerService.finalizeUpload(uploadId);
return ResponseEntity.ok(ChunkUploadResponse.builder()
.chunkNumber(chunkNumber)
.completed(true)
.fileUrl(fileMetadata.getDownloadUrl())
.fileId(fileMetadata.getFileId())
.build());
}

return ResponseEntity.ok(ChunkUploadResponse.builder()
.chunkNumber(chunkNumber)
.completed(false)
.build());
}
}

Chunk Management System

For files larger than 10MB, the system implements intelligent chunking with resumable uploads:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
@Service
public class ChunkManagerService {

private final RedisTemplate<String, Object> redisTemplate;
private final GcsService gcsService;
private final FileMetadataRepository metadataRepository;

private static final int DEFAULT_CHUNK_SIZE = 5 * 1024 * 1024; // 5MB
private static final String CHUNK_SESSION_PREFIX = "chunk_session:";

public ChunkedUploadSession createUploadSession(String uploadId,
String fileName, long totalSize, int chunkSize, String projectId) {

int totalChunks = (int) Math.ceil((double) totalSize / chunkSize);

ChunkedUploadSession session = ChunkedUploadSession.builder()
.uploadId(uploadId)
.fileName(fileName)
.totalSize(totalSize)
.chunkSize(chunkSize)
.totalChunks(totalChunks)
.projectId(projectId)
.uploadedChunks(new BitSet(totalChunks))
.createdAt(Instant.now())
.expiresAt(Instant.now().plus(Duration.ofHours(24)))
.build();

// Store session in Redis with TTL
redisTemplate.opsForValue().set(
CHUNK_SESSION_PREFIX + uploadId,
session,
Duration.ofHours(24)
);

return session;
}

@Transactional
public ChunkUploadResult uploadChunk(String uploadId, int chunkNumber,
MultipartFile chunk) {

ChunkedUploadSession session = getUploadSession(uploadId);
if (session == null) {
throw new UploadSessionNotFoundException("Upload session not found: " + uploadId);
}

// Validate chunk
validateChunk(session, chunkNumber, chunk);

// Upload chunk to GCS with temporary naming
String tempChunkPath = generateTempChunkPath(uploadId, chunkNumber);
gcsService.uploadChunk(chunk.getInputStream(), tempChunkPath);

// Update session state
session.getUploadedChunks().set(chunkNumber - 1);
updateUploadSession(session);

// Check if all chunks are uploaded
boolean isComplete = session.getUploadedChunks().cardinality() == session.getTotalChunks();

return ChunkUploadResult.builder()
.chunkNumber(chunkNumber)
.complete(isComplete)
.build();
}

@Transactional
public FileMetadata finalizeUpload(String uploadId) {
ChunkedUploadSession session = getUploadSession(uploadId);

// Combine all chunks into final file
String finalFilePath = generateFinalFilePath(session);
List<String> chunkPaths = generateChunkPaths(uploadId, session.getTotalChunks());

gcsService.combineChunks(chunkPaths, finalFilePath);

// Create file metadata
FileMetadata metadata = FileMetadata.builder()
.fileId(UUID.randomUUID().toString())
.originalFileName(session.getFileName())
.gcsPath(finalFilePath)
.fileSize(session.getTotalSize())
.projectId(session.getProjectId())
.uploadedAt(Instant.now())
.downloadUrl(gcsService.generateSignedUrl(finalFilePath, Duration.ofDays(365)))
.build();

metadataRepository.save(metadata);

// Clean up chunks and session
cleanupChunks(chunkPaths);
deleteUploadSession(uploadId);

return metadata;
}
}

Database Schema Design

The PostgreSQL schema is optimized for query performance and supports file versioning:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- Files table for storing file metadata
CREATE TABLE files (
file_id VARCHAR(36) PRIMARY KEY,
original_filename VARCHAR(255) NOT NULL,
content_type VARCHAR(100),
file_size BIGINT NOT NULL,
gcs_bucket VARCHAR(100) NOT NULL,
gcs_path VARCHAR(500) NOT NULL,
project_id VARCHAR(36),
upload_session_id VARCHAR(36),
checksum_md5 VARCHAR(32),
checksum_sha256 VARCHAR(64),
encryption_key_id VARCHAR(36),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
deleted_at TIMESTAMP WITH TIME ZONE,
created_by VARCHAR(36),
tags JSONB,
metadata JSONB
);

-- File versions for supporting file versioning
CREATE TABLE file_versions (
version_id VARCHAR(36) PRIMARY KEY,
file_id VARCHAR(36) REFERENCES files(file_id),
version_number INTEGER NOT NULL,
gcs_path VARCHAR(500) NOT NULL,
file_size BIGINT NOT NULL,
checksum_md5 VARCHAR(32),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
created_by VARCHAR(36),
change_description TEXT
);

-- Upload sessions for chunked uploads
CREATE TABLE upload_sessions (
session_id VARCHAR(36) PRIMARY KEY,
original_filename VARCHAR(255) NOT NULL,
total_size BIGINT NOT NULL,
chunk_size INTEGER NOT NULL,
total_chunks INTEGER NOT NULL,
uploaded_chunks_bitmap TEXT,
project_id VARCHAR(36),
status VARCHAR(20) DEFAULT 'IN_PROGRESS',
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
expires_at TIMESTAMP WITH TIME ZONE NOT NULL,
created_by VARCHAR(36)
);

-- Indexes for performance optimization
CREATE INDEX idx_files_project_id ON files(project_id);
CREATE INDEX idx_files_created_at ON files(created_at);
CREATE INDEX idx_files_deleted_at ON files(deleted_at) WHERE deleted_at IS NULL;
CREATE INDEX idx_files_tags ON files USING GIN(tags);
CREATE INDEX idx_file_versions_file_id ON file_versions(file_id);
CREATE INDEX idx_upload_sessions_expires_at ON upload_sessions(expires_at);

-- Partitioning for large datasets (optional)
CREATE TABLE files_2024 PARTITION OF files
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

GCS Integration Service

The GCS service handles actual file storage with intelligent lifecycle management:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
@Service
public class GcsService {

private final Storage storage;
private final String bucketName;
private final String cdnDomain;

private static final Duration DEFAULT_SIGNED_URL_DURATION = Duration.ofHours(1);

public String uploadFile(InputStream inputStream, String fileName,
String contentType, Map<String, String> metadata) {

String gcsPath = generateGcsPath(fileName);

BlobId blobId = BlobId.of(bucketName, gcsPath);
BlobInfo.Builder blobInfoBuilder = BlobInfo.newBuilder(blobId)
.setContentType(contentType)
.setCacheControl("public, max-age=31536000"); // 1 year cache

// Add custom metadata
if (metadata != null) {
blobInfoBuilder.setMetadata(metadata);
}

// Set storage class based on file characteristics
StorageClass storageClass = determineStorageClass(fileName, metadata);
blobInfoBuilder.setStorageClass(storageClass);

BlobInfo blobInfo = blobInfoBuilder.build();

try {
Blob blob = storage.create(blobInfo, inputStream);
return blob.getName();
} catch (Exception e) {
throw new FileStorageException("Failed to upload file to GCS", e);
}
}

public String generateSignedUrl(String gcsPath, Duration duration) {
BlobId blobId = BlobId.of(bucketName, gcsPath);

URL signedUrl = storage.signUrl(
BlobInfo.newBuilder(blobId).build(),
duration.toMillis(),
TimeUnit.MILLISECONDS,
Storage.SignUrlOption.httpMethod(HttpMethod.GET)
);

return signedUrl.toString();
}

public void combineChunks(List<String> chunkPaths, String finalPath) {
BlobId finalBlobId = BlobId.of(bucketName, finalPath);

// Use GCS compose operation for efficient chunk combination
Storage.ComposeRequest.Builder composeBuilder = Storage.ComposeRequest.newBuilder()
.setTarget(BlobInfo.newBuilder(finalBlobId).build());

for (String chunkPath : chunkPaths) {
BlobId chunkBlobId = BlobId.of(bucketName, chunkPath);
composeBuilder.addSource(chunkBlobId);
}

storage.compose(composeBuilder.build());
}

private StorageClass determineStorageClass(String fileName, Map<String, String> metadata) {
// Intelligent storage class selection based on file type and metadata
String fileExtension = getFileExtension(fileName);

// Archive files go to coldline storage
if (isArchiveFile(fileExtension)) {
return StorageClass.COLDLINE;
}

// Frequently accessed files use standard storage
return StorageClass.STANDARD;
}

// Lifecycle management
@Scheduled(cron = "0 2 * * *") // Daily at 2 AM
public void optimizeStorageClasses() {
// Move infrequently accessed files to cheaper storage classes
// This could be based on access patterns, file age, etc.
}
}

File Storage SDK Design

The SDK provides a clean, intuitive interface for client applications:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
public class FileStorageClient {

private final String baseUrl;
private final String apiKey;
private final OkHttpClient httpClient;
private final ObjectMapper objectMapper;

public FileStorageClient(String baseUrl, String apiKey) {
this.baseUrl = baseUrl;
this.apiKey = apiKey;
this.httpClient = createHttpClient();
this.objectMapper = new ObjectMapper();
}

/**
* Upload a single file
*/
public FileUploadResult uploadFile(File file, UploadOptions options) {
if (file.length() > options.getChunkThreshold()) {
return uploadFileChunked(file, options);
} else {
return uploadFileDirect(file, options);
}
}

/**
* Upload file with progress callback
*/
public FileUploadResult uploadFile(File file, UploadOptions options,
ProgressCallback progressCallback) {

return uploadFileChunked(file, options, progressCallback);
}

/**
* Download file by ID
*/
public InputStream downloadFile(String fileId) throws IOException {
FileMetadata metadata = getFileMetadata(fileId);
return downloadFileByUrl(metadata.getDownloadUrl());
}

/**
* Get file metadata
*/
public FileMetadata getFileMetadata(String fileId) throws IOException {
Request request = new Request.Builder()
.url(baseUrl + "/api/v1/files/" + fileId + "/metadata")
.addHeader("Authorization", "Bearer " + apiKey)
.build();

try (Response response = httpClient.newCall(request).execute()) {
if (!response.isSuccessful()) {
throw new IOException("Failed to get file metadata: " + response.code());
}

return objectMapper.readValue(response.body().string(), FileMetadata.class);
}
}

/**
* List files with pagination and filtering
*/
public PagedResult<FileMetadata> listFiles(FileQuery query) throws IOException {
HttpUrl.Builder urlBuilder = HttpUrl.parse(baseUrl + "/api/v1/files").newBuilder();

if (query.getProjectId() != null) {
urlBuilder.addQueryParameter("projectId", query.getProjectId());
}
if (query.getFileType() != null) {
urlBuilder.addQueryParameter("fileType", query.getFileType());
}
urlBuilder.addQueryParameter("page", String.valueOf(query.getPage()));
urlBuilder.addQueryParameter("size", String.valueOf(query.getSize()));

Request request = new Request.Builder()
.url(urlBuilder.build())
.addHeader("Authorization", "Bearer " + apiKey)
.build();

try (Response response = httpClient.newCall(request).execute()) {
if (!response.isSuccessful()) {
throw new IOException("Failed to list files: " + response.code());
}

TypeReference<PagedResult<FileMetadata>> typeRef =
new TypeReference<PagedResult<FileMetadata>>() {};
return objectMapper.readValue(response.body().string(), typeRef);
}
}

/**
* Delete file (soft delete)
*/
public void deleteFile(String fileId) throws IOException {
Request request = new Request.Builder()
.url(baseUrl + "/api/v1/files/" + fileId)
.delete()
.addHeader("Authorization", "Bearer " + apiKey)
.build();

try (Response response = httpClient.newCall(request).execute()) {
if (!response.isSuccessful()) {
throw new IOException("Failed to delete file: " + response.code());
}
}
}

private FileUploadResult uploadFileChunked(File file, UploadOptions options,
ProgressCallback progressCallback) {

try {
// Initiate chunked upload
ChunkedUploadResponse initResponse = initiateChunkedUpload(file, options);

long uploadedBytes = 0;
int chunkSize = initResponse.getChunkSize();

try (FileInputStream fis = new FileInputStream(file)) {
for (int chunkNumber = 1; chunkNumber <= initResponse.getTotalChunks(); chunkNumber++) {
byte[] buffer = new byte[Math.min(chunkSize, (int) (file.length() - uploadedBytes))];
int bytesRead = fis.read(buffer);

if (bytesRead > 0) {
uploadChunk(initResponse.getUploadId(), chunkNumber,
Arrays.copyOf(buffer, bytesRead));
uploadedBytes += bytesRead;

if (progressCallback != null) {
progressCallback.onProgress(uploadedBytes, file.length());
}
}
}
}

// All chunks uploaded successfully
return FileUploadResult.builder()
.fileId(initResponse.getFinalFileId())
.downloadUrl(initResponse.getFinalDownloadUrl())
.build();

} catch (Exception e) {
throw new FileUploadException("Failed to upload file", e);
}
}
}

Security Implementation

Authentication and Authorization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
@Component
public class FileAccessSecurityService {

private final JwtTokenProvider tokenProvider;
private final UserService userService;
private final ProjectService projectService;

public boolean canAccessFile(String userId, String fileId, FileOperation operation) {
FileMetadata fileMetadata = getFileMetadata(fileId);
User user = userService.getUser(userId);

// Check project-level permissions
if (fileMetadata.getProjectId() != null) {
ProjectPermission permission = projectService.getUserPermission(
userId, fileMetadata.getProjectId()
);

return hasRequiredPermission(permission, operation);
}

// Check file-level permissions
return fileMetadata.getCreatedBy().equals(userId) || user.isAdmin();
}

@PreAuthorize("@fileAccessSecurityService.canAccessFile(authentication.name, #fileId, 'READ')")
public FileMetadata getFile(String fileId) {
return fileMetadataRepository.findById(fileId)
.orElseThrow(() -> new FileNotFoundException("File not found: " + fileId));
}

private boolean hasRequiredPermission(ProjectPermission permission, FileOperation operation) {
switch (operation) {
case READ:
return permission.canRead();
case WRITE:
return permission.canWrite();
case DELETE:
return permission.canDelete();
default:
return false;
}
}
}

File Encryption

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
@Service
public class FileEncryptionService {

private final GoogleKmsService kmsService;
private final String keyRingName;

public EncryptedUploadStream encryptFile(InputStream inputStream, String fileId) {
try {
// Generate data encryption key (DEK)
byte[] dek = generateDataEncryptionKey();

// Encrypt DEK with KEK (Key Encryption Key) using Google KMS
String encryptedDek = kmsService.encrypt(keyRingName, dek);

// Create cipher for file encryption
Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding");
SecretKeySpec keySpec = new SecretKeySpec(dek, "AES");
cipher.init(Cipher.ENCRYPT_MODE, keySpec);

// Store encryption metadata
FileEncryptionMetadata encryptionMetadata = FileEncryptionMetadata.builder()
.fileId(fileId)
.encryptedDek(encryptedDek)
.algorithm("AES-256-GCM")
.keyVersion(kmsService.getCurrentKeyVersion())
.build();

return new EncryptedUploadStream(
new CipherInputStream(inputStream, cipher),
encryptionMetadata
);

} catch (Exception e) {
throw new EncryptionException("Failed to encrypt file", e);
}
}

public InputStream decryptFile(InputStream encryptedStream, String fileId) {
try {
FileEncryptionMetadata metadata = getEncryptionMetadata(fileId);

// Decrypt DEK using KMS
byte[] dek = kmsService.decrypt(metadata.getEncryptedDek());

// Create cipher for decryption
Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding");
SecretKeySpec keySpec = new SecretKeySpec(dek, "AES");
cipher.init(Cipher.DECRYPT_MODE, keySpec);

return new CipherInputStream(encryptedStream, cipher);

} catch (Exception e) {
throw new DecryptionException("Failed to decrypt file", e);
}
}
}

Performance Optimization Strategies

Caching Layer Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
@Service
public class FileCacheService {

private final RedisTemplate<String, Object> redisTemplate;
private final GcsService gcsService;

private static final String FILE_METADATA_CACHE_PREFIX = "file_metadata:";
private static final String FILE_CONTENT_CACHE_PREFIX = "file_content:";
private static final Duration METADATA_TTL = Duration.ofHours(1);
private static final Duration SMALL_FILE_TTL = Duration.ofMinutes(30);

@Cacheable(value = "file_metadata", key = "#fileId")
public FileMetadata getFileMetadata(String fileId) {
String cacheKey = FILE_METADATA_CACHE_PREFIX + fileId;

FileMetadata cached = (FileMetadata) redisTemplate.opsForValue().get(cacheKey);
if (cached != null) {
return cached;
}

FileMetadata metadata = fileMetadataRepository.findById(fileId)
.orElseThrow(() -> new FileNotFoundException("File not found: " + fileId));

// Cache metadata
redisTemplate.opsForValue().set(cacheKey, metadata, METADATA_TTL);

return metadata;
}

public InputStream getFileContent(String fileId) {
FileMetadata metadata = getFileMetadata(fileId);

// Cache small files (< 1MB) in Redis
if (metadata.getFileSize() < 1024 * 1024) {
String cacheKey = FILE_CONTENT_CACHE_PREFIX + fileId;
byte[] cachedContent = (byte[]) redisTemplate.opsForValue().get(cacheKey);

if (cachedContent != null) {
return new ByteArrayInputStream(cachedContent);
}

// Load from GCS and cache
byte[] content = gcsService.downloadFileAsBytes(metadata.getGcsPath());
redisTemplate.opsForValue().set(cacheKey, content, SMALL_FILE_TTL);

return new ByteArrayInputStream(content);
}

// Large files: direct stream from GCS
return gcsService.downloadFile(metadata.getGcsPath());
}
}

Database Query Optimization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
@Repository
public class FileMetadataRepository {

@PersistenceContext
private EntityManager entityManager;

/**
* Optimized query for file listing with filtering and pagination
*/
public Page<FileMetadata> findFilesWithFilters(FileSearchCriteria criteria,
Pageable pageable) {

CriteriaBuilder cb = entityManager.getCriteriaBuilder();
CriteriaQuery<FileMetadata> query = cb.createQuery(FileMetadata.class);
Root<FileMetadata> root = query.from(FileMetadata.class);

List<Predicate> predicates = new ArrayList<>();

// Active files only (not soft deleted)
predicates.add(cb.isNull(root.get("deletedAt")));

// Project filter
if (criteria.getProjectId() != null) {
predicates.add(cb.equal(root.get("projectId"), criteria.getProjectId()));
}

// Content type filter
if (criteria.getContentType() != null) {
predicates.add(cb.like(root.get("contentType"), criteria.getContentType() + "%"));
}

// File size range filter
if (criteria.getMinSize() != null) {
predicates.add(cb.greaterThanOrEqualTo(root.get("fileSize"), criteria.getMinSize()));
}
if (criteria.getMaxSize() != null) {
predicates.add(cb.lessThanOrEqualTo(root.get("fileSize"), criteria.getMaxSize()));
}

// Date range filter
if (criteria.getCreatedAfter() != null) {
predicates.add(cb.greaterThanOrEqualTo(root.get("createdAt"), criteria.getCreatedAfter()));
}
if (criteria.getCreatedBefore() != null) {
predicates.add(cb.lessThanOrEqualTo(root.get("createdAt"), criteria.getCreatedBefore()));
}

// Tags filter (using JSONB operations)
if (criteria.getTags() != null && !criteria.getTags().isEmpty()) {
for (String tag : criteria.getTags()) {
predicates.add(cb.isTrue(
cb.function("jsonb_exists", Boolean.class, root.get("tags"), cb.literal(tag))
));
}
}

query.where(predicates.toArray(new Predicate[0]));

// Default ordering by creation date (newest first)
query.orderBy(cb.desc(root.get("createdAt")));

TypedQuery<FileMetadata> typedQuery = entityManager.createQuery(query);
typedQuery.setFirstResult((int) pageable.getOffset());
typedQuery.setMaxResults(pageable.getPageSize());

List<FileMetadata> results = typedQuery.getResultList();

// Count query for pagination
CriteriaQuery<Long> countQuery = cb.createQuery(Long.class);
Root<FileMetadata> countRoot = countQuery.from(FileMetadata.class);
countQuery.select(cb.count(countRoot));
countQuery.where(predicates.toArray(new Predicate[0]));

Long totalCount = entityManager.createQuery(countQuery).getSingleResult();

return new PageImpl<>(results, pageable, totalCount);
}
}

Monitoring and Observability

Comprehensive Monitoring Setup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
@Component
public class FileStorageMetrics {

private final MeterRegistry meterRegistry;
private final Timer uploadTimer;
private final Timer downloadTimer;
private final Counter uploadSuccessCounter;
private final Counter uploadFailureCounter;
private final Gauge activeUploadsGauge;

public FileStorageMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;

this.uploadTimer = Timer.builder("file_upload_duration")
.description("Time taken to upload files")
.tag("service", "file-storage")
.register(meterRegistry);

this.downloadTimer = Timer.builder("file_download_duration")
.description("Time taken to download files")
.tag("service", "file-storage")
.register(meterRegistry);

this.uploadSuccessCounter = Counter.builder("file_upload_success_total")
.description("Total number of successful file uploads")
.tag("service", "file-storage")
.register(meterRegistry);

this.uploadFailureCounter = Counter.builder("file_upload_failure_total")
.description("Total number of failed file uploads")
.tag("service", "file-storage")
.register(meterRegistry);

this.activeUploadsGauge = Gauge.builder("file_upload_active_count")
.description("Number of currently active uploads")
.tag("service", "file-storage")
.register(meterRegistry, this, FileStorageMetrics::getActiveUploadsCount);
}

public void recordUploadSuccess(String fileType, long fileSizeBytes, Duration duration) {
uploadSuccessCounter.increment(
Tags.of(
"file_type", fileType,
"size_category", categorizeFileSize(fileSizeBytes)
)
);

uploadTimer.record(duration);

// Record file size distribution
DistributionSummary.builder("file_upload_size_bytes")
.tag("file_type", fileType)
.register(meterRegistry)
.record(fileSizeBytes);
}

public void recordUploadFailure(String fileType, String errorType, Duration duration) {
uploadFailureCounter.increment(
Tags.of(
"file_type", fileType,
"error_type", errorType
)
);
}

private String categorizeFileSize(long sizeBytes) {
if (sizeBytes < 1024 * 1024) return "small"; // < 1MB
if (sizeBytes < 10 * 1024 * 1024) return "medium"; // < 10MB
if (sizeBytes < 100 * 1024 * 1024) return "large"; // < 100MB
return "xlarge"; // >= 100MB
}

private double getActiveUploadsCount() {
// Implementation to get current active upload count
return uploadSessionService.getActiveUploadCount();
}
}

// Health Check Implementation
@Component
public class FileStorageHealthIndicator implements HealthIndicator {

private final GcsService gcsService;
private final DataSource dataSource;
private final RedisTemplate<String, Object> redisTemplate;

@Override
public Health health() {
Health.Builder healthBuilder = Health.up();

// Check PostgreSQL connectivity
try (Connection connection = dataSource.getConnection()) {
if (!connection.isValid(5)) {
return Health.down()
.withDetail("database", "Connection validation failed")
.build();
}
healthBuilder.withDetail("database", "UP");
} catch (Exception e) {
return Health.down()
.withDetail("database", "Connection failed: " + e.getMessage())
.build();
}

// Check Redis connectivity
try {
redisTemplate.execute((RedisCallback<String>) connection -> {
connection.ping();
return "PONG";
});
healthBuilder.withDetail("redis", "UP");
} catch (Exception e) {
healthBuilder.withDetail("redis", "DOWN: " + e.getMessage());
}

// Check GCS connectivity
try {
boolean gcsHealthy = gcsService.checkHealth();
healthBuilder.withDetail("gcs", gcsHealthy ? "UP" : "DOWN");
if (!gcsHealthy) {
return healthBuilder.down().build();
}
} catch (Exception e) {
return Health.down()
.withDetail("gcs", "Health check failed: " + e.getMessage())
.build();
}

return healthBuilder.build();
}
}

Error Handling and Resilience

Comprehensive Error Handling Strategy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
@ControllerAdvice
public class FileStorageExceptionHandler {

private static final Logger logger = LoggerFactory.getLogger(FileStorageExceptionHandler.class);

@ExceptionHandler(FileNotFoundException.class)
@ResponseStatus(HttpStatus.NOT_FOUND)
public ErrorResponse handleFileNotFound(FileNotFoundException e) {
logger.warn("File not found: {}", e.getMessage());
return ErrorResponse.builder()
.code("FILE_NOT_FOUND")
.message("The requested file was not found")
.timestamp(Instant.now())
.build();
}

@ExceptionHandler(FileSizeExceededException.class)
@ResponseStatus(HttpStatus.PAYLOAD_TOO_LARGE)
public ErrorResponse handleFileSizeExceeded(FileSizeExceededException e) {
logger.warn("File size exceeded: {}", e.getMessage());
return ErrorResponse.builder()
.code("FILE_SIZE_EXCEEDED")
.message("File size exceeds the maximum allowed limit")
.details(Map.of("maxSize", e.getMaxAllowedSize(), "actualSize", e.getActualSize()))
.timestamp(Instant.now())
.build();
}

@ExceptionHandler(UnsupportedFileTypeException.class)
@ResponseStatus(HttpStatus.UNSUPPORTED_MEDIA_TYPE)
public ErrorResponse handleUnsupportedFileType(UnsupportedFileTypeException e) {
logger.warn("Unsupported file type: {}", e.getMessage());
return ErrorResponse.builder()
.code("UNSUPPORTED_FILE_TYPE")
.message("The file type is not supported")
.details(Map.of("supportedTypes", e.getSupportedTypes()))
.timestamp(Instant.now())
.build();
}

@ExceptionHandler(StorageException.class)
@ResponseStatus(HttpStatus.INTERNAL_SERVER_ERROR)
public ErrorResponse handleStorageException(StorageException e) {
logger.error("Storage operation failed", e);
return ErrorResponse.builder()
.code("STORAGE_ERROR")
.message("An error occurred while processing the file")
.timestamp(Instant.now())
.build();
}

@ExceptionHandler(RateLimitExceededException.class)
@ResponseStatus(HttpStatus.TOO_MANY_REQUESTS)
public ErrorResponse handleRateLimit(RateLimitExceededException e) {
logger.warn("Rate limit exceeded for user: {}", e.getUserId());
return ErrorResponse.builder()
.code("RATE_LIMIT_EXCEEDED")
.message("Upload rate limit exceeded")
.details(Map.of("retryAfter", e.getRetryAfterSeconds()))
.timestamp(Instant.now())
.build();
}
}

// Retry Mechanism for GCS Operations
@Component
public class ResilientGcsService {

private final Storage storage;
private final RetryTemplate retryTemplate;
private final CircuitBreaker circuitBreaker;

public ResilientGcsService(Storage storage) {
this.storage = storage;
this.retryTemplate = createRetryTemplate();
this.circuitBreaker = createCircuitBreaker();
}

@Retryable(value = {StorageException.class}, maxAttempts = 3,
backoff = @Backoff(delay = 1000, multiplier = 2))
public String uploadWithRetry(InputStream inputStream, String path, String contentType) {
return circuitBreaker.executeSupplier(() ->
retryTemplate.execute(context -> {
logger.info("Attempting upload, attempt: {}", context.getRetryCount() + 1);
return doUpload(inputStream, path, contentType);
})
);
}

private RetryTemplate createRetryTemplate() {
RetryTemplate template = new RetryTemplate();

FixedBackOffPolicy backOffPolicy = new FixedBackOffPolicy();
backOffPolicy.setBackOffPeriod(2000); // 2 seconds
template.setBackOffPolicy(backOffPolicy);

SimpleRetryPolicy retryPolicy = new SimpleRetryPolicy();
retryPolicy.setMaxAttempts(3);
template.setRetryPolicy(retryPolicy);

return template;
}

private CircuitBreaker createCircuitBreaker() {
return CircuitBreaker.ofDefaults("gcs-upload");
}
}

Production Deployment Architecture

Kubernetes Deployment Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# file-storage-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: file-storage-service
labels:
app: file-storage-service
spec:
replicas: 3
selector:
matchLabels:
app: file-storage-service
template:
metadata:
labels:
app: file-storage-service
spec:
containers:
- name: file-storage-service
image: file-storage-service:latest
ports:
- containerPort: 8080
env:
- name: SPRING_PROFILES_ACTIVE
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: file-storage-secrets
key: database-url
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: file-storage-secrets
key: redis-url
- name: GCS_BUCKET_NAME
valueFrom:
configMapKeyRef:
name: file-storage-config
key: gcs-bucket-name
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
volumeMounts:
- name: gcs-key
mountPath: "/etc/gcs"
readOnly: true
volumes:
- name: gcs-key
secret:
secretName: gcs-service-account-key

---
apiVersion: v1
kind: Service
metadata:
name: file-storage-service
spec:
selector:
app: file-storage-service
ports:
- port: 80
targetPort: 8080
type: ClusterIP

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: file-storage-ingress
annotations:
kubernetes.io/ingress.class: "nginx"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
spec:
tls:
- hosts:
- api.fileservice.com
secretName: file-storage-tls
rules:
- host: api.fileservice.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: file-storage-service
port:
number: 80

Infrastructure as Code (Terraform)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# GCS Bucket Configuration
resource "google_storage_bucket" "file_storage_bucket" {
name = "company-file-storage-${var.environment}"
location = "US"
force_destroy = false

versioning {
enabled = true
}

lifecycle_rule {
condition {
age = 30
}
action {
type = "SetStorageClass"
storage_class = "NEARLINE"
}
}

lifecycle_rule {
condition {
age = 90
}
action {
type = "SetStorageClass"
storage_class = "COLDLINE"
}
}

lifecycle_rule {
condition {
age = 365
}
action {
type = "SetStorageClass"
storage_class = "ARCHIVE"
}
}

cors {
origin = ["https://app.company.com"]
method = ["GET", "HEAD", "PUT", "POST", "DELETE"]
response_header = ["*"]
max_age_seconds = 3600
}
}

# Cloud SQL PostgreSQL Instance
resource "google_sql_database_instance" "file_storage_db" {
name = "file-storage-db-${var.environment}"
database_version = "POSTGRES_13"
region = "us-central1"

settings {
tier = "db-custom-2-4096"

backup_configuration {
enabled = true
start_time = "03:00"
location = "us"
point_in_time_recovery_enabled = true
backup_retention_settings {
retained_backups = 30
}
}

ip_configuration {
ipv4_enabled = false
private_network = google_compute_network.vpc.id
}

database_flags {
name = "max_connections"
value = "200"
}

database_flags {
name = "shared_preload_libraries"
value = "pg_stat_statements"
}
}
}

# Redis Instance for Caching
resource "google_redis_instance" "file_storage_cache" {
name = "file-storage-cache-${var.environment}"
tier = "STANDARD_HA"
memory_size_gb = 4
region = "us-central1"

redis_version = "REDIS_6_X"
display_name = "File Storage Cache"
authorized_network = google_compute_network.vpc.id
}

Performance Benchmarks and SLAs

Service Level Objectives

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# SLO Configuration
slo_targets:
availability:
target: 99.9%
measurement_window: 30_days

latency:
upload_p95: 5s # 95% of uploads complete within 5 seconds
upload_p99: 15s # 99% of uploads complete within 15 seconds
download_p95: 500ms # 95% of downloads start within 500ms
download_p99: 2s # 99% of downloads start within 2 seconds

throughput:
max_concurrent_uploads: 1000
max_upload_rate: 10GB/s
max_download_rate: 50GB/s

error_rate:
target: 0.1% # Less than 0.1% error rate
measurement_window: 24_hours

Load Testing Strategy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@Component
public class FileStorageLoadTester {

private final FileStorageClient client;
private final ExecutorService executorService;

public LoadTestResult runLoadTest(LoadTestConfig config) {
List<CompletableFuture<UploadResult>> futures = new ArrayList<>();

for (int i = 0; i < config.getConcurrentUsers(); i++) {
CompletableFuture<UploadResult> future = CompletableFuture.supplyAsync(() -> {
return simulateUserUpload(config);
}, executorService);

futures.add(future);
}

// Collect results
List<UploadResult> results = futures.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());

return analyzeResults(results);
}

private UploadResult simulateUserUpload(LoadTestConfig config) {
long startTime = System.currentTimeMillis();

try {
// Generate test file
byte[] testData = generateTestFile(config.getFileSize());

// Upload file
FileUploadResult result = client.uploadFile(
new ByteArrayInputStream(testData),
"test-file-" + UUID.randomUUID().toString(),
"application/octet-stream"
);

long duration = System.currentTimeMillis() - startTime;

return UploadResult.builder()
.success(true)
.duration(duration)
.fileSize(testData.length)
.fileId(result.getFileId())
.build();

} catch (Exception e) {
long duration = System.currentTimeMillis() - startTime;

return UploadResult.builder()
.success(false)
.duration(duration)
.error(e.getMessage())
.build();
}
}
}

Interview Questions and Insights

System Design Questions

Q: How would you handle a scenario where files need to be replicated across multiple regions for disaster recovery?

A: I would implement a multi-region replication strategy:

  1. Primary-Secondary Pattern: Use GCS multi-region buckets for automatic replication
  2. Database Replication: Set up PostgreSQL read replicas in different regions
  3. Metadata Consistency: Implement eventual consistency with conflict resolution
  4. Failover Logic: Automatic failover with health checks and circuit breakers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
@Service
public class MultiRegionReplicationService {

private final Map<String, GcsService> regionalGcsServices;
private final CircuitBreaker circuitBreaker;

public FileUploadResult uploadWithReplication(MultipartFile file, String primaryRegion) {
// Upload to primary region first
String primaryPath = uploadToPrimary(file, primaryRegion);

// Async replication to secondary regions
CompletableFuture.runAsync(() -> replicateToSecondaryRegions(file, primaryPath));

return buildUploadResult(primaryPath);
}

private void replicateToSecondaryRegions(MultipartFile file, String primaryPath) {
regionalGcsServices.entrySet().parallelStream()
.filter(entry -> !entry.getKey().equals(primaryRegion))
.forEach(entry -> {
try {
circuitBreaker.executeSupplier(() ->
entry.getValue().replicateFile(primaryPath, file)
);
} catch (Exception e) {
logger.error("Failed to replicate to region: {}", entry.getKey(), e);
// Schedule retry or alert operations team
}
});
}
}

Q: How would you optimize the system for handling millions of small files vs. thousands of large files?

A: Different optimization strategies are needed:

For Small Files (< 1MB):

  • Batch multiple small files into larger objects
  • Use aggressive caching in Redis
  • Implement file bundling/archiving
  • Use CDN for frequently accessed files

For Large Files (> 100MB):

  • Mandatory chunked uploads with resumability
  • Implement progressive download with range requests
  • Use streaming processing to avoid memory issues
  • Implement intelligent storage class selection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@Service
public class FileOptimizationService {

public StorageStrategy determineOptimalStrategy(FileMetadata file) {
if (file.getFileSize() < SMALL_FILE_THRESHOLD) {
return StorageStrategy.builder()
.cacheInRedis(true)
.bundleWithOthers(shouldBundle(file))
.storageClass(StorageClass.STANDARD)
.cdnEnabled(true)
.build();
} else if (file.getFileSize() > LARGE_FILE_THRESHOLD) {
return StorageStrategy.builder()
.chunkedUpload(true)
.resumableUpload(true)
.storageClass(determineStorageClass(file))
.compressionEnabled(shouldCompress(file))
.build();
}

return StorageStrategy.defaultStrategy();
}
}

Q: How would you implement file deduplication to save storage costs?

A: Implement content-based deduplication using cryptographic hashing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
@Service
public class FileDeduplicationService {

private final FileHashRepository hashRepository;
private final GcsService gcsService;

@Transactional
public FileUploadResult uploadWithDeduplication(MultipartFile file, String projectId) {
// Calculate file hash
String sha256Hash = calculateSHA256(file);
String md5Hash = calculateMD5(file);

// Check if file already exists
Optional<FileMetadata> existingFile = hashRepository.findByHash(sha256Hash);

if (existingFile.isPresent()) {
// File exists, create reference instead of duplicate
return createFileReference(existingFile.get(), projectId);
}

// New file, upload and store hash
String gcsPath = gcsService.uploadFile(file.getInputStream(),
generateFileName(), file.getContentType());

FileMetadata metadata = FileMetadata.builder()
.fileId(UUID.randomUUID().toString())
.originalFileName(file.getOriginalFilename())
.gcsPath(gcsPath)
.fileSize(file.getSize())
.projectId(projectId)
.sha256Hash(sha256Hash)
.md5Hash(md5Hash)
.build();

return fileMetadataRepository.save(metadata);
}

private FileUploadResult createFileReference(FileMetadata originalFile, String projectId) {
// Create a new metadata entry that references the same GCS object
FileMetadata referenceMetadata = originalFile.toBuilder()
.fileId(UUID.randomUUID().toString())
.projectId(projectId)
.createdAt(Instant.now())
.isReference(true)
.originalFileId(originalFile.getFileId())
.build();

fileMetadataRepository.save(referenceMetadata);

return FileUploadResult.builder()
.fileId(referenceMetadata.getFileId())
.downloadUrl(generateDownloadUrl(referenceMetadata))
.deduplicated(true)
.originalFileId(originalFile.getFileId())
.build();
}
}

Scalability Questions

Q: How would you handle rate limiting to prevent abuse while maintaining good user experience?

A: Implement a multi-tier rate limiting strategy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
@Component
public class FileUploadRateLimiter {

private final RedisTemplate<String, String> redisTemplate;

// Different limits for different user tiers
private final Map<UserTier, RateLimitConfig> tierLimits = Map.of(
UserTier.FREE, new RateLimitConfig(10, Duration.ofMinutes(1), 100 * 1024 * 1024L), // 10 files/min, 100MB/day
UserTier.PREMIUM, new RateLimitConfig(100, Duration.ofMinutes(1), 10 * 1024 * 1024 * 1024L), // 100 files/min, 10GB/day
UserTier.ENTERPRISE, new RateLimitConfig(1000, Duration.ofMinutes(1), Long.MAX_VALUE) // 1000 files/min, unlimited
);

public boolean allowUpload(String userId, long fileSize) {
UserTier userTier = getUserTier(userId);
RateLimitConfig config = tierLimits.get(userTier);

// Check request rate limit
String rateLimitKey = "rate_limit:" + userId;
if (!checkRateLimit(rateLimitKey, config.getRequestsPerWindow(), config.getTimeWindow())) {
throw new RateLimitExceededException("Request rate limit exceeded");
}

// Check bandwidth limit
String bandwidthKey = "bandwidth:" + userId + ":" + LocalDate.now();
long currentUsage = getCurrentBandwidthUsage(bandwidthKey);

if (currentUsage + fileSize > config.getDailyBandwidthLimit()) {
throw new BandwidthLimitExceededException("Daily bandwidth limit exceeded");
}

// Update bandwidth usage
redisTemplate.opsForValue().increment(bandwidthKey, fileSize);
redisTemplate.expire(bandwidthKey, Duration.ofDays(1));

return true;
}

private boolean checkRateLimit(String key, int limit, Duration window) {
String script = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

redis.call('zremrangebyscore', key, 0, now - window)
local current = redis.call('zcard', key)

if current < limit then
redis.call('zadd', key, now, now)
redis.call('expire', key, window)
return 1
else
return 0
end
""";

Long result = redisTemplate.execute(
RedisScript.of(script, Long.class),
List.of(key),
String.valueOf(limit),
String.valueOf(window.toMillis()),
String.valueOf(System.currentTimeMillis())
);

return result != null && result == 1;
}
}

SDK Usage Examples

Java SDK Usage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
public class FileStorageExamples {

public static void main(String[] args) {
// Initialize the client
FileStorageClient client = new FileStorageClient(
"https://api.fileservice.com",
"your-api-key"
);

// Example 1: Simple file upload
File file = new File("document.pdf");
UploadOptions options = UploadOptions.builder()
.projectId("project-123")
.tags(Set.of("document", "pdf"))
.metadata(Map.of("department", "engineering"))
.build();

try {
FileUploadResult result = client.uploadFile(file, options);
System.out.println("File uploaded successfully: " + result.getFileId());
System.out.println("Download URL: " + result.getDownloadUrl());
} catch (FileUploadException e) {
System.err.println("Upload failed: " + e.getMessage());
}

// Example 2: Large file upload with progress tracking
File largeFile = new File("large-video.mp4");

FileUploadResult result = client.uploadFile(largeFile, options, (uploaded, total) -> {
double progress = (double) uploaded / total * 100;
System.out.printf("Upload progress: %.2f%%\n", progress);
});

// Example 3: File download
try (InputStream inputStream = client.downloadFile(result.getFileId())) {
Files.copy(inputStream, Paths.get("downloaded-file.mp4"));
System.out.println("File downloaded successfully");
} catch (IOException e) {
System.err.println("Download failed: " + e.getMessage());
}

// Example 4: List files with filtering
FileQuery query = FileQuery.builder()
.projectId("project-123")
.fileType("image/*")
.createdAfter(LocalDateTime.now().minusDays(7))
.tags(Set.of("processed"))
.page(0)
.size(20)
.build();

PagedResult<FileMetadata> files = client.listFiles(query);
files.getContent().forEach(file ->
System.out.println("File: " + file.getOriginalFileName() + " - " + file.getFileSize() + " bytes")
);
}
}

React SDK Usage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import { FileStorageClient } from '@company/file-storage-sdk';

const App = () => {
const [client] = useState(() => new FileStorageClient({
baseUrl: 'https://api.fileservice.com',
apiKey: process.env.REACT_APP_API_KEY
}));

const [uploadProgress, setUploadProgress] = useState(0);

const handleFileUpload = async (file) => {
try {
const result = await client.uploadFile(file, {
projectId: 'project-123',
onProgress: (loaded, total) => {
setUploadProgress((loaded / total) * 100);
}
});

console.log('Upload successful:', result);
} catch (error) {
console.error('Upload failed:', error);
}
};

const handleFileDrop = (acceptedFiles) => {
acceptedFiles.forEach(file => {
handleFileUpload(file);
});
};

return (
<div>
<FileDropzone onDrop={handleFileDrop} />
{uploadProgress > 0 && (
<ProgressBar value={uploadProgress} />
)}
</div>
);
};

Security Best Practices

Content Validation and Sanitization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
@Service
public class FileSecurityService {

private final Set<String> allowedContentTypes = Set.of(
"image/jpeg", "image/png", "image/gif", "image/webp",
"application/pdf", "text/plain", "application/json",
"video/mp4", "video/webm"
);

private final Set<String> dangerousExtensions = Set.of(
".exe", ".bat", ".cmd", ".scr", ".pif", ".com", ".vbs", ".js"
);

public void validateFile(MultipartFile file) {
// Check file size
if (file.getSize() > MAX_FILE_SIZE) {
throw new FileSizeExceededException("File size exceeds maximum allowed size");
}

// Validate content type
String contentType = file.getContentType();
if (contentType == null || !allowedContentTypes.contains(contentType.toLowerCase())) {
throw new UnsupportedFileTypeException("File type not allowed: " + contentType);
}

// Check file extension
String filename = file.getOriginalFilename();
if (filename != null && hasDangerousExtension(filename)) {
throw new SecurityException("File extension not allowed");
}

// Scan file content for malware (integrate with antivirus service)
scanForMalware(file);

// Validate file signature matches declared content type
validateFileSignature(file, contentType);
}

private void scanForMalware(MultipartFile file) {
// Integration with antivirus scanning service
try {
AntivirusResult result = antivirusService.scanFile(file.getBytes());
if (!result.isClean()) {
throw new SecurityException("File contains malware: " + result.getThreatName());
}
} catch (Exception e) {
throw new SecurityException("Unable to scan file for malware", e);
}
}

private void validateFileSignature(MultipartFile file, String declaredContentType) {
try {
byte[] header = new byte[12];
file.getInputStream().read(header);

String detectedType = detectContentType(header);
if (!declaredContentType.equals(detectedType)) {
throw new SecurityException("File signature doesn't match declared content type");
}
} catch (IOException e) {
throw new SecurityException("Unable to validate file signature", e);
}
}
}

Access Control Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
@Service
public class FileAccessControlService {

private final JwtTokenProvider tokenProvider;
private final PermissionEvaluator permissionEvaluator;

@PreAuthorize("@fileAccessControlService.canAccessFile(authentication, #fileId, 'READ')")
public FileMetadata getFileMetadata(String fileId) {
return fileMetadataRepository.findById(fileId)
.orElseThrow(() -> new FileNotFoundException("File not found: " + fileId));
}

public boolean canAccessFile(Authentication authentication, String fileId, String operation) {
String userId = authentication.getName();
FileMetadata file = getFileMetadata(fileId);

// Owner can do everything
if (file.getCreatedBy().equals(userId)) {
return true;
}

// Check project-level permissions
if (file.getProjectId() != null) {
return permissionEvaluator.hasPermission(authentication, file.getProjectId(), "Project", operation);
}

// Check if file is public
if (file.getVisibility() == FileVisibility.PUBLIC && "READ".equals(operation)) {
return true;
}

// Check explicit file permissions
return hasExplicitFilePermission(userId, fileId, operation);
}

private boolean hasExplicitFilePermission(String userId, String fileId, String operation) {
// Check file-specific permissions in ACL table
return filePermissionRepository.existsByFileIdAndUserIdAndPermission(
fileId, userId, FilePermission.valueOf(operation)
);
}
}

Disaster Recovery and Backup Strategy

Automated Backup System

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
@Component
public class FileBackupService {

private final GcsService primaryGcsService;
private final GcsService backupGcsService;
private final FileMetadataRepository fileMetadataRepository;

@Scheduled(cron = "0 2 * * *") // Daily at 2 AM
public void performIncrementalBackup() {
LocalDateTime lastBackup = getLastBackupTime();
LocalDateTime now = LocalDateTime.now();

// Find files modified since last backup
List<FileMetadata> modifiedFiles = fileMetadataRepository
.findByUpdatedAtBetween(lastBackup, now);

BackupResult result = BackupResult.builder()
.startTime(now)
.totalFiles(modifiedFiles.size())
.build();

try {
// Backup files in parallel
List<CompletableFuture<FileBackupResult>> futures = modifiedFiles.stream()
.map(file -> CompletableFuture.supplyAsync(() -> backupFile(file)))
.collect(Collectors.toList());

List<FileBackupResult> backupResults = futures.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());

result = result.toBuilder()
.endTime(LocalDateTime.now())
.successCount(backupResults.stream().mapToInt(r -> r.isSuccess() ? 1 : 0).sum())
.failureCount(backupResults.stream().mapToInt(r -> r.isSuccess() ? 0 : 1).sum())
.build();

// Record backup completion
recordBackupCompletion(result);

} catch (Exception e) {
logger.error("Backup process failed", e);
alertService.sendBackupFailureAlert(e);
}
}

private FileBackupResult backupFile(FileMetadata fileMetadata) {
try {
// Generate backup path
String backupPath = generateBackupPath(fileMetadata);

// Copy file to backup bucket
primaryGcsService.copyFile(fileMetadata.getGcsPath(),
backupGcsService, backupPath);

// Create backup metadata record
FileBackupMetadata backupMetadata = FileBackupMetadata.builder()
.originalFileId(fileMetadata.getFileId())
.backupPath(backupPath)
.backupTime(LocalDateTime.now())
.backupSize(fileMetadata.getFileSize())
.checksum(fileMetadata.getChecksumSha256())
.build();

fileBackupRepository.save(backupMetadata);

return FileBackupResult.success(fileMetadata.getFileId());

} catch (Exception e) {
logger.error("Failed to backup file: {}", fileMetadata.getFileId(), e);
return FileBackupResult.failure(fileMetadata.getFileId(), e.getMessage());
}
}

@Scheduled(cron = "0 3 * * 0") // Weekly on Sunday at 3 AM
public void performFullBackup() {
// Full backup implementation
logger.info("Starting full backup process");

// Export database schema and data
exportDatabase();

// Backup all files
performFullFileBackup();

// Verify backup integrity
verifyBackupIntegrity();
}

public RestoreResult restoreFromBackup(RestoreRequest request) {
try {
if (request.getRestoreType() == RestoreType.POINT_IN_TIME) {
return restoreToPointInTime(request.getTargetTime());
} else if (request.getRestoreType() == RestoreType.SPECIFIC_FILES) {
return restoreSpecificFiles(request.getFileIds());
} else {
return restoreFullSystem(request.getBackupId());
}
} catch (Exception e) {
logger.error("Restore operation failed", e);
return RestoreResult.failure(e.getMessage());
}
}
}

Multi-Region Deployment Flow


graph TB
subgraph "Primary Region (US-Central)"
    A[API Gateway] --> B[File Service Instances]
    B --> C[Primary PostgreSQL]
    B --> D[Primary GCS Bucket]
    B --> E[Redis Cache]
end

subgraph "Secondary Region (Europe-West)"
    F[API Gateway] --> G[File Service Instances]
    G --> H[Read Replica PostgreSQL]
    G --> I[Secondary GCS Bucket]
    G --> J[Redis Cache]
end

subgraph "Disaster Recovery Region (Asia-Southeast)"
    K[Standby API Gateway] --> L[Standby File Service]
    L --> M[Backup PostgreSQL]
    L --> N[Archive GCS Bucket]
end

C --> H
D --> I
D --> N

O[Global Load Balancer] --> A
O --> F
O --> K

P[Health Checker] --> O
Q[Failover Controller] --> O

style A fill:#e1f5fe
style F fill:#e8f5e8
style K fill:#fff3e0

Cost Optimization Strategies

Intelligent Storage Class Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
@Service
public class StorageCostOptimizer {

private final GcsService gcsService;
private final FileMetadataRepository fileMetadataRepository;
private final FileAccessLogRepository accessLogRepository;

@Scheduled(cron = "0 1 * * *") // Daily at 1 AM
public void optimizeStorageClasses() {
LocalDateTime cutoffDate = LocalDateTime.now().minusDays(30);

// Find files that haven't been accessed in 30 days
List<FileMetadata> candidates = fileMetadataRepository.findInfrequentlyAccessedFiles(cutoffDate);

for (FileMetadata file : candidates) {
StorageOptimizationDecision decision = analyzeFile(file);

if (decision.shouldChangeStorageClass()) {
try {
gcsService.changeStorageClass(file.getGcsPath(), decision.getTargetStorageClass());

// Update metadata
file.setStorageClass(decision.getTargetStorageClass());
file.setLastOptimized(LocalDateTime.now());
fileMetadataRepository.save(file);

// Record cost savings
recordCostSavings(file, decision);

} catch (Exception e) {
logger.error("Failed to optimize storage for file: {}", file.getFileId(), e);
}
}
}
}

private StorageOptimizationDecision analyzeFile(FileMetadata file) {
// Analyze access patterns
long accessCount = accessLogRepository.countByFileIdAndAccessTimeAfter(
file.getFileId(), LocalDateTime.now().minusDays(30)
);

LocalDateTime lastAccess = accessLogRepository.findLastAccessTime(file.getFileId());
long daysSinceLastAccess = ChronoUnit.DAYS.between(lastAccess, LocalDateTime.now());

// Decision matrix for storage class optimization
if (daysSinceLastAccess > 365) {
return StorageOptimizationDecision.moveToArchive();
} else if (daysSinceLastAccess > 90 && accessCount < 5) {
return StorageOptimizationDecision.moveToColdline();
} else if (daysSinceLastAccess > 30 && accessCount < 10) {
return StorageOptimizationDecision.moveToNearline();
}

return StorageOptimizationDecision.noChange();
}

private void recordCostSavings(FileMetadata file, StorageOptimizationDecision decision) {
BigDecimal originalCost = calculateMonthlyCost(file.getFileSize(), file.getCurrentStorageClass());
BigDecimal newCost = calculateMonthlyCost(file.getFileSize(), decision.getTargetStorageClass());
BigDecimal savings = originalCost.subtract(newCost);

CostOptimizationRecord record = CostOptimizationRecord.builder()
.fileId(file.getFileId())
.optimizationDate(LocalDateTime.now())
.fromStorageClass(file.getCurrentStorageClass())
.toStorageClass(decision.getTargetStorageClass())
.fileSize(file.getFileSize())
.monthlySavings(savings)
.build();

costOptimizationRepository.save(record);

// Update metrics
costSavingsCounter.increment(Tags.of("storage_class", decision.getTargetStorageClass().name()),
savings.doubleValue());
}
}

Usage Analytics and Reporting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
@Service
public class FileStorageAnalyticsService {

public StorageAnalyticsReport generateMonthlyReport(String projectId, YearMonth month) {
LocalDateTime startOfMonth = month.atDay(1).atStartOfDay();
LocalDateTime endOfMonth = month.atEndOfMonth().atTime(23, 59, 59);

// Storage usage metrics
StorageUsageMetrics usage = StorageUsageMetrics.builder()
.totalFiles(fileMetadataRepository.countByProjectIdAndCreatedAtBetween(
projectId, startOfMonth, endOfMonth))
.totalStorageBytes(fileMetadataRepository.sumFileSizeByProjectIdAndDateRange(
projectId, startOfMonth, endOfMonth))
.averageFileSize(calculateAverageFileSize(projectId, startOfMonth, endOfMonth))
.build();

// Access patterns
AccessPatternMetrics access = AccessPatternMetrics.builder()
.totalDownloads(accessLogRepository.countDownloadsByProjectAndDateRange(
projectId, startOfMonth, endOfMonth))
.totalUploads(uploadLogRepository.countUploadsByProjectAndDateRange(
projectId, startOfMonth, endOfMonth))
.uniqueUsers(accessLogRepository.countUniqueUsersByProjectAndDateRange(
projectId, startOfMonth, endOfMonth))
.build();

// Cost analysis
CostAnalysisMetrics cost = CostAnalysisMetrics.builder()
.storageCost(calculateStorageCost(usage))
.bandwidthCost(calculateBandwidthCost(access))
.operationsCost(calculateOperationsCost(access))
.totalCost(calculateTotalCost(usage, access))
.projectedCost(projectCostForNextMonth(usage, access))
.build();

// Top files by size and access
List<FileUsageStats> topFilesBySize = getTopFilesBySize(projectId, 10);
List<FileUsageStats> topFilesByAccess = getTopFilesByAccess(projectId, 10);

return StorageAnalyticsReport.builder()
.projectId(projectId)
.reportMonth(month)
.generatedAt(LocalDateTime.now())
.storageUsage(usage)
.accessPatterns(access)
.costAnalysis(cost)
.topFilesBySize(topFilesBySize)
.topFilesByAccess(topFilesByAccess)
.recommendations(generateRecommendations(usage, access, cost))
.build();
}

private List<CostOptimizationRecommendation> generateRecommendations(
StorageUsageMetrics usage, AccessPatternMetrics access, CostAnalysisMetrics cost) {

List<CostOptimizationRecommendation> recommendations = new ArrayList<>();

// Recommendation 1: Storage class optimization
if (cost.getStorageCost().compareTo(cost.getBandwidthCost()) > 0) {
recommendations.add(CostOptimizationRecommendation.builder()
.type(RecommendationType.STORAGE_CLASS_OPTIMIZATION)
.description("Consider moving infrequently accessed files to cheaper storage classes")
.potentialSavings(estimateStorageClassSavings(usage))
.priority(RecommendationPriority.HIGH)
.build());
}

// Recommendation 2: File lifecycle management
long oldFilesCount = fileMetadataRepository.countOldFiles(
usage.getProjectId(), LocalDateTime.now().minusYears(1)
);

if (oldFilesCount > 1000) {
recommendations.add(CostOptimizationRecommendation.builder()
.type(RecommendationType.LIFECYCLE_POLICY)
.description("Implement automatic archival for files older than 1 year")
.potentialSavings(estimateLifecycleSavings(oldFilesCount))
.priority(RecommendationPriority.MEDIUM)
.build());
}

return recommendations;
}
}

Testing Strategy

Integration Testing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@TestPropertySource(properties = {
"spring.datasource.url=jdbc:h2:mem:testdb",
"gcs.bucket-name=test-bucket",
"redis.host=localhost"
})
class FileStorageIntegrationTest {

@Autowired
private TestRestTemplate restTemplate;

@Autowired
private FileMetadataRepository fileMetadataRepository;

@MockBean
private GcsService gcsService;

@Test
void shouldUploadFileSuccessfully() throws Exception {
// Arrange
MockMultipartFile file = new MockMultipartFile(
"file", "test.txt", "text/plain", "Hello World".getBytes()
);

when(gcsService.uploadFile(any(), any(), any(), any()))
.thenReturn("test-bucket/files/test.txt");
when(gcsService.generateSignedUrl(any(), any()))
.thenReturn("https://storage.googleapis.com/test-bucket/files/test.txt");

// Act
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
body.add("file", file.getResource());
body.add("projectId", "test-project");

HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.MULTIPART_FORM_DATA);
headers.setBearerAuth("test-token");

ResponseEntity<FileUploadResponse> response = restTemplate.exchange(
"/api/v1/files/upload",
HttpMethod.POST,
new HttpEntity<>(body, headers),
FileUploadResponse.class
);

// Assert
assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
assertThat(response.getBody().getFileId()).isNotNull();
assertThat(response.getBody().getDownloadUrl()).contains("storage.googleapis.com");

// Verify database record
Optional<FileMetadata> savedFile = fileMetadataRepository.findById(response.getBody().getFileId());
assertThat(savedFile).isPresent();
assertThat(savedFile.get().getOriginalFileName()).isEqualTo("test.txt");
}

@Test
void shouldHandleChunkedUploadCorrectly() throws Exception {
// Arrange
String uploadId = UUID.randomUUID().toString();
int totalChunks = 3;
int chunkSize = 1024;

// Initiate chunked upload
ChunkedUploadRequest initRequest = ChunkedUploadRequest.builder()
.fileName("large-file.mp4")
.totalSize(3072)
.chunkSize(chunkSize)
.projectId("test-project")
.build();

ResponseEntity<ChunkedUploadResponse> initResponse = restTemplate.postForEntity(
"/api/v1/files/upload/chunked/initiate",
initRequest,
ChunkedUploadResponse.class
);

assertThat(initResponse.getStatusCode()).isEqualTo(HttpStatus.OK);
String actualUploadId = initResponse.getBody().getUploadId();

// Upload chunks
for (int i = 1; i <= totalChunks; i++) {
MockMultipartFile chunk = new MockMultipartFile(
"chunk", "chunk" + i, "application/octet-stream",
new byte[i == totalChunks ? 1024 : chunkSize] // Last chunk smaller
);

MultiValueMap<String, Object> chunkBody = new LinkedMultiValueMap<>();
chunkBody.add("chunk", chunk.getResource());

ResponseEntity<ChunkUploadResponse> chunkResponse = restTemplate.exchange(
"/api/v1/files/upload/chunked/" + actualUploadId + "/chunk/" + i,
HttpMethod.PUT,
new HttpEntity<>(chunkBody, createAuthHeaders()),
ChunkUploadResponse.class
);

assertThat(chunkResponse.getStatusCode()).isEqualTo(HttpStatus.OK);

if (i == totalChunks) {
// Last chunk should complete the upload
assertThat(chunkResponse.getBody().isCompleted()).isTrue();
assertThat(chunkResponse.getBody().getFileId()).isNotNull();
}
}
}

@Test
void shouldEnforceRateLimitsCorrectly() {
// Test rate limiting behavior
String userId = "test-user";

// Simulate exceeding rate limit
for (int i = 0; i < 15; i++) { // Assuming limit is 10 requests per minute
MockMultipartFile file = new MockMultipartFile(
"file", "test" + i + ".txt", "text/plain", ("Content " + i).getBytes()
);

MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
body.add("file", file.getResource());

ResponseEntity<String> response = restTemplate.exchange(
"/api/v1/files/upload",
HttpMethod.POST,
new HttpEntity<>(body, createAuthHeaders(userId)),
String.class
);

if (i < 10) {
assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
} else {
assertThat(response.getStatusCode()).isEqualTo(HttpStatus.TOO_MANY_REQUESTS);
}
}
}
}

Load Testing with JMeter Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2">
<hashTree>
<TestPlan>
<stringProp name="TestPlan.comments">File Storage Service Load Test</stringProp>
<boolProp name="TestPlan.functional_mode">false</boolProp>
<boolProp name="TestPlan.tearDown_on_shutdown">true</boolProp>
<boolProp name="TestPlan.serialize_threadgroups">false</boolProp>
<elementProp name="TestPlan.arguments" elementType="Arguments" guiclass="ArgumentsPanel">
<collectionProp name="Arguments.arguments">
<elementProp name="base_url" elementType="Argument">
<stringProp name="Argument.name">base_url</stringProp>
<stringProp name="Argument.value">https://api.fileservice.com</stringProp>
</elementProp>
</collectionProp>
</elementProp>
<stringProp name="TestPlan.user_define_classpath"></stringProp>
</TestPlan>
<hashTree>
<!-- Thread Group for File Upload Load Test -->
<ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="File Upload Load Test">
<stringProp name="ThreadGroup.on_sample_error">continue</stringProp>
<elementProp name="ThreadGroup.main_controller" elementType="LoopController">
<boolProp name="LoopController.continue_forever">false</boolProp>
<stringProp name="LoopController.loops">10</stringProp>
</elementProp>
<stringProp name="ThreadGroup.num_threads">100</stringProp>
<stringProp name="ThreadGroup.ramp_time">60</stringProp>
<boolProp name="ThreadGroup.scheduler">false</boolProp>
<stringProp name="ThreadGroup.duration"></stringProp>
<stringProp name="ThreadGroup.delay"></stringProp>
</ThreadGroup>
<hashTree>
<!-- HTTP Request for File Upload -->
<HTTPSamplerProxy>
<elementProp name="HTTPsampler.Files" elementType="HTTPFileArgs">
<collectionProp name="HTTPFileArgs.files">
<elementProp name="" elementType="HTTPFileArg">
<stringProp name="File.path">${__P(test_file_path)}</stringProp>
<stringProp name="File.paramname">file</stringProp>
<stringProp name="File.mimetype">application/octet-stream</stringProp>
</elementProp>
</collectionProp>
</elementProp>
<elementProp name="HTTPsampler.Arguments" elementType="Arguments">
<collectionProp name="Arguments.arguments">
<elementProp name="projectId" elementType="HTTPArgument">
<boolProp name="HTTPArgument.always_encode">false</boolProp>
<stringProp name="Argument.value">load-test-project</stringProp>
<stringProp name="Argument.name">projectId</stringProp>
</elementProp>
</collectionProp>
</elementProp>
<stringProp name="HTTPSampler.domain">${base_url}</stringProp>
<stringProp name="HTTPSampler.port"></stringProp>
<stringProp name="HTTPSampler.protocol">https</stringProp>
<stringProp name="HTTPSampler.contentEncoding"></stringProp>
<stringProp name="HTTPSampler.path">/api/v1/files/upload</stringProp>
<stringProp name="HTTPSampler.method">POST</stringProp>
<boolProp name="HTTPSampler.follow_redirects">true</boolProp>
<boolProp name="HTTPSampler.auto_redirects">false</boolProp>
<boolProp name="HTTPSampler.use_keepalive">true</boolProp>
<boolProp name="HTTPSampler.DO_MULTIPART_POST">true</boolProp>
</HTTPSamplerProxy>
</hashTree>
</hashTree>
</hashTree>
</jmeterTestPlan>

External Resources and Documentation

Essential Resources

Google Cloud Storage Documentation:

PostgreSQL Performance:

Spring Boot Integration:

Monitoring and Observability:

Security Best Practices:

This comprehensive file storage service design provides a production-ready, scalable solution that handles millions of files efficiently while maintaining high availability, security, and performance standards. The architecture supports both small and large file operations, implements intelligent cost optimization, and provides robust monitoring and disaster recovery capabilities.

What is Kubernetes and why would you use it for Java applications?

Reference Answer

Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. It acts as a distributed operating system for containerized workloads.

Key Benefits for Java Applications:

  • Microservices Architecture: Enables independent scaling and deployment of Java services
  • Service Discovery: Built-in DNS-based service discovery eliminates hardcoded endpoints
  • Load Balancing: Automatic distribution of traffic across healthy instances
  • Rolling Deployments: Zero-downtime deployments with gradual traffic shifting
  • Configuration Management: Externalized configuration through ConfigMaps and Secrets
  • Resource Management: Optimal JVM performance through resource quotas and limits
  • Self-Healing: Automatic restart of failed containers and rescheduling on healthy nodes
  • Horizontal Scaling: Auto-scaling based on CPU, memory, or custom metrics

Architecture Overview


graph TB
subgraph "Kubernetes Cluster"
    subgraph "Master Node"
        API[API Server]
        ETCD[etcd]
        SCHED[Scheduler]
        CM[Controller Manager]
    end
    
    subgraph "Worker Node 1"
        KUBELET1[Kubelet]
        PROXY1[Kube-proxy]
        subgraph "Pods"
            POD1[Java App Pod 1]
            POD2[Java App Pod 2]
        end
    end
    
    subgraph "Worker Node 2"
        KUBELET2[Kubelet]
        PROXY2[Kube-proxy]
        POD3[Java App Pod 3]
    end
end

USERS[Users] --> API
API --> ETCD
API --> SCHED
API --> CM
SCHED --> KUBELET1
SCHED --> KUBELET2


Explain the difference between Pods, Services, and Deployments

Reference Answer

These are fundamental Kubernetes resources that work together to run and expose applications.

Pod

  • Definition: Smallest deployable unit containing one or more containers
  • Characteristics: Shared network and storage, ephemeral, single IP address
  • Java Context: Typically one JVM per pod for resource isolation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: v1
kind: Pod
metadata:
name: java-app-pod
labels:
app: java-app
spec:
containers:
- name: java-container
image: openjdk:11-jre-slim
ports:
- containerPort: 8080
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"

Service

  • Definition: Abstraction that defines access to a logical set of pods
  • Types: ClusterIP (internal), NodePort (external via node), LoadBalancer (cloud LB)
  • Purpose: Provides stable networking and load balancing
1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: v1
kind: Service
metadata:
name: java-app-service
spec:
selector:
app: java-app
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: ClusterIP

Deployment

  • Definition: Higher-level resource managing ReplicaSets and pod lifecycles
  • Features: Rolling updates, rollbacks, replica management, declarative updates
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app-deployment
spec:
replicas: 3
selector:
matchLabels:
app: java-app
template:
metadata:
labels:
app: java-app
spec:
containers:
- name: java-app
image: my-java-app:1.0
ports:
- containerPort: 8080

Relationship Diagram


graph LR
DEPLOY[Deployment] --> RS[ReplicaSet]
RS --> POD1[Pod 1]
RS --> POD2[Pod 2]
RS --> POD3[Pod 3]

SVC[Service] --> POD1
SVC --> POD2
SVC --> POD3

USERS[External Users] --> SVC


How do you handle configuration management for Java applications in Kubernetes?

Reference Answer

Configuration management in Kubernetes separates configuration from application code using ConfigMaps and Secrets, following the twelve-factor app methodology.

ConfigMaps (Non-sensitive data)

1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: v1
kind: ConfigMap
metadata:
name: java-app-config
data:
application.properties: |
server.port=8080
spring.profiles.active=production
logging.level.com.example=INFO
database.pool.size=10
app.env: "production"
debug.enabled: "false"

Secrets (Sensitive data)

1
2
3
4
5
6
7
8
9
apiVersion: v1
kind: Secret
metadata:
name: java-app-secrets
type: Opaque
data:
database-username: dXNlcm5hbWU= # base64 encoded
database-password: cGFzc3dvcmQ= # base64 encoded
api-key: YWJjZGVmZ2hpams= # base64 encoded

Using Configuration in Deployment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app
spec:
template:
spec:
containers:
- name: app
image: my-java-app:1.0
# Environment variables from ConfigMap
env:
- name: SPRING_PROFILES_ACTIVE
valueFrom:
configMapKeyRef:
name: java-app-config
key: app.env
- name: DB_USERNAME
valueFrom:
secretKeyRef:
name: java-app-secrets
key: database-username
# Mount ConfigMap as volume
volumeMounts:
- name: config-volume
mountPath: /app/config
readOnly: true
- name: secret-volume
mountPath: /app/secrets
readOnly: true
volumes:
- name: config-volume
configMap:
name: java-app-config
- name: secret-volume
secret:
secretName: java-app-secrets

Spring Boot Integration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@Configuration
@ConfigurationProperties(prefix = "app")
public class AppConfig {
private String environment;
private boolean debugEnabled;

// getters and setters
}

@RestController
public class ConfigController {

@Value("${database.pool.size:5}")
private int poolSize;

@Autowired
private AppConfig appConfig;
}

Configuration Hot-Reloading with Spring Cloud Kubernetes

1
2
3
4
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-kubernetes-config</artifactId>
</dependency>
1
2
3
4
5
6
7
8
9
10
11
12
spring:
cloud:
kubernetes:
config:
enabled: true
sources:
- name: java-app-config
namespace: default
reload:
enabled: true
mode: event
strategy: refresh

Describe resource management and JVM tuning in Kubernetes

Reference Answer

Resource management in Kubernetes involves setting appropriate CPU and memory requests and limits, while JVM tuning ensures optimal performance within container constraints.

Resource Requests vs Limits

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app
spec:
template:
spec:
containers:
- name: java-app
image: my-java-app:1.0
resources:
requests:
memory: "1Gi" # Guaranteed memory
cpu: "500m" # Guaranteed CPU (0.5 cores)
limits:
memory: "2Gi" # Maximum memory
cpu: "1000m" # Maximum CPU (1 core)

JVM Container Awareness

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app
spec:
template:
spec:
containers:
- name: java-app
image: openjdk:11-jre-slim
env:
- name: JAVA_OPTS
value: >-
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=75.0
-XX:+UseG1GC
-XX:+UseStringDeduplication
-XX:+OptimizeStringConcat
-Djava.security.egd=file:/dev/./urandom
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"

Memory Calculation Strategy


graph TD
CONTAINER[Container Memory Limit: 2Gi] --> JVM[JVM Heap: ~75% = 1.5Gi]
CONTAINER --> NONHEAP[Non-Heap: ~20% = 400Mi]
CONTAINER --> OS[OS/Buffer: ~5% = 100Mi]

JVM --> HEAP_YOUNG[Young Generation]
JVM --> HEAP_OLD[Old Generation]

NONHEAP --> METASPACE[Metaspace]
NONHEAP --> CODECACHE[Code Cache]
NONHEAP --> STACK[Thread Stacks]

Advanced JVM Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
FROM openjdk:11-jre-slim

# JVM tuning for containers
ENV JAVA_OPTS="-server \
-XX:+UseContainerSupport \
-XX:MaxRAMPercentage=75.0 \
-XX:InitialRAMPercentage=50.0 \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=200 \
-XX:+UseStringDeduplication \
-XX:+OptimizeStringConcat \
-XX:+UseCompressedOops \
-XX:+UseCompressedClassPointers \
-Djava.security.egd=file:/dev/./urandom \
-Dfile.encoding=UTF-8 \
-Duser.timezone=UTC"

COPY app.jar /app/app.jar
EXPOSE 8080
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -jar /app/app.jar"]

Resource Monitoring

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
name: java-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/actuator/prometheus"
prometheus.io/port: "8080"
spec:
containers:
- name: java-app
image: my-java-app:1.0
ports:
- containerPort: 8080
name: http

Vertical Pod Autoscaler (VPA) Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: java-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: java-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: java-app
maxAllowed:
memory: 4Gi
cpu: 2000m
minAllowed:
memory: 512Mi
cpu: 250m

How do you implement health checks for Java applications?

Reference Answer

Health checks in Kubernetes use three types of probes to ensure application reliability and proper traffic routing.

Probe Types


graph LR
subgraph "Pod Lifecycle"
    START[Pod Start] --> STARTUP{Startup Probe}
    STARTUP -->|Pass| READY{Readiness Probe}
    STARTUP -->|Fail| RESTART[Restart Container]
    READY -->|Pass| TRAFFIC[Receive Traffic]
    READY -->|Fail| NO_TRAFFIC[No Traffic]
    TRAFFIC --> LIVE{Liveness Probe}
    LIVE -->|Pass| TRAFFIC
    LIVE -->|Fail| RESTART
end

Spring Boot Actuator Health Endpoints

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
@RestController
public class HealthController {

@Autowired
private DataSource dataSource;

@GetMapping("/health/live")
public ResponseEntity<Map<String, String>> liveness() {
Map<String, String> status = new HashMap<>();
status.put("status", "UP");
status.put("timestamp", Instant.now().toString());
return ResponseEntity.ok(status);
}

@GetMapping("/health/ready")
public ResponseEntity<Map<String, Object>> readiness() {
Map<String, Object> health = new HashMap<>();
health.put("status", "UP");

// Check database connectivity
try {
dataSource.getConnection().close();
health.put("database", "UP");
} catch (Exception e) {
health.put("database", "DOWN");
health.put("status", "DOWN");
return ResponseEntity.status(503).body(health);
}

// Check external dependencies
health.put("externalAPI", checkExternalAPI());

return ResponseEntity.ok(health);
}

private String checkExternalAPI() {
// Implementation to check external dependencies
return "UP";
}
}

Kubernetes Deployment with Health Checks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app
spec:
template:
spec:
containers:
- name: java-app
image: my-java-app:1.0
ports:
- containerPort: 8080

# Startup probe for slow-starting applications
startupProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # 30 * 5 = 150s max startup time

# Liveness probe
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3 # Restart after 3 consecutive failures

# Readiness probe
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3 # Remove from service after 3 failures

Custom Health Indicators

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
@Component
public class DatabaseHealthIndicator implements HealthIndicator {

@Autowired
private DataSource dataSource;

@Override
public Health health() {
try {
Connection connection = dataSource.getConnection();
// Perform a simple query
PreparedStatement statement = connection.prepareStatement("SELECT 1");
ResultSet resultSet = statement.executeQuery();

if (resultSet.next()) {
return Health.up()
.withDetail("database", "Available")
.withDetail("connectionPool", getConnectionPoolInfo())
.build();
}
} catch (SQLException e) {
return Health.down()
.withDetail("database", "Unavailable")
.withDetail("error", e.getMessage())
.build();
}

return Health.down().build();
}

private Map<String, Object> getConnectionPoolInfo() {
// Return connection pool metrics
Map<String, Object> poolInfo = new HashMap<>();
poolInfo.put("active", 5);
poolInfo.put("idle", 3);
poolInfo.put("max", 10);
return poolInfo;
}
}

Application Properties Configuration

1
2
3
4
5
6
7
8
9
10
# Health endpoint configuration
management.endpoints.web.exposure.include=health,info,metrics,prometheus
management.endpoint.health.show-details=always
management.endpoint.health.show-components=always
management.health.db.enabled=true
management.health.diskspace.enabled=true

# Custom health check paths
management.server.port=8081
management.endpoints.web.base-path=/actuator

TCP and Command Probes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# TCP probe example
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 30
periodSeconds: 10

# Command probe example
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "ps aux | grep java"
initialDelaySeconds: 30
periodSeconds: 10

Explain how to handle persistent data in Java applications on Kubernetes

Reference Answer

Persistent data in Kubernetes requires understanding storage abstractions and choosing appropriate patterns based on application requirements.

Storage Architecture


graph TB
subgraph "Storage Layer"
    SC[StorageClass] --> PV[PersistentVolume]
    PVC[PersistentVolumeClaim] --> PV
end

subgraph "Application Layer"
    POD[Pod] --> PVC
    STATEFULSET[StatefulSet] --> PVC
    DEPLOYMENT[Deployment] --> PVC
end

subgraph "Physical Storage"
    PV --> DISK[Physical Disk]
    PV --> NFS[NFS Server]
    PV --> CLOUD[Cloud Storage]
end

StorageClass Definition

1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
fsType: ext4
encrypted: "true"
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

PersistentVolumeClaim

1
2
3
4
5
6
7
8
9
10
11
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: java-app-storage
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 10Gi

Java Application with File Storage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-file-processor
spec:
replicas: 1 # Single replica for ReadWriteOnce
template:
spec:
containers:
- name: app
image: my-java-app:1.0
volumeMounts:
- name: app-storage
mountPath: /app/data
- name: logs-storage
mountPath: /app/logs
env:
- name: DATA_DIR
value: "/app/data"
- name: LOG_DIR
value: "/app/logs"
volumes:
- name: app-storage
persistentVolumeClaim:
claimName: java-app-storage
- name: logs-storage
persistentVolumeClaim:
claimName: java-app-logs

StatefulSet for Database Applications

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: java-database-app
spec:
serviceName: "java-db-service"
replicas: 3
template:
spec:
containers:
- name: app
image: my-java-db-app:1.0
ports:
- containerPort: 8080
volumeMounts:
- name: data-volume
mountPath: /app/data
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeClaimTemplates:
- metadata:
name: data-volume
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "fast-ssd"
resources:
requests:
storage: 20Gi

Java File Processing Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
@Service
public class FileProcessingService {

@Value("${app.data.dir:/app/data}")
private String dataDirectory;

@PostConstruct
public void init() {
// Ensure data directory exists
Path dataPath = Paths.get(dataDirectory);
if (!Files.exists(dataPath)) {
try {
Files.createDirectories(dataPath);
logger.info("Created data directory: {}", dataPath);
} catch (IOException e) {
logger.error("Failed to create data directory", e);
}
}
}

public void processFile(MultipartFile uploadedFile) {
try {
String filename = UUID.randomUUID().toString() + "_" + uploadedFile.getOriginalFilename();
Path filePath = Paths.get(dataDirectory, filename);

// Save uploaded file
uploadedFile.transferTo(filePath.toFile());

// Process file
processFileContent(filePath);

// Move to processed directory
Path processedDir = Paths.get(dataDirectory, "processed");
Files.createDirectories(processedDir);
Files.move(filePath, processedDir.resolve(filename));

} catch (IOException e) {
logger.error("Error processing file", e);
throw new FileProcessingException("Failed to process file", e);
}
}

private void processFileContent(Path filePath) {
// File processing logic
try (BufferedReader reader = Files.newBufferedReader(filePath)) {
reader.lines()
.filter(line -> !line.trim().isEmpty())
.forEach(this::processLine);
} catch (IOException e) {
logger.error("Error reading file: " + filePath, e);
}
}
}

Backup Strategy with CronJob

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
apiVersion: batch/v1
kind: CronJob
metadata:
name: data-backup
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: alpine:latest
command:
- /bin/sh
- -c
- |
apk add --no-cache tar gzip
DATE=$(date +%Y%m%d_%H%M%S)
tar -czf /backup/data_backup_$DATE.tar.gz -C /app/data .
# Keep only last 7 backups
find /backup -name "data_backup_*.tar.gz" -mtime +7 -delete
volumeMounts:
- name: app-data
mountPath: /app/data
readOnly: true
- name: backup-storage
mountPath: /backup
volumes:
- name: app-data
persistentVolumeClaim:
claimName: java-app-storage
- name: backup-storage
persistentVolumeClaim:
claimName: backup-storage
restartPolicy: OnFailure

Database Connection with Persistent Storage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@Configuration
public class DatabaseConfig {

@Value("${spring.datasource.url}")
private String databaseUrl;

@Bean
@Primary
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(databaseUrl);
config.setUsername("${DB_USERNAME}");
config.setPassword("${DB_PASSWORD}");
config.setMaximumPoolSize(20);
config.setMinimumIdle(5);
config.setConnectionTimeout(30000);
config.setIdleTimeout(600000);
config.setMaxLifetime(1800000);

return new HikariDataSource(config);
}

@Bean
public PlatformTransactionManager transactionManager() {
return new DataSourceTransactionManager(dataSource());
}
}

How do you implement service discovery and communication between Java microservices?

Reference Answer

Kubernetes provides built-in service discovery through DNS, while Java applications can leverage Spring Cloud Kubernetes for enhanced integration.

Service Discovery Architecture


graph TB
subgraph "Kubernetes Cluster"
    subgraph "Namespace: default"
        SVC1[user-service]
        SVC2[order-service]
        SVC3[payment-service]
        
        POD1[User Service Pods]
        POD2[Order Service Pods]
        POD3[Payment Service Pods]
        
        SVC1 --> POD1
        SVC2 --> POD2
        SVC3 --> POD3
    end
    
    DNS[CoreDNS] --> SVC1
    DNS --> SVC2
    DNS --> SVC3
end

POD2 -->|user-service.default.svc.cluster.local| POD1
POD2 -->|payment-service.default.svc.cluster.local| POD3

Service Definitions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# User Service
apiVersion: v1
kind: Service
metadata:
name: user-service
labels:
app: user-service
spec:
selector:
app: user-service
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP

---
# Order Service
apiVersion: v1
kind: Service
metadata:
name: order-service
labels:
app: order-service
spec:
selector:
app: order-service
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP

---
# Payment Service
apiVersion: v1
kind: Service
metadata:
name: payment-service
labels:
app: payment-service
spec:
selector:
app: payment-service
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP

Spring Cloud Kubernetes Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<dependencies>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-kubernetes-client</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-loadbalancer</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
</dependencies>

Service Discovery Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
@Configuration
@EnableDiscoveryClient
public class ServiceDiscoveryConfig {

@Bean
@LoadBalanced
public WebClient.Builder webClientBuilder() {
return WebClient.builder();
}

@Bean
public WebClient webClient(WebClient.Builder builder) {
return builder
.codecs(configurer -> configurer.defaultCodecs().maxInMemorySize(1024 * 1024))
.build();
}
}

Inter-Service Communication

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
@Service
public class OrderService {

private final WebClient webClient;

public OrderService(WebClient webClient) {
this.webClient = webClient;
}

public Mono<UserDto> getUserDetails(String userId) {
return webClient
.get()
.uri("http://user-service/api/users/{userId}", userId)
.retrieve()
.onStatus(HttpStatus::isError, response -> {
return Mono.error(new ServiceException("User service error: " + response.statusCode()));
})
.bodyToMono(UserDto.class)
.timeout(Duration.ofSeconds(5))
.retry(3);
}

public Mono<PaymentResponse> processPayment(PaymentRequest request) {
return webClient
.post()
.uri("http://payment-service/api/payments")
.body(Mono.just(request), PaymentRequest.class)
.retrieve()
.bodyToMono(PaymentResponse.class)
.timeout(Duration.ofSeconds(10));
}

@Transactional
public Mono<OrderDto> createOrder(CreateOrderRequest request) {
return getUserDetails(request.getUserId())
.flatMap(user -> {
Order order = new Order();
order.setUserId(user.getId());
order.setAmount(request.getAmount());
order.setStatus(OrderStatus.PENDING);

return Mono.fromCallable(() -> orderRepository.save(order));
})
.flatMap(order -> {
PaymentRequest paymentRequest = new PaymentRequest();
paymentRequest.setOrderId(order.getId());
paymentRequest.setAmount(order.getAmount());

return processPayment(paymentRequest)
.map(paymentResponse -> {
order.setStatus(paymentResponse.isSuccessful() ?
OrderStatus.CONFIRMED : OrderStatus.FAILED);
return orderRepository.save(order);
});
})
.map(this::toOrderDto);
}
}

Circuit Breaker with Resilience4j

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
@Component
public class PaymentServiceClient {

private final WebClient webClient;
private final CircuitBreaker circuitBreaker;

public PaymentServiceClient(WebClient webClient) {
this.webClient = webClient;
this.circuitBreaker = CircuitBreaker.ofDefaults("payment-service");
}

public Mono<PaymentResponse> processPayment(PaymentRequest request) {
Supplier<Mono<PaymentResponse>> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> {
return webClient
.post()
.uri("http://payment-service/api/payments")
.body(Mono.just(request), PaymentRequest.class)
.retrieve()
.bodyToMono(PaymentResponse.class)
.timeout(Duration.ofSeconds(5));
});

return decoratedSupplier.get()
.onErrorResume(CallNotPermittedException.class, ex -> {
// Circuit breaker is open
return Mono.just(PaymentResponse.failed("Service temporarily unavailable"));
})
.onErrorResume(TimeoutException.class, ex -> {
return Mono.just(PaymentResponse.failed("Payment service timeout"));
});
}
}

Application Properties

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
spring:
cloud:
kubernetes:
discovery:
enabled: true
all-namespaces: false
wait-cache-ready: true
client:
namespace: default
loadbalancer:
ribbon:
enabled: false

resilience4j:
circuitbreaker:
instances:
payment-service:
registerHealthIndicator: true
slidingWindowSize: 10
minimumNumberOfCalls: 3
failureRateThreshold: 50
waitDurationInOpenState: 10s
permittedNumberOfCallsInHalfOpenState: 2
retry:
instances:
payment-service:
maxAttempts: 3
waitDuration: 1s
exponentialBackoffMultiplier: 2
timelimiter:
instances:
payment-service:
timeoutDuration: 5s

Service Mesh Integration (Istio)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: payment-service
spec:
hosts:
- payment-service
http:
- match:
- uri:
prefix: /api/payments
route:
- destination:
host: payment-service
port:
number: 80
timeout: 10s
retries:
attempts: 3
perTryTimeout: 3s
retryOn: 5xx,reset,connect-failure,refused-stream

---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: payment-service
spec:
host: payment-service
trafficPolicy:
loadBalancer:
simple: LEAST_CONN
connectionPool:
tcp:
maxConnections: 10
http:
http1MaxPendingRequests: 10
maxRequestsPerConnection: 2
circuitBreaker:
consecutiveErrors: 3
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50

Describe different deployment strategies in Kubernetes

Reference Answer

Kubernetes supports various deployment strategies to minimize downtime and reduce risk during application updates.

Deployment Strategies Overview


graph TB
subgraph "Rolling Update"
    RU1[v1 Pod] --> RU2[v1 Pod]
    RU2 --> RU3[v1 Pod]
    RU4[v2 Pod] --> RU5[v2 Pod]
    RU5 --> RU6[v2 Pod]
end

subgraph "Blue-Green"
    BG1[Blue Environment<br/>v1 Pods] 
    BG2[Green Environment<br/>v2 Pods]
    LB[Load Balancer] --> BG1
    LB -.-> BG2
end

subgraph "Canary"
    C1[v1 Pods - 90%]
    C2[v2 Pods - 10%]
    CLB[Load Balancer] --> C1
    CLB --> C2
end

Rolling Update (Default Strategy)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app-rolling
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # Max pods that can be unavailable
maxSurge: 2 # Max pods that can be created above desired replica count
template:
spec:
containers:
- name: java-app
image: my-java-app:v2
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10

Blue-Green Deployment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Blue deployment (current version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app-blue
labels:
version: blue
spec:
replicas: 3
selector:
matchLabels:
app: java-app
version: blue
template:
metadata:
labels:
app: java-app
version: blue
spec:
containers:
- name: java-app
image: my-java-app:v1

---
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app-green
labels:
version: green
spec:
replicas: 3
selector:
matchLabels:
app: java-app
version: green
template:
metadata:
labels:
app: java-app
version: green
spec:
containers:
- name: java-app
image: my-java-app:v2

---
# Service pointing to blue (active) version
apiVersion: v1
kind: Service
metadata:
name: java-app-service
spec:
selector:
app: java-app
version: blue # Switch to 'green' when ready
ports:
- port: 80
targetPort: 8080

Canary Deployment with Istio

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# Primary deployment (90% traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app-primary
spec:
replicas: 9
selector:
matchLabels:
app: java-app
version: primary
template:
metadata:
labels:
app: java-app
version: primary
spec:
containers:
- name: java-app
image: my-java-app:v1

---
# Canary deployment (10% traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app-canary
spec:
replicas: 1
selector:
matchLabels:
app: java-app
version: canary
template:
metadata:
labels:
app: java-app
version: canary
spec:
containers:
- name: java-app
image: my-java-app:v2

---
# VirtualService for traffic splitting
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: java-app
spec:
hosts:
- java-app
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: java-app
subset: canary
- route:
- destination:
host: java-app
subset: primary
weight: 90
- destination:
host: java-app
subset: canary
weight: 10

---
# DestinationRule defining subsets
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: java-app
spec:
host: java-app
subsets:
- name: primary
labels:
version: primary
- name: canary
labels:
version: canary

Recreate Strategy

1
2
3
4
5
6
7
8
9
10
11
12
13
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app-recreate
spec:
replicas: 3
strategy:
type: Recreate # Terminates all old pods before creating new ones
template:
spec:
containers:
- name: java-app
image: my-java-app:v2

Deployment Automation Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash

DEPLOYMENT_NAME="java-app"
NEW_IMAGE="my-java-app:v2"
NAMESPACE="default"

# Rolling update
kubectl set image deployment/$DEPLOYMENT_NAME java-app=$NEW_IMAGE -n $NAMESPACE

# Wait for rollout to complete
kubectl rollout status deployment/$DEPLOYMENT_NAME -n $NAMESPACE

# Check if rollout was successful
if [ $? -eq 0 ]; then
echo "Deployment successful"

# Optional: Run smoke tests
kubectl run smoke-test --rm -i --restart=Never --image=curlimages/curl -- \
curl -f http://java-app-service/health/ready

if [ $? -eq 0 ]; then
echo "Smoke tests passed"
else
echo "Smoke tests failed, rolling back"
kubectl rollout undo deployment/$DEPLOYMENT_NAME -n $NAMESPACE
fi
else
echo "Deployment failed, rolling back"
kubectl rollout undo deployment/$DEPLOYMENT_NAME -n $NAMESPACE
fi

Spring Boot Graceful Shutdown

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@Component
public class GracefulShutdownHook {

private static final Logger logger = LoggerFactory.getLogger(GracefulShutdownHook.class);

@EventListener
public void onApplicationEvent(ContextClosedEvent event) {
logger.info("Received shutdown signal, starting graceful shutdown...");

// Allow ongoing requests to complete
try {
Thread.sleep(5000); // Grace period
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}

logger.info("Graceful shutdown completed");
}
}
1
2
3
# application.properties
server.shutdown=graceful
spring.lifecycle.timeout-per-shutdown-phase=30s

How do you handle logging and monitoring for Java applications in Kubernetes?

Reference Answer

Comprehensive logging and monitoring in Kubernetes requires centralized log aggregation, metrics collection, and distributed tracing.

Logging Architecture


graph TB
subgraph "Kubernetes Cluster"
    subgraph "Application Pods"
        APP1[Java App 1] --> LOGS1[stdout/stderr]
        APP2[Java App 2] --> LOGS2[stdout/stderr]
        APP3[Java App 3] --> LOGS3[stdout/stderr]
    end
    
    subgraph "Log Collection"
        FLUENTD[Fluentd DaemonSet]
        LOGS1 --> FLUENTD
        LOGS2 --> FLUENTD
        LOGS3 --> FLUENTD
    end
    
    subgraph "Monitoring"
        PROMETHEUS[Prometheus]
        APP1 --> PROMETHEUS
        APP2 --> PROMETHEUS
        APP3 --> PROMETHEUS
    end
end

subgraph "External Systems"
    FLUENTD --> ELASTICSEARCH[Elasticsearch]
    ELASTICSEARCH --> KIBANA[Kibana]
    PROMETHEUS --> GRAFANA[Grafana]
end

Structured Logging Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
<!-- logback-spring.xml -->
<configuration>
<springProfile name="!local">
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
<providers>
<timestamp/>
<logLevel/>
<loggerName/>
<message/>
<mdc/>
<arguments/>
<stackTrace/>
<pattern>
<pattern>
{
"service": "java-app",
"version": "${APP_VERSION:-unknown}",
"pod": "${HOSTNAME:-unknown}",
"namespace": "${POD_NAMESPACE:-default}"
}
</pattern>
</pattern>
</providers>
</encoder>
</appender>
</springProfile>

<springProfile name="local">
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
</springProfile>

<root level="INFO">
<appender-ref ref="STDOUT"/>
</root>

<logger name="com.example" level="DEBUG"/>
<logger name="org.springframework.web" level="DEBUG"/>
</configuration>

Application Logging Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
@RestController
@Slf4j
public class OrderController {

@Autowired
private OrderService orderService;

@PostMapping("/orders")
public ResponseEntity<OrderDto> createOrder(@RequestBody CreateOrderRequest request) {
String correlationId = UUID.randomUUID().toString();

// Add correlation ID to MDC for request tracing
MDC.put("correlationId", correlationId);
MDC.put("operation", "createOrder");
MDC.put("userId", request.getUserId());

try {
log.info("Creating order for user: {}", request.getUserId());

OrderDto order = orderService.createOrder(request);

log.info("Order created successfully: orderId={}, amount={}",
order.getId(), order.getAmount());

return ResponseEntity.ok(order);

} catch (Exception e) {
log.error("Failed to create order: {}", e.getMessage(), e);
throw e;
} finally {
MDC.clear();
}
}
}

Fluentd Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*java-app*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
format json
read_from_head true
</source>

<filter kubernetes.**>
@type kubernetes_metadata
</filter>

<filter kubernetes.**>
@type parser
key_name log
reserve_data true
<parse>
@type json
</parse>
</filter>

<match kubernetes.**>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
index_name kubernetes
type_name _doc
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 5s
retry_forever
retry_max_interval 30
chunk_limit_size 2M
queue_limit_length 8
overflow_action block
</buffer>
</match>

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
spec:
selector:
matchLabels:
name: fluentd
template:
metadata:
labels:
name: fluentd
spec:
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1.14-debian-elasticsearch7-1
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: config-volume
mountPath: /fluentd/etc
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: config-volume
configMap:
name: fluentd-config

Prometheus Metrics with Micrometer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
@Component
public class MetricsConfig {

@Bean
public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
return registry -> registry.config().commonTags(
"application", "java-app",
"version", System.getProperty("app.version", "unknown")
);
}

@Bean
public TimedAspect timedAspect(MeterRegistry registry) {
return new TimedAspect(registry);
}
}

@Service
@Slf4j
public class OrderService {

private final Counter orderCreatedCounter;
private final Timer orderProcessingTimer;
private final Gauge activeOrdersGauge;

public OrderService(MeterRegistry meterRegistry) {
this.orderCreatedCounter = Counter.builder("orders.created")
.description("Number of orders created")
.register(meterRegistry);

this.orderProcessingTimer = Timer.builder("orders.processing.time")
.description("Order processing time")
.register(meterRegistry);

this.activeOrdersGauge = Gauge.builder("orders.active")
.description("Number of active orders")
.register(meterRegistry, this, OrderService::getActiveOrderCount);
}

@Timed(name = "orders.create", description = "Create order operation")
public OrderDto createOrder(CreateOrderRequest request) {
return Timer.Sample.start(orderProcessingTimer)
.stop(() -> {
try {
// Order creation logic
OrderDto order = processOrder(request);
orderCreatedCounter.increment(Tags.of("status", "success"));
return order;
} catch (Exception e) {
orderCreatedCounter.increment(Tags.of("status", "error"));
throw e;
}
});
}

public double getActiveOrderCount() {
// Return current active order count
return orderRepository.countByStatus(OrderStatus.ACTIVE);
}
}

Deployment with Monitoring Annotations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/actuator/prometheus"
prometheus.io/port: "8080"
labels:
app: java-app
spec:
containers:
- name: java-app
image: my-java-app:1.0
ports:
- containerPort: 8080
name: http
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: APP_VERSION
value: "1.0"

Distributed Tracing with Jaeger

1
2
3
4
<dependency>
<groupId>io.opentracing.contrib</groupId>
<artifactId>opentracing-spring-jaeger-cloud-starter</artifactId>
</dependency>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@RestController
public class OrderController {

@Autowired
private Tracer tracer;

@PostMapping("/orders")
public ResponseEntity<OrderDto> createOrder(@RequestBody CreateOrderRequest request) {
Span span = tracer.nextSpan()
.name("create-order")
.tag("user.id", request.getUserId())
.tag("order.amount", String.valueOf(request.getAmount()))
.start();

try (Tracer.SpanInScope ws = tracer.withSpanInScope(span)) {
OrderDto order = orderService.createOrder(request);
span.tag("order.id", order.getId());
return ResponseEntity.ok(order);
} catch (Exception e) {
span.tag("error", true);
span.tag("error.message", e.getMessage());
throw e;
} finally {
span.end();
}
}
}

Application Properties for Monitoring

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus,loggers
endpoint:
health:
show-details: always
metrics:
enabled: true
metrics:
export:
prometheus:
enabled: true
distribution:
percentiles-histogram:
http.server.requests: true
percentiles:
http.server.requests: 0.5, 0.95, 0.99
sla:
http.server.requests: 100ms, 200ms, 500ms

opentracing:
jaeger:
service-name: java-app
sampler:
type: const
param: 1
log-spans: true

logging:
level:
io.jaeger: INFO
io.opentracing: INFO

What are the security considerations for Java applications in Kubernetes?

Reference Answer

Security in Kubernetes requires a multi-layered approach covering container security, network policies, RBAC, and secure configuration management.

Security Architecture


graph TB
subgraph "Cluster Security"
    RBAC[RBAC] --> API[API Server]
    PSP[Pod Security Standards] --> PODS[Pods]
    NP[Network Policies] --> NETWORK[Pod Network]
end

subgraph "Pod Security"
    SECCTX[Security Context] --> CONTAINER[Container]
    SECRETS[Secrets] --> CONTAINER
    SA[Service Account] --> CONTAINER
end

subgraph "Image Security"
    SCAN[Image Scanning] --> REGISTRY[Container Registry]
    SIGN[Image Signing] --> REGISTRY
end

Pod Security Context

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
apiVersion: apps/v1
kind: Deployment
metadata:
name: secure-java-app
spec:
template:
spec:
# Pod-level security context
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault

containers:
- name: java-app
image: my-java-app:1.0
# Container-level security context
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE # Only if needed for port < 1024

volumeMounts:
- name: tmp-volume
mountPath: /tmp
- name: cache-volume
mountPath: /app/cache

resources:
limits:
memory: "1Gi"
cpu: "500m"
requests:
memory: "512Mi"
cpu: "250m"

volumes:
- name: tmp-volume
emptyDir: {}
- name: cache-volume
emptyDir: {}

Secure Dockerfile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
FROM openjdk:11-jre-slim

# Create non-root user
RUN groupadd -r appgroup && useradd -r -g appgroup -u 1000 appuser

# Create app directory
RUN mkdir -p /app/logs /app/cache && \
chown -R appuser:appgroup /app

# Copy application
COPY --chown=appuser:appgroup target/app.jar /app/app.jar

# Switch to non-root user
USER appuser

WORKDIR /app

# Expose port
EXPOSE 8080

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8080/actuator/health || exit 1

ENTRYPOINT ["java", "-jar", "app.jar"]

Network Policies

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Deny all ingress traffic by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes:
- Ingress

---
# Allow specific ingress traffic to Java app
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: java-app-ingress
spec:
podSelector:
matchLabels:
app: java-app
policyTypes:
- Ingress
- Egress
ingress:
# Allow traffic from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8080
# Allow traffic from same namespace
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
# Allow DNS resolution
- to: []
ports:
- protocol: UDP
port: 53
# Allow database access
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
# Allow external API calls
- to: []
ports:
- protocol: TCP
port: 443

RBAC Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
name: java-app-sa
namespace: default

---
# Role with minimal permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: java-app-role
namespace: default
rules:
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]

---
# Role Binding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: java-app-binding
namespace: default
subjects:
- kind: ServiceAccount
name: java-app-sa
namespace: default
roleRef:
kind: Role
name: java-app-role
apiGroup: rbac.authorization.k8s.io

---
# Deployment using Service Account
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app
spec:
template:
spec:
serviceAccountName: java-app-sa
automountServiceAccountToken: false # Disable if not needed
containers:
- name: java-app
image: my-java-app:1.0

Secrets Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Create secret from command line (better than YAML)
# kubectl create secret generic db-credentials \
# --from-literal=username=dbuser \
# --from-literal=password=securepassword

apiVersion: v1
kind: Secret
metadata:
name: db-credentials
type: Opaque
data:
username: ZGJ1c2Vy # base64 encoded
password: c2VjdXJlcGFzc3dvcmQ= # base64 encoded

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app
spec:
template:
spec:
containers:
- name: java-app
image: my-java-app:1.0
env:
- name: DB_USERNAME
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
# Or mount as files
volumeMounts:
- name: db-credentials
mountPath: /etc/secrets
readOnly: true
volumes:
- name: db-credentials
secret:
secretName: db-credentials
defaultMode: 0400 # Read-only for owner

Pod Security Standards

1
2
3
4
5
6
7
8
apiVersion: v1
kind: Namespace
metadata:
name: secure-namespace
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted

Security Configuration in Java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
@Configuration
@EnableWebSecurity
public class SecurityConfig {

@Bean
public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
http
.sessionManagement()
.sessionCreationPolicy(SessionCreationPolicy.STATELESS)
.and()
.headers()
.frameOptions().deny()
.contentTypeOptions()
.and()
.httpStrictTransportSecurity(hstsConfig -> hstsConfig
.maxAgeInSeconds(31536000)
.includeSubdomains(true))
.and()
.csrf().disable()
.authorizeHttpRequests(authz -> authz
.requestMatchers("/actuator/health", "/actuator/info").permitAll()
.requestMatchers("/actuator/**").hasRole("ADMIN")
.anyRequest().authenticated()
)
.oauth2ResourceServer(OAuth2ResourceServerConfigurer::jwt);

return http.build();
}
}

@RestController
public class SecureController {

@GetMapping("/secure-endpoint")
@PreAuthorize("hasRole('USER')")
public ResponseEntity<String> secureEndpoint(Authentication authentication) {
// Log access attempt
log.info("Secure endpoint accessed by: {}", authentication.getName());

return ResponseEntity.ok("Secure data");
}
}

Image Scanning with Trivy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: batch/v1
kind: Job
metadata:
name: image-scan
spec:
template:
spec:
restartPolicy: Never
containers:
- name: trivy
image: aquasec/trivy:latest
command:
- trivy
- image
- --exit-code
- "1"
- --severity
- HIGH,CRITICAL
- my-java-app:1.0

Admission Controller (OPA Gatekeeper)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
name: k8srequiredsecuritycontext
spec:
crd:
spec:
names:
kind: K8sRequiredSecurityContext
validation:
properties:
runAsNonRoot:
type: boolean
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredsecuritycontext

violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not container.securityContext.runAsNonRoot
msg := "Container must run as non-root user"
}

---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredSecurityContext
metadata:
name: must-run-as-non-root
spec:
match:
kinds:
- apiGroups: ["apps"]
kinds: ["Deployment"]
parameters:
runAsNonRoot: true

How do you debug issues in Java applications running on Kubernetes?

Reference Answer

Debugging Kubernetes applications requires understanding both Kubernetes diagnostics and Java-specific debugging techniques.

Debugging Workflow


graph TD
ISSUE[Application Issue] --> CHECK1{Pod Status}
CHECK1 -->|Running| CHECK2{Logs Analysis}
CHECK1 -->|Pending| EVENTS[Check Events]
CHECK1 -->|Failed| DESCRIBE[Describe Pod]

CHECK2 -->|App Logs| APPLOGS[Application Logs]
CHECK2 -->|System Logs| SYSLOGS[System Logs]

EVENTS --> RESOURCES{Resource Issues}
DESCRIBE --> CONFIG{Config Issues}

APPLOGS --> METRICS[Check Metrics]
SYSLOGS --> NETWORK[Network Debug]

RESOURCES --> SCALE[Scale Resources]
CONFIG --> FIX[Fix Configuration]

METRICS --> PROFILE[Java Profiling]
NETWORK --> CONNECTIVITY[Test Connectivity]

Basic Kubernetes Debugging Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Check pod status
kubectl get pods -l app=java-app

# Describe pod for detailed information
kubectl describe pod <pod-name>

# Get pod logs
kubectl logs <pod-name>
kubectl logs <pod-name> --previous # Previous container logs
kubectl logs <pod-name> -c <container-name> # Multi-container pod

# Follow logs in real-time
kubectl logs -f <pod-name>

# Get logs from all pods with label
kubectl logs -l app=java-app --tail=100

# Check events
kubectl get events --sort-by=.metadata.creationTimestamp

# Execute commands in pod
kubectl exec -it <pod-name> -- /bin/bash
kubectl exec -it <pod-name> -- ps aux
kubectl exec -it <pod-name> -- netstat -tlnp

Java Application Debugging

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
@RestController
public class DebugController {

private static final Logger logger = LoggerFactory.getLogger(DebugController.class);

@Autowired
private MeterRegistry meterRegistry;

@GetMapping("/debug/health")
public Map<String, Object> getDetailedHealth() {
Map<String, Object> health = new HashMap<>();

// JVM Memory info
MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
MemoryUsage heapUsage = memoryBean.getHeapMemoryUsage();
MemoryUsage nonHeapUsage = memoryBean.getNonHeapMemoryUsage();

Map<String, Object> memory = new HashMap<>();
memory.put("heap", Map.of(
"used", heapUsage.getUsed(),
"committed", heapUsage.getCommitted(),
"max", heapUsage.getMax()
));
memory.put("nonHeap", Map.of(
"used", nonHeapUsage.getUsed(),
"committed", nonHeapUsage.getCommitted(),
"max", nonHeapUsage.getMax()
));

// Thread info
ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
Map<String, Object> threads = new HashMap<>();
threads.put("count", threadBean.getThreadCount());
threads.put("peak", threadBean.getPeakThreadCount());
threads.put("daemon", threadBean.getDaemonThreadCount());

// GC info
List<GarbageCollectorMXBean> gcBeans = ManagementFactory.getGarbageCollectorMXBeans();
List<Map<String, Object>> gcInfo = gcBeans.stream()
.map(gc -> Map.of(
"name", gc.getName(),
"collections", gc.getCollectionCount(),
"time", gc.getCollectionTime()
))
.collect(Collectors.toList());

health.put("timestamp", Instant.now());
health.put("memory", memory);
health.put("threads", threads);
health.put("gc", gcInfo);
health.put("uptime", ManagementFactory.getRuntimeMXBean().getUptime());

return health;
}

@GetMapping("/debug/threads")
public Map<String, Object> getThreadDump() {
ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
ThreadInfo[] threadInfos = threadBean.dumpAllThreads(true, true);

Map<String, Object> dump = new HashMap<>();
dump.put("timestamp", Instant.now());
dump.put("threadCount", threadInfos.length);

List<Map<String, Object>> threads = Arrays.stream(threadInfos)
.map(info -> {
Map<String, Object> thread = new HashMap<>();
thread.put("name", info.getThreadName());
thread.put("state", info.getThreadState().toString());
thread.put("blocked", info.getBlockedCount());
thread.put("waited", info.getWaitedCount());

if (info.getLockInfo() != null) {
thread.put("lock", info.getLockInfo().toString());
}

return thread;
})
.collect(Collectors.toList());

dump.put("threads", threads);
return dump;
}

@PostMapping("/debug/gc")
public String triggerGC() {
logger.warn("Manually triggering garbage collection - this should not be done in production");
System.gc();
return "GC triggered";
}
}

Remote Debugging Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app-debug
spec:
replicas: 1 # Single replica for debugging
template:
spec:
containers:
- name: java-app
image: my-java-app:1.0
ports:
- containerPort: 8080
name: http
- containerPort: 5005
name: debug
env:
- name: JAVA_OPTS
value: >-
-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=75.0
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"

---
apiVersion: v1
kind: Service
metadata:
name: java-app-debug-service
spec:
selector:
app: java-app
ports:
- name: http
port: 8080
targetPort: 8080
- name: debug
port: 5005
targetPort: 5005
type: ClusterIP

Port Forwarding for Local Debugging

1
2
3
4
5
6
7
8
9
10
# Forward application port
kubectl port-forward deployment/java-app 8080:8080

# Forward debug port
kubectl port-forward deployment/java-app-debug 5005:5005

# Forward multiple ports
kubectl port-forward pod/<pod-name> 8080:8080 5005:5005

# Connect your IDE debugger to localhost:5005

Performance Debugging with JVM Tools

1
2
3
4
5
6
7
8
9
10
11
12
13
# Execute JVM diagnostic commands in pod
kubectl exec -it <pod-name> -- jps -l
kubectl exec -it <pod-name> -- jstat -gc <pid> 5s
kubectl exec -it <pod-name> -- jstack <pid>
kubectl exec -it <pod-name> -- jmap -histo <pid>

# Create heap dump
kubectl exec -it <pod-name> -- jcmd <pid> GC.run_finalization
kubectl exec -it <pod-name> -- jcmd <pid> VM.gc
kubectl exec -it <pod-name> -- jcmd <pid> GC.dump /tmp/heapdump.hprof

# Copy heap dump to local machine
kubectl cp <pod-name>:/tmp/heapdump.hprof ./heapdump.hprof

Network Debugging

1
2
3
4
5
6
7
8
9
10
11
12
# Network debugging pod
apiVersion: v1
kind: Pod
metadata:
name: network-debug
spec:
containers:
- name: network-tools
image: nicolaka/netshoot
command: ["/bin/bash"]
args: ["-c", "while true; do ping localhost; sleep 30;done"]
restartPolicy: Always
1
2
3
4
5
6
7
8
9
10
11
12
13
# Test network connectivity
kubectl exec -it network-debug -- nslookup java-app-service
kubectl exec -it network-debug -- curl -v http://java-app-service:8080/health
kubectl exec -it network-debug -- telnet java-app-service 8080

# Check DNS resolution
kubectl exec -it network-debug -- dig java-app-service.default.svc.cluster.local

# Test external connectivity
kubectl exec -it network-debug -- curl -v https://api.external-service.com

# Network policy testing
kubectl exec -it network-debug -- nc -zv java-app-service 8080

Debugging Init Containers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app-with-init
spec:
template:
spec:
initContainers:
- name: wait-for-db
image: busybox:1.28
command: ['sh', '-c']
args:
- |
echo "Waiting for database..."
until nc -z database-service 5432; do
echo "Database not ready, waiting..."
sleep 2
done
echo "Database is ready!"
- name: migrate-db
image: migrate/migrate
command: ["/migrate"]
args:
- "-path=/migrations"
- "-database=postgresql://user:pass@database-service:5432/db?sslmode=disable"
- "up"
volumeMounts:
- name: migrations
mountPath: /migrations
containers:
- name: java-app
image: my-java-app:1.0
volumes:
- name: migrations
configMap:
name: db-migrations
1
2
3
4
5
6
# Check init container logs
kubectl logs <pod-name> -c wait-for-db
kubectl logs <pod-name> -c migrate-db

# Describe pod to see init container status
kubectl describe pod <pod-name>

Application Metrics Debugging

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
@Component
public class CustomMetrics {

private final Counter httpRequestsTotal;
private final Timer httpRequestDuration;
private final Gauge activeConnections;

public CustomMetrics(MeterRegistry meterRegistry) {
this.httpRequestsTotal = Counter.builder("http_requests_total")
.description("Total HTTP requests")
.register(meterRegistry);

this.httpRequestDuration = Timer.builder("http_request_duration")
.description("HTTP request duration")
.register(meterRegistry);

this.activeConnections = Gauge.builder("active_connections")
.description("Active database connections")
.register(meterRegistry, this, CustomMetrics::getActiveConnections);
}

public void recordRequest(String method, String endpoint, long duration) {
httpRequestsTotal.increment(
Tags.of("method", method, "endpoint", endpoint)
);
httpRequestDuration.record(duration, TimeUnit.MILLISECONDS);
}

public double getActiveConnections() {
// Return actual active connection count
return 10.0; // Placeholder
}
}

Debugging Persistent Volumes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Check PV and PVC status
kubectl get pv
kubectl get pvc

# Describe PVC for detailed info
kubectl describe pvc java-app-storage

# Check mounted volumes in pod
kubectl exec -it <pod-name> -- df -h
kubectl exec -it <pod-name> -- ls -la /app/data

# Check file permissions
kubectl exec -it <pod-name> -- ls -la /app/data
kubectl exec -it <pod-name> -- id

# Test file creation
kubectl exec -it <pod-name> -- touch /app/data/test.txt
kubectl exec -it <pod-name> -- echo "test" > /app/data/test.txt

Resource Usage Investigation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Check resource usage
kubectl top pods
kubectl top nodes

# Get detailed resource information
kubectl describe node <node-name>

# Check resource quotas
kubectl get resourcequota
kubectl describe resourcequota

# Check limit ranges
kubectl get limitrange
kubectl describe limitrange

Debugging ConfigMaps and Secrets

1
2
3
4
5
6
7
8
9
10
11
12
# Check ConfigMap content
kubectl get configmap java-app-config -o yaml

# Check Secret content (base64 encoded)
kubectl get secret java-app-secrets -o yaml

# Decode secret values
kubectl get secret java-app-secrets -o jsonpath='{.data.database-password}' | base64 --decode

# Check mounted config in pod
kubectl exec -it <pod-name> -- cat /app/config/application.properties
kubectl exec -it <pod-name> -- env | grep -i database

Automated Debugging Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/bin/bash

APP_NAME="java-app"
NAMESPACE="default"

echo "=== Kubernetes Debugging Report for $APP_NAME ==="
echo "Timestamp: $(date)"
echo

echo "=== Pod Status ==="
kubectl get pods -l app=$APP_NAME -n $NAMESPACE
echo

echo "=== Recent Events ==="
kubectl get events --sort-by=.metadata.creationTimestamp -n $NAMESPACE | tail -10
echo

echo "=== Pod Description ==="
POD_NAME=$(kubectl get pods -l app=$APP_NAME -n $NAMESPACE -o jsonpath='{.items[0].metadata.name}')
kubectl describe pod $POD_NAME -n $NAMESPACE
echo

echo "=== Application Logs (last 50 lines) ==="
kubectl logs $POD_NAME -n $NAMESPACE --tail=50
echo

echo "=== Resource Usage ==="
kubectl top pod $POD_NAME -n $NAMESPACE
echo

echo "=== Service Status ==="
kubectl get svc -l app=$APP_NAME -n $NAMESPACE
echo

echo "=== ConfigMap Status ==="
kubectl get configmap -l app=$APP_NAME -n $NAMESPACE
echo

echo "=== Secret Status ==="
kubectl get secret -l app=$APP_NAME -n $NAMESPACE
echo

echo "=== Network Connectivity Test ==="
kubectl run debug-pod --rm -i --restart=Never --image=nicolaka/netshoot -- \
/bin/bash -c "nslookup $APP_NAME-service.$NAMESPACE.svc.cluster.local && \
curl -s -o /dev/null -w '%{http_code}' http://$APP_NAME-service.$NAMESPACE.svc.cluster.local:8080/health"

Explain Ingress and how to expose Java applications externally

Reference Answer

Ingress provides HTTP and HTTPS routing to services within a Kubernetes cluster, acting as a reverse proxy and load balancer for external traffic.

Ingress Architecture


graph TB
INTERNET[Internet] --> LB[Load Balancer]
LB --> INGRESS[Ingress Controller]

subgraph "Kubernetes Cluster"
    INGRESS --> INGRESS_RULES[Ingress Rules]
    INGRESS_RULES --> SVC1[Java App Service]
    INGRESS_RULES --> SVC2[API Service]
    INGRESS_RULES --> SVC3[Frontend Service]
    
    SVC1 --> POD1[Java App Pods]
    SVC2 --> POD2[API Pods]
    SVC3 --> POD3[Frontend Pods]
end

Basic Ingress Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: java-app-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
spec:
ingressClassName: nginx
tls:
- hosts:
- myapp.example.com
secretName: myapp-tls
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: java-app-service
port:
number: 80

Advanced Ingress with Path-based Routing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: microservices-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
nginx.ingress.kubernetes.io/use-regex: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.example.com
secretName: api-tls
rules:
- host: api.example.com
http:
paths:
# User service
- path: /api/users(/|$)(.*)
pathType: Prefix
backend:
service:
name: user-service
port:
number: 80
# Order service
- path: /api/orders(/|$)(.*)
pathType: Prefix
backend:
service:
name: order-service
port:
number: 80
# Payment service
- path: /api/payments(/|$)(.*)
pathType: Prefix
backend:
service:
name: payment-service
port:
number: 80
# Default fallback
- path: /(.*)
pathType: Prefix
backend:
service:
name: frontend-service
port:
number: 80

Java Application Configuration for Ingress

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
@RestController
@RequestMapping("/api/orders")
public class OrderController {

// Handle base path properly
@GetMapping("/health")
public ResponseEntity<Map<String, String>> health(HttpServletRequest request) {
Map<String, String> health = new HashMap<>();
health.put("status", "UP");
health.put("service", "order-service");
health.put("path", request.getRequestURI());
health.put("forwardedPath", request.getHeader("X-Forwarded-Prefix"));

return ResponseEntity.ok(health);
}

@GetMapping
public ResponseEntity<List<OrderDto>> getOrders(
@RequestParam(defaultValue = "0") int page,
@RequestParam(defaultValue = "10") int size,
HttpServletRequest request) {

// Log forwarded headers for debugging
String forwardedFor = request.getHeader("X-Forwarded-For");
String forwardedProto = request.getHeader("X-Forwarded-Proto");
String forwardedHost = request.getHeader("X-Forwarded-Host");

log.info("Request from: {} via {} to {}", forwardedFor, forwardedProto, forwardedHost);

List<OrderDto> orders = orderService.getOrders(page, size);
return ResponseEntity.ok(orders);
}
}

Spring Boot Configuration for Proxy Headers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
server:
port: 8080
servlet:
context-path: /
forward-headers-strategy: native
tomcat:
remoteip:
remote-ip-header: X-Forwarded-For
protocol-header: X-Forwarded-Proto
port-header: X-Forwarded-Port

management:
server:
port: 8081
endpoints:
web:
base-path: /actuator
exposure:
include: health,info,metrics,prometheus

SSL/TLS Certificate Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Using cert-manager for automatic SSL certificates
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@example.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: java-app-ssl-ingress
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- myapp.example.com
secretName: myapp-tls-auto # Will be created by cert-manager
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: java-app-service
port:
number: 80

Ingress with Authentication

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: java-app-auth-ingress
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: basic-auth
nginx.ingress.kubernetes.io/auth-realm: 'Authentication Required'
# Or OAuth2 authentication
nginx.ingress.kubernetes.io/auth-url: "https://auth.example.com/oauth2/auth"
nginx.ingress.kubernetes.io/auth-signin: "https://auth.example.com/oauth2/start"
spec:
ingressClassName: nginx
rules:
- host: secure.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: java-app-service
port:
number: 80

---
# Basic auth secret
apiVersion: v1
kind: Secret
metadata:
name: basic-auth
type: Opaque
data:
auth: YWRtaW46JGFwcjEkSDY1dnVhNU8kblNEOC9ObDBINFkwL3pmWUZOcUI4MQ== # admin:admin

Custom Error Pages

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-error-pages
data:
404.html: |
<!DOCTYPE html>
<html>
<head>
<title>Page Not Found</title>
<style>
body { font-family: Arial, sans-serif; text-align: center; margin-top: 50px; }
.error-code { font-size: 72px; color: #e74c3c; }
.error-message { font-size: 24px; color: #7f8c8d; }
</style>
</head>
<body>
<div class="error-code">404</div>
<div class="error-message">The page you're looking for doesn't exist.</div>
<p><a href="/">Go back to homepage</a></p>
</body>
</html>
500.html: |
<!DOCTYPE html>
<html>
<head>
<title>Internal Server Error</title>
<style>
body { font-family: Arial, sans-serif; text-align: center; margin-top: 50px; }
.error-code { font-size: 72px; color: #e74c3c; }
.error-message { font-size: 24px; color: #7f8c8d; }
</style>
</head>
<body>
<div class="error-code">500</div>
<div class="error-message">Something went wrong on our end.</div>
<p>Please try again later.</p>
</body>
</html>

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: java-app-custom-errors
annotations:
nginx.ingress.kubernetes.io/custom-http-errors: "404,500,503"
nginx.ingress.kubernetes.io/default-backend: error-pages-service
spec:
ingressClassName: nginx
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: java-app-service
port:
number: 80

Load Balancing and Session Affinity

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: java-app-sticky-sessions
annotations:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/affinity-mode: "persistent"
nginx.ingress.kubernetes.io/session-cookie-name: "JSESSIONID"
nginx.ingress.kubernetes.io/session-cookie-expires: "86400"
nginx.ingress.kubernetes.io/session-cookie-max-age: "86400"
nginx.ingress.kubernetes.io/session-cookie-path: "/"
nginx.ingress.kubernetes.io/upstream-hash-by: "$remote_addr"
spec:
ingressClassName: nginx
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: java-app-service
port:
number: 80

Ingress Health Checks and Monitoring

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
@RestController
public class IngressHealthController {

@GetMapping("/health/ingress")
public ResponseEntity<Map<String, Object>> ingressHealth(HttpServletRequest request) {
Map<String, Object> health = new HashMap<>();
health.put("status", "UP");
health.put("timestamp", Instant.now());

// Include request information for debugging
Map<String, String> requestInfo = new HashMap<>();
requestInfo.put("remoteAddr", request.getRemoteAddr());
requestInfo.put("forwardedFor", request.getHeader("X-Forwarded-For"));
requestInfo.put("forwardedProto", request.getHeader("X-Forwarded-Proto"));
requestInfo.put("forwardedHost", request.getHeader("X-Forwarded-Host"));
requestInfo.put("userAgent", request.getHeader("User-Agent"));

health.put("request", requestInfo);

return ResponseEntity.ok(health);
}

@GetMapping("/ready")
public ResponseEntity<String> readiness() {
// Perform readiness checks
if (isApplicationReady()) {
return ResponseEntity.ok("Ready");
} else {
return ResponseEntity.status(503).body("Not Ready");
}
}

private boolean isApplicationReady() {
// Check database connectivity, external services, etc.
return true;
}
}

Ingress Controller Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-configuration
namespace: ingress-nginx
data:
# Global settings
proxy-connect-timeout: "60"
proxy-send-timeout: "60"
proxy-read-timeout: "60"
proxy-body-size: "100m"

# Performance tuning
worker-processes: "auto"
worker-connections: "1024"
keepalive-timeout: "65"
keepalive-requests: "100"

# Security headers
add-headers: "ingress-nginx/custom-headers"

# Compression
enable-gzip: "true"
gzip-types: "text/plain text/css application/json application/javascript text/xml application/xml"

# Rate limiting
rate-limit: "1000"
rate-limit-window: "1m"

---
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-headers
namespace: ingress-nginx
data:
X-Content-Type-Options: "nosniff"
X-Frame-Options: "DENY"
X-XSS-Protection: "1; mode=block"
Strict-Transport-Security: "max-age=31536000; includeSubDomains"
Content-Security-Policy: "default-src 'self'"

Testing Ingress Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Test basic connectivity
curl -H "Host: myapp.example.com" http://<ingress-ip>/health

# Test SSL
curl -H "Host: myapp.example.com" https://<ingress-ip>/health

# Test with custom headers
curl -H "Host: myapp.example.com" -H "X-Custom-Header: test" http://<ingress-ip>/api/orders

# Test different paths
curl -H "Host: api.example.com" http://<ingress-ip>/api/users/health
curl -H "Host: api.example.com" http://<ingress-ip>/api/orders/health

# Debug ingress controller logs
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller -f

# Check ingress status
kubectl get ingress
kubectl describe ingress java-app-ingress

This comprehensive guide covers the essential Kubernetes concepts and practical implementations that senior Java developers need to understand when working with containerized applications in production environments.

Core Java Concepts

What’s the difference between == and .equals() in Java?

Reference Answer:

  • == compares references (memory addresses) for objects and values for primitives
  • .equals() compares the actual content/state of objects
  • By default, .equals() uses == (reference comparison) unless overridden
  • When overriding .equals(), you must also override .hashCode() to maintain the contract: if two objects are equal according to .equals(), they must have the same hash code
  • String pool example: "hello" == "hello" is true due to string interning, but new String("hello") == new String("hello") is false
1
2
3
4
5
6
7
String s1 = "hello";
String s2 = "hello";
String s3 = new String("hello");

s1 == s2; // true (same reference in string pool)
s1 == s3; // false (different references)
s1.equals(s3); // true (same content)

Explain the Java memory model and garbage collection.

Reference Answer:
Memory Areas:

  • Heap: Object storage, divided into Young Generation (Eden, S0, S1) and Old Generation
  • Stack: Method call frames, local variables, partial results
  • Method Area/Metaspace: Class metadata, constant pool
  • PC Register: Current executing instruction
  • Native Method Stack: Native method calls

Garbage Collection Process:

  1. Objects created in Eden space
  2. When Eden fills, minor GC moves surviving objects to Survivor space
  3. After several GC cycles, long-lived objects promoted to Old Generation
  4. Major GC cleans Old Generation (more expensive)

Common GC Algorithms:

  • Serial GC: Single-threaded, suitable for small applications
  • Parallel GC: Multi-threaded, good for throughput
  • G1GC: Low-latency, good for large heaps
  • ZGC/Shenandoah: Ultra-low latency collectors

What are the differences between abstract classes and interfaces?

Reference Answer:

Aspect Abstract Class Interface
Inheritance Single inheritance Multiple inheritance
Methods Can have concrete methods All methods abstract (before Java 8)
Variables Can have instance variables Only public static final variables
Constructor Can have constructors Cannot have constructors
Access Modifiers Any access modifier Public by default

Modern Java (8+) additions:

  • Interfaces can have default and static methods
  • Private methods in interfaces (Java 9+)

When to use:

  • Abstract Class: When you have common code to share and “is-a” relationship
  • Interface: When you want to define a contract and “can-do” relationship

Concurrency and Threading

How does the volatile keyword work?

Reference Answer:
Purpose: Ensures visibility of variable changes across threads and prevents instruction reordering.

Memory Effects:

  • Reads/writes to volatile variables are directly from/to main memory
  • Creates a happens-before relationship
  • Prevents compiler optimizations that cache variable values

When to use:

  • Simple flags or state variables
  • Single writer, multiple readers scenarios
  • Not sufficient for compound operations (like increment)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public class VolatileExample {
private volatile boolean flag = false;

// Thread 1
public void setFlag() {
flag = true; // Immediately visible to other threads
}

// Thread 2
public void checkFlag() {
while (!flag) {
// Will see the change immediately
}
}
}

Limitations: Doesn’t provide atomicity for compound operations. Use AtomicBoolean, AtomicInteger, etc., for atomic operations.

Explain different ways to create threads and their trade-offs.

Reference Answer:

1. Extending Thread class:

1
2
3
4
class MyThread extends Thread {
public void run() { /* implementation */ }
}
new MyThread().start();
  • Pros: Simple, direct control
  • Cons: Single inheritance limitation, tight coupling

2. Implementing Runnable:

1
2
3
4
class MyTask implements Runnable {
public void run() { /* implementation */ }
}
new Thread(new MyTask()).start();
  • Pros: Better design, can extend other classes
  • Cons: Still creates OS threads

3. ExecutorService:

1
2
ExecutorService executor = Executors.newFixedThreadPool(10);
executor.submit(() -> { /* task */ });
  • Pros: Thread pooling, resource management
  • Cons: More complex, need proper shutdown

4. CompletableFuture:

1
2
CompletableFuture.supplyAsync(() -> { /* computation */ })
.thenApply(result -> { /* transform */ });
  • Pros: Asynchronous composition, functional style
  • Cons: Learning curve, can be overkill for simple tasks

5. Virtual Threads (Java 19+):

1
Thread.startVirtualThread(() -> { /* task */ });
  • Pros: Lightweight, millions of threads possible
  • Cons: New feature, limited adoption

What’s the difference between synchronized and ReentrantLock?

Reference Answer:

Feature synchronized ReentrantLock
Type Intrinsic/implicit lock Explicit lock
Acquisition Automatic Manual (lock/unlock)
Fairness No fairness guarantee Optional fairness
Interruptibility Not interruptible Interruptible
Try Lock Not available Available
Condition Variables wait/notify Multiple Condition objects
Performance JVM optimized Slightly more overhead

ReentrantLock Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
private final ReentrantLock lock = new ReentrantLock(true); // fair lock

public void performTask() {
lock.lock();
try {
// critical section
} finally {
lock.unlock(); // Must be in finally block
}
}

public boolean tryPerformTask() {
if (lock.tryLock()) {
try {
// critical section
return true;
} finally {
lock.unlock();
}
}
return false;
}

Collections and Data Structures

How does HashMap work internally?

Reference Answer:
Internal Structure:

  • Array of buckets (Node<K,V>[] table)
  • Each bucket can contain a linked list or red-black tree
  • Default initial capacity: 16, load factor: 0.75

Hash Process:

  1. Calculate hash code of key using hashCode()
  2. Apply hash function: hash(key) = key.hashCode() ^ (key.hashCode() >>> 16)
  3. Find bucket: index = hash & (capacity - 1)

Collision Resolution:

  • Chaining: Multiple entries in same bucket form linked list
  • Treeification (Java 8+): When bucket size ≥ 8, convert to red-black tree
  • Untreeification: When bucket size ≤ 6, convert back to linked list

Resizing:

  • When size > capacity × load factor, capacity doubles
  • All entries rehashed to new positions
  • Expensive operation, can cause performance issues

Poor hashCode() Impact:
If hashCode() always returns same value, all entries go to one bucket, degrading performance to O(n) for operations.

1
2
3
4
5
6
7
// Simplified internal structure
static class Node<K,V> {
final int hash;
final K key;
V value;
Node<K,V> next;
}

When would you use ConcurrentHashMap vs Collections.synchronizedMap()?

Reference Answer:

Collections.synchronizedMap():

  • Wraps existing map with synchronized methods
  • Synchronization: Entire map locked for each operation
  • Performance: Poor in multi-threaded scenarios
  • Iteration: Requires external synchronization
  • Fail-fast: Iterators can throw ConcurrentModificationException

ConcurrentHashMap:

  • Synchronization: Segment-based locking (Java 7) or CAS operations (Java 8+)
  • Performance: Excellent concurrent read performance
  • Iteration: Weakly consistent iterators, no external sync needed
  • Fail-safe: Iterators reflect state at creation time
  • Atomic operations: putIfAbsent(), replace(), computeIfAbsent()
1
2
3
4
5
6
7
8
9
// ConcurrentHashMap example
ConcurrentHashMap<String, Integer> map = new ConcurrentHashMap<>();
map.putIfAbsent("key", 1);
map.computeIfPresent("key", (k, v) -> v + 1);

// Safe iteration without external synchronization
for (Map.Entry<String, Integer> entry : map.entrySet()) {
// No ConcurrentModificationException
}

Use ConcurrentHashMap when:

  • High concurrent access
  • More reads than writes
  • Need atomic operations
  • Want better performance

Design Patterns and Architecture

Implement the Singleton pattern and discuss its problems.

Reference Answer:

1. Eager Initialization:

1
2
3
4
5
6
7
8
9
public class EagerSingleton {
private static final EagerSingleton INSTANCE = new EagerSingleton();

private EagerSingleton() {}

public static EagerSingleton getInstance() {
return INSTANCE;
}
}
  • Pros: Thread-safe, simple
  • Cons: Creates instance even if never used

2. Lazy Initialization (Thread-unsafe):

1
2
3
4
5
6
7
8
9
10
11
12
public class LazySingleton {
private static LazySingleton instance;

private LazySingleton() {}

public static LazySingleton getInstance() {
if (instance == null) {
instance = new LazySingleton(); // Race condition!
}
return instance;
}
}

3. Thread-safe Lazy (Double-checked locking):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class ThreadSafeSingleton {
private static volatile ThreadSafeSingleton instance;

private ThreadSafeSingleton() {}

public static ThreadSafeSingleton getInstance() {
if (instance == null) {
synchronized (ThreadSafeSingleton.class) {
if (instance == null) {
instance = new ThreadSafeSingleton();
}
}
}
return instance;
}
}

4. Enum Singleton (Recommended):

1
2
3
4
5
6
7
public enum EnumSingleton {
INSTANCE;

public void doSomething() {
// business logic
}
}

Problems with Singleton:

  • Testing: Difficult to mock, global state
  • Coupling: Tight coupling throughout application
  • Scalability: Global bottleneck
  • Serialization: Need special handling
  • Reflection: Can break private constructor
  • Classloader: Multiple instances with different classloaders

Explain dependency injection and inversion of control.

Reference Answer:

Inversion of Control (IoC):
Principle where control of object creation and lifecycle is transferred from the application code to an external framework.

Dependency Injection (DI):
A technique to implement IoC where dependencies are provided to an object rather than the object creating them.

Types of DI:

1. Constructor Injection:

1
2
3
4
5
6
7
public class UserService {
private final UserRepository userRepository;

public UserService(UserRepository userRepository) {
this.userRepository = userRepository;
}
}

2. Setter Injection:

1
2
3
4
5
6
7
public class UserService {
private UserRepository userRepository;

public void setUserRepository(UserRepository userRepository) {
this.userRepository = userRepository;
}
}

3. Field Injection:

1
2
3
4
public class UserService {
@Inject
private UserRepository userRepository;
}

Benefits:

  • Testability: Easy to inject mock dependencies
  • Flexibility: Change implementations without code changes
  • Decoupling: Reduces tight coupling between classes
  • Configuration: Centralized dependency configuration

Without DI:

1
2
3
public class UserService {
private UserRepository userRepository = new DatabaseUserRepository(); // Tight coupling
}

With DI:

1
2
3
4
5
6
7
public class UserService {
private final UserRepository userRepository;

public UserService(UserRepository userRepository) { // Loose coupling
this.userRepository = userRepository;
}
}

Performance and Optimization

How would you identify and resolve a memory leak in a Java application?

Reference Answer:

Identification Tools:

  1. JVisualVM: Visual profiler, heap dumps
  2. JProfiler: Commercial profiler
  3. Eclipse MAT: Memory Analyzer Tool
  4. JConsole: Built-in monitoring
  5. Application metrics: OutOfMemoryError frequency

Detection Signs:

  • Gradual memory increase over time
  • OutOfMemoryError exceptions
  • Increasing GC frequency/duration
  • Application slowdown

Analysis Process:

1. Heap Dump Analysis:

1
2
3
jcmd <pid> GC.run_finalization
jcmd <pid> VM.gc
jmap -dump:format=b,file=heapdump.hprof <pid>

2. Common Leak Scenarios:

Static Collections:

1
2
3
4
5
6
7
public class LeakyClass {
private static List<Object> cache = new ArrayList<>(); // Never cleared

public void addToCache(Object obj) {
cache.add(obj); // Memory leak!
}
}

Listener Registration:

1
2
3
4
5
6
7
8
9
10
11
public class EventPublisher {
private List<EventListener> listeners = new ArrayList<>();

public void addListener(EventListener listener) {
listeners.add(listener); // If not removed, leak!
}

public void removeListener(EventListener listener) {
listeners.remove(listener); // Often forgotten
}
}

ThreadLocal Variables:

1
2
3
4
5
6
7
8
9
10
11
public class ThreadLocalLeak {
private static ThreadLocal<ExpensiveObject> threadLocal = new ThreadLocal<>();

public void setThreadLocalValue() {
threadLocal.set(new ExpensiveObject()); // Clear when done!
}

public void cleanup() {
threadLocal.remove(); // Essential in long-lived threads
}
}

Resolution Strategies:

  • Use weak references where appropriate
  • Implement proper cleanup in finally blocks
  • Clear collections when no longer needed
  • Remove listeners in lifecycle methods
  • Use try-with-resources for automatic cleanup
  • Monitor object creation patterns

What are some JVM tuning parameters you’ve used?

Reference Answer:

Heap Memory:

1
2
3
4
-Xms2g          # Initial heap size
-Xmx8g # Maximum heap size
-XX:NewRatio=3 # Old/Young generation ratio
-XX:MaxMetaspaceSize=256m # Metaspace limit

Garbage Collection:

1
2
3
4
5
6
7
8
9
10
11
12
# G1GC (recommended for large heaps)
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapRegionSize=16m

# Parallel GC (good throughput)
-XX:+UseParallelGC
-XX:ParallelGCThreads=8

# ZGC (ultra-low latency)
-XX:+UseZGC
-XX:+UnlockExperimentalVMOptions

GC Logging:

1
2
3
4
-Xlog:gc*:gc.log:time,tags
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=100M

Performance Monitoring:

1
2
3
4
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/path/to/dumps/

JIT Compilation:

1
2
3
-XX:+TieredCompilation
-XX:CompileThreshold=10000
-XX:+PrintCompilation

Common Tuning Scenarios:

  • High throughput: Parallel GC, larger heap
  • Low latency: G1GC or ZGC, smaller pause times
  • Memory constrained: Smaller heap, compressed OOPs
  • CPU intensive: More GC threads, tiered compilation

Modern Java Features

Explain streams and when you’d use them vs traditional loops.

Reference Answer:

Stream Characteristics:

  • Functional: Declarative programming style
  • Lazy: Operations executed only when terminal operation called
  • Immutable: Original collection unchanged
  • Chainable: Fluent API for operation composition

Stream Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
List<String> names = Arrays.asList("Alice", "Bob", "Charlie", "David");

// Traditional loop
List<String> result = new ArrayList<>();
for (String name : names) {
if (name.length() > 4) {
result.add(name.toUpperCase());
}
}

// Stream approach
List<String> result = names.stream()
.filter(name -> name.length() > 4)
.map(String::toUpperCase)
.collect(Collectors.toList());

When to use Streams:

  • Data transformation pipelines
  • Complex filtering/mapping operations
  • Parallel processing (.parallelStream())
  • Functional programming style preferred
  • Readability over performance for complex operations

When to use Traditional Loops:

  • Simple iterations without transformations
  • Performance critical tight loops
  • Early termination needed
  • State modification during iteration
  • Index-based operations

Performance Considerations:

1
2
3
4
5
6
7
8
9
10
11
// Stream overhead for simple operations
list.stream().forEach(System.out::println); // Slower
list.forEach(System.out::println); // Faster

// Streams excel at complex operations
list.stream()
.filter(complex_predicate)
.map(expensive_transformation)
.sorted()
.limit(10)
.collect(Collectors.toList()); // More readable than equivalent loop

What are records in Java 14+ and when would you use them?

Reference Answer:

Records Definition:
Records are immutable data carriers that automatically generate boilerplate code.

Basic Record:

1
2
3
4
5
6
7
public record Person(String name, int age, String email) {}

// Automatically generates:
// - Constructor: Person(String name, int age, String email)
// - Accessors: name(), age(), email()
// - equals(), hashCode(), toString()
// - All fields are private final

Custom Methods:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public record Point(double x, double y) {
// Custom constructor with validation
public Point {
if (x < 0 || y < 0) {
throw new IllegalArgumentException("Coordinates must be positive");
}
}

// Additional methods
public double distanceFromOrigin() {
return Math.sqrt(x * x + y * y);
}

// Static factory method
public static Point origin() {
return new Point(0, 0);
}
}

When to Use Records:

  • Data Transfer Objects (DTOs)
  • Configuration objects
  • API response/request models
  • Value objects in domain modeling
  • Tuple-like data structures
  • Database result mapping

Example Use Cases:

API Response:

1
2
3
4
5
6
7
public record UserResponse(Long id, String username, String email, LocalDateTime createdAt) {}

// Usage
return users.stream()
.map(user -> new UserResponse(user.getId(), user.getUsername(),
user.getEmail(), user.getCreatedAt()))
.collect(Collectors.toList());

Configuration:

1
2
public record DatabaseConfig(String url, String username, String password, 
int maxConnections, Duration timeout) {}

Limitations:

  • Cannot extend other classes (can implement interfaces)
  • All fields are implicitly final
  • Cannot declare instance fields beyond record components
  • Less flexibility than regular classes

Records vs Classes:

  • Use Records: Immutable data, minimal behavior
  • Use Classes: Mutable state, complex behavior, inheritance needed

System Design Integration

How would you design a thread-safe cache with TTL (time-to-live)?

Reference Answer:

Design Requirements:

  • Thread-safe concurrent access
  • Automatic expiration based on TTL
  • Efficient cleanup of expired entries
  • Good performance for reads and writes

Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
public class TTLCache<K, V> {
private static class CacheEntry<V> {
final V value;
final long expirationTime;

CacheEntry(V value, long ttlMillis) {
this.value = value;
this.expirationTime = System.currentTimeMillis() + ttlMillis;
}

boolean isExpired() {
return System.currentTimeMillis() > expirationTime;
}
}

private final ConcurrentHashMap<K, CacheEntry<V>> cache = new ConcurrentHashMap<>();
private final ScheduledExecutorService cleanupExecutor;
private final long defaultTTL;

public TTLCache(long defaultTTLMillis, long cleanupIntervalMillis) {
this.defaultTTL = defaultTTLMillis;
this.cleanupExecutor = Executors.newSingleThreadScheduledExecutor(r -> {
Thread t = new Thread(r, "TTLCache-Cleanup");
t.setDaemon(true);
return t;
});

// Schedule periodic cleanup
cleanupExecutor.scheduleAtFixedRate(this::cleanup,
cleanupIntervalMillis, cleanupIntervalMillis, TimeUnit.MILLISECONDS);
}

public void put(K key, V value) {
put(key, value, defaultTTL);
}

public void put(K key, V value, long ttlMillis) {
cache.put(key, new CacheEntry<>(value, ttlMillis));
}

public V get(K key) {
CacheEntry<V> entry = cache.get(key);
if (entry == null || entry.isExpired()) {
cache.remove(key); // Clean up expired entry
return null;
}
return entry.value;
}

public boolean containsKey(K key) {
return get(key) != null;
}

public void remove(K key) {
cache.remove(key);
}

public void clear() {
cache.clear();
}

public int size() {
cleanup(); // Clean expired entries first
return cache.size();
}

private void cleanup() {
cache.entrySet().removeIf(entry -> entry.getValue().isExpired());
}

public void shutdown() {
cleanupExecutor.shutdown();
try {
if (!cleanupExecutor.awaitTermination(5, TimeUnit.SECONDS)) {
cleanupExecutor.shutdownNow();
}
} catch (InterruptedException e) {
cleanupExecutor.shutdownNow();
Thread.currentThread().interrupt();
}
}
}

Usage Example:

1
2
3
4
5
6
7
8
9
10
11
// Create cache with 5-minute default TTL, cleanup every minute
TTLCache<String, UserData> userCache = new TTLCache<>(5 * 60 * 1000, 60 * 1000);

// Store with default TTL
userCache.put("user123", userData);

// Store with custom TTL (10 minutes)
userCache.put("session456", sessionData, 10 * 60 * 1000);

// Retrieve
UserData user = userCache.get("user123");

Alternative Approaches:

  • Caffeine Cache: Production-ready with advanced features
  • Guava Cache: Google’s caching library
  • Redis: External cache for distributed systems
  • Chronicle Map: Off-heap storage for large datasets

Explain how you’d handle database connections in a high-traffic application.

Reference Answer:

Connection Pooling Strategy:

1. HikariCP Configuration (Recommended):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
@Configuration
public class DatabaseConfig {

@Bean
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:postgresql://localhost:5432/mydb");
config.setUsername("user");
config.setPassword("password");

// Pool sizing
config.setMaximumPoolSize(20); // Max connections
config.setMinimumIdle(5); // Min idle connections
config.setConnectionTimeout(30000); // 30 seconds timeout
config.setIdleTimeout(600000); // 10 minutes idle timeout
config.setMaxLifetime(1800000); // 30 minutes max lifetime

// Performance tuning
config.setLeakDetectionThreshold(60000); // 1 minute leak detection
config.setCachePrepStmts(true);
config.setPrepStmtCacheSize(250);
config.setPrepStmtCacheSqlLimit(2048);

return new HikariDataSource(config);
}
}

2. Connection Pool Sizing:

1
2
3
4
5
connections = ((core_count * 2) + effective_spindle_count)

For CPU-intensive: core_count * 2
For I/O-intensive: higher multiplier (3-4x)
Monitor and adjust based on actual usage

3. Transaction Management:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
@Service
@Transactional
public class UserService {

@Autowired
private UserRepository userRepository;

@Transactional(readOnly = true)
public User findById(Long id) {
return userRepository.findById(id);
}

@Transactional(propagation = Propagation.REQUIRES_NEW)
public void updateUserAsync(Long id, UserData data) {
// Runs in separate transaction
User user = userRepository.findById(id);
user.update(data);
userRepository.save(user);
}

@Transactional(timeout = 30) // 30 seconds timeout
public void bulkOperation(List<User> users) {
users.forEach(userRepository::save);
}
}

4. Read/Write Splitting:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
@Configuration
public class DatabaseRoutingConfig {

@Bean
@Primary
public DataSource routingDataSource() {
RoutingDataSource routingDataSource = new RoutingDataSource();

Map<Object, Object> targetDataSources = new HashMap<>();
targetDataSources.put("write", writeDataSource());
targetDataSources.put("read", readDataSource());

routingDataSource.setTargetDataSources(targetDataSources);
routingDataSource.setDefaultTargetDataSource(writeDataSource());

return routingDataSource;
}

@Bean
public DataSource writeDataSource() {
// Master database configuration
return createDataSource("jdbc:postgresql://master:5432/mydb");
}

@Bean
public DataSource readDataSource() {
// Replica database configuration
return createDataSource("jdbc:postgresql://replica:5432/mydb");
}
}

public class RoutingDataSource extends AbstractRoutingDataSource {
@Override
protected Object determineCurrentLookupKey() {
return TransactionSynchronizationManager.isCurrentTransactionReadOnly() ? "read" : "write";
}
}

5. Monitoring and Health Checks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
@Component
public class DatabaseHealthIndicator implements HealthIndicator {

@Autowired
private DataSource dataSource;

@Override
public Health health() {
try (Connection connection = dataSource.getConnection()) {
if (connection.isValid(2)) { // 2 second timeout
return Health.up()
.withDetail("database", "Available")
.withDetail("active-connections", getActiveConnections())
.build();
}
} catch (SQLException e) {
return Health.down()
.withDetail("database", "Unavailable")
.withException(e)
.build();
}
return Health.down().withDetail("database", "Connection invalid").build();
}

private int getActiveConnections() {
if (dataSource instanceof HikariDataSource) {
return ((HikariDataSource) dataSource).getHikariPoolMXBean().getActiveConnections();
}
return -1;
}
}

6. Best Practices for High Traffic:

Connection Management:

  • Always use connection pooling
  • Set appropriate timeouts
  • Monitor pool metrics
  • Use read replicas for read-heavy workloads

Query Optimization:

  • Use prepared statements
  • Implement proper indexing
  • Cache frequently accessed data
  • Use batch operations for bulk updates

Resilience Patterns:

  • Circuit breaker for database failures
  • Retry logic with exponential backoff
  • Graceful degradation when database unavailable
  • Database failover strategies

Performance Monitoring:

1
2
3
4
5
6
7
8
9
@EventListener
public void handleConnectionPoolMetrics(ConnectionPoolMetricsEvent event) {
logger.info("Active connections: {}, Idle: {}, Waiting: {}",
event.getActive(), event.getIdle(), event.getWaiting());

if (event.getActive() > event.getMaxPool() * 0.8) {
alertingService.sendAlert("High database connection usage");
}
}

This comprehensive approach ensures database connections are efficiently managed in high-traffic scenarios while maintaining performance and reliability.

Overview

Introduction

The FinTech AI Workflow and Chat System represents a comprehensive lending platform that combines traditional workflow automation with artificial intelligence capabilities. This system streamlines the personal loan application process through intelligent automation while maintaining human oversight at critical decision points.

The architecture employs a microservices approach, integrating multiple AI technologies including Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and intelligent agents to create a seamless lending experience. The system processes over 2000 concurrent conversations with an average response time of 30 seconds, demonstrating enterprise-grade performance.

Key Business Benefits:

  • Reduced Processing Time: From days to minutes for loan approvals
  • Enhanced Accuracy: AI-powered risk assessment reduces default rates
  • Improved Customer Experience: 24/7 availability with multi-modal interaction
  • Regulatory Compliance: Built-in compliance checks and audit trails
  • Cost Efficiency: Automated workflows reduce operational costs by 60%

Key Interview Question: “How would you design a scalable FinTech system that balances automation with regulatory compliance?”

Reference Answer: The system employs a layered architecture with clear separation of concerns. The workflow engine handles business logic while maintaining audit trails for regulatory compliance. AI components augment human decision-making rather than replacing it entirely, ensuring transparency and accountability. The microservices architecture allows for independent scaling of components based on demand.

Architecture Design


flowchart TB
subgraph "Frontend Layer"
    A[ChatWebUI] --> B[React/Vue Components]
    B --> C[Multi-Modal Input Handler]
end

subgraph "Gateway Layer"
    D[Higress AI Gateway] --> E[Load Balancer]
    E --> F[Multi-Model Provider]
    F --> G[Context Memory - mem0]
end

subgraph "Service Layer"
    H[ConversationService] --> I[AIWorkflowEngineService]
    I --> J[WorkflowEngineService]
    H --> K[KnowledgeBaseService]
end

subgraph "AI Layer"
    L[LLM Providers] --> M[ReAct Pattern Engine]
    M --> N[MCP Server Agents]
    N --> O[RAG System]
end

subgraph "External Systems"
    P[BankCreditSystem]
    Q[TaxSystem]
    R[SocialSecuritySystem]
    S[Rule Engine]
end

subgraph "Configuration"
    T[Nacos Config Center]
    U[Prompt Templates]
end

A --> D
D --> H
H --> L
I --> P
I --> Q
I --> R
J --> S
K --> O
T --> U
U --> L

The architecture follows a distributed microservices pattern with clear separation between presentation, business logic, and data layers. The AI Gateway serves as the entry point for all AI-related operations, providing load balancing and context management across multiple LLM providers.

Core Components

WorkflowEngineService

The WorkflowEngineService serves as the backbone of the lending process, orchestrating the three-stage review workflow: Initial Review, Review, and Final Review.

Core Responsibilities:

  • Workflow orchestration and state management
  • External system integration
  • Business rule execution
  • Audit trail maintenance
  • SLA monitoring and enforcement

Implementation Architecture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
@Service
@Transactional
public class WorkflowEngineService {

@Autowired
private LoanApplicationRepository loanRepo;

@Autowired
private ExternalIntegrationService integrationService;

@Autowired
private RuleEngineService ruleEngine;

@Autowired
private NotificationService notificationService;

public WorkflowResult processLoanApplication(LoanApplication application) {
try {
// Initialize workflow
WorkflowInstance workflow = initializeWorkflow(application);

// Execute initial review
InitialReviewResult initialResult = executeInitialReview(application);
workflow.updateStage(WorkflowStage.INITIAL_REVIEW, initialResult);

if (initialResult.isApproved()) {
// Proceed to detailed review
DetailedReviewResult detailedResult = executeDetailedReview(application);
workflow.updateStage(WorkflowStage.DETAILED_REVIEW, detailedResult);

if (detailedResult.isApproved()) {
// Final review
FinalReviewResult finalResult = executeFinalReview(application);
workflow.updateStage(WorkflowStage.FINAL_REVIEW, finalResult);

return WorkflowResult.builder()
.status(finalResult.isApproved() ?
WorkflowStatus.APPROVED : WorkflowStatus.REJECTED)
.workflowId(workflow.getId())
.build();
}
}

return WorkflowResult.builder()
.status(WorkflowStatus.REJECTED)
.workflowId(workflow.getId())
.build();

} catch (Exception e) {
log.error("Workflow processing failed", e);
return handleWorkflowError(application, e);
}
}

private InitialReviewResult executeInitialReview(LoanApplication application) {
// Validate basic information
ValidationResult validation = validateBasicInfo(application);
if (!validation.isValid()) {
return InitialReviewResult.rejected(validation.getErrors());
}

// Check credit score
CreditScoreResult creditScore = integrationService.getCreditScore(
application.getApplicantId());

// Apply initial screening rules
RuleResult ruleResult = ruleEngine.evaluateInitialRules(
application, creditScore);

return InitialReviewResult.builder()
.approved(ruleResult.isApproved())
.creditScore(creditScore.getScore())
.reasons(ruleResult.getReasons())
.build();
}
}

Three-Stage Review Process:

  1. Initial Review: Automated screening based on basic criteria

    • Identity verification
    • Credit score check
    • Basic eligibility validation
    • Fraud detection algorithms
  2. Detailed Review: Comprehensive analysis of financial capacity

    • Income verification through tax systems
    • Employment history validation
    • Debt-to-income ratio calculation
    • Collateral assessment (if applicable)
  3. Final Review: Human oversight and final approval

    • Risk assessment confirmation
    • Regulatory compliance check
    • Manual review of edge cases
    • Final approval or rejection

External System Integration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@Component
public class ExternalIntegrationService {

@Autowired
private BankCreditSystemClient bankCreditClient;

@Autowired
private TaxSystemClient taxClient;

@Autowired
private SocialSecuritySystemClient socialSecurityClient;

@Retryable(value = {Exception.class}, maxAttempts = 3)
public CreditScoreResult getCreditScore(String applicantId) {
return bankCreditClient.getCreditScore(applicantId);
}

@Retryable(value = {Exception.class}, maxAttempts = 3)
public TaxInformationResult getTaxInformation(String applicantId, int years) {
return taxClient.getTaxInformation(applicantId, years);
}

@Retryable(value = {Exception.class}, maxAttempts = 3)
public SocialSecurityResult getSocialSecurityInfo(String applicantId) {
return socialSecurityClient.getSocialSecurityInfo(applicantId);
}
}

Key Interview Question: “How do you handle transaction consistency across multiple external system calls in a workflow?”

Reference Answer: The system uses the Saga pattern for distributed transactions. Each step in the workflow is designed as a compensable transaction. If a step fails, the system executes compensation actions to maintain consistency. For example, if the final review fails after initial approvals, the system automatically triggers cleanup processes to revert any provisional approvals.

AIWorkflowEngineService

The AIWorkflowEngineService leverages Spring AI to provide intelligent automation of the lending process, reducing manual intervention while maintaining accuracy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
@Service
@Sl4j
public class AIWorkflowEngineService {

@Autowired
private ChatModel chatModel;

@Autowired
private PromptTemplateService promptTemplateService;

@Autowired
private WorkflowEngineService traditionalWorkflowService;

public AIWorkflowResult processLoanApplicationWithAI(LoanApplication application) {
// First, gather all relevant data
ApplicationContext context = gatherApplicationContext(application);

// Use AI to perform initial assessment
AIAssessmentResult aiAssessment = performAIAssessment(context);

// Decide whether to proceed with full automated flow or human review
if (aiAssessment.getConfidenceScore() > 0.85) {
return processAutomatedFlow(context, aiAssessment);
} else {
return processHybridFlow(context, aiAssessment);
}
}

private AIAssessmentResult performAIAssessment(ApplicationContext context) {
String promptTemplate = promptTemplateService.getTemplate("loan_assessment");

Map<String, Object> variables = Map.of(
"applicantData", context.getApplicantData(),
"creditHistory", context.getCreditHistory(),
"financialData", context.getFinancialData()
);

Prompt prompt = new PromptTemplate(promptTemplate, variables).create();
ChatResponse response = chatModel.call(prompt);

return parseAIResponse(response.getResult().getOutput().getContent());
}

private AIAssessmentResult parseAIResponse(String aiResponse) {
// Parse structured AI response
ObjectMapper mapper = new ObjectMapper();
try {
return mapper.readValue(aiResponse, AIAssessmentResult.class);
} catch (JsonProcessingException e) {
log.error("Failed to parse AI response", e);
return AIAssessmentResult.lowConfidence();
}
}
}

Key Interview Question: “How do you ensure AI decisions are explainable and auditable in a regulated financial environment?”

Reference Answer: The system maintains detailed audit logs for every AI decision, including the input data, prompt templates used, model responses, and confidence scores. Each AI assessment includes reasoning chains that explain the decision logic. For regulatory compliance, the system can replay any decision by re-running the same prompt with the same input data, ensuring reproducibility and transparency.

ChatWebUI

The ChatWebUI serves as the primary interface for user interaction, supporting multi-modal communication including text, files, images, and audio.

Key Features:

  • Multi-Modal Input: Text, voice, image, and document upload
  • Real-Time Chat: WebSocket-based instant messaging
  • Progressive Web App: Mobile-responsive design
  • Accessibility: WCAG 2.1 compliant interface
  • Internationalization: Multi-language support

React-based Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
@RestController
@RequestMapping("/api/chat")
public class ChatController {

@Autowired
private ConversationService conversationService;

@Autowired
private FileProcessingService fileProcessingService;

@PostMapping("/message")
public ResponseEntity<ChatResponse> sendMessage(@RequestBody ChatRequest request) {
try {
ChatResponse response = conversationService.processMessage(request);
return ResponseEntity.ok(response);
} catch (Exception e) {
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR)
.body(ChatResponse.error("Failed to process message"));
}
}

@PostMapping("/upload")
public ResponseEntity<FileUploadResponse> uploadFile(
@RequestParam("file") MultipartFile file,
@RequestParam("conversationId") String conversationId) {

try {
FileProcessingResult result = fileProcessingService.processFile(
file, conversationId);

return ResponseEntity.ok(FileUploadResponse.builder()
.fileId(result.getFileId())
.extractedText(result.getExtractedText())
.processingStatus(result.getStatus())
.build());

} catch (Exception e) {
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR)
.body(FileUploadResponse.error("File processing failed"));
}
}

@GetMapping("/conversation/{id}")
public ResponseEntity<ConversationHistory> getConversationHistory(
@PathVariable String id) {

ConversationHistory history = conversationService.getConversationHistory(id);
return ResponseEntity.ok(history);
}
}

ConversationService

The ConversationService handles multi-modal customer interactions, supporting text, file uploads, images, and audio processing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
@Service
@sl4j
public class ConversationService {

@Autowired
private KnowledgeBaseService knowledgeBaseService;

@Autowired
private AIWorkflowEngineService aiWorkflowService;

@Autowired
private ContextMemoryService contextMemoryService;

public ConversationResponse processMessage(ConversationRequest request) {
// Retrieve conversation context
ConversationContext context = contextMemoryService.getContext(
request.getSessionId());

// Process multi-modal input
ProcessedInput processedInput = processMultiModalInput(request);

// Classify intent using ReAct pattern
IntentClassification intent = classifyIntent(processedInput, context);

switch (intent.getType()) {
case LOAN_APPLICATION:
return handleLoanApplication(processedInput, context);
case KNOWLEDGE_QUERY:
return handleKnowledgeQuery(processedInput, context);
case DOCUMENT_UPLOAD:
return handleDocumentUpload(processedInput, context);
default:
return handleGeneralChat(processedInput, context);
}
}

private ProcessedInput processMultiModalInput(ConversationRequest request) {
ProcessedInput.Builder builder = ProcessedInput.builder()
.sessionId(request.getSessionId())
.timestamp(Instant.now());

// Process text
if (request.getText() != null) {
builder.text(request.getText());
}

// Process files
if (request.getFiles() != null) {
List<ProcessedFile> processedFiles = request.getFiles().stream()
.map(this::processFile)
.collect(Collectors.toList());
builder.files(processedFiles);
}

// Process images
if (request.getImages() != null) {
List<ProcessedImage> processedImages = request.getImages().stream()
.map(this::processImage)
.collect(Collectors.toList());
builder.images(processedImages);
}

return builder.build();
}
}

KnowledgeBaseService

The KnowledgeBaseService implements a comprehensive RAG system for financial domain knowledge, supporting various document formats and providing contextually relevant responses.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
@Service
@sl4j
public class KnowledgeBaseService {

@Autowired
private VectorStoreService vectorStoreService;

@Autowired
private DocumentParsingService documentParsingService;

@Autowired
private EmbeddingModel embeddingModel;

@Autowired
private ChatModel chatModel;

public KnowledgeResponse queryKnowledge(String query, ConversationContext context) {
// Generate embedding for the query
EmbeddingRequest embeddingRequest = new EmbeddingRequest(
List.of(query), EmbeddingOptions.EMPTY);
EmbeddingResponse embeddingResponse = embeddingModel.call(embeddingRequest);

// Retrieve relevant documents
List<Document> relevantDocs = vectorStoreService.similaritySearch(
SearchRequest.query(query)
.withTopK(5)
.withSimilarityThreshold(0.7));

// Generate contextual response
return generateContextualResponse(query, relevantDocs, context);
}

public void indexDocument(MultipartFile file) {
try {
// Parse document based on format
ParsedDocument parsedDoc = documentParsingService.parse(file);

// Split into chunks
List<DocumentChunk> chunks = splitDocument(parsedDoc);

// Generate embeddings and store
for (DocumentChunk chunk : chunks) {
EmbeddingRequest embeddingRequest = new EmbeddingRequest(
List.of(chunk.getContent()), EmbeddingOptions.EMPTY);
EmbeddingResponse embeddingResponse = embeddingModel.call(embeddingRequest);

Document document = new Document(chunk.getContent(),
Map.of("source", file.getOriginalFilename(),
"chunk_id", chunk.getId()));
document.setEmbedding(embeddingResponse.getResults().get(0).getOutput());

vectorStoreService.add(List.of(document));
}
} catch (Exception e) {
log.error("Failed to index document: {}", file.getOriginalFilename(), e);
throw new DocumentIndexingException("Failed to index document", e);
}
}

private List<DocumentChunk> splitDocument(ParsedDocument parsedDoc) {
// Implement intelligent chunking based on document structure
return DocumentChunker.builder()
.chunkSize(1000)
.chunkOverlap(200)
.respectSentenceBoundaries(true)
.respectParagraphBoundaries(true)
.build()
.split(parsedDoc);
}
}

Key Technologies

LLM fine-tuning with Financial data

Fine-tuning Large Language Models with domain-specific financial data enhances their understanding of financial concepts, regulations, and terminology.

Fine-tuning Strategy:

  • Base Model Selection: Choose appropriate foundation models (GPT-4, Claude, or Llama)
  • Dataset Preparation: Curate high-quality financial datasets
  • Training Pipeline: Implement efficient fine-tuning workflows
  • Evaluation Metrics: Define domain-specific evaluation criteria
  • Continuous Learning: Update models with new financial data

Implementation Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
@Component
public class FinancialLLMFineTuner {

@Autowired
private ModelTrainingService trainingService;

@Autowired
private DatasetManager datasetManager;

@Autowired
private ModelEvaluationService evaluationService;

@Scheduled(cron = "0 0 2 * * SUN") // Weekly training
public void scheduledFineTuning() {
try {
// Prepare training dataset
FinancialDataset dataset = datasetManager.prepareFinancialDataset();

// Configure training parameters
TrainingConfig config = TrainingConfig.builder()
.baseModel("gpt-4")
.learningRate(0.0001)
.batchSize(16)
.epochs(3)
.warmupSteps(100)
.evaluationStrategy(EvaluationStrategy.STEPS)
.evaluationSteps(500)
.build();

// Start fine-tuning
TrainingResult result = trainingService.fineTuneModel(config, dataset);

// Evaluate model performance
EvaluationResult evaluation = evaluationService.evaluate(
result.getModelId(), dataset.getTestSet());

// Deploy if performance meets criteria
if (evaluation.getFinancialAccuracy() > 0.95) {
deployModel(result.getModelId());
}

} catch (Exception e) {
log.error("Fine-tuning failed", e);
}
}

private void deployModel(String modelId) {
// Implement model deployment logic
// Include A/B testing for gradual rollout
}
}

Multi-Modal Message Processing

The system processes diverse input types, including text, images, audio, and documents. Each modality is handled by specialized processors that extract relevant information and convert it into a unified format.

MultiModalProcessor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
@Component
public class MultiModalProcessor {

@Autowired
private AudioTranscriptionService audioTranscriptionService;

@Autowired
private ImageAnalysisService imageAnalysisService;

@Autowired
private DocumentExtractionService documentExtractionService;

public ProcessedInput processInput(MultiModalInput input) {
ProcessedInput.Builder builder = ProcessedInput.builder();

// Process audio to text
if (input.hasAudio()) {
String transcription = audioTranscriptionService.transcribe(input.getAudio());
builder.transcription(transcription);
}

// Process images
if (input.hasImages()) {
List<ImageAnalysisResult> imageResults = input.getImages().stream()
.map(imageAnalysisService::analyzeImage)
.collect(Collectors.toList());
builder.imageAnalysis(imageResults);
}

// Process documents
if (input.hasDocuments()) {
List<ExtractedContent> documentContents = input.getDocuments().stream()
.map(documentExtractionService::extractContent)
.collect(Collectors.toList());
builder.documentContents(documentContents);
}

return builder.build();
}
}

Multi-Format Document Processing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
@Component
public class DocumentProcessor {

@Autowired
private PdfProcessor pdfProcessor;

@Autowired
private ExcelProcessor excelProcessor;

@Autowired
private WordProcessor wordProcessor;

@Autowired
private TextSplitter textSplitter;

public List<Document> processDocument(MultipartFile file) throws IOException {
String filename = file.getOriginalFilename();
String contentType = file.getContentType();

String content = switch (contentType) {
case "application/pdf" -> pdfProcessor.extractText(file);
case "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" ->
excelProcessor.extractText(file);
case "application/vnd.openxmlformats-officedocument.wordprocessingml.document" ->
wordProcessor.extractText(file);
case "text/plain" -> new String(file.getBytes(), StandardCharsets.UTF_8);
default -> throw new UnsupportedFileTypeException("Unsupported file type: " + contentType);
};

// Split content into chunks
List<String> chunks = textSplitter.splitText(content);

// Create documents
return chunks.stream()
.map(chunk -> Document.builder()
.content(chunk)
.metadata(Map.of(
"filename", filename,
"content_type", contentType,
"chunk_size", String.valueOf(chunk.length())
))
.build())
.collect(Collectors.toList());
}
}

Key Interview Question: “How do you handle different file formats and ensure consistent processing across modalities?”

Reference Answer: The system uses a plugin-based architecture where each file type has a dedicated processor. Common formats like PDF, DOCX, and images are handled by specialized libraries (Apache PDFBox, Apache POI, etc.). For audio, we use speech-to-text services. All processors output to a common ProcessedInput format, ensuring consistency downstream. The system is extensible - new processors can be added without modifying core logic.

RAG Implementation for Knowledge Base

The RAG system combines vector search with contextual generation to provide accurate, relevant responses about financial topics.

RAGService

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
@Service
public class RAGService {

@Autowired
private VectorStoreService vectorStore;

@Autowired
private ChatModel chatModel;

@Autowired
private PromptTemplateService promptTemplateService;

public RAGResponse generateResponse(String query, ConversationContext context) {
// Step 1: Retrieve relevant documents
List<Document> relevantDocs = retrieveRelevantDocuments(query);

// Step 2: Rank and filter documents
List<Document> rankedDocs = rankDocuments(relevantDocs, query, context);

// Step 3: Generate response with context
return generateWithContext(query, rankedDocs, context);
}

private List<Document> rankDocuments(List<Document> documents,
String query,
ConversationContext context) {
// Implement re-ranking based on:
// - Semantic similarity
// - Recency of information
// - User's conversation history
// - Domain-specific relevance

return documents.stream()
.sorted((doc1, doc2) -> {
double score1 = calculateRelevanceScore(doc1, query, context);
double score2 = calculateRelevanceScore(doc2, query, context);
return Double.compare(score2, score1);
})
.limit(3)
.collect(Collectors.toList());
}

private double calculateRelevanceScore(Document doc, String query, ConversationContext context) {
double semanticScore = calculateSemanticSimilarity(doc, query);
double contextScore = calculateContextualRelevance(doc, context);
double freshnessScore = calculateFreshnessScore(doc);

return 0.5 * semanticScore + 0.3 * contextScore + 0.2 * freshnessScore;
}
}

Multi-Vector RAG for Financial Documents

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
@Service
public class AdvancedRAGService {

@Autowired
private VectorStore semanticVectorStore;

@Autowired
private VectorStore keywordVectorStore;

@Autowired
private GraphRAGService graphRAGService;

public RAGResponse queryWithMultiVectorRAG(String query, ConversationContext context) {
// Semantic search
List<Document> semanticResults = semanticVectorStore.similaritySearch(
SearchRequest.query(query).withTopK(5)
);

// Keyword search
List<Document> keywordResults = keywordVectorStore.similaritySearch(
SearchRequest.query(extractKeywords(query)).withTopK(5)
);

// Graph-based retrieval for relationship context
List<Document> graphResults = graphRAGService.retrieveRelatedDocuments(query);

// Combine and re-rank results
List<Document> combinedResults = reRankDocuments(
Arrays.asList(semanticResults, keywordResults, graphResults),
query
);

// Generate response with multi-vector context
return generateEnhancedResponse(query, combinedResults, context);
}

private List<Document> reRankDocuments(List<List<Document>> documentLists, String query) {
// Implement reciprocal rank fusion (RRF)
Map<String, Double> documentScores = new HashMap<>();

for (List<Document> documents : documentLists) {
for (int i = 0; i < documents.size(); i++) {
Document doc = documents.get(i);
String docId = doc.getId();
double score = 1.0 / (i + 1); // Reciprocal rank
documentScores.merge(docId, score, Double::sum);
}
}

// Sort by combined score and return top results
return documentScores.entrySet().stream()
.sorted(Map.Entry.<String, Double>comparingByValue().reversed())
.limit(10)
.map(entry -> findDocumentById(entry.getKey()))
.filter(Objects::nonNull)
.collect(Collectors.toList());
}
}

Financial Domain-Specific Text Splitter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
@Component
public class FinancialTextSplitter {

private static final Pattern FINANCIAL_SECTION_PATTERN =
Pattern.compile("(INCOME|EXPENSES|ASSETS|LIABILITIES|CASH FLOW|CREDIT HISTORY)",
Pattern.CASE_INSENSITIVE);

private static final Pattern CURRENCY_PATTERN =
Pattern.compile("\\$[0-9,]+\\.?[0-9]*|[0-9,]+\\.[0-9]{2}");

public List<String> splitFinancialDocument(String text) {
List<String> chunks = new ArrayList<>();

// Split by financial sections first
String[] sections = FINANCIAL_SECTION_PATTERN.split(text);

for (String section : sections) {
if (section.length() > 2000) {
// Further split large sections while preserving financial context
chunks.addAll(splitLargeSection(section));
} else {
chunks.add(section.trim());
}
}

return chunks.stream()
.filter(chunk -> !chunk.isEmpty())
.collect(Collectors.toList());
}

private List<String> splitLargeSection(String section) {
List<String> chunks = new ArrayList<>();
String[] sentences = section.split("\\.");

StringBuilder currentChunk = new StringBuilder();

for (String sentence : sentences) {
if (currentChunk.length() + sentence.length() > 1500) {
if (currentChunk.length() > 0) {
chunks.add(currentChunk.toString().trim());
currentChunk = new StringBuilder();
}
}

currentChunk.append(sentence).append(".");

// Preserve financial context by keeping currency amounts together
if (CURRENCY_PATTERN.matcher(sentence).find()) {
// Don't split immediately after financial amounts
continue;
}
}

if (currentChunk.length() > 0) {
chunks.add(currentChunk.toString().trim());
}

return chunks;
}
}

MCP Server and Agent-to-Agent Communication

The Model Context Protocol (MCP) enables seamless communication between specialized agents, each handling specific domain expertise.

MCPServerManager

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
@Component
public class MCPServerManager {

private final Map<String, MCPAgent> agents = new ConcurrentHashMap<>();

@PostConstruct
public void initializeAgents() {
// Initialize specialized agents
agents.put("credit_agent", new CreditAnalysisAgent());
agents.put("risk_agent", new RiskAssessmentAgent());
agents.put("compliance_agent", new ComplianceAgent());
agents.put("document_agent", new DocumentAnalysisAgent());
}

public AgentResponse routeToAgent(String agentType, AgentRequest request) {
MCPAgent agent = agents.get(agentType);
if (agent == null) {
throw new AgentNotFoundException("Agent not found: " + agentType);
}

return agent.process(request);
}

public CompoundResponse processWithMultipleAgents(List<String> agentTypes,
AgentRequest request) {
CompoundResponse.Builder responseBuilder = CompoundResponse.builder();

// Process with multiple agents in parallel
List<CompletableFuture<AgentResponse>> futures = agentTypes.stream()
.map(agentType -> CompletableFuture.supplyAsync(() ->
routeToAgent(agentType, request)))
.collect(Collectors.toList());

CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]))
.join();

// Combine responses
List<AgentResponse> responses = futures.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());

return responseBuilder.agentResponses(responses).build();
}
}

Credit Analysis Agent Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
@Component
public class CreditAnalysisAgent {

@MCPMethod("analyze_credit_profile")
public CreditAnalysisResult analyzeCreditProfile(CreditAnalysisRequest request) {
// Specialized credit analysis logic
CreditProfile profile = request.getCreditProfile();

// Calculate various credit metrics
double debtToIncomeRatio = calculateDebtToIncomeRatio(profile);
double creditUtilization = calculateCreditUtilization(profile);
int paymentHistory = analyzePaymentHistory(profile);

// Generate risk score
double riskScore = calculateCreditRiskScore(debtToIncomeRatio, creditUtilization, paymentHistory);

// Provide recommendations
List<String> recommendations = generateCreditRecommendations(profile, riskScore);

return CreditAnalysisResult.builder()
.riskScore(riskScore)
.debtToIncomeRatio(debtToIncomeRatio)
.creditUtilization(creditUtilization)
.paymentHistoryScore(paymentHistory)
.recommendations(recommendations)
.analysisTimestamp(Instant.now())
.build();
}

private double calculateCreditRiskScore(double dtiRatio, double utilization, int paymentHistory) {
// Weighted scoring algorithm
double dtiWeight = 0.35;
double utilizationWeight = 0.30;
double paymentHistoryWeight = 0.35;

double dtiScore = Math.max(0, 100 - (dtiRatio * 2)); // Lower DTI = higher score
double utilizationScore = Math.max(0, 100 - (utilization * 100)); // Lower utilization = higher score
double paymentScore = paymentHistory; // Already normalized to 0-100

return (dtiScore * dtiWeight) + (utilizationScore * utilizationWeight) + (paymentScore * paymentHistoryWeight);
}
}

Session Memory and Context Caching with mem0

The mem0 solution provides sophisticated context management, maintaining conversation state and user preferences across sessions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
@Service
public class ContextMemoryService {

@Autowired
private Mem0Client mem0Client;

@Autowired
private RedisTemplate<String, Object> redisTemplate;

public ConversationContext getContext(String sessionId) {
// Try L1 cache first (Redis)
ConversationContext context = (ConversationContext)
redisTemplate.opsForValue().get("context:" + sessionId);

if (context == null) {
// Fall back to mem0 for persistent context
context = mem0Client.getContext(sessionId);
if (context != null) {
// Cache in Redis for quick access
redisTemplate.opsForValue().set("context:" + sessionId,
context, Duration.ofMinutes(30));
}
}

return context != null ? context : new ConversationContext(sessionId);
}

public void updateContext(String sessionId, ConversationContext context) {
// Update both caches
redisTemplate.opsForValue().set("context:" + sessionId,
context, Duration.ofMinutes(30));
mem0Client.updateContext(sessionId, context);
}

public void addMemory(String sessionId, Memory memory) {
mem0Client.addMemory(sessionId, memory);

// Invalidate cache to force refresh
redisTemplate.delete("context:" + sessionId);
}
}

Key Interview Question: “How do you handle context windows and memory management in long conversations?”

Reference Answer: The system uses a hierarchical memory approach. Short-term context is kept in Redis for quick access, while long-term memories are stored in mem0. We implement context window management by summarizing older parts of conversations and keeping only the most relevant recent exchanges. The system also uses semantic clustering to group related memories and retrieves them based on relevance to the current conversation.

LLM ReAct Pattern Implementation

The ReAct (Reasoning + Acting) pattern enables the system to break down complex queries into reasoning steps and actions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
@Component
public class ReActEngine {

@Autowired
private ChatModel chatModel;

@Autowired
private ToolRegistry toolRegistry;

public ReActResponse process(String query, ConversationContext context) {
ReActState state = new ReActState(query, context);

while (!state.isComplete() && state.getStepCount() < MAX_STEPS) {
// Reasoning step
ReasoningResult reasoning = performReasoning(state);
state.addReasoning(reasoning);

// Action step
if (reasoning.requiresAction()) {
ActionResult action = performAction(reasoning.getAction(), state);
state.addAction(action);

// Observation step
if (action.hasObservation()) {
state.addObservation(action.getObservation());
}
}

// Check if we have enough information to provide final answer
if (reasoning.canProvideAnswer()) {
state.setComplete(true);
}
}

return generateFinalResponse(state);
}

private ReasoningResult performReasoning(ReActState state) {
String reasoningPrompt = buildReasoningPrompt(state);
ChatResponse response = chatModel.call(new Prompt(reasoningPrompt));

return parseReasoningResponse(response.getResult().getOutput().getContent());
}

private ActionResult performAction(Action action, ReActState state) {
Tool tool = toolRegistry.getTool(action.getToolName());
if (tool == null) {
return ActionResult.error("Tool not found: " + action.getToolName());
}

return tool.execute(action.getParameters(), state.getContext());
}
}

LLM Planning Pattern implementation

The Planning pattern enables the system to create and execute complex multi-step plans for loan processing workflows.

Planning Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
@Service
public class PlanningAgent {

@Autowired
private ChatClient chatClient;

@Autowired
private TaskExecutor taskExecutor;

@Autowired
private PlanValidator planValidator;

public PlanExecutionResult executePlan(String objective, PlanningConfig config) {
// Step 1: Generate plan
Plan plan = generatePlan(objective, config);

// Step 2: Validate plan
ValidationResult validation = planValidator.validate(plan);
if (!validation.isValid()) {
return PlanExecutionResult.failed(validation.getErrors());
}

// Step 3: Execute plan
return executePlanSteps(plan);
}

private Plan generatePlan(String objective, PlanningConfig config) {
String prompt = String.format(
"Create a detailed plan to achieve the following objective: %s\n\n" +
"Available capabilities: %s\n\n" +
"Constraints: %s\n\n" +
"Generate a step-by-step plan with the following format:\n" +
"1. Step description\n" +
" - Required tools: [tool1, tool2]\n" +
" - Expected output: description\n" +
" - Dependencies: [step numbers]\n\n" +
"Plan:",
objective,
config.getAvailableCapabilities(),
config.getConstraints());

ChatResponse response = chatClient.call(new Prompt(prompt));
return parsePlan(response.getResult().getOutput().getContent());
}

private PlanExecutionResult executePlanSteps(Plan plan) {
List<StepResult> stepResults = new ArrayList<>();
Map<String, Object> context = new HashMap<>();

for (PlanStep step : plan.getSteps()) {
try {
// Check dependencies
if (!areDependenciesMet(step, stepResults)) {
return PlanExecutionResult.failed("Dependencies not met for step: " + step.getId());
}

// Execute step
StepResult result = executeStep(step, context);
stepResults.add(result);

// Update context with results
context.put(step.getId(), result.getOutput());

// Check if step failed
if (!result.isSuccess()) {
return PlanExecutionResult.failed("Step failed: " + step.getId());
}

} catch (Exception e) {
return PlanExecutionResult.failed("Step execution error: " + e.getMessage());
}
}

return PlanExecutionResult.success(stepResults);
}

private StepResult executeStep(PlanStep step, Map<String, Object> context) {
return taskExecutor.execute(TaskExecution.builder()
.stepId(step.getId())
.description(step.getDescription())
.tools(step.getRequiredTools())
.context(context)
.build());
}
}

// Example: Loan processing planning
@Component
public class LoanProcessingPlanner {

@Autowired
private PlanningAgent planningAgent;

public LoanProcessingResult processLoanWithPlanning(LoanApplication application) {
String objective = String.format(
"Process loan application for %s requesting $%,.2f. " +
"Complete all required verifications and make final decision.",
application.getApplicantName(),
application.getRequestedAmount());

PlanningConfig config = PlanningConfig.builder()
.availableCapabilities(Arrays.asList(
"document_verification", "credit_check", "income_verification",
"employment_verification", "risk_assessment", "compliance_check"))
.constraints(Arrays.asList(
"Must complete within 30 minutes",
"Must verify all required documents",
"Must comply with lending regulations"))
.build();

PlanExecutionResult result = planningAgent.executePlan(objective, config);

return LoanProcessingResult.builder()
.application(application)
.executionResult(result)
.decision(extractDecisionFromPlan(result))
.processingTime(result.getExecutionTime())
.build();
}
}

Model Providers Routing with Higress AI gateway

Higress AI Gateway provides intelligent routing and load balancing across multiple LLM providers, ensuring optimal performance and cost efficiency.

Gateway Configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
@Configuration
public class HigressAIGatewayConfig {

@Bean
public ModelProviderRouter modelProviderRouter() {
return ModelProviderRouter.builder()
.addProvider("openai", OpenAIProvider.builder()
.apiKey("${openai.api-key}")
.models(Arrays.asList("gpt-4", "gpt-3.5-turbo"))
.rateLimits(RateLimits.builder()
.requestsPerMinute(60)
.tokensPerMinute(150000)
.build())
.build())
.addProvider("anthropic", AnthropicProvider.builder()
.apiKey("${anthropic.api-key}")
.models(Arrays.asList("claude-3-opus", "claude-3-sonnet"))
.rateLimits(RateLimits.builder()
.requestsPerMinute(50)
.tokensPerMinute(100000)
.build())
.build())
.addProvider("azure", AzureProvider.builder()
.apiKey("${azure.api-key}")
.endpoint("${azure.endpoint}")
.models(Arrays.asList("gpt-4", "gpt-35-turbo"))
.rateLimits(RateLimits.builder()
.requestsPerMinute(100)
.tokensPerMinute(200000)
.build())
.build())
.routingStrategy(RoutingStrategy.WEIGHTED_ROUND_ROBIN)
.fallbackStrategy(FallbackStrategy.CASCADE)
.build();
}
}

@Service
public class IntelligentModelRouter {

@Autowired
private ModelProviderRouter router;

@Autowired
private ModelPerformanceMonitor monitor;

@Autowired
private CostOptimizer costOptimizer;

public ModelResponse routeRequest(ModelRequest request) {
// Determine optimal provider based on request characteristics
ProviderSelection selection = selectOptimalProvider(request);

try {
// Route to selected provider
ModelResponse response = router.route(request, selection.getProvider());

// Update performance metrics
monitor.recordSuccess(selection.getProvider(), response.getLatency());

return response;

} catch (Exception e) {
// Handle failures with fallback
return handleFailureWithFallback(request, selection, e);
}
}

private ProviderSelection selectOptimalProvider(ModelRequest request) {
// Analyze request characteristics
RequestAnalysis analysis = analyzeRequest(request);

// Consider multiple factors for provider selection
List<ProviderScore> scores = new ArrayList<>();

for (String provider : router.getAvailableProviders()) {
double score = calculateProviderScore(provider, analysis);
scores.add(new ProviderScore(provider, score));
}

// Select provider with highest score
ProviderScore best = scores.stream()
.max(Comparator.comparingDouble(ProviderScore::getScore))
.orElse(scores.get(0));

return ProviderSelection.builder()
.provider(best.getProvider())
.confidence(best.getScore())
.reasoning(generateSelectionReasoning(best, analysis))
.build();
}

private double calculateProviderScore(String provider, RequestAnalysis analysis) {
double score = 0.0;

// Factor 1: Model capability match
score += calculateCapabilityScore(provider, analysis) * 0.4;

// Factor 2: Performance (latency, availability)
score += calculatePerformanceScore(provider) * 0.3;

// Factor 3: Cost efficiency
score += calculateCostScore(provider, analysis) * 0.2;

// Factor 4: Current load
score += calculateLoadScore(provider) * 0.1;

return score;
}

private ModelResponse handleFailureWithFallback(
ModelRequest request, ProviderSelection selection, Exception error) {

log.warn("Provider {} failed, attempting fallback", selection.getProvider(), error);

// Get fallback providers
List<String> fallbackProviders = router.getFallbackProviders(selection.getProvider());

for (String fallbackProvider : fallbackProviders) {
try {
ModelResponse response = router.route(request, fallbackProvider);
monitor.recordFallbackSuccess(fallbackProvider);
return response;
} catch (Exception fallbackError) {
log.warn("Fallback provider {} also failed", fallbackProvider, fallbackError);
}
}

// All providers failed
throw new ModelRoutingException("All providers failed for request", error);
}
}

LLM Prompt Templates via Nacos

Dynamic prompt management through Nacos configuration center enables hot-swapping of prompts without system restart.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@Component
@ConfigurationProperties(prefix = "prompts")
public class PromptTemplateService {

@NacosValue("${prompts.loan-assessment}")
private String loanAssessmentTemplate;

@NacosValue("${prompts.risk-analysis}")
private String riskAnalysisTemplate;

@NacosValue("${prompts.knowledge-query}")
private String knowledgeQueryTemplate;

private final Map<String, String> templateCache = new ConcurrentHashMap<>();

@PostConstruct
public void initializeTemplates() {
templateCache.put("loan_assessment", loanAssessmentTemplate);
templateCache.put("risk_analysis", riskAnalysisTemplate);
templateCache.put("knowledge_query", knowledgeQueryTemplate);
}

public String getTemplate(String templateName) {
return templateCache.getOrDefault(templateName, getDefaultTemplate());
}

@NacosConfigListener(dataId = "prompts", type = ConfigType.YAML)
public void onConfigChange(String configInfo) {
// Hot reload templates when configuration changes
log.info("Prompt templates updated, reloading...");
// Parse new configuration and update cache
updateTemplateCache(configInfo);
}

private void updateTemplateCache(String configInfo) {
try {
ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
Map<String, String> newTemplates = mapper.readValue(configInfo,
new TypeReference<Map<String, String>>() {});

templateCache.clear();
templateCache.putAll(newTemplates);

log.info("Successfully updated {} prompt templates", newTemplates.size());
} catch (Exception e) {
log.error("Failed to update prompt templates", e);
}
}
}

Monitoring and Observability with OpenTelemetry

OpenTelemetry provides comprehensive observability for the AI system, enabling performance monitoring, error tracking, and optimization insights.

OpenTelemetry Configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
@Configuration
@EnableAutoConfiguration
public class OpenTelemetryConfig {

@Bean
public OpenTelemetry openTelemetry() {
return OpenTelemetySdk.builder()
.setTracerProvider(
SdkTracerProvider.builder()
.addSpanProcessor(BatchSpanProcessor.builder(
OtlpGrpcSpanExporter.builder()
.setEndpoint("http://jaeger:14250")
.build())
.build())
.setResource(Resource.getDefault()
.merge(Resource.builder()
.put(ResourceAttributes.SERVICE_NAME, "fintech-ai-system")
.put(ResourceAttributes.SERVICE_VERSION, "1.0.0")
.build()))
.build())
.setMeterProvider(
SdkMeterProvider.builder()
.registerMetricReader(
PeriodicMetricReader.builder(
OtlpGrpcMetricExporter.builder()
.setEndpoint("http://prometheus:9090")
.build())
.setInterval(Duration.ofSeconds(30))
.build())
.build())
.build();
}
}

@Component
public class AISystemObservability {

private final Tracer tracer;
private final Meter meter;

// Metrics
private final Counter requestCounter;
private final Histogram responseTime;
private final Gauge activeConnections;

public AISystemObservability(OpenTelemetry openTelemetry) {
this.tracer = openTelemetry.getTracer("fintech-ai-system");
this.meter = openTelemetry.getMeter("fintech-ai-system");

// Initialize metrics
this.requestCounter = meter.counterBuilder("ai_requests_total")
.setDescription("Total number of AI requests")
.build();

this.responseTime = meter.histogramBuilder("ai_response_time_seconds")
.setDescription("AI response time in seconds")
.build();

this.activeConnections = meter.gaugeBuilder("ai_active_connections")
.setDescription("Number of active AI connections")
.buildObserver();
}

public <T> T traceAIOperation(String operationName, Supplier<T> operation) {
Span span = tracer.spanBuilder(operationName)
.setSpanKind(SpanKind.INTERNAL)
.startSpan();

try (Scope scope = span.makeCurrent()) {
long startTime = System.nanoTime();

// Execute operation
T result = operation.get();

// Record metrics
long duration = System.nanoTime() - startTime;
responseTime.record(duration / 1_000_000_000.0);
requestCounter.add(1);

// Add span attributes
span.setStatus(StatusCode.OK);
span.setAttribute("operation.success", true);

return result;

} catch (Exception e) {
span.setStatus(StatusCode.ERROR, e.getMessage());
span.setAttribute("operation.success", false);
span.setAttribute("error.type", e.getClass().getSimpleName());
throw e;
} finally {
span.end();
}
}

public void recordLLMMetrics(String provider, String model, long tokens,
double latency, boolean success) {

Attributes attributes = Attributes.builder()
.put("provider", provider)
.put("model", model)
.put("success", success)
.build();

meter.counterBuilder("llm_requests_total")
.build()
.add(1, attributes);

meter.histogramBuilder("llm_token_usage")
.build()
.record(tokens, attributes);

meter.histogramBuilder("llm_latency_seconds")
.build()
.record(latency, attributes);
}
}

// Usage in services
@Service
public class MonitoredAIService {

@Autowired
private AISystemObservability observability;

@Autowired
private ChatClient chatClient;

public String processWithMonitoring(String query) {
return observability.traceAIOperation("llm_query_processing", () -> {
long startTime = System.currentTimeMillis();

try {
ChatResponse response = chatClient.call(new Prompt(query));

// Record success metrics
long latency = System.currentTimeMillis() - startTime;
observability.recordLLMMetrics("openai", "gpt-4",
response.getResult().getMetadata().getUsage().getTotalTokens(),
latency / 1000.0, true);

return response.getResult().getOutput().getContent();

} catch (Exception e) {
// Record failure metrics
long latency = System.currentTimeMillis() - startTime;
observability.recordLLMMetrics("openai", "gpt-4", 0,
latency / 1000.0, false);
throw e;
}
});
}
}

Use Cases and Examples

Use Case 1: Automated Loan Application Processing

Scenario: A customer applies for a $50,000 personal loan through the chat interface.

Flow:

  1. Customer initiates conversation: “I’d like to apply for a personal loan”
  2. AI classifies intent as LOAN_APPLICATION
  3. System guides customer through document collection
  4. AI processes submitted documents using OCR and NLP
  5. Automated workflow calls external systems for verification
  6. AI makes preliminary assessment with 92% confidence
  7. System auto-approves loan with conditions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Example implementation
@Test
public void testAutomatedLoanFlow() {
// Simulate customer input
ConversationRequest request = ConversationRequest.builder()
.text("I need a $50,000 personal loan")
.sessionId("session-123")
.build();

// Process through conversation service
ConversationResponse response = conversationService.processMessage(request);

assertThat(response.getIntent()).isEqualTo(IntentType.LOAN_APPLICATION);
assertThat(response.getNextSteps()).contains("document_collection");

// Simulate document upload
ConversationRequest docRequest = ConversationRequest.builder()
.files(Arrays.asList(mockPayStub, mockBankStatement))
.sessionId("session-123")
.build();

ConversationResponse docResponse = conversationService.processMessage(docRequest);

// Verify AI processing
assertThat(docResponse.getProcessingResult().getConfidence()).isGreaterThan(0.9);
}

Use Case 2: Multi-Modal Customer Support

Scenario: Customer uploads a photo of their bank statement and asks about eligibility.

Flow:

  1. Customer uploads bank statement image
  2. OCR extracts text and financial data
  3. AI analyzes income patterns and expenses
  4. System queries knowledge base for eligibility criteria
  5. AI provides personalized eligibility assessment

Use Case 3: Complex Financial Query Resolution

Scenario: “What are the tax implications of early loan repayment?”

Flow:

  1. ReAct engine breaks down the query
  2. System retrieves relevant tax documents from knowledge base
  3. AI reasons through tax implications step by step
  4. System provides comprehensive answer with citations

Performance Optimization and Scalability

Caching Strategy

The system implements a multi-level caching strategy to achieve sub-30-second response times:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@Service
public class CachingService {

@Autowired
private RedisTemplate<String, Object> redisTemplate;

@Cacheable(value = "llm-responses", key = "#promptHash")
public String getCachedResponse(String promptHash) {
return (String) redisTemplate.opsForValue().get("llm:" + promptHash);
}

@CachePut(value = "llm-responses", key = "#promptHash")
public String cacheResponse(String promptHash, String response) {
redisTemplate.opsForValue().set("llm:" + promptHash, response,
Duration.ofHours(1));
return response;
}

@Cacheable(value = "embeddings", key = "#text.hashCode()")
public List<Float> getCachedEmbedding(String text) {
return (List<Float>) redisTemplate.opsForValue().get("embedding:" + text.hashCode());
}
}

Load Balancing and Horizontal Scaling


flowchart LR
A[Load Balancer] --> B[Service Instance 1]
A --> C[Service Instance 2]
A --> D[Service Instance 3]

B --> E[LLM Provider 1]
B --> F[LLM Provider 2]
C --> E
C --> F
D --> E
D --> F

E --> G[Redis Cache]
F --> G

B --> H[Vector DB]
C --> H
D --> H

Database Optimization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
@Entity
@Table(name = "loan_applications", indexes = {
@Index(name = "idx_applicant_status", columnList = "applicant_id, status"),
@Index(name = "idx_created_date", columnList = "created_date")
})
public class LoanApplication {

@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;

@Column(name = "applicant_id", nullable = false)
private String applicantId;

@Enumerated(EnumType.STRING)
@Column(name = "status", nullable = false)
private ApplicationStatus status;

@Column(name = "created_date", nullable = false)
private LocalDateTime createdDate;

// Optimized for queries
@Column(name = "search_vector", columnDefinition = "tsvector")
private String searchVector;
}

Key Interview Question: “How do you ensure the system can handle 2000+ concurrent users while maintaining response times?”

Reference Answer: The system uses several optimization techniques: 1) Multi-level caching with Redis for frequently accessed data, 2) Connection pooling for database and external service calls, 3) Asynchronous processing for non-critical operations, 4) Load balancing across multiple LLM providers, 5) Database query optimization with proper indexing, 6) Context caching to avoid repeated LLM calls for similar queries, and 7) Horizontal scaling of microservices based on demand.

Conclusion

The FinTech AI Workflow and Chat System represents a sophisticated integration of traditional financial workflows with cutting-edge AI technologies. By combining the reliability of established banking processes with the intelligence of modern AI systems, the platform delivers a superior user experience while maintaining the security and compliance requirements essential in financial services.

The architecture’s microservices design ensures scalability and maintainability, while the AI components provide intelligent automation that reduces processing time and improves accuracy. The system’s ability to handle over 2000 concurrent conversations with rapid response times demonstrates its enterprise readiness.

Key success factors include:

  • Seamless integration between traditional and AI-powered workflows
  • Robust multi-modal processing capabilities
  • Intelligent context management and memory systems
  • Flexible prompt template management for rapid iteration
  • Comprehensive performance optimization strategies

The system sets a new standard for AI-powered financial services, combining the best of human expertise with artificial intelligence to create a truly intelligent lending platform.

External Resources and References

System Architecture Overview

A distributed pressure testing system leverages multiple client nodes coordinated through Apache Zookeeper to simulate high-load scenarios against target services. This architecture provides horizontal scalability, centralized coordination, and real-time monitoring capabilities.


graph TB
subgraph "Control Layer"
    Master[MasterTestNode]
    Dashboard[Dashboard Website]
    ZK[Zookeeper Cluster]
end

subgraph "Execution Layer"
    Client1[ClientTestNode 1]
    Client2[ClientTestNode 2]
    Client3[ClientTestNode N]
end

subgraph "Target Layer"
    Service[Target Microservice]
    DB[(Database)]
end

Master --> ZK
Dashboard --> Master
Client1 --> ZK
Client2 --> ZK
Client3 --> ZK
Client1 --> Service
Client2 --> Service
Client3 --> Service
Service --> DB

ZK -.-> Master
ZK -.-> Client1
ZK -.-> Client2
ZK -.-> Client3

Interview Question: Why choose Zookeeper for coordination instead of a message queue like Kafka or RabbitMQ?

Answer: Zookeeper provides strong consistency guarantees, distributed configuration management, and service discovery capabilities essential for test coordination. Unlike message queues that focus on data streaming, Zookeeper excels at maintaining cluster state, leader election, and distributed locks - critical for coordinating test execution phases and preventing race conditions.

Core Components Design

ClientTestNode Architecture

The ClientTestNode is the workhorse of the system, responsible for generating load and collecting metrics. Built on Netty for high-performance HTTP communication.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@Component
public class ClientTestNode {
private final ZookeeperClient zkClient;
private final NettyHttpClient httpClient;
private final MetricsCollector metricsCollector;
private final TaskConfiguration taskConfig;

@PostConstruct
public void initialize() {
// Register with Zookeeper
zkClient.registerNode(getNodeInfo());

// Initialize Netty client
httpClient.initialize(taskConfig.getNettyConfig());

// Start metrics collection
metricsCollector.startCollection();
}

public void executeTest() {
TestTask task = zkClient.getTestTask();

EventLoopGroup group = new NioEventLoopGroup(task.getThreadCount());
try {
Bootstrap bootstrap = new Bootstrap()
.group(group)
.channel(NioSocketChannel.class)
.handler(new HttpClientInitializer(metricsCollector));

// Execute concurrent requests
IntStream.range(0, task.getConcurrency())
.parallel()
.forEach(i -> executeRequest(bootstrap, task));

} finally {
group.shutdownGracefully();
}
}

private void executeRequest(Bootstrap bootstrap, TestTask task) {
long startTime = System.nanoTime();

ChannelFuture future = bootstrap.connect(task.getTargetHost(), task.getTargetPort());
future.addListener((ChannelFutureListener) channelFuture -> {
if (channelFuture.isSuccess()) {
Channel channel = channelFuture.channel();

// Build HTTP request
FullHttpRequest request = new DefaultFullHttpRequest(
HTTP_1_1, HttpMethod.valueOf(task.getMethod()), task.getPath());
request.headers().set(HttpHeaderNames.HOST, task.getTargetHost());
request.headers().set(HttpHeaderNames.CONNECTION, HttpHeaderValues.KEEP_ALIVE);

// Send request and handle response
channel.writeAndFlush(request);
}
});
}
}

MasterTestNode Coordination

The MasterTestNode orchestrates the entire testing process, manages client lifecycle, and aggregates results.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@Service
public class MasterTestNode {
private final ZookeeperClient zkClient;
private final TestTaskManager taskManager;
private final ResultAggregator resultAggregator;

public void startTest(TestConfiguration config) {
// Create test task in Zookeeper
String taskPath = zkClient.createTestTask(config);

// Wait for client nodes to register
waitForClientNodes(config.getRequiredClientCount());

// Distribute task configuration
distributeTaskConfiguration(taskPath, config);

// Monitor test execution
monitorTestExecution(taskPath);
}

private void waitForClientNodes(int requiredCount) {
CountDownLatch latch = new CountDownLatch(requiredCount);

zkClient.watchChildren("/test/clients", (event) -> {
List<String> children = zkClient.getChildren("/test/clients");
if (children.size() >= requiredCount) {
latch.countDown();
}
});

try {
latch.await(30, TimeUnit.SECONDS);
} catch (InterruptedException e) {
throw new TestExecutionException("Timeout waiting for client nodes");
}
}

public TestResult aggregateResults() {
List<String> clientNodes = zkClient.getChildren("/test/clients");
List<ClientMetrics> allMetrics = new ArrayList<>();

for (String clientNode : clientNodes) {
ClientMetrics metrics = zkClient.getData("/test/results/" + clientNode, ClientMetrics.class);
allMetrics.add(metrics);
}

return resultAggregator.aggregate(allMetrics);
}
}

Task Configuration Management

Configuration Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
@Data
@JsonSerialize
public class TaskConfiguration {
private String testId;
private String targetUrl;
private HttpMethod method;
private Map<String, String> headers;
private String requestBody;
private LoadPattern loadPattern;
private Duration duration;
private int concurrency;
private int qps;
private RetryPolicy retryPolicy;
private NettyConfiguration nettyConfig;

@Data
public static class LoadPattern {
private LoadType type; // CONSTANT, RAMP_UP, SPIKE, STEP
private List<LoadStep> steps;

@Data
public static class LoadStep {
private Duration duration;
private int targetQps;
private int concurrency;
}
}

@Data
public static class NettyConfiguration {
private int connectTimeoutMs = 5000;
private int readTimeoutMs = 10000;
private int maxConnections = 1000;
private boolean keepAlive = true;
private int workerThreads = Runtime.getRuntime().availableProcessors() * 2;
}
}

Dynamic Configuration Updates

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
@Component
public class DynamicConfigurationManager {
private final ZookeeperClient zkClient;
private volatile TaskConfiguration currentConfig;

@PostConstruct
public void initialize() {
String configPath = "/test/config";

// Watch for configuration changes
zkClient.watchData(configPath, (event) -> {
if (event.getType() == EventType.NodeDataChanged) {
updateConfiguration(zkClient.getData(configPath, TaskConfiguration.class));
}
});
}

private void updateConfiguration(TaskConfiguration newConfig) {
TaskConfiguration oldConfig = this.currentConfig;
this.currentConfig = newConfig;

// Apply hot configuration changes
if (oldConfig != null && !Objects.equals(oldConfig.getQps(), newConfig.getQps())) {
adjustLoadRate(newConfig.getQps());
}

if (oldConfig != null && !Objects.equals(oldConfig.getConcurrency(), newConfig.getConcurrency())) {
adjustConcurrency(newConfig.getConcurrency());
}
}

private void adjustLoadRate(int newQps) {
// Implement rate limiter adjustment
RateLimiter.create(newQps);
}
}

Interview Question: How do you handle configuration consistency across distributed nodes during runtime updates?

Answer: We use Zookeeper’s atomic operations and watches to ensure configuration consistency. When the master updates configuration, it uses conditional writes (compare-and-swap) to prevent conflicts. Client nodes register watches on configuration znodes and receive immediate notifications. We implement a two-phase commit pattern: first distribute the new configuration, then send an activation signal once all nodes acknowledge receipt.

Metrics Collection and Statistics

Real-time Metrics Collection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
@Component
public class MetricsCollector {
private final Timer responseTimer;
private final Counter requestCounter;
private final Counter errorCounter;
private final Histogram responseSizeHistogram;
private final ScheduledExecutorService scheduler;

public MetricsCollector() {
MetricRegistry registry = new MetricRegistry();
this.responseTimer = registry.timer("http.response.time");
this.requestCounter = registry.counter("http.requests.total");
this.errorCounter = registry.counter("http.errors.total");
this.responseSizeHistogram = registry.histogram("http.response.size");
this.scheduler = Executors.newScheduledThreadPool(2);
}

public void recordRequest(long responseTimeNanos, int statusCode, int responseSize) {
responseTimer.update(responseTimeNanos, TimeUnit.NANOSECONDS);
requestCounter.inc();

if (statusCode >= 400) {
errorCounter.inc();
}

responseSizeHistogram.update(responseSize);
}

public MetricsSnapshot getSnapshot() {
Snapshot timerSnapshot = responseTimer.getSnapshot();

return MetricsSnapshot.builder()
.timestamp(System.currentTimeMillis())
.totalRequests(requestCounter.getCount())
.totalErrors(errorCounter.getCount())
.qps(calculateQPS())
.avgResponseTime(timerSnapshot.getMean())
.p95ResponseTime(timerSnapshot.get95thPercentile())
.p99ResponseTime(timerSnapshot.get99thPercentile())
.errorRate(calculateErrorRate())
.build();
}

private double calculateQPS() {
long currentTime = System.currentTimeMillis();
long timeWindow = 1000; // 1 second

return requestCounter.getCount() / ((currentTime - startTime) / 1000.0);
}

@Scheduled(fixedRate = 1000) // Report every second
public void reportMetrics() {
MetricsSnapshot snapshot = getSnapshot();
zkClient.updateData("/test/metrics/" + nodeId, snapshot);
}
}

Advanced Statistical Calculations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
@Service
public class StatisticalAnalyzer {

public TestResult calculateDetailedStatistics(List<MetricsSnapshot> snapshots) {
if (snapshots.isEmpty()) {
return TestResult.empty();
}

// Calculate aggregated metrics
DoubleSummaryStatistics responseTimeStats = snapshots.stream()
.mapToDouble(MetricsSnapshot::getAvgResponseTime)
.summaryStatistics();

// Calculate percentiles using HdrHistogram for accuracy
Histogram histogram = new Histogram(3);
snapshots.forEach(snapshot ->
histogram.recordValue((long) snapshot.getAvgResponseTime()));

// Throughput analysis
double totalQps = snapshots.stream()
.mapToDouble(MetricsSnapshot::getQps)
.sum();

// Error rate analysis
double totalRequests = snapshots.stream()
.mapToDouble(MetricsSnapshot::getTotalRequests)
.sum();
double totalErrors = snapshots.stream()
.mapToDouble(MetricsSnapshot::getTotalErrors)
.sum();
double overallErrorRate = totalErrors / totalRequests * 100;

// Stability analysis
double responseTimeStdDev = calculateStandardDeviation(
snapshots.stream()
.mapToDouble(MetricsSnapshot::getAvgResponseTime)
.toArray());

return TestResult.builder()
.totalQps(totalQps)
.avgResponseTime(responseTimeStats.getAverage())
.minResponseTime(responseTimeStats.getMin())
.maxResponseTime(responseTimeStats.getMax())
.p50ResponseTime(histogram.getValueAtPercentile(50))
.p95ResponseTime(histogram.getValueAtPercentile(95))
.p99ResponseTime(histogram.getValueAtPercentile(99))
.p999ResponseTime(histogram.getValueAtPercentile(99.9))
.errorRate(overallErrorRate)
.responseTimeStdDev(responseTimeStdDev)
.stabilityScore(calculateStabilityScore(responseTimeStdDev, overallErrorRate))
.build();
}

private double calculateStabilityScore(double stdDev, double errorRate) {
// Custom stability scoring algorithm
double variabilityScore = Math.max(0, 100 - (stdDev / 10)); // Lower std dev = higher score
double reliabilityScore = Math.max(0, 100 - (errorRate * 2)); // Lower error rate = higher score

return (variabilityScore + reliabilityScore) / 2;
}
}

Interview Question: How do you ensure accurate percentile calculations in a distributed environment?

Answer: We use HdrHistogram library for accurate percentile calculations with minimal memory overhead. Each client node maintains local histograms and periodically serializes them to Zookeeper. The master node deserializes and merges histograms using HdrHistogram’s built-in merge capabilities, which maintains accuracy across distributed measurements. This approach is superior to simple averaging and provides true percentile values across the entire distributed system.

Zookeeper Integration Patterns

Service Discovery and Registration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
@Component
public class ZookeeperServiceRegistry {
private final CuratorFramework client;
private final ServiceDiscovery<TestNodeMetadata> serviceDiscovery;

public ZookeeperServiceRegistry() {
this.client = CuratorFrameworkFactory.newClient(
"localhost:2181",
new ExponentialBackoffRetry(1000, 3)
);

this.serviceDiscovery = ServiceDiscoveryBuilder.builder(TestNodeMetadata.class)
.client(client)
.basePath("/test/services")
.build();
}

public void registerTestNode(TestNodeInfo nodeInfo) {
try {
ServiceInstance<TestNodeMetadata> instance = ServiceInstance.<TestNodeMetadata>builder()
.name("test-client")
.id(nodeInfo.getNodeId())
.address(nodeInfo.getHost())
.port(nodeInfo.getPort())
.payload(new TestNodeMetadata(nodeInfo))
.build();

serviceDiscovery.registerService(instance);

// Create ephemeral sequential node for load balancing
client.create()
.withMode(CreateMode.EPHEMERAL_SEQUENTIAL)
.forPath("/test/clients/client-", nodeInfo.serialize());

} catch (Exception e) {
throw new ServiceRegistrationException("Failed to register test node", e);
}
}

public List<TestNodeInfo> discoverAvailableNodes() {
try {
Collection<ServiceInstance<TestNodeMetadata>> instances =
serviceDiscovery.queryForInstances("test-client");

return instances.stream()
.map(instance -> instance.getPayload().getNodeInfo())
.collect(Collectors.toList());
} catch (Exception e) {
throw new ServiceDiscoveryException("Failed to discover test nodes", e);
}
}
}

Distributed Coordination and Synchronization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
@Service
public class DistributedTestCoordinator {
private final CuratorFramework client;
private final DistributedBarrier startBarrier;
private final DistributedBarrier endBarrier;
private final InterProcessMutex configLock;

public DistributedTestCoordinator(CuratorFramework client) {
this.client = client;
this.startBarrier = new DistributedBarrier(client, "/test/barriers/start");
this.endBarrier = new DistributedBarrier(client, "/test/barriers/end");
this.configLock = new InterProcessMutex(client, "/test/locks/config");
}

public void coordinateTestStart(int expectedClients) throws Exception {
// Wait for all clients to be ready
CountDownLatch clientReadyLatch = new CountDownLatch(expectedClients);

PathChildrenCache clientCache = new PathChildrenCache(client, "/test/clients", true);
clientCache.getListenable().addListener((cache, event) -> {
if (event.getType() == PathChildrenCacheEvent.Type.CHILD_ADDED) {
clientReadyLatch.countDown();
}
});
clientCache.start();

// Wait for all clients with timeout
boolean allReady = clientReadyLatch.await(30, TimeUnit.SECONDS);
if (!allReady) {
throw new TestCoordinationException("Not all clients ready within timeout");
}

// Set start barrier to begin test
startBarrier.setBarrier();

// Signal all clients to start
client.setData().forPath("/test/control/command", "START".getBytes());
}

public void waitForTestCompletion() throws Exception {
// Wait for end barrier
endBarrier.waitOnBarrier();

// Cleanup
cleanupTestResources();
}

public void updateConfigurationSafely(TaskConfiguration newConfig) throws Exception {
// Acquire distributed lock
if (configLock.acquire(10, TimeUnit.SECONDS)) {
try {
// Atomic configuration update
String configPath = "/test/config";
Stat stat = client.checkExists().forPath(configPath);

client.setData()
.withVersion(stat.getVersion())
.forPath(configPath, JsonUtils.toJson(newConfig).getBytes());

} finally {
configLock.release();
}
} else {
throw new ConfigurationException("Failed to acquire configuration lock");
}
}
}

Interview Question: How do you handle network partitions and split-brain scenarios in your distributed testing system?

Answer: We implement several safeguards: 1) Use Zookeeper’s session timeouts to detect node failures quickly. 2) Implement a master election process using Curator’s LeaderSelector to prevent split-brain. 3) Use distributed barriers to ensure synchronized test phases. 4) Implement exponential backoff retry policies for transient network issues. 5) Set minimum quorum requirements - tests only proceed if sufficient client nodes are available. 6) Use Zookeeper’s strong consistency guarantees to maintain authoritative state.

High-Performance Netty Implementation

Netty HTTP Client Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
@Configuration
public class NettyHttpClientConfig {

@Bean
public NettyHttpClient createHttpClient(TaskConfiguration config) {
NettyConfiguration nettyConfig = config.getNettyConfig();

EventLoopGroup workerGroup = new NioEventLoopGroup(nettyConfig.getWorkerThreads());

Bootstrap bootstrap = new Bootstrap()
.group(workerGroup)
.channel(NioSocketChannel.class)
.option(ChannelOption.SO_KEEPALIVE, nettyConfig.isKeepAlive())
.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, nettyConfig.getConnectTimeoutMs())
.option(ChannelOption.SO_REUSEADDR, true)
.option(ChannelOption.TCP_NODELAY, true)
.option(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT)
.handler(new ChannelInitializer<SocketChannel>() {
@Override
protected void initChannel(SocketChannel ch) {
ChannelPipeline pipeline = ch.pipeline();

// HTTP codec
pipeline.addLast(new HttpClientCodec());
pipeline.addLast(new HttpObjectAggregator(1048576)); // 1MB max

// Compression
pipeline.addLast(new HttpContentDecompressor());

// Timeout handlers
pipeline.addLast(new ReadTimeoutHandler(nettyConfig.getReadTimeoutMs(), TimeUnit.MILLISECONDS));

// Custom handler for metrics and response processing
pipeline.addLast(new HttpResponseHandler());
}
});

return new NettyHttpClient(bootstrap, workerGroup);
}
}

High-Performance Request Execution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
public class HttpResponseHandler extends SimpleChannelInboundHandler<FullHttpResponse> {
private final MetricsCollector metricsCollector;
private final AtomicLong requestStartTime = new AtomicLong();

@Override
public void channelActive(ChannelHandlerContext ctx) {
requestStartTime.set(System.nanoTime());
}

@Override
protected void channelRead0(ChannelHandlerContext ctx, FullHttpResponse response) {
long responseTime = System.nanoTime() - requestStartTime.get();
int statusCode = response.status().code();
int responseSize = response.content().readableBytes();

// Record metrics
metricsCollector.recordRequest(responseTime, statusCode, responseSize);

// Handle response based on status
if (statusCode >= 200 && statusCode < 300) {
handleSuccessResponse(response);
} else {
handleErrorResponse(response, statusCode);
}

// Close connection if not keep-alive
if (!HttpUtil.isKeepAlive(response)) {
ctx.close();
}
}

@Override
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) {
long responseTime = System.nanoTime() - requestStartTime.get();

// Record error metrics
metricsCollector.recordRequest(responseTime, 0, 0);

logger.error("Request failed", cause);
ctx.close();
}

private void handleSuccessResponse(FullHttpResponse response) {
// Process successful response
String contentType = response.headers().get(HttpHeaderNames.CONTENT_TYPE);
ByteBuf content = response.content();

// Optional: Validate response content
if (contentType != null && contentType.contains("application/json")) {
validateJsonResponse(content.toString(StandardCharsets.UTF_8));
}
}
}

Connection Pool Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
@Component
public class NettyConnectionPoolManager {
private final Map<String, Channel> connectionPool = new ConcurrentHashMap<>();
private final AtomicInteger connectionCount = new AtomicInteger(0);
private final int maxConnections;

public NettyConnectionPoolManager(NettyConfiguration config) {
this.maxConnections = config.getMaxConnections();
}

public Channel getConnection(String host, int port) {
String key = host + ":" + port;

return connectionPool.computeIfAbsent(key, k -> {
if (connectionCount.get() >= maxConnections) {
throw new ConnectionPoolExhaustedException("Connection pool exhausted");
}

return createNewConnection(host, port);
});
}

private Channel createNewConnection(String host, int port) {
try {
ChannelFuture future = bootstrap.connect(host, port);
Channel channel = future.sync().channel();

connectionCount.incrementAndGet();

// Add close listener to update connection count
channel.closeFuture().addListener(closeFuture -> {
connectionCount.decrementAndGet();
connectionPool.remove(host + ":" + port);
});

return channel;
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new ConnectionException("Failed to create connection", e);
}
}

public void closeAllConnections() {
connectionPool.values().forEach(Channel::close);
connectionPool.clear();
connectionCount.set(0);
}
}

Dashboard and Visualization

Real-time Dashboard Backend

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
@RestController
@RequestMapping("/api/dashboard")
public class DashboardController {
private final TestResultService testResultService;
private final SimpMessagingTemplate messagingTemplate;

@GetMapping("/tests/{testId}/metrics")
public ResponseEntity<TestMetrics> getCurrentMetrics(@PathVariable String testId) {
TestMetrics metrics = testResultService.getCurrentMetrics(testId);
return ResponseEntity.ok(metrics);
}

@GetMapping("/tests/{testId}/timeline")
public ResponseEntity<List<TimelineData>> getMetricsTimeline(
@PathVariable String testId,
@RequestParam(defaultValue = "300") int seconds) {

List<TimelineData> timeline = testResultService.getMetricsTimeline(testId, seconds);
return ResponseEntity.ok(timeline);
}

@EventListener
public void handleMetricsUpdate(MetricsUpdateEvent event) {
// Broadcast real-time metrics to WebSocket clients
messagingTemplate.convertAndSend(
"/topic/metrics/" + event.getTestId(),
event.getMetrics()
);
}

@GetMapping("/tests/{testId}/report")
public ResponseEntity<TestReport> generateReport(@PathVariable String testId) {
TestReport report = testResultService.generateComprehensiveReport(testId);
return ResponseEntity.ok(report);
}
}

WebSocket Configuration for Real-time Updates

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
@Configuration
@EnableWebSocketMessageBroker
public class WebSocketConfig implements WebSocketMessageBrokerConfigurer {

@Override
public void configureMessageBroker(MessageBrokerRegistry config) {
config.enableSimpleBroker("/topic");
config.setApplicationDestinationPrefixes("/app");
}

@Override
public void registerStompEndpoints(StompEndpointRegistry registry) {
registry.addEndpoint("/websocket")
.setAllowedOriginPatterns("*")
.withSockJS();
}
}

Frontend Dashboard Components

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
// Real-time metrics dashboard component
class MetricsDashboard {
constructor(testId) {
this.testId = testId;
this.socket = new SockJS('/websocket');
this.stompClient = Stomp.over(this.socket);
this.charts = {};

this.initializeCharts();
this.connectWebSocket();
}

initializeCharts() {
// QPS Chart
this.charts.qps = new Chart(document.getElementById('qpsChart'), {
type: 'line',
data: {
labels: [],
datasets: [{
label: 'QPS',
data: [],
borderColor: 'rgb(75, 192, 192)',
tension: 0.1
}]
},
options: {
responsive: true,
scales: {
y: {
beginAtZero: true
}
},
plugins: {
title: {
display: true,
text: 'Queries Per Second'
}
}
}
});

// Response Time Chart
this.charts.responseTime = new Chart(document.getElementById('responseTimeChart'), {
type: 'line',
data: {
labels: [],
datasets: [
{
label: 'Average',
data: [],
borderColor: 'rgb(54, 162, 235)'
},
{
label: 'P95',
data: [],
borderColor: 'rgb(255, 206, 86)'
},
{
label: 'P99',
data: [],
borderColor: 'rgb(255, 99, 132)'
}
]
},
options: {
responsive: true,
scales: {
y: {
beginAtZero: true,
title: {
display: true,
text: 'Response Time (ms)'
}
}
}
}
});
}

connectWebSocket() {
this.stompClient.connect({}, (frame) => {
console.log('Connected: ' + frame);

this.stompClient.subscribe(`/topic/metrics/${this.testId}`, (message) => {
const metrics = JSON.parse(message.body);
this.updateCharts(metrics);
this.updateMetricCards(metrics);
});
});
}

updateCharts(metrics) {
const timestamp = new Date(metrics.timestamp).toLocaleTimeString();

// Update QPS chart
this.addDataPoint(this.charts.qps, timestamp, metrics.qps);

// Update Response Time chart
this.addDataPoint(this.charts.responseTime, timestamp, [
metrics.avgResponseTime,
metrics.p95ResponseTime,
metrics.p99ResponseTime
]);
}

addDataPoint(chart, label, data) {
chart.data.labels.push(label);

if (Array.isArray(data)) {
data.forEach((value, index) => {
chart.data.datasets[index].data.push(value);
});
} else {
chart.data.datasets[0].data.push(data);
}

// Keep only last 50 data points
if (chart.data.labels.length > 50) {
chart.data.labels.shift();
chart.data.datasets.forEach(dataset => dataset.data.shift());
}

chart.update('none'); // No animation for better performance
}

updateMetricCards(metrics) {
document.getElementById('currentQps').textContent = metrics.qps.toFixed(0);
document.getElementById('avgResponseTime').textContent = metrics.avgResponseTime.toFixed(2) + ' ms';
document.getElementById('errorRate').textContent = (metrics.errorRate * 100).toFixed(2) + '%';
document.getElementById('activeConnections').textContent = metrics.activeConnections;
}
}

Production Deployment Considerations

Docker Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# ClientTestNode Dockerfile
FROM openjdk:17-jre-slim

WORKDIR /app

# Install monitoring tools
RUN apt-get update && apt-get install -y \
curl \
netcat \
htop \
&& rm -rf /var/lib/apt/lists/*

COPY target/client-test-node.jar app.jar

# JVM optimization for load testing
ENV JAVA_OPTS="-Xms2g -Xmx4g -XX:+UseG1GC -XX:+UseStringDeduplication -XX:MaxGCPauseMillis=200 -Dio.netty.allocator.type=pooled -Dio.netty.allocator.numDirectArenas=8"

EXPOSE 8080 8081

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8080/actuator/health || exit 1

ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -jar app.jar"]

Kubernetes Deployment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# client-test-node-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: client-test-node
labels:
app: client-test-node
spec:
replicas: 5
selector:
matchLabels:
app: client-test-node
template:
metadata:
labels:
app: client-test-node
spec:
containers:
- name: client-test-node
image: your-registry/client-test-node:latest
ports:
- containerPort: 8080
- containerPort: 8081
env:
- name: ZOOKEEPER_HOSTS
value: "zookeeper:2181"
- name: NODE_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
readinessProbe:
httpGet:
path: /actuator/ready
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: client-test-node-service
spec:
selector:
app: client-test-node
ports:
- name: http
port: 8080
targetPort: 8080
- name: metrics
port: 8081
targetPort: 8081
type: ClusterIP

Monitoring and Observability

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
@Component
public class SystemMonitor {
private final MeterRegistry meterRegistry;
private final ScheduledExecutorService scheduler;

public SystemMonitor(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.scheduler = Executors.newScheduledThreadPool(2);
initializeMetrics();
}

private void initializeMetrics() {
// JVM metrics
Metrics.gauge("jvm.memory.heap.used", this, monitor -> getHeapMemoryUsed());
Metrics.gauge("jvm.memory.heap.max", this, monitor -> getHeapMemoryMax());
Metrics.gauge("jvm.gc.pause", this, monitor -> getGCPauseTime());

// Netty metrics
Metrics.gauge("netty.connections.active", this, monitor -> getActiveConnections());
Metrics.gauge("netty.buffer.memory.used", this, monitor -> getBufferMemoryUsed());

// System metrics
Metrics.gauge("system.cpu.usage", this, monitor -> getCpuUsage());
Metrics.gauge("system.memory.usage", this, monitor -> getSystemMemoryUsage());

// Custom application metrics
scheduler.scheduleAtFixedRate(this::collectCustomMetrics, 0, 5, TimeUnit.SECONDS);
}

private void collectCustomMetrics() {
// Network interface metrics
NetworkInterface[] interfaces = NetworkInterface.getNetworkInterfaces();
for (NetworkInterface ni : interfaces) {
if (ni.isUp() && !ni.isLoopback()) {
Metrics.gauge("network.bytes.sent",
Tags.of("interface", ni.getName()),
ni.getBytesRecv());
Metrics.gauge("network.bytes.received",
Tags.of("interface", ni.getName()),
ni.getBytesSent());
}
}

// Thread pool metrics
ThreadPoolExecutor executor = (ThreadPoolExecutor)
((ScheduledThreadPoolExecutor) scheduler);
Metrics.gauge("thread.pool.active", executor.getActiveCount());
Metrics.gauge("thread.pool.queue.size", executor.getQueue().size());
}

@EventListener
public void handleTestEvent(TestEvent event) {
Metrics.counter("test.events",
Tags.of("type", event.getType().name(),
"status", event.getStatus().name()))
.increment();
}
}

Interview Question: How do you handle resource management and prevent memory leaks in a long-running load testing system?

Answer: We implement comprehensive resource management: 1) Use Netty’s pooled allocators to reduce GC pressure. 2) Configure appropriate JVM heap sizes and use G1GC for low-latency collection. 3) Implement proper connection lifecycle management with connection pooling. 4) Use weak references for caches and implement cache eviction policies. 5) Monitor memory usage through JMX and set up alerts for memory leaks. 6) Implement graceful shutdown procedures to clean up resources. 7) Use profiling tools like async-profiler to identify memory hotspots.

Advanced Use Cases and Examples

Scenario 1: E-commerce Flash Sale Testing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
@Component
public class FlashSaleTestScenario {

public TaskConfiguration createFlashSaleTest() {
return TaskConfiguration.builder()
.testId("flash-sale-2024")
.targetUrl("https://api.ecommerce.com/products/flash-sale")
.method(HttpMethod.POST)
.headers(Map.of(
"Content-Type", "application/json",
"User-Agent", "LoadTester/1.0"
))
.requestBody(generateRandomPurchaseRequest())
.loadPattern(LoadPattern.builder()
.type(LoadType.SPIKE)
.steps(Arrays.asList(
LoadStep.of(Duration.ofMinutes(2), 100, 10), // Warm-up
LoadStep.of(Duration.ofMinutes(1), 5000, 500), // Spike
LoadStep.of(Duration.ofMinutes(5), 2000, 200), // Sustained
LoadStep.of(Duration.ofMinutes(2), 100, 10) // Cool-down
))
.build())
.duration(Duration.ofMinutes(10))
.retryPolicy(RetryPolicy.builder()
.maxRetries(3)
.backoffStrategy(BackoffStrategy.EXPONENTIAL)
.build())
.build();
}

private String generateRandomPurchaseRequest() {
return """
{
"productId": "%s",
"quantity": %d,
"userId": "%s",
"paymentMethod": "credit_card",
"shippingAddress": {
"street": "123 Test St",
"city": "Test City",
"zipCode": "12345"
}
}
""".formatted(
generateRandomProductId(),
ThreadLocalRandom.current().nextInt(1, 5),
generateRandomUserId()
);
}
}

Scenario 2: Gradual Ramp-up Testing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
@Component
public class GradualRampUpTestScenario {

public TaskConfiguration createRampUpTest() {
List<LoadStep> rampUpSteps = IntStream.range(0, 10)
.mapToObj(i -> LoadStep.of(
Duration.ofMinutes(2),
100 + (i * 200), // QPS: 100, 300, 500, 700, 900...
10 + (i * 20) // Concurrency: 10, 30, 50, 70, 90...
))
.collect(Collectors.toList());

return TaskConfiguration.builder()
.testId("gradual-ramp-up")
.targetUrl("https://api.service.com/endpoint")
.method(HttpMethod.GET)
.loadPattern(LoadPattern.builder()
.type(LoadType.RAMP_UP)
.steps(rampUpSteps)
.build())
.duration(Duration.ofMinutes(20))
.build();
}
}

Scenario 3: API Rate Limiting Validation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
@Component
public class RateLimitingTestScenario {

public void testRateLimiting() {
TaskConfiguration config = TaskConfiguration.builder()
.testId("rate-limiting-validation")
.targetUrl("https://api.service.com/rate-limited-endpoint")
.method(HttpMethod.GET)
.headers(Map.of("API-Key", "test-key"))
.qps(1000) // Exceed rate limit intentionally
.concurrency(100)
.duration(Duration.ofMinutes(5))
.build();

// Custom result validator
TestResultValidator validator = new TestResultValidator() {
@Override
public ValidationResult validate(TestResult result) {
double rateLimitErrorRate = result.getErrorsByStatus().get(429) /
(double) result.getTotalRequests() * 100;

if (rateLimitErrorRate < 10) {
return ValidationResult.failed("Rate limiting not working properly");
}

if (result.getP99ResponseTime() > 5000) {
return ValidationResult.failed("Response time too high under rate limiting");
}

return ValidationResult.passed();
}
};

executeTestWithValidation(config, validator);
}
}

Error Handling and Resilience

Circuit Breaker Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
@Component
public class CircuitBreakerTestClient {
private final CircuitBreaker circuitBreaker;
private final MetricsCollector metricsCollector;

public CircuitBreakerTestClient() {
this.circuitBreaker = CircuitBreaker.ofDefaults("test-circuit-breaker");
this.circuitBreaker.getEventPublisher()
.onStateTransition(event ->
metricsCollector.recordCircuitBreakerEvent(event));
}

public CompletableFuture<HttpResponse> executeRequest(HttpRequest request) {
Supplier<CompletableFuture<HttpResponse>> decoratedSupplier =
CircuitBreaker.decorateSupplier(circuitBreaker, () -> {
try {
return httpClient.execute(request);
} catch (Exception e) {
throw new RuntimeException("Request failed", e);
}
});

return Try.ofSupplier(decoratedSupplier)
.recover(throwable -> {
if (throwable instanceof CallNotPermittedException) {
// Circuit breaker is open
metricsCollector.recordCircuitBreakerOpen();
return CompletableFuture.completedFuture(
HttpResponse.builder()
.statusCode(503)
.body("Circuit breaker open")
.build()
);
}
return CompletableFuture.failedFuture(throwable);
})
.get();
}
}

Retry Strategy with Backoff

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
@Component
public class RetryableTestClient {
private final Retry retry;
private final TimeLimiter timeLimiter;

public RetryableTestClient(RetryPolicy retryPolicy) {
this.retry = Retry.of("test-retry", RetryConfig.custom()
.maxAttempts(retryPolicy.getMaxRetries())
.waitDuration(Duration.ofMillis(retryPolicy.getBaseDelayMs()))
.intervalFunction(IntervalFunction.ofExponentialBackoff(
retryPolicy.getBaseDelayMs(),
retryPolicy.getMultiplier()))
.retryOnException(throwable ->
throwable instanceof IOException ||
throwable instanceof TimeoutException)
.build());

this.timeLimiter = TimeLimiter.of("test-timeout", TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofSeconds(30))
.build());
}

public CompletableFuture<HttpResponse> executeWithRetry(HttpRequest request) {
Supplier<CompletableFuture<HttpResponse>> decoratedSupplier =
Decorators.ofSupplier(() -> httpClient.execute(request))
.withRetry(retry)
.withTimeLimiter(timeLimiter)
.decorate();

return decoratedSupplier.get();
}
}

Graceful Degradation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@Service
public class GracefulDegradationService {
private final HealthIndicator healthIndicator;
private final AlertService alertService;

@EventListener
public void handleHighErrorRate(HighErrorRateEvent event) {
if (event.getErrorRate() > 50) {
// Reduce load automatically
reduceTestLoad(event.getTestId(), 0.5); // Reduce to 50%
alertService.sendAlert("High error rate detected, reducing load");
}

if (event.getErrorRate() > 80) {
// Stop test to prevent damage
stopTest(event.getTestId());
alertService.sendCriticalAlert("Critical error rate, test stopped");
}
}

@EventListener
public void handleResourceExhaustion(ResourceExhaustionEvent event) {
switch (event.getResourceType()) {
case MEMORY:
// Trigger garbage collection and reduce batch sizes
System.gc();
adjustBatchSize(event.getTestId(), 0.7);
break;
case CPU:
// Reduce thread pool size
adjustThreadPoolSize(event.getTestId(), 0.8);
break;
case NETWORK:
// Implement connection throttling
enableConnectionThrottling(event.getTestId());
break;
}
}

private void reduceTestLoad(String testId, double factor) {
TaskConfiguration currentConfig = getTestConfiguration(testId);
TaskConfiguration reducedConfig = currentConfig.toBuilder()
.qps((int) (currentConfig.getQps() * factor))
.concurrency((int) (currentConfig.getConcurrency() * factor))
.build();

updateTestConfiguration(testId, reducedConfig);
}
}

Security and Authentication

Secure Test Execution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
@Component
public class SecureTestExecutor {
private final JwtTokenProvider tokenProvider;
private final CertificateManager certificateManager;

public TaskConfiguration createSecureTestConfig() {
return TaskConfiguration.builder()
.testId("secure-api-test")
.targetUrl("https://secure-api.company.com/endpoint")
.method(HttpMethod.POST)
.headers(Map.of(
"Authorization", "Bearer " + tokenProvider.generateTestToken(),
"X-API-Key", getApiKey(),
"Content-Type", "application/json"
))
.sslConfig(SslConfig.builder()
.trustStore(certificateManager.getTrustStore())
.keyStore(certificateManager.getClientKeyStore())
.verifyHostname(false) // Only for testing
.build())
.build();
}

@Scheduled(fixedRate = 300000) // Refresh every 5 minutes
public void refreshSecurityTokens() {
String newToken = tokenProvider.refreshToken();
updateAllActiveTestsWithNewToken(newToken);
}

private void updateAllActiveTestsWithNewToken(String newToken) {
List<String> activeTests = getActiveTestIds();

for (String testId : activeTests) {
TaskConfiguration config = getTestConfiguration(testId);
Map<String, String> updatedHeaders = new HashMap<>(config.getHeaders());
updatedHeaders.put("Authorization", "Bearer " + newToken);

TaskConfiguration updatedConfig = config.toBuilder()
.headers(updatedHeaders)
.build();

updateTestConfiguration(testId, updatedConfig);
}
}
}

SSL/TLS Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
@Configuration
public class SSLConfiguration {

@Bean
public SslContext createSslContext() throws Exception {
return SslContextBuilder.forClient()
.trustManager(createTrustManagerFactory())
.keyManager(createKeyManagerFactory())
.protocols("TLSv1.2", "TLSv1.3")
.ciphers(Arrays.asList(
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
"TLS_DHE_RSA_WITH_AES_256_GCM_SHA384"
))
.build();
}

private TrustManagerFactory createTrustManagerFactory() throws Exception {
KeyStore trustStore = KeyStore.getInstance("JKS");
try (InputStream trustStoreStream = getClass()
.getResourceAsStream("/ssl/truststore.jks")) {
trustStore.load(trustStoreStream, "changeit".toCharArray());
}

TrustManagerFactory trustManagerFactory =
TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());
trustManagerFactory.init(trustStore);

return trustManagerFactory;
}
}

Performance Optimization Techniques

Memory Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
@Component
public class MemoryOptimizedTestClient {
private final ObjectPool<ByteBuf> bufferPool;
private final ObjectPool<StringBuilder> stringBuilderPool;

public MemoryOptimizedTestClient() {
// Use Netty's pooled allocator
this.bufferPool = new DefaultObjectPool<>(
new PooledObjectFactory<ByteBuf>() {
@Override
public ByteBuf create() {
return PooledByteBufAllocator.DEFAULT.directBuffer(1024);
}

@Override
public void destroy(ByteBuf buffer) {
buffer.release();
}

@Override
public void reset(ByteBuf buffer) {
buffer.clear();
}
}
);

// String builder pool for JSON construction
this.stringBuilderPool = new DefaultObjectPool<>(
new PooledObjectFactory<StringBuilder>() {
@Override
public StringBuilder create() {
return new StringBuilder(512);
}

@Override
public void destroy(StringBuilder sb) {
// No explicit destruction needed
}

@Override
public void reset(StringBuilder sb) {
sb.setLength(0);
}
}
);
}

public HttpRequest createOptimizedRequest(RequestTemplate template) {
StringBuilder sb = stringBuilderPool.borrowObject();
ByteBuf buffer = bufferPool.borrowObject();

try {
// Build JSON request body efficiently
sb.append("{")
.append("\"timestamp\":").append(System.currentTimeMillis()).append(",")
.append("\"data\":\"").append(template.getData()).append("\"")
.append("}");

// Write to buffer
buffer.writeBytes(sb.toString().getBytes(StandardCharsets.UTF_8));

return HttpRequest.builder()
.uri(template.getUri())
.method(template.getMethod())
.body(buffer.nioBuffer())
.build();

} finally {
stringBuilderPool.returnObject(sb);
bufferPool.returnObject(buffer);
}
}
}

CPU Optimization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@Component
public class CPUOptimizedTestExecutor {
private final DisruptorEventBus eventBus;
private final AffinityExecutor affinityExecutor;

public CPUOptimizedTestExecutor() {
// Use Disruptor for lock-free event processing
this.eventBus = new DisruptorEventBus("test-events", 1024 * 1024);

// CPU affinity for better cache locality
this.affinityExecutor = new AffinityExecutor("test-executor");
}

public void executeHighPerformanceTest(TaskConfiguration config) {
// Partition work across CPU cores
int coreCount = Runtime.getRuntime().availableProcessors();
int requestsPerCore = config.getQps() / coreCount;

List<CompletableFuture<Void>> futures = IntStream.range(0, coreCount)
.mapToObj(coreId ->
CompletableFuture.runAsync(
() -> executeOnCore(coreId, requestsPerCore, config),
affinityExecutor.getExecutor(coreId)
)
)
.collect(Collectors.toList());

// Wait for all cores to complete
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]))
.join();
}

private void executeOnCore(int coreId, int requestCount, TaskConfiguration config) {
// Pin thread to specific CPU core for better cache performance
AffinityLock lock = AffinityLock.acquireLock(coreId);
try {
RateLimiter rateLimiter = RateLimiter.create(requestCount);

for (int i = 0; i < requestCount; i++) {
rateLimiter.acquire();

// Execute request with minimal object allocation
executeRequestOptimized(config);
}
} finally {
lock.release();
}
}
}

Troubleshooting Common Issues

Connection Pool Exhaustion

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
@Component
public class ConnectionPoolMonitor {
private final ConnectionPool connectionPool;
private final AlertService alertService;

@Scheduled(fixedRate = 10000) // Check every 10 seconds
public void monitorConnectionPool() {
ConnectionPoolStats stats = connectionPool.getStats();

double utilizationRate = (double) stats.getActiveConnections() /
stats.getMaxConnections();

if (utilizationRate > 0.8) {
alertService.sendWarning("Connection pool utilization high: " +
(utilizationRate * 100) + "%");
}

if (utilizationRate > 0.95) {
// Emergency action: increase pool size or throttle requests
connectionPool.increasePoolSize(stats.getMaxConnections() * 2);
alertService.sendCriticalAlert("Connection pool nearly exhausted, " +
"increasing pool size");
}

// Monitor for connection leaks
if (stats.getLeakedConnections() > 0) {
alertService.sendAlert("Connection leak detected: " +
stats.getLeakedConnections() + " connections");
connectionPool.closeLeakedConnections();
}
}
}

Memory Leak Detection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
@Component
public class MemoryLeakDetector {
private final MBeanServer mBeanServer;
private final List<MemorySnapshot> snapshots = new ArrayList<>();

@Scheduled(fixedRate = 60000) // Check every minute
public void checkMemoryUsage() {
MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
MemoryUsage heapUsage = memoryBean.getHeapMemoryUsage();

MemorySnapshot snapshot = new MemorySnapshot(
System.currentTimeMillis(),
heapUsage.getUsed(),
heapUsage.getMax(),
heapUsage.getCommitted()
);

snapshots.add(snapshot);

// Keep only last 10 minutes of data
snapshots.removeIf(s ->
System.currentTimeMillis() - s.getTimestamp() > 600000);

// Detect memory leak pattern
if (snapshots.size() >= 10) {
boolean possibleLeak = detectMemoryLeakPattern();
if (possibleLeak) {
triggerMemoryDump();
alertService.sendCriticalAlert("Possible memory leak detected");
}
}
}

private boolean detectMemoryLeakPattern() {
// Simple heuristic: memory usage consistently increasing
List<Long> memoryUsages = snapshots.stream()
.map(MemorySnapshot::getUsedMemory)
.collect(Collectors.toList());

// Check if memory usage is consistently increasing
int increasingCount = 0;
for (int i = 1; i < memoryUsages.size(); i++) {
if (memoryUsages.get(i) > memoryUsages.get(i - 1)) {
increasingCount++;
}
}

return increasingCount > (memoryUsages.size() * 0.8);
}

private void triggerMemoryDump() {
try {
MBeanServer server = ManagementFactory.getPlatformMBeanServer();
HotSpotDiagnosticMXBean hotspotMXBean =
ManagementFactory.newPlatformMXBeanProxy(
server, "com.sun.management:type=HotSpotDiagnostic",
HotSpotDiagnosticMXBean.class);

String dumpFile = "/tmp/memory-dump-" +
System.currentTimeMillis() + ".hprof";
hotspotMXBean.dumpHeap(dumpFile, true);

logger.info("Memory dump created: " + dumpFile);
} catch (Exception e) {
logger.error("Failed to create memory dump", e);
}
}
}

Interview Questions and Insights

Q: How do you handle the coordination of thousands of concurrent test clients?

A: We use Zookeeper’s hierarchical namespace and watches for efficient coordination. Clients register as ephemeral sequential nodes under /test/clients/, allowing automatic discovery and cleanup. We implement a master-slave pattern where the master uses distributed barriers to synchronize test phases. For large-scale coordination, we use consistent hashing to partition clients into groups, with sub-masters coordinating each group to reduce the coordination load on the main master.

Q: What strategies do you use to ensure test result accuracy in a distributed environment?

A: We implement several accuracy measures: 1) Use NTP for time synchronization across all nodes. 2) Implement vector clocks for ordering distributed events. 3) Use HdrHistogram for accurate percentile calculations. 4) Implement consensus algorithms for critical metrics aggregation. 5) Use statistical sampling techniques for large datasets. 6) Implement outlier detection to identify and handle anomalous results. 7) Cross-validate results using multiple measurement techniques.

Q: How do you prevent your load testing from affecting production systems?

A: We implement multiple safeguards: 1) Circuit breakers to automatically stop testing when error rates exceed thresholds. 2) Rate limiting with gradual ramp-up to detect capacity limits early. 3) Monitoring dashboards with automatic alerts for abnormal patterns. 4) Separate network segments or VPCs for testing. 5) Database read replicas for read-heavy tests. 6) Feature flags to enable/disable test-specific functionality. 7) Graceful degradation mechanisms that reduce load automatically.

Q: How do you handle test data management in distributed testing?

A: We use a multi-layered approach: 1) Synthetic data generation using libraries like Faker for realistic test data. 2) Data partitioning strategies to avoid hotspots (e.g., user ID sharding). 3) Test data pools with automatic refresh mechanisms. 4) Database seeding scripts for consistent test environments. 5) Data masking for production-like datasets. 6) Cleanup procedures to maintain test data integrity. 7) Version control for test datasets to ensure reproducibility.

Best Practices and Recommendations

Test Planning and Design

  1. Start Small, Scale Gradually: Begin with single-node tests before scaling to distributed scenarios
  2. Realistic Load Patterns: Use production traffic patterns rather than constant load
  3. Comprehensive Monitoring: Monitor both client and server metrics during tests
  4. Baseline Establishment: Establish performance baselines before load testing
  5. Test Environment Isolation: Ensure test environments closely match production

Production Readiness Checklist

  • Comprehensive error handling and retry mechanisms
  • Resource leak detection and prevention
  • Graceful shutdown procedures
  • Monitoring and alerting integration
  • Security hardening (SSL/TLS, authentication)
  • Configuration management and hot reloading
  • Backup and disaster recovery procedures
  • Documentation and runbooks
  • Load testing of the load testing system itself

Scalability Considerations


graph TD
A[Client Requests] --> B{Load Balancer}
B --> C[Client Node 1]
B --> D[Client Node 2]
B --> E[Client Node N]

C --> F[Zookeeper Cluster]
D --> F
E --> F

F --> G[Master Node]
G --> H[Results Aggregator]
G --> I[Dashboard]

J[Auto Scaler] --> B
K[Metrics Monitor] --> J
H --> K

External Resources

This comprehensive guide provides a production-ready foundation for building a distributed pressure testing system using Zookeeper. The architecture balances performance, reliability, and scalability while providing detailed insights for system design interviews and real-world implementation.

Core Underlying Principles

Spring Security is built on several fundamental principles that form the backbone of its architecture and functionality. Understanding these principles is crucial for implementing robust security solutions.

Authentication vs Authorization

Authentication answers “Who are you?” while Authorization answers “What can you do?” Spring Security treats these as separate concerns, allowing for flexible security configurations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Authentication - verifying identity
@Override
protected void configure(AuthenticationManagerBuilder auth) throws Exception {
auth.inMemoryAuthentication()
.withUser("user")
.password(passwordEncoder().encode("password"))
.roles("USER");
}

// Authorization - defining access rules
@Override
protected void configure(HttpSecurity http) throws Exception {
http.authorizeRequests()
.antMatchers("/admin/**").hasRole("ADMIN")
.antMatchers("/user/**").hasRole("USER")
.anyRequest().authenticated();
}

Security Filter Chain

Spring Security operates through a chain of filters that intercept HTTP requests. Each filter has a specific responsibility and can either process the request or pass it to the next filter.


flowchart TD
A[HTTP Request] --> B[Security Filter Chain]
B --> C[SecurityContextPersistenceFilter]
C --> D[UsernamePasswordAuthenticationFilter]
D --> E[ExceptionTranslationFilter]
E --> F[FilterSecurityInterceptor]
F --> G[Application Controller]

style B fill:#e1f5fe
style G fill:#e8f5e8

SecurityContext and SecurityContextHolder

The SecurityContext stores security information for the current thread of execution. The SecurityContextHolder provides access to this context.

1
2
3
4
5
6
7
8
9
// Getting current authenticated user
Authentication authentication = SecurityContextHolder.getContext().getAuthentication();
String username = authentication.getName();
Collection<? extends GrantedAuthority> authorities = authentication.getAuthorities();

// Setting security context programmatically
UsernamePasswordAuthenticationToken token =
new UsernamePasswordAuthenticationToken(user, null, authorities);
SecurityContextHolder.getContext().setAuthentication(token);

Interview Insight: “How does Spring Security maintain security context across requests?”

Spring Security uses ThreadLocal to store security context, ensuring thread safety. The SecurityContextPersistenceFilter loads the context from HttpSession at the beginning of each request and clears it at the end.

Principle of Least Privilege

Spring Security encourages granting minimal necessary permissions. This is implemented through role-based and method-level security.

1
2
3
4
@PreAuthorize("hasRole('ADMIN') or (hasRole('USER') and #username == authentication.name)")
public User getUserDetails(@PathVariable String username) {
return userService.findByUsername(username);
}

When to Use Spring Security Framework

Enterprise Applications

Spring Security is ideal for enterprise applications requiring:

  • Complex authentication mechanisms (LDAP, OAuth2, SAML)
  • Fine-grained authorization
  • Audit trails and compliance requirements
  • Integration with existing identity providers

Web Applications with User Management

Perfect for applications featuring:

  • User registration and login
  • Role-based access control
  • Session management
  • CSRF protection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
@Configuration
@EnableWebSecurity
public class WebSecurityConfig extends WebSecurityConfigurerAdapter {

@Override
protected void configure(HttpSecurity http) throws Exception {
http
.authorizeRequests()
.antMatchers("/register", "/login").permitAll()
.antMatchers("/admin/**").hasRole("ADMIN")
.anyRequest().authenticated()
.and()
.formLogin()
.loginPage("/login")
.defaultSuccessUrl("/dashboard")
.and()
.logout()
.logoutSuccessUrl("/login?logout")
.and()
.csrf().csrfTokenRepository(CookieCsrfTokenRepository.withHttpOnlyFalse());
}
}

REST APIs and Microservices

Essential for securing REST APIs with:

  • JWT token-based authentication
  • Stateless security
  • API rate limiting
  • Cross-origin resource sharing (CORS)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@Configuration
@EnableWebSecurity
public class JwtSecurityConfig {

@Bean
public JwtAuthenticationEntryPoint jwtAuthenticationEntryPoint() {
return new JwtAuthenticationEntryPoint();
}

@Bean
public JwtRequestFilter jwtRequestFilter() {
return new JwtRequestFilter();
}

@Override
protected void configure(HttpSecurity http) throws Exception {
http.csrf().disable()
.authorizeRequests()
.antMatchers("/api/auth/**").permitAll()
.anyRequest().authenticated()
.and()
.exceptionHandling().authenticationEntryPoint(jwtAuthenticationEntryPoint)
.and()
.sessionManagement().sessionCreationPolicy(SessionCreationPolicy.STATELESS);

http.addFilterBefore(jwtRequestFilter, UsernamePasswordAuthenticationFilter.class);
}
}

When NOT to Use Spring Security

  • Simple applications with basic authentication needs
  • Applications with custom security requirements that conflict with Spring Security’s architecture
  • Performance-critical applications where the filter chain overhead is unacceptable
  • Applications requiring non-standard authentication flows

User Login, Logout, and Session Management

Login Process Flow


sequenceDiagram
participant U as User
participant B as Browser
participant S as Spring Security
participant A as AuthenticationManager
participant P as AuthenticationProvider
participant D as UserDetailsService

U->>B: Enter credentials
B->>S: POST /login
S->>A: Authenticate request
A->>P: Delegate authentication
P->>D: Load user details
D-->>P: Return UserDetails
P-->>A: Authentication result
A-->>S: Authenticated user
S->>B: Redirect to success URL
B->>U: Display protected resource

Custom Login Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
@Configuration
@EnableWebSecurity
public class LoginConfig extends WebSecurityConfigurerAdapter {

@Autowired
private CustomUserDetailsService userDetailsService;

@Autowired
private CustomAuthenticationSuccessHandler successHandler;

@Autowired
private CustomAuthenticationFailureHandler failureHandler;

@Override
protected void configure(HttpSecurity http) throws Exception {
http
.formLogin()
.loginPage("/custom-login")
.loginProcessingUrl("/perform-login")
.usernameParameter("email")
.passwordParameter("pwd")
.successHandler(successHandler)
.failureHandler(failureHandler)
.and()
.logout()
.logoutUrl("/perform-logout")
.logoutSuccessHandler(customLogoutSuccessHandler())
.deleteCookies("JSESSIONID")
.invalidateHttpSession(true);
}

@Bean
public CustomLogoutSuccessHandler customLogoutSuccessHandler() {
return new CustomLogoutSuccessHandler();
}
}

Custom Authentication Success Handler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
@Component
public class CustomAuthenticationSuccessHandler implements AuthenticationSuccessHandler {

private final Logger logger = LoggerFactory.getLogger(CustomAuthenticationSuccessHandler.class);

@Override
public void onAuthenticationSuccess(HttpServletRequest request,
HttpServletResponse response,
Authentication authentication) throws IOException {

// Log successful login
logger.info("User {} logged in successfully", authentication.getName());

// Update last login timestamp
updateLastLoginTime(authentication.getName());

// Redirect based on role
String redirectUrl = determineTargetUrl(authentication);
response.sendRedirect(redirectUrl);
}

private String determineTargetUrl(Authentication authentication) {
boolean isAdmin = authentication.getAuthorities().stream()
.anyMatch(authority -> authority.getAuthority().equals("ROLE_ADMIN"));

return isAdmin ? "/admin/dashboard" : "/user/dashboard";
}

private void updateLastLoginTime(String username) {
// Implementation to update user's last login time
}
}

Session Management

Spring Security provides comprehensive session management capabilities:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
@Override
protected void configure(HttpSecurity http) throws Exception {
http
.sessionManagement()
.sessionCreationPolicy(SessionCreationPolicy.IF_REQUIRED)
.maximumSessions(1)
.maxSessionsPreventsLogin(false)
.sessionRegistry(sessionRegistry())
.and()
.sessionFixation().migrateSession()
.invalidSessionUrl("/login?expired");
}

@Bean
public HttpSessionEventPublisher httpSessionEventPublisher() {
return new HttpSessionEventPublisher();
}

@Bean
public SessionRegistry sessionRegistry() {
return new SessionRegistryImpl();
}

Interview Insight: “How does Spring Security handle concurrent sessions?”

Spring Security can limit concurrent sessions per user through SessionRegistry. When maximum sessions are exceeded, it can either prevent new logins or invalidate existing sessions based on configuration.

Session Timeout Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// In application.properties
server.servlet.session.timeout=30m

// Programmatic configuration
@Override
protected void configure(HttpSecurity http) throws Exception {
http
.sessionManagement()
.sessionCreationPolicy(SessionCreationPolicy.IF_REQUIRED)
.and()
.rememberMe()
.key("uniqueAndSecret")
.tokenValiditySeconds(86400) // 24 hours
.userDetailsService(userDetailsService);
}

Remember Me Functionality

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@Configuration
public class RememberMeConfig {

@Bean
public PersistentTokenRepository persistentTokenRepository() {
JdbcTokenRepositoryImpl tokenRepository = new JdbcTokenRepositoryImpl();
tokenRepository.setDataSource(dataSource);
return tokenRepository;
}

@Override
protected void configure(HttpSecurity http) throws Exception {
http
.rememberMe()
.rememberMeParameter("remember-me")
.tokenRepository(persistentTokenRepository())
.tokenValiditySeconds(86400)
.userDetailsService(userDetailsService);
}
}

Logout Process

Proper logout implementation is essential for security, ensuring complete cleanup of user sessions and security contexts.

Comprehensive Logout Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
@Configuration
public class LogoutConfig {

@Bean
public SecurityFilterChain logoutFilterChain(HttpSecurity http) throws Exception {
return http
.logout(logout -> logout
.logoutUrl("/logout")
.logoutRequestMatcher(new AntPathRequestMatcher("/logout", "POST"))
.logoutSuccessUrl("/login?logout=true")
.logoutSuccessHandler(customLogoutSuccessHandler())
.invalidateHttpSession(true)
.clearAuthentication(true)
.deleteCookies("JSESSIONID", "remember-me")
.addLogoutHandler(customLogoutHandler())
)
.build();
}

@Bean
public LogoutSuccessHandler customLogoutSuccessHandler() {
return new CustomLogoutSuccessHandler();
}

@Bean
public LogoutHandler customLogoutHandler() {
return new CustomLogoutHandler();
}
}

Custom Logout Handlers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@Component
public class CustomLogoutHandler implements LogoutHandler {

@Autowired
private SessionRegistry sessionRegistry;

@Autowired
private RedisTemplate<String, Object> redisTemplate;

@Override
public void logout(HttpServletRequest request, HttpServletResponse response,
Authentication authentication) {

if (authentication != null) {
String username = authentication.getName();

// Clear user-specific cache
redisTemplate.delete("user:cache:" + username);
redisTemplate.delete("user:permissions:" + username);

// Log logout event
logger.info("User {} logged out from IP: {}", username, getClientIP(request));

// Invalidate all sessions for this user (optional)
sessionRegistry.getAllPrincipals().stream()
.filter(principal -> principal instanceof UserDetails)
.filter(principal -> ((UserDetails) principal).getUsername().equals(username))
.forEach(principal ->
sessionRegistry.getAllSessions(principal, false)
.forEach(SessionInformation::expireNow)
);
}

// Clear security context
SecurityContextHolder.clearContext();
}
}

@Component
public class CustomLogoutSuccessHandler implements LogoutSuccessHandler {

@Override
public void onLogoutSuccess(HttpServletRequest request, HttpServletResponse response,
Authentication authentication) throws IOException, ServletException {

// Add logout timestamp to response headers
response.addHeader("Logout-Time", Instant.now().toString());

// Redirect based on user agent or request parameter
String redirectUrl = "/login?logout=true";
String userAgent = request.getHeader("User-Agent");

if (userAgent != null && userAgent.contains("Mobile")) {
redirectUrl = "/mobile/login?logout=true";
}

response.sendRedirect(redirectUrl);
}
}

Logout Flow Diagram


sequenceDiagram
participant User
participant Browser
participant LogoutFilter
participant LogoutHandler
participant SessionRegistry
participant RedisCache
participant Database

User->>Browser: Click logout
Browser->>LogoutFilter: POST /logout
LogoutFilter->>LogoutHandler: Handle logout
LogoutHandler->>SessionRegistry: Invalidate sessions
LogoutHandler->>RedisCache: Clear user cache
LogoutHandler->>Database: Log logout event
LogoutHandler-->>LogoutFilter: Cleanup complete
LogoutFilter->>LogoutFilter: Clear SecurityContext
LogoutFilter-->>Browser: Redirect to login
Browser-->>User: Login page with logout message

Advanced Authentication Mechanisms

JWT Token-Based Authentication

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
@Component
public class JwtTokenUtil {

private static final String SECRET = "mySecretKey";
private static final int JWT_TOKEN_VALIDITY = 5 * 60 * 60; // 5 hours

public String generateToken(UserDetails userDetails) {
Map<String, Object> claims = new HashMap<>();
return createToken(claims, userDetails.getUsername());
}

private String createToken(Map<String, Object> claims, String subject) {
return Jwts.builder()
.setClaims(claims)
.setSubject(subject)
.setIssuedAt(new Date(System.currentTimeMillis()))
.setExpiration(new Date(System.currentTimeMillis() + JWT_TOKEN_VALIDITY * 1000))
.signWith(SignatureAlgorithm.HS512, SECRET)
.compact();
}

public Boolean validateToken(String token, UserDetails userDetails) {
final String username = getUsernameFromToken(token);
return (username.equals(userDetails.getUsername()) && !isTokenExpired(token));
}
}

OAuth2 Integration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@Configuration
@EnableOAuth2Client
public class OAuth2Config {

@Bean
public OAuth2RestTemplate oauth2RestTemplate(OAuth2ClientContext oauth2ClientContext) {
return new OAuth2RestTemplate(googleOAuth2ResourceDetails(), oauth2ClientContext);
}

@Bean
public OAuth2ProtectedResourceDetails googleOAuth2ResourceDetails() {
AuthorizationCodeResourceDetails details = new AuthorizationCodeResourceDetails();
details.setClientId("your-client-id");
details.setClientSecret("your-client-secret");
details.setAccessTokenUri("https://oauth2.googleapis.com/token");
details.setUserAuthorizationUri("https://accounts.google.com/o/oauth2/auth");
details.setScope(Arrays.asList("email", "profile"));
return details;
}
}

Method-Level Security

Enabling Method Security

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
@Configuration
@EnableGlobalMethodSecurity(
prePostEnabled = true,
securedEnabled = true,
jsr250Enabled = true
)
public class MethodSecurityConfig extends GlobalMethodSecurityConfiguration {

@Override
protected MethodSecurityExpressionHandler createExpressionHandler() {
DefaultMethodSecurityExpressionHandler expressionHandler =
new DefaultMethodSecurityExpressionHandler();
expressionHandler.setPermissionEvaluator(new CustomPermissionEvaluator());
return expressionHandler;
}
}

Security Annotations in Action

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@Service
public class DocumentService {

@PreAuthorize("hasRole('ADMIN')")
public void deleteDocument(Long documentId) {
// Only admins can delete documents
}

@PreAuthorize("hasRole('USER') and #document.owner == authentication.name")
public void editDocument(@P("document") Document document) {
// Users can only edit their own documents
}

@PostAuthorize("returnObject.owner == authentication.name or hasRole('ADMIN')")
public Document getDocument(Long documentId) {
return documentRepository.findById(documentId);
}

@PreFilter("filterObject.owner == authentication.name")
public void processDocuments(List<Document> documents) {
// Process only documents owned by the current user
}
}

Interview Insight: “What’s the difference between @PreAuthorize and @Secured?”

@PreAuthorize supports SpEL expressions for complex authorization logic, while @Secured only supports role-based authorization. @PreAuthorize is more flexible and powerful.

Security Best Practices

Password Security

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@Configuration
public class PasswordConfig {

@Bean
public PasswordEncoder passwordEncoder() {
return new BCryptPasswordEncoder(12);
}

@Bean
public PasswordValidator passwordValidator() {
return new PasswordValidator(Arrays.asList(
new LengthRule(8, 30),
new CharacterRule(EnglishCharacterData.UpperCase, 1),
new CharacterRule(EnglishCharacterData.LowerCase, 1),
new CharacterRule(EnglishCharacterData.Digit, 1),
new CharacterRule(EnglishCharacterData.Special, 1),
new WhitespaceRule()
));
}
}

CSRF Protection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@Override
protected void configure(HttpSecurity http) throws Exception {
http
.csrf()
.csrfTokenRepository(CookieCsrfTokenRepository.withHttpOnlyFalse())
.ignoringAntMatchers("/api/public/**")
.and()
.headers()
.frameOptions().deny()
.contentTypeOptions().and()
.httpStrictTransportSecurity(hstsConfig -> hstsConfig
.maxAgeInSeconds(31536000)
.includeSubdomains(true));
}

Input Validation and Sanitization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@RestController
@Validated
public class UserController {

@PostMapping("/users")
public ResponseEntity<User> createUser(@Valid @RequestBody CreateUserRequest request) {
// Validation handled by @Valid annotation
User user = userService.createUser(request);
return ResponseEntity.ok(user);
}
}

@Data
public class CreateUserRequest {

@NotBlank(message = "Username is required")
@Size(min = 3, max = 20, message = "Username must be between 3 and 20 characters")
@Pattern(regexp = "^[a-zA-Z0-9._-]+$", message = "Username contains invalid characters")
private String username;

@NotBlank(message = "Email is required")
@Email(message = "Invalid email format")
private String email;

@NotBlank(message = "Password is required")
@Size(min = 8, message = "Password must be at least 8 characters")
private String password;
}

Common Security Vulnerabilities and Mitigation

SQL Injection Prevention

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@Repository
public class UserRepository {

@Autowired
private JdbcTemplate jdbcTemplate;

// Vulnerable code (DON'T DO THIS)
public User findByUsernameUnsafe(String username) {
String sql = "SELECT * FROM users WHERE username = '" + username + "'";
return jdbcTemplate.queryForObject(sql, User.class);
}

// Secure code (DO THIS)
public User findByUsernameSafe(String username) {
String sql = "SELECT * FROM users WHERE username = ?";
return jdbcTemplate.queryForObject(sql, new Object[]{username}, User.class);
}
}

XSS Prevention

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
@Configuration
public class SecurityHeadersConfig {

@Bean
public FilterRegistrationBean<XSSFilter> xssPreventFilter() {
FilterRegistrationBean<XSSFilter> registrationBean = new FilterRegistrationBean<>();
registrationBean.setFilter(new XSSFilter());
registrationBean.addUrlPatterns("/*");
return registrationBean;
}
}

public class XSSFilter implements Filter {

@Override
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException {

XSSRequestWrapper wrappedRequest = new XSSRequestWrapper((HttpServletRequest) request);
chain.doFilter(wrappedRequest, response);
}
}

Testing Spring Security

Security Testing with MockMvc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@RunWith(SpringRunner.class)
@WebMvcTest(UserController.class)
public class UserControllerSecurityTest {

@Autowired
private MockMvc mockMvc;

@Test
@WithMockUser(roles = "ADMIN")
public void testAdminAccessToUserData() throws Exception {
mockMvc.perform(get("/admin/users"))
.andExpect(status().isOk());
}

@Test
@WithMockUser(roles = "USER")
public void testUserAccessToAdminEndpoint() throws Exception {
mockMvc.perform(get("/admin/users"))
.andExpect(status().isForbidden());
}

@Test
public void testUnauthenticatedAccess() throws Exception {
mockMvc.perform(get("/user/profile"))
.andExpect(status().isUnauthorized());
}
}

Integration Testing with TestContainers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
@SpringBootTest
@Testcontainers
public class SecurityIntegrationTest {

@Container
static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:13")
.withDatabaseName("testdb")
.withUsername("test")
.withPassword("test");

@Autowired
private TestRestTemplate restTemplate;

@Test
public void testFullAuthenticationFlow() {
// Test user registration
ResponseEntity<String> registerResponse = restTemplate.postForEntity(
"/api/auth/register",
new RegisterRequest("test@example.com", "password123"),
String.class
);

assertThat(registerResponse.getStatusCode()).isEqualTo(HttpStatus.CREATED);

// Test user login
ResponseEntity<LoginResponse> loginResponse = restTemplate.postForEntity(
"/api/auth/login",
new LoginRequest("test@example.com", "password123"),
LoginResponse.class
);

assertThat(loginResponse.getStatusCode()).isEqualTo(HttpStatus.OK);
assertThat(loginResponse.getBody().getToken()).isNotNull();
}
}

Performance Optimization

Security Filter Chain Optimization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@Configuration
public class OptimizedSecurityConfig {

@Override
protected void configure(HttpSecurity http) throws Exception {
http
// Disable unnecessary features for API-only applications
.csrf().disable()
.sessionManagement().sessionCreationPolicy(SessionCreationPolicy.STATELESS)
.and()
// Order matters - put most specific patterns first
.authorizeRequests()
.antMatchers("/api/public/**").permitAll()
.antMatchers(HttpMethod.GET, "/api/products/**").permitAll()
.antMatchers("/api/admin/**").hasRole("ADMIN")
.anyRequest().authenticated();
}
}

Caching Security Context

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@Configuration
public class SecurityCacheConfig {

@Bean
public CacheManager cacheManager() {
return new ConcurrentMapCacheManager("userCache", "permissionCache");
}

@Service
public class CachedUserDetailsService implements UserDetailsService {

@Cacheable(value = "userCache", key = "#username")
@Override
public UserDetails loadUserByUsername(String username) throws UsernameNotFoundException {
return userRepository.findByUsername(username)
.map(this::createUserPrincipal)
.orElseThrow(() -> new UsernameNotFoundException("User not found: " + username));
}
}
}

Troubleshooting Common Issues

Debug Security Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
@Configuration
@EnableWebSecurity
@EnableGlobalMethodSecurity(prePostEnabled = true)
public class DebugSecurityConfig extends WebSecurityConfigurerAdapter {

@Override
public void configure(WebSecurity web) throws Exception {
web.debug(true); // Enable security debugging
}

@Bean
public Logger securityLogger() {
Logger logger = LoggerFactory.getLogger("org.springframework.security");
((ch.qos.logback.classic.Logger) logger).setLevel(Level.DEBUG);
return logger;
}
}

Common Configuration Mistakes

1
2
3
4
5
6
7
8
9
// WRONG: Ordering matters in security configuration
http.authorizeRequests()
.anyRequest().authenticated() // This catches everything
.antMatchers("/public/**").permitAll(); // This never gets reached

// CORRECT: Specific patterns first
http.authorizeRequests()
.antMatchers("/public/**").permitAll()
.anyRequest().authenticated();

Interview Insight: “What happens when Spring Security configuration conflicts occur?”

Spring Security evaluates rules in order. The first matching rule wins, so specific patterns must come before general ones. Always place more restrictive rules before less restrictive ones.

Monitoring and Auditing

Security Events Logging

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
@Component
public class SecurityEventListener {

private final Logger logger = LoggerFactory.getLogger(SecurityEventListener.class);

@EventListener
public void handleAuthenticationSuccess(AuthenticationSuccessEvent event) {
logger.info("User '{}' logged in successfully from IP: {}",
event.getAuthentication().getName(),
getClientIpAddress());
}

@EventListener
public void handleAuthenticationFailure(AbstractAuthenticationFailureEvent event) {
logger.warn("Authentication failed for user '{}': {}",
event.getAuthentication().getName(),
event.getException().getMessage());
}

@EventListener
public void handleAuthorizationFailure(AuthorizationFailureEvent event) {
logger.warn("Authorization failed for user '{}' accessing resource: {}",
event.getAuthentication().getName(),
event.getRequestUrl());
}
}

Metrics and Monitoring

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@Component
public class SecurityMetrics {

private final MeterRegistry meterRegistry;
private final Counter loginAttempts;
private final Counter loginFailures;

public SecurityMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.loginAttempts = Counter.builder("security.login.attempts")
.description("Total login attempts")
.register(meterRegistry);
this.loginFailures = Counter.builder("security.login.failures")
.description("Failed login attempts")
.register(meterRegistry);
}

@EventListener
public void onLoginAttempt(AuthenticationSuccessEvent event) {
loginAttempts.increment();
}

@EventListener
public void onLoginFailure(AbstractAuthenticationFailureEvent event) {
loginAttempts.increment();
loginFailures.increment();
}
}

Interview Questions and Answers

Technical Deep Dive Questions

Q: Explain the difference between authentication and authorization in Spring Security.
A: Authentication verifies identity (“who are you?”) while authorization determines permissions (“what can you do?”). Spring Security separates these concerns - AuthenticationManager handles authentication, while AccessDecisionManager handles authorization decisions.

Q: How does Spring Security handle stateless authentication?
A: For stateless authentication, Spring Security doesn’t maintain session state. Instead, it uses tokens (like JWT) passed with each request. Configure with SessionCreationPolicy.STATELESS and implement token-based filters.

Q: What is the purpose of SecurityContextHolder?
A: SecurityContextHolder provides access to the SecurityContext, which stores authentication information for the current thread. It uses ThreadLocal to ensure thread safety and provides three strategies: ThreadLocal (default), InheritableThreadLocal, and Global.

Q: How do you implement custom authentication in Spring Security?
A: Implement custom authentication by:

  1. Creating a custom AuthenticationProvider
  2. Implementing authenticate() method
  3. Registering the provider with AuthenticationManager
  4. Optionally creating custom Authentication tokens

Practical Implementation Questions

Q: How would you secure a REST API with JWT tokens?
A: Implement JWT security by:

  1. Creating JWT utility class for token generation/validation
  2. Implementing JwtAuthenticationEntryPoint for unauthorized access
  3. Creating JwtRequestFilter to validate tokens
  4. Configuring HttpSecurity with stateless session management
  5. Adding JWT filter before UsernamePasswordAuthenticationFilter

Q: What are the security implications of CSRF and how does Spring Security handle it?
A: CSRF attacks trick users into performing unwanted actions. Spring Security provides CSRF protection by:

  1. Generating unique tokens for each session
  2. Validating tokens on state-changing requests
  3. Storing tokens in HttpSession or cookies
  4. Automatically including tokens in forms via Thymeleaf integration

External Resources and References

Conclusion

Spring Security provides a comprehensive, flexible framework for securing Java applications. Its architecture based on filters, authentication managers, and security contexts allows for sophisticated security implementations while maintaining clean separation of concerns. Success with Spring Security requires understanding its core principles, proper configuration, and adherence to security best practices.

The framework’s strength lies in its ability to handle complex security requirements while providing sensible defaults for common use cases. Whether building traditional web applications or modern microservices, Spring Security offers the tools and flexibility needed to implement robust security solutions.

Memory Management Fundamentals

Java’s automatic memory management through garbage collection is one of its key features that differentiates it from languages like C and C++. The JVM automatically handles memory allocation and deallocation, freeing developers from manual memory management while preventing memory leaks and dangling pointer issues.

Memory Layout Overview

The JVM heap is divided into several regions, each serving specific purposes in the garbage collection process:


flowchart TB
subgraph "JVM Memory Structure"
    subgraph "Heap Memory"
        subgraph "Young Generation"
            Eden["Eden Space"]
            S0["Survivor 0"]
            S1["Survivor 1"]
        end
        
        subgraph "Old Generation"
            OldGen["Old Generation (Tenured)"]
        end
        
        MetaSpace["Metaspace (Java 8+)"]
    end
    
    subgraph "Non-Heap Memory"
        PC["Program Counter"]
        Stack["Java Stacks"]
        Native["Native Method Stacks"]
        Direct["Direct Memory"]
    end
end

Interview Insight: “Can you explain the difference between heap and non-heap memory in JVM?”

Answer: Heap memory stores object instances and arrays, managed by GC. Non-heap includes method area (storing class metadata), program counter registers, and stack memory (storing method calls and local variables). Only heap memory is subject to garbage collection.

GC Roots and Object Reachability

Understanding GC Roots

GC Roots are the starting points for garbage collection algorithms to determine object reachability. An object is considered “reachable” if there’s a path from any GC Root to that object.

Primary GC Roots include:

  • Local Variables: Variables in currently executing methods
  • Static Variables: Class-level static references
  • JNI References: Objects referenced from native code
  • Monitor Objects: Objects used for synchronization
  • Thread Objects: Active thread instances
  • Class Objects: Loaded class instances in Metaspace

flowchart TD
subgraph "GC Roots"
    LV["Local Variables"]
    SV["Static Variables"]
    JNI["JNI References"]
    TO["Thread Objects"]
end

subgraph "Heap Objects"
    A["Object A"]
    B["Object B"]
    C["Object C"]
    D["Object D (Unreachable)"]
end

LV --> A
SV --> B
A --> C
B --> C

style D fill:#ff6b6b
style A fill:#51cf66
style B fill:#51cf66
style C fill:#51cf66

Object Reachability Algorithm

The reachability analysis works through a mark-and-sweep approach:

  1. Mark Phase: Starting from GC Roots, mark all reachable objects
  2. Sweep Phase: Reclaim memory of unmarked (unreachable) objects
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Example: Object Reachability
public class ReachabilityExample {
private static Object staticRef; // GC Root

public void demonstrateReachability() {
Object localRef = new Object(); // GC Root (local variable)
Object chainedObj = new Object();

// Creating reference chain
localRef.someField = chainedObj; // chainedObj is reachable

// Breaking reference chain
localRef = null; // chainedObj becomes unreachable
}
}

Interview Insight: “How does JVM determine if an object is eligible for garbage collection?”

Answer: JVM uses reachability analysis starting from GC Roots. If an object cannot be reached through any path from GC Roots, it becomes eligible for GC. This is more reliable than reference counting as it handles circular references correctly.

Object Reference Types

Java provides different reference types that interact with garbage collection in distinct ways:

Strong References

Default reference type that prevents garbage collection:

1
2
Object obj = new Object();  // Strong reference
// obj will not be collected while this reference exists

Weak References

Allow garbage collection even when references exist:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import java.lang.ref.WeakReference;

WeakReference<Object> weakRef = new WeakReference<>(new Object());
Object obj = weakRef.get(); // May return null if collected

// Common use case: Cache implementation
public class WeakCache<K, V> {
private Map<K, WeakReference<V>> cache = new HashMap<>();

public V get(K key) {
WeakReference<V> ref = cache.get(key);
return (ref != null) ? ref.get() : null;
}
}

Soft References

More aggressive than weak references, collected only when memory is low:

1
2
3
4
import java.lang.ref.SoftReference;

SoftReference<LargeObject> softRef = new SoftReference<>(new LargeObject());
// Collected only when JVM needs memory

Phantom References

Used for cleanup operations, cannot retrieve the object:

1
2
3
4
5
6
import java.lang.ref.PhantomReference;
import java.lang.ref.ReferenceQueue;

ReferenceQueue<Object> queue = new ReferenceQueue<>();
PhantomReference<Object> phantomRef = new PhantomReference<>(obj, queue);
// Used for resource cleanup notification

Interview Insight: “When would you use WeakReference vs SoftReference?”

Answer: Use WeakReference for cache entries that can be recreated easily (like parsed data). Use SoftReference for memory-sensitive caches where you want to keep objects as long as possible but allow collection under memory pressure.

Generational Garbage Collection

The Generational Hypothesis

Most objects die young - this fundamental observation drives generational GC design:


flowchart LR
subgraph "Object Lifecycle"
    A["Object Creation"] --> B["Short-lived Objects (90%+)"]
    A --> C["Long-lived Objects (<10%)"]
    B --> D["Die in Young Generation"]
    C --> E["Promoted to Old Generation"]
end

Young Generation Structure

Eden Space: Where new objects are allocated
Survivor Spaces (S0, S1): Hold objects that survived at least one GC cycle

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Example: Object allocation flow
public class AllocationExample {
public void demonstrateAllocation() {
// Objects allocated in Eden space
for (int i = 0; i < 1000; i++) {
Object obj = new Object(); // Allocated in Eden

if (i % 100 == 0) {
// Some objects may survive longer
longLivedList.add(obj); // May get promoted to Old Gen
}
}
}
}

Minor GC Process

  1. Allocation: New objects go to Eden
  2. Eden Full: Triggers Minor GC
  3. Survival: Live objects move to Survivor space
  4. Age Increment: Survivor objects get age incremented
  5. Promotion: Old enough objects move to Old Generation

sequenceDiagram
participant E as Eden Space
participant S0 as Survivor 0
participant S1 as Survivor 1
participant O as Old Generation

E->>S0: First GC: Move live objects
Note over S0: Age = 1
E->>S0: Second GC: New objects to S0
S0->>S1: Move aged objects
Note over S1: Age = 2
S1->>O: Promotion (Age >= threshold)

Major GC and Old Generation

Old Generation uses different algorithms optimized for long-lived objects:

  • Concurrent Collection: Minimize application pause times
  • Compaction: Reduce fragmentation
  • Different Triggers: Based on Old Gen occupancy or allocation failure

Interview Insight: “Why is Minor GC faster than Major GC?”

Answer: Minor GC only processes Young Generation (smaller space, most objects are dead). Major GC processes entire heap or Old Generation (larger space, more live objects), often requiring more complex algorithms like concurrent marking or compaction.

Garbage Collection Algorithms

Mark and Sweep

The fundamental GC algorithm:

Mark Phase: Identify live objects starting from GC Roots
Sweep Phase: Reclaim memory from dead objects


flowchart TD
subgraph "Mark Phase"
    A["Start from GC Roots"] --> B["Mark Reachable Objects"]
    B --> C["Traverse Reference Graph"]
end

subgraph "Sweep Phase"
    D["Scan Heap"] --> E["Identify Unmarked Objects"]
    E --> F["Reclaim Memory"]
end

C --> D

Advantages: Simple, handles circular references
Disadvantages: Stop-the-world pauses, fragmentation

Copying Algorithm

Used primarily in Young Generation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Conceptual representation
public class CopyingGC {
private Space fromSpace;
private Space toSpace;

public void collect() {
// Copy live objects from 'from' to 'to' space
for (Object obj : fromSpace.getLiveObjects()) {
toSpace.copy(obj);
updateReferences(obj);
}

// Swap spaces
Space temp = fromSpace;
fromSpace = toSpace;
toSpace = temp;

// Clear old space
temp.clear();
}
}

Advantages: No fragmentation, fast allocation
Disadvantages: Requires double memory, inefficient for high survival rates

Mark-Compact Algorithm

Combines marking with compaction:

  1. Mark: Identify live objects
  2. Compact: Move live objects to eliminate fragmentation

flowchart LR
subgraph "Before Compaction"
    A["Live"] --> B["Dead"] --> C["Live"] --> D["Dead"] --> E["Live"]
end


flowchart LR
subgraph "After Compaction"
    F["Live"] --> G["Live"] --> H["Live"] --> I["Free Space"]
end

Interview Insight: “Why doesn’t Young Generation use Mark-Compact algorithm?”

Answer: Young Generation has high mortality rate (90%+ objects die), making copying algorithm more efficient. Mark-Compact is better for Old Generation where most objects survive and fragmentation is a concern.

Incremental and Concurrent Algorithms

Incremental GC: Breaks GC work into small increments
Concurrent GC: Runs GC concurrently with application threads

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Tri-color marking for concurrent GC
public enum ObjectColor {
WHITE, // Not visited
GRAY, // Visited but children not processed
BLACK // Visited and children processed
}

public class ConcurrentMarking {
public void concurrentMark() {
// Mark roots as gray
for (Object root : gcRoots) {
root.color = GRAY;
grayQueue.add(root);
}

// Process gray objects concurrently
while (!grayQueue.isEmpty() && !shouldYield()) {
Object obj = grayQueue.poll();
for (Object child : obj.getReferences()) {
if (child.color == WHITE) {
child.color = GRAY;
grayQueue.add(child);
}
}
obj.color = BLACK;
}
}
}

Garbage Collectors Evolution

Serial GC (-XX:+UseSerialGC)

Characteristics: Single-threaded, stop-the-world
Best for: Small applications, client-side applications
JVM Versions: All versions

1
2
# JVM flags for Serial GC
java -XX:+UseSerialGC -Xmx512m MyApplication

Use Case Example:

1
2
3
4
5
6
7
// Small desktop application
public class CalculatorApp {
public static void main(String[] args) {
// Serial GC sufficient for small heap sizes
SwingUtilities.invokeLater(() -> new Calculator().setVisible(true));
}
}

Parallel GC (-XX:+UseParallelGC)

Characteristics: Multi-threaded, throughput-focused
Best for: Batch processing, throughput-sensitive applications
Default: Java 8 (server-class machines)

1
2
# Parallel GC configuration
java -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xmx2g MyBatchJob

Production Example:

1
2
3
4
5
6
7
8
9
// Data processing application
public class DataProcessor {
public void processBatch(List<Record> records) {
// High throughput processing
records.parallelStream()
.map(this::transform)
.collect(Collectors.toList());
}
}

CMS GC (-XX:+UseConcMarkSweepGC) [Deprecated in Java 14]

Phases:

  1. Initial Mark (STW)
  2. Concurrent Mark
  3. Concurrent Preclean
  4. Remark (STW)
  5. Concurrent Sweep

Characteristics: Concurrent, low-latency focused
Best for: Web applications requiring low pause times

1
2
# CMS configuration (legacy)
java -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Xmx4g WebApp

G1 GC (-XX:+UseG1GC)

Characteristics: Low-latency, region-based, predictable pause times
Best for: Large heaps (>4GB), latency-sensitive applications
Default: Java 9+

1
2
# G1 GC tuning
java -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:G1HeapRegionSize=16m -Xmx8g

Region-based Architecture:


flowchart TB
subgraph "G1 Heap Regions"
    subgraph "Young Regions"
        E1["Eden 1"]
        E2["Eden 2"]
        S1["Survivor 1"]
    end
    
    subgraph "Old Regions"
        O1["Old 1"]
        O2["Old 2"]
        O3["Old 3"]
    end
    
    subgraph "Special Regions"
        H["Humongous"]
        F["Free"]
    end
end

Interview Insight: “When would you choose G1 over Parallel GC?”

Answer: Choose G1 for applications requiring predictable low pause times (<200ms) with large heaps (>4GB). Use Parallel GC for batch processing where throughput is more important than latency.

ZGC (-XX:+UseZGC) [Java 11+]

Characteristics: Ultra-low latency (<10ms), colored pointers
Best for: Applications requiring consistent low latency

1
2
# ZGC configuration
java -XX:+UseZGC -XX:+UseTransparentHugePages -Xmx32g LatencyCriticalApp

Shenandoah GC (-XX:+UseShenandoahGC) [Java 12+]

Characteristics: Low pause times, concurrent collection
Best for: Applications with large heaps requiring consistent performance

1
2
3
# Shenandoah configuration
-XX:+UseShenandoahGC
-XX:ShenandoahGCHeuristics=adaptive

Collector Comparison

Collector Comparison Table:

Collector Java Version Best Heap Size Pause Time Throughput Use Case
Serial All < 100MB High Low Single-core, client apps
Parallel All (default 8) < 8GB Medium-High High Multi-core, batch processing
G1 7+ (default 9+) > 4GB Low-Medium Medium-High Server applications
ZGC 11+ > 8GB Ultra-low Medium Latency-critical applications
Shenandoah 12+ > 8GB Ultra-low Medium Real-time applications

GC Tuning Parameters and Best Practices

Heap Sizing Parameters

1
2
3
4
5
# Basic heap configuration
-Xms2g # Initial heap size
-Xmx8g # Maximum heap size
-XX:NewRatio=3 # Old/Young generation ratio
-XX:SurvivorRatio=8 # Eden/Survivor ratio

Young Generation Tuning

1
2
3
4
# Young generation specific tuning
-Xmn2g # Set young generation size
-XX:MaxTenuringThreshold=7 # Promotion threshold
-XX:TargetSurvivorRatio=90 # Survivor space target utilization

Real-world Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Web application tuning scenario
public class WebAppTuning {
/*
* Application characteristics:
* - High request rate
* - Short-lived request objects
* - Some cached data
*
* Tuning strategy:
* - Larger young generation for short-lived objects
* - G1GC for predictable pause times
* - Monitoring allocation rate
*/
}

// JVM flags:
// -XX:+UseG1GC -Xmx4g -XX:MaxGCPauseMillis=100
// -XX:G1HeapRegionSize=8m -XX:NewRatio=2

Monitoring and Logging

1
2
3
4
5
6
7
8
9
# GC logging (Java 8)
-Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

# GC logging (Java 9+)
-Xlog:gc*:gc.log:time,tags

# Additional monitoring
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintStringDeduplicationStatistics (G1)

Production Tuning Checklist

Memory Allocation:

1
2
3
4
5
6
7
8
9
10
11
12
// Monitor allocation patterns
public class AllocationMonitoring {
public void trackAllocationRate() {
MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();

long beforeGC = memoryBean.getHeapMemoryUsage().getUsed();
// ... application work
long afterGC = memoryBean.getHeapMemoryUsage().getUsed();

long allocatedBytes = calculateAllocationRate(beforeGC, afterGC);
}
}

GC Overhead Analysis:

1
2
3
4
5
6
7
8
9
// Acceptable GC overhead typically < 5%
public class GCOverheadCalculator {
public double calculateGCOverhead(List<GCEvent> events, long totalTime) {
long gcTime = events.stream()
.mapToLong(GCEvent::getDuration)
.sum();
return (double) gcTime / totalTime * 100;
}
}

Advanced GC Concepts

Escape Analysis and TLAB

Thread Local Allocation Buffers (TLAB) optimize object allocation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public class TLABExample {
public void demonstrateTLAB() {
// Objects allocated in thread-local buffer
for (int i = 0; i < 1000; i++) {
Object obj = new Object(); // Fast TLAB allocation
}
}

// Escape analysis may eliminate allocation entirely
public void noEscapeAllocation() {
StringBuilder sb = new StringBuilder(); // May be stack-allocated
sb.append("Hello");
return sb.toString(); // Object doesn't escape method
}
}

String Deduplication (G1)

1
2
# Enable string deduplication
-XX:+UseG1GC -XX:+UseStringDeduplication
1
2
3
4
5
6
7
8
9
10
11
// String deduplication example
public class StringDeduplication {
public void demonstrateDeduplication() {
List<String> strings = new ArrayList<>();

// These strings have same content but different instances
for (int i = 0; i < 1000; i++) {
strings.add(new String("duplicate content")); // Candidates for deduplication
}
}
}

Compressed OOPs

1
2
3
# Enable compressed ordinary object pointers (default on 64-bit with heap < 32GB)
-XX:+UseCompressedOops
-XX:+UseCompressedClassPointers

Interview Questions and Advanced Scenarios

Scenario-Based Questions

Question: “Your application experiences long GC pauses during peak traffic. How would you diagnose and fix this?”

Answer:

  1. Analysis: Enable GC logging, analyze pause times and frequency
  2. Identification: Check if Major GC is causing long pauses
  3. Solutions:
    • Switch to G1GC for predictable pause times
    • Increase heap size to reduce GC frequency
    • Tune young generation size
    • Consider object pooling for frequently allocated objects
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Example diagnostic approach
public class GCDiagnostics {
public void diagnoseGCIssues() {
// Monitor GC metrics
List<GarbageCollectorMXBean> gcBeans =
ManagementFactory.getGarbageCollectorMXBeans();

for (GarbageCollectorMXBean gcBean : gcBeans) {
System.out.printf("GC Name: %s, Collections: %d, Time: %d ms%n",
gcBean.getName(),
gcBean.getCollectionCount(),
gcBean.getCollectionTime());
}
}
}

Question: “Explain the trade-offs between throughput and latency in GC selection.”

Answer:

  • Throughput-focused: Parallel GC maximizes application processing time
  • Latency-focused: G1/ZGC minimizes pause times but may reduce overall throughput
  • Choice depends on: Application requirements, SLA constraints, heap size

Memory Leak Detection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Common memory leak patterns
public class MemoryLeakExamples {
private static Set<Object> cache = new HashSet<>(); // Static collection

public void potentialLeak() {
// Listeners not removed
someComponent.addListener(event -> {});

// ThreadLocal not cleaned
ThreadLocal<ExpensiveObject> threadLocal = new ThreadLocal<>();
threadLocal.set(new ExpensiveObject());
// threadLocal.remove(); // Missing cleanup
}

// Proper cleanup
public void properCleanup() {
try {
// Use try-with-resources
try (AutoCloseable resource = createResource()) {
// Work with resource
}
} catch (Exception e) {
// Handle exception
}
}
}

Production Best Practices

Monitoring and Alerting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// JMX-based GC monitoring
public class GCMonitor {
private final List<GarbageCollectorMXBean> gcBeans;

public GCMonitor() {
this.gcBeans = ManagementFactory.getGarbageCollectorMXBeans();
}

public void setupAlerts() {
// Alert if GC overhead > 5%
// Alert if pause times > SLA limits
// Monitor allocation rate trends
}

public GCMetrics collectMetrics() {
return new GCMetrics(
getTotalGCTime(),
getGCFrequency(),
getLongestPause(),
getAllocationRate()
);
}
}

Capacity Planning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Capacity planning calculations
public class CapacityPlanning {
public HeapSizeRecommendation calculateHeapSize(
long allocationRate,
int targetGCFrequency,
double survivorRatio) {

// Rule of thumb: Heap size should accommodate
// allocation rate * GC interval * safety factor
long recommendedHeap = allocationRate * targetGCFrequency * 3;

return new HeapSizeRecommendation(
recommendedHeap,
calculateYoungGenSize(recommendedHeap, survivorRatio),
calculateOldGenSize(recommendedHeap, survivorRatio)
);
}
}

Performance Testing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// GC performance testing framework
public class GCPerformanceTest {
public void runGCStressTest() {
// Measure allocation patterns
AllocationProfiler profiler = new AllocationProfiler();

// Simulate production load
for (int iteration = 0; iteration < 1000; iteration++) {
simulateWorkload();

if (iteration % 100 == 0) {
profiler.recordMetrics();
}
}

// Analyze results
profiler.generateReport();
}

private void simulateWorkload() {
// Create realistic object allocation patterns
List<Object> shortLived = createShortLivedObjects();
Object longLived = createLongLivedObject();

// Process data
processData(shortLived, longLived);
}
}

Conclusion and Future Directions

Java’s garbage collection continues to evolve with new collectors like ZGC and Shenandoah pushing the boundaries of low-latency collection. Understanding GC fundamentals, choosing appropriate collectors, and proper tuning remain critical for production Java applications.

Key Takeaways:

  • Choose GC based on application requirements (throughput vs latency)
  • Monitor and measure before optimizing
  • Understand object lifecycle and allocation patterns
  • Use appropriate reference types for memory-sensitive applications
  • Regular capacity planning and performance testing

Future Trends:

  • Ultra-low latency collectors (sub-millisecond pauses)
  • Better integration with container environments
  • Machine learning-assisted GC tuning
  • Region-based collectors becoming mainstream

The evolution of GC technology continues to make Java more suitable for a wider range of applications, from high-frequency trading systems requiring microsecond latencies to large-scale data processing systems prioritizing throughput.

External References

Overview of Cache Expiration Strategies

Redis implements multiple expiration deletion strategies to efficiently manage memory and ensure optimal performance. Understanding these mechanisms is crucial for building scalable, high-performance applications.

Interview Insight: “How does Redis handle expired keys?” - Redis uses a combination of lazy deletion and active deletion strategies. It doesn’t immediately delete expired keys but employs intelligent algorithms to balance performance and memory usage.

Core Expiration Deletion Policies

Lazy Deletion (Passive Expiration)

Lazy deletion is the primary mechanism where expired keys are only removed when they are accessed.

How it works:

  • When a client attempts to access a key, Redis checks if it has expired
  • If expired, the key is immediately deleted and NULL is returned
  • No background scanning or proactive deletion occurs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Example: Lazy deletion in action
import redis
import time

r = redis.Redis()

# Set a key with 2-second expiration
r.setex('temp_key', 2, 'temporary_value')

# Key exists initially
print(r.get('temp_key')) # b'temporary_value'

# Wait for expiration
time.sleep(3)

# Key is deleted only when accessed (lazy deletion)
print(r.get('temp_key')) # None

Advantages:

  • Minimal CPU overhead
  • No background processing required
  • Perfect for frequently accessed keys

Disadvantages:

  • Memory waste if expired keys are never accessed
  • Unpredictable memory usage patterns

Active Deletion (Proactive Scanning)

Redis periodically scans and removes expired keys to prevent memory bloat.

Algorithm Details:

  1. Redis runs expiration cycles approximately 10 times per second
  2. Each cycle samples 20 random keys from the expires dictionary
  3. If more than 25% are expired, repeat the process
  4. Maximum execution time per cycle is limited to prevent blocking

flowchart TD
A[Start Expiration Cycle] --> B[Sample 20 Random Keys]
B --> C{More than 25% expired?}
C -->|Yes| D[Delete Expired Keys]
D --> E{Time limit reached?}
E -->|No| B
E -->|Yes| F[End Cycle]
C -->|No| F
F --> G[Wait ~100ms]
G --> A

Configuration Parameters:

1
2
3
# Redis configuration for active expiration
hz 10 # Frequency of background tasks (10 Hz = 10 times/second)
active-expire-effort 1 # CPU effort for active expiration (1-10)

Timer-Based Deletion

While Redis doesn’t implement traditional timer-based deletion, you can simulate it using sorted sets:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import redis
import time
import threading

class TimerCache:
def __init__(self):
self.redis_client = redis.Redis()
self.timer_key = "expiration_timer"

def set_with_timer(self, key, value, ttl):
"""Set key-value with custom timer deletion"""
expire_time = time.time() + ttl

# Store the actual data
self.redis_client.set(key, value)

# Add to timer sorted set
self.redis_client.zadd(self.timer_key, {key: expire_time})

def cleanup_expired(self):
"""Background thread to clean expired keys"""
current_time = time.time()
expired_keys = self.redis_client.zrangebyscore(
self.timer_key, 0, current_time
)

if expired_keys:
# Remove expired keys
for key in expired_keys:
self.redis_client.delete(key.decode())

# Remove from timer set
self.redis_client.zremrangebyscore(self.timer_key, 0, current_time)

# Usage example
cache = TimerCache()
cache.set_with_timer('user:1', 'John Doe', 60) # 60 seconds TTL

Delay Queue Deletion

Implement a delay queue pattern for complex expiration scenarios:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import redis
import json
import time
from datetime import datetime, timedelta

class DelayQueueExpiration:
def __init__(self):
self.redis_client = redis.Redis()
self.queue_key = "delay_expiration_queue"

def schedule_deletion(self, key, delay_seconds):
"""Schedule key deletion after specified delay"""
execution_time = time.time() + delay_seconds
task = {
'key': key,
'scheduled_time': execution_time,
'action': 'delete'
}

self.redis_client.zadd(
self.queue_key,
{json.dumps(task): execution_time}
)

def process_delayed_deletions(self):
"""Process pending deletions"""
current_time = time.time()

# Get tasks ready for execution
ready_tasks = self.redis_client.zrangebyscore(
self.queue_key, 0, current_time, withscores=True
)

for task_json, score in ready_tasks:
task = json.loads(task_json)

# Execute deletion
self.redis_client.delete(task['key'])

# Remove from queue
self.redis_client.zrem(self.queue_key, task_json)

print(f"Deleted key: {task['key']} at {datetime.now()}")

# Usage
delay_queue = DelayQueueExpiration()
delay_queue.schedule_deletion('temp_data', 300) # Delete after 5 minutes

Interview Insight: “What’s the difference between active and passive expiration?” - Passive (lazy) expiration only occurs when keys are accessed, while active expiration proactively scans and removes expired keys in background cycles to prevent memory bloat.

Redis Expiration Policies (Eviction Policies)

When Redis reaches memory limits, it employs eviction policies to free up space:

Available Eviction Policies

1
2
3
# Configuration in redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru

Policy Types:

  1. noeviction (default)

    • No keys are evicted
    • Write operations return errors when memory limit reached
    • Use case: Critical data that cannot be lost
  2. allkeys-lru

    • Removes least recently used keys from all keys
    • Use case: General caching scenarios
  3. allkeys-lfu

    • Removes least frequently used keys
    • Use case: Applications with distinct access patterns
  4. volatile-lru

    • Removes LRU keys only from keys with expiration set
    • Use case: Mixed persistent and temporary data
  5. volatile-lfu

    • Removes LFU keys only from keys with expiration set
  6. allkeys-random

    • Randomly removes keys
    • Use case: When access patterns are unpredictable
  7. volatile-random

    • Randomly removes keys with expiration set
  8. volatile-ttl

    • Removes keys with shortest TTL first
    • Use case: Time-sensitive data prioritization

Policy Selection Guide


flowchart TD
A[Memory Pressure] --> B{All data equally important?}
B -->|Yes| C[allkeys-lru/lfu]
B -->|No| D{Temporary vs Persistent data?}
D -->|Mixed| E[volatile-lru/lfu]
D -->|Time-sensitive| F[volatile-ttl]
C --> G[High access pattern variance?]
G -->|Yes| H[allkeys-lfu]
G -->|No| I[allkeys-lru]

Master-Slave Cluster Expiration Mechanisms

Replication of Expiration

In Redis clusters, expiration handling follows specific patterns:

Master-Slave Expiration Flow:

  1. Only masters perform active expiration
  2. Masters send explicit DEL commands to slaves
  3. Slaves don’t independently expire keys (except for lazy deletion)

sequenceDiagram
participant M as Master
participant S1 as Slave 1
participant S2 as Slave 2
participant C as Client

Note over M: Active expiration cycle
M->>M: Check expired keys
M->>S1: DEL expired_key
M->>S2: DEL expired_key

C->>S1: GET expired_key
S1->>S1: Lazy expiration check
S1->>C: NULL (key expired)

Cluster Configuration for Expiration

1
2
3
4
5
6
7
8
9
10
11
12
# Master configuration
bind 0.0.0.0
port 6379
maxmemory 1gb
maxmemory-policy allkeys-lru
hz 10

# Slave configuration
bind 0.0.0.0
port 6380
slaveof 127.0.0.1 6379
slave-read-only yes

Production Example - Redis Sentinel with Expiration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import redis.sentinel

# Sentinel configuration for high availability
sentinels = [('localhost', 26379), ('localhost', 26380), ('localhost', 26381)]
sentinel = redis.sentinel.Sentinel(sentinels)

# Get master and slave connections
master = sentinel.master_for('mymaster', socket_timeout=0.1)
slave = sentinel.slave_for('mymaster', socket_timeout=0.1)

# Write to master with expiration
master.setex('session:user:1', 3600, 'session_data')

# Read from slave (expiration handled consistently)
session_data = slave.get('session:user:1')

Interview Insight: “How does Redis handle expiration in a cluster?” - In Redis clusters, only master nodes perform active expiration. When a master expires a key, it sends explicit DEL commands to all slaves to maintain consistency.

Durability and Expired Keys

RDB Persistence

Expired keys are handled during RDB operations:

1
2
3
4
5
6
7
8
# RDB configuration
save 900 1 # Save if at least 1 key changed in 900 seconds
save 300 10 # Save if at least 10 keys changed in 300 seconds
save 60 10000 # Save if at least 10000 keys changed in 60 seconds

# Expired keys are not saved to RDB files
rdbcompression yes
rdbchecksum yes

AOF Persistence

AOF handles expiration through explicit commands:

1
2
3
4
5
6
7
8
# AOF configuration
appendonly yes
appendfsync everysec
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

# Expired keys generate explicit DEL commands in AOF
no-appendfsync-on-rewrite no

Example AOF entries for expiration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
*2
$6
SELECT
$1
0
*3
$3
SET
$8
temp_key
$5
value
*3
$6
EXPIRE
$8
temp_key
$2
60
*2
$3
DEL
$8
temp_key

Optimization Strategies

Memory-Efficient Configuration

1
2
3
4
5
6
7
8
9
10
11
# redis.conf optimizations
maxmemory 2gb
maxmemory-policy allkeys-lru

# Active deletion tuning
hz 10 # Background task frequency
active-expire-cycle-lookups-per-loop 20
active-expire-cycle-fast-duration 1000

# Memory sampling for LRU/LFU
maxmemory-samples 5

Expiration Time Configuration Optimization

Hierarchical TTL Strategy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class TTLManager:
def __init__(self, redis_client):
self.redis = redis_client

# Define TTL hierarchy
self.ttl_config = {
'hot_data': 300, # 5 minutes - frequently accessed
'warm_data': 1800, # 30 minutes - moderately accessed
'cold_data': 3600, # 1 hour - rarely accessed
'session_data': 7200, # 2 hours - user sessions
'cache_data': 86400 # 24 hours - general cache
}

def set_with_smart_ttl(self, key, value, data_type='cache_data'):
"""Set key with intelligent TTL based on data type"""
ttl = self.ttl_config.get(data_type, 3600)

# Add jitter to prevent thundering herd
import random
jitter = random.randint(-ttl//10, ttl//10)
final_ttl = ttl + jitter

return self.redis.setex(key, final_ttl, value)

def adaptive_ttl(self, key, access_frequency):
"""Adjust TTL based on access patterns"""
base_ttl = 3600 # 1 hour base

if access_frequency > 100: # Hot key
return base_ttl // 4 # 15 minutes
elif access_frequency > 10: # Warm key
return base_ttl // 2 # 30 minutes
else: # Cold key
return base_ttl * 2 # 2 hours

# Usage example
ttl_manager = TTLManager(redis.Redis())
ttl_manager.set_with_smart_ttl('user:profile:123', user_data, 'hot_data')

Production Use Cases

High-Concurrent Idempotent Scenarios

In idempotent(/aɪˈdempətənt/) operations, cache expiration must prevent duplicate processing while maintaining consistency.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import redis
import uuid
import time
import hashlib

class IdempotentCache:
def __init__(self):
self.redis = redis.Redis()
self.default_ttl = 300 # 5 minutes

def generate_idempotent_key(self, operation, params):
"""Generate unique key for operation"""
# Create hash from operation and parameters
content = f"{operation}:{str(sorted(params.items()))}"
return f"idempotent:{hashlib.md5(content.encode()).hexdigest()}"

def execute_idempotent(self, operation, params, executor_func):
"""Execute operation with idempotency guarantee"""
idempotent_key = self.generate_idempotent_key(operation, params)

# Check if operation already executed
result = self.redis.get(idempotent_key)
if result:
return json.loads(result)

# Use distributed lock to prevent concurrent execution
lock_key = f"lock:{idempotent_key}"
lock_acquired = self.redis.set(lock_key, "1", nx=True, ex=60)

if not lock_acquired:
# Wait and check again
time.sleep(0.1)
result = self.redis.get(idempotent_key)
if result:
return json.loads(result)
raise Exception("Operation in progress")

try:
# Execute the actual operation
result = executor_func(params)

# Cache the result
self.redis.setex(
idempotent_key,
self.default_ttl,
json.dumps(result)
)

return result
finally:
# Release lock
self.redis.delete(lock_key)

# Usage example
def process_payment(params):
# Simulate payment processing
return {"status": "success", "transaction_id": str(uuid.uuid4())}

idempotent_cache = IdempotentCache()
result = idempotent_cache.execute_idempotent(
"payment",
{"amount": 100, "user_id": "123"},
process_payment
)

Hot Key Scenarios

Problem: Managing frequently accessed keys that can overwhelm Redis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import redis
import random
import threading
from collections import defaultdict

class HotKeyManager:
def __init__(self):
self.redis = redis.Redis()
self.access_stats = defaultdict(int)
self.hot_key_threshold = 1000 # Requests per minute

def get_with_hot_key_protection(self, key):
"""Get value with hot key protection"""
self.access_stats[key] += 1

# Check if key is hot
if self.access_stats[key] > self.hot_key_threshold:
return self._handle_hot_key(key)

return self.redis.get(key)

def _handle_hot_key(self, hot_key):
"""Handle hot key with multiple strategies"""
strategies = [
self._local_cache_strategy,
self._replica_strategy,
self._fragmentation_strategy
]

# Choose strategy based on key characteristics
return random.choice(strategies)(hot_key)

def _local_cache_strategy(self, key):
"""Use local cache for hot keys"""
local_cache_key = f"local:{key}"

# Check local cache first (simulate with Redis)
local_value = self.redis.get(local_cache_key)
if local_value:
return local_value

# Get from main cache and store locally
value = self.redis.get(key)
if value:
# Short TTL for local cache
self.redis.setex(local_cache_key, 60, value)

return value

def _replica_strategy(self, key):
"""Create multiple replicas of hot key"""
replica_count = 5
replica_key = f"{key}:replica:{random.randint(1, replica_count)}"

# Try to get from replica
value = self.redis.get(replica_key)
if not value:
# Get from master and update replica
value = self.redis.get(key)
if value:
self.redis.setex(replica_key, 300, value) # 5 min TTL

return value

def _fragmentation_strategy(self, key):
"""Fragment hot key into smaller pieces"""
# For large objects, split into fragments
fragments = []
fragment_index = 0

while True:
fragment_key = f"{key}:frag:{fragment_index}"
fragment = self.redis.get(fragment_key)

if not fragment:
break

fragments.append(fragment)
fragment_index += 1

if fragments:
return b''.join(fragments)

return self.redis.get(key)

# Usage example
hot_key_manager = HotKeyManager()
value = hot_key_manager.get_with_hot_key_protection('popular_product:123')

Pre-Loading and Predictive Caching

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
class PredictiveCacheManager:
def __init__(self, redis_client):
self.redis = redis_client

def preload_related_data(self, primary_key, related_keys_func, short_ttl=300):
"""
Pre-load related data with shorter TTL
Useful for pagination, related products, etc.
"""
# Get related keys that might be accessed soon
related_keys = related_keys_func(primary_key)

pipeline = self.redis.pipeline()
for related_key in related_keys:
# Check if already cached
if not self.redis.exists(related_key):
# Pre-load with shorter TTL
related_data = self._fetch_data(related_key)
pipeline.setex(related_key, short_ttl, related_data)

pipeline.execute()

def cache_with_prefetch(self, key, value, ttl=3600, prefetch_ratio=0.1):
"""
Cache data and trigger prefetch when TTL is near expiration
"""
self.redis.setex(key, ttl, value)

# Set a prefetch trigger at 90% of TTL
prefetch_ttl = int(ttl * prefetch_ratio)
prefetch_key = f"prefetch:{key}"
self.redis.setex(prefetch_key, ttl - prefetch_ttl, "trigger")

def check_and_prefetch(self, key, refresh_func):
"""Check if prefetch is needed and refresh in background"""
prefetch_key = f"prefetch:{key}"
if not self.redis.exists(prefetch_key):
# Prefetch trigger expired - refresh in background
threading.Thread(
target=self._background_refresh,
args=(key, refresh_func)
).start()

def _background_refresh(self, key, refresh_func):
"""Refresh data in background before expiration"""
try:
new_value = refresh_func()
current_ttl = self.redis.ttl(key)
if current_ttl > 0:
# Extend current key TTL and set new value
self.redis.setex(key, current_ttl + 3600, new_value)
except Exception as e:
# Log error but don't fail main request
print(f"Background refresh failed for {key}: {e}")

# Example usage for e-commerce
def get_related_product_keys(product_id):
"""Return keys for related products, reviews, recommendations"""
return [
f"product:{product_id}:reviews",
f"product:{product_id}:recommendations",
f"product:{product_id}:similar",
f"category:{get_category(product_id)}:featured"
]

# Pre-load when user views a product
predictive_cache = PredictiveCacheManager(redis_client)
predictive_cache.preload_related_data(
f"product:{product_id}",
get_related_product_keys,
short_ttl=600 # 10 minutes for related data
)

Performance Monitoring and Metrics

Expiration Monitoring

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import redis
import time
import json

class ExpirationMonitor:
def __init__(self):
self.redis = redis.Redis()

def get_expiration_stats(self):
"""Get comprehensive expiration statistics"""
info = self.redis.info()

stats = {
'expired_keys': info.get('expired_keys', 0),
'evicted_keys': info.get('evicted_keys', 0),
'keyspace_hits': info.get('keyspace_hits', 0),
'keyspace_misses': info.get('keyspace_misses', 0),
'used_memory': info.get('used_memory', 0),
'maxmemory': info.get('maxmemory', 0),
'memory_usage_percentage': 0
}

if stats['maxmemory'] > 0:
stats['memory_usage_percentage'] = (
stats['used_memory'] / stats['maxmemory'] * 100
)

# Calculate hit ratio
total_requests = stats['keyspace_hits'] + stats['keyspace_misses']
if total_requests > 0:
stats['hit_ratio'] = stats['keyspace_hits'] / total_requests * 100
else:
stats['hit_ratio'] = 0

return stats

def analyze_key_expiration_patterns(self, pattern="*"):
"""Analyze expiration patterns for keys matching pattern"""
keys = self.redis.keys(pattern)
expiration_analysis = {
'total_keys': len(keys),
'keys_with_ttl': 0,
'keys_without_ttl': 0,
'avg_ttl': 0,
'ttl_distribution': {}
}

ttl_values = []

for key in keys:
ttl = self.redis.ttl(key)

if ttl == -1: # No expiration set
expiration_analysis['keys_without_ttl'] += 1
elif ttl >= 0: # Has expiration
expiration_analysis['keys_with_ttl'] += 1
ttl_values.append(ttl)

# Categorize TTL
if ttl < 300: # < 5 minutes
category = 'short_term'
elif ttl < 3600: # < 1 hour
category = 'medium_term'
else: # >= 1 hour
category = 'long_term'

expiration_analysis['ttl_distribution'][category] = \
expiration_analysis['ttl_distribution'].get(category, 0) + 1

if ttl_values:
expiration_analysis['avg_ttl'] = sum(ttl_values) / len(ttl_values)

return expiration_analysis

# Usage
monitor = ExpirationMonitor()
stats = monitor.get_expiration_stats()
print(f"Hit ratio: {stats['hit_ratio']:.2f}%")
print(f"Memory usage: {stats['memory_usage_percentage']:.2f}%")

Configuration Checklist

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Memory management
maxmemory 2gb
maxmemory-policy allkeys-lru

# Expiration tuning
hz 10
active-expire-effort 1

# Persistence (affects expiration)
save 900 1
appendonly yes
appendfsync everysec

# Monitoring
latency-monitor-threshold 100

Interview Questions and Expert Answers

Q: How does Redis handle expiration in a master-slave setup, and what happens during failover?

A: In Redis replication, only the master performs expiration logic. When a key expires on the master (either through lazy or active expiration), the master sends an explicit DEL command to all slaves. Slaves never expire keys independently - they wait for the master’s instruction.

During failover, the promoted slave becomes the new master and starts handling expiration. However, there might be temporary inconsistencies because:

  1. The old master might have expired keys that weren’t yet replicated
  2. Clock differences can cause timing variations
  3. Some keys might appear “unexpired” on the new master

Production applications should handle these edge cases by implementing fallback mechanisms and not relying solely on Redis for strict expiration timing.

Q: What’s the difference between eviction and expiration, and how do they interact?

A: Expiration is time-based removal of keys that have reached their TTL, while eviction is memory-pressure-based removal when Redis reaches its memory limit.

They interact in several ways:

  • Eviction policies like volatile-lru only consider keys with expiration set
  • Active expiration reduces memory pressure, potentially avoiding eviction
  • The volatile-ttl policy evicts keys with the shortest remaining TTL first
  • Proper TTL configuration can reduce eviction frequency and improve cache performance

Q: How would you optimize Redis expiration for a high-traffic e-commerce site?

A: For high-traffic e-commerce, I’d implement a multi-tier expiration strategy:

  1. Product Catalog: Long TTL (4-24 hours) with background refresh
  2. Inventory Counts: Short TTL (1-5 minutes) with real-time updates
  3. User Sessions: Medium TTL (30 minutes) with sliding expiration
  4. Shopping Carts: Longer TTL (24-48 hours) with cleanup processes
  5. Search Results: Staggered TTL (15-60 minutes) with jitter to prevent thundering herd

Key optimizations:

  • Use allkeys-lru eviction for cache-heavy workloads
  • Implement predictive pre-loading for related products
  • Add jitter to TTL values to prevent simultaneous expiration
  • Monitor hot keys and implement replication strategies
  • Use pipeline operations for bulk TTL updates

The goal is balancing data freshness, memory usage, and system performance while handling traffic spikes gracefully.

External References and Resources

Key Takeaways

Redis expiration deletion policies are crucial for maintaining optimal performance and memory usage in production systems. The combination of lazy deletion, active expiration, and memory eviction policies provides flexible options for different use cases.

Success in production requires understanding the trade-offs between memory usage, CPU overhead, and data consistency, especially in distributed environments. Monitoring expiration efficiency and implementing appropriate TTL strategies based on access patterns is essential for maintaining high-performance Redis deployments.

The key is matching expiration strategies to your specific use case: use longer TTLs with background refresh for stable data, shorter TTLs for frequently changing data, and implement sophisticated hot key handling for high-traffic scenarios.

Overview of Redis Memory Management

Redis is an in-memory data structure store that requires careful memory management to maintain optimal performance. When Redis approaches its memory limit, it must decide which keys to remove to make space for new data. This process is called memory eviction.


flowchart TD
A[Redis Instance] --> B{Memory Usage Check}
B -->|Below maxmemory| C[Accept New Data]
B -->|At maxmemory| D[Apply Eviction Policy]
D --> E[Select Keys to Evict]
E --> F[Remove Selected Keys]
F --> G[Accept New Data]

style A fill:#f9f,stroke:#333,stroke-width:2px
style D fill:#bbf,stroke:#333,stroke-width:2px
style E fill:#fbb,stroke:#333,stroke-width:2px

Interview Insight: Why is memory management crucial in Redis?

  • Redis stores all data in RAM for fast access
  • Uncontrolled memory growth can lead to system crashes
  • Proper eviction prevents OOM (Out of Memory) errors
  • Maintains predictable performance characteristics

Redis Memory Eviction Policies

Redis offers 8 different eviction policies, each serving different use cases:

LRU-Based Policies

allkeys-lru

Evicts the least recently used keys across all keys in the database.

1
2
3
4
5
6
7
8
# Configuration
CONFIG SET maxmemory-policy allkeys-lru

# Example scenario
SET user:1001 "John Doe" # Time: T1
GET user:1001 # Access at T2
SET user:1002 "Jane Smith" # Time: T3
# If memory is full, user:1002 is more likely to be evicted

Best Practice: Use when you have a natural access pattern where some data is accessed more frequently than others.

volatile-lru

Evicts the least recently used keys only among keys with an expiration set.

1
2
3
4
5
# Setup
SET session:abc123 "user_data" EX 3600 # With expiration
SET config:theme "dark" # Without expiration

# Only session:abc123 is eligible for LRU eviction

Use Case: Session management where you want to preserve configuration data.

LFU-Based Policies

allkeys-lfu

Evicts the least frequently used keys across all keys.

1
2
3
4
5
6
# Example: Access frequency tracking
SET product:1 "laptop" # Accessed 100 times
SET product:2 "mouse" # Accessed 5 times
SET product:3 "keyboard" # Accessed 50 times

# product:2 (mouse) would be evicted first due to lowest frequency

volatile-lfu

Evicts the least frequently used keys only among keys with expiration.

Interview Insight: When would you choose LFU over LRU?

  • LFU is better for data with consistent access patterns
  • LRU is better for data with temporal locality
  • LFU prevents cache pollution from occasional bulk operations

Random Policies

allkeys-random

Randomly selects keys for eviction across all keys.

1
2
3
4
5
6
# Simulation of random eviction
import random

keys = ["user:1", "user:2", "user:3", "config:db", "session:xyz"]
evict_key = random.choice(keys)
print(f"Evicting: {evict_key}")

volatile-random

Randomly selects keys for eviction only among keys with expiration.

When to Use Random Policies:

  • When access patterns are completely unpredictable
  • For testing and development environments
  • When you need simple, fast eviction decisions

TTL-Based Policy

volatile-ttl

Evicts keys with expiration, prioritizing those with shorter remaining TTL.

1
2
3
4
5
6
# Example scenario
SET cache:data1 "value1" EX 3600 # Expires in 1 hour
SET cache:data2 "value2" EX 1800 # Expires in 30 minutes
SET cache:data3 "value3" EX 7200 # Expires in 2 hours

# cache:data2 will be evicted first (shortest TTL)

No Eviction Policy

noeviction

Returns errors when memory limit is reached instead of evicting keys.

1
2
3
4
5
CONFIG SET maxmemory-policy noeviction

# When memory is full:
SET new_key "value"
# Error: OOM command not allowed when used memory > 'maxmemory'

Use Case: Critical systems where data loss is unacceptable.

Memory Limitation Strategies

Why Limit Cache Memory?


flowchart LR
A[Unlimited Memory] --> B[System Instability]
A --> C[Unpredictable Performance]
A --> D[Resource Contention]

E[Limited Memory] --> F[Predictable Behavior]
E --> G[System Stability]
E --> H[Better Resource Planning]

style A fill:#fbb,stroke:#333,stroke-width:2px
style E fill:#bfb,stroke:#333,stroke-width:2px

Production Reasons:

  • System Stability: Prevents Redis from consuming all available RAM
  • Performance Predictability: Maintains consistent response times
  • Multi-tenancy: Allows multiple services to coexist
  • Cost Control: Manages infrastructure costs effectively

Basic Memory Configuration

1
2
3
4
5
6
7
8
# Set maximum memory limit (512MB)
CONFIG SET maxmemory 536870912

# Set eviction policy
CONFIG SET maxmemory-policy allkeys-lru

# Check current memory usage
INFO memory

Using Lua Scripts for Advanced Memory Control

Limiting Key-Value Pairs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- limit_keys.lua: Limit total number of keys
local max_keys = tonumber(ARGV[1])
local current_keys = redis.call('DBSIZE')

if current_keys >= max_keys then
-- Get random key and delete it
local keys = redis.call('RANDOMKEY')
if keys then
redis.call('DEL', keys)
return "Evicted key: " .. keys
end
end

-- Add the new key
redis.call('SET', KEYS[1], ARGV[2])
return "Key added successfully"
1
2
# Usage
EVAL "$(cat limit_keys.lua)" 1 "new_key" 1000 "new_value"

Limiting Value Size

1
2
3
4
5
6
7
8
9
10
11
-- limit_value_size.lua: Reject large values
local max_size = tonumber(ARGV[2])
local value = ARGV[1]
local value_size = string.len(value)

if value_size > max_size then
return redis.error_reply("Value size " .. value_size .. " exceeds limit " .. max_size)
end

redis.call('SET', KEYS[1], value)
return "OK"
1
2
# Usage: Limit values to 1KB
EVAL "$(cat limit_value_size.lua)" 1 "my_key" "my_value" 1024

Memory-Aware Key Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- memory_aware_set.lua: Check memory before setting
local key = KEYS[1]
local value = ARGV[1]
local memory_threshold = tonumber(ARGV[2])

-- Get current memory usage
local memory_info = redis.call('MEMORY', 'USAGE', 'SAMPLES', '0')
local used_memory = memory_info['used_memory']
local max_memory = memory_info['maxmemory']

if max_memory > 0 and used_memory > (max_memory * memory_threshold / 100) then
-- Trigger manual cleanup
local keys_to_check = redis.call('RANDOMKEY')
if keys_to_check then
local key_memory = redis.call('MEMORY', 'USAGE', keys_to_check)
if key_memory > 1000 then -- If key uses more than 1KB
redis.call('DEL', keys_to_check)
end
end
end

redis.call('SET', key, value)
return "Key set with memory check"

Practical Cache Eviction Solutions

Big Object Evict First Strategy

This strategy prioritizes evicting large objects to free maximum memory quickly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Python implementation for big object eviction
import redis
import json

class BigObjectEvictionRedis:
def __init__(self, redis_client):
self.redis = redis_client
self.size_threshold = 10240 # 10KB threshold

def set_with_size_check(self, key, value):
# Calculate value size
value_size = len(str(value).encode('utf-8'))

# Store size metadata
self.redis.hset(f"{key}:meta", "size", value_size)
self.redis.hset(f"{key}:meta", "created", int(time.time()))

# Set the actual value
self.redis.set(key, value)

# Track large objects
if value_size > self.size_threshold:
self.redis.sadd("large_objects", key)

def evict_large_objects(self, target_memory_mb):
large_objects = self.redis.smembers("large_objects")
freed_memory = 0
target_bytes = target_memory_mb * 1024 * 1024

# Sort by size (largest first)
objects_with_size = []
for obj in large_objects:
size = self.redis.hget(f"{obj}:meta", "size")
if size:
objects_with_size.append((obj, int(size)))

objects_with_size.sort(key=lambda x: x[1], reverse=True)

for obj, size in objects_with_size:
if freed_memory >= target_bytes:
break

self.redis.delete(obj)
self.redis.delete(f"{obj}:meta")
self.redis.srem("large_objects", obj)
freed_memory += size

return freed_memory

# Usage example
r = redis.Redis()
big_obj_redis = BigObjectEvictionRedis(r)

# Set some large objects
big_obj_redis.set_with_size_check("large_data:1", "x" * 50000)
big_obj_redis.set_with_size_check("large_data:2", "y" * 30000)

# Evict to free 100MB
freed = big_obj_redis.evict_large_objects(100)
print(f"Freed {freed} bytes")

Small Object Evict First Strategy

Useful when you want to preserve large, expensive-to-recreate objects.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- small_object_evict.lua
local function get_object_size(key)
return redis.call('MEMORY', 'USAGE', key) or 0
end

local function evict_small_objects(count)
local all_keys = redis.call('KEYS', '*')
local small_keys = {}

for i, key in ipairs(all_keys) do
local size = get_object_size(key)
if size < 1000 then -- Less than 1KB
table.insert(small_keys, {key, size})
end
end

-- Sort by size (smallest first)
table.sort(small_keys, function(a, b) return a[2] < b[2] end)

local evicted = 0
for i = 1, math.min(count, #small_keys) do
redis.call('DEL', small_keys[i][1])
evicted = evicted + 1
end

return evicted
end

return evict_small_objects(tonumber(ARGV[1]))

Low-Cost Evict First Strategy

Evicts data that’s cheap to regenerate or reload.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class CostBasedEviction:
def __init__(self, redis_client):
self.redis = redis_client
self.cost_factors = {
'cache:': 1, # Low cost - can regenerate
'session:': 5, # Medium cost - user experience impact
'computed:': 10, # High cost - expensive computation
'external:': 8 # High cost - external API call
}

def set_with_cost(self, key, value, custom_cost=None):
# Determine cost based on key prefix
cost = custom_cost or self._calculate_cost(key)

# Store with cost metadata
pipe = self.redis.pipeline()
pipe.set(key, value)
pipe.hset(f"{key}:meta", "cost", cost)
pipe.hset(f"{key}:meta", "timestamp", int(time.time()))
pipe.execute()

def _calculate_cost(self, key):
for prefix, cost in self.cost_factors.items():
if key.startswith(prefix):
return cost
return 3 # Default medium cost

def evict_low_cost_items(self, target_count):
# Get all keys with metadata
pattern = "*:meta"
meta_keys = self.redis.keys(pattern)

items_with_cost = []
for meta_key in meta_keys:
original_key = meta_key.replace(':meta', '')
cost = self.redis.hget(meta_key, 'cost')
if cost:
items_with_cost.append((original_key, int(cost)))

# Sort by cost (lowest first)
items_with_cost.sort(key=lambda x: x[1])

evicted = 0
for key, cost in items_with_cost[:target_count]:
self.redis.delete(key)
self.redis.delete(f"{key}:meta")
evicted += 1

return evicted

# Usage
cost_eviction = CostBasedEviction(redis.Redis())
cost_eviction.set_with_cost("cache:user:1001", user_data)
cost_eviction.set_with_cost("computed:analytics:daily", expensive_computation)
cost_eviction.evict_low_cost_items(10)

Cold Data Evict First Strategy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import time
from datetime import datetime, timedelta

class ColdDataEviction:
def __init__(self, redis_client):
self.redis = redis_client
self.access_tracking_key = "access_log"

def get_with_tracking(self, key):
# Record access
now = int(time.time())
self.redis.zadd(self.access_tracking_key, {key: now})

# Get value
return self.redis.get(key)

def set_with_tracking(self, key, value):
now = int(time.time())

# Set value and track access
pipe = self.redis.pipeline()
pipe.set(key, value)
pipe.zadd(self.access_tracking_key, {key: now})
pipe.execute()

def evict_cold_data(self, days_threshold=7, max_evict=100):
"""Evict data not accessed within threshold days"""
cutoff_time = int(time.time()) - (days_threshold * 24 * 3600)

# Get cold keys (accessed before cutoff time)
cold_keys = self.redis.zrangebyscore(
self.access_tracking_key,
0,
cutoff_time,
start=0,
num=max_evict
)

evicted_count = 0
if cold_keys:
pipe = self.redis.pipeline()
for key in cold_keys:
pipe.delete(key)
pipe.zrem(self.access_tracking_key, key)
evicted_count += 1

pipe.execute()

return evicted_count

def get_access_stats(self):
"""Get access statistics"""
now = int(time.time())
day_ago = now - 86400
week_ago = now - (7 * 86400)

recent_keys = self.redis.zrangebyscore(self.access_tracking_key, day_ago, now)
weekly_keys = self.redis.zrangebyscore(self.access_tracking_key, week_ago, now)
total_keys = self.redis.zcard(self.access_tracking_key)

return {
'total_tracked_keys': total_keys,
'accessed_last_day': len(recent_keys),
'accessed_last_week': len(weekly_keys),
'cold_keys': total_keys - len(weekly_keys)
}

# Usage example
cold_eviction = ColdDataEviction(redis.Redis())

# Use with tracking
cold_eviction.set_with_tracking("user:1001", "user_data")
value = cold_eviction.get_with_tracking("user:1001")

# Evict data not accessed in 7 days
evicted = cold_eviction.evict_cold_data(days_threshold=7)
print(f"Evicted {evicted} cold data items")

# Get statistics
stats = cold_eviction.get_access_stats()
print(f"Access stats: {stats}")

Algorithm Deep Dive

LRU Implementation Details

Redis uses an approximate LRU algorithm for efficiency:


flowchart TD
A[Key Access] --> B[Update LRU Clock]
B --> C{Memory Full?}
C -->|No| D[Operation Complete]
C -->|Yes| E[Sample Random Keys]
E --> F[Calculate LRU Score]
F --> G[Select Oldest Key]
G --> H[Evict Key]
H --> I[Operation Complete]

style E fill:#bbf,stroke:#333,stroke-width:2px
style F fill:#fbb,stroke:#333,stroke-width:2px

Interview Question: Why doesn’t Redis use true LRU?

  • True LRU requires maintaining a doubly-linked list of all keys
  • This would consume significant memory overhead
  • Approximate LRU samples random keys and picks the best candidate
  • Provides good enough results with much better performance

LFU Implementation Details

Redis LFU uses a probabilistic counter that decays over time:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Simplified LFU counter simulation
import time
import random

class LFUCounter:
def __init__(self):
self.counter = 0
self.last_access = time.time()

def increment(self):
# Probabilistic increment based on current counter
# Higher counters increment less frequently
probability = 1.0 / (self.counter * 10 + 1)
if random.random() < probability:
self.counter += 1
self.last_access = time.time()

def decay(self, decay_time_minutes=1):
# Decay counter over time
now = time.time()
minutes_passed = (now - self.last_access) / 60

if minutes_passed > decay_time_minutes:
decay_amount = int(minutes_passed / decay_time_minutes)
self.counter = max(0, self.counter - decay_amount)
self.last_access = now

# Example usage
counter = LFUCounter()
for _ in range(100):
counter.increment()
print(f"Counter after 100 accesses: {counter.counter}")

Choosing the Right Eviction Policy

Decision Matrix


flowchart TD
A[Choose Eviction Policy] --> B{Data has TTL?}
B -->|Yes| C{Preserve non-expiring data?}
B -->|No| D{Access pattern known?}

C -->|Yes| E[volatile-lru/lfu/ttl]
C -->|No| F[allkeys-lru/lfu]

D -->|Temporal locality| G[allkeys-lru]
D -->|Frequency based| H[allkeys-lfu]
D -->|Unknown/Random| I[allkeys-random]

J{Can tolerate data loss?} --> K[No eviction]
J -->|Yes| L[Choose based on pattern]

style E fill:#bfb,stroke:#333,stroke-width:2px
style G fill:#bbf,stroke:#333,stroke-width:2px
style H fill:#fbb,stroke:#333,stroke-width:2px

Use Case Recommendations

Use Case Recommended Policy Reason
Web session store volatile-lru Sessions have TTL, preserve config data
Cache layer allkeys-lru Recent data more likely to be accessed
Analytics cache allkeys-lfu Popular queries accessed frequently
Rate limiting volatile-ttl Remove expired limits first
Database cache allkeys-lfu Hot data accessed repeatedly

Production Configuration Example

1
2
3
4
5
6
7
8
# redis.conf production settings
maxmemory 2gb
maxmemory-policy allkeys-lru
maxmemory-samples 10

# Monitor memory usage
redis-cli --latency-history -i 1
redis-cli INFO memory | grep used_memory_human

Performance Monitoring and Tuning

Key Metrics to Monitor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# monitoring_script.py
import redis
import time

def monitor_eviction_performance(redis_client):
info = redis_client.info('stats')
memory_info = redis_client.info('memory')

metrics = {
'evicted_keys': info.get('evicted_keys', 0),
'keyspace_hits': info.get('keyspace_hits', 0),
'keyspace_misses': info.get('keyspace_misses', 0),
'used_memory': memory_info.get('used_memory', 0),
'used_memory_peak': memory_info.get('used_memory_peak', 0),
'mem_fragmentation_ratio': memory_info.get('mem_fragmentation_ratio', 0)
}

# Calculate hit ratio
total_requests = metrics['keyspace_hits'] + metrics['keyspace_misses']
hit_ratio = metrics['keyspace_hits'] / total_requests if total_requests > 0 else 0

metrics['hit_ratio'] = hit_ratio

return metrics

# Usage
r = redis.Redis()
while True:
stats = monitor_eviction_performance(r)
print(f"Hit Ratio: {stats['hit_ratio']:.2%}, Evicted: {stats['evicted_keys']}")
time.sleep(10)

Alerting Thresholds

1
2
3
4
5
6
7
8
9
10
11
12
13
# alerts.yml (Prometheus/Grafana style)
alerts:
- name: redis_hit_ratio_low
condition: redis_hit_ratio < 0.90
severity: warning

- name: redis_eviction_rate_high
condition: rate(redis_evicted_keys[5m]) > 100
severity: critical

- name: redis_memory_usage_high
condition: redis_used_memory / redis_maxmemory > 0.90
severity: warning

Interview Questions and Answers

Advanced Interview Questions

Q: How would you handle a scenario where your cache hit ratio drops significantly after implementing LRU eviction?

A: This suggests the working set is larger than available memory. Solutions:

  1. Increase memory allocation if possible
  2. Switch to LFU if there’s a frequency-based access pattern
  3. Implement application-level partitioning
  4. Use Redis Cluster for horizontal scaling
  5. Optimize data structures (use hashes for small objects)

Q: Explain the trade-offs between different sampling sizes in Redis LRU implementation.

A:

  • Small samples (3-5): Fast eviction, less accurate LRU approximation
  • Large samples (10+): Better LRU approximation, higher CPU overhead
  • Default (5): Good balance for most use cases
  • Monitor evicted_keys and keyspace_misses to tune

Q: How would you implement a custom eviction policy for a specific business requirement?

A: Use Lua scripts or application-level logic:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Custom: Evict based on business priority
local function business_priority_evict()
local keys = redis.call('KEYS', '*')
local priorities = {}

for i, key in ipairs(keys) do
local priority = redis.call('HGET', key .. ':meta', 'business_priority')
if priority then
table.insert(priorities, {key, tonumber(priority)})
end
end

table.sort(priorities, function(a, b) return a[2] < b[2] end)

if #priorities > 0 then
redis.call('DEL', priorities[1][1])
return priorities[1][1]
end
return nil
end

Best Practices Summary

Configuration Best Practices

  1. Set appropriate maxmemory: 80% of available RAM for dedicated Redis instances
  2. Choose policy based on use case: LRU for temporal, LFU for frequency patterns
  3. Monitor continuously: Track hit ratios, eviction rates, and memory usage
  4. Test under load: Verify eviction behavior matches expectations

Application Integration Best Practices

  1. Graceful degradation: Handle cache misses gracefully
  2. TTL strategy: Set appropriate expiration times
  3. Key naming: Use consistent patterns for better policy effectiveness
  4. Size awareness: Monitor and limit large values

Operational Best Practices

  1. Regular monitoring: Set up alerts for key metrics
  2. Capacity planning: Plan for growth and peak loads
  3. Testing: Regularly test eviction scenarios
  4. Documentation: Document policy choices and rationale

External Resources

This comprehensive guide provides the foundation for implementing effective memory eviction strategies in Redis production environments. The combination of theoretical understanding and practical implementation examples ensures robust cache management that scales with your application needs.

0%