Elasticsearch Underlying Principles Deep Dive

Posted on 2025-06-16 In elasticsearch

Understanding Inverted Index: The Heart of Search

The inverted index is Elasticsearch’s fundamental data structure that enables lightning-fast full-text search. Unlike a traditional database index that maps record IDs to field values, an inverted index maps each unique term to a list of documents containing that term.

How Inverted Index Works

Consider a simple example with three documents:

Document 1: “The quick brown fox”
Document 2: “The brown dog”
Document 3: “A quick fox jumps”

The inverted index would look like:

Term     | Document IDs | Positions
---------|-------------|----------
the      | [1, 2]      | [1:0, 2:0]
quick    | [1, 3]      | [1:1, 3:1]
brown    | [1, 2]      | [1:2, 2:1]
fox      | [1, 3]      | [1:3, 3:2]
dog      | [2]         | [2:2]
a        | [3]         | [3:0]
jumps    | [3]         | [3:3]

Implementation Details

Elasticsearch implements inverted indexes using several sophisticated techniques:

Term Dictionary: Stores all unique terms in sorted order
Posting Lists: For each term, maintains a list of documents containing that term
Term Frequencies: Tracks how often each term appears in each document
Positional Information: Stores exact positions for phrase queries

{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard"
      },
      "description": {
        "type": "text",
        "index_options": "positions"
      }
    }
  }
}

Interview Insight: “Can you explain why Elasticsearch is faster than traditional SQL databases for text search?” The answer lies in the inverted index structure - instead of scanning entire documents, Elasticsearch directly maps search terms to relevant documents.

Text vs Keyword: Understanding Field Types

The distinction between Text and Keyword fields is crucial for proper data modeling and search behavior.

Text Fields

Text fields are analyzed - they go through tokenization, normalization, and other transformations:

{
  "mappings": {
    "properties": {
      "product_description": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

Analysis Process for Text Fields:

Tokenization: “iPhone 13 Pro Max” → [“iPhone”, “13”, “Pro”, “Max”]
Lowercase Filter: [“iphone”, “13”, “pro”, “max”]
Stop Words Removal: (if configured)
Stemming: (if configured) [“iphon”, “13”, “pro”, “max”]

Keyword Fields

Keyword fields are stored as-is, without analysis:

{
  "mappings": {
    "properties": {
      "product_id": {
        "type": "keyword"
      },
      "category": {
        "type": "keyword",
        "fields": {
          "text": {
            "type": "text"
          }
        }
      }
    }
  }
}

Use Cases Comparison

Scenario	Text Field	Keyword Field
Full-text search	✅ “search iPhone” matches “iPhone 13”	❌ Exact match only
Aggregations	❌ Analyzed terms cause issues	✅ Perfect for grouping
Sorting	❌ Unreliable due to analysis	✅ Lexicographic sorting
Exact matching	❌ “iPhone-13” ≠ “iPhone 13”	✅ “iPhone-13” = “iPhone-13”

Interview Insight: “When would you use multi-fields?” Multi-fields allow the same data to be indexed in multiple ways - as both text (for search) and keyword (for aggregations and sorting).

Posting Lists, Trie Trees, and FST

Posting Lists

Posting lists are the core data structure that stores document IDs for each term. Elasticsearch optimizes these lists using several techniques:

Delta Compression: Instead of storing absolute document IDs, store differences:

1 2	Original: [1, 5, 8, 12, 15] Compressed: [1, +4, +3, +4, +3]

Variable Byte Encoding: Uses fewer bytes for smaller numbers
Skip Lists: Enable faster intersection operations for AND queries

Trie Trees (Prefix Trees)

Trie trees optimize prefix-based operations and are used in Elasticsearch for:

Autocomplete functionality
Wildcard queries
Range queries on terms


graph TD
A[Root] --> B[c]
A --> C[s]
B --> D[a]
B --> E[o]
D --> F[r]
D --> G[t]
F --> H[car]
G --> I[cat]
E --> J[o]
J --> K[cool]
C --> L[u]
L --> M[n]
M --> N[sun]

Finite State Transducers (FST)

FST is Elasticsearch’s secret weapon for memory-efficient term dictionaries. It combines the benefits of tries with minimal memory usage.

Benefits of FST:

Memory Efficient: Shares common prefixes and suffixes
Fast Lookups: O(k) complexity where k is key length
Ordered Iteration: Maintains lexicographic order

{
  "query": {
    "prefix": {
      "title": "elastics"
    }
  }
}

Interview Insight: “How does Elasticsearch handle memory efficiency for large vocabularies?” FST allows Elasticsearch to store millions of terms using minimal memory by sharing common character sequences.

Data Writing Process in Elasticsearch Cluster

Understanding the write path is crucial for optimizing indexing performance and ensuring data durability.

Write Process Overview


sequenceDiagram
participant Client
participant Coordinating Node
participant Primary Shard
participant Replica Shard
participant Translog
participant Lucene

Client->>Coordinating Node: Index Request
Coordinating Node->>Primary Shard: Route to Primary
Primary Shard->>Translog: Write to Translog
Primary Shard->>Lucene: Add to In-Memory Buffer
Primary Shard->>Replica Shard: Replicate to Replicas
Replica Shard->>Translog: Write to Translog
Replica Shard->>Lucene: Add to In-Memory Buffer
Primary Shard->>Coordinating Node: Success Response
Coordinating Node->>Client: Acknowledge

Detailed Write Steps

Step 1: Document Routing

1 2	shard_id = hash(routing_value) % number_of_primary_shards # Default routing_value is document _id

Step 2: Primary Shard Processing

{
  "index": {
    "_index": "products",
    "_id": "1",
    "_routing": "user123"
  }
}
{
  "name": "iPhone 13",
  "price": 999,
  "category": "electronics"
}

Step 3: Translog Write
The transaction log ensures durability before data reaches disk:

# Translog configuration
PUT /my_index/_settings
{
  "translog": {
    "sync_interval": "5s",
    "durability": "request"
  }
}

Step 4: Replication
Documents are replicated to replica shards for high availability:

{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 2,
    "index.write.wait_for_active_shards": "all"
  }
}

Write Performance Optimization

Bulk Indexing Best Practices:

POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "Product 1", "price": 100}
{"index": {"_index": "products", "_id": "2"}}
{"name": "Product 2", "price": 200}

Optimal Bulk Size: 5-15 MB per bulk request
Thread Pool Tuning:

thread_pool:
  write:
    size: 8
    queue_size: 200

Interview Insight: “How would you optimize Elasticsearch for high write throughput?” Key strategies include bulk indexing, increasing refresh intervals, using appropriate replica counts, and tuning thread pools.

Refresh, Flush, and Fsync Operations

These operations manage the transition of data from memory to disk and control search visibility.

Refresh Operation

Refresh makes documents searchable by moving them from the in-memory buffer to the filesystem cache.


graph LR
A[In-Memory Buffer] -->|Refresh| B[Filesystem Cache]
B -->|Flush| C[Disk Segments]
D[Translog] -->|Flush| E[Disk]

subgraph "Search Visible"
    B
    C
end

Refresh Configuration:

{
  "settings": {
    "refresh_interval": "30s",
    "index.max_refresh_listeners": 1000
  }
}

Manual Refresh:

1	POST /my_index/_refresh

Real-time Use Case:

PUT /logs/_doc/1?refresh=true
{
  "timestamp": "2024-01-15T10:30:00",
  "level": "ERROR",
  "message": "Database connection failed"
}

Flush Operation

Flush persists the translog to disk and creates new Lucene segments.

Flush Triggers:

Translog size exceeds threshold (default: 512MB)
Translog age exceeds threshold (default: 30 minutes)
Manual flush operation

{
  "settings": {
    "translog.flush_threshold_size": "1gb",
    "translog.sync_interval": "5s",
    "translog.durability": "request"
  }
}

Manual Flush:

1 2	POST /my_index/_flush POST /_flush?wait_if_ongoing=true

Fsync Operation

Fsync ensures data is physically written to disk storage.

Fsync Configuration:

{
  "settings": {
    "translog.durability": "async",
    "translog.sync_interval": "5s"
  }
}

Performance Impact Analysis

Operation	Frequency	Performance Impact	Data Safety
Refresh	High (1s default)	Medium	No durability
Flush	Low (30m or 512MB)	High	Full durability
Fsync	Configurable	High	Hardware dependent

Production Best Practices

High Throughput Indexing:

{
  "settings": {
    "refresh_interval": "60s",
    "translog.durability": "async",
    "translog.sync_interval": "30s",
    "number_of_replicas": 0
  }
}

Near Real-time Search:

{
  "settings": {
    "refresh_interval": "1s",
    "translog.durability": "request"
  }
}

Interview Insight: “Explain the trade-offs between search latency and indexing performance.” Frequent refreshes provide near real-time search but impact indexing throughput. Adjust refresh_interval based on your use case - use longer intervals for high-volume indexing and shorter for real-time requirements.

Advanced Concepts and Optimizations

Segment Merging

Elasticsearch continuously merges smaller segments into larger ones:

{
  "settings": {
    "index.merge.policy.max_merge_at_once": 10,
    "index.merge.policy.segments_per_tier": 10,
    "index.merge.scheduler.max_thread_count": 3
  }
}

Force Merge for Read-Only Indices

1	POST /old_logs/_forcemerge?max_num_segments=1

Circuit Breakers

Prevent OutOfMemory errors during operations:

{
  "persistent": {
    "indices.breaker.total.limit": "70%",
    "indices.breaker.fielddata.limit": "40%",
    "indices.breaker.request.limit": "30%"
  }
}

Monitoring and Troubleshooting

Key Metrics to Monitor

# Index stats
GET /_stats/indexing,search,merge,refresh,flush

# Segment information
GET /my_index/_segments

# Translog stats
GET /_stats/translog

Common Issues and Solutions

Slow Indexing:

Check bulk request size
Monitor merge operations
Verify disk I/O capacity

Memory Issues:

Implement proper mapping
Use appropriate field types
Monitor fielddata usage

Search Latency:

Optimize queries
Check segment count
Monitor cache hit rates

Interview Questions Deep Dive

Q: “How does Elasticsearch achieve near real-time search?”
A: Through the refresh operation that moves documents from in-memory buffers to searchable filesystem cache, typically every 1 second by default.

Q: “What happens when a primary shard fails during indexing?”
A: Elasticsearch promotes a replica shard to primary, replays the translog, and continues operations. The cluster remains functional with potential brief unavailability.

Q: “How would you design an Elasticsearch cluster for a high-write, low-latency application?”
A: Focus on horizontal scaling, optimize bulk operations, increase refresh intervals during high-write periods, use appropriate replica counts, and implement proper monitoring.

Q: “Explain the memory implications of text vs keyword fields.”
A: Text fields consume more memory during analysis and create larger inverted indexes. Keyword fields are more memory-efficient for exact-match scenarios and aggregations.

External References

This deep dive covers the fundamental concepts that power Elasticsearch’s search capabilities. Understanding these principles is essential for building scalable, performant search applications and succeeding in technical interviews.