Core Storage Engine

Relevant source files

The core storage engine is the foundational layer of SIMD R Drive, providing append-only, memory-mapped binary storage with zero-copy access. This document covers the high-level architecture and key components that implement the storage functionality.

For network-based access to the storage engine, see Network Layer and RPC. For Python integration, see Python Integration. For performance optimizations including SIMD acceleration, see Performance Optimizations.

Purpose and Scope

This page introduces the core storage engine's architecture, primary components, and design principles. Detailed information about specific subsystems is covered in child pages:

Architecture Overview

The storage engine implements an append-only binary store with in-memory indexing and memory-mapped file access. The architecture consists of four primary structures that work together to provide concurrent, high-performance storage operations.

Sources: src/storage_engine/data_store.rs:26-33 src/storage_engine.rs:1-25

graph TB
    subgraph "Public Traits"
        Reader["DataStoreReader\nread, exists, batch_read"]
Writer["DataStoreWriter\nwrite, delete, batch_write"]
end
    
    subgraph "DataStore Structure"
        DS["DataStore"]
FileHandle["Arc&lt;RwLock&lt;BufWriter&lt;File&gt;&gt;&gt;\nfile\nSequential write operations"]
MmapHandle["Arc&lt;Mutex&lt;Arc&lt;Mmap&gt;&gt;&gt;\nmmap\nMemory-mapped read access"]
TailOffset["AtomicU64\ntail_offset\nCurrent end position"]
KeyIndexer["Arc&lt;RwLock&lt;KeyIndexer&gt;&gt;\nkey_indexer\nHash to offset mapping"]
PathBuf["PathBuf\npath\nFile system location"]
end
    
    subgraph "Support Structures"
        EH["EntryHandle\nZero-copy data view"]
EM["EntryMetadata\nkey_hash, prev_offset, checksum"]
EI["EntryIterator\nSequential traversal"]
ES["EntryStream\nStreaming reads"]
end
    
    Reader -.implements.-> DS
    Writer -.implements.-> DS
    
 
   DS --> FileHandle
 
   DS --> MmapHandle
 
   DS --> TailOffset
 
   DS --> KeyIndexer
 
   DS --> PathBuf
    
 
   DS --> EH
 
   DS --> EI
 
   EH --> EM
 
   EH --> ES

Core Components

DataStore

The DataStore struct is the primary interface to the storage engine. It maintains four essential shared resources protected by different synchronization primitives:

Field	Type	Purpose	Synchronization
`file`	`Arc<RwLock<BufWriter<File>>>`	Write operations to disk	`RwLock` for exclusive writes
`mmap`	`Arc<Mutex<Arc<Mmap>>>`	Memory-mapped view for reads	`Mutex` for atomic remapping
`tail_offset`	`AtomicU64`	Current end-of-file position	Atomic operations
`key_indexer`	`Arc<RwLock<KeyIndexer>>`	Hash-based key lookup	`RwLock` for concurrent reads
`path`	`PathBuf`	Storage file location	Immutable after creation

Sources: src/storage_engine/data_store.rs:26-33

Traits

The storage engine exposes two primary traits that define its public API:

DataStoreReader : Provides zero-copy read operations including read(), batch_read(), exists(), and iteration methods. Implemented at src/storage_engine/data_store.rs:1027-1182
DataStoreWriter : Provides write operations including write(), batch_write(), delete(), and streaming writes. Implemented at src/storage_engine/data_store.rs:752-1025

Sources: src/lib.rs21 src/storage_engine.rs21

EntryHandle

EntryHandle provides a zero-copy view into stored data by maintaining a reference to the memory-mapped file and a byte range. It includes the associated EntryMetadata for accessing checksums and hash values without additional lookups.

Sources: src/storage_engine.rs24 README.md:43-49

KeyIndexer

KeyIndexer maintains an in-memory hash map from 64-bit XXH3 key hashes to packed values containing a 16-bit collision detection tag and a 48-bit file offset. This structure enables O(1) key lookups while detecting hash collisions.

Sources: src/storage_engine/data_store.rs31 README.md:158-169

Component Interaction Flow

The following diagram illustrates how the core components interact during typical read and write operations, showing the actual method calls and data flow between structures.

Sources: src/storage_engine/data_store.rs:752-825 src/storage_engine/data_store.rs:1040-1049 src/storage_engine/data_store.rs:224-259

sequenceDiagram
    participant Client
    participant DS as "DataStore"
    participant FileHandle as "file: RwLock&lt;BufWriter&lt;File&gt;&gt;"
    participant TailOffset as "tail_offset: AtomicU64"
    participant MmapHandle as "mmap: Mutex&lt;Arc&lt;Mmap&gt;&gt;"
    participant KeyIndexer as "key_indexer: RwLock&lt;KeyIndexer&gt;"
    
    Note over Client,KeyIndexer: Write Operation
    Client->>DS: write(key, payload)
    DS->>DS: compute_hash(key)
    DS->>TailOffset: load(Ordering::Acquire)
    DS->>FileHandle: write().unwrap()
    DS->>FileHandle: write_all(pre_pad + payload + metadata)
    DS->>FileHandle: flush()
    DS->>DS: reindex()
    DS->>FileHandle: init_mmap()
    DS->>MmapHandle: lock().unwrap()
    DS->>MmapHandle: *guard = Arc::new(new_mmap)
    DS->>KeyIndexer: write().unwrap()
    DS->>KeyIndexer: insert(key_hash, offset)
    DS->>TailOffset: store(new_offset, Ordering::Release)
    DS-->>Client: Ok(offset)
    
    Note over Client,KeyIndexer: Read Operation
    Client->>DS: read(key)
    DS->>DS: compute_hash(key)
    DS->>KeyIndexer: read().unwrap()
    DS->>KeyIndexer: get_packed(key_hash)
    KeyIndexer-->>DS: Some(packed_value)
    DS->>DS: KeyIndexer::unpack(packed)
    DS->>MmapHandle: lock().unwrap()
    DS->>MmapHandle: clone Arc
    DS->>DS: read_entry_with_context()
    DS->>DS: EntryMetadata::deserialize()
    DS->>DS: create EntryHandle
    DS-->>Client: Ok(Some(EntryHandle))

Storage Initialization and Recovery

The DataStore::open() method performs a multi-stage initialization process that ensures data integrity even after crashes or incomplete writes.

flowchart TD
    Start["DataStore::open(path)"]
OpenFile["open_file_in_append_mode()\nOpenOptions::new()\n.read(true).write(true).create(true)"]
InitMmap["init_mmap()\nmemmap2::MmapOptions::new().map()"]
Recover["recover_valid_chain(mmap, file_len)\nBackward scan validating prev_offset chain"]
CheckTrunc{"final_len < file_len?"}
Truncate["Truncate file to final_len\nfile.set_len(final_len)\nReopen storage"]
BuildIndex["KeyIndexer::build(mmap, final_len)\nForward scan to populate index"]
CreateDS["Create DataStore instance\nwith Arc-wrapped fields"]
Start --> OpenFile
 
   OpenFile --> InitMmap
 
   InitMmap --> Recover
 
   Recover --> CheckTrunc
 
   CheckTrunc -->|Yes Corruption detected| Truncate
 
   CheckTrunc -->|No File valid| BuildIndex
 
   Truncate --> Start
 
   BuildIndex --> CreateDS

The recovery process validates the storage file by:

Scanning backward from the end of file following prev_offset links in EntryMetadata
Verifying chain integrity ensuring each offset points to a valid previous tail
Truncating corruption if an invalid chain is detected
Building the index by forward scanning the validated region

Sources: src/storage_engine/data_store.rs:84-117 src/storage_engine/data_store.rs:383-482

flowchart TD
    WriteCall["write(key, payload)"]
ComputeHash["compute_hash(key)\nXXH3_64 hash"]
AcquireWrite["file.write().unwrap()\nAcquire write lock"]
LoadTail["tail_offset.load(Ordering::Acquire)"]
CalcPad["prepad_len(prev_tail)\n(A - offset % A) & (A-1)"]
WritePad["write_all(&pad[..prepad])\nZero bytes for alignment"]
WritePayload["write_all(payload)\nActual data"]
WriteMetadata["write_all(metadata.serialize())\nkey_hash + prev_offset + checksum"]
Flush["flush()\nEnsure disk persistence"]
Reindex["reindex()\nUpdate mmap and index"]
UpdateTail["tail_offset.store(new_offset)\nOrdering::Release"]
WriteCall --> ComputeHash
 
   ComputeHash --> AcquireWrite
 
   AcquireWrite --> LoadTail
 
   LoadTail --> CalcPad
 
   CalcPad --> WritePad
 
   WritePad --> WritePayload
 
   WritePayload --> WriteMetadata
 
   WriteMetadata --> Flush
 
   Flush --> Reindex
 
   Reindex --> UpdateTail

Write Path Architecture

Write operations follow a specific sequence to maintain consistency and enable zero-copy reads. All payloads are aligned to 64-byte boundaries using pre-padding.

The reindex() method performs two critical operations after each write:

Memory map update : Creates a new Mmap view of the file and atomically swaps it
Index update : Inserts new key-hash to offset mappings into KeyIndexer

Sources: src/storage_engine/data_store.rs:827-834 src/storage_engine/data_store.rs:758-825 src/storage_engine/data_store.rs:224-259

Read Path Architecture

Read operations use the hash index for O(1) lookup and return EntryHandle instances that provide zero-copy access to the memory-mapped data.

flowchart TD
    ReadCall["read(key)"]
ComputeHash["compute_hash(key)\nXXH3_64 hash"]
AcquireIndex["key_indexer.read().unwrap()\nAcquire read lock"]
GetMmap["get_mmap_arc()\nClone Arc&lt;Mmap&gt;"]
ReadContext["read_entry_with_context()"]
Lookup["key_indexer.get_packed(key_hash)"]
CheckFound{"Found?"}
Unpack["KeyIndexer::unpack(packed)\nExtract tag + offset"]
VerifyTag{"Tag matches?"}
ReadMetadata["EntryMetadata::deserialize()\nFrom mmap[offset..offset+20]"]
CalcRange["Derive entry range\nentry_start..entry_end"]
CheckTombstone{"Is tombstone?"}
CreateHandle["Create EntryHandle\nmmap_arc + range + metadata"]
ReturnNone["Return None"]
ReadCall --> ComputeHash
 
   ComputeHash --> AcquireIndex
 
   AcquireIndex --> GetMmap
 
   GetMmap --> ReadContext
 
   ReadContext --> Lookup
 
   Lookup --> CheckFound
 
   CheckFound -->|No| ReturnNone
 
   CheckFound -->|Yes| Unpack
 
   Unpack --> VerifyTag
 
   VerifyTag -->|No Collision| ReturnNone
 
   VerifyTag -->|Yes| ReadMetadata
 
   ReadMetadata --> CalcRange
 
   CalcRange --> CheckTombstone
 
   CheckTombstone -->|Yes| ReturnNone
 
   CheckTombstone -->|No| CreateHandle

The tag verification step at src/storage_engine/data_store.rs:513-521 prevents hash collision issues by comparing a 16-bit tag derived from the original key with the stored tag.

Sources: src/storage_engine/data_store.rs:1040-1049 src/storage_engine/data_store.rs:501-565

Batch Operations

Batch operations reduce lock acquisition overhead by processing multiple entries in a single synchronized operation.

flowchart TD
    BatchWrite["batch_write(entries)"]
HashAll["compute_hash_batch(keys)\nParallel XXH3_64"]
AcquireWrite["file.write().unwrap()"]
InitBuffer["buffer = Vec::new()"]
LoopStart{"For each entry"}
CalcPad["prepad_len(tail_offset)"]
BuildEntry["Build entry:\npre_pad + payload + metadata"]
AppendBuffer["buffer.extend_from_slice(entry)"]
UpdateTail["tail_offset += entry.len()"]
LoopEnd["Next entry"]
WriteAll["file.write_all(&buffer)"]
Flush["file.flush()"]
Reindex["reindex(key_hash_offsets)"]
BatchWrite --> HashAll
 
   HashAll --> AcquireWrite
 
   AcquireWrite --> InitBuffer
 
   InitBuffer --> LoopStart
 
   LoopStart -->|More entries| CalcPad
 
   CalcPad --> BuildEntry
 
   BuildEntry --> AppendBuffer
 
   AppendBuffer --> UpdateTail
 
   UpdateTail --> LoopEnd
 
   LoopEnd --> LoopStart
 
   LoopStart -->|Done| WriteAll
 
   WriteAll --> Flush
 
   Flush --> Reindex

Batch Write

The batch_write() method accumulates all entries in an in-memory buffer before acquiring the write lock once:

Sources: src/storage_engine/data_store.rs:838-843 src/storage_engine/data_store.rs:847-939

Batch Read

The batch_read() method acquires the index lock once and performs all lookups before releasing it:

Sources: src/storage_engine/data_store.rs:1105-1109 src/storage_engine/data_store.rs:1111-1158

Iteration Support

The storage engine provides multiple iteration mechanisms for different use cases:

Iterator	Type	Synchronization	Use Case
`EntryIterator`	Sequential	Single lock acquisition	Forward/backward iteration
`par_iter_entries()`	Parallel (Rayon)	Lock-free after setup	High-throughput scanning
`IntoIterator`	Consuming	Single lock acquisition	Owned iteration

The EntryIterator scans backward through the file using the prev_offset chain, tracking seen keys in a HashSet to return only the latest version of each key.

Sources: src/storage_engine/entry_iterator.rs:8-128 src/storage_engine/data_store.rs:276-280 src/storage_engine/data_store.rs:296-361

Key Design Principles

Append-Only Model

All write operations append data to the end of the file. Updates and deletes create new entries rather than modifying existing data. This design ensures:

No seeking during writes : Sequential disk I/O for maximum throughput
Crash safety : Incomplete writes are detected and truncated during recovery
Zero-copy reads : Memory-mapped regions remain valid as data is never modified

Sources: README.md:98-103

Fixed Alignment

Every non-tombstone payload starts on a 64-byte boundary (configurable via PAYLOAD_ALIGNMENT). This alignment enables:

Cache-line efficiency : Payloads align with CPU cache lines
SIMD operations : Full-speed vectorized reads without crossing boundaries
Zero-copy typed views : Safe reinterpretation as typed slices

Sources: README.md:51-59 src/storage_engine/data_store.rs:670-673

Metadata-After-Payload

Each entry stores its metadata immediately after the payload, forming a backward-linked chain. This design:

Minimizes seeking : Metadata and payload are adjacent on disk
Enables recovery : The chain can be validated backward from any point
Supports nesting : Storage containers can be stored within other containers

Sources: README.md:139-148

Lock-Free Reads

Read operations do not acquire write locks and perform zero-copy access through memory-mapped views. The AtomicU64 tail offset ensures readers see a consistent view without blocking writes.

Sources: README.md:170-183 src/storage_engine/data_store.rs:1040-1049

Keyboard shortcuts

rust-simd-r-drive Documentation