Core Storage Engine
Relevant source files
- README.md
- src/lib.rs
- src/storage_engine.rs
- src/storage_engine/data_store.rs
- src/storage_engine/entry_iterator.rs
The core storage engine is the foundational layer of SIMD R Drive, providing append-only, memory-mapped binary storage with zero-copy access. This document covers the high-level architecture and key components that implement the storage functionality.
For network-based access to the storage engine, see Network Layer and RPC. For Python integration, see Python Integration. For performance optimizations including SIMD acceleration, see Performance Optimizations.
Purpose and Scope
This page introduces the core storage engine's architecture, primary components, and design principles. Detailed information about specific subsystems is covered in child pages:
Architecture Overview
The storage engine implements an append-only binary store with in-memory indexing and memory-mapped file access. The architecture consists of four primary structures that work together to provide concurrent, high-performance storage operations.
Sources: src/storage_engine/data_store.rs:26-33 src/storage_engine.rs:1-25
graph TB
subgraph "Public Traits"
Reader["DataStoreReader\nread, exists, batch_read"]
Writer["DataStoreWriter\nwrite, delete, batch_write"]
end
subgraph "DataStore Structure"
DS["DataStore"]
FileHandle["Arc<RwLock<BufWriter<File>>>\nfile\nSequential write operations"]
MmapHandle["Arc<Mutex<Arc<Mmap>>>\nmmap\nMemory-mapped read access"]
TailOffset["AtomicU64\ntail_offset\nCurrent end position"]
KeyIndexer["Arc<RwLock<KeyIndexer>>\nkey_indexer\nHash to offset mapping"]
PathBuf["PathBuf\npath\nFile system location"]
end
subgraph "Support Structures"
EH["EntryHandle\nZero-copy data view"]
EM["EntryMetadata\nkey_hash, prev_offset, checksum"]
EI["EntryIterator\nSequential traversal"]
ES["EntryStream\nStreaming reads"]
end
Reader -.implements.-> DS
Writer -.implements.-> DS
DS --> FileHandle
DS --> MmapHandle
DS --> TailOffset
DS --> KeyIndexer
DS --> PathBuf
DS --> EH
DS --> EI
EH --> EM
EH --> ES
Core Components
DataStore
The DataStore struct is the primary interface to the storage engine. It maintains four essential shared resources protected by different synchronization primitives:
| Field | Type | Purpose | Synchronization |
|---|---|---|---|
file | Arc<RwLock<BufWriter<File>>> | Write operations to disk | RwLock for exclusive writes |
mmap | Arc<Mutex<Arc<Mmap>>> | Memory-mapped view for reads | Mutex for atomic remapping |
tail_offset | AtomicU64 | Current end-of-file position | Atomic operations |
key_indexer | Arc<RwLock<KeyIndexer>> | Hash-based key lookup | RwLock for concurrent reads |
path | PathBuf | Storage file location | Immutable after creation |
Sources: src/storage_engine/data_store.rs:26-33
Traits
The storage engine exposes two primary traits that define its public API:
-
DataStoreReader: Provides zero-copy read operations includingread(),batch_read(),exists(), and iteration methods. Implemented at src/storage_engine/data_store.rs:1027-1182 -
DataStoreWriter: Provides write operations includingwrite(),batch_write(),delete(), and streaming writes. Implemented at src/storage_engine/data_store.rs:752-1025
Sources: src/lib.rs21 src/storage_engine.rs21
EntryHandle
EntryHandle provides a zero-copy view into stored data by maintaining a reference to the memory-mapped file and a byte range. It includes the associated EntryMetadata for accessing checksums and hash values without additional lookups.
Sources: src/storage_engine.rs24 README.md:43-49
KeyIndexer
KeyIndexer maintains an in-memory hash map from 64-bit XXH3 key hashes to packed values containing a 16-bit collision detection tag and a 48-bit file offset. This structure enables O(1) key lookups while detecting hash collisions.
Sources: src/storage_engine/data_store.rs31 README.md:158-169
Component Interaction Flow
The following diagram illustrates how the core components interact during typical read and write operations, showing the actual method calls and data flow between structures.
Sources: src/storage_engine/data_store.rs:752-825 src/storage_engine/data_store.rs:1040-1049 src/storage_engine/data_store.rs:224-259
sequenceDiagram
participant Client
participant DS as "DataStore"
participant FileHandle as "file: RwLock<BufWriter<File>>"
participant TailOffset as "tail_offset: AtomicU64"
participant MmapHandle as "mmap: Mutex<Arc<Mmap>>"
participant KeyIndexer as "key_indexer: RwLock<KeyIndexer>"
Note over Client,KeyIndexer: Write Operation
Client->>DS: write(key, payload)
DS->>DS: compute_hash(key)
DS->>TailOffset: load(Ordering::Acquire)
DS->>FileHandle: write().unwrap()
DS->>FileHandle: write_all(pre_pad + payload + metadata)
DS->>FileHandle: flush()
DS->>DS: reindex()
DS->>FileHandle: init_mmap()
DS->>MmapHandle: lock().unwrap()
DS->>MmapHandle: *guard = Arc::new(new_mmap)
DS->>KeyIndexer: write().unwrap()
DS->>KeyIndexer: insert(key_hash, offset)
DS->>TailOffset: store(new_offset, Ordering::Release)
DS-->>Client: Ok(offset)
Note over Client,KeyIndexer: Read Operation
Client->>DS: read(key)
DS->>DS: compute_hash(key)
DS->>KeyIndexer: read().unwrap()
DS->>KeyIndexer: get_packed(key_hash)
KeyIndexer-->>DS: Some(packed_value)
DS->>DS: KeyIndexer::unpack(packed)
DS->>MmapHandle: lock().unwrap()
DS->>MmapHandle: clone Arc
DS->>DS: read_entry_with_context()
DS->>DS: EntryMetadata::deserialize()
DS->>DS: create EntryHandle
DS-->>Client: Ok(Some(EntryHandle))
Storage Initialization and Recovery
The DataStore::open() method performs a multi-stage initialization process that ensures data integrity even after crashes or incomplete writes.
flowchart TD
Start["DataStore::open(path)"]
OpenFile["open_file_in_append_mode()\nOpenOptions::new()\n.read(true).write(true).create(true)"]
InitMmap["init_mmap()\nmemmap2::MmapOptions::new().map()"]
Recover["recover_valid_chain(mmap, file_len)\nBackward scan validating prev_offset chain"]
CheckTrunc{"final_len < file_len?"}
Truncate["Truncate file to final_len\nfile.set_len(final_len)\nReopen storage"]
BuildIndex["KeyIndexer::build(mmap, final_len)\nForward scan to populate index"]
CreateDS["Create DataStore instance\nwith Arc-wrapped fields"]
Start --> OpenFile
OpenFile --> InitMmap
InitMmap --> Recover
Recover --> CheckTrunc
CheckTrunc -->|Yes Corruption detected| Truncate
CheckTrunc -->|No File valid| BuildIndex
Truncate --> Start
BuildIndex --> CreateDS
The recovery process validates the storage file by:
- Scanning backward from the end of file following
prev_offsetlinks inEntryMetadata - Verifying chain integrity ensuring each offset points to a valid previous tail
- Truncating corruption if an invalid chain is detected
- Building the index by forward scanning the validated region
Sources: src/storage_engine/data_store.rs:84-117 src/storage_engine/data_store.rs:383-482
flowchart TD
WriteCall["write(key, payload)"]
ComputeHash["compute_hash(key)\nXXH3_64 hash"]
AcquireWrite["file.write().unwrap()\nAcquire write lock"]
LoadTail["tail_offset.load(Ordering::Acquire)"]
CalcPad["prepad_len(prev_tail)\n(A - offset % A) & (A-1)"]
WritePad["write_all(&pad[..prepad])\nZero bytes for alignment"]
WritePayload["write_all(payload)\nActual data"]
WriteMetadata["write_all(metadata.serialize())\nkey_hash + prev_offset + checksum"]
Flush["flush()\nEnsure disk persistence"]
Reindex["reindex()\nUpdate mmap and index"]
UpdateTail["tail_offset.store(new_offset)\nOrdering::Release"]
WriteCall --> ComputeHash
ComputeHash --> AcquireWrite
AcquireWrite --> LoadTail
LoadTail --> CalcPad
CalcPad --> WritePad
WritePad --> WritePayload
WritePayload --> WriteMetadata
WriteMetadata --> Flush
Flush --> Reindex
Reindex --> UpdateTail
Write Path Architecture
Write operations follow a specific sequence to maintain consistency and enable zero-copy reads. All payloads are aligned to 64-byte boundaries using pre-padding.
The reindex() method performs two critical operations after each write:
- Memory map update : Creates a new
Mmapview of the file and atomically swaps it - Index update : Inserts new key-hash to offset mappings into
KeyIndexer
Sources: src/storage_engine/data_store.rs:827-834 src/storage_engine/data_store.rs:758-825 src/storage_engine/data_store.rs:224-259
Read Path Architecture
Read operations use the hash index for O(1) lookup and return EntryHandle instances that provide zero-copy access to the memory-mapped data.
flowchart TD
ReadCall["read(key)"]
ComputeHash["compute_hash(key)\nXXH3_64 hash"]
AcquireIndex["key_indexer.read().unwrap()\nAcquire read lock"]
GetMmap["get_mmap_arc()\nClone Arc<Mmap>"]
ReadContext["read_entry_with_context()"]
Lookup["key_indexer.get_packed(key_hash)"]
CheckFound{"Found?"}
Unpack["KeyIndexer::unpack(packed)\nExtract tag + offset"]
VerifyTag{"Tag matches?"}
ReadMetadata["EntryMetadata::deserialize()\nFrom mmap[offset..offset+20]"]
CalcRange["Derive entry range\nentry_start..entry_end"]
CheckTombstone{"Is tombstone?"}
CreateHandle["Create EntryHandle\nmmap_arc + range + metadata"]
ReturnNone["Return None"]
ReadCall --> ComputeHash
ComputeHash --> AcquireIndex
AcquireIndex --> GetMmap
GetMmap --> ReadContext
ReadContext --> Lookup
Lookup --> CheckFound
CheckFound -->|No| ReturnNone
CheckFound -->|Yes| Unpack
Unpack --> VerifyTag
VerifyTag -->|No Collision| ReturnNone
VerifyTag -->|Yes| ReadMetadata
ReadMetadata --> CalcRange
CalcRange --> CheckTombstone
CheckTombstone -->|Yes| ReturnNone
CheckTombstone -->|No| CreateHandle
The tag verification step at src/storage_engine/data_store.rs:513-521 prevents hash collision issues by comparing a 16-bit tag derived from the original key with the stored tag.
Sources: src/storage_engine/data_store.rs:1040-1049 src/storage_engine/data_store.rs:501-565
Batch Operations
Batch operations reduce lock acquisition overhead by processing multiple entries in a single synchronized operation.
flowchart TD
BatchWrite["batch_write(entries)"]
HashAll["compute_hash_batch(keys)\nParallel XXH3_64"]
AcquireWrite["file.write().unwrap()"]
InitBuffer["buffer = Vec::new()"]
LoopStart{"For each entry"}
CalcPad["prepad_len(tail_offset)"]
BuildEntry["Build entry:\npre_pad + payload + metadata"]
AppendBuffer["buffer.extend_from_slice(entry)"]
UpdateTail["tail_offset += entry.len()"]
LoopEnd["Next entry"]
WriteAll["file.write_all(&buffer)"]
Flush["file.flush()"]
Reindex["reindex(key_hash_offsets)"]
BatchWrite --> HashAll
HashAll --> AcquireWrite
AcquireWrite --> InitBuffer
InitBuffer --> LoopStart
LoopStart -->|More entries| CalcPad
CalcPad --> BuildEntry
BuildEntry --> AppendBuffer
AppendBuffer --> UpdateTail
UpdateTail --> LoopEnd
LoopEnd --> LoopStart
LoopStart -->|Done| WriteAll
WriteAll --> Flush
Flush --> Reindex
Batch Write
The batch_write() method accumulates all entries in an in-memory buffer before acquiring the write lock once:
Sources: src/storage_engine/data_store.rs:838-843 src/storage_engine/data_store.rs:847-939
Batch Read
The batch_read() method acquires the index lock once and performs all lookups before releasing it:
Sources: src/storage_engine/data_store.rs:1105-1109 src/storage_engine/data_store.rs:1111-1158
Iteration Support
The storage engine provides multiple iteration mechanisms for different use cases:
| Iterator | Type | Synchronization | Use Case |
|---|---|---|---|
EntryIterator | Sequential | Single lock acquisition | Forward/backward iteration |
par_iter_entries() | Parallel (Rayon) | Lock-free after setup | High-throughput scanning |
IntoIterator | Consuming | Single lock acquisition | Owned iteration |
The EntryIterator scans backward through the file using the prev_offset chain, tracking seen keys in a HashSet to return only the latest version of each key.
Sources: src/storage_engine/entry_iterator.rs:8-128 src/storage_engine/data_store.rs:276-280 src/storage_engine/data_store.rs:296-361
Key Design Principles
Append-Only Model
All write operations append data to the end of the file. Updates and deletes create new entries rather than modifying existing data. This design ensures:
- No seeking during writes : Sequential disk I/O for maximum throughput
- Crash safety : Incomplete writes are detected and truncated during recovery
- Zero-copy reads : Memory-mapped regions remain valid as data is never modified
Sources: README.md:98-103
Fixed Alignment
Every non-tombstone payload starts on a 64-byte boundary (configurable via PAYLOAD_ALIGNMENT). This alignment enables:
- Cache-line efficiency : Payloads align with CPU cache lines
- SIMD operations : Full-speed vectorized reads without crossing boundaries
- Zero-copy typed views : Safe reinterpretation as typed slices
Sources: README.md:51-59 src/storage_engine/data_store.rs:670-673
Metadata-After-Payload
Each entry stores its metadata immediately after the payload, forming a backward-linked chain. This design:
- Minimizes seeking : Metadata and payload are adjacent on disk
- Enables recovery : The chain can be validated backward from any point
- Supports nesting : Storage containers can be stored within other containers
Sources: README.md:139-148
Lock-Free Reads
Read operations do not acquire write locks and perform zero-copy access through memory-mapped views. The AtomicU64 tail offset ensures readers see a consistent view without blocking writes.
Sources: README.md:170-183 src/storage_engine/data_store.rs:1040-1049