Storage Architecture

Relevant source files

Purpose and Scope

This document describes the core storage architecture of SIMD R Drive, covering the append-only design principles, single-file storage model, on-disk layout, validation chain structure, and recovery mechanisms. For details about the programmatic interface, see DataStore API. For specifics on entry structure and metadata format, see Entry Structure and Metadata. For memory mapping and zero-copy access patterns, see Memory Management and Zero-Copy Access.

Append-Only Design

SIMD R Drive implements a strict append-only storage model where data is written sequentially to the end of a single file. Entries are never modified or deleted in place; instead, updates and deletions are represented by appending new entries.

Key Characteristics

Characteristic	Implementation	Benefit
Sequential Writes	All writes append to file end	Minimizes disk seeks, maximizes throughput
Immutable Entries	Existing entries never modified	Enables concurrent reads without locks
Version Chain	Multiple versions stored for same key	Supports time-travel and recovery
Latest-Wins Semantics	Index points to most recent version	Ensures consistency in queries

The append-only model is enforced through the tail_offset field src/storage_engine/data_store.rs30 which tracks the current end of valid data. Write operations acquire an exclusive lock, append data, and atomically update the tail offset src/storage_engine/data_store.rs:816-824

Write Flow

Sources: src/storage_engine/data_store.rs:752-825 src/storage_engine/data_store.rs:827-843

Single-File Storage Container

All data resides in a single binary file managed by the DataStore struct src/storage_engine/data_store.rs:27-33 This design simplifies deployment, backup, and replication while enabling efficient memory-mapped access.

graph TB
    subgraph "DataStore Structure"
        FileHandle["file:\nArc&lt;RwLock&lt;BufWriter&lt;File&gt;&gt;&gt;"]
MmapHandle["mmap:\nArc&lt;Mutex&lt;Arc&lt;Mmap&gt;&gt;&gt;"]
TailOffset["tail_offset:\nAtomicU64"]
Indexer["key_indexer:\nArc&lt;RwLock&lt;KeyIndexer&gt;&gt;"]
PathBuf["path:\nPathBuf"]
end
    
    subgraph "Write Operations"
        WriteOp["Write/Delete Operations"]
end
    
    subgraph "Read Operations"
        ReadOp["Read/Exists Operations"]
end
    
    subgraph "Physical Storage"
        DiskFile["Storage File on Disk"]
end
    
 
   WriteOp --> FileHandle
 
   FileHandle --> DiskFile
    
 
   ReadOp --> MmapHandle
 
   ReadOp --> Indexer
 
   MmapHandle --> DiskFile
    
 
   WriteOp -.->|updates after flush| TailOffset
 
   WriteOp -.->|remaps after flush| MmapHandle
 
   WriteOp -.->|updates after flush| Indexer
    
 
   TailOffset -.->|tracks valid data end| DiskFile

File Handles and Memory Mapping

Sources: src/storage_engine/data_store.rs:27-33 src/storage_engine/data_store.rs:84-117

File Opening and Initialization

The DataStore::open function src/storage_engine/data_store.rs:84-117 performs the following sequence:

Open File : Opens the file in read/write mode, creating it if necessary src/storage_engine/data_store.rs:161-170
Initial Mapping : Creates an initial memory-mapped view src/storage_engine/data_store.rs:172-174
Chain Recovery : Validates the storage chain and determines the valid data extent src/storage_engine/data_store.rs:383-482
Truncation (if needed) : Truncates corrupted data if recovery finds inconsistencies src/storage_engine/data_store.rs:91-104
Index Building : Constructs the in-memory hash index by scanning valid entries src/storage_engine/data_store.rs108

Sources: src/storage_engine/data_store.rs:84-117 src/storage_engine/data_store.rs:141-144

Storage Layout

Entries are stored with 64-byte alignment to optimize cache efficiency and SIMD operations. The alignment constant PAYLOAD_ALIGNMENT is defined as 64 simd-r-drive-entry-handle/src/constants.rs

Non-Tombstone Entry Layout

The pre-padding calculation ensures the payload starts on a 64-byte boundary src/storage_engine/data_store.rs:669-673:

pad = (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT)) & (PAYLOAD_ALIGNMENT - 1)

Where prev_tail is the absolute file offset of the previous entry's metadata end.

graph LR
    subgraph "Tombstone Entry"
        NullByte["NULL Byte\n(1 byte)\n0x00"]
KeyHash["Key Hash\n(8 bytes)\nXXH3_64"]
PrevOff["Prev Offset\n(8 bytes)\nPrevious Tail"]
Checksum["Checksum\n(4 bytes)\nCRC32C"]
end
    
 
   NullByte --> KeyHash
 
   KeyHash --> PrevOff
 
   PrevOff --> Checksum

Tombstone Entry Layout

Tombstones represent deleted keys and use a minimal layout without pre-padding:

The single NULL_BYTE src/storage_engine.rs136 serves as both the tombstone marker and payload. This design minimizes space overhead for deletions src/storage_engine/data_store.rs:864-897

Layout Summary Table

Component	Non-Tombstone Size	Tombstone Size	Purpose
Pre-Pad	0-63 bytes	0 bytes	Align payload to 64-byte boundary
Payload	Variable	1 byte (0x00)	User data or deletion marker
Key Hash	8 bytes	8 bytes	XXH3_64 hash for index lookup
Prev Offset	8 bytes	8 bytes	Points to previous tail (chain link)
Checksum	4 bytes	4 bytes	CRC32C of payload for integrity

Sources: README.md:112-137 src/storage_engine/data_store.rs:669-673 src/storage_engine/data_store.rs:864-897

Validation Chain

Each entry's metadata contains a prev_offset field that points to the absolute file offset of the previous entry's metadata end (the "previous tail"). This creates a backward-linked chain from the end of the file to offset 0.

graph RL
    EOF["End of File\n(tail_offset)"]
Meta3["Entry 3 Metadata\nprev_offset = T2"]
Entry3["Entry 3 Payload\n(aligned)"]
Meta2["Entry 2 Metadata\nprev_offset = T1"]
Entry2["Entry 2 Payload\n(aligned)"]
Meta1["Entry 1 Metadata\nprev_offset = 0"]
Entry1["Entry 1 Payload\n(aligned)"]
Zero["Offset 0\n(File Start)"]
EOF --> Meta3
 
   Meta3 -.->|prev_offset| Meta2
 
   Meta3 --> Entry3
    
 
   Meta2 -.->|prev_offset| Meta1
 
   Meta2 --> Entry2
    
 
   Meta1 -.->|prev_offset| Zero
 
   Meta1 --> Entry1
    
 
   Entry1 --> Zero

Chain Structure

Sources: README.md:139-148 src/storage_engine/data_store.rs:383-482

Chain Validation Properties

The validation chain ensures data integrity through several properties:

Completeness : A valid chain must trace back to offset 0 without gaps src/storage_engine/data_store.rs473
Forward Progress : Each prev_offset must be less than the current metadata offset src/storage_engine/data_store.rs:465-468
Bounds Checking : All offsets must be within valid file boundaries src/storage_engine/data_store.rs:435-438
Size Consistency : Total chain size must not exceed file length src/storage_engine/data_store.rs473

These properties are enforced during the recover_valid_chain function src/storage_engine/data_store.rs:383-482 which validates the chain structure on every file open.

Sources: src/storage_engine/data_store.rs:383-482

Recovery Mechanisms

SIMD R Drive implements automatic recovery to handle incomplete writes from crashes, power loss, or other interruptions. The recovery process occurs transparently during DataStore::open.

graph TB
    Start["Open Storage File"]
CheckSize{{"File Size ≥\nMETADATA_SIZE?"}}
EmptyFile["Return offset 0\n(Empty File)"]
InitCursor["cursor = file_length"]
ReadMeta["Read metadata at\ncursor - METADATA_SIZE"]
ExtractPrev["Extract prev_offset"]
CalcBounds["Calculate entry bounds\nusing prepad_len"]
ValidateBounds{{"Entry bounds valid?"}}
DecrementCursor["cursor -= 1"]
WalkBackward["Walk chain backward\nusing prev_offset"]
CheckChain{{"Chain reaches\noffset 0?"}}
CheckSize2{{"Total size ≤\nfile_length?"}}
FoundValid["Store valid offset\nbest_valid_offset = cursor"]
MoreToCheck{{"cursor ≥\nMETADATA_SIZE?"}}
ReturnBest["Return best_valid_offset"]
Truncate["Truncate file to\nbest_valid_offset"]
Reopen["Recursively reopen\nDataStore::open"]
BuildIndex["Build KeyIndexer\nfrom valid chain"]
Complete["Recovery Complete"]
Start --> CheckSize
 
   CheckSize -->|No| EmptyFile
 
   CheckSize -->|Yes| InitCursor
 
   EmptyFile --> Complete
    
 
   InitCursor --> ReadMeta
 
   ReadMeta --> ExtractPrev
 
   ExtractPrev --> CalcBounds
 
   CalcBounds --> ValidateBounds
    
 
   ValidateBounds -->|No| DecrementCursor
 
   ValidateBounds -->|Yes| WalkBackward
    
 
   WalkBackward --> CheckChain
 
   CheckChain -->|No| DecrementCursor
 
   CheckChain -->|Yes| CheckSize2
    
 
   CheckSize2 -->|No| DecrementCursor
 
   CheckSize2 -->|Yes| FoundValid
    
 
   FoundValid --> ReturnBest
    
 
   DecrementCursor --> MoreToCheck
 
   MoreToCheck -->|Yes| ReadMeta
 
   MoreToCheck -->|No| ReturnBest
    
 
   ReturnBest --> Truncate
 
   Truncate --> Reopen
 
   Reopen --> BuildIndex
 
   BuildIndex --> Complete

Recovery Algorithm

Sources: src/storage_engine/data_store.rs:383-482 src/storage_engine/data_store.rs:91-104

Recovery Steps Detailed

Step 1: Backward Scan

Starting from the end of file, the recovery algorithm scans backward looking for valid metadata entries. For each potential metadata location src/storage_engine/data_store.rs:390-394:

Reads 20 bytes of metadata
Deserializes into EntryMetadata structure
Extracts prev_offset (previous tail)
Calculates entry boundaries using alignment rules

Step 2: Chain Validation

For each candidate entry, the algorithm walks the entire chain backward src/storage_engine/data_store.rs:424-471:

back_cursor = prev_tail
while back_cursor != 0:
    - Read previous entry metadata
    - Validate bounds and size
    - Check forward progress (prev_prev_tail < prev_metadata_offset)
    - Accumulate total chain size
    - Update back_cursor to previous prev_offset

A chain is valid only if it reaches offset 0 and the total size doesn't exceed file length src/storage_engine/data_store.rs473

Step 3: Truncation and Retry

If corruption is detected (file length > valid chain extent), the function:

Logs a warning src/storage_engine/data_store.rs:92-97
Drops existing file handles and memory map src/storage_engine/data_store.rs:98-99
Reopens file with write access src/storage_engine/data_store.rs100
Truncates to valid offset src/storage_engine/data_store.rs101
Syncs to disk src/storage_engine/data_store.rs102
Recursively calls DataStore::open src/storage_engine/data_store.rs103

Step 4: Index Rebuilding

After establishing the valid data extent, KeyIndexer::build scans forward through the file src/storage_engine/data_store.rs108 populating the hash index with key → offset mappings. This forward scan uses the iterator pattern src/storage_engine/entry_iterator.rs:56-127 to process each entry sequentially.

Sources: src/storage_engine/data_store.rs:383-482 src/storage_engine/data_store.rs:91-104 src/storage_engine/entry_iterator.rs:21-54

graph TB
    subgraph "Storage Architecture (2.1)"
        AppendOnly["Append-Only Design"]
SingleFile["Single-File Container"]
Layout["Storage Layout\n(64-byte alignment)"]
Chain["Validation Chain"]
Recovery["Recovery Mechanism"]
end
    
    subgraph "DataStore API (2.2)"
        Write["write/batch_write"]
Read["read/batch_read"]
Delete["delete/batch_delete"]
end
    
    subgraph "Entry Structure (2.3)"
        EntryMetadata["EntryMetadata\n(key_hash, prev_offset, checksum)"]
PrePad["Pre-Padding Logic"]
Tombstone["Tombstone Format"]
end
    
    subgraph "Memory Management (2.4)"
        Mmap["Memory-Mapped File"]
EntryHandle["EntryHandle\n(Zero-Copy View)"]
Alignment["PAYLOAD_ALIGNMENT"]
end
    
    subgraph "Key Indexer (2.6)"
        KeyIndexer["KeyIndexer"]
HashLookup["Hash → Offset Mapping"]
end
    
 
   AppendOnly -.->|enforces| Write
 
   SingleFile -.->|provides| Mmap
 
   Layout -.->|defines| PrePad
 
   Layout -.->|uses| Alignment
 
   Chain -.->|uses| EntryMetadata
 
   Recovery -.->|validates| Chain
 
   Recovery -.->|rebuilds| KeyIndexer
    
 
   Write -.->|appends| Layout
 
   Read -.->|uses| HashLookup
 
   Read -.->|returns| EntryHandle
 
   Delete -.->|writes| Tombstone

Architecture Integration

The storage architecture integrates with other subsystems through well-defined boundaries:

Component Interactions

Sources: src/storage_engine/data_store.rs:1-1183 src/storage_engine.rs:1-25

Key Design Decisions

Decision	Rationale	Trade-off
64-byte alignment	Matches CPU cache lines, enables SIMD	Wastes 0-63 bytes per entry
Backward-linked chain	Efficient recovery from EOF	Cannot validate forward
Single NULL byte tombstones	Minimal deletion overhead	Special case in code
Metadata-last layout	Sequential write pattern	Recovery scans backward
Atomic tail_offset	Lock-free progress tracking	Additional synchronization primitive

Sources: README.md:51-59 README.md:104-148 src/storage_engine/data_store.rs:669-673

Performance Characteristics

The storage architecture exhibits the following performance properties:

Write Performance

Sequential I/O : All writes append to file end, maximizing disk throughput
Batching : Multiple entries written in single lock acquisition src/storage_engine/data_store.rs:847-939
SIMD Acceleration : Payload copying uses platform-specific SIMD instructions src/storage_engine/data_store.rs925
Buffer Flushing : BufWriter reduces system call overhead src/storage_engine/data_store.rs28

Read Performance

Zero-Copy : Reads directly reference memory-mapped regions src/storage_engine/data_store.rs:1040-1049
Index Lookup : O(1) hash table lookup to locate entries src/storage_engine/data_store.rs:1042-1046
Concurrent Reads : Multiple readers access mmap simultaneously without locks
Cache Efficiency : 64-byte alignment ensures single cache line access for aligned data

Recovery Performance

Incremental Validation : Stops at first valid chain found src/storage_engine/data_store.rs:474-476
Backward Scan : Starts from likely-valid tail, avoiding full file scan
Indexed Rebuild : Forward scan populates index in single pass src/storage_engine/entry_iterator.rs:56-127

Sources: src/storage_engine/data_store.rs:752-939 src/storage_engine/data_store.rs:1027-1183 src/storage_engine/data_store.rs:383-482

rust-simd-r-drive Documentation