Storage Architecture
Relevant source files
- README.md
- src/lib.rs
- src/storage_engine.rs
- src/storage_engine/data_store.rs
- src/storage_engine/entry_iterator.rs
Purpose and Scope
This document describes the core storage architecture of SIMD R Drive, covering the append-only design principles, single-file storage model, on-disk layout, validation chain structure, and recovery mechanisms. For details about the programmatic interface, see DataStore API. For specifics on entry structure and metadata format, see Entry Structure and Metadata. For memory mapping and zero-copy access patterns, see Memory Management and Zero-Copy Access.
Append-Only Design
SIMD R Drive implements a strict append-only storage model where data is written sequentially to the end of a single file. Entries are never modified or deleted in place; instead, updates and deletions are represented by appending new entries.
Key Characteristics
| Characteristic | Implementation | Benefit |
|---|---|---|
| Sequential Writes | All writes append to file end | Minimizes disk seeks, maximizes throughput |
| Immutable Entries | Existing entries never modified | Enables concurrent reads without locks |
| Version Chain | Multiple versions stored for same key | Supports time-travel and recovery |
| Latest-Wins Semantics | Index points to most recent version | Ensures consistency in queries |
The append-only model is enforced through the tail_offset field src/storage_engine/data_store.rs30 which tracks the current end of valid data. Write operations acquire an exclusive lock, append data, and atomically update the tail offset src/storage_engine/data_store.rs:816-824
Write Flow
Sources: src/storage_engine/data_store.rs:752-825 src/storage_engine/data_store.rs:827-843
Single-File Storage Container
All data resides in a single binary file managed by the DataStore struct src/storage_engine/data_store.rs:27-33 This design simplifies deployment, backup, and replication while enabling efficient memory-mapped access.
graph TB
subgraph "DataStore Structure"
FileHandle["file:\nArc<RwLock<BufWriter<File>>>"]
MmapHandle["mmap:\nArc<Mutex<Arc<Mmap>>>"]
TailOffset["tail_offset:\nAtomicU64"]
Indexer["key_indexer:\nArc<RwLock<KeyIndexer>>"]
PathBuf["path:\nPathBuf"]
end
subgraph "Write Operations"
WriteOp["Write/Delete Operations"]
end
subgraph "Read Operations"
ReadOp["Read/Exists Operations"]
end
subgraph "Physical Storage"
DiskFile["Storage File on Disk"]
end
WriteOp --> FileHandle
FileHandle --> DiskFile
ReadOp --> MmapHandle
ReadOp --> Indexer
MmapHandle --> DiskFile
WriteOp -.->|updates after flush| TailOffset
WriteOp -.->|remaps after flush| MmapHandle
WriteOp -.->|updates after flush| Indexer
TailOffset -.->|tracks valid data end| DiskFile
File Handles and Memory Mapping
Sources: src/storage_engine/data_store.rs:27-33 src/storage_engine/data_store.rs:84-117
File Opening and Initialization
The DataStore::open function src/storage_engine/data_store.rs:84-117 performs the following sequence:
- Open File : Opens the file in read/write mode, creating it if necessary src/storage_engine/data_store.rs:161-170
- Initial Mapping : Creates an initial memory-mapped view src/storage_engine/data_store.rs:172-174
- Chain Recovery : Validates the storage chain and determines the valid data extent src/storage_engine/data_store.rs:383-482
- Truncation (if needed) : Truncates corrupted data if recovery finds inconsistencies src/storage_engine/data_store.rs:91-104
- Index Building : Constructs the in-memory hash index by scanning valid entries src/storage_engine/data_store.rs108
Sources: src/storage_engine/data_store.rs:84-117 src/storage_engine/data_store.rs:141-144
Storage Layout
Entries are stored with 64-byte alignment to optimize cache efficiency and SIMD operations. The alignment constant PAYLOAD_ALIGNMENT is defined as 64 simd-r-drive-entry-handle/src/constants.rs
Non-Tombstone Entry Layout
The pre-padding calculation ensures the payload starts on a 64-byte boundary src/storage_engine/data_store.rs:669-673:
pad = (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT)) & (PAYLOAD_ALIGNMENT - 1)
Where prev_tail is the absolute file offset of the previous entry's metadata end.
graph LR
subgraph "Tombstone Entry"
NullByte["NULL Byte\n(1 byte)\n0x00"]
KeyHash["Key Hash\n(8 bytes)\nXXH3_64"]
PrevOff["Prev Offset\n(8 bytes)\nPrevious Tail"]
Checksum["Checksum\n(4 bytes)\nCRC32C"]
end
NullByte --> KeyHash
KeyHash --> PrevOff
PrevOff --> Checksum
Tombstone Entry Layout
Tombstones represent deleted keys and use a minimal layout without pre-padding:
The single NULL_BYTE src/storage_engine.rs136 serves as both the tombstone marker and payload. This design minimizes space overhead for deletions src/storage_engine/data_store.rs:864-897
Layout Summary Table
| Component | Non-Tombstone Size | Tombstone Size | Purpose |
|---|---|---|---|
| Pre-Pad | 0-63 bytes | 0 bytes | Align payload to 64-byte boundary |
| Payload | Variable | 1 byte (0x00) | User data or deletion marker |
| Key Hash | 8 bytes | 8 bytes | XXH3_64 hash for index lookup |
| Prev Offset | 8 bytes | 8 bytes | Points to previous tail (chain link) |
| Checksum | 4 bytes | 4 bytes | CRC32C of payload for integrity |
Sources: README.md:112-137 src/storage_engine/data_store.rs:669-673 src/storage_engine/data_store.rs:864-897
Validation Chain
Each entry's metadata contains a prev_offset field that points to the absolute file offset of the previous entry's metadata end (the "previous tail"). This creates a backward-linked chain from the end of the file to offset 0.
graph RL
EOF["End of File\n(tail_offset)"]
Meta3["Entry 3 Metadata\nprev_offset = T2"]
Entry3["Entry 3 Payload\n(aligned)"]
Meta2["Entry 2 Metadata\nprev_offset = T1"]
Entry2["Entry 2 Payload\n(aligned)"]
Meta1["Entry 1 Metadata\nprev_offset = 0"]
Entry1["Entry 1 Payload\n(aligned)"]
Zero["Offset 0\n(File Start)"]
EOF --> Meta3
Meta3 -.->|prev_offset| Meta2
Meta3 --> Entry3
Meta2 -.->|prev_offset| Meta1
Meta2 --> Entry2
Meta1 -.->|prev_offset| Zero
Meta1 --> Entry1
Entry1 --> Zero
Chain Structure
Sources: README.md:139-148 src/storage_engine/data_store.rs:383-482
Chain Validation Properties
The validation chain ensures data integrity through several properties:
- Completeness : A valid chain must trace back to offset 0 without gaps src/storage_engine/data_store.rs473
- Forward Progress : Each
prev_offsetmust be less than the current metadata offset src/storage_engine/data_store.rs:465-468 - Bounds Checking : All offsets must be within valid file boundaries src/storage_engine/data_store.rs:435-438
- Size Consistency : Total chain size must not exceed file length src/storage_engine/data_store.rs473
These properties are enforced during the recover_valid_chain function src/storage_engine/data_store.rs:383-482 which validates the chain structure on every file open.
Sources: src/storage_engine/data_store.rs:383-482
Recovery Mechanisms
SIMD R Drive implements automatic recovery to handle incomplete writes from crashes, power loss, or other interruptions. The recovery process occurs transparently during DataStore::open.
graph TB
Start["Open Storage File"]
CheckSize{{"File Size ≥\nMETADATA_SIZE?"}}
EmptyFile["Return offset 0\n(Empty File)"]
InitCursor["cursor = file_length"]
ReadMeta["Read metadata at\ncursor - METADATA_SIZE"]
ExtractPrev["Extract prev_offset"]
CalcBounds["Calculate entry bounds\nusing prepad_len"]
ValidateBounds{{"Entry bounds valid?"}}
DecrementCursor["cursor -= 1"]
WalkBackward["Walk chain backward\nusing prev_offset"]
CheckChain{{"Chain reaches\noffset 0?"}}
CheckSize2{{"Total size ≤\nfile_length?"}}
FoundValid["Store valid offset\nbest_valid_offset = cursor"]
MoreToCheck{{"cursor ≥\nMETADATA_SIZE?"}}
ReturnBest["Return best_valid_offset"]
Truncate["Truncate file to\nbest_valid_offset"]
Reopen["Recursively reopen\nDataStore::open"]
BuildIndex["Build KeyIndexer\nfrom valid chain"]
Complete["Recovery Complete"]
Start --> CheckSize
CheckSize -->|No| EmptyFile
CheckSize -->|Yes| InitCursor
EmptyFile --> Complete
InitCursor --> ReadMeta
ReadMeta --> ExtractPrev
ExtractPrev --> CalcBounds
CalcBounds --> ValidateBounds
ValidateBounds -->|No| DecrementCursor
ValidateBounds -->|Yes| WalkBackward
WalkBackward --> CheckChain
CheckChain -->|No| DecrementCursor
CheckChain -->|Yes| CheckSize2
CheckSize2 -->|No| DecrementCursor
CheckSize2 -->|Yes| FoundValid
FoundValid --> ReturnBest
DecrementCursor --> MoreToCheck
MoreToCheck -->|Yes| ReadMeta
MoreToCheck -->|No| ReturnBest
ReturnBest --> Truncate
Truncate --> Reopen
Reopen --> BuildIndex
BuildIndex --> Complete
Recovery Algorithm
Sources: src/storage_engine/data_store.rs:383-482 src/storage_engine/data_store.rs:91-104
Recovery Steps Detailed
Step 1: Backward Scan
Starting from the end of file, the recovery algorithm scans backward looking for valid metadata entries. For each potential metadata location src/storage_engine/data_store.rs:390-394:
- Reads 20 bytes of metadata
- Deserializes into
EntryMetadatastructure - Extracts
prev_offset(previous tail) - Calculates entry boundaries using alignment rules
Step 2: Chain Validation
For each candidate entry, the algorithm walks the entire chain backward src/storage_engine/data_store.rs:424-471:
back_cursor = prev_tail
while back_cursor != 0:
- Read previous entry metadata
- Validate bounds and size
- Check forward progress (prev_prev_tail < prev_metadata_offset)
- Accumulate total chain size
- Update back_cursor to previous prev_offset
A chain is valid only if it reaches offset 0 and the total size doesn't exceed file length src/storage_engine/data_store.rs473
Step 3: Truncation and Retry
If corruption is detected (file length > valid chain extent), the function:
- Logs a warning src/storage_engine/data_store.rs:92-97
- Drops existing file handles and memory map src/storage_engine/data_store.rs:98-99
- Reopens file with write access src/storage_engine/data_store.rs100
- Truncates to valid offset src/storage_engine/data_store.rs101
- Syncs to disk src/storage_engine/data_store.rs102
- Recursively calls
DataStore::opensrc/storage_engine/data_store.rs103
Step 4: Index Rebuilding
After establishing the valid data extent, KeyIndexer::build scans forward through the file src/storage_engine/data_store.rs108 populating the hash index with key → offset mappings. This forward scan uses the iterator pattern src/storage_engine/entry_iterator.rs:56-127 to process each entry sequentially.
Sources: src/storage_engine/data_store.rs:383-482 src/storage_engine/data_store.rs:91-104 src/storage_engine/entry_iterator.rs:21-54
graph TB
subgraph "Storage Architecture (2.1)"
AppendOnly["Append-Only Design"]
SingleFile["Single-File Container"]
Layout["Storage Layout\n(64-byte alignment)"]
Chain["Validation Chain"]
Recovery["Recovery Mechanism"]
end
subgraph "DataStore API (2.2)"
Write["write/batch_write"]
Read["read/batch_read"]
Delete["delete/batch_delete"]
end
subgraph "Entry Structure (2.3)"
EntryMetadata["EntryMetadata\n(key_hash, prev_offset, checksum)"]
PrePad["Pre-Padding Logic"]
Tombstone["Tombstone Format"]
end
subgraph "Memory Management (2.4)"
Mmap["Memory-Mapped File"]
EntryHandle["EntryHandle\n(Zero-Copy View)"]
Alignment["PAYLOAD_ALIGNMENT"]
end
subgraph "Key Indexer (2.6)"
KeyIndexer["KeyIndexer"]
HashLookup["Hash → Offset Mapping"]
end
AppendOnly -.->|enforces| Write
SingleFile -.->|provides| Mmap
Layout -.->|defines| PrePad
Layout -.->|uses| Alignment
Chain -.->|uses| EntryMetadata
Recovery -.->|validates| Chain
Recovery -.->|rebuilds| KeyIndexer
Write -.->|appends| Layout
Read -.->|uses| HashLookup
Read -.->|returns| EntryHandle
Delete -.->|writes| Tombstone
Architecture Integration
The storage architecture integrates with other subsystems through well-defined boundaries:
Component Interactions
Sources: src/storage_engine/data_store.rs:1-1183 src/storage_engine.rs:1-25
Key Design Decisions
| Decision | Rationale | Trade-off |
|---|---|---|
| 64-byte alignment | Matches CPU cache lines, enables SIMD | Wastes 0-63 bytes per entry |
| Backward-linked chain | Efficient recovery from EOF | Cannot validate forward |
| Single NULL byte tombstones | Minimal deletion overhead | Special case in code |
| Metadata-last layout | Sequential write pattern | Recovery scans backward |
| Atomic tail_offset | Lock-free progress tracking | Additional synchronization primitive |
Sources: README.md:51-59 README.md:104-148 src/storage_engine/data_store.rs:669-673
Performance Characteristics
The storage architecture exhibits the following performance properties:
Write Performance
- Sequential I/O : All writes append to file end, maximizing disk throughput
- Batching : Multiple entries written in single lock acquisition src/storage_engine/data_store.rs:847-939
- SIMD Acceleration : Payload copying uses platform-specific SIMD instructions src/storage_engine/data_store.rs925
- Buffer Flushing :
BufWriterreduces system call overhead src/storage_engine/data_store.rs28
Read Performance
- Zero-Copy : Reads directly reference memory-mapped regions src/storage_engine/data_store.rs:1040-1049
- Index Lookup : O(1) hash table lookup to locate entries src/storage_engine/data_store.rs:1042-1046
- Concurrent Reads : Multiple readers access mmap simultaneously without locks
- Cache Efficiency : 64-byte alignment ensures single cache line access for aligned data
Recovery Performance
- Incremental Validation : Stops at first valid chain found src/storage_engine/data_store.rs:474-476
- Backward Scan : Starts from likely-valid tail, avoiding full file scan
- Indexed Rebuild : Forward scan populates index in single pass src/storage_engine/entry_iterator.rs:56-127
Sources: src/storage_engine/data_store.rs:752-939 src/storage_engine/data_store.rs:1027-1183 src/storage_engine/data_store.rs:383-482
Limitations and Considerations
Append-Only Growth
The file grows continuously with updates and deletions. Compaction is available but requires exclusive access src/storage_engine/data_store.rs:706-749
Alignment Overhead
Each non-tombstone entry may waste up to 63 bytes for alignment padding. For small payloads, this overhead can be significant relative to payload size.
Recovery Cost
Recovery time increases with file size, as chain validation requires traversing all entries. For very large files (exceeding RAM), this may cause extended startup times.
Single-Process Design
While thread-safe within a process, multiple processes cannot safely share the same file without external locking mechanisms README.md:185-206
Sources: README.md:51-59 README.md:185-206 src/storage_engine/data_store.rs:669-673 src/storage_engine/data_store.rs:706-749