Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

Storage Architecture

Relevant source files

Purpose and Scope

This document describes the core storage architecture of SIMD R Drive, covering the append-only design principles, single-file storage model, on-disk layout, validation chain structure, and recovery mechanisms. For details about the programmatic interface, see DataStore API. For specifics on entry structure and metadata format, see Entry Structure and Metadata. For memory mapping and zero-copy access patterns, see Memory Management and Zero-Copy Access.


Append-Only Design

SIMD R Drive implements a strict append-only storage model where data is written sequentially to the end of a single file. Entries are never modified or deleted in place; instead, updates and deletions are represented by appending new entries.

Key Characteristics

CharacteristicImplementationBenefit
Sequential WritesAll writes append to file endMinimizes disk seeks, maximizes throughput
Immutable EntriesExisting entries never modifiedEnables concurrent reads without locks
Version ChainMultiple versions stored for same keySupports time-travel and recovery
Latest-Wins SemanticsIndex points to most recent versionEnsures consistency in queries

The append-only model is enforced through the tail_offset field src/storage_engine/data_store.rs30 which tracks the current end of valid data. Write operations acquire an exclusive lock, append data, and atomically update the tail offset src/storage_engine/data_store.rs:816-824

Write Flow

Sources: src/storage_engine/data_store.rs:752-825 src/storage_engine/data_store.rs:827-843


Single-File Storage Container

All data resides in a single binary file managed by the DataStore struct src/storage_engine/data_store.rs:27-33 This design simplifies deployment, backup, and replication while enabling efficient memory-mapped access.

graph TB
    subgraph "DataStore Structure"
        FileHandle["file:\nArc<RwLock<BufWriter<File>>>"]
MmapHandle["mmap:\nArc<Mutex<Arc<Mmap>>>"]
TailOffset["tail_offset:\nAtomicU64"]
Indexer["key_indexer:\nArc<RwLock<KeyIndexer>>"]
PathBuf["path:\nPathBuf"]
end
    
    subgraph "Write Operations"
        WriteOp["Write/Delete Operations"]
end
    
    subgraph "Read Operations"
        ReadOp["Read/Exists Operations"]
end
    
    subgraph "Physical Storage"
        DiskFile["Storage File on Disk"]
end
    
 
   WriteOp --> FileHandle
 
   FileHandle --> DiskFile
    
 
   ReadOp --> MmapHandle
 
   ReadOp --> Indexer
 
   MmapHandle --> DiskFile
    
 
   WriteOp -.->|updates after flush| TailOffset
 
   WriteOp -.->|remaps after flush| MmapHandle
 
   WriteOp -.->|updates after flush| Indexer
    
 
   TailOffset -.->|tracks valid data end| DiskFile

File Handles and Memory Mapping

Sources: src/storage_engine/data_store.rs:27-33 src/storage_engine/data_store.rs:84-117

File Opening and Initialization

The DataStore::open function src/storage_engine/data_store.rs:84-117 performs the following sequence:

  1. Open File : Opens the file in read/write mode, creating it if necessary src/storage_engine/data_store.rs:161-170
  2. Initial Mapping : Creates an initial memory-mapped view src/storage_engine/data_store.rs:172-174
  3. Chain Recovery : Validates the storage chain and determines the valid data extent src/storage_engine/data_store.rs:383-482
  4. Truncation (if needed) : Truncates corrupted data if recovery finds inconsistencies src/storage_engine/data_store.rs:91-104
  5. Index Building : Constructs the in-memory hash index by scanning valid entries src/storage_engine/data_store.rs108

Sources: src/storage_engine/data_store.rs:84-117 src/storage_engine/data_store.rs:141-144


Storage Layout

Entries are stored with 64-byte alignment to optimize cache efficiency and SIMD operations. The alignment constant PAYLOAD_ALIGNMENT is defined as 64 simd-r-drive-entry-handle/src/constants.rs

Non-Tombstone Entry Layout

The pre-padding calculation ensures the payload starts on a 64-byte boundary src/storage_engine/data_store.rs:669-673:

pad = (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT)) & (PAYLOAD_ALIGNMENT - 1)

Where prev_tail is the absolute file offset of the previous entry's metadata end.

graph LR
    subgraph "Tombstone Entry"
        NullByte["NULL Byte\n(1 byte)\n0x00"]
KeyHash["Key Hash\n(8 bytes)\nXXH3_64"]
PrevOff["Prev Offset\n(8 bytes)\nPrevious Tail"]
Checksum["Checksum\n(4 bytes)\nCRC32C"]
end
    
 
   NullByte --> KeyHash
 
   KeyHash --> PrevOff
 
   PrevOff --> Checksum

Tombstone Entry Layout

Tombstones represent deleted keys and use a minimal layout without pre-padding:

The single NULL_BYTE src/storage_engine.rs136 serves as both the tombstone marker and payload. This design minimizes space overhead for deletions src/storage_engine/data_store.rs:864-897

Layout Summary Table

ComponentNon-Tombstone SizeTombstone SizePurpose
Pre-Pad0-63 bytes0 bytesAlign payload to 64-byte boundary
PayloadVariable1 byte (0x00)User data or deletion marker
Key Hash8 bytes8 bytesXXH3_64 hash for index lookup
Prev Offset8 bytes8 bytesPoints to previous tail (chain link)
Checksum4 bytes4 bytesCRC32C of payload for integrity

Sources: README.md:112-137 src/storage_engine/data_store.rs:669-673 src/storage_engine/data_store.rs:864-897


Validation Chain

Each entry's metadata contains a prev_offset field that points to the absolute file offset of the previous entry's metadata end (the "previous tail"). This creates a backward-linked chain from the end of the file to offset 0.

graph RL
    EOF["End of File\n(tail_offset)"]
Meta3["Entry 3 Metadata\nprev_offset = T2"]
Entry3["Entry 3 Payload\n(aligned)"]
Meta2["Entry 2 Metadata\nprev_offset = T1"]
Entry2["Entry 2 Payload\n(aligned)"]
Meta1["Entry 1 Metadata\nprev_offset = 0"]
Entry1["Entry 1 Payload\n(aligned)"]
Zero["Offset 0\n(File Start)"]
EOF --> Meta3
 
   Meta3 -.->|prev_offset| Meta2
 
   Meta3 --> Entry3
    
 
   Meta2 -.->|prev_offset| Meta1
 
   Meta2 --> Entry2
    
 
   Meta1 -.->|prev_offset| Zero
 
   Meta1 --> Entry1
    
 
   Entry1 --> Zero

Chain Structure

Sources: README.md:139-148 src/storage_engine/data_store.rs:383-482

Chain Validation Properties

The validation chain ensures data integrity through several properties:

  1. Completeness : A valid chain must trace back to offset 0 without gaps src/storage_engine/data_store.rs473
  2. Forward Progress : Each prev_offset must be less than the current metadata offset src/storage_engine/data_store.rs:465-468
  3. Bounds Checking : All offsets must be within valid file boundaries src/storage_engine/data_store.rs:435-438
  4. Size Consistency : Total chain size must not exceed file length src/storage_engine/data_store.rs473

These properties are enforced during the recover_valid_chain function src/storage_engine/data_store.rs:383-482 which validates the chain structure on every file open.

Sources: src/storage_engine/data_store.rs:383-482


Recovery Mechanisms

SIMD R Drive implements automatic recovery to handle incomplete writes from crashes, power loss, or other interruptions. The recovery process occurs transparently during DataStore::open.

graph TB
    Start["Open Storage File"]
CheckSize{{"File Size ≥\nMETADATA_SIZE?"}}
EmptyFile["Return offset 0\n(Empty File)"]
InitCursor["cursor = file_length"]
ReadMeta["Read metadata at\ncursor - METADATA_SIZE"]
ExtractPrev["Extract prev_offset"]
CalcBounds["Calculate entry bounds\nusing prepad_len"]
ValidateBounds{{"Entry bounds valid?"}}
DecrementCursor["cursor -= 1"]
WalkBackward["Walk chain backward\nusing prev_offset"]
CheckChain{{"Chain reaches\noffset 0?"}}
CheckSize2{{"Total size ≤\nfile_length?"}}
FoundValid["Store valid offset\nbest_valid_offset = cursor"]
MoreToCheck{{"cursor ≥\nMETADATA_SIZE?"}}
ReturnBest["Return best_valid_offset"]
Truncate["Truncate file to\nbest_valid_offset"]
Reopen["Recursively reopen\nDataStore::open"]
BuildIndex["Build KeyIndexer\nfrom valid chain"]
Complete["Recovery Complete"]
Start --> CheckSize
 
   CheckSize -->|No| EmptyFile
 
   CheckSize -->|Yes| InitCursor
 
   EmptyFile --> Complete
    
 
   InitCursor --> ReadMeta
 
   ReadMeta --> ExtractPrev
 
   ExtractPrev --> CalcBounds
 
   CalcBounds --> ValidateBounds
    
 
   ValidateBounds -->|No| DecrementCursor
 
   ValidateBounds -->|Yes| WalkBackward
    
 
   WalkBackward --> CheckChain
 
   CheckChain -->|No| DecrementCursor
 
   CheckChain -->|Yes| CheckSize2
    
 
   CheckSize2 -->|No| DecrementCursor
 
   CheckSize2 -->|Yes| FoundValid
    
 
   FoundValid --> ReturnBest
    
 
   DecrementCursor --> MoreToCheck
 
   MoreToCheck -->|Yes| ReadMeta
 
   MoreToCheck -->|No| ReturnBest
    
 
   ReturnBest --> Truncate
 
   Truncate --> Reopen
 
   Reopen --> BuildIndex
 
   BuildIndex --> Complete

Recovery Algorithm

Sources: src/storage_engine/data_store.rs:383-482 src/storage_engine/data_store.rs:91-104

Recovery Steps Detailed

Step 1: Backward Scan

Starting from the end of file, the recovery algorithm scans backward looking for valid metadata entries. For each potential metadata location src/storage_engine/data_store.rs:390-394:

  • Reads 20 bytes of metadata
  • Deserializes into EntryMetadata structure
  • Extracts prev_offset (previous tail)
  • Calculates entry boundaries using alignment rules

Step 2: Chain Validation

For each candidate entry, the algorithm walks the entire chain backward src/storage_engine/data_store.rs:424-471:

back_cursor = prev_tail
while back_cursor != 0:
    - Read previous entry metadata
    - Validate bounds and size
    - Check forward progress (prev_prev_tail < prev_metadata_offset)
    - Accumulate total chain size
    - Update back_cursor to previous prev_offset

A chain is valid only if it reaches offset 0 and the total size doesn't exceed file length src/storage_engine/data_store.rs473

Step 3: Truncation and Retry

If corruption is detected (file length > valid chain extent), the function:

  1. Logs a warning src/storage_engine/data_store.rs:92-97
  2. Drops existing file handles and memory map src/storage_engine/data_store.rs:98-99
  3. Reopens file with write access src/storage_engine/data_store.rs100
  4. Truncates to valid offset src/storage_engine/data_store.rs101
  5. Syncs to disk src/storage_engine/data_store.rs102
  6. Recursively calls DataStore::open src/storage_engine/data_store.rs103

Step 4: Index Rebuilding

After establishing the valid data extent, KeyIndexer::build scans forward through the file src/storage_engine/data_store.rs108 populating the hash index with key → offset mappings. This forward scan uses the iterator pattern src/storage_engine/entry_iterator.rs:56-127 to process each entry sequentially.

Sources: src/storage_engine/data_store.rs:383-482 src/storage_engine/data_store.rs:91-104 src/storage_engine/entry_iterator.rs:21-54


graph TB
    subgraph "Storage Architecture (2.1)"
        AppendOnly["Append-Only Design"]
SingleFile["Single-File Container"]
Layout["Storage Layout\n(64-byte alignment)"]
Chain["Validation Chain"]
Recovery["Recovery Mechanism"]
end
    
    subgraph "DataStore API (2.2)"
        Write["write/batch_write"]
Read["read/batch_read"]
Delete["delete/batch_delete"]
end
    
    subgraph "Entry Structure (2.3)"
        EntryMetadata["EntryMetadata\n(key_hash, prev_offset, checksum)"]
PrePad["Pre-Padding Logic"]
Tombstone["Tombstone Format"]
end
    
    subgraph "Memory Management (2.4)"
        Mmap["Memory-Mapped File"]
EntryHandle["EntryHandle\n(Zero-Copy View)"]
Alignment["PAYLOAD_ALIGNMENT"]
end
    
    subgraph "Key Indexer (2.6)"
        KeyIndexer["KeyIndexer"]
HashLookup["Hash → Offset Mapping"]
end
    
 
   AppendOnly -.->|enforces| Write
 
   SingleFile -.->|provides| Mmap
 
   Layout -.->|defines| PrePad
 
   Layout -.->|uses| Alignment
 
   Chain -.->|uses| EntryMetadata
 
   Recovery -.->|validates| Chain
 
   Recovery -.->|rebuilds| KeyIndexer
    
 
   Write -.->|appends| Layout
 
   Read -.->|uses| HashLookup
 
   Read -.->|returns| EntryHandle
 
   Delete -.->|writes| Tombstone

Architecture Integration

The storage architecture integrates with other subsystems through well-defined boundaries:

Component Interactions

Sources: src/storage_engine/data_store.rs:1-1183 src/storage_engine.rs:1-25

Key Design Decisions

DecisionRationaleTrade-off
64-byte alignmentMatches CPU cache lines, enables SIMDWastes 0-63 bytes per entry
Backward-linked chainEfficient recovery from EOFCannot validate forward
Single NULL byte tombstonesMinimal deletion overheadSpecial case in code
Metadata-last layoutSequential write patternRecovery scans backward
Atomic tail_offsetLock-free progress trackingAdditional synchronization primitive

Sources: README.md:51-59 README.md:104-148 src/storage_engine/data_store.rs:669-673


Performance Characteristics

The storage architecture exhibits the following performance properties:

Write Performance

Read Performance

Recovery Performance

Sources: src/storage_engine/data_store.rs:752-939 src/storage_engine/data_store.rs:1027-1183 src/storage_engine/data_store.rs:383-482


Limitations and Considerations

Append-Only Growth

The file grows continuously with updates and deletions. Compaction is available but requires exclusive access src/storage_engine/data_store.rs:706-749

Alignment Overhead

Each non-tombstone entry may waste up to 63 bytes for alignment padding. For small payloads, this overhead can be significant relative to payload size.

Recovery Cost

Recovery time increases with file size, as chain validation requires traversing all entries. For very large files (exceeding RAM), this may cause extended startup times.

Single-Process Design

While thread-safe within a process, multiple processes cannot safely share the same file without external locking mechanisms README.md:185-206

Sources: README.md:51-59 README.md:185-206 src/storage_engine/data_store.rs:669-673 src/storage_engine/data_store.rs:706-749