Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Storage Architecture

Loading…

Storage Architecture

Relevant source files

Purpose and Scope

This document describes the on-disk storage format used by SIMD R Drive, including the physical layout of entries, the alignment strategy, the backward-linked chain structure, and the recovery mechanism that ensures data integrity after crashes or incomplete writes.

For information about the in-memory data structures and API, see DataStore API. For details about entry metadata fields, see Entry Structure and Metadata. For information about memory-mapped access patterns, see Memory Management and Zero-Copy Access.

Single-File Storage Container

SIMD R Drive stores all data in a single binary file with an append-only design. The storage engine writes sequentially to minimize disk seeks and maximize throughput. Each write operation appends a new entry to the end of the file, and the file position is tracked using the AtomicU64 tail_offset field in DataStore.

Sources: README.md:61-97 src/storage_engine/data_store.rs:27-33

graph LR
    FILE["Single Binary File\n(*.simd-r-drive)"]
ENTRY1["Entry 1\n(Pre-pad + Payload + Metadata)"]
ENTRY2["Entry 2\n(Pre-pad + Payload + Metadata)"]
ENTRY3["Entry 3\n(Pre-pad + Payload + Metadata)"]
TAIL["tail_offset\n(AtomicU64)"]
FILE --> ENTRY1
 
   ENTRY1 --> ENTRY2
 
   ENTRY2 --> ENTRY3
 
   ENTRY3 --> TAIL

Entry Layout

Each entry in the storage file consists of three components: optional pre-padding for alignment, the payload data, and metadata. The layout differs between non-tombstone entries (data) and tombstone entries (deletion markers).

Non-Tombstone Entry Structure

Non-tombstone entries store actual data and are aligned to PAYLOAD_ALIGNMENT (64 bytes by default). The alignment ensures cache-line efficiency and enables zero-copy access for typed slices.

Physical Layout Table:

graph LR
    PREV_TAIL["Previous\ntail_offset"]
PREPAD["Pre-Pad\n0-63 bytes\n(zero bytes)"]
PAYLOAD["Payload\nVariable Length\n(actual data)"]
METADATA["EntryMetadata\n20 bytes"]
NEXT_TAIL["New\ntail_offset"]
PREV_TAIL --> PREPAD
 
   PREPAD --> PAYLOAD
 
   PAYLOAD --> METADATA
 
   METADATA --> NEXT_TAIL
Offset RangeFieldSize (Bytes)Description
P .. P+padPre-PadpadZero bytes calculated as (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT)) & (PAYLOAD_ALIGNMENT - 1)
P+pad .. NPayloadN-(P+pad)Variable-length data, starts at aligned boundary
N .. N+8key_hash864-bit XXH3 hash of the key
N+8 .. N+16prev_offset8Absolute file offset of previous entry’s tail
N+16 .. N+20checksum4CRC32C checksum of the payload

Where:

  • pad = DataStore::prepad_len(prev_tail) computed at write time
  • PAYLOAD_ALIGNMENT = 64 (defined in simd-r-drive-entry-handle/src/constants.rs)
  • Next entry starts at N + 20

Sources: README.md:112-125 simd-r-drive-entry-handle/src/entry_metadata.rs:11-23 src/storage_engine/data_store.rs:666-673

Entry Metadata Structure

The EntryMetadata struct stores three critical fields in exactly 20 bytes:

Field Purposes:

FieldTypePurpose
key_hashu64XXH3 hash of the key for index lookups
prev_offsetu64Absolute file offset pointing to the previous entry’s tail (forms backward chain)
checksum[u8; 4]CRC32C checksum of payload for integrity verification

The metadata is serialized using little-endian encoding via EntryMetadata::serialize() and deserialized via EntryMetadata::deserialize().

Sources: simd-r-drive-entry-handle/src/entry_metadata.rs:44-113

Tombstone Entry Structure

Tombstone entries mark deleted keys. They consist of a single null byte (NULL_BYTE[0] = 0x00) followed by metadata, with no pre-padding.

Physical Layout Table:

graph LR
    PREV_TAIL["Previous\ntail_offset"]
NULL["NULL_BYTE\n1 byte\n(0x00)"]
METADATA["EntryMetadata\n20 bytes"]
NEXT_TAIL["New\ntail_offset"]
PREV_TAIL --> NULL
 
   NULL --> METADATA
 
   METADATA --> NEXT_TAIL
Offset RangeFieldSize (Bytes)Description
T .. T+1Payload1Single byte 0x00 (NULL_BYTE)
T+1 .. T+9key_hash8Hash of the deleted key
T+9 .. T+17prev_offset8Previous entry’s tail offset
T+17 .. T+21checksum4CRC32C of the null byte

Tombstones are written using DataStore::batch_write_with_key_hashes() with the allow_null_bytes parameter set to true. The deletion logic filters existing keys before writing tombstones to avoid unnecessary I/O.

Sources: README.md:126-132 src/storage_engine/data_store.rs:863-897 src/storage_engine/data_store.rs:990-1024

Alignment Strategy

The pre-padding mechanism ensures that every non-tombstone payload starts at a 64-byte aligned boundary. This alignment is critical for:

  1. Cache-line efficiency : Payloads align with CPU cache lines (typically 64 bytes)
  2. SIMD operations : Vectorized loads/stores can operate without crossing boundaries
  3. Zero-copy typed access : Enables safe reinterpretation as typed slices (e.g., &[u64])

Alignment Calculation

The DataStore::prepad_len() function implements this calculation:

fn prepad_len(offset: u64) -> usize {
    let a = PAYLOAD_ALIGNMENT;
    ((a - (offset % a)) & (a - 1)) as usize
}

During writes, the code checks if pre-padding is needed and writes zero bytes before the payload:

Sources: src/storage_engine/data_store.rs:666-673 README.md:52-59

Backward-Linked Chain Structure

Each entry’s prev_offset field in EntryMetadata points to the absolute file offset of the previous entry’s tail, forming a backward-linked chain. This chain enables:

  1. Iteration : Walking entries from end to beginning
  2. Recovery : Validating chain integrity
  3. Alignment derivation : Computing payload start from previous tail

Chain Traversal During Reads

When reading an entry from the index:

  1. Look up key_hash in KeyIndexer to get metadata offset
  2. Read EntryMetadata at that offset
  3. Extract prev_offset (previous tail)
  4. Calculate payload start: prev_offset + DataStore::prepad_len(prev_offset)
  5. Payload ends at metadata offset

This logic is implemented in DataStore::read_entry_with_context() at src/storage_engine/data_store.rs:501-565

Chain Traversal Diagram:

Sources: src/storage_engine/data_store.rs:501-565 simd-r-drive-entry-handle/src/entry_metadata.rs:40-43

graph TD
    START["Start at file_len"]
CHECK_SIZE{"file_len <\nMETADATA_SIZE?"}
RETURN_ZERO["Return Ok(0)"]
INIT["cursor = file_len\nbest_valid_offset = None"]
LOOP{"cursor >=\nMETADATA_SIZE?"}
READ_META["Read metadata at\n(cursor - METADATA_SIZE)"]
EXTRACT["Extract prev_offset\nDerive entry_start"]
VALIDATE{"entry_start <\nmetadata_offset?"}
WALK["Walk chain backward\nvia prev_offset"]
CHAIN_VALID{"Chain reaches\noffset 0?"}
SET_BEST["best_valid_offset =\ncursor"]
BREAK["Break loop"]
DECREMENT["cursor -= 1"]
RETURN_BEST["Return\nbest_valid_offset\nor 0"]
START --> CHECK_SIZE
 
   CHECK_SIZE -->|Yes| RETURN_ZERO
 
   CHECK_SIZE -->|No| INIT
 
   INIT --> LOOP
 
   LOOP -->|Yes| READ_META
 
   LOOP -->|No| RETURN_BEST
 
   READ_META --> EXTRACT
 
   EXTRACT --> VALIDATE
 
   VALIDATE -->|No| DECREMENT
 
   VALIDATE -->|Yes| WALK
 
   WALK --> CHAIN_VALID
 
   CHAIN_VALID -->|Yes| SET_BEST
 
   CHAIN_VALID -->|No| DECREMENT
 
   SET_BEST --> BREAK
 
   BREAK --> RETURN_BEST
 
   DECREMENT --> LOOP

Recovery Mechanism

The DataStore::recover_valid_chain() function validates chain integrity when opening a file. It scans backward from the file end to find the deepest valid chain that reaches offset 0, automatically recovering from incomplete writes.

Recovery Algorithm

Recovery Process Steps

  1. Initial Check : If file is smaller than METADATA_SIZE (20 bytes), return offset 0
  2. Backward Scan : Start from file_len and scan backward by 1 byte at a time
  3. Metadata Read : At each position, attempt to read metadata at cursor - METADATA_SIZE
  4. Entry Validation :
    • Extract prev_offset from metadata
    • Calculate expected entry start using DataStore::prepad_len(prev_offset)
    • Handle tombstone special case (single null byte without pre-pad)
    • Verify entry_start < metadata_offset
  5. Chain Walk : For valid entry, walk entire chain backward:
    • Follow prev_offset links
    • Validate each link points to earlier offset
    • Track total chain size
    • Stop when prev_offset = 0 (chain start)
  6. Validation : Chain is valid if:
    • All links point backward (no cycles)
    • Chain reaches offset 0
    • Total chain size ≤ file length
  7. Result : Return the first valid chain found (deepest chain)

Tombstone Handling in Recovery

Tombstones have special handling during recovery because they lack pre-padding:

if entry_end > prev_tail
    && entry_end - prev_tail == 1
    && mmap[prev_tail..entry_end] == NULL_BYTE
{
    entry_start = prev_tail  // No pre-pad for tombstone
} else {
    entry_start = prev_tail + prepad_len(prev_tail)
}

This logic appears at:

Sources: src/storage_engine/data_store.rs:363-482 README.md:139-148

Recovery on File Open

The DataStore::open() function performs recovery automatically:

If recovery detects corruption (incomplete chain), the file is truncated to the last valid offset and reopened. This ensures the storage is always in a consistent state.

Sources: src/storage_engine/data_store.rs:84-117 README.md:150-156

Entry Size Calculation

Each entry’s total file size includes pre-padding, payload, and metadata. The calculation depends on entry type:

Non-Tombstone Entry:

total_size = prepad_len(prev_tail) + payload.len() + METADATA_SIZE

Tombstone Entry:

total_size = 1 + METADATA_SIZE  // No pre-pad

The EntryHandle::file_size() method computes this from the entry’s range and metadata. For iteration and compaction, this allows precise tracking of storage usage.

Sources: README.md:112-132 src/storage_engine/data_store.rs:705-749

File Growth and Tail Tracking

The tail_offset field tracks the current end of valid data:

The atomic store uses Ordering::Release to ensure visibility across threads. Writers acquire tail_offset with Ordering::Acquire before computing pre-padding.

Sources: src/storage_engine/data_store.rs256 src/storage_engine/data_store.rs763 src/storage_engine/data_store.rs858

Dismiss

Refresh this wiki

Enter email to refresh