Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Core Storage Engine

Loading…

Core Storage Engine

Relevant source files

Purpose and Scope

This document provides an overview of the core storage engine architecture in SIMD R Drive. It covers the fundamental design principles, main components, and data flow patterns that enable high-performance, append-only binary storage.

For detailed information about specific aspects of the storage engine:

Sources: README.md:1-50 src/lib.rs:1-28


System Architecture

The core storage engine is implemented as a single-file, append-only key-value store. It consists of four primary components that work together to provide high-performance binary storage with zero-copy read access.

Diagram: Core Storage Engine Components

graph TB
    subgraph "Public API Layer"
        DS["DataStore"]
DSR["DataStoreReader trait"]
DSW["DataStoreWriter trait"]
end
    
    subgraph "Indexing Layer"
        KI["KeyIndexer"]
DASHMAP["DashMap<u64, u64>"]
XXH3["xxh3_64 hasher"]
end
    
    subgraph "Storage Layer"
        MMAP["Arc<Mmap>"]
FILE["Single binary file"]
RWLOCK["RwLock<File>"]
end
    
    subgraph "Access Layer"
        EH["EntryHandle"]
EI["EntryIterator"]
ES["EntryStream"]
end
    
 
   DS --> DSR
 
   DS --> DSW
 
   DS --> KI
 
   DS --> MMAP
 
   DS --> RWLOCK
    
 
   KI --> DASHMAP
 
   KI --> XXH3
    
 
   DSR --> EH
 
   DSR --> EI
 
   DSW --> RWLOCK
    
 
   EH --> MMAP
 
   EI --> MMAP
 
   ES --> EH
    
 
   MMAP --> FILE
 
   RWLOCK --> FILE

The diagram shows the relationship between the main code entities:

ComponentCode EntityPurpose
Public APIDataStoreMain interface for storage operations
TraitsDataStoreReader, DataStoreWriterSeparate read/write capabilities
IndexingKeyIndexerManages in-memory hash index
Hash MapDashMap<u64, u64>Lock-free concurrent hash map
Hashingxxh3_64SIMD-accelerated hash function
Memory MappingArc<Mmap>Shared memory-mapped file reference
File AccessRwLock<File>Synchronized write access
Entry AccessEntryHandleZero-copy view into mapped memory
IterationEntryIteratorBackward chain traversal
StreamingEntryStreamBuffered reading for large entries

Sources: src/storage_engine.rs:1-25 src/lib.rs:129-136 README.md:5-11


Append-Only Design

The storage engine follows a strict append-only model where data is never modified or deleted in place. All operations result in new entries being appended to the end of the file.

Diagram: Write Path Data Flow

graph LR
    subgraph "Write Operations"
        W["write()"]
BW["batch_write()"]
WS["write_stream()"]
DEL["delete()"]
end
    
    subgraph "Internal Write Path"
        HASH["Calculate xxh3_64 hash"]
ALIGN["Calculate prepad for\n64-byte alignment"]
COPY["simd_copy payload"]
APPEND["Append to file"]
META["Write metadata:\nkey_hash, prev_offset, crc32"]
end
    
    subgraph "File State"
        TAIL["AtomicU64 tail_offset"]
CHAIN["Backward-linked chain"]
end
    
 
   W --> HASH
 
   BW --> HASH
 
   WS --> HASH
 
   DEL --> HASH
    
 
   HASH --> ALIGN
 
   ALIGN --> COPY
 
   COPY --> APPEND
 
   APPEND --> META
 
   META --> TAIL
 
   META --> CHAIN

Key Characteristics

CharacteristicImplementation
ImmutabilityEntries are never modified after writing
OverwritesNew entries with same key supersede old ones
DeletionsAppend tombstone marker (single 0x00 byte)
File GrowthFile grows monotonically until compaction
OrderingMaintains temporal order via prev_offset chain
RecoveryIncomplete writes detected via chain validation

The append-only design provides several benefits:

  • Crash safety : Incomplete writes can be detected and discarded
  • Simplified concurrency : No in-place modifications to coordinate
  • Time-travel : Historical data remains until compaction
  • Write performance : Sequential I/O with no seek overhead

Sources: README.md:98-147 src/lib.rs:3-17


Single-File Storage Container

All data is stored in a single binary file with a specific structure designed for efficient access and validation.

Diagram: File Organization and Backward-Linked Chain

graph TB
    subgraph "File Structure"
        START["File Start: offset 0"]
E1["Entry 1\nprepad + payload + metadata"]
E2["Entry 2\nprepad + payload + metadata"]
E3["Entry 3\nprepad + payload + metadata"]
EN["Entry N\nprepad + payload + metadata"]
TAIL["tail_offset\nEnd of valid data"]
end
    
    subgraph "Metadata Chain"
        M1["metadata.prev_offset = 0"]
M2["metadata.prev_offset = end(E1)"]
M3["metadata.prev_offset = end(E2)"]
MN["metadata.prev_offset = end(E3)"]
end
    
 
   START --> E1
 
   E1 --> E2
 
   E2 --> E3
 
   E3 --> EN
 
   EN --> TAIL
    
 
   E1 -.-> M1
 
   E2 -.-> M2
 
   E3 -.-> M3
 
   EN -.-> MN
    
    MN -.backward chain.-> M3
    M3 -.backward chain.-> M2
    M2 -.backward chain.-> M1

Storage Properties

PropertyValuePurpose
File TypeSingle binary fileSimplified management and deployment
Entry Alignment64-byte boundariesCache-line and SIMD optimization
Metadata Size20 byteskey_hash (8) + prev_offset (8) + crc32 (4)
Chain DirectionBackward (tail to head)Fast validation and recovery
Maximum Size256 TiB48-bit offset support
FormatSchema-lessRaw binary with no interpretation

Entry Composition

Each entry consists of three parts:

  1. Pre-padding (0-63 bytes): Zero bytes to align payload start to 64-byte boundary
  2. Payload (variable length): Raw binary data
  3. Metadata (20 bytes): Hash, previous offset, checksum

Tombstone entries (deletions) use a minimal format:

  • 1-byte payload (0x00)
  • 20-byte metadata

Sources: README.md:104-138 README.md:61-97


DataStore: Primary Interface

DataStore is the main public interface providing all storage operations. It implements the DataStoreReader and DataStoreWriter traits to separate read and write capabilities.

Diagram: DataStore Structure and Methods

graph TB
    subgraph "DataStore struct"
        FILE_LOCK["file: RwLock&lt;File&gt;"]
MMAP_LOCK["mmap: Mutex&lt;Arc&lt;Mmap&gt;&gt;"]
INDEXER["indexer: KeyIndexer"]
TAIL["tail_offset: AtomicU64"]
end
    
    subgraph "DataStoreWriter methods"
        W1["write(key, value)"]
W2["batch_write(entries)"]
W3["write_stream(key, reader)"]
W4["delete(key)"]
end
    
    subgraph "DataStoreReader methods"
        R1["read(key) -> Option&lt;EntryHandle&gt;"]
R2["batch_read(keys)"]
R3["iter_entries() -> EntryIterator"]
R4["contains_key(key) -> bool"]
end
    
    subgraph "Maintenance methods"
        M1["compact() -> Stats"]
M2["estimate_compaction_space()"]
M3["verify_file_integrity()"]
end
    
 
   FILE_LOCK --> W1
 
   FILE_LOCK --> W2
 
   FILE_LOCK --> W3
 
   FILE_LOCK --> W4
    
 
   MMAP_LOCK --> R1
 
   MMAP_LOCK --> R2
 
   MMAP_LOCK --> R3
    
 
   INDEXER --> R1
 
   INDEXER --> R2
 
   INDEXER --> R4
    
 
   TAIL --> W1
 
   TAIL --> W2
 
   TAIL --> W3

Core Fields

The DataStore struct maintains four critical fields:

FieldTypePurpose
fileRwLock<File>Serializes write operations
mmapMutex<Arc<Mmap>>Protects memory map updates
indexerKeyIndexerProvides O(1) key lookups
tail_offsetAtomicU64Tracks file end without locks

Sources: src/storage_engine.rs:4-5 src/lib.rs:19-63


graph LR
    subgraph "Read Request"
        KEY["Key bytes"]
HASH_KEY["xxh3_64(key)"]
end
    
    subgraph "Index Lookup"
        DASHMAP_GET["DashMap.get(hash)"]
PACKED["Packed value:\n16-bit tag + 48-bit offset"]
TAG_CHECK["Verify 16-bit tag"]
end
    
    subgraph "Memory Access"
        OFFSET["File offset"]
MMAP_SLICE["Arc&lt;Mmap&gt; slice"]
METADATA_READ["Read EntryMetadata"]
PAYLOAD_RANGE["Calculate payload range"]
end
    
    subgraph "Result"
        EH["EntryHandle\n(zero-copy view)"]
end
    
 
   KEY --> HASH_KEY
 
   HASH_KEY --> DASHMAP_GET
 
   DASHMAP_GET --> PACKED
 
   PACKED --> TAG_CHECK
 
   TAG_CHECK --> OFFSET
 
   OFFSET --> MMAP_SLICE
 
   MMAP_SLICE --> METADATA_READ
 
   METADATA_READ --> PAYLOAD_RANGE
 
   PAYLOAD_RANGE --> EH

Zero-Copy Read Path

Read operations leverage memory-mapped files to provide zero-copy access to stored data without deserialization overhead.

Diagram: Zero-Copy Read Operation Flow

Read Performance Characteristics

OperationComplexityNotes
Single readO(1)Hash index lookup + memory access
Batch readO(n)Independent lookups, parallelizable
Full iterationO(m)m = total entries, follows chain
Collision handlingO(1)16-bit tag check

EntryHandle

The EntryHandle struct provides a zero-copy view into the memory-mapped file:

Methods like as_slice(), as_bytes(), and streaming conversion allow direct access to payload data without copying.

Sources: README.md:43-50 README.md:228-231 src/storage_engine/entry_iterator.rs:8-21


graph TB
    subgraph "Key to Hash"
        K1["Key: 'user:1234'"]
H1["xxh3_64 hash:\n0xABCDEF0123456789"]
end
    
    subgraph "Hash to Packed Value"
        TAG["Extract 16-bit tag:\n0xABCD"]
OFF["Extract file offset:\n0x00EF0123456789"]
PACKED_VAL["Packed 64-bit value:\ntag:16 / offset:48"]
end
    
    subgraph "DashMap Storage"
        DM["DashMap&lt;u64, u64&gt;"]
ENTRY["hash -> packed_value"]
end
    
    subgraph "Collision Detection"
        LOOKUP["On read: lookup hash"]
VERIFY["Verify 16-bit tag matches"]
RESOLVE["If mismatch: collision detected"]
end
    
 
   K1 --> H1
 
   H1 --> TAG
 
   H1 --> OFF
 
   TAG --> PACKED_VAL
 
   OFF --> PACKED_VAL
 
   PACKED_VAL --> ENTRY
 
   ENTRY --> DM
    
 
   DM --> LOOKUP
 
   LOOKUP --> VERIFY
 
   VERIFY --> RESOLVE

Key Indexing System

The KeyIndexer maintains an in-memory hash index for O(1) key lookups using the xxh3_64 hashing algorithm with SIMD acceleration.

Diagram: Key Indexing and Collision Detection

Index Structure

ComponentTypeSizePurpose
Hash MapDashMap<u64, u64>DynamicLock-free concurrent access
Key Hashu648 bytesFull xxh3_64 hash of key
Packed Valueu648 bytes16-bit tag + 48-bit offset
Tagu162 bytesCollision detection
Offsetu486 bytesFile location (0-256 TiB)

Hash Algorithm Properties

The xxh3_64 algorithm provides:

  • SIMD acceleration : Uses SSE2/AVX2 (x86_64) or NEON (ARM)
  • High quality : Low collision probability
  • Performance : Optimized for throughput
  • Stability : Consistent across platforms

Sources: README.md:158-168 src/storage_engine.rs:13-14


graph TB
    subgraph "Concurrent Reads"
        T1["Thread 1 read()"]
T2["Thread 2 read()"]
T3["Thread 3 read()"]
DM_READ["DashMap lock-free read"]
MMAP_SHARE["Shared Arc&lt;Mmap&gt;"]
end
    
    subgraph "Synchronized Writes"
        T4["Thread 4 write()"]
T5["Thread 5 write()"]
RWLOCK_ACQUIRE["RwLock write lock"]
FILE_WRITE["Exclusive file access"]
INDEX_UPDATE["Update DashMap"]
ATOMIC_UPDATE["AtomicU64 tail_offset"]
end
    
    subgraph "Memory Map Updates"
        REMAP["Remap after write"]
MUTEX_LOCK["Mutex&lt;Arc&lt;Mmap&gt;&gt;"]
NEW_ARC["Create new Arc&lt;Mmap&gt;"]
end
    
 
   T1 --> DM_READ
 
   T2 --> DM_READ
 
   T3 --> DM_READ
 
   DM_READ --> MMAP_SHARE
    
 
   T4 --> RWLOCK_ACQUIRE
 
   T5 --> RWLOCK_ACQUIRE
 
   RWLOCK_ACQUIRE --> FILE_WRITE
 
   FILE_WRITE --> INDEX_UPDATE
 
   FILE_WRITE --> ATOMIC_UPDATE
 
   FILE_WRITE --> REMAP
    
 
   REMAP --> MUTEX_LOCK
 
   MUTEX_LOCK --> NEW_ARC
 
   NEW_ARC --> MMAP_SHARE

Thread Safety Model

The storage engine provides thread-safe concurrent access within a single process using a combination of synchronization primitives.

Diagram: Concurrency Control Mechanisms

Synchronization Primitives

PrimitiveProtectsAccess Pattern
RwLock<File>File handleExclusive writes, no lock for reads
Mutex<Arc<Mmap>>Memory mappingLocked during remap, readers get Arc clone
DashMap<u64, u64>Key indexLock-free concurrent reads
AtomicU64Tail offsetLock-free updates

Thread Safety Guarantees

Within single process:

  • ✅ Multiple concurrent reads (zero-copy, lock-free)
  • ✅ Serialized writes (RwLock ensures ordering)
  • ✅ Consistent index updates (DashMap internal locks)
  • ✅ Safe memory mapping (Arc reference counting)

Across multiple processes:

  • ❌ No cross-process coordination
  • ❌ Requires external file locking (e.g., flock)

Sources: README.md:170-206


graph LR
    subgraph "Write Path SIMD"
        PAYLOAD["Payload bytes"]
SIMD_COPY["simd_copy()"]
AVX2["x86_64: AVX2\n256-bit vectors"]
NEON["ARM: NEON\n128-bit vectors"]
BUFFER["Aligned buffer"]
end
    
    subgraph "Hash Path SIMD"
        KEY["Key bytes"]
XXH3_SIMD["xxh3_64 SIMD"]
SSE2["x86_64: SSE2/AVX2"]
NEON2["ARM: NEON"]
HASH_OUT["64-bit hash"]
end
    
 
   PAYLOAD --> SIMD_COPY
 
   SIMD_COPY --> AVX2
 
   SIMD_COPY --> NEON
 
   AVX2 --> BUFFER
 
   NEON --> BUFFER
    
 
   KEY --> XXH3_SIMD
 
   XXH3_SIMD --> SSE2
 
   XXH3_SIMD --> NEON2
 
   SSE2 --> HASH_OUT
 
   NEON2 --> HASH_OUT

Performance Optimizations

The storage engine incorporates several performance optimizations for high-throughput workloads.

SIMD Acceleration

Diagram: SIMD Operations in Write Path

Optimization Features

FeatureImplementationBenefit
SIMD Copysimd_copy() with AVX2/NEONFaster memory operations
Cache Alignment64-byte payload boundariesOptimal cache-line usage
Zero-Copy Readsmmap + EntryHandleNo deserialization overhead
Lock-Free IndexDashMap for readsConcurrent read scaling
Atomic TrackingAtomicU64 tail offsetNo lock contention for offset
Sequential WritesAppend-only designOptimal disk I/O patterns

Alignment Benefits

The 64-byte PAYLOAD_ALIGNMENT constant ensures:

  1. Cache efficiency : Payloads align with CPU cache lines
  2. SIMD compatibility : Vector loads don’t cross boundaries
  3. Predictable performance : Consistent access patterns
  4. Type casting safety : Can reinterpret as typed slices

Sources: README.md:51-59 README.md:248-256


Operation Modes

The storage engine supports multiple operation modes optimized for different use cases.

Write Modes

ModeMethodUse CaseCharacteristics
Singlewrite(key, value)Individual entriesImmediate flush, atomic
Batchbatch_write(entries)Multiple entriesSingle lock, batch flush
Streamingwrite_stream(key, reader)Large payloadsNo full memory allocation

Read Modes

ModeMethodUse CaseCharacteristics
Directread(key)Single entryZero-copy EntryHandle
Batchbatch_read(keys)Multiple entriesIndependent lookups
Iterationiter_entries()Full scanFollows chain backward
Parallelpar_iter_entries()Bulk processingRayon-powered (optional)
StreamingEntryStreamLarge entriesBuffered, incremental

Sources: README.md:208-246 src/lib.rs:20-115


Summary

The core storage engine provides a high-performance, append-only key-value store built on four main components:

  1. DataStore : Public API implementing reader and writer traits
  2. KeyIndexer : O(1) hash-based index with collision detection
  3. Arc : Zero-copy memory-mapped file access
  4. EntryHandle : View abstraction for payload data

Key architectural decisions:

  • Append-only : Simplifies concurrency and crash recovery
  • Single-file : Easy deployment and management
  • 64-byte alignment : Optimizes cache and SIMD performance
  • Backward chain : Enables fast validation and iteration
  • Zero-copy reads : Eliminates deserialization overhead
  • Lock-free index : Scales read throughput across threads

For implementation details on specific subsystems, refer to the child pages listed at the beginning of this document.

Sources: README.md:1-50 src/lib.rs:1-28 src/storage_engine.rs:1-25

Dismiss

Refresh this wiki

Enter email to refresh