This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Core Storage Engine
Loading…
Core Storage Engine
Relevant source files
Purpose and Scope
This document provides an overview of the core storage engine architecture in SIMD R Drive. It covers the fundamental design principles, main components, and data flow patterns that enable high-performance, append-only binary storage.
For detailed information about specific aspects of the storage engine:
Sources: README.md:1-50 src/lib.rs:1-28
System Architecture
The core storage engine is implemented as a single-file, append-only key-value store. It consists of four primary components that work together to provide high-performance binary storage with zero-copy read access.
Diagram: Core Storage Engine Components
graph TB
subgraph "Public API Layer"
DS["DataStore"]
DSR["DataStoreReader trait"]
DSW["DataStoreWriter trait"]
end
subgraph "Indexing Layer"
KI["KeyIndexer"]
DASHMAP["DashMap<u64, u64>"]
XXH3["xxh3_64 hasher"]
end
subgraph "Storage Layer"
MMAP["Arc<Mmap>"]
FILE["Single binary file"]
RWLOCK["RwLock<File>"]
end
subgraph "Access Layer"
EH["EntryHandle"]
EI["EntryIterator"]
ES["EntryStream"]
end
DS --> DSR
DS --> DSW
DS --> KI
DS --> MMAP
DS --> RWLOCK
KI --> DASHMAP
KI --> XXH3
DSR --> EH
DSR --> EI
DSW --> RWLOCK
EH --> MMAP
EI --> MMAP
ES --> EH
MMAP --> FILE
RWLOCK --> FILE
The diagram shows the relationship between the main code entities:
| Component | Code Entity | Purpose |
|---|---|---|
| Public API | DataStore | Main interface for storage operations |
| Traits | DataStoreReader, DataStoreWriter | Separate read/write capabilities |
| Indexing | KeyIndexer | Manages in-memory hash index |
| Hash Map | DashMap<u64, u64> | Lock-free concurrent hash map |
| Hashing | xxh3_64 | SIMD-accelerated hash function |
| Memory Mapping | Arc<Mmap> | Shared memory-mapped file reference |
| File Access | RwLock<File> | Synchronized write access |
| Entry Access | EntryHandle | Zero-copy view into mapped memory |
| Iteration | EntryIterator | Backward chain traversal |
| Streaming | EntryStream | Buffered reading for large entries |
Sources: src/storage_engine.rs:1-25 src/lib.rs:129-136 README.md:5-11
Append-Only Design
The storage engine follows a strict append-only model where data is never modified or deleted in place. All operations result in new entries being appended to the end of the file.
Diagram: Write Path Data Flow
graph LR
subgraph "Write Operations"
W["write()"]
BW["batch_write()"]
WS["write_stream()"]
DEL["delete()"]
end
subgraph "Internal Write Path"
HASH["Calculate xxh3_64 hash"]
ALIGN["Calculate prepad for\n64-byte alignment"]
COPY["simd_copy payload"]
APPEND["Append to file"]
META["Write metadata:\nkey_hash, prev_offset, crc32"]
end
subgraph "File State"
TAIL["AtomicU64 tail_offset"]
CHAIN["Backward-linked chain"]
end
W --> HASH
BW --> HASH
WS --> HASH
DEL --> HASH
HASH --> ALIGN
ALIGN --> COPY
COPY --> APPEND
APPEND --> META
META --> TAIL
META --> CHAIN
Key Characteristics
| Characteristic | Implementation |
|---|---|
| Immutability | Entries are never modified after writing |
| Overwrites | New entries with same key supersede old ones |
| Deletions | Append tombstone marker (single 0x00 byte) |
| File Growth | File grows monotonically until compaction |
| Ordering | Maintains temporal order via prev_offset chain |
| Recovery | Incomplete writes detected via chain validation |
The append-only design provides several benefits:
- Crash safety : Incomplete writes can be detected and discarded
- Simplified concurrency : No in-place modifications to coordinate
- Time-travel : Historical data remains until compaction
- Write performance : Sequential I/O with no seek overhead
Sources: README.md:98-147 src/lib.rs:3-17
Single-File Storage Container
All data is stored in a single binary file with a specific structure designed for efficient access and validation.
Diagram: File Organization and Backward-Linked Chain
graph TB
subgraph "File Structure"
START["File Start: offset 0"]
E1["Entry 1\nprepad + payload + metadata"]
E2["Entry 2\nprepad + payload + metadata"]
E3["Entry 3\nprepad + payload + metadata"]
EN["Entry N\nprepad + payload + metadata"]
TAIL["tail_offset\nEnd of valid data"]
end
subgraph "Metadata Chain"
M1["metadata.prev_offset = 0"]
M2["metadata.prev_offset = end(E1)"]
M3["metadata.prev_offset = end(E2)"]
MN["metadata.prev_offset = end(E3)"]
end
START --> E1
E1 --> E2
E2 --> E3
E3 --> EN
EN --> TAIL
E1 -.-> M1
E2 -.-> M2
E3 -.-> M3
EN -.-> MN
MN -.backward chain.-> M3
M3 -.backward chain.-> M2
M2 -.backward chain.-> M1
Storage Properties
| Property | Value | Purpose |
|---|---|---|
| File Type | Single binary file | Simplified management and deployment |
| Entry Alignment | 64-byte boundaries | Cache-line and SIMD optimization |
| Metadata Size | 20 bytes | key_hash (8) + prev_offset (8) + crc32 (4) |
| Chain Direction | Backward (tail to head) | Fast validation and recovery |
| Maximum Size | 256 TiB | 48-bit offset support |
| Format | Schema-less | Raw binary with no interpretation |
Entry Composition
Each entry consists of three parts:
- Pre-padding (0-63 bytes): Zero bytes to align payload start to 64-byte boundary
- Payload (variable length): Raw binary data
- Metadata (20 bytes): Hash, previous offset, checksum
Tombstone entries (deletions) use a minimal format:
- 1-byte payload (
0x00) - 20-byte metadata
Sources: README.md:104-138 README.md:61-97
DataStore: Primary Interface
DataStore is the main public interface providing all storage operations. It implements the DataStoreReader and DataStoreWriter traits to separate read and write capabilities.
Diagram: DataStore Structure and Methods
graph TB
subgraph "DataStore struct"
FILE_LOCK["file: RwLock<File>"]
MMAP_LOCK["mmap: Mutex<Arc<Mmap>>"]
INDEXER["indexer: KeyIndexer"]
TAIL["tail_offset: AtomicU64"]
end
subgraph "DataStoreWriter methods"
W1["write(key, value)"]
W2["batch_write(entries)"]
W3["write_stream(key, reader)"]
W4["delete(key)"]
end
subgraph "DataStoreReader methods"
R1["read(key) -> Option<EntryHandle>"]
R2["batch_read(keys)"]
R3["iter_entries() -> EntryIterator"]
R4["contains_key(key) -> bool"]
end
subgraph "Maintenance methods"
M1["compact() -> Stats"]
M2["estimate_compaction_space()"]
M3["verify_file_integrity()"]
end
FILE_LOCK --> W1
FILE_LOCK --> W2
FILE_LOCK --> W3
FILE_LOCK --> W4
MMAP_LOCK --> R1
MMAP_LOCK --> R2
MMAP_LOCK --> R3
INDEXER --> R1
INDEXER --> R2
INDEXER --> R4
TAIL --> W1
TAIL --> W2
TAIL --> W3
Core Fields
The DataStore struct maintains four critical fields:
| Field | Type | Purpose |
|---|---|---|
file | RwLock<File> | Serializes write operations |
mmap | Mutex<Arc<Mmap>> | Protects memory map updates |
indexer | KeyIndexer | Provides O(1) key lookups |
tail_offset | AtomicU64 | Tracks file end without locks |
Sources: src/storage_engine.rs:4-5 src/lib.rs:19-63
graph LR
subgraph "Read Request"
KEY["Key bytes"]
HASH_KEY["xxh3_64(key)"]
end
subgraph "Index Lookup"
DASHMAP_GET["DashMap.get(hash)"]
PACKED["Packed value:\n16-bit tag + 48-bit offset"]
TAG_CHECK["Verify 16-bit tag"]
end
subgraph "Memory Access"
OFFSET["File offset"]
MMAP_SLICE["Arc<Mmap> slice"]
METADATA_READ["Read EntryMetadata"]
PAYLOAD_RANGE["Calculate payload range"]
end
subgraph "Result"
EH["EntryHandle\n(zero-copy view)"]
end
KEY --> HASH_KEY
HASH_KEY --> DASHMAP_GET
DASHMAP_GET --> PACKED
PACKED --> TAG_CHECK
TAG_CHECK --> OFFSET
OFFSET --> MMAP_SLICE
MMAP_SLICE --> METADATA_READ
METADATA_READ --> PAYLOAD_RANGE
PAYLOAD_RANGE --> EH
Zero-Copy Read Path
Read operations leverage memory-mapped files to provide zero-copy access to stored data without deserialization overhead.
Diagram: Zero-Copy Read Operation Flow
Read Performance Characteristics
| Operation | Complexity | Notes |
|---|---|---|
| Single read | O(1) | Hash index lookup + memory access |
| Batch read | O(n) | Independent lookups, parallelizable |
| Full iteration | O(m) | m = total entries, follows chain |
| Collision handling | O(1) | 16-bit tag check |
EntryHandle
The EntryHandle struct provides a zero-copy view into the memory-mapped file:
Methods like as_slice(), as_bytes(), and streaming conversion allow direct access to payload data without copying.
Sources: README.md:43-50 README.md:228-231 src/storage_engine/entry_iterator.rs:8-21
graph TB
subgraph "Key to Hash"
K1["Key: 'user:1234'"]
H1["xxh3_64 hash:\n0xABCDEF0123456789"]
end
subgraph "Hash to Packed Value"
TAG["Extract 16-bit tag:\n0xABCD"]
OFF["Extract file offset:\n0x00EF0123456789"]
PACKED_VAL["Packed 64-bit value:\ntag:16 / offset:48"]
end
subgraph "DashMap Storage"
DM["DashMap<u64, u64>"]
ENTRY["hash -> packed_value"]
end
subgraph "Collision Detection"
LOOKUP["On read: lookup hash"]
VERIFY["Verify 16-bit tag matches"]
RESOLVE["If mismatch: collision detected"]
end
K1 --> H1
H1 --> TAG
H1 --> OFF
TAG --> PACKED_VAL
OFF --> PACKED_VAL
PACKED_VAL --> ENTRY
ENTRY --> DM
DM --> LOOKUP
LOOKUP --> VERIFY
VERIFY --> RESOLVE
Key Indexing System
The KeyIndexer maintains an in-memory hash index for O(1) key lookups using the xxh3_64 hashing algorithm with SIMD acceleration.
Diagram: Key Indexing and Collision Detection
Index Structure
| Component | Type | Size | Purpose |
|---|---|---|---|
| Hash Map | DashMap<u64, u64> | Dynamic | Lock-free concurrent access |
| Key Hash | u64 | 8 bytes | Full xxh3_64 hash of key |
| Packed Value | u64 | 8 bytes | 16-bit tag + 48-bit offset |
| Tag | u16 | 2 bytes | Collision detection |
| Offset | u48 | 6 bytes | File location (0-256 TiB) |
Hash Algorithm Properties
The xxh3_64 algorithm provides:
- SIMD acceleration : Uses SSE2/AVX2 (x86_64) or NEON (ARM)
- High quality : Low collision probability
- Performance : Optimized for throughput
- Stability : Consistent across platforms
Sources: README.md:158-168 src/storage_engine.rs:13-14
graph TB
subgraph "Concurrent Reads"
T1["Thread 1 read()"]
T2["Thread 2 read()"]
T3["Thread 3 read()"]
DM_READ["DashMap lock-free read"]
MMAP_SHARE["Shared Arc<Mmap>"]
end
subgraph "Synchronized Writes"
T4["Thread 4 write()"]
T5["Thread 5 write()"]
RWLOCK_ACQUIRE["RwLock write lock"]
FILE_WRITE["Exclusive file access"]
INDEX_UPDATE["Update DashMap"]
ATOMIC_UPDATE["AtomicU64 tail_offset"]
end
subgraph "Memory Map Updates"
REMAP["Remap after write"]
MUTEX_LOCK["Mutex<Arc<Mmap>>"]
NEW_ARC["Create new Arc<Mmap>"]
end
T1 --> DM_READ
T2 --> DM_READ
T3 --> DM_READ
DM_READ --> MMAP_SHARE
T4 --> RWLOCK_ACQUIRE
T5 --> RWLOCK_ACQUIRE
RWLOCK_ACQUIRE --> FILE_WRITE
FILE_WRITE --> INDEX_UPDATE
FILE_WRITE --> ATOMIC_UPDATE
FILE_WRITE --> REMAP
REMAP --> MUTEX_LOCK
MUTEX_LOCK --> NEW_ARC
NEW_ARC --> MMAP_SHARE
Thread Safety Model
The storage engine provides thread-safe concurrent access within a single process using a combination of synchronization primitives.
Diagram: Concurrency Control Mechanisms
Synchronization Primitives
| Primitive | Protects | Access Pattern |
|---|---|---|
RwLock<File> | File handle | Exclusive writes, no lock for reads |
Mutex<Arc<Mmap>> | Memory mapping | Locked during remap, readers get Arc clone |
DashMap<u64, u64> | Key index | Lock-free concurrent reads |
AtomicU64 | Tail offset | Lock-free updates |
Thread Safety Guarantees
Within single process:
- ✅ Multiple concurrent reads (zero-copy, lock-free)
- ✅ Serialized writes (RwLock ensures ordering)
- ✅ Consistent index updates (DashMap internal locks)
- ✅ Safe memory mapping (Arc reference counting)
Across multiple processes:
- ❌ No cross-process coordination
- ❌ Requires external file locking (e.g.,
flock)
Sources: README.md:170-206
graph LR
subgraph "Write Path SIMD"
PAYLOAD["Payload bytes"]
SIMD_COPY["simd_copy()"]
AVX2["x86_64: AVX2\n256-bit vectors"]
NEON["ARM: NEON\n128-bit vectors"]
BUFFER["Aligned buffer"]
end
subgraph "Hash Path SIMD"
KEY["Key bytes"]
XXH3_SIMD["xxh3_64 SIMD"]
SSE2["x86_64: SSE2/AVX2"]
NEON2["ARM: NEON"]
HASH_OUT["64-bit hash"]
end
PAYLOAD --> SIMD_COPY
SIMD_COPY --> AVX2
SIMD_COPY --> NEON
AVX2 --> BUFFER
NEON --> BUFFER
KEY --> XXH3_SIMD
XXH3_SIMD --> SSE2
XXH3_SIMD --> NEON2
SSE2 --> HASH_OUT
NEON2 --> HASH_OUT
Performance Optimizations
The storage engine incorporates several performance optimizations for high-throughput workloads.
SIMD Acceleration
Diagram: SIMD Operations in Write Path
Optimization Features
| Feature | Implementation | Benefit |
|---|---|---|
| SIMD Copy | simd_copy() with AVX2/NEON | Faster memory operations |
| Cache Alignment | 64-byte payload boundaries | Optimal cache-line usage |
| Zero-Copy Reads | mmap + EntryHandle | No deserialization overhead |
| Lock-Free Index | DashMap for reads | Concurrent read scaling |
| Atomic Tracking | AtomicU64 tail offset | No lock contention for offset |
| Sequential Writes | Append-only design | Optimal disk I/O patterns |
Alignment Benefits
The 64-byte PAYLOAD_ALIGNMENT constant ensures:
- Cache efficiency : Payloads align with CPU cache lines
- SIMD compatibility : Vector loads don’t cross boundaries
- Predictable performance : Consistent access patterns
- Type casting safety : Can reinterpret as typed slices
Sources: README.md:51-59 README.md:248-256
Operation Modes
The storage engine supports multiple operation modes optimized for different use cases.
Write Modes
| Mode | Method | Use Case | Characteristics |
|---|---|---|---|
| Single | write(key, value) | Individual entries | Immediate flush, atomic |
| Batch | batch_write(entries) | Multiple entries | Single lock, batch flush |
| Streaming | write_stream(key, reader) | Large payloads | No full memory allocation |
Read Modes
| Mode | Method | Use Case | Characteristics |
|---|---|---|---|
| Direct | read(key) | Single entry | Zero-copy EntryHandle |
| Batch | batch_read(keys) | Multiple entries | Independent lookups |
| Iteration | iter_entries() | Full scan | Follows chain backward |
| Parallel | par_iter_entries() | Bulk processing | Rayon-powered (optional) |
| Streaming | EntryStream | Large entries | Buffered, incremental |
Sources: README.md:208-246 src/lib.rs:20-115
Summary
The core storage engine provides a high-performance, append-only key-value store built on four main components:
- DataStore : Public API implementing reader and writer traits
- KeyIndexer : O(1) hash-based index with collision detection
- Arc : Zero-copy memory-mapped file access
- EntryHandle : View abstraction for payload data
Key architectural decisions:
- Append-only : Simplifies concurrency and crash recovery
- Single-file : Easy deployment and management
- 64-byte alignment : Optimizes cache and SIMD performance
- Backward chain : Enables fast validation and iteration
- Zero-copy reads : Eliminates deserialization overhead
- Lock-free index : Scales read throughput across threads
For implementation details on specific subsystems, refer to the child pages listed at the beginning of this document.
Sources: README.md:1-50 src/lib.rs:1-28 src/storage_engine.rs:1-25
Dismiss
Refresh this wiki
Enter email to refresh