This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Storage Architecture
Loading…
Storage Architecture
Relevant source files
- README.md
- simd-r-drive-entry-handle/Cargo.toml
- simd-r-drive-entry-handle/src/entry_metadata.rs
- src/storage_engine/data_store.rs
Purpose and Scope
This document describes the on-disk storage format used by SIMD R Drive, including the physical layout of entries, the alignment strategy, the backward-linked chain structure, and the recovery mechanism that ensures data integrity after crashes or incomplete writes.
For information about the in-memory data structures and API, see DataStore API. For details about entry metadata fields, see Entry Structure and Metadata. For information about memory-mapped access patterns, see Memory Management and Zero-Copy Access.
Single-File Storage Container
SIMD R Drive stores all data in a single binary file with an append-only design. The storage engine writes sequentially to minimize disk seeks and maximize throughput. Each write operation appends a new entry to the end of the file, and the file position is tracked using the AtomicU64 tail_offset field in DataStore.
Sources: README.md:61-97 src/storage_engine/data_store.rs:27-33
graph LR
FILE["Single Binary File\n(*.simd-r-drive)"]
ENTRY1["Entry 1\n(Pre-pad + Payload + Metadata)"]
ENTRY2["Entry 2\n(Pre-pad + Payload + Metadata)"]
ENTRY3["Entry 3\n(Pre-pad + Payload + Metadata)"]
TAIL["tail_offset\n(AtomicU64)"]
FILE --> ENTRY1
ENTRY1 --> ENTRY2
ENTRY2 --> ENTRY3
ENTRY3 --> TAIL
Entry Layout
Each entry in the storage file consists of three components: optional pre-padding for alignment, the payload data, and metadata. The layout differs between non-tombstone entries (data) and tombstone entries (deletion markers).
Non-Tombstone Entry Structure
Non-tombstone entries store actual data and are aligned to PAYLOAD_ALIGNMENT (64 bytes by default). The alignment ensures cache-line efficiency and enables zero-copy access for typed slices.
Physical Layout Table:
graph LR
PREV_TAIL["Previous\ntail_offset"]
PREPAD["Pre-Pad\n0-63 bytes\n(zero bytes)"]
PAYLOAD["Payload\nVariable Length\n(actual data)"]
METADATA["EntryMetadata\n20 bytes"]
NEXT_TAIL["New\ntail_offset"]
PREV_TAIL --> PREPAD
PREPAD --> PAYLOAD
PAYLOAD --> METADATA
METADATA --> NEXT_TAIL
| Offset Range | Field | Size (Bytes) | Description |
|---|---|---|---|
P .. P+pad | Pre-Pad | pad | Zero bytes calculated as (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT)) & (PAYLOAD_ALIGNMENT - 1) |
P+pad .. N | Payload | N-(P+pad) | Variable-length data, starts at aligned boundary |
N .. N+8 | key_hash | 8 | 64-bit XXH3 hash of the key |
N+8 .. N+16 | prev_offset | 8 | Absolute file offset of previous entry’s tail |
N+16 .. N+20 | checksum | 4 | CRC32C checksum of the payload |
Where:
pad = DataStore::prepad_len(prev_tail)computed at write timePAYLOAD_ALIGNMENT = 64(defined insimd-r-drive-entry-handle/src/constants.rs)- Next entry starts at
N + 20
Sources: README.md:112-125 simd-r-drive-entry-handle/src/entry_metadata.rs:11-23 src/storage_engine/data_store.rs:666-673
Entry Metadata Structure
The EntryMetadata struct stores three critical fields in exactly 20 bytes:
Field Purposes:
| Field | Type | Purpose |
|---|---|---|
key_hash | u64 | XXH3 hash of the key for index lookups |
prev_offset | u64 | Absolute file offset pointing to the previous entry’s tail (forms backward chain) |
checksum | [u8; 4] | CRC32C checksum of payload for integrity verification |
The metadata is serialized using little-endian encoding via EntryMetadata::serialize() and deserialized via EntryMetadata::deserialize().
Sources: simd-r-drive-entry-handle/src/entry_metadata.rs:44-113
Tombstone Entry Structure
Tombstone entries mark deleted keys. They consist of a single null byte (NULL_BYTE[0] = 0x00) followed by metadata, with no pre-padding.
Physical Layout Table:
graph LR
PREV_TAIL["Previous\ntail_offset"]
NULL["NULL_BYTE\n1 byte\n(0x00)"]
METADATA["EntryMetadata\n20 bytes"]
NEXT_TAIL["New\ntail_offset"]
PREV_TAIL --> NULL
NULL --> METADATA
METADATA --> NEXT_TAIL
| Offset Range | Field | Size (Bytes) | Description |
|---|---|---|---|
T .. T+1 | Payload | 1 | Single byte 0x00 (NULL_BYTE) |
T+1 .. T+9 | key_hash | 8 | Hash of the deleted key |
T+9 .. T+17 | prev_offset | 8 | Previous entry’s tail offset |
T+17 .. T+21 | checksum | 4 | CRC32C of the null byte |
Tombstones are written using DataStore::batch_write_with_key_hashes() with the allow_null_bytes parameter set to true. The deletion logic filters existing keys before writing tombstones to avoid unnecessary I/O.
Sources: README.md:126-132 src/storage_engine/data_store.rs:863-897 src/storage_engine/data_store.rs:990-1024
Alignment Strategy
The pre-padding mechanism ensures that every non-tombstone payload starts at a 64-byte aligned boundary. This alignment is critical for:
- Cache-line efficiency : Payloads align with CPU cache lines (typically 64 bytes)
- SIMD operations : Vectorized loads/stores can operate without crossing boundaries
- Zero-copy typed access : Enables safe reinterpretation as typed slices (e.g.,
&[u64])
Alignment Calculation
The DataStore::prepad_len() function implements this calculation:
fn prepad_len(offset: u64) -> usize {
let a = PAYLOAD_ALIGNMENT;
((a - (offset % a)) & (a - 1)) as usize
}
During writes, the code checks if pre-padding is needed and writes zero bytes before the payload:
- At src/storage_engine/data_store.rs:765-771: Streaming write path
- At src/storage_engine/data_store.rs:909-914: Batch write path
Sources: src/storage_engine/data_store.rs:666-673 README.md:52-59
Backward-Linked Chain Structure
Each entry’s prev_offset field in EntryMetadata points to the absolute file offset of the previous entry’s tail, forming a backward-linked chain. This chain enables:
- Iteration : Walking entries from end to beginning
- Recovery : Validating chain integrity
- Alignment derivation : Computing payload start from previous tail
Chain Traversal During Reads
When reading an entry from the index:
- Look up
key_hashinKeyIndexerto get metadata offset - Read
EntryMetadataat that offset - Extract
prev_offset(previous tail) - Calculate payload start:
prev_offset + DataStore::prepad_len(prev_offset) - Payload ends at metadata offset
This logic is implemented in DataStore::read_entry_with_context() at src/storage_engine/data_store.rs:501-565
Chain Traversal Diagram:
Sources: src/storage_engine/data_store.rs:501-565 simd-r-drive-entry-handle/src/entry_metadata.rs:40-43
graph TD
START["Start at file_len"]
CHECK_SIZE{"file_len <\nMETADATA_SIZE?"}
RETURN_ZERO["Return Ok(0)"]
INIT["cursor = file_len\nbest_valid_offset = None"]
LOOP{"cursor >=\nMETADATA_SIZE?"}
READ_META["Read metadata at\n(cursor - METADATA_SIZE)"]
EXTRACT["Extract prev_offset\nDerive entry_start"]
VALIDATE{"entry_start <\nmetadata_offset?"}
WALK["Walk chain backward\nvia prev_offset"]
CHAIN_VALID{"Chain reaches\noffset 0?"}
SET_BEST["best_valid_offset =\ncursor"]
BREAK["Break loop"]
DECREMENT["cursor -= 1"]
RETURN_BEST["Return\nbest_valid_offset\nor 0"]
START --> CHECK_SIZE
CHECK_SIZE -->|Yes| RETURN_ZERO
CHECK_SIZE -->|No| INIT
INIT --> LOOP
LOOP -->|Yes| READ_META
LOOP -->|No| RETURN_BEST
READ_META --> EXTRACT
EXTRACT --> VALIDATE
VALIDATE -->|No| DECREMENT
VALIDATE -->|Yes| WALK
WALK --> CHAIN_VALID
CHAIN_VALID -->|Yes| SET_BEST
CHAIN_VALID -->|No| DECREMENT
SET_BEST --> BREAK
BREAK --> RETURN_BEST
DECREMENT --> LOOP
Recovery Mechanism
The DataStore::recover_valid_chain() function validates chain integrity when opening a file. It scans backward from the file end to find the deepest valid chain that reaches offset 0, automatically recovering from incomplete writes.
Recovery Algorithm
Recovery Process Steps
- Initial Check : If file is smaller than
METADATA_SIZE(20 bytes), return offset 0 - Backward Scan : Start from
file_lenand scan backward by 1 byte at a time - Metadata Read : At each position, attempt to read metadata at
cursor - METADATA_SIZE - Entry Validation :
- Extract
prev_offsetfrom metadata - Calculate expected entry start using
DataStore::prepad_len(prev_offset) - Handle tombstone special case (single null byte without pre-pad)
- Verify
entry_start < metadata_offset
- Extract
- Chain Walk : For valid entry, walk entire chain backward:
- Follow
prev_offsetlinks - Validate each link points to earlier offset
- Track total chain size
- Stop when
prev_offset = 0(chain start)
- Follow
- Validation : Chain is valid if:
- All links point backward (no cycles)
- Chain reaches offset 0
- Total chain size ≤ file length
- Result : Return the first valid chain found (deepest chain)
Tombstone Handling in Recovery
Tombstones have special handling during recovery because they lack pre-padding:
if entry_end > prev_tail
&& entry_end - prev_tail == 1
&& mmap[prev_tail..entry_end] == NULL_BYTE
{
entry_start = prev_tail // No pre-pad for tombstone
} else {
entry_start = prev_tail + prepad_len(prev_tail)
}
This logic appears at:
- src/storage_engine/data_store.rs:404-416: Recovery path
- src/storage_engine/data_store.rs:447-454: Chain walking
Sources: src/storage_engine/data_store.rs:363-482 README.md:139-148
Recovery on File Open
The DataStore::open() function performs recovery automatically:
If recovery detects corruption (incomplete chain), the file is truncated to the last valid offset and reopened. This ensures the storage is always in a consistent state.
Sources: src/storage_engine/data_store.rs:84-117 README.md:150-156
Entry Size Calculation
Each entry’s total file size includes pre-padding, payload, and metadata. The calculation depends on entry type:
Non-Tombstone Entry:
total_size = prepad_len(prev_tail) + payload.len() + METADATA_SIZE
Tombstone Entry:
total_size = 1 + METADATA_SIZE // No pre-pad
The EntryHandle::file_size() method computes this from the entry’s range and metadata. For iteration and compaction, this allows precise tracking of storage usage.
Sources: README.md:112-132 src/storage_engine/data_store.rs:705-749
File Growth and Tail Tracking
The tail_offset field tracks the current end of valid data:
The atomic store uses Ordering::Release to ensure visibility across threads. Writers acquire tail_offset with Ordering::Acquire before computing pre-padding.
Sources: src/storage_engine/data_store.rs256 src/storage_engine/data_store.rs763 src/storage_engine/data_store.rs858
Dismiss
Refresh this wiki
Enter email to refresh