This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Entry Structure and Metadata
Loading…
Entry Structure and Metadata
Relevant source files
- README.md
- simd-r-drive-entry-handle/Cargo.toml
- simd-r-drive-entry-handle/src/entry_handle.rs
- simd-r-drive-entry-handle/src/entry_metadata.rs
Purpose and Scope
This document details the on-disk binary layout of entries in the SIMD R Drive storage engine. It covers the structure of aligned entries, tombstones, metadata fields, and the alignment strategy that enables zero-copy access.
For information about how entries are read and accessed in memory, see Memory Management and Zero-Copy Access. For details on the validation chain and recovery mechanisms, see Storage Architecture.
On-Disk Entry Layout Overview
Every entry written to the storage file consists of three components:
- Pre-Pad Bytes (optional, 0-63 bytes) - Zero bytes inserted to ensure the payload starts at a 64-byte boundary
- Payload - Variable-length binary data
- Metadata - Fixed 20-byte structure containing key hash, previous offset, and checksum
The exception is tombstones (deletion markers), which use a minimal 1-byte payload with no pre-padding.
Sources: README.md:104-137 simd-r-drive-entry-handle/src/entry_metadata.rs:9-37
Aligned Entry Structure
Entry Layout Table
| Offset Range | Field | Size (Bytes) | Description |
|---|---|---|---|
P .. P+pad | Pre-Pad (optional) | pad | Zero bytes to align payload start |
P+pad .. N | Payload | N-(P+pad) | Variable-length data |
N .. N+8 | Key Hash | 8 | 64-bit XXH3 key hash |
N+8 .. N+16 | Prev Offset | 8 | Absolute offset of previous tail |
N+16 .. N+20 | Checksum | 4 | CRC32C of payload |
Where:
pad = (A - (prev_tail % A)) & (A - 1), withA = PAYLOAD_ALIGNMENT(64 bytes)- The next entry starts at offset
N + 20
Aligned Entry Structure Diagram
Sources: README.md:112-137 simd-r-drive-entry-handle/src/entry_metadata.rs:11-23
Tombstone Structure
Tombstones are special deletion markers that do not require payload alignment. They consist of a single zero byte followed by the standard 20-byte metadata structure.
Tombstone Layout Table
| Offset Range | Field | Size (Bytes) | Description |
|---|---|---|---|
T .. T+1 | Payload | 1 | Single byte 0x00 |
T+1 .. T+21 | Metadata | 20 | Key hash, prev, crc32c |
Tombstone Structure Diagram
Sources: README.md:126-131 simd-r-drive-entry-handle/src/entry_metadata.rs:25-30
EntryMetadata Structure
The EntryMetadata struct represents the fixed 20-byte metadata block that follows every payload. It is defined in #[repr(C)] layout to ensure consistent binary representation.
graph TB
subgraph EntryMetadataStruct["EntryMetadata struct"]
field1["key_hash: u64\n8 bytes\nXXH3_64 hash"]
field2["prev_offset: u64\n8 bytes\nbackward chain link"]
field3["checksum: [u8; 4]\n4 bytes\nCRC32C payload checksum"]
end
field1 --> field2
field2 --> field3
note4["Serialized at offset N\nfollowing payload"]
note5["Total: METADATA_SIZE = 20"]
field1 -.-> note4
field3 -.-> note5
Metadata Fields
Field Descriptions
key_hash: u64 (8 bytes, offset N .. N+8)
- 64-bit XXH3 hash of the key
- Used by
KeyIndexerfor O(1) lookups - Combined with a tag for collision detection
- Hardware-accelerated via SSE2/AVX2/NEON
prev_offset: u64 (8 bytes, offset N+8 .. N+16)
- Absolute file offset of the previous entry for this key
- Forms a backward-linked chain for version history
- Set to
0for the first entry of a key - Used during chain validation and recovery
checksum: [u8; 4] (4 bytes, offset N+16 .. N+20)
- CRC32C checksum of the payload
- Provides fast integrity verification
- Not cryptographically secure
- Used during recovery to detect corruption
Serialization and Deserialization
The EntryMetadata struct provides methods for converting to/from bytes:
serialize() -> [u8; 20]- Converts metadata to byte array using little-endian encodingdeserialize(data:&[u8]) -> Self- Reconstructs metadata from byte slice
Sources: simd-r-drive-entry-handle/src/entry_metadata.rs:44-113 README.md:114-120
Pre-Padding and Alignment Strategy
Alignment Purpose
All non-tombstone payloads start at a 64-byte aligned address. This alignment ensures:
- Cache-line efficiency - Matches typical CPU cache line size
- SIMD optimization - Enables full-speed AVX2/AVX-512/NEON operations
- Zero-copy typed views - Allows safe reinterpretation as typed slices (
&[u16],&[u32], etc.)
graph TD
Start["Calculate padding needed"]
GetPrevTail["prev_tail = last written offset"]
CalcPad["pad = (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT))\n& (PAYLOAD_ALIGNMENT - 1)"]
CheckPad{"pad > 0?"}
WritePad["Write pad zero bytes"]
WritePayload["Write payload at aligned offset"]
Start --> GetPrevTail
GetPrevTail --> CalcPad
CalcPad --> CheckPad
CheckPad -->|Yes| WritePad
CheckPad -->|No| WritePayload
WritePad --> WritePayload
The alignment is configured via PAYLOAD_ALIGNMENT constant (64 bytes as of version 0.15.0).
Pre-Padding Calculation
The formula pad = (A - (prev_tail % A)) & (A - 1) where A = PAYLOAD_ALIGNMENT ensures:
- If
prev_tailis already aligned,pad = 0 - Otherwise,
padequals the bytes needed to reach the next aligned boundary - Maximum padding is
A - 1bytes (63 bytes for 64-byte alignment)
Constants
The alignment is defined in simd-r-drive-entry-handle/src/constants.rs:1-20:
| Constant | Value | Description |
|---|---|---|
PAYLOAD_ALIGN_LOG2 | 6 | Log₂ of alignment (2⁶ = 64) |
PAYLOAD_ALIGNMENT | 64 | Actual alignment boundary in bytes |
METADATA_SIZE | 20 | Fixed size of metadata block |
Sources: README.md:51-59 simd-r-drive-entry-handle/src/entry_metadata.rs:22-23 CHANGELOG.md:25-51
Backward Chain Formation
Chain Structure
Each entry’s prev_offset field creates a backward-linked chain that tracks the version history for a given key. This chain is essential for:
- Recovery and validation on file open
- Detecting incomplete writes
- Rebuilding the index
Chain Properties
- Most recent entry is at the end of the file (highest offset)
- Chain traversal moves backward from tail toward offset 0
- First entry for a key has
prev_offset = 0 - Valid chain can be walked all the way back to byte 0 without gaps
- Broken chain indicates corruption or incomplete write
Usage in Recovery
During file open, the system:
- Scans backward from EOF reading metadata
- Follows
prev_offsetlinks to validate chain continuity - Verifies checksums at each step
- Truncates file if corruption is detected
- Scans forward to rebuild the index
Sources: README.md:139-147 simd-r-drive-entry-handle/src/entry_metadata.rs:41-43
Entry Type Comparison
Aligned Entry vs. Tombstone
| Aspect | Aligned Entry (Non-Tombstone) | Tombstone (Deletion Marker) |
|---|---|---|
| Pre-padding | 0-63 bytes (alignment dependent) | None |
| Payload size | Variable (user-defined) | Fixed 1 byte (0x00) |
| Payload alignment | 64-byte boundary | No alignment requirement |
| Metadata size | 20 bytes | 20 bytes |
| Total minimum size | 21 bytes (1-byte payload + metadata) | 21 bytes (1-byte + metadata) |
| Total maximum overhead | 83 bytes (63-byte pad + 20 metadata) | 21 bytes |
| Zero-copy capable | Yes (aligned payload) | No (tombstone flag only) |
When Tombstones Are Used
Tombstones mark key deletions while maintaining chain integrity. They:
- Preserve the backward chain via
prev_offset - Use minimal space (no alignment overhead)
- Are detected during reads and filtered out
- Enable recovery to skip deleted entries
Sources: README.md:112-137 simd-r-drive-entry-handle/src/entry_metadata.rs:9-37
Metadata Serialization Format
Binary Layout in File
Constants for Range Indexing
The simd-r-drive-entry-handle/src/constants.rs:1-20 file defines range constants for metadata field access:
KEY_HASH_RANGE = 0..8PREV_OFFSET_RANGE = 8..16CHECKSUM_RANGE = 16..20METADATA_SIZE = 20
These ranges are used in EntryMetadata::serialize() and deserialize() methods.
Sources: simd-r-drive-entry-handle/src/entry_metadata.rs:62-112
Alignment Evolution and Migration
Version History
v0.14.0-alpha and earlier: Used 16-byte alignment (PAYLOAD_ALIGNMENT = 16)
v0.15.0-alpha onwards: Changed to 64-byte alignment (PAYLOAD_ALIGNMENT = 64)
This change was made to:
- Ensure full cache-line alignment
- Support AVX-512 and future SIMD extensions
- Improve zero-copy performance across modern hardware
Migration Considerations
Storage files created with different alignment values are not compatible :
- v0.14.x readers cannot correctly parse v0.15.x stores
- v0.15.x readers may misinterpret v0.14.x padding
To migrate between versions:
- Read all entries using the old version binary
- Write entries to a new store using the new version binary
- Replace the old file after verification
In multi-service environments, deploy reader upgrades before writer upgrades to avoid mixed-version issues.
Sources: CHANGELOG.md:25-82 README.md:51-59
Debug Assertions for Alignment
Runtime Validation
The codebase includes debug-only alignment assertions that validate both pointer and offset alignment:
debug_assert_aligned(ptr: *const u8, align: usize) - Validates pointer alignment
- Active in debug and test builds
- Zero cost in release/bench builds
- Ensures buffer base address is properly aligned
debug_assert_aligned_offset(off: u64) - Validates file offset alignment
- Checks that derived payload start offset is at
PAYLOAD_ALIGNMENTboundary - Used during entry handle creation
- Defined in simd-r-drive-entry-handle/src/debug_assert_aligned.rs:1-88
These assertions help catch alignment issues during development without imposing runtime overhead in production.
Sources: simd-r-drive-entry-handle/src/debug_assert_aligned.rs:1-88 CHANGELOG.md:33-41
Summary
The SIMD R Drive entry structure uses a carefully designed binary layout that balances efficiency, integrity, and flexibility:
- Fixed 64-byte alignment ensures cache-friendly, SIMD-optimized access
- 20-byte fixed metadata provides fast integrity checks and chain traversal
- Variable pre-padding maintains alignment without complex calculations
- Minimal tombstones mark deletions efficiently
- Backward-linked chain enables robust recovery and validation
This design enables zero-copy reads, high write throughput, and automatic crash recovery while maintaining a simple, append-only storage model.
Dismiss
Refresh this wiki
Enter email to refresh