Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Entry Structure and Metadata

Loading…

Entry Structure and Metadata

Relevant source files

Purpose and Scope

This document details the on-disk binary layout of entries in the SIMD R Drive storage engine. It covers the structure of aligned entries, tombstones, metadata fields, and the alignment strategy that enables zero-copy access.

For information about how entries are read and accessed in memory, see Memory Management and Zero-Copy Access. For details on the validation chain and recovery mechanisms, see Storage Architecture.


On-Disk Entry Layout Overview

Every entry written to the storage file consists of three components:

  1. Pre-Pad Bytes (optional, 0-63 bytes) - Zero bytes inserted to ensure the payload starts at a 64-byte boundary
  2. Payload - Variable-length binary data
  3. Metadata - Fixed 20-byte structure containing key hash, previous offset, and checksum

The exception is tombstones (deletion markers), which use a minimal 1-byte payload with no pre-padding.

Sources: README.md:104-137 simd-r-drive-entry-handle/src/entry_metadata.rs:9-37


Aligned Entry Structure

Entry Layout Table

Offset RangeFieldSize (Bytes)Description
P .. P+padPre-Pad (optional)padZero bytes to align payload start
P+pad .. NPayloadN-(P+pad)Variable-length data
N .. N+8Key Hash864-bit XXH3 key hash
N+8 .. N+16Prev Offset8Absolute offset of previous tail
N+16 .. N+20Checksum4CRC32C of payload

Where:

  • pad = (A - (prev_tail % A)) & (A - 1), with A = PAYLOAD_ALIGNMENT (64 bytes)
  • The next entry starts at offset N + 20

Aligned Entry Structure Diagram

Sources: README.md:112-137 simd-r-drive-entry-handle/src/entry_metadata.rs:11-23


Tombstone Structure

Tombstones are special deletion markers that do not require payload alignment. They consist of a single zero byte followed by the standard 20-byte metadata structure.

Tombstone Layout Table

Offset RangeFieldSize (Bytes)Description
T .. T+1Payload1Single byte 0x00
T+1 .. T+21Metadata20Key hash, prev, crc32c

Tombstone Structure Diagram

Sources: README.md:126-131 simd-r-drive-entry-handle/src/entry_metadata.rs:25-30


EntryMetadata Structure

The EntryMetadata struct represents the fixed 20-byte metadata block that follows every payload. It is defined in #[repr(C)] layout to ensure consistent binary representation.

graph TB
    subgraph EntryMetadataStruct["EntryMetadata struct"]
field1["key_hash: u64\n8 bytes\nXXH3_64 hash"]
field2["prev_offset: u64\n8 bytes\nbackward chain link"]
field3["checksum: [u8; 4]\n4 bytes\nCRC32C payload checksum"]
end
    
 
   field1 --> field2
 
   field2 --> field3
    
    note4["Serialized at offset N\nfollowing payload"]
note5["Total: METADATA_SIZE = 20"]
field1 -.-> note4
 
   field3 -.-> note5

Metadata Fields

Field Descriptions

key_hash: u64 (8 bytes, offset N .. N+8)

  • 64-bit XXH3 hash of the key
  • Used by KeyIndexer for O(1) lookups
  • Combined with a tag for collision detection
  • Hardware-accelerated via SSE2/AVX2/NEON

prev_offset: u64 (8 bytes, offset N+8 .. N+16)

  • Absolute file offset of the previous entry for this key
  • Forms a backward-linked chain for version history
  • Set to 0 for the first entry of a key
  • Used during chain validation and recovery

checksum: [u8; 4] (4 bytes, offset N+16 .. N+20)

  • CRC32C checksum of the payload
  • Provides fast integrity verification
  • Not cryptographically secure
  • Used during recovery to detect corruption

Serialization and Deserialization

The EntryMetadata struct provides methods for converting to/from bytes:

  • serialize() -> [u8; 20] - Converts metadata to byte array using little-endian encoding
  • deserialize(data:&[u8]) -> Self - Reconstructs metadata from byte slice

Sources: simd-r-drive-entry-handle/src/entry_metadata.rs:44-113 README.md:114-120


Pre-Padding and Alignment Strategy

Alignment Purpose

All non-tombstone payloads start at a 64-byte aligned address. This alignment ensures:

  • Cache-line efficiency - Matches typical CPU cache line size
  • SIMD optimization - Enables full-speed AVX2/AVX-512/NEON operations
  • Zero-copy typed views - Allows safe reinterpretation as typed slices (&[u16], &[u32], etc.)
graph TD
    Start["Calculate padding needed"]
GetPrevTail["prev_tail = last written offset"]
CalcPad["pad = (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT))\n& (PAYLOAD_ALIGNMENT - 1)"]
CheckPad{"pad > 0?"}
WritePad["Write pad zero bytes"]
WritePayload["Write payload at aligned offset"]
Start --> GetPrevTail
 
   GetPrevTail --> CalcPad
 
   CalcPad --> CheckPad
 
   CheckPad -->|Yes| WritePad
 
   CheckPad -->|No| WritePayload
 
   WritePad --> WritePayload

The alignment is configured via PAYLOAD_ALIGNMENT constant (64 bytes as of version 0.15.0).

Pre-Padding Calculation

The formula pad = (A - (prev_tail % A)) & (A - 1) where A = PAYLOAD_ALIGNMENT ensures:

  • If prev_tail is already aligned, pad = 0
  • Otherwise, pad equals the bytes needed to reach the next aligned boundary
  • Maximum padding is A - 1 bytes (63 bytes for 64-byte alignment)

Constants

The alignment is defined in simd-r-drive-entry-handle/src/constants.rs:1-20:

ConstantValueDescription
PAYLOAD_ALIGN_LOG26Log₂ of alignment (2⁶ = 64)
PAYLOAD_ALIGNMENT64Actual alignment boundary in bytes
METADATA_SIZE20Fixed size of metadata block

Sources: README.md:51-59 simd-r-drive-entry-handle/src/entry_metadata.rs:22-23 CHANGELOG.md:25-51


Backward Chain Formation

Chain Structure

Each entry’s prev_offset field creates a backward-linked chain that tracks the version history for a given key. This chain is essential for:

  • Recovery and validation on file open
  • Detecting incomplete writes
  • Rebuilding the index

Chain Properties

  • Most recent entry is at the end of the file (highest offset)
  • Chain traversal moves backward from tail toward offset 0
  • First entry for a key has prev_offset = 0
  • Valid chain can be walked all the way back to byte 0 without gaps
  • Broken chain indicates corruption or incomplete write

Usage in Recovery

During file open, the system:

  1. Scans backward from EOF reading metadata
  2. Follows prev_offset links to validate chain continuity
  3. Verifies checksums at each step
  4. Truncates file if corruption is detected
  5. Scans forward to rebuild the index

Sources: README.md:139-147 simd-r-drive-entry-handle/src/entry_metadata.rs:41-43


Entry Type Comparison

Aligned Entry vs. Tombstone

AspectAligned Entry (Non-Tombstone)Tombstone (Deletion Marker)
Pre-padding0-63 bytes (alignment dependent)None
Payload sizeVariable (user-defined)Fixed 1 byte (0x00)
Payload alignment64-byte boundaryNo alignment requirement
Metadata size20 bytes20 bytes
Total minimum size21 bytes (1-byte payload + metadata)21 bytes (1-byte + metadata)
Total maximum overhead83 bytes (63-byte pad + 20 metadata)21 bytes
Zero-copy capableYes (aligned payload)No (tombstone flag only)

When Tombstones Are Used

Tombstones mark key deletions while maintaining chain integrity. They:

  • Preserve the backward chain via prev_offset
  • Use minimal space (no alignment overhead)
  • Are detected during reads and filtered out
  • Enable recovery to skip deleted entries

Sources: README.md:112-137 simd-r-drive-entry-handle/src/entry_metadata.rs:9-37


Metadata Serialization Format

Binary Layout in File

Constants for Range Indexing

The simd-r-drive-entry-handle/src/constants.rs:1-20 file defines range constants for metadata field access:

  • KEY_HASH_RANGE = 0..8
  • PREV_OFFSET_RANGE = 8..16
  • CHECKSUM_RANGE = 16..20
  • METADATA_SIZE = 20

These ranges are used in EntryMetadata::serialize() and deserialize() methods.

Sources: simd-r-drive-entry-handle/src/entry_metadata.rs:62-112


Alignment Evolution and Migration

Version History

v0.14.0-alpha and earlier: Used 16-byte alignment (PAYLOAD_ALIGNMENT = 16)

v0.15.0-alpha onwards: Changed to 64-byte alignment (PAYLOAD_ALIGNMENT = 64)

This change was made to:

  • Ensure full cache-line alignment
  • Support AVX-512 and future SIMD extensions
  • Improve zero-copy performance across modern hardware

Migration Considerations

Storage files created with different alignment values are not compatible :

  • v0.14.x readers cannot correctly parse v0.15.x stores
  • v0.15.x readers may misinterpret v0.14.x padding

To migrate between versions:

  1. Read all entries using the old version binary
  2. Write entries to a new store using the new version binary
  3. Replace the old file after verification

In multi-service environments, deploy reader upgrades before writer upgrades to avoid mixed-version issues.

Sources: CHANGELOG.md:25-82 README.md:51-59


Debug Assertions for Alignment

Runtime Validation

The codebase includes debug-only alignment assertions that validate both pointer and offset alignment:

debug_assert_aligned(ptr: *const u8, align: usize) - Validates pointer alignment

  • Active in debug and test builds
  • Zero cost in release/bench builds
  • Ensures buffer base address is properly aligned

debug_assert_aligned_offset(off: u64) - Validates file offset alignment

These assertions help catch alignment issues during development without imposing runtime overhead in production.

Sources: simd-r-drive-entry-handle/src/debug_assert_aligned.rs:1-88 CHANGELOG.md:33-41


Summary

The SIMD R Drive entry structure uses a carefully designed binary layout that balances efficiency, integrity, and flexibility:

  • Fixed 64-byte alignment ensures cache-friendly, SIMD-optimized access
  • 20-byte fixed metadata provides fast integrity checks and chain traversal
  • Variable pre-padding maintains alignment without complex calculations
  • Minimal tombstones mark deletions efficiently
  • Backward-linked chain enables robust recovery and validation

This design enables zero-copy reads, high write throughput, and automatic crash recovery while maintaining a simple, append-only storage model.

Dismiss

Refresh this wiki

Enter email to refresh