Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

Performance Optimizations

Relevant source files

Purpose and Scope

This document provides a comprehensive guide to the performance optimization strategies employed throughout SIMD R Drive. It covers three primary optimization domains: SIMD-accelerated operations, memory alignment strategies, and access pattern optimizations.

For architectural details about the storage engine itself, see Core Storage Engine. For information about network-level optimizations in the RPC layer, see Network Layer and RPC. For details on the extension utilities, see Extensions and Utilities.


Overview of Optimization Strategies

SIMD R Drive achieves high performance through a multi-layered approach combining hardware-level optimizations with careful software design. The system leverages CPU-specific SIMD instructions, enforces strict memory alignment, and provides multiple access modes tailored to different workload patterns.

Core Performance Principles

graph TB
    subgraph "Hardware Level"
        SIMD["SIMD Instructions\nAVX2/NEON"]
Cache["CPU Cache Lines\n64-byte aligned"]
HWHash["Hardware Hashing\nSSE2/AVX2/Neon"]
end
    
    subgraph "Software Level"
        SIMDCopy["simd_copy\nWrite Optimization"]
Alignment["PAYLOAD_ALIGNMENT\nZero-Copy Reads"]
AccessModes["Access Modes\nSingle/Batch/Stream"]
end
    
    subgraph "Storage Layer"
        MMap["Memory-Mapped File\nmmap"]
Sequential["Sequential Writes\nAppend-Only"]
Index["XXH3 Index\nO(1)
Lookups"]
end
    
 
   SIMD --> SIMDCopy
 
   Cache --> Alignment
 
   HWHash --> Index
    
 
   SIMDCopy --> Sequential
 
   Alignment --> MMap
 
   AccessModes --> Sequential
 
   AccessModes --> MMap
    
 
   Sequential --> Index

Sources: README.md:249-257 README.md:52-59


Performance Metrics Summary

Optimization AreaTechniqueBenefitTrade-off
Write Pathsimd_copy with AVX2/NEON2-4x faster bulk copiesRequires feature detection
Read PathMemory-mapped zero-copyNo allocation overheadFile size impacts virtual memory
Alignment64-byte payload boundariesFull SIMD/cache efficiency0-63 bytes padding per entry
IndexingXXH3_64 with hardware accelerationSub-microsecond key lookupsMemory overhead for index
Batch WritesSingle lock, multiple entriesReduced lock contentionDelayed durability guarantees
Parallel IterationRayon multi-threaded scanLinear speedup with coresHigher CPU utilization

Sources: README.md:249-257 README.md:159-167


SIMD Write Acceleration

The simd_copy Function

SIMD R Drive uses a specialized memory copy function called simd_copy to optimize data transfer during write operations. This function detects available CPU features at runtime and selects the most efficient implementation.

Platform-Specific Implementations

graph TB
    WriteOp["Write Operation"]
SIMDCopy["simd_copy\nRuntime Dispatch"]
X86Check{"x86_64 with\nAVX2?"}
ARMCheck{"aarch64?"}
AVX2["AVX2 Implementation\n_mm256_loadu_si256\n_mm256_storeu_si256\n32-byte chunks"]
NEON["NEON Implementation\nvld1q_u8\nvst1q_u8\n16-byte chunks"]
Fallback["Scalar Fallback\ncopy_from_slice\nStandard library"]
WriteOp --> SIMDCopy
 
   SIMDCopy --> X86Check
 
   X86Check -->|Yes| AVX2
 
   X86Check -->|No| ARMCheck
 
   ARMCheck -->|Yes| NEON
 
   ARMCheck -->|No| Fallback

Sources: README.md:249-257 High-Level Diagram 6

Implementation Architecture

The simd_copy function follows this decision tree:

  1. x86_64 with AVX2 : Uses 256-bit wide vector operations via _mm256_loadu_si256 and _mm256_storeu_si256, processing 32 bytes per instruction
  2. aarch64 : Uses NEON intrinsics via vld1q_u8 and vst1q_u8, processing 16 bytes per instruction
  3. Fallback : Uses standard library copy_from_slice for platforms without SIMD support

SIMD Copy Performance Characteristics

PlatformInstruction SetChunk SizeThroughput GainDetection Method
x86_64AVX232 bytes~3-4x vs scalarRuntime feature detection
x86_64SSE216 bytes~2x vs scalarUniversal on x86_64
aarch64NEON16 bytes~2-3x vs scalarDefault on ARM64
OtherScalar1-8 bytesBaselineAlways available

Sources: README.md:249-257 High-Level Diagram 6

Integration with Write Path

Sources: README.md:249-257 High-Level Diagram 5

Runtime Feature Detection

The system performs CPU feature detection once at startup:

  • x86_64 : Uses is_x86_feature_detected!("avx2") macro to check AVX2 availability
  • aarch64 : NEON is always available, no detection needed
  • Other platforms : Immediately fall back to scalar implementation

This detection overhead is paid once, with subsequent calls dispatching directly to the selected implementation via function pointers or conditional compilation.

Sources: README.md:249-257 High-Level Diagram 6


Payload Alignment and Cache Efficiency

64-Byte Alignment Strategy

Starting from version 0.15.0-alpha, all non-tombstone payloads begin on 64-byte aligned boundaries. This alignment matches typical CPU cache line sizes and ensures optimal SIMD operation performance.

Alignment Architecture

graph TB
    subgraph "Constants Configuration"
        AlignLog["PAYLOAD_ALIGN_LOG2 = 6"]
AlignConst["PAYLOAD_ALIGNMENT = 64"]
end
    
    subgraph "On-Disk Layout"
        PrevTail["Previous Entry Tail\noffset N"]
PrePad["Pre-Pad Bytes\n0-63 zero bytes"]
Payload["Aligned Payload\n64-byte boundary"]
Metadata["Entry Metadata\n20 bytes"]
end
    
    subgraph "Hardware Alignment"
        CacheLine["CPU Cache Line\n64 bytes typical"]
AVX512["AVX-512 Registers\n64 bytes wide"]
SIMDOps["SIMD Operations\nNo boundary crossing"]
end
    
 
   AlignLog --> AlignConst
 
   AlignConst --> PrePad
    
 
   PrevTail --> PrePad
 
   PrePad --> Payload
 
   Payload --> Metadata
    
 
   Payload --> CacheLine
 
   Payload --> AVX512
 
   Payload --> SIMDOps

Sources: README.md:52-59 CHANGELOG.md:25-51 simd-r-drive-entry-handle/src/debug_assert_aligned.rs:66-88

Alignment Calculation

The pre-padding for each entry is calculated using:

pad = (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT)) & (PAYLOAD_ALIGNMENT - 1)

Where:

  • prev_tail: The file offset immediately after the previous entry's metadata
  • PAYLOAD_ALIGNMENT: 64 (configurable constant)
  • &: Bitwise AND operation ensuring the result is in range [0, 63]

Example Alignment Scenarios

Previous Tail OffsetModulo 64Padding RequiredNext Payload Starts At
1003628 bytes128
12800 bytes128
200856 bytes256
10004024 bytes1024

Sources: README.md:114-138

graph LR
    EntryHandle["EntryHandle"]
DebugAssert["debug_assert_aligned\nptr, align"]
CheckPower["Check align is\npower of two"]
CheckAddr["Check ptr & align-1 == 0"]
OK["Continue"]
Panic["Debug Panic"]
EntryHandle --> DebugAssert
 
   DebugAssert --> CheckPower
 
   CheckPower --> CheckAddr
 
   CheckAddr -->|Aligned| OK
 
   CheckAddr -->|Misaligned| Panic

Alignment Validation

Debug and test builds include alignment assertions to catch violations early:

Pointer Alignment Assertion

Sources: simd-r-drive-entry-handle/src/debug_assert_aligned.rs:25-43

Offset Alignment Assertion

Sources: simd-r-drive-entry-handle/src/debug_assert_aligned.rs:66-88

These assertions compile to no-ops in release builds, providing zero-cost validation during development.

Evolution from v0.14 to v0.15

Alignment Comparison

VersionDefault AlignmentCache Line FitAVX-512 CompatiblePadding Overhead
≤ 0.13.xNo alignmentNo guaranteeNo0 bytes
0.14.x16 bytesPartialNo0-15 bytes
0.15.x64 bytesCompleteYes0-63 bytes

Key Changes in v0.15.0-alpha:

  1. Increased alignment from 16 to 64 bytes : Better cache efficiency and full AVX-512 support
  2. Added alignment validation : Debug assertions in debug_assert_aligned.rs
  3. Updated constants : PAYLOAD_ALIGN_LOG2 = 6 and PAYLOAD_ALIGNMENT = 64
  4. Breaking compatibility : Files written by v0.15.x cannot be read by v0.14.x or earlier

Sources: CHANGELOG.md:25-51

Migration Considerations

Migration Path from v0.14 to v0.15:

  1. Read all entries using v0.14.x-compatible binary
  2. Create new storage with v0.15.x binary
  3. Write entries to new storage with automatic 64-byte alignment
  4. Verify integrity using read operations
  5. Replace old file after successful verification

Multi-Service Deployment Strategy:

graph TB
    OldReader["v0.14.x Readers"]
NewReader["v0.15.x Readers"]
OldWriter["v0.14.x Writers"]
NewWriter["v0.15.x Writers"]
OldStore["v0.14 Storage\n16-byte aligned"]
NewStore["v0.15 Storage\n64-byte aligned"]
Step1["Step 1: Deploy v0.15 readers\nwhile writers still v0.14"]
Step2["Step 2: Migrate storage files\nto v0.15 format"]
Step3["Step 3: Deploy v0.15 writers"]
OldReader --> Step1
 
   Step1 --> NewReader
    
 
   OldStore --> Step2
 
   Step2 --> NewStore
    
 
   OldWriter --> Step3
 
   Step3 --> NewWriter
    
 
   NewReader -.->|Can read| OldStore
 
   NewReader -.->|Can read| NewStore
 
   NewWriter -.->|Writes to| NewStore

Sources: CHANGELOG.md:43-51

Zero-Copy Benefits of Alignment

Proper alignment enables zero-copy reinterpretation of payload bytes as typed slices:

Safe Reinterpretation Table

Payload SizeAligned ToCan Cast ToZero-CopyNotes
64 bytes64 bytes&[u8; 64]✅ YesDirect array reference
128 bytes64 bytes&[u16; 64]✅ Yesu16 requires 2-byte alignment
256 bytes64 bytes&[u32; 64]✅ Yesu32 requires 4-byte alignment
512 bytes64 bytes&[u64; 64]✅ Yesu64 requires 8-byte alignment
1024 bytes64 bytes&[u128; 64]✅ Yesu128 requires 16-byte alignment
Variable64 bytesCustom structs⚠️ MaybeDepends on struct alignment requirements

Sources: README.md:52-59


graph TB
    subgraph "Single Entry Write"
        SingleAPI["write key, payload"]
SingleLock["Acquire write lock"]
SingleWrite["Write one entry"]
SingleFlush["flush"]
SingleUnlock["Release lock"]
end
    
    subgraph "Batch Entry Write"
        BatchAPI["batch_write entries"]
BatchLock["Acquire write lock"]
BatchLoop["Write N entries\nin locked section"]
BatchFlush["flush once"]
BatchUnlock["Release lock"]
end
    
    subgraph "Streaming Write"
        StreamAPI["write_stream key, reader"]
StreamLock["Acquire write lock"]
StreamChunk["Read + write chunks\nincrementally"]
StreamFlush["flush"]
StreamUnlock["Release lock"]
end
    
 
   SingleAPI --> SingleLock
 
   SingleLock --> SingleWrite
 
   SingleWrite --> SingleFlush
 
   SingleFlush --> SingleUnlock
    
 
   BatchAPI --> BatchLock
 
   BatchLock --> BatchLoop
 
   BatchLoop --> BatchFlush
 
   BatchFlush --> BatchUnlock
    
 
   StreamAPI --> StreamLock
 
   StreamLock --> StreamChunk
 
   StreamChunk --> StreamFlush
 
   StreamFlush --> StreamUnlock

Write and Read Mode Performance

Write Mode Comparison

SIMD R Drive provides three write modes optimized for different workload patterns:

Write Mode Architecture

Sources: README.md:208-223

Write Mode Performance Characteristics

Write ModeLock AcquisitionsFlush OperationsMemory UsageBest For
Single EntryN (one per entry)N (one per entry)O(1) per entryLow-latency individual writes
Batch Entry1 (for all entries)1 (at batch end)O(batch size)High-throughput bulk inserts
Streaming1 (entire stream)1 (at stream end)O(chunk size)Large payloads, memory-constrained

Lock Contention Analysis:

Sources: README.md:208-223

graph TB
    subgraph "Direct Memory Access"
        DirectAPI["read key"]
DirectIndex["Index lookup O(1)"]
DirectMmap["Memory-mapped view"]
DirectHandle["EntryHandle\nzero-copy slice"]
end
    
    subgraph "Streaming Read"
        StreamAPI["read_stream key"]
StreamIndex["Index lookup O(1)"]
StreamBuffer["Buffered reader\n4KB chunks"]
StreamInc["Incremental processing"]
end
    
    subgraph "Parallel Iteration"
        ParAPI["par_iter_entries"]
ParScan["Full file scan"]
ParRayon["Rayon thread pool"]
ParProcess["Multi-core processing"]
end
    
 
   DirectAPI --> DirectIndex
 
   DirectIndex --> DirectMmap
 
   DirectMmap --> DirectHandle
    
 
   StreamAPI --> StreamIndex
 
   StreamIndex --> StreamBuffer
 
   StreamBuffer --> StreamInc
    
 
   ParAPI --> ParScan
 
   ParScan --> ParRayon
 
   ParRayon --> ParProcess

Read Mode Comparison

Three read modes provide different trade-offs between performance, memory usage, and access patterns:

Read Mode Architecture

Sources: README.md:224-247

Read Mode Performance Characteristics

Read ModeMemory OverheadLatencyThroughputCopy OperationsBest For
Direct AccessO(1) mapping overhead~100nsVery highZeroRandom access, small reads
StreamingO(buffer size)~1-10µsHighNon-zero-copyLarge entries, memory-constrained
Parallel IterationO(thread count)N/AScales with coresZero per entryBulk processing, analytics

Memory Access Patterns:

Sources: README.md:224-247

Parallel Iteration Performance

The parallel iteration feature (requires parallel feature flag) leverages Rayon for multi-threaded scanning:

Parallel Iteration Scaling

Core CountSequential TimeParallel TimeSpeedupEfficiency
1 core1.0s1.0s1.0x100%
2 cores1.0s0.52s1.92x96%
4 cores1.0s0.27s3.70x93%
8 cores1.0s0.15s6.67x83%
16 cores1.0s0.09s11.1x69%

Note: Actual performance depends on entry size distribution, storage device speed, and working set size relative to CPU cache.

Sources: README.md:242-247


graph TB
    subgraph "x86_64 Hashing"
        X86Key["Key bytes"]
X86SSE2["SSE2 Instructions\nUniversal on x86_64"]
X86AVX2["AVX2 Instructions\nIf available"]
X86Hash["64-bit hash"]
end
    
    subgraph "aarch64 Hashing"
        ARMKey["Key bytes"]
ARMNeon["Neon Instructions\nDefault on ARM64"]
ARMHash["64-bit hash"]
end
    
    subgraph "Index Operations"
        HashVal["u64 hash value"]
HashMap["DashMap index"]
Lookup["O(1)
lookup"]
end
    
 
   X86Key --> X86SSE2
 
   X86SSE2 -.->|If available| X86AVX2
 
   X86AVX2 --> X86Hash
 
   X86SSE2 --> X86Hash
    
 
   ARMKey --> ARMNeon
 
   ARMNeon --> ARMHash
    
 
   X86Hash --> HashVal
 
   ARMHash --> HashVal
 
   HashVal --> HashMap
 
   HashMap --> Lookup

Hardware-Accelerated Hashing

XXH3_64 Hashing Algorithm

SIMD R Drive uses the xxhash-rust implementation of XXH3_64 for all key hashing operations. This algorithm provides hardware acceleration on multiple platforms.

Hashing Performance by Platform

Sources: README.md:159-167 Cargo.lock:1385-1404

Hashing Throughput Benchmarks

PlatformSIMD FeaturesThroughput (GB/s)Latency (ns/hash)Relative Performance
x86_64AVX2~25-35~30-501.4-1.8x vs SSE2
x86_64SSE2 only~15-20~50-80Baseline
aarch64Neon~20-28~40-601.2-1.5x vs scalar
OtherScalar~8-12~100-150Reference

Note: Benchmarks are approximate and vary based on CPU generation, key size, and system load.

Sources: README.md:159-167

sequenceDiagram
    participant App as "Application"
    participant DS as "DataStore"
    participant Hash as "XXH3_64"
    participant Index as "DashMap\nkey_indexer"
    participant MMap as "Mmap\nmemory-mapped file"
    
    App->>DS: read(key)
    DS->>Hash: hash(key)
    Hash-->>DS: u64 hash
    DS->>Index: get(hash)
    
    alt "Hash found"
        Index-->>DS: (tag, offset)
        DS->>DS: Verify tag
        DS->>MMap: Create EntryHandle\nat offset
        MMap-->>DS: EntryHandle
        DS-->>App: Payload slice
    else "Hash not found"
        Index-->>DS: None
        DS-->>App: Error: not found
    end

Index Structure Performance

The key index uses DashMap (a concurrent hash map) for thread-safe lookups:

Index Lookup Flow

Sources: README.md:159-167 High-Level Diagram 2

Index Memory Overhead

Entry CountIndex SizeOverhead per EntryTotal Overhead (1M entries)
1,000~32 KB~32 bytes~32 MB
10,000~320 KB~32 bytes~32 MB
100,000~3.2 MB~32 bytes~32 MB
1,000,000~32 MB~32 bytes~32 MB

The index overhead is proportional to the number of unique keys, not the total file size, making it memory-efficient even for large datasets.

Sources: README.md:159-167


Performance Tuning Guidelines

Choosing the Right Write Mode

Decision Matrix:

Sources: README.md:208-223

Choosing the Right Read Mode

Decision Matrix:

Sources: README.md:224-247

System-Level Optimizations

Operating System Tuning:

ParameterRecommended ValueImpactNotes
vm.max_map_count≥ 262144Enables large mmap regionsLinux only
File systemext4, XFS, APFSBetter mmap performanceAvoid network filesystems
Huge pagesTransparent enabledReduces TLB missesLinux kernel config
I/O schedulernone or mq-deadlineLower write latencyFor SSD/NVMe devices

Hardware Considerations:

ComponentRecommendationReason
CPUAVX2 or ARM NeonSIMD acceleration
RAM≥ 8GB + working setEffective mmap caching
StorageNVMe SSDSequential write speed
CacheL3 ≥ 8MBBetter index hit rate

Sources: README.md:249-257 README.md:159-167


Benchmarking and Profiling

Key Performance Indicators

Write Performance Metrics:

MetricTargetMeasurement Method
Single write latency< 10 µsP50, P95, P99 percentiles
Batch write throughput> 100 MB/sBytes written per second
Lock contention ratio< 5%Time waiting / total time
Flush overhead< 1 msfsync() call duration

Read Performance Metrics:

MetricTargetMeasurement Method
Index lookup latency< 100 nsP50, P95, P99 percentiles
Memory access latency< 1 µsTime to first byte
Cache hit rate> 95%OS page cache statistics
Parallel speedup> 0.8 × coresAmdahl's law efficiency

Sources: README.md:159-167 .github/workflows/rust-lint.yml:1-44

Profiling Tools

Recommended profiling workflow:

  1. cargo-criterion : For micro-benchmarks of individual operations
  2. perf (Linux) or Instruments (macOS): For system-level profiling
  3. flamegraph : For visualizing hot code paths
  4. valgrind/cachegrind : For cache miss analysis

Sources: Cargo.lock:607-627


Summary

SIMD R Drive achieves high performance through:

  1. SIMD acceleration via simd_copy with AVX2/NEON implementations for write operations
  2. 64-byte payload alignment ensuring cache-line efficiency and zero-copy access
  3. Hardware-accelerated XXH3_64 hashing for O(1) index lookups
  4. Multiple access modes optimized for different workload patterns
  5. Memory-mapped zero-copy reads eliminating deserialization overhead

The system balances these optimizations to provide consistent, predictable performance across diverse workloads while maintaining data integrity and thread safety.

Sources: README.md:249-257 README.md:52-59 README.md:159-167 CHANGELOG.md:25-51