Performance Optimizations

Relevant source files

Purpose and Scope

This document provides a comprehensive guide to the performance optimization strategies employed throughout SIMD R Drive. It covers three primary optimization domains: SIMD-accelerated operations, memory alignment strategies, and access pattern optimizations.

For architectural details about the storage engine itself, see Core Storage Engine. For information about network-level optimizations in the RPC layer, see Network Layer and RPC. For details on the extension utilities, see Extensions and Utilities.

Overview of Optimization Strategies

SIMD R Drive achieves high performance through a multi-layered approach combining hardware-level optimizations with careful software design. The system leverages CPU-specific SIMD instructions, enforces strict memory alignment, and provides multiple access modes tailored to different workload patterns.

Core Performance Principles

graph TB
    subgraph "Hardware Level"
        SIMD["SIMD Instructions\nAVX2/NEON"]
Cache["CPU Cache Lines\n64-byte aligned"]
HWHash["Hardware Hashing\nSSE2/AVX2/Neon"]
end
    
    subgraph "Software Level"
        SIMDCopy["simd_copy\nWrite Optimization"]
Alignment["PAYLOAD_ALIGNMENT\nZero-Copy Reads"]
AccessModes["Access Modes\nSingle/Batch/Stream"]
end
    
    subgraph "Storage Layer"
        MMap["Memory-Mapped File\nmmap"]
Sequential["Sequential Writes\nAppend-Only"]
Index["XXH3 Index\nO(1)
Lookups"]
end
    
 
   SIMD --> SIMDCopy
 
   Cache --> Alignment
 
   HWHash --> Index
    
 
   SIMDCopy --> Sequential
 
   Alignment --> MMap
 
   AccessModes --> Sequential
 
   AccessModes --> MMap
    
 
   Sequential --> Index

Sources: README.md:249-257 README.md:52-59

Performance Metrics Summary

Optimization Area	Technique	Benefit	Trade-off
Write Path	`simd_copy` with AVX2/NEON	2-4x faster bulk copies	Requires feature detection
Read Path	Memory-mapped zero-copy	No allocation overhead	File size impacts virtual memory
Alignment	64-byte payload boundaries	Full SIMD/cache efficiency	0-63 bytes padding per entry
Indexing	XXH3_64 with hardware acceleration	Sub-microsecond key lookups	Memory overhead for index
Batch Writes	Single lock, multiple entries	Reduced lock contention	Delayed durability guarantees
Parallel Iteration	Rayon multi-threaded scan	Linear speedup with cores	Higher CPU utilization

Sources: README.md:249-257 README.md:159-167

SIMD Write Acceleration

The simd_copy Function

SIMD R Drive uses a specialized memory copy function called simd_copy to optimize data transfer during write operations. This function detects available CPU features at runtime and selects the most efficient implementation.

Platform-Specific Implementations

graph TB
    WriteOp["Write Operation"]
SIMDCopy["simd_copy\nRuntime Dispatch"]
X86Check{"x86_64 with\nAVX2?"}
ARMCheck{"aarch64?"}
AVX2["AVX2 Implementation\n_mm256_loadu_si256\n_mm256_storeu_si256\n32-byte chunks"]
NEON["NEON Implementation\nvld1q_u8\nvst1q_u8\n16-byte chunks"]
Fallback["Scalar Fallback\ncopy_from_slice\nStandard library"]
WriteOp --> SIMDCopy
 
   SIMDCopy --> X86Check
 
   X86Check -->|Yes| AVX2
 
   X86Check -->|No| ARMCheck
 
   ARMCheck -->|Yes| NEON
 
   ARMCheck -->|No| Fallback

Sources: README.md:249-257 High-Level Diagram 6

Implementation Architecture

The simd_copy function follows this decision tree:

x86_64 with AVX2 : Uses 256-bit wide vector operations via _mm256_loadu_si256 and _mm256_storeu_si256, processing 32 bytes per instruction
aarch64 : Uses NEON intrinsics via vld1q_u8 and vst1q_u8, processing 16 bytes per instruction
Fallback : Uses standard library copy_from_slice for platforms without SIMD support

SIMD Copy Performance Characteristics

Platform	Instruction Set	Chunk Size	Throughput Gain	Detection Method
x86_64	AVX2	32 bytes	~3-4x vs scalar	Runtime feature detection
x86_64	SSE2	16 bytes	~2x vs scalar	Universal on x86_64
aarch64	NEON	16 bytes	~2-3x vs scalar	Default on ARM64
Other	Scalar	1-8 bytes	Baseline	Always available

Sources: README.md:249-257 High-Level Diagram 6

Integration with Write Path

Sources: README.md:249-257 High-Level Diagram 5

Runtime Feature Detection

The system performs CPU feature detection once at startup:

x86_64 : Uses is_x86_feature_detected!("avx2") macro to check AVX2 availability
aarch64 : NEON is always available, no detection needed
Other platforms : Immediately fall back to scalar implementation

This detection overhead is paid once, with subsequent calls dispatching directly to the selected implementation via function pointers or conditional compilation.

Sources: README.md:249-257 High-Level Diagram 6

Payload Alignment and Cache Efficiency

64-Byte Alignment Strategy

Starting from version 0.15.0-alpha, all non-tombstone payloads begin on 64-byte aligned boundaries. This alignment matches typical CPU cache line sizes and ensures optimal SIMD operation performance.

Alignment Architecture

graph TB
    subgraph "Constants Configuration"
        AlignLog["PAYLOAD_ALIGN_LOG2 = 6"]
AlignConst["PAYLOAD_ALIGNMENT = 64"]
end
    
    subgraph "On-Disk Layout"
        PrevTail["Previous Entry Tail\noffset N"]
PrePad["Pre-Pad Bytes\n0-63 zero bytes"]
Payload["Aligned Payload\n64-byte boundary"]
Metadata["Entry Metadata\n20 bytes"]
end
    
    subgraph "Hardware Alignment"
        CacheLine["CPU Cache Line\n64 bytes typical"]
AVX512["AVX-512 Registers\n64 bytes wide"]
SIMDOps["SIMD Operations\nNo boundary crossing"]
end
    
 
   AlignLog --> AlignConst
 
   AlignConst --> PrePad
    
 
   PrevTail --> PrePad
 
   PrePad --> Payload
 
   Payload --> Metadata
    
 
   Payload --> CacheLine
 
   Payload --> AVX512
 
   Payload --> SIMDOps

Sources: README.md:52-59 CHANGELOG.md:25-51 simd-r-drive-entry-handle/src/debug_assert_aligned.rs:66-88

Alignment Calculation

The pre-padding for each entry is calculated using:

pad = (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT)) & (PAYLOAD_ALIGNMENT - 1)

Where:

prev_tail: The file offset immediately after the previous entry's metadata
PAYLOAD_ALIGNMENT: 64 (configurable constant)
&: Bitwise AND operation ensuring the result is in range [0, 63]

Example Alignment Scenarios

Previous Tail Offset	Modulo 64	Padding Required	Next Payload Starts At
100	36	28 bytes	128
128	0	0 bytes	128
200	8	56 bytes	256
1000	40	24 bytes	1024

Sources: README.md:114-138

graph LR
    EntryHandle["EntryHandle"]
DebugAssert["debug_assert_aligned\nptr, align"]
CheckPower["Check align is\npower of two"]
CheckAddr["Check ptr & align-1 == 0"]
OK["Continue"]
Panic["Debug Panic"]
EntryHandle --> DebugAssert
 
   DebugAssert --> CheckPower
 
   CheckPower --> CheckAddr
 
   CheckAddr -->|Aligned| OK
 
   CheckAddr -->|Misaligned| Panic

Alignment Validation

Debug and test builds include alignment assertions to catch violations early:

Pointer Alignment Assertion

Sources: simd-r-drive-entry-handle/src/debug_assert_aligned.rs:25-43

Offset Alignment Assertion

Sources: simd-r-drive-entry-handle/src/debug_assert_aligned.rs:66-88

These assertions compile to no-ops in release builds, providing zero-cost validation during development.

Evolution from v0.14 to v0.15

Alignment Comparison

Version	Default Alignment	Cache Line Fit	AVX-512 Compatible	Padding Overhead
≤ 0.13.x	No alignment	No guarantee	No	0 bytes
0.14.x	16 bytes	Partial	No	0-15 bytes
0.15.x	64 bytes	Complete	Yes	0-63 bytes

Key Changes in v0.15.0-alpha:

Increased alignment from 16 to 64 bytes : Better cache efficiency and full AVX-512 support
Added alignment validation : Debug assertions in debug_assert_aligned.rs
Updated constants : PAYLOAD_ALIGN_LOG2 = 6 and PAYLOAD_ALIGNMENT = 64
Breaking compatibility : Files written by v0.15.x cannot be read by v0.14.x or earlier

Sources: CHANGELOG.md:25-51

Migration Considerations

Migration Path from v0.14 to v0.15:

Read all entries using v0.14.x-compatible binary
Create new storage with v0.15.x binary
Write entries to new storage with automatic 64-byte alignment
Verify integrity using read operations
Replace old file after successful verification

Multi-Service Deployment Strategy:

graph TB
    OldReader["v0.14.x Readers"]
NewReader["v0.15.x Readers"]
OldWriter["v0.14.x Writers"]
NewWriter["v0.15.x Writers"]
OldStore["v0.14 Storage\n16-byte aligned"]
NewStore["v0.15 Storage\n64-byte aligned"]
Step1["Step 1: Deploy v0.15 readers\nwhile writers still v0.14"]
Step2["Step 2: Migrate storage files\nto v0.15 format"]
Step3["Step 3: Deploy v0.15 writers"]
OldReader --> Step1
 
   Step1 --> NewReader
    
 
   OldStore --> Step2
 
   Step2 --> NewStore
    
 
   OldWriter --> Step3
 
   Step3 --> NewWriter
    
 
   NewReader -.->|Can read| OldStore
 
   NewReader -.->|Can read| NewStore
 
   NewWriter -.->|Writes to| NewStore

Sources: CHANGELOG.md:43-51

Zero-Copy Benefits of Alignment

Proper alignment enables zero-copy reinterpretation of payload bytes as typed slices:

Safe Reinterpretation Table

Payload Size	Aligned To	Can Cast To	Zero-Copy	Notes
64 bytes	64 bytes	`&[u8; 64]`	✅ Yes	Direct array reference
128 bytes	64 bytes	`&[u16; 64]`	✅ Yes	u16 requires 2-byte alignment
256 bytes	64 bytes	`&[u32; 64]`	✅ Yes	u32 requires 4-byte alignment
512 bytes	64 bytes	`&[u64; 64]`	✅ Yes	u64 requires 8-byte alignment
1024 bytes	64 bytes	`&[u128; 64]`	✅ Yes	u128 requires 16-byte alignment
Variable	64 bytes	Custom structs	⚠️ Maybe	Depends on struct alignment requirements

Sources: README.md:52-59

graph TB
    subgraph "Single Entry Write"
        SingleAPI["write key, payload"]
SingleLock["Acquire write lock"]
SingleWrite["Write one entry"]
SingleFlush["flush"]
SingleUnlock["Release lock"]
end
    
    subgraph "Batch Entry Write"
        BatchAPI["batch_write entries"]
BatchLock["Acquire write lock"]
BatchLoop["Write N entries\nin locked section"]
BatchFlush["flush once"]
BatchUnlock["Release lock"]
end
    
    subgraph "Streaming Write"
        StreamAPI["write_stream key, reader"]
StreamLock["Acquire write lock"]
StreamChunk["Read + write chunks\nincrementally"]
StreamFlush["flush"]
StreamUnlock["Release lock"]
end
    
 
   SingleAPI --> SingleLock
 
   SingleLock --> SingleWrite
 
   SingleWrite --> SingleFlush
 
   SingleFlush --> SingleUnlock
    
 
   BatchAPI --> BatchLock
 
   BatchLock --> BatchLoop
 
   BatchLoop --> BatchFlush
 
   BatchFlush --> BatchUnlock
    
 
   StreamAPI --> StreamLock
 
   StreamLock --> StreamChunk
 
   StreamChunk --> StreamFlush
 
   StreamFlush --> StreamUnlock

Write and Read Mode Performance

Write Mode Comparison

SIMD R Drive provides three write modes optimized for different workload patterns:

Write Mode Architecture

Sources: README.md:208-223

Write Mode Performance Characteristics

Write Mode	Lock Acquisitions	Flush Operations	Memory Usage	Best For
Single Entry	N (one per entry)	N (one per entry)	O(1) per entry	Low-latency individual writes
Batch Entry	1 (for all entries)	1 (at batch end)	O(batch size)	High-throughput bulk inserts
Streaming	1 (entire stream)	1 (at stream end)	O(chunk size)	Large payloads, memory-constrained

Lock Contention Analysis:

Sources: README.md:208-223

graph TB
    subgraph "Direct Memory Access"
        DirectAPI["read key"]
DirectIndex["Index lookup O(1)"]
DirectMmap["Memory-mapped view"]
DirectHandle["EntryHandle\nzero-copy slice"]
end
    
    subgraph "Streaming Read"
        StreamAPI["read_stream key"]
StreamIndex["Index lookup O(1)"]
StreamBuffer["Buffered reader\n4KB chunks"]
StreamInc["Incremental processing"]
end
    
    subgraph "Parallel Iteration"
        ParAPI["par_iter_entries"]
ParScan["Full file scan"]
ParRayon["Rayon thread pool"]
ParProcess["Multi-core processing"]
end
    
 
   DirectAPI --> DirectIndex
 
   DirectIndex --> DirectMmap
 
   DirectMmap --> DirectHandle
    
 
   StreamAPI --> StreamIndex
 
   StreamIndex --> StreamBuffer
 
   StreamBuffer --> StreamInc
    
 
   ParAPI --> ParScan
 
   ParScan --> ParRayon
 
   ParRayon --> ParProcess

Read Mode Comparison

Three read modes provide different trade-offs between performance, memory usage, and access patterns:

Read Mode Architecture

Sources: README.md:224-247

Read Mode Performance Characteristics

Read Mode	Memory Overhead	Latency	Throughput	Copy Operations	Best For
Direct Access	O(1) mapping overhead	~100ns	Very high	Zero	Random access, small reads
Streaming	O(buffer size)	~1-10µs	High	Non-zero-copy	Large entries, memory-constrained
Parallel Iteration	O(thread count)	N/A	Scales with cores	Zero per entry	Bulk processing, analytics

Memory Access Patterns:

Sources: README.md:224-247

Parallel Iteration Performance

The parallel iteration feature (requires parallel feature flag) leverages Rayon for multi-threaded scanning:

Parallel Iteration Scaling

Core Count	Sequential Time	Parallel Time	Speedup	Efficiency
1 core	1.0s	1.0s	1.0x	100%
2 cores	1.0s	0.52s	1.92x	96%
4 cores	1.0s	0.27s	3.70x	93%
8 cores	1.0s	0.15s	6.67x	83%
16 cores	1.0s	0.09s	11.1x	69%

Note: Actual performance depends on entry size distribution, storage device speed, and working set size relative to CPU cache.

Sources: README.md:242-247

graph TB
    subgraph "x86_64 Hashing"
        X86Key["Key bytes"]
X86SSE2["SSE2 Instructions\nUniversal on x86_64"]
X86AVX2["AVX2 Instructions\nIf available"]
X86Hash["64-bit hash"]
end
    
    subgraph "aarch64 Hashing"
        ARMKey["Key bytes"]
ARMNeon["Neon Instructions\nDefault on ARM64"]
ARMHash["64-bit hash"]
end
    
    subgraph "Index Operations"
        HashVal["u64 hash value"]
HashMap["DashMap index"]
Lookup["O(1)
lookup"]
end
    
 
   X86Key --> X86SSE2
 
   X86SSE2 -.->|If available| X86AVX2
 
   X86AVX2 --> X86Hash
 
   X86SSE2 --> X86Hash
    
 
   ARMKey --> ARMNeon
 
   ARMNeon --> ARMHash
    
 
   X86Hash --> HashVal
 
   ARMHash --> HashVal
 
   HashVal --> HashMap
 
   HashMap --> Lookup

Hardware-Accelerated Hashing

XXH3_64 Hashing Algorithm

SIMD R Drive uses the xxhash-rust implementation of XXH3_64 for all key hashing operations. This algorithm provides hardware acceleration on multiple platforms.

Hashing Performance by Platform

Sources: README.md:159-167 Cargo.lock:1385-1404

Hashing Throughput Benchmarks

Platform	SIMD Features	Throughput (GB/s)	Latency (ns/hash)	Relative Performance
x86_64	AVX2	~25-35	~30-50	1.4-1.8x vs SSE2
x86_64	SSE2 only	~15-20	~50-80	Baseline
aarch64	Neon	~20-28	~40-60	1.2-1.5x vs scalar
Other	Scalar	~8-12	~100-150	Reference

Note: Benchmarks are approximate and vary based on CPU generation, key size, and system load.

Sources: README.md:159-167

sequenceDiagram
    participant App as "Application"
    participant DS as "DataStore"
    participant Hash as "XXH3_64"
    participant Index as "DashMap\nkey_indexer"
    participant MMap as "Mmap\nmemory-mapped file"
    
    App->>DS: read(key)
    DS->>Hash: hash(key)
    Hash-->>DS: u64 hash
    DS->>Index: get(hash)
    
    alt "Hash found"
        Index-->>DS: (tag, offset)
        DS->>DS: Verify tag
        DS->>MMap: Create EntryHandle\nat offset
        MMap-->>DS: EntryHandle
        DS-->>App: Payload slice
    else "Hash not found"
        Index-->>DS: None
        DS-->>App: Error: not found
    end

Index Structure Performance

The key index uses DashMap (a concurrent hash map) for thread-safe lookups:

Index Lookup Flow

Sources: README.md:159-167 High-Level Diagram 2

Index Memory Overhead

Entry Count	Index Size	Overhead per Entry	Total Overhead (1M entries)
1,000	~32 KB	~32 bytes	~32 MB
10,000	~320 KB	~32 bytes	~32 MB
100,000	~3.2 MB	~32 bytes	~32 MB
1,000,000	~32 MB	~32 bytes	~32 MB

The index overhead is proportional to the number of unique keys, not the total file size, making it memory-efficient even for large datasets.

Sources: README.md:159-167

Parameter	Recommended Value	Impact	Notes
`vm.max_map_count`	≥ 262144	Enables large mmap regions	Linux only
File system	ext4, XFS, APFS	Better mmap performance	Avoid network filesystems
Huge pages	Transparent enabled	Reduces TLB misses	Linux kernel config
I/O scheduler	`none` or `mq-deadline`	Lower write latency	For SSD/NVMe devices

Hardware Considerations:

Component	Recommendation	Reason
CPU	AVX2 or ARM Neon	SIMD acceleration
RAM	≥ 8GB + working set	Effective mmap caching
Storage	NVMe SSD	Sequential write speed
Cache	L3 ≥ 8MB	Better index hit rate

Sources: README.md:249-257 README.md:159-167

Benchmarking and Profiling

Key Performance Indicators

Write Performance Metrics:

Metric	Target	Measurement Method
Single write latency	< 10 µs	P50, P95, P99 percentiles
Batch write throughput	> 100 MB/s	Bytes written per second
Lock contention ratio	< 5%	Time waiting / total time
Flush overhead	< 1 ms	fsync() call duration

Read Performance Metrics:

Metric	Target	Measurement Method
Index lookup latency	< 100 ns	P50, P95, P99 percentiles
Memory access latency	< 1 µs	Time to first byte
Cache hit rate	> 95%	OS page cache statistics
Parallel speedup	> 0.8 × cores	Amdahl's law efficiency

Sources: README.md:159-167 .github/workflows/rust-lint.yml:1-44

Profiling Tools

Recommended profiling workflow:

cargo-criterion : For micro-benchmarks of individual operations
perf (Linux) or Instruments (macOS): For system-level profiling
flamegraph : For visualizing hot code paths
valgrind/cachegrind : For cache miss analysis

Sources: Cargo.lock:607-627

Summary

SIMD R Drive achieves high performance through:

SIMD acceleration via simd_copy with AVX2/NEON implementations for write operations
64-byte payload alignment ensuring cache-line efficiency and zero-copy access
Hardware-accelerated XXH3_64 hashing for O(1) index lookups
Multiple access modes optimized for different workload patterns
Memory-mapped zero-copy reads eliminating deserialization overhead

The system balances these optimizations to provide consistent, predictable performance across diverse workloads while maintaining data integrity and thread safety.

Sources: README.md:249-257 README.md:52-59 README.md:159-167 CHANGELOG.md:25-51

rust-simd-r-drive Documentation

Performance Optimizations

Purpose and Scope

Overview of Optimization Strategies

Performance Metrics Summary

SIMD Write Acceleration

The simd_copy Function

Implementation Architecture

Integration with Write Path

Runtime Feature Detection

Payload Alignment and Cache Efficiency

64-Byte Alignment Strategy

Alignment Calculation

Alignment Validation

Evolution from v0.14 to v0.15

Migration Considerations

Zero-Copy Benefits of Alignment

Write and Read Mode Performance

Write Mode Comparison

Read Mode Comparison

Parallel Iteration Performance

Hardware-Accelerated Hashing

XXH3_64 Hashing Algorithm

Index Structure Performance

Performance Tuning Guidelines

Choosing the Right Write Mode

Choosing the Right Read Mode

System-Level Optimizations

Benchmarking and Profiling

Key Performance Indicators

Profiling Tools

Summary

Keyboard shortcuts

rust-simd-r-drive Documentation