Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Performance Optimizations

Loading…

Performance Optimizations

Relevant source files

Purpose and Scope

This document describes the performance optimization strategies employed by SIMD R Drive to achieve high-throughput storage operations. It covers hardware acceleration through SIMD instructions, cache-efficient memory alignment, zero-copy access patterns, lock-free concurrent operations, and the benchmarking infrastructure used to validate these optimizations.

For implementation details of specific SIMD operations, see SIMD Acceleration. For cache-line alignment specifics, see Payload Alignment and Cache Efficiency. For operation mode characteristics, see Write and Read Modes. For benchmark execution and analysis, see Benchmarking.


Performance Architecture Overview

The performance optimization stack consists of multiple layers, from hardware acceleration at the bottom to application-level operation modes at the top. Each layer contributes to the overall system throughput.

Performance Architecture Stack

graph TB
    subgraph Hardware["Hardware Features"]
CPU["CPU Architecture"]
AVX2["AVX2 Instructions\nx86_64"]
NEON["NEON Instructions\naarch64"]
CACHE["64-byte Cache Lines"]
end
    
    subgraph SIMD["SIMD Optimization Layer"]
SIMD_COPY["simd_copy Function"]
XXH3["xxh3_64 Hashing"]
FEATURE_DETECT["Runtime Feature Detection"]
end
    
    subgraph Memory["Memory Management"]
MMAP["memmap2::Mmap"]
ALIGN["PAYLOAD_ALIGNMENT = 64"]
DASHMAP["DashMap Index"]
end
    
    subgraph Concurrency["Concurrency Primitives"]
ATOMIC["AtomicU64 tail_offset"]
RWLOCK["RwLock File"]
ARC["Arc Mmap Sharing"]
end
    
    subgraph Operations["Operation Modes"]
WRITE_SINGLE["Single Write"]
WRITE_BATCH["Batch Write"]
WRITE_STREAM["Stream Write"]
READ_DIRECT["Direct Read"]
READ_STREAM["Stream Read"]
READ_PARALLEL["Parallel Iteration"]
end
    
 
   CPU --> AVX2
 
   CPU --> NEON
 
   CPU --> CACHE
    
 
   AVX2 --> SIMD_COPY
 
   NEON --> SIMD_COPY
 
   AVX2 --> XXH3
 
   NEON --> XXH3
    
 
   FEATURE_DETECT --> SIMD_COPY
    
 
   CACHE --> ALIGN
 
   SIMD_COPY --> ALIGN
    
 
   ALIGN --> MMAP
 
   MMAP --> ARC
 
   MMAP --> DASHMAP
    
 
   DASHMAP --> ATOMIC
 
   RWLOCK --> ATOMIC
    
 
   SIMD_COPY --> WRITE_SINGLE
 
   SIMD_COPY --> WRITE_BATCH
 
   MMAP --> READ_DIRECT
 
   ARC --> READ_PARALLEL

The diagram shows how hardware features enable SIMD operations, which work with aligned memory to maximize cache efficiency. The memory management layer uses zero-copy access patterns, while concurrency primitives enable safe multi-threaded operations. Application-level operation modes leverage these lower layers for optimal performance.

Sources: README.md:5-7 README.md:249-256 src/storage_engine/simd_copy.rs:1-139


SIMD Acceleration Components

SIMD R Drive uses vectorized instructions to accelerate two critical operations: memory copying during writes and key hashing for indexing. The system detects available CPU features at runtime and selects the optimal implementation.

SIMD Component Architecture

graph TB
    subgraph Detection["Feature Detection"]
RUNTIME["std::is_x86_feature_detected!"]
X86_CHECK["Check avx2 on x86_64"]
ARM_DEFAULT["Default neon on aarch64"]
end
    
    subgraph Implementations["SIMD Implementations"]
SIMD_COPY_X86["simd_copy_x86\n_mm256_loadu_si256\n_mm256_storeu_si256"]
SIMD_COPY_ARM["simd_copy_arm\nvld1q_u8\nvst1q_u8"]
FALLBACK["Scalar copy_from_slice"]
end
    
    subgraph Operations["Accelerated Operations"]
WRITE_OP["Write Operations"]
HASH_OP["xxh3_64 Hashing"]
end
    
    subgraph Characteristics["Performance Characteristics"]
AVX2_SIZE["32-byte chunks AVX2"]
NEON_SIZE["16-byte chunks NEON"]
REMAINDER["Scalar remainder"]
end
    
 
   RUNTIME --> X86_CHECK
 
   RUNTIME --> ARM_DEFAULT
    
 
   X86_CHECK -->|Detected| SIMD_COPY_X86
 
   X86_CHECK -->|Not Detected| FALLBACK
 
   ARM_DEFAULT --> SIMD_COPY_ARM
    
 
   SIMD_COPY_X86 --> AVX2_SIZE
 
   SIMD_COPY_ARM --> NEON_SIZE
 
   AVX2_SIZE --> REMAINDER
 
   NEON_SIZE --> REMAINDER
    
 
   SIMD_COPY_X86 --> WRITE_OP
 
   SIMD_COPY_ARM --> WRITE_OP
 
   FALLBACK --> WRITE_OP
    
 
   HASH_OP --> WRITE_OP

The simd_copy function performs runtime feature detection to select the appropriate SIMD implementation. On x86_64, it checks for AVX2 support and processes 32 bytes per iteration. On aarch64, it uses NEON instructions to process 16 bytes per iteration. A scalar fallback handles unsupported architectures and the remainder bytes after vectorized processing.

Sources: src/storage_engine/simd_copy.rs:110-138 src/storage_engine/simd_copy.rs:16-62 src/storage_engine/simd_copy.rs:64-108

SIMD Copy Function Details

The simd_copy function provides the core memory copying optimization used during write operations:

ArchitectureSIMD ExtensionChunk SizeLoad InstructionStore InstructionFeature Detection
x86_64AVX232 bytes_mm256_loadu_si256_mm256_storeu_si256is_x86_feature_detected!("avx2")
aarch64NEON16 bytesvld1q_u8vst1q_u8Always enabled
FallbackNoneVariableN/AN/AOther architectures

The function is defined in src/storage_engine/simd_copy.rs:110-138 with platform-specific implementations in src/storage_engine/simd_copy.rs:33-62 (x86_64) and src/storage_engine/simd_copy.rs:81-108 (aarch64).

Hardware-Accelerated Hashing

The xxh3_64 hashing algorithm used by KeyIndexer leverages SIMD extensions to accelerate key hashing operations. The dependency is configured in Cargo.toml34 with features ["xxh3", "const_xxh3"].

Hash acceleration characteristics:

  • SSE2 : Universally supported on x86_64, enabled by default
  • AVX2 : Additional performance gains on capable CPUs
  • NEON : Default on aarch64 targets, providing ARM SIMD acceleration

Sources: README.md:158-166 Cargo.toml34


Cache-Line Aligned Memory Layout

The storage engine aligns all non-tombstone payloads to 64-byte boundaries, matching typical CPU cache-line sizes. This alignment strategy ensures that SIMD operations operate efficiently without crossing cache-line boundaries.

64-byte Alignment Strategy

graph LR
    subgraph Entry1["Entry N"]
PREPAD1["Pre-Padding\n0-63 bytes"]
PAYLOAD1["Payload\nAligned Start"]
META1["Metadata\n20 bytes"]
end
    
    subgraph Entry2["Entry N+1"]
PREPAD2["Pre-Padding\n0-63 bytes"]
PAYLOAD2["Payload\nAligned Start"]
META2["Metadata\n20 bytes"]
end
    
    subgraph Alignment["Alignment Calculation"]
PREV_TAIL["prev_tail_offset"]
CALC["pad = A - prev_tail mod A mod A\nA = PAYLOAD_ALIGNMENT = 64"]
NEXT_START["payload_start mod 64 = 0"]
end
    
 
   META1 --> PREV_TAIL
 
   PREV_TAIL --> CALC
 
   CALC --> PREPAD2
 
   PREPAD2 --> PAYLOAD2
 
   PAYLOAD2 --> NEXT_START

Each payload is preceded by 0-63 bytes of padding to ensure the payload itself starts on a 64-byte boundary. The padding length is calculated based on the previous entry’s tail offset. This enables efficient SIMD loads/stores and ensures optimal cache-line utilization.

Alignment Formula

The pre-padding calculation ensures proper alignment:

pad = (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT)) & (PAYLOAD_ALIGNMENT - 1)

Where:

  • PAYLOAD_ALIGNMENT = 64 (defined in simd-r-drive-entry-handle/src/constants.rs)
  • prev_tail is the absolute file offset after the previous entry’s metadata
  • The bitwise AND with (PAYLOAD_ALIGNMENT - 1) handles the modulo operation efficiently since 64 is a power of 2

Sources: README.md:51-59 README.md:114-124

Alignment Benefits

BenefitDescriptionImpact
SIMD EfficiencyVectorized operations don’t cross cache-line boundaries2-4x speedup on bulk copies
Cache PerformanceSingle payload typically fits within contiguous cache linesReduced cache misses
Zero-Copy CastingAligned payloads can be safely cast to typed slices (&[u32], &[u64])No buffer allocation needed
Predictable PerformanceConsistent access patterns regardless of payload sizeStable latency characteristics

The alignment is enforced during write operations and verified during entry access through EntryHandle.

Sources: README.md:51-59


graph TB
    subgraph File["Storage File"]
DISK["simd-r-drive.bin"]
end
    
    subgraph Mapping["Memory Mapping"]
MMAP_CREATE["Mmap::map"]
MMAP_INSTANCE["Mmap Instance"]
ARC_MMAP["Arc Mmap\nShared Reference"]
end
    
    subgraph Index["KeyIndexer"]
DASHMAP_STRUCT["DashMap key_hash -> packed_value"]
PACKED["Packed u64\n16-bit tag\n48-bit offset"]
end
    
    subgraph Access["Zero-Copy Access"]
LOOKUP["read key"]
GET_OFFSET["Extract offset from packed_value"]
SLICE["Arc Mmap byte range"]
HANDLE["EntryHandle\nDirect payload reference"]
end
    
    subgraph Threading["Multi-threaded Access"]
CLONE["Arc::clone"]
THREAD1["Thread 1 read"]
THREAD2["Thread 2 read"]
THREAD3["Thread N read"]
end
    
 
   DISK --> MMAP_CREATE
 
   MMAP_CREATE --> MMAP_INSTANCE
 
   MMAP_INSTANCE --> ARC_MMAP
    
 
   ARC_MMAP --> DASHMAP_STRUCT
 
   DASHMAP_STRUCT --> PACKED
    
 
   LOOKUP --> DASHMAP_STRUCT
 
   DASHMAP_STRUCT --> GET_OFFSET
 
   GET_OFFSET --> SLICE
 
   ARC_MMAP --> SLICE
 
   SLICE --> HANDLE
    
 
   ARC_MMAP --> CLONE
 
   CLONE --> THREAD1
 
   CLONE --> THREAD2
 
   CLONE --> THREAD3
 
   THREAD1 --> HANDLE
 
   THREAD2 --> HANDLE
 
   THREAD3 --> HANDLE

Zero-Copy Memory Access Patterns

SIMD R Drive achieves zero-copy reads by memory-mapping the entire storage file and providing direct byte slice access to payloads. This eliminates deserialization overhead and reduces memory pressure for large datasets.

Zero-Copy Read Architecture

The storage file is memory-mapped once and shared via Arc<Mmap> across threads. Read operations perform hash lookups in DashMap to get file offsets, then return EntryHandle instances that provide direct views into the mapped memory. Multiple threads can safely read concurrently without copying data.

Sources: README.md:43-49 README.md:173-175

Memory-Mapped File Management

The memmap2 crate provides the memory mapping functionality:

  • Configured as a workspace dependency in Cargo.toml102
  • Used in the DataStore implementation
  • Protected by Mutex<Arc<Mmap>> to prevent unsafe remapping during active reads
  • Automatically remapped when file grows beyond current mapping size

EntryHandle Zero-Copy Interface

The EntryHandle type provides zero-copy access to stored payloads without allocating intermediate buffers:

MethodReturn TypeCopy BehaviorUse Case
payload()&[u8]Zero-copy referenceDirect access to full payload
payload_reader()impl ReadBuffered readsStreaming large payloads
as_arrow_buffer()arrow::BufferZero-copy viewApache Arrow integration

The handle maintains a reference to the memory-mapped region and calculates the payload range based on entry metadata.

Sources: README.md:228-233


graph TB
    subgraph Writes["Write Path Synchronized"]
WRITE_LOCK["RwLock File write"]
APPEND["Append entry to file"]
UPDATE_INDEX["DashMap::insert"]
UPDATE_TAIL["AtomicU64::store tail_offset"]
end
    
    subgraph Reads["Read Path Lock-Free"]
READ1["Thread 1 read"]
READ2["Thread 2 read"]
READN["Thread N read"]
LOOKUP1["DashMap::get"]
LOOKUP2["DashMap::get"]
LOOKUPN["DashMap::get"]
MMAP_ACCESS["Arc Mmap shared access"]
end
    
    subgraph Synchronization["Concurrency Control"]
RWLOCK_STRUCT["RwLock File"]
ATOMIC_STRUCT["AtomicU64 tail_offset"]
DASHMAP_STRUCT["DashMap Index"]
ARC_STRUCT["Arc Mmap"]
end
    
 
   WRITE_LOCK --> APPEND
 
   APPEND --> UPDATE_INDEX
 
   UPDATE_INDEX --> UPDATE_TAIL
    
 
   READ1 --> LOOKUP1
 
   READ2 --> LOOKUP2
 
   READN --> LOOKUPN
    
 
   LOOKUP1 --> MMAP_ACCESS
 
   LOOKUP2 --> MMAP_ACCESS
 
   LOOKUPN --> MMAP_ACCESS
    
    RWLOCK_STRUCT -.controls.-> WRITE_LOCK
    ATOMIC_STRUCT -.updated by.-> UPDATE_TAIL
    DASHMAP_STRUCT -.provides.-> LOOKUP1
    DASHMAP_STRUCT -.provides.-> LOOKUP2
    DASHMAP_STRUCT -.provides.-> LOOKUPN
    ARC_STRUCT -.enables.-> MMAP_ACCESS

Lock-Free Concurrent Read Operations

The storage engine enables multiple threads to perform concurrent reads without acquiring locks, using DashMap for the in-memory index and atomic operations for metadata tracking.

Lock-Free Read Architecture

Write operations acquire an RwLock to ensure sequential appends, but read operations access the DashMap index without locking. The DashMap data structure provides lock-free reads through internal sharding and fine-grained locking. The memory-mapped file is shared via Arc<Mmap>, allowing concurrent zero-copy access.

Sources: README.md:172-183 Cargo.toml27

DashMap Index Characteristics

The DashMap dependency is configured in Cargo.toml27 and provides these characteristics:

  • Lock-free reads : Read operations don’t block each other
  • Sharded locking : Write operations only lock specific shards
  • Concurrent inserts : Multiple threads can update different shards simultaneously
  • Memory overhead : Approximately 64 bytes per entry for hash table overhead

Atomic Operations

The AtomicU64 for tail_offset tracking provides:

  • Ordering guarantees : SeqCst ordering ensures consistency across threads
  • Lock-free updates : Writes update the tail without blocking reads
  • Single-word operations : 64-bit atomic operations are efficient on modern CPUs

Sources: README.md:182-183


Operation Mode Performance Characteristics

SIMD R Drive provides multiple operation modes optimized for different workload patterns. Each mode has specific performance characteristics and resource usage profiles.

Write Operation Modes

ModeMethodLock DurationI/O PatternFlush BehaviorBest For
Singlewrite()Per-entrySequentialImmediateLow-latency single writes
Batchbatch_write()Per-batchSequentialAfter batchHigh-throughput bulk writes
Streamwrite_stream()Per-entrySequentialImmediateLarge entries (>1MB)

Single Write (README.md:213-215):

  • Acquires RwLock for each entry
  • Flushes to disk immediately after write
  • Suitable for applications requiring durability guarantees per operation

Batch Write (README.md:217-219):

  • Acquires RwLock once for entire batch
  • Flushes to disk after all entries written
  • Reduces syscall overhead for bulk operations
  • Can write thousands of entries in single lock acquisition

Stream Write (README.md:221-223):

  • Accepts impl Read source for payload data
  • Copies data in chunks to avoid full in-memory allocation
  • Suitable for writing multi-megabyte or gigabyte-sized entries

Sources: README.md:208-223

Read Operation Modes

ModeMethodMemory BehaviorParallelismBest For
Directread()Zero-copySingle-threadedSmall to medium entries
Streampayload_reader()BufferedSingle-threadedLarge entries (>10MB)
Parallelpar_iter_entries()Zero-copyMulti-threadedBulk processing entire dataset

Direct Read (README.md:228-233):

  • Returns EntryHandle with direct memory-mapped payload reference
  • Zero allocation for payload access
  • Full entry must fit in virtual address space
  • Fastest for entries under 10MB

Stream Read (README.md:234-241):

  • Reads payload incrementally through Read trait
  • Uses 8KB buffer internally
  • Avoids memory pressure for large entries
  • Non-zero-copy but memory-efficient

Parallel Iteration (README.md:242-247):

  • Requires parallel feature flag in Cargo.toml52
  • Uses Rayon for multi-threaded iteration
  • Processes all valid entries across CPU cores
  • Ideal for building in-memory caches or analytics workloads

Sources: README.md:224-247 Cargo.toml30 Cargo.toml52


graph TB
    subgraph Benchmark_Suite["Benchmark Suite"]
STORAGE_BENCH["storage_benchmark"]
CONTENTION_BENCH["contention_benchmark"]
end
    
    subgraph Storage_Tests["Storage Benchmark Tests"]
WRITE_SINGLE["Single Write Throughput"]
WRITE_BATCH["Batch Write Throughput"]
READ_SEQ["Sequential Read Throughput"]
READ_RAND["Random Read Throughput"]
end
    
    subgraph Contention_Tests["Contention Benchmark Tests"]
MULTI_THREAD["Multi-threaded Read Contention"]
PARALLEL_ITER["Parallel Iteration Performance"]
end
    
    subgraph Criterion["Criterion.rs Framework"]
STATISTICAL["Statistical Analysis"]
COMPARISON["Baseline Comparison"]
PLOTS["Performance Plots"]
REPORTS["HTML Reports"]
end
    
 
   STORAGE_BENCH --> WRITE_SINGLE
 
   STORAGE_BENCH --> WRITE_BATCH
 
   STORAGE_BENCH --> READ_SEQ
 
   STORAGE_BENCH --> READ_RAND
    
 
   CONTENTION_BENCH --> MULTI_THREAD
 
   CONTENTION_BENCH --> PARALLEL_ITER
    
 
   WRITE_SINGLE --> STATISTICAL
 
   WRITE_BATCH --> STATISTICAL
 
   READ_SEQ --> STATISTICAL
 
   READ_RAND --> STATISTICAL
 
   MULTI_THREAD --> STATISTICAL
 
   PARALLEL_ITER --> STATISTICAL
    
 
   STATISTICAL --> COMPARISON
 
   COMPARISON --> PLOTS
 
   PLOTS --> REPORTS

Benchmarking Infrastructure

SIMD R Drive uses Criterion.rs for statistical benchmarking of performance-critical operations. The benchmark suite validates the effectiveness of SIMD optimizations and concurrent access patterns.

Benchmarking Architecture

The benchmark suite consists of two main harnesses: storage_benchmark measures fundamental read/write throughput, while contention_benchmark measures concurrent access performance. Criterion.rs provides statistical analysis, baseline comparisons, and HTML reports for each benchmark run.

Sources: Cargo.toml:57-63 Cargo.toml98

Benchmark Configuration

The benchmark harnesses are defined in Cargo.toml:57-63:

The harness = false setting disables Rust’s default benchmark harness, allowing Criterion.rs to provide its own test runner with statistical analysis capabilities.

Criterion.rs Integration

The Criterion.rs framework is configured as a development dependency in Cargo.toml39 and provides:

  • Statistical rigor : Multiple iterations with outlier detection
  • Baseline comparison : Compare performance across code changes
  • Regression detection : Automatically detect performance regressions
  • Visualization : Generate performance plots and HTML reports
  • Reproducibility : Consistent measurement methodology across environments

Benchmarks are executed with:

The results are stored in target/criterion/ for historical comparison.

Sources: Cargo.toml39 Cargo.toml:57-63


Performance Feature Summary

The following table summarizes the key performance features and their implementation locations:

FeatureImplementationBenefitConfiguration
AVX2 SIMDsimd_copy_x86 in src/storage_engine/simd_copy.rs:33-6232-byte vectorized copiesRuntime feature detection
NEON SIMDsimd_copy_arm in src/storage_engine/simd_copy.rs:81-10816-byte vectorized copiesAlways enabled on aarch64
64-byte AlignmentPAYLOAD_ALIGNMENT constantCache-line efficiencyBuild-time constant
Zero-Copy Readsmemmap2::MmapNo deserialization overheadAlways enabled
Lock-Free ReadsDashMap in Cargo.toml27Concurrent read scalingAlways enabled
Parallel IterationRayon in Cargo.toml30Multi-core bulk processingparallel feature flag
Hardware Hashingxxhash-rust in Cargo.toml34SIMD-accelerated indexingAlways enabled

For detailed information on each feature, see the corresponding child sections 5.1, 5.2, 5.3, and 5.4.

Sources: README.md:5-7 README.md:249-256 Cargo.toml27 Cargo.toml30 Cargo.toml34

Dismiss

Refresh this wiki

Enter email to refresh