This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Performance Optimizations
Loading…
Performance Optimizations
Relevant source files
Purpose and Scope
This document describes the performance optimization strategies employed by SIMD R Drive to achieve high-throughput storage operations. It covers hardware acceleration through SIMD instructions, cache-efficient memory alignment, zero-copy access patterns, lock-free concurrent operations, and the benchmarking infrastructure used to validate these optimizations.
For implementation details of specific SIMD operations, see SIMD Acceleration. For cache-line alignment specifics, see Payload Alignment and Cache Efficiency. For operation mode characteristics, see Write and Read Modes. For benchmark execution and analysis, see Benchmarking.
Performance Architecture Overview
The performance optimization stack consists of multiple layers, from hardware acceleration at the bottom to application-level operation modes at the top. Each layer contributes to the overall system throughput.
Performance Architecture Stack
graph TB
subgraph Hardware["Hardware Features"]
CPU["CPU Architecture"]
AVX2["AVX2 Instructions\nx86_64"]
NEON["NEON Instructions\naarch64"]
CACHE["64-byte Cache Lines"]
end
subgraph SIMD["SIMD Optimization Layer"]
SIMD_COPY["simd_copy Function"]
XXH3["xxh3_64 Hashing"]
FEATURE_DETECT["Runtime Feature Detection"]
end
subgraph Memory["Memory Management"]
MMAP["memmap2::Mmap"]
ALIGN["PAYLOAD_ALIGNMENT = 64"]
DASHMAP["DashMap Index"]
end
subgraph Concurrency["Concurrency Primitives"]
ATOMIC["AtomicU64 tail_offset"]
RWLOCK["RwLock File"]
ARC["Arc Mmap Sharing"]
end
subgraph Operations["Operation Modes"]
WRITE_SINGLE["Single Write"]
WRITE_BATCH["Batch Write"]
WRITE_STREAM["Stream Write"]
READ_DIRECT["Direct Read"]
READ_STREAM["Stream Read"]
READ_PARALLEL["Parallel Iteration"]
end
CPU --> AVX2
CPU --> NEON
CPU --> CACHE
AVX2 --> SIMD_COPY
NEON --> SIMD_COPY
AVX2 --> XXH3
NEON --> XXH3
FEATURE_DETECT --> SIMD_COPY
CACHE --> ALIGN
SIMD_COPY --> ALIGN
ALIGN --> MMAP
MMAP --> ARC
MMAP --> DASHMAP
DASHMAP --> ATOMIC
RWLOCK --> ATOMIC
SIMD_COPY --> WRITE_SINGLE
SIMD_COPY --> WRITE_BATCH
MMAP --> READ_DIRECT
ARC --> READ_PARALLEL
The diagram shows how hardware features enable SIMD operations, which work with aligned memory to maximize cache efficiency. The memory management layer uses zero-copy access patterns, while concurrency primitives enable safe multi-threaded operations. Application-level operation modes leverage these lower layers for optimal performance.
Sources: README.md:5-7 README.md:249-256 src/storage_engine/simd_copy.rs:1-139
SIMD Acceleration Components
SIMD R Drive uses vectorized instructions to accelerate two critical operations: memory copying during writes and key hashing for indexing. The system detects available CPU features at runtime and selects the optimal implementation.
SIMD Component Architecture
graph TB
subgraph Detection["Feature Detection"]
RUNTIME["std::is_x86_feature_detected!"]
X86_CHECK["Check avx2 on x86_64"]
ARM_DEFAULT["Default neon on aarch64"]
end
subgraph Implementations["SIMD Implementations"]
SIMD_COPY_X86["simd_copy_x86\n_mm256_loadu_si256\n_mm256_storeu_si256"]
SIMD_COPY_ARM["simd_copy_arm\nvld1q_u8\nvst1q_u8"]
FALLBACK["Scalar copy_from_slice"]
end
subgraph Operations["Accelerated Operations"]
WRITE_OP["Write Operations"]
HASH_OP["xxh3_64 Hashing"]
end
subgraph Characteristics["Performance Characteristics"]
AVX2_SIZE["32-byte chunks AVX2"]
NEON_SIZE["16-byte chunks NEON"]
REMAINDER["Scalar remainder"]
end
RUNTIME --> X86_CHECK
RUNTIME --> ARM_DEFAULT
X86_CHECK -->|Detected| SIMD_COPY_X86
X86_CHECK -->|Not Detected| FALLBACK
ARM_DEFAULT --> SIMD_COPY_ARM
SIMD_COPY_X86 --> AVX2_SIZE
SIMD_COPY_ARM --> NEON_SIZE
AVX2_SIZE --> REMAINDER
NEON_SIZE --> REMAINDER
SIMD_COPY_X86 --> WRITE_OP
SIMD_COPY_ARM --> WRITE_OP
FALLBACK --> WRITE_OP
HASH_OP --> WRITE_OP
The simd_copy function performs runtime feature detection to select the appropriate SIMD implementation. On x86_64, it checks for AVX2 support and processes 32 bytes per iteration. On aarch64, it uses NEON instructions to process 16 bytes per iteration. A scalar fallback handles unsupported architectures and the remainder bytes after vectorized processing.
Sources: src/storage_engine/simd_copy.rs:110-138 src/storage_engine/simd_copy.rs:16-62 src/storage_engine/simd_copy.rs:64-108
SIMD Copy Function Details
The simd_copy function provides the core memory copying optimization used during write operations:
| Architecture | SIMD Extension | Chunk Size | Load Instruction | Store Instruction | Feature Detection |
|---|---|---|---|---|---|
| x86_64 | AVX2 | 32 bytes | _mm256_loadu_si256 | _mm256_storeu_si256 | is_x86_feature_detected!("avx2") |
| aarch64 | NEON | 16 bytes | vld1q_u8 | vst1q_u8 | Always enabled |
| Fallback | None | Variable | N/A | N/A | Other architectures |
The function is defined in src/storage_engine/simd_copy.rs:110-138 with platform-specific implementations in src/storage_engine/simd_copy.rs:33-62 (x86_64) and src/storage_engine/simd_copy.rs:81-108 (aarch64).
Hardware-Accelerated Hashing
The xxh3_64 hashing algorithm used by KeyIndexer leverages SIMD extensions to accelerate key hashing operations. The dependency is configured in Cargo.toml34 with features ["xxh3", "const_xxh3"].
Hash acceleration characteristics:
- SSE2 : Universally supported on x86_64, enabled by default
- AVX2 : Additional performance gains on capable CPUs
- NEON : Default on aarch64 targets, providing ARM SIMD acceleration
Sources: README.md:158-166 Cargo.toml34
Cache-Line Aligned Memory Layout
The storage engine aligns all non-tombstone payloads to 64-byte boundaries, matching typical CPU cache-line sizes. This alignment strategy ensures that SIMD operations operate efficiently without crossing cache-line boundaries.
64-byte Alignment Strategy
graph LR
subgraph Entry1["Entry N"]
PREPAD1["Pre-Padding\n0-63 bytes"]
PAYLOAD1["Payload\nAligned Start"]
META1["Metadata\n20 bytes"]
end
subgraph Entry2["Entry N+1"]
PREPAD2["Pre-Padding\n0-63 bytes"]
PAYLOAD2["Payload\nAligned Start"]
META2["Metadata\n20 bytes"]
end
subgraph Alignment["Alignment Calculation"]
PREV_TAIL["prev_tail_offset"]
CALC["pad = A - prev_tail mod A mod A\nA = PAYLOAD_ALIGNMENT = 64"]
NEXT_START["payload_start mod 64 = 0"]
end
META1 --> PREV_TAIL
PREV_TAIL --> CALC
CALC --> PREPAD2
PREPAD2 --> PAYLOAD2
PAYLOAD2 --> NEXT_START
Each payload is preceded by 0-63 bytes of padding to ensure the payload itself starts on a 64-byte boundary. The padding length is calculated based on the previous entry’s tail offset. This enables efficient SIMD loads/stores and ensures optimal cache-line utilization.
Alignment Formula
The pre-padding calculation ensures proper alignment:
pad = (PAYLOAD_ALIGNMENT - (prev_tail % PAYLOAD_ALIGNMENT)) & (PAYLOAD_ALIGNMENT - 1)
Where:
PAYLOAD_ALIGNMENT = 64(defined insimd-r-drive-entry-handle/src/constants.rs)prev_tailis the absolute file offset after the previous entry’s metadata- The bitwise AND with
(PAYLOAD_ALIGNMENT - 1)handles the modulo operation efficiently since 64 is a power of 2
Sources: README.md:51-59 README.md:114-124
Alignment Benefits
| Benefit | Description | Impact |
|---|---|---|
| SIMD Efficiency | Vectorized operations don’t cross cache-line boundaries | 2-4x speedup on bulk copies |
| Cache Performance | Single payload typically fits within contiguous cache lines | Reduced cache misses |
| Zero-Copy Casting | Aligned payloads can be safely cast to typed slices (&[u32], &[u64]) | No buffer allocation needed |
| Predictable Performance | Consistent access patterns regardless of payload size | Stable latency characteristics |
The alignment is enforced during write operations and verified during entry access through EntryHandle.
Sources: README.md:51-59
graph TB
subgraph File["Storage File"]
DISK["simd-r-drive.bin"]
end
subgraph Mapping["Memory Mapping"]
MMAP_CREATE["Mmap::map"]
MMAP_INSTANCE["Mmap Instance"]
ARC_MMAP["Arc Mmap\nShared Reference"]
end
subgraph Index["KeyIndexer"]
DASHMAP_STRUCT["DashMap key_hash -> packed_value"]
PACKED["Packed u64\n16-bit tag\n48-bit offset"]
end
subgraph Access["Zero-Copy Access"]
LOOKUP["read key"]
GET_OFFSET["Extract offset from packed_value"]
SLICE["Arc Mmap byte range"]
HANDLE["EntryHandle\nDirect payload reference"]
end
subgraph Threading["Multi-threaded Access"]
CLONE["Arc::clone"]
THREAD1["Thread 1 read"]
THREAD2["Thread 2 read"]
THREAD3["Thread N read"]
end
DISK --> MMAP_CREATE
MMAP_CREATE --> MMAP_INSTANCE
MMAP_INSTANCE --> ARC_MMAP
ARC_MMAP --> DASHMAP_STRUCT
DASHMAP_STRUCT --> PACKED
LOOKUP --> DASHMAP_STRUCT
DASHMAP_STRUCT --> GET_OFFSET
GET_OFFSET --> SLICE
ARC_MMAP --> SLICE
SLICE --> HANDLE
ARC_MMAP --> CLONE
CLONE --> THREAD1
CLONE --> THREAD2
CLONE --> THREAD3
THREAD1 --> HANDLE
THREAD2 --> HANDLE
THREAD3 --> HANDLE
Zero-Copy Memory Access Patterns
SIMD R Drive achieves zero-copy reads by memory-mapping the entire storage file and providing direct byte slice access to payloads. This eliminates deserialization overhead and reduces memory pressure for large datasets.
Zero-Copy Read Architecture
The storage file is memory-mapped once and shared via Arc<Mmap> across threads. Read operations perform hash lookups in DashMap to get file offsets, then return EntryHandle instances that provide direct views into the mapped memory. Multiple threads can safely read concurrently without copying data.
Sources: README.md:43-49 README.md:173-175
Memory-Mapped File Management
The memmap2 crate provides the memory mapping functionality:
- Configured as a workspace dependency in Cargo.toml102
- Used in the
DataStoreimplementation - Protected by
Mutex<Arc<Mmap>>to prevent unsafe remapping during active reads - Automatically remapped when file grows beyond current mapping size
EntryHandle Zero-Copy Interface
The EntryHandle type provides zero-copy access to stored payloads without allocating intermediate buffers:
| Method | Return Type | Copy Behavior | Use Case |
|---|---|---|---|
payload() | &[u8] | Zero-copy reference | Direct access to full payload |
payload_reader() | impl Read | Buffered reads | Streaming large payloads |
as_arrow_buffer() | arrow::Buffer | Zero-copy view | Apache Arrow integration |
The handle maintains a reference to the memory-mapped region and calculates the payload range based on entry metadata.
Sources: README.md:228-233
graph TB
subgraph Writes["Write Path Synchronized"]
WRITE_LOCK["RwLock File write"]
APPEND["Append entry to file"]
UPDATE_INDEX["DashMap::insert"]
UPDATE_TAIL["AtomicU64::store tail_offset"]
end
subgraph Reads["Read Path Lock-Free"]
READ1["Thread 1 read"]
READ2["Thread 2 read"]
READN["Thread N read"]
LOOKUP1["DashMap::get"]
LOOKUP2["DashMap::get"]
LOOKUPN["DashMap::get"]
MMAP_ACCESS["Arc Mmap shared access"]
end
subgraph Synchronization["Concurrency Control"]
RWLOCK_STRUCT["RwLock File"]
ATOMIC_STRUCT["AtomicU64 tail_offset"]
DASHMAP_STRUCT["DashMap Index"]
ARC_STRUCT["Arc Mmap"]
end
WRITE_LOCK --> APPEND
APPEND --> UPDATE_INDEX
UPDATE_INDEX --> UPDATE_TAIL
READ1 --> LOOKUP1
READ2 --> LOOKUP2
READN --> LOOKUPN
LOOKUP1 --> MMAP_ACCESS
LOOKUP2 --> MMAP_ACCESS
LOOKUPN --> MMAP_ACCESS
RWLOCK_STRUCT -.controls.-> WRITE_LOCK
ATOMIC_STRUCT -.updated by.-> UPDATE_TAIL
DASHMAP_STRUCT -.provides.-> LOOKUP1
DASHMAP_STRUCT -.provides.-> LOOKUP2
DASHMAP_STRUCT -.provides.-> LOOKUPN
ARC_STRUCT -.enables.-> MMAP_ACCESS
Lock-Free Concurrent Read Operations
The storage engine enables multiple threads to perform concurrent reads without acquiring locks, using DashMap for the in-memory index and atomic operations for metadata tracking.
Lock-Free Read Architecture
Write operations acquire an RwLock to ensure sequential appends, but read operations access the DashMap index without locking. The DashMap data structure provides lock-free reads through internal sharding and fine-grained locking. The memory-mapped file is shared via Arc<Mmap>, allowing concurrent zero-copy access.
Sources: README.md:172-183 Cargo.toml27
DashMap Index Characteristics
The DashMap dependency is configured in Cargo.toml27 and provides these characteristics:
- Lock-free reads : Read operations don’t block each other
- Sharded locking : Write operations only lock specific shards
- Concurrent inserts : Multiple threads can update different shards simultaneously
- Memory overhead : Approximately 64 bytes per entry for hash table overhead
Atomic Operations
The AtomicU64 for tail_offset tracking provides:
- Ordering guarantees :
SeqCstordering ensures consistency across threads - Lock-free updates : Writes update the tail without blocking reads
- Single-word operations : 64-bit atomic operations are efficient on modern CPUs
Sources: README.md:182-183
Operation Mode Performance Characteristics
SIMD R Drive provides multiple operation modes optimized for different workload patterns. Each mode has specific performance characteristics and resource usage profiles.
Write Operation Modes
| Mode | Method | Lock Duration | I/O Pattern | Flush Behavior | Best For |
|---|---|---|---|---|---|
| Single | write() | Per-entry | Sequential | Immediate | Low-latency single writes |
| Batch | batch_write() | Per-batch | Sequential | After batch | High-throughput bulk writes |
| Stream | write_stream() | Per-entry | Sequential | Immediate | Large entries (>1MB) |
Single Write (README.md:213-215):
- Acquires
RwLockfor each entry - Flushes to disk immediately after write
- Suitable for applications requiring durability guarantees per operation
Batch Write (README.md:217-219):
- Acquires
RwLockonce for entire batch - Flushes to disk after all entries written
- Reduces syscall overhead for bulk operations
- Can write thousands of entries in single lock acquisition
Stream Write (README.md:221-223):
- Accepts
impl Readsource for payload data - Copies data in chunks to avoid full in-memory allocation
- Suitable for writing multi-megabyte or gigabyte-sized entries
Sources: README.md:208-223
Read Operation Modes
| Mode | Method | Memory Behavior | Parallelism | Best For |
|---|---|---|---|---|
| Direct | read() | Zero-copy | Single-threaded | Small to medium entries |
| Stream | payload_reader() | Buffered | Single-threaded | Large entries (>10MB) |
| Parallel | par_iter_entries() | Zero-copy | Multi-threaded | Bulk processing entire dataset |
Direct Read (README.md:228-233):
- Returns
EntryHandlewith direct memory-mapped payload reference - Zero allocation for payload access
- Full entry must fit in virtual address space
- Fastest for entries under 10MB
Stream Read (README.md:234-241):
- Reads payload incrementally through
Readtrait - Uses 8KB buffer internally
- Avoids memory pressure for large entries
- Non-zero-copy but memory-efficient
Parallel Iteration (README.md:242-247):
- Requires
parallelfeature flag in Cargo.toml52 - Uses Rayon for multi-threaded iteration
- Processes all valid entries across CPU cores
- Ideal for building in-memory caches or analytics workloads
Sources: README.md:224-247 Cargo.toml30 Cargo.toml52
graph TB
subgraph Benchmark_Suite["Benchmark Suite"]
STORAGE_BENCH["storage_benchmark"]
CONTENTION_BENCH["contention_benchmark"]
end
subgraph Storage_Tests["Storage Benchmark Tests"]
WRITE_SINGLE["Single Write Throughput"]
WRITE_BATCH["Batch Write Throughput"]
READ_SEQ["Sequential Read Throughput"]
READ_RAND["Random Read Throughput"]
end
subgraph Contention_Tests["Contention Benchmark Tests"]
MULTI_THREAD["Multi-threaded Read Contention"]
PARALLEL_ITER["Parallel Iteration Performance"]
end
subgraph Criterion["Criterion.rs Framework"]
STATISTICAL["Statistical Analysis"]
COMPARISON["Baseline Comparison"]
PLOTS["Performance Plots"]
REPORTS["HTML Reports"]
end
STORAGE_BENCH --> WRITE_SINGLE
STORAGE_BENCH --> WRITE_BATCH
STORAGE_BENCH --> READ_SEQ
STORAGE_BENCH --> READ_RAND
CONTENTION_BENCH --> MULTI_THREAD
CONTENTION_BENCH --> PARALLEL_ITER
WRITE_SINGLE --> STATISTICAL
WRITE_BATCH --> STATISTICAL
READ_SEQ --> STATISTICAL
READ_RAND --> STATISTICAL
MULTI_THREAD --> STATISTICAL
PARALLEL_ITER --> STATISTICAL
STATISTICAL --> COMPARISON
COMPARISON --> PLOTS
PLOTS --> REPORTS
Benchmarking Infrastructure
SIMD R Drive uses Criterion.rs for statistical benchmarking of performance-critical operations. The benchmark suite validates the effectiveness of SIMD optimizations and concurrent access patterns.
Benchmarking Architecture
The benchmark suite consists of two main harnesses: storage_benchmark measures fundamental read/write throughput, while contention_benchmark measures concurrent access performance. Criterion.rs provides statistical analysis, baseline comparisons, and HTML reports for each benchmark run.
Sources: Cargo.toml:57-63 Cargo.toml98
Benchmark Configuration
The benchmark harnesses are defined in Cargo.toml:57-63:
The harness = false setting disables Rust’s default benchmark harness, allowing Criterion.rs to provide its own test runner with statistical analysis capabilities.
Criterion.rs Integration
The Criterion.rs framework is configured as a development dependency in Cargo.toml39 and provides:
- Statistical rigor : Multiple iterations with outlier detection
- Baseline comparison : Compare performance across code changes
- Regression detection : Automatically detect performance regressions
- Visualization : Generate performance plots and HTML reports
- Reproducibility : Consistent measurement methodology across environments
Benchmarks are executed with:
The results are stored in target/criterion/ for historical comparison.
Sources: Cargo.toml39 Cargo.toml:57-63
Performance Feature Summary
The following table summarizes the key performance features and their implementation locations:
| Feature | Implementation | Benefit | Configuration |
|---|---|---|---|
| AVX2 SIMD | simd_copy_x86 in src/storage_engine/simd_copy.rs:33-62 | 32-byte vectorized copies | Runtime feature detection |
| NEON SIMD | simd_copy_arm in src/storage_engine/simd_copy.rs:81-108 | 16-byte vectorized copies | Always enabled on aarch64 |
| 64-byte Alignment | PAYLOAD_ALIGNMENT constant | Cache-line efficiency | Build-time constant |
| Zero-Copy Reads | memmap2::Mmap | No deserialization overhead | Always enabled |
| Lock-Free Reads | DashMap in Cargo.toml27 | Concurrent read scaling | Always enabled |
| Parallel Iteration | Rayon in Cargo.toml30 | Multi-core bulk processing | parallel feature flag |
| Hardware Hashing | xxhash-rust in Cargo.toml34 | SIMD-accelerated indexing | Always enabled |
For detailed information on each feature, see the corresponding child sections 5.1, 5.2, 5.3, and 5.4.
Sources: README.md:5-7 README.md:249-256 Cargo.toml27 Cargo.toml30 Cargo.toml34
Dismiss
Refresh this wiki
Enter email to refresh