This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SIMD Acceleration
Loading…
SIMD Acceleration
Relevant source files
- README.md
- benches/storage_benchmark.rs
- src/main.rs
- src/storage_engine/simd_copy.rs
- src/utils/format_bytes.rs
- tests/concurrency_tests.rs
Purpose and Scope
This document details the SIMD (Single Instruction, Multiple Data) acceleration features implemented in simd-r-drive. SIMD optimizations are applied to two critical performance paths: memory copying during write operations and key hashing for indexing. These optimizations leverage hardware-specific vector instructions (AVX2, NEON) to reduce CPU overhead and improve throughput in write-heavy workloads.
For information about the broader performance context including alignment strategies, see Payload Alignment and Cache Efficiency. For details on how these optimizations integrate with different write patterns, see Write and Read Modes.
Sources: README.md:248-257
SIMD Operations Overview
The system employs SIMD acceleration in two distinct areas:
| Operation | Purpose | Instruction Sets | Primary Use Case |
|---|---|---|---|
Memory Copy (simd_copy) | Vectorized data transfer to write buffers | AVX2 (x86_64), NEON (aarch64) | Write operations, payload staging |
Key Hashing (xxh3_64) | Hardware-accelerated hash computation | SSE2, AVX2, Neon | Key indexing, collision detection |
Key Distinction: SIMD is not used for read operations. Zero-copy reads access memory-mapped data directly without transformations, making SIMD unnecessary. The acceleration targets write-path bottlenecks where data must be moved through buffers before disk flush.
Sources: README.md:248-257
Architecture and Implementation Flow
Architecture Decision Flow
The simd_copy function performs runtime detection to select the optimal implementation. On x86_64, it checks for AVX2 support using std::is_x86_feature_detected!("avx2"). On aarch64, NEON is assumed available (ARM64 standard). Unsupported architectures fall back to standard Rust slice operations.
Sources: src/storage_engine/simd_copy.rs:110-138
x86_64 AVX2 Implementation
Memory Copy Mechanism
AVX2 Implementation Details
The simd_copy_x86 function processes memory in 32-byte chunks using AVX2 vector instructions:
Processing Strategy:
- Computes
chunks = len / 32to determine full vector iterations - Uses
_mm256_loadu_si256for unaligned 32-byte loads from source - Uses
_mm256_storeu_si256for unaligned 32-byte stores to destination - Processes remainder bytes (< 32) with scalar
copy_from_slice
Safety Considerations:
- Marked
unsafeand requires#[target_feature(enable = "avx2")] - Caller must ensure both
srcanddsthave at leastlenvalid bytes - Pointer arithmetic is bounds-checked via chunk calculation
- Unaligned loads/stores handle arbitrary memory alignment
Performance Characteristics:
- AVX2 processes 256 bits (32 bytes) per instruction
- Typical throughput: 4-8x faster than scalar copy on supported CPUs
- Fallback warning logged once if AVX2 unavailable (e.g., virtualized environments)
Sources: src/storage_engine/simd_copy.rs:16-62 src/storage_engine/simd_copy.rs:112-125
AArch64 NEON Implementation
ARM Vector Operations
NEON Implementation Details
The simd_copy_arm function implements ARM Advanced SIMD (NEON):
Processing Strategy:
- Processes 16-byte chunks (NEON vector width: 128 bits)
- Uses
vld1q_u8for 16-byte vector loads - Uses
vst1q_u8for 16-byte vector stores - Handles remainder with scalar copy
Architecture Notes:
- NEON is mandatory on ARM64 (aarch64), no runtime detection needed
- No fallback warning on ARM; NEON is guaranteed available
- Smaller vector width (16 vs 32 bytes) but consistent across ARM64 devices
Performance Characteristics:
- Typical speedup: 2-4x over scalar copy
- Particularly beneficial on Apple Silicon, AWS Graviton, and other ARM server platforms
Sources: src/storage_engine/simd_copy.rs:64-108 src/storage_engine/simd_copy.rs:127-133
Hardware-Accelerated Hashing (XXH3)
Hashing Infrastructure
XXH3 Acceleration Tiers
The XXH3 hashing algorithm automatically selects hardware acceleration based on available instruction sets:
| Platform | Default Acceleration | Enhanced Acceleration | Availability |
|---|---|---|---|
| x86_64 | SSE2 (universal) | AVX2 (opt-in) | SSE2 guaranteed on all x86_64 |
| aarch64 | Neon (universal) | N/A | Neon guaranteed on all ARM64 |
| Other | Scalar fallback | N/A | Portable implementation |
Integration Points:
- Write Path: Keys are hashed during
write(),batch_write(), andwrite_stream()operations - Read Path: Keys are hashed during
read()andbatch_read()lookups - Indexing: Hash values populate the
KeyIndexerHashMap for O(1) lookups - Collision Detection: High-order 16 bits serve as tag for collision verification (see Key Indexing and Hashing)
Performance Impact:
Benchmarking shows XXH3 with hardware acceleration can compute 1 million random key hashes in well under 1 second, enabling high-throughput indexing operations without becoming a bottleneck.
Sources: README.md:158-167
Write Path Integration
SIMD in Write Operations
Write Operation Flow with SIMD
The SIMD copy function is invoked during all write operations to transfer payload data into the BufWriter<File> buffer:
- Key Hashing: XXH3 with SIMD acceleration computes the key hash
- Payload Staging:
simd_copytransfers value bytes to write buffer - Metadata Appending: Non-SIMD operations append metadata (hash, offset, checksum)
- Flush: Buffer contents written to disk in single system call
- Index Update: Hash stored in KeyIndexer for subsequent lookups
Why SIMD for Writes Only:
- Writes: Data must traverse multiple memory regions (app → buffer → disk), benefiting from vectorized transfer
- Reads: Memory-mapped access provides direct pointer to disk pages, no intermediate copying required
Sources: README.md:248-257 src/storage_engine/simd_copy.rs:1-139
Performance Characteristics
Benchmark Results
The storage benchmark demonstrates SIMD acceleration impact across write-heavy workloads:
Test Configuration:
- Entry Count: 1,000,000 entries
- Entry Size: 8 bytes payload
- Write Batch Size: 1,024 entries per batch
- Platform: Multi-architecture (x86_64 with AVX2, aarch64 with NEON)
Typical Performance Metrics:
| Operation | Throughput | Notes |
|---|---|---|
| Batched Writes | 100,000+ writes/s | SIMD copy + XXH3 acceleration |
| Sequential Reads | 1,000,000+ reads/s | Zero-copy, no SIMD needed |
| Random Reads | 500,000+ reads/s | XXH3 hash lookup acceleration |
| Batch Reads | 700,000+ reads/s | Vectorized hash lookups |
SIMD Contribution Analysis:
- Write Throughput: SIMD copy reduces memory transfer overhead by 4-8x on AVX2 systems
- Index Performance: XXH3 SIMD hashing maintains O(1) lookup performance even at high scales
- Memory Efficiency: Vectorized operations reduce cache pressure by minimizing instruction count
Fallback Impact:
On systems without SIMD support (virtualized x86_64, non-ARM/x86 architectures), performance degrades gracefully to scalar implementations. A warning is logged once on x86_64 systems lacking AVX2.
Sources: benches/storage_benchmark.rs:1-234 README.md:166-167
Runtime Feature Detection
Detection Implementation
Detection Strategy
The system uses a tiered approach to SIMD feature selection:
Compile-Time Decisions:
- Target architecture determined by
rustcbuild target (e.g.,x86_64-unknown-linux-gnu,aarch64-apple-darwin) - Conditional compilation blocks via
#[cfg(target_arch = "x86_64")]and#[cfg(target_arch = "aarch64")]
Runtime Decisions (x86_64 only):
- AVX2 availability checked with
std::is_x86_feature_detected!("avx2") - Detection performed on every
simd_copycall (minimal overhead, inlined) - Warning logged once via
std::sync::Onceif AVX2 unavailable
ARM64 Assumption:
- NEON guaranteed on all ARM64 targets, no runtime check needed
- ARM specification mandates Advanced SIMD support
Fallback Behavior:
- Graceful degradation to
copy_from_sliceon unsupported platforms - No panics or errors; functionality preserved with reduced performance
Sources: src/storage_engine/simd_copy.rs:4-8 src/storage_engine/simd_copy.rs:110-138
Limitations and Scope
What SIMD Does Not Accelerate
Read Operations:
Zero-copy memory-mapped reads access data directly via pointers into mmap pages. No intermediate copying occurs, making SIMD memory transfer unnecessary. The system optimizes reads through:
- Direct pointer arithmetic into mapped regions
- Cacheline-aligned payloads (64-byte boundaries, see Payload Alignment and Cache Efficiency)
- Memory prefetching handled by OS page fault mechanisms
CRC32 Checksums:
The CRC32C checksum calculation in entry validation is not SIMD-accelerated in the current implementation. This operation uses standard library CRC computation.
Metadata Processing:
Metadata writes (20 bytes: key_hash, prev_offset, checksum) use scalar operations. The small, fixed size provides no benefit from vectorization.
Platform Limitations:
| Platform | Limitation | Workaround |
|---|---|---|
| Windows ARM64 (emulated) | AVX2 detection may fail | Automatic fallback to scalar |
| RISC-V, WASM, other | No SIMD implementation | Scalar copy_from_slice |
| x86 (32-bit) | No AVX2 support | Not a target architecture |
Sources: README.md:256-257 src/storage_engine/simd_copy.rs:119-124
Summary
SIMD acceleration in simd-r-drive targets write-path bottlenecks through two mechanisms:
simd_copy: Vectorized memory transfer (AVX2/NEON) for staging write buffers- XXH3 hashing: Hardware-accelerated key hashing (SSE2/AVX2/Neon) for indexing
These optimizations complement the zero-copy read architecture, where direct memory-mapped access eliminates the need for data transformation. The combination enables high-throughput writes while maintaining sub-microsecond read latencies at scale.
For details on how payload alignment enhances these optimizations, see Payload Alignment and Cache Efficiency. For broader context on write patterns that leverage SIMD acceleration, see Write and Read Modes.
Sources: README.md:248-257 src/storage_engine/simd_copy.rs:1-139
Dismiss
Refresh this wiki
Enter email to refresh