SIMD Acceleration
Relevant source files
- README.md
- benches/storage_benchmark.rs
- experiments/bindings/python_(old_client)/pyproject.toml/pyproject.toml)
- src/main.rs
- src/utils/format_bytes.rs
- tests/concurrency_tests.rs
Purpose and Scope
This page documents the SIMD (Single Instruction, Multiple Data) optimizations implemented in SIMD R Drive. The storage engine uses SIMD acceleration in two critical areas: write operations (via the simd_copy function) and key hashing (via the xxh3_64 algorithm).
SIMD is not used for read operations, which rely on zero-copy memory-mapped access for optimal performance. For information about zero-copy reads and memory management, see Memory Management and Zero-Copy Access. For details on the 64-byte alignment strategy that enables efficient SIMD operations, see Payload Alignment and Cache Efficiency.
Sources: README.md:248-256
Overview
SIMD R Drive leverages hardware SIMD capabilities to accelerate performance-critical operations during write workflows. The storage engine uses platform-specific SIMD instructions to minimize CPU cycles spent on memory operations, resulting in improved write throughput and indexing performance.
The SIMD acceleration strategy consists of two components:
| Component | Purpose | Platforms | Performance Impact |
|---|---|---|---|
simd_copy Function | Accelerates memory copying during write buffer staging | x86_64 (AVX2), aarch64 (NEON) | Reduces write latency by vectorizing memory transfers |
xxh3_64 Hashing | Hardware-accelerated key hashing for index operations | x86_64 (SSE2/AVX2), aarch64 (Neon) | Improves key lookup and index building performance |
The system uses runtime CPU feature detection to select the optimal implementation path and automatically falls back to scalar operations when SIMD instructions are unavailable.
Sources: README.md:248-256 High-Level Diagram 6
SIMD Copy Function
Purpose and Role in Write Operations
The simd_copy function is a specialized memory transfer routine used during write operations. When the storage engine writes data to disk, it first stages the payload in an internal buffer. The simd_copy function accelerates this staging process by using vectorized memory operations instead of standard byte-wise copying.
Diagram: SIMD Copy Function Flow in Write Operations
graph TB
WriteAPI["DataStoreWriter::write()"]
Buffer["Internal Write Buffer\n(BufWriter<File>)"]
SIMDCopy["simd_copy Function"]
CPUDetect["CPU Feature Detection\n(Runtime)"]
AVX2Path["AVX2 Implementation\n_mm256_loadu_si256\n_mm256_storeu_si256\n(32-byte chunks)"]
NEONPath["NEON Implementation\nvld1q_u8\nvst1q_u8\n(16-byte chunks)"]
ScalarPath["Scalar Fallback\ncopy_from_slice\n(Standard memcpy)"]
DiskWrite["Write to Disk\n(Buffered I/O)"]
WriteAPI -->|user payload| SIMDCopy
SIMDCopy --> CPUDetect
CPUDetect -->|x86_64 + AVX2| AVX2Path
CPUDetect -->|aarch64| NEONPath
CPUDetect -->|no SIMD| ScalarPath
AVX2Path --> Buffer
NEONPath --> Buffer
ScalarPath --> Buffer
Buffer --> DiskWrite
The simd_copy function operates transparently within the write path. Applications using the DataStoreWriter trait do not need to explicitly invoke SIMD operations—the acceleration is automatic and selected based on the host CPU capabilities.
Sources: README.md:252-253 High-Level Diagram 6
Platform-Specific Implementations
x86_64 Architecture: AVX2 Instructions
On x86_64 platforms with AVX2 support, the simd_copy function uses 256-bit Advanced Vector Extensions to process memory in 32-byte chunks. The implementation employs two primary intrinsics:
| Intrinsic | Operation | Description |
|---|---|---|
_mm256_loadu_si256 | Load | Reads 32 bytes from source memory into a 256-bit register (unaligned load) |
_mm256_storeu_si256 | Store | Writes 32 bytes from a 256-bit register to destination memory (unaligned store) |
graph LR
SrcMem["Source Memory\n(Payload Data)"]
AVX2Reg["256-bit AVX2 Register\n(__m256i)"]
DstMem["Destination Buffer\n(Write Buffer)"]
SrcMem -->|_mm256_loadu_si256 32 bytes| AVX2Reg
AVX2Reg -->|_mm256_storeu_si256 32 bytes| DstMem
The AVX2 path is activated when the CPU supports the avx2 feature flag. Runtime detection ensures that the instruction set is available before executing AVX2 code paths.
Diagram: AVX2 Memory Transfer Operation
Sources: README.md:252-253 High-Level Diagram 6
aarch64 Architecture: NEON Instructions
On aarch64 (ARM64) platforms, the simd_copy function uses NEON (Advanced SIMD) instructions to process memory in 16-byte chunks. The implementation uses:
| Intrinsic | Operation | Description |
|---|---|---|
vld1q_u8 | Load | Reads 16 bytes from source memory into a 128-bit NEON register |
vst1q_u8 | Store | Writes 16 bytes from a 128-bit NEON register to destination memory |
graph LR
SrcMem["Source Memory\n(Payload Data)"]
NEONReg["128-bit NEON Register\n(uint8x16_t)"]
DstMem["Destination Buffer\n(Write Buffer)"]
SrcMem -->|vld1q_u8 16 bytes| NEONReg
NEONReg -->|vst1q_u8 16 bytes| DstMem
NEON is available by default on all aarch64 targets, so no runtime feature detection is required for ARM64 platforms.
Diagram: NEON Memory Transfer Operation
Sources: README.md:252-253 High-Level Diagram 6
Scalar Fallback Implementation
When SIMD instructions are unavailable (e.g., on older CPUs or unsupported architectures), the simd_copy function automatically falls back to the standard Rust copy_from_slice method. This ensures portability across all platforms while still benefiting from compiler optimizations (which may include auto-vectorization in some cases).
The fallback path provides correct functionality on all platforms but does not achieve the same throughput as explicit SIMD implementations.
Sources: README.md:252-253 High-Level Diagram 6
Runtime CPU Feature Detection
Detection Mechanism
The storage engine uses conditional compilation and runtime feature detection to select the appropriate SIMD implementation:
Diagram: SIMD Path Selection Logic
graph TB
Start["Program Execution Start"]
CompileTarget["Compile-Time Target Check"]
X86Check{"Target:\nx86_64?"}
RuntimeAVX2{"Runtime:\nAVX2 available?"}
ARMCheck{"Target:\naarch64?"}
UseAVX2["Use AVX2 Path\n(32-byte chunks)"]
UseNEON["Use NEON Path\n(16-byte chunks)"]
UseScalar["Use Scalar Fallback\n(copy_from_slice)"]
Start --> CompileTarget
CompileTarget --> X86Check
X86Check -->|Yes| RuntimeAVX2
RuntimeAVX2 -->|Yes| UseAVX2
RuntimeAVX2 -->|No| UseScalar
X86Check -->|No| ARMCheck
ARMCheck -->|Yes| UseNEON
ARMCheck -->|No| UseScalar
Feature Detection Strategy
| Platform | Detection Method | Feature Flag | Fallback Condition |
|---|---|---|---|
| x86_64 | Runtime is_x86_feature_detected!("avx2") | avx2 | AVX2 not supported → Scalar |
| aarch64 | Compile-time (default available) | N/A | N/A (NEON always present) |
| Other | Compile-time target check | N/A | Always use scalar |
The runtime detection overhead is negligible—the feature check is typically performed once during the first simd_copy invocation and the result is cached for subsequent calls.
Sources: README.md:252-256 High-Level Diagram 6
SIMD Acceleration in Key Hashing
XXH3_64 Algorithm with Hardware Acceleration
In addition to write buffer acceleration, SIMD R Drive uses the xxh3_64 hashing algorithm for key indexing. This algorithm is specifically designed to leverage SIMD instructions for high-speed hash computation.
The hashing subsystem uses hardware acceleration on supported platforms:
| Platform | SIMD Extension | Availability | Performance Benefit |
|---|---|---|---|
| x86_64 | SSE2 | Universal (all 64-bit x86) | Baseline SIMD acceleration |
| x86_64 | AVX2 | Supported on modern CPUs | Additional performance gains |
| aarch64 | Neon | Default on all ARM64 | SIMD acceleration |
Diagram: SIMD-Accelerated Key Hashing Pipeline
The xxh3_64 algorithm is used in multiple operations:
- Key hash computation during write operations
- Index lookups during read operations
- Index building during storage recovery
By using SIMD-accelerated hashing, the storage engine minimizes the CPU overhead of hash computation, which is critical for maintaining high throughput in index-heavy workloads.
Sources: README.md:158-166 README.md:254-255 High-Level Diagram 6
Performance Characteristics
Write Throughput Improvements
The SIMD-accelerated write path provides measurable performance improvements over scalar memory operations. The performance gain depends on several factors:
| Factor | Impact on Performance |
|---|---|
| Payload size | Larger payloads benefit more from vectorized copying |
| CPU architecture | AVX2 (32-byte) typically faster than NEON (16-byte) |
| Memory alignment | 64-byte aligned payloads maximize cache efficiency |
| Buffer size | Larger write buffers reduce function call overhead |
While exact performance gains vary by workload and hardware, typical scenarios show:
- AVX2 implementations : 2-4× throughput improvement over scalar copy for medium-to-large payloads
- NEON implementations : 1.5-3× throughput improvement over scalar copy
- Hash computation : XXH3 with SIMD is significantly faster than non-accelerated hash functions
Interaction with 64-Byte Alignment
The storage engine's 64-byte payload alignment strategy (see Payload Alignment and Cache Efficiency) synergizes with SIMD operations:
Diagram: Alignment and SIMD Interaction
The 64-byte alignment ensures that:
- Cache line alignment : Payloads start on cache line boundaries, avoiding split-cache-line access penalties
- SIMD-friendly boundaries : Both AVX2 (32-byte) and NEON (16-byte) operations can operate at full speed without crossing alignment boundaries
- Zero-copy efficiency : Memory-mapped reads benefit from predictable alignment (see Memory Management and Zero-Copy Access)
Sources: README.md:51-59 README.md:248-256 High-Level Diagram 6
When SIMD Is (and Isn't) Used
Operations Using SIMD Acceleration
| Operation | SIMD Usage | Implementation |
|---|---|---|
Write operations (write, batch_write, write_stream) | ✅ Yes | simd_copy function transfers payload to write buffer |
| Key hashing (all operations requiring key lookup) | ✅ Yes | xxh3_64 with SSE2/AVX2/Neon acceleration |
| Index building (during storage recovery) | ✅ Yes | xxh3_64 hashing during forward scan |
Operations NOT Using SIMD
| Operation | SIMD Usage | Reason |
|---|---|---|
Read operations (read, batch_read, read_stream) | ❌ No | Zero-copy memory-mapped access—data is accessed directly from mmap without copying |
Iteration (into_iter, par_iter_entries) | ❌ No | Iterators return references to memory-mapped regions—no data transfer occurs |
| Entry validation (CRC32C checksum) | ❌ No | Uses standard CRC32C implementation (hardware-accelerated on supporting CPUs via separate mechanism) |
Diagram: SIMD Usage in Write vs. Read Paths
Why Reads Don't Need SIMD
Read operations in SIMD R Drive use memory-mapped I/O (mmap), which provides direct access to the storage file's contents without copying data into buffers. Since no memory transfer occurs, there is no opportunity to apply SIMD acceleration.
The zero-copy read strategy is fundamentally faster than any SIMD-accelerated copy operation because it eliminates the copy entirely. The memory-mapped approach allows the operating system's virtual memory subsystem to handle data access, often utilizing hardware-level optimizations like demand paging and read-ahead caching.
For more details on the zero-copy read architecture, see Memory Management and Zero-Copy Access.
Sources: README.md:43-49 README.md:254-256 High-Level Diagram 2
Code Entity Reference
The following table maps the conceptual SIMD components discussed in this document to their likely implementation locations within the codebase:
| Concept | Code Entity | Expected Location |
|---|---|---|
| SIMD copy function | simd_copy() | Core storage engine or utilities |
| AVX2 implementation | Architecture-specific module with _mm256_* intrinsics | Platform-specific code (likely #[cfg(target_arch = "x86_64")]) |
| NEON implementation | Architecture-specific module with vld1q_* / vst1q_* intrinsics | Platform-specific code (likely #[cfg(target_arch = "aarch64")]) |
| Runtime detection | is_x86_feature_detected!("avx2") macro | AVX2 code path guard |
| XXH3 hashing | xxh3_64 function or xxhash_rust crate usage | Key hashing module |
| Write buffer staging | BufWriter<File> with simd_copy integration | DataStore write implementation |
Sources: README.md:248-256 High-Level Diagram 6
Summary
SIMD acceleration in SIMD R Drive focuses on write-path optimization and key hashing performance. The dual-platform implementation (AVX2 for x86_64, NEON for aarch64) with automatic runtime detection ensures optimal performance across diverse hardware while maintaining portability through scalar fallback.
The 64-byte payload alignment strategy complements SIMD operations by ensuring cache-friendly access patterns. However, the storage engine intentionally does not use SIMD for read operations—zero-copy memory-mapped access eliminates the need for data transfer entirely, providing superior read performance without vectorization.
For related performance optimizations, see:
- Payload Alignment and Cache Efficiency for alignment strategy details
- Memory Management and Zero-Copy Access for zero-copy read architecture
- Write and Read Modes for performance comparisons across different access patterns
Sources: README.md:248-256 README.md:43-59 High-Level Diagrams 2, 5, and 6