This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SIMD Acceleration
Loading…
SIMD Acceleration
Relevant source files
Purpose and Scope
This document describes the SIMD (Single Instruction, Multiple Data) acceleration layer used in the SIMD R Drive storage engine. SIMD acceleration provides vectorized memory copy operations that process multiple bytes simultaneously, improving throughput for data write operations. The implementation supports AVX2 instructions on x86_64 architectures and NEON instructions on ARM AArch64 architectures.
For information about payload alignment considerations that complement SIMD operations, see Payload Alignment and Cache Efficiency. For details on how SIMD operations are measured, see Benchmarking.
Architecture Support Matrix
The SIMD acceleration layer provides platform-specific implementations based on available hardware features:
| Architecture | SIMD Technology | Vector Width | Bytes per Operation | Runtime Detection |
|---|---|---|---|---|
| x86_64 | AVX2 | 256-bit | 32 bytes | Yes (is_x86_feature_detected!) |
| aarch64 (ARM) | NEON | 128-bit | 16 bytes | No (always enabled) |
| Other | Scalar fallback | N/A | 1 byte | N/A |
Sources: src/storage_engine/simd_copy.rs:10-138
SIMD Copy Architecture
The simd_copy function serves as the unified entry point for SIMD-accelerated memory operations, dispatching to architecture-specific implementations based on compile-time and runtime feature detection.
SIMD Copy Dispatch Flow
graph TB
Entry["simd_copy(dst, src)"]
Check_x86["#[cfg(target_arch = 'x86_64')]\nCompile-time check"]
Check_arm["#[cfg(target_arch = 'aarch64')]\nCompile-time check"]
Detect_AVX2["is_x86_feature_detected!('avx2')\nRuntime detection"]
AVX2_Impl["simd_copy_x86(dst, src)\n32-byte chunks\n_mm256_loadu_si256\n_mm256_storeu_si256"]
NEON_Impl["simd_copy_arm(dst, src)\n16-byte chunks\nvld1q_u8\nvst1q_u8"]
Scalar_Fallback["copy_from_slice\nStandard Rust memcpy"]
Warning["LOG_ONCE.call_once\nWarn: AVX2 not detected"]
Entry --> Check_x86
Entry --> Check_arm
Check_x86 --> Detect_AVX2
Detect_AVX2 -->|true| AVX2_Impl
Detect_AVX2 -->|false| Warning
Warning --> Scalar_Fallback
Check_arm --> NEON_Impl
Entry --> Scalar_Fallback
style Entry fill:#f9f9f9,stroke:#333,stroke-width:2px
style AVX2_Impl fill:#f0f0f0
style NEON_Impl fill:#f0f0f0
style Scalar_Fallback fill:#f0f0f0
Sources: src/storage_engine/simd_copy.rs:110-138
x86_64 AVX2 Implementation
The simd_copy_x86 function leverages AVX2 instructions for vectorized memory operations on x86_64 processors.
Function Signature and Safety
src/storage_engine/simd_copy.rs:32-35 defines the function with the #[target_feature(enable = "avx2")] attribute, which enables AVX2 code generation and marks the function as unsafe:
Chunked Copy Strategy
The implementation processes data in 32-byte chunks corresponding to the 256-bit AVX2 register width:
| Step | Operation | Intrinsic | Description |
|---|---|---|---|
| 1. Calculate chunks | len / 32 | N/A | Determines number of full 32-byte iterations |
| 2. Load from source | _mm256_loadu_si256 | src/storage_engine/simd_copy.rs47 | Unaligned load of 256 bits |
| 3. Store to destination | _mm256_storeu_si256 | src/storage_engine/simd_copy.rs55 | Unaligned store of 256 bits |
| 4. Handle remainder | copy_from_slice | src/storage_engine/simd_copy.rs61 | Scalar copy for remaining bytes |
Memory Safety Guarantees
The implementation includes detailed safety comments (src/storage_engine/simd_copy.rs:42-56) documenting:
- Buffer bounds validation (
lencalculated as minimum ofdst.len()andsrc.len()) - Pointer arithmetic guarantees (
ibounded bychunks * 32 <= len) - Alignment handling via unaligned load/store instructions
Sources: src/storage_engine/simd_copy.rs:32-62
ARM NEON Implementation
The simd_copy_arm function provides vectorized operations for ARM AArch64 processors using the NEON instruction set.
Function Signature
src/storage_engine/simd_copy.rs:80-83 defines the ARM-specific implementation:
NEON Operation Pattern
NEON 16-byte Copy Cycle
The implementation (src/storage_engine/simd_copy.rs:83-108):
- Chunk Calculation : Divides length by 16 (NEON register width)
- Load Operation : Uses
vld1q_u8to read 16 bytes into a NEON register (src/storage_engine/simd_copy.rs94) - Store Operation : Uses
vst1q_u8to write 16 bytes from register to destination (src/storage_engine/simd_copy.rs101) - Remainder Handling : Scalar copy for any bytes not fitting in 16-byte chunks (src/storage_engine/simd_copy.rs107)
Sources: src/storage_engine/simd_copy.rs:80-108
Runtime Feature Detection
x86_64 Detection Mechanism
The x86_64 implementation uses Rust’s standard library feature detection:
x86_64 AVX2 Runtime Detection Flow
src/storage_engine/simd_copy.rs:114-124 implements the detection with logging:
- The
std::is_x86_feature_detected!("avx2")macro performs runtime CPUID checks - The
LOG_ONCEstatic variable (src/storage_engine/simd_copy.rs8) ensures the warning is emitted only once - Fallback to scalar copy occurs transparently when AVX2 is unavailable
ARM Detection Strategy
ARM AArch64 does not provide standard runtime feature detection. The implementation assumes NEON availability on all AArch64 targets (src/storage_engine/simd_copy.rs:127-133), which is guaranteed by the ARMv8 architecture specification.
Sources: src/storage_engine/simd_copy.rs:4-8 src/storage_engine/simd_copy.rs:110-138
graph TD
Start["simd_copy invoked"]
Layer1["Layer 1: Platform-specific SIMD\nAVX2 or NEON if available"]
Layer2["Layer 2: Runtime detection failure\nAVX2 not detected on x86_64"]
Layer3["Layer 3: Unsupported architecture\nNeither x86_64 nor aarch64"]
Scalar["copy_from_slice\nStandard Rust memcpy\nCompiler-optimized"]
Start --> Layer1
Layer1 -->|No SIMD available| Layer2
Layer2 -->|No runtime support| Layer3
Layer3 --> Scalar
style Scalar fill:#f0f0f0
Fallback Behavior
The system provides three fallback layers for environments without SIMD support:
Fallback Hierarchy
Fallback Decision Tree
Scalar Copy Implementation
src/storage_engine/simd_copy.rs:136-137 implements the final fallback:
This uses Rust’s standard library copy_from_slice, which:
- Relies on LLVM’s optimized
memcpyimplementation - May use SIMD instructions if the compiler determines it’s beneficial
- Provides a safe, portable baseline for all platforms
Sources: src/storage_engine/simd_copy.rs:136-137
graph TB
subgraph "DataStore Write Path"
Write["write(key, value)"]
Align["Calculate 64-byte alignment padding"]
Allocate["Allocate file space"]
Copy["simd_copy(dst, src)"]
Metadata["Write metadata\n(hash, prev_offset, crc32)"]
end
subgraph "simd_copy Function"
Dispatch["Platform dispatch"]
AVX2["AVX2: 32-byte chunks"]
NEON["NEON: 16-byte chunks"]
Scalar["Scalar fallback"]
end
subgraph "Storage File"
MMap["Memory-mapped region"]
Payload["64-byte aligned payload"]
end
Write --> Align
Align --> Allocate
Allocate --> Copy
Copy --> Dispatch
Dispatch --> AVX2
Dispatch --> NEON
Dispatch --> Scalar
AVX2 --> MMap
NEON --> MMap
Scalar --> MMap
MMap --> Payload
Copy --> Metadata
style Copy fill:#f9f9f9,stroke:#333,stroke-width:2px
style Dispatch fill:#f0f0f0
Integration with Storage Engine
The simd_copy function is invoked during write operations to efficiently copy user data into the memory-mapped storage file.
Usage Context
SIMD Integration in Write Path
The storage engine’s write operations leverage SIMD acceleration when copying payload data into the memory-mapped file. The 64-byte payload alignment (see Payload Alignment and Cache Efficiency) ensures that SIMD operations work with naturally aligned memory boundaries, maximizing cache efficiency.
Performance Impact
SIMD acceleration provides measurable benefits:
- AVX2 (x86_64) : Processes 32 bytes per instruction vs. scalar’s 8 bytes (or less)
- NEON (ARM) : Processes 16 bytes per instruction vs. scalar’s 8 bytes (or less)
- Cache Efficiency : Larger transfer granularity reduces memory access overhead
- Write Throughput : Directly improves
write,batch_write, andwrite_streamperformance
The actual performance gains are measured using the Criterion.rs benchmark suite (see Benchmarking).
Sources: src/storage_engine/simd_copy.rs:1-139 Cargo.toml8
Dependencies and Compiler Support
Architecture-Specific Intrinsics
The implementation imports platform-specific SIMD intrinsics:
| Architecture | Import Statement | Intrinsics Used |
|---|---|---|
| x86_64 | use std::arch::x86_64::*; (src/storage_engine/simd_copy.rs11) | __m256i, _mm256_loadu_si256, _mm256_storeu_si256 |
| aarch64 | use std::arch::aarch64::*; (src/storage_engine/simd_copy.rs14) | vld1q_u8, vst1q_u8 |
Build Configuration
The SIMD implementation requires no special feature flags in Cargo.toml:1-113 The code uses:
- Compile-time conditional compilation (
#[cfg(target_arch = "...")]) - Runtime feature detection (x86_64 only)
- Standard Rust toolchain support (no nightly features required)
The #[inline] attribute on all SIMD functions encourages the compiler to inline these hot-path operations, reducing function call overhead.
Sources: src/storage_engine/simd_copy.rs:10-14 src/storage_engine/simd_copy.rs:32-35 src/storage_engine/simd_copy.rs:80-83
Dismiss
Refresh this wiki
Enter email to refresh