Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SIMD Acceleration

Loading…

SIMD Acceleration

Relevant source files

Purpose and Scope

This document describes the SIMD (Single Instruction, Multiple Data) acceleration layer used in the SIMD R Drive storage engine. SIMD acceleration provides vectorized memory copy operations that process multiple bytes simultaneously, improving throughput for data write operations. The implementation supports AVX2 instructions on x86_64 architectures and NEON instructions on ARM AArch64 architectures.

For information about payload alignment considerations that complement SIMD operations, see Payload Alignment and Cache Efficiency. For details on how SIMD operations are measured, see Benchmarking.

Architecture Support Matrix

The SIMD acceleration layer provides platform-specific implementations based on available hardware features:

ArchitectureSIMD TechnologyVector WidthBytes per OperationRuntime Detection
x86_64AVX2256-bit32 bytesYes (is_x86_feature_detected!)
aarch64 (ARM)NEON128-bit16 bytesNo (always enabled)
OtherScalar fallbackN/A1 byteN/A

Sources: src/storage_engine/simd_copy.rs:10-138

SIMD Copy Architecture

The simd_copy function serves as the unified entry point for SIMD-accelerated memory operations, dispatching to architecture-specific implementations based on compile-time and runtime feature detection.

SIMD Copy Dispatch Flow

graph TB
    Entry["simd_copy(dst, src)"]
Check_x86["#[cfg(target_arch = 'x86_64')]\nCompile-time check"]
Check_arm["#[cfg(target_arch = 'aarch64')]\nCompile-time check"]
Detect_AVX2["is_x86_feature_detected!('avx2')\nRuntime detection"]
AVX2_Impl["simd_copy_x86(dst, src)\n32-byte chunks\n_mm256_loadu_si256\n_mm256_storeu_si256"]
NEON_Impl["simd_copy_arm(dst, src)\n16-byte chunks\nvld1q_u8\nvst1q_u8"]
Scalar_Fallback["copy_from_slice\nStandard Rust memcpy"]
Warning["LOG_ONCE.call_once\nWarn: AVX2 not detected"]
Entry --> Check_x86
 
   Entry --> Check_arm
 
   Check_x86 --> Detect_AVX2
 
   Detect_AVX2 -->|true| AVX2_Impl
 
   Detect_AVX2 -->|false| Warning
 
   Warning --> Scalar_Fallback
 
   Check_arm --> NEON_Impl
 
   Entry --> Scalar_Fallback
    
    style Entry fill:#f9f9f9,stroke:#333,stroke-width:2px
    style AVX2_Impl fill:#f0f0f0
    style NEON_Impl fill:#f0f0f0
    style Scalar_Fallback fill:#f0f0f0

Sources: src/storage_engine/simd_copy.rs:110-138

x86_64 AVX2 Implementation

The simd_copy_x86 function leverages AVX2 instructions for vectorized memory operations on x86_64 processors.

Function Signature and Safety

src/storage_engine/simd_copy.rs:32-35 defines the function with the #[target_feature(enable = "avx2")] attribute, which enables AVX2 code generation and marks the function as unsafe:

Chunked Copy Strategy

The implementation processes data in 32-byte chunks corresponding to the 256-bit AVX2 register width:

StepOperationIntrinsicDescription
1. Calculate chunkslen / 32N/ADetermines number of full 32-byte iterations
2. Load from source_mm256_loadu_si256src/storage_engine/simd_copy.rs47Unaligned load of 256 bits
3. Store to destination_mm256_storeu_si256src/storage_engine/simd_copy.rs55Unaligned store of 256 bits
4. Handle remaindercopy_from_slicesrc/storage_engine/simd_copy.rs61Scalar copy for remaining bytes

Memory Safety Guarantees

The implementation includes detailed safety comments (src/storage_engine/simd_copy.rs:42-56) documenting:

  • Buffer bounds validation (len calculated as minimum of dst.len() and src.len())
  • Pointer arithmetic guarantees (i bounded by chunks * 32 <= len)
  • Alignment handling via unaligned load/store instructions

Sources: src/storage_engine/simd_copy.rs:32-62

ARM NEON Implementation

The simd_copy_arm function provides vectorized operations for ARM AArch64 processors using the NEON instruction set.

Function Signature

src/storage_engine/simd_copy.rs:80-83 defines the ARM-specific implementation:

NEON Operation Pattern

NEON 16-byte Copy Cycle

The implementation (src/storage_engine/simd_copy.rs:83-108):

  1. Chunk Calculation : Divides length by 16 (NEON register width)
  2. Load Operation : Uses vld1q_u8 to read 16 bytes into a NEON register (src/storage_engine/simd_copy.rs94)
  3. Store Operation : Uses vst1q_u8 to write 16 bytes from register to destination (src/storage_engine/simd_copy.rs101)
  4. Remainder Handling : Scalar copy for any bytes not fitting in 16-byte chunks (src/storage_engine/simd_copy.rs107)

Sources: src/storage_engine/simd_copy.rs:80-108

Runtime Feature Detection

x86_64 Detection Mechanism

The x86_64 implementation uses Rust’s standard library feature detection:

x86_64 AVX2 Runtime Detection Flow

src/storage_engine/simd_copy.rs:114-124 implements the detection with logging:

  • The std::is_x86_feature_detected!("avx2") macro performs runtime CPUID checks
  • The LOG_ONCE static variable (src/storage_engine/simd_copy.rs8) ensures the warning is emitted only once
  • Fallback to scalar copy occurs transparently when AVX2 is unavailable

ARM Detection Strategy

ARM AArch64 does not provide standard runtime feature detection. The implementation assumes NEON availability on all AArch64 targets (src/storage_engine/simd_copy.rs:127-133), which is guaranteed by the ARMv8 architecture specification.

Sources: src/storage_engine/simd_copy.rs:4-8 src/storage_engine/simd_copy.rs:110-138

graph TD
    Start["simd_copy invoked"]
Layer1["Layer 1: Platform-specific SIMD\nAVX2 or NEON if available"]
Layer2["Layer 2: Runtime detection failure\nAVX2 not detected on x86_64"]
Layer3["Layer 3: Unsupported architecture\nNeither x86_64 nor aarch64"]
Scalar["copy_from_slice\nStandard Rust memcpy\nCompiler-optimized"]
Start --> Layer1
 
   Layer1 -->|No SIMD available| Layer2
 
   Layer2 -->|No runtime support| Layer3
 
   Layer3 --> Scalar
    
    style Scalar fill:#f0f0f0

Fallback Behavior

The system provides three fallback layers for environments without SIMD support:

Fallback Hierarchy

Fallback Decision Tree

Scalar Copy Implementation

src/storage_engine/simd_copy.rs:136-137 implements the final fallback:

This uses Rust’s standard library copy_from_slice, which:

  • Relies on LLVM’s optimized memcpy implementation
  • May use SIMD instructions if the compiler determines it’s beneficial
  • Provides a safe, portable baseline for all platforms

Sources: src/storage_engine/simd_copy.rs:136-137

graph TB
    subgraph "DataStore Write Path"
        Write["write(key, value)"]
Align["Calculate 64-byte alignment padding"]
Allocate["Allocate file space"]
Copy["simd_copy(dst, src)"]
Metadata["Write metadata\n(hash, prev_offset, crc32)"]
end
    
    subgraph "simd_copy Function"
        Dispatch["Platform dispatch"]
AVX2["AVX2: 32-byte chunks"]
NEON["NEON: 16-byte chunks"]
Scalar["Scalar fallback"]
end
    
    subgraph "Storage File"
        MMap["Memory-mapped region"]
Payload["64-byte aligned payload"]
end
    
 
   Write --> Align
 
   Align --> Allocate
 
   Allocate --> Copy
 
   Copy --> Dispatch
 
   Dispatch --> AVX2
 
   Dispatch --> NEON
 
   Dispatch --> Scalar
 
   AVX2 --> MMap
 
   NEON --> MMap
 
   Scalar --> MMap
 
   MMap --> Payload
 
   Copy --> Metadata
    
    style Copy fill:#f9f9f9,stroke:#333,stroke-width:2px
    style Dispatch fill:#f0f0f0

Integration with Storage Engine

The simd_copy function is invoked during write operations to efficiently copy user data into the memory-mapped storage file.

Usage Context

SIMD Integration in Write Path

The storage engine’s write operations leverage SIMD acceleration when copying payload data into the memory-mapped file. The 64-byte payload alignment (see Payload Alignment and Cache Efficiency) ensures that SIMD operations work with naturally aligned memory boundaries, maximizing cache efficiency.

Performance Impact

SIMD acceleration provides measurable benefits:

  • AVX2 (x86_64) : Processes 32 bytes per instruction vs. scalar’s 8 bytes (or less)
  • NEON (ARM) : Processes 16 bytes per instruction vs. scalar’s 8 bytes (or less)
  • Cache Efficiency : Larger transfer granularity reduces memory access overhead
  • Write Throughput : Directly improves write, batch_write, and write_stream performance

The actual performance gains are measured using the Criterion.rs benchmark suite (see Benchmarking).

Sources: src/storage_engine/simd_copy.rs:1-139 Cargo.toml8

Dependencies and Compiler Support

Architecture-Specific Intrinsics

The implementation imports platform-specific SIMD intrinsics:

ArchitectureImport StatementIntrinsics Used
x86_64use std::arch::x86_64::*; (src/storage_engine/simd_copy.rs11)__m256i, _mm256_loadu_si256, _mm256_storeu_si256
aarch64use std::arch::aarch64::*; (src/storage_engine/simd_copy.rs14)vld1q_u8, vst1q_u8

Build Configuration

The SIMD implementation requires no special feature flags in Cargo.toml:1-113 The code uses:

  • Compile-time conditional compilation (#[cfg(target_arch = "...")])
  • Runtime feature detection (x86_64 only)
  • Standard Rust toolchain support (no nightly features required)

The #[inline] attribute on all SIMD functions encourages the compiler to inline these hot-path operations, reducing function call overhead.

Sources: src/storage_engine/simd_copy.rs:10-14 src/storage_engine/simd_copy.rs:32-35 src/storage_engine/simd_copy.rs:80-83

Dismiss

Refresh this wiki

Enter email to refresh