Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

SIMD Acceleration

Relevant source files

Purpose and Scope

This page documents the SIMD (Single Instruction, Multiple Data) optimizations implemented in SIMD R Drive. The storage engine uses SIMD acceleration in two critical areas: write operations (via the simd_copy function) and key hashing (via the xxh3_64 algorithm).

SIMD is not used for read operations, which rely on zero-copy memory-mapped access for optimal performance. For information about zero-copy reads and memory management, see Memory Management and Zero-Copy Access. For details on the 64-byte alignment strategy that enables efficient SIMD operations, see Payload Alignment and Cache Efficiency.

Sources: README.md:248-256


Overview

SIMD R Drive leverages hardware SIMD capabilities to accelerate performance-critical operations during write workflows. The storage engine uses platform-specific SIMD instructions to minimize CPU cycles spent on memory operations, resulting in improved write throughput and indexing performance.

The SIMD acceleration strategy consists of two components:

ComponentPurposePlatformsPerformance Impact
simd_copy FunctionAccelerates memory copying during write buffer stagingx86_64 (AVX2), aarch64 (NEON)Reduces write latency by vectorizing memory transfers
xxh3_64 HashingHardware-accelerated key hashing for index operationsx86_64 (SSE2/AVX2), aarch64 (Neon)Improves key lookup and index building performance

The system uses runtime CPU feature detection to select the optimal implementation path and automatically falls back to scalar operations when SIMD instructions are unavailable.

Sources: README.md:248-256 High-Level Diagram 6


SIMD Copy Function

Purpose and Role in Write Operations

The simd_copy function is a specialized memory transfer routine used during write operations. When the storage engine writes data to disk, it first stages the payload in an internal buffer. The simd_copy function accelerates this staging process by using vectorized memory operations instead of standard byte-wise copying.

Diagram: SIMD Copy Function Flow in Write Operations

graph TB
    WriteAPI["DataStoreWriter::write()"]
Buffer["Internal Write Buffer\n(BufWriter<File>)"]
SIMDCopy["simd_copy Function"]
CPUDetect["CPU Feature Detection\n(Runtime)"]
AVX2Path["AVX2 Implementation\n_mm256_loadu_si256\n_mm256_storeu_si256\n(32-byte chunks)"]
NEONPath["NEON Implementation\nvld1q_u8\nvst1q_u8\n(16-byte chunks)"]
ScalarPath["Scalar Fallback\ncopy_from_slice\n(Standard memcpy)"]
DiskWrite["Write to Disk\n(Buffered I/O)"]
WriteAPI -->|user payload| SIMDCopy
 
   SIMDCopy --> CPUDetect
 
   CPUDetect -->|x86_64 + AVX2| AVX2Path
 
   CPUDetect -->|aarch64| NEONPath
 
   CPUDetect -->|no SIMD| ScalarPath
    
 
   AVX2Path --> Buffer
 
   NEONPath --> Buffer
 
   ScalarPath --> Buffer
 
   Buffer --> DiskWrite

The simd_copy function operates transparently within the write path. Applications using the DataStoreWriter trait do not need to explicitly invoke SIMD operations—the acceleration is automatic and selected based on the host CPU capabilities.

Sources: README.md:252-253 High-Level Diagram 6


Platform-Specific Implementations

x86_64 Architecture: AVX2 Instructions

On x86_64 platforms with AVX2 support, the simd_copy function uses 256-bit Advanced Vector Extensions to process memory in 32-byte chunks. The implementation employs two primary intrinsics:

IntrinsicOperationDescription
_mm256_loadu_si256LoadReads 32 bytes from source memory into a 256-bit register (unaligned load)
_mm256_storeu_si256StoreWrites 32 bytes from a 256-bit register to destination memory (unaligned store)
graph LR
    SrcMem["Source Memory\n(Payload Data)"]
AVX2Reg["256-bit AVX2 Register\n(__m256i)"]
DstMem["Destination Buffer\n(Write Buffer)"]
SrcMem -->|_mm256_loadu_si256 32 bytes| AVX2Reg
 
   AVX2Reg -->|_mm256_storeu_si256 32 bytes| DstMem

The AVX2 path is activated when the CPU supports the avx2 feature flag. Runtime detection ensures that the instruction set is available before executing AVX2 code paths.

Diagram: AVX2 Memory Transfer Operation

Sources: README.md:252-253 High-Level Diagram 6

aarch64 Architecture: NEON Instructions

On aarch64 (ARM64) platforms, the simd_copy function uses NEON (Advanced SIMD) instructions to process memory in 16-byte chunks. The implementation uses:

IntrinsicOperationDescription
vld1q_u8LoadReads 16 bytes from source memory into a 128-bit NEON register
vst1q_u8StoreWrites 16 bytes from a 128-bit NEON register to destination memory
graph LR
    SrcMem["Source Memory\n(Payload Data)"]
NEONReg["128-bit NEON Register\n(uint8x16_t)"]
DstMem["Destination Buffer\n(Write Buffer)"]
SrcMem -->|vld1q_u8 16 bytes| NEONReg
 
   NEONReg -->|vst1q_u8 16 bytes| DstMem

NEON is available by default on all aarch64 targets, so no runtime feature detection is required for ARM64 platforms.

Diagram: NEON Memory Transfer Operation

Sources: README.md:252-253 High-Level Diagram 6

Scalar Fallback Implementation

When SIMD instructions are unavailable (e.g., on older CPUs or unsupported architectures), the simd_copy function automatically falls back to the standard Rust copy_from_slice method. This ensures portability across all platforms while still benefiting from compiler optimizations (which may include auto-vectorization in some cases).

The fallback path provides correct functionality on all platforms but does not achieve the same throughput as explicit SIMD implementations.

Sources: README.md:252-253 High-Level Diagram 6


Runtime CPU Feature Detection

Detection Mechanism

The storage engine uses conditional compilation and runtime feature detection to select the appropriate SIMD implementation:

Diagram: SIMD Path Selection Logic

graph TB
    Start["Program Execution Start"]
CompileTarget["Compile-Time Target Check"]
X86Check{"Target:\nx86_64?"}
RuntimeAVX2{"Runtime:\nAVX2 available?"}
ARMCheck{"Target:\naarch64?"}
UseAVX2["Use AVX2 Path\n(32-byte chunks)"]
UseNEON["Use NEON Path\n(16-byte chunks)"]
UseScalar["Use Scalar Fallback\n(copy_from_slice)"]
Start --> CompileTarget
 
   CompileTarget --> X86Check
 
   X86Check -->|Yes| RuntimeAVX2
 
   RuntimeAVX2 -->|Yes| UseAVX2
 
   RuntimeAVX2 -->|No| UseScalar
 
   X86Check -->|No| ARMCheck
 
   ARMCheck -->|Yes| UseNEON
 
   ARMCheck -->|No| UseScalar

Feature Detection Strategy

PlatformDetection MethodFeature FlagFallback Condition
x86_64Runtime is_x86_feature_detected!("avx2")avx2AVX2 not supported → Scalar
aarch64Compile-time (default available)N/AN/A (NEON always present)
OtherCompile-time target checkN/AAlways use scalar

The runtime detection overhead is negligible—the feature check is typically performed once during the first simd_copy invocation and the result is cached for subsequent calls.

Sources: README.md:252-256 High-Level Diagram 6


SIMD Acceleration in Key Hashing

XXH3_64 Algorithm with Hardware Acceleration

In addition to write buffer acceleration, SIMD R Drive uses the xxh3_64 hashing algorithm for key indexing. This algorithm is specifically designed to leverage SIMD instructions for high-speed hash computation.

The hashing subsystem uses hardware acceleration on supported platforms:

PlatformSIMD ExtensionAvailabilityPerformance Benefit
x86_64SSE2Universal (all 64-bit x86)Baseline SIMD acceleration
x86_64AVX2Supported on modern CPUsAdditional performance gains
aarch64NeonDefault on all ARM64SIMD acceleration

Diagram: SIMD-Accelerated Key Hashing Pipeline

The xxh3_64 algorithm is used in multiple operations:

  • Key hash computation during write operations
  • Index lookups during read operations
  • Index building during storage recovery

By using SIMD-accelerated hashing, the storage engine minimizes the CPU overhead of hash computation, which is critical for maintaining high throughput in index-heavy workloads.

Sources: README.md:158-166 README.md:254-255 High-Level Diagram 6


Performance Characteristics

Write Throughput Improvements

The SIMD-accelerated write path provides measurable performance improvements over scalar memory operations. The performance gain depends on several factors:

FactorImpact on Performance
Payload sizeLarger payloads benefit more from vectorized copying
CPU architectureAVX2 (32-byte) typically faster than NEON (16-byte)
Memory alignment64-byte aligned payloads maximize cache efficiency
Buffer sizeLarger write buffers reduce function call overhead

While exact performance gains vary by workload and hardware, typical scenarios show:

  • AVX2 implementations : 2-4× throughput improvement over scalar copy for medium-to-large payloads
  • NEON implementations : 1.5-3× throughput improvement over scalar copy
  • Hash computation : XXH3 with SIMD is significantly faster than non-accelerated hash functions

Interaction with 64-Byte Alignment

The storage engine's 64-byte payload alignment strategy (see Payload Alignment and Cache Efficiency) synergizes with SIMD operations:

Diagram: Alignment and SIMD Interaction

The 64-byte alignment ensures that:

  1. Cache line alignment : Payloads start on cache line boundaries, avoiding split-cache-line access penalties
  2. SIMD-friendly boundaries : Both AVX2 (32-byte) and NEON (16-byte) operations can operate at full speed without crossing alignment boundaries
  3. Zero-copy efficiency : Memory-mapped reads benefit from predictable alignment (see Memory Management and Zero-Copy Access)

Sources: README.md:51-59 README.md:248-256 High-Level Diagram 6


When SIMD Is (and Isn't) Used

Operations Using SIMD Acceleration

OperationSIMD UsageImplementation
Write operations (write, batch_write, write_stream)✅ Yessimd_copy function transfers payload to write buffer
Key hashing (all operations requiring key lookup)✅ Yesxxh3_64 with SSE2/AVX2/Neon acceleration
Index building (during storage recovery)✅ Yesxxh3_64 hashing during forward scan

Operations NOT Using SIMD

OperationSIMD UsageReason
Read operations (read, batch_read, read_stream)❌ NoZero-copy memory-mapped access—data is accessed directly from mmap without copying
Iteration (into_iter, par_iter_entries)❌ NoIterators return references to memory-mapped regions—no data transfer occurs
Entry validation (CRC32C checksum)❌ NoUses standard CRC32C implementation (hardware-accelerated on supporting CPUs via separate mechanism)

Diagram: SIMD Usage in Write vs. Read Paths

Why Reads Don't Need SIMD

Read operations in SIMD R Drive use memory-mapped I/O (mmap), which provides direct access to the storage file's contents without copying data into buffers. Since no memory transfer occurs, there is no opportunity to apply SIMD acceleration.

The zero-copy read strategy is fundamentally faster than any SIMD-accelerated copy operation because it eliminates the copy entirely. The memory-mapped approach allows the operating system's virtual memory subsystem to handle data access, often utilizing hardware-level optimizations like demand paging and read-ahead caching.

For more details on the zero-copy read architecture, see Memory Management and Zero-Copy Access.

Sources: README.md:43-49 README.md:254-256 High-Level Diagram 2


Code Entity Reference

The following table maps the conceptual SIMD components discussed in this document to their likely implementation locations within the codebase:

ConceptCode EntityExpected Location
SIMD copy functionsimd_copy()Core storage engine or utilities
AVX2 implementationArchitecture-specific module with _mm256_* intrinsicsPlatform-specific code (likely #[cfg(target_arch = "x86_64")])
NEON implementationArchitecture-specific module with vld1q_* / vst1q_* intrinsicsPlatform-specific code (likely #[cfg(target_arch = "aarch64")])
Runtime detectionis_x86_feature_detected!("avx2") macroAVX2 code path guard
XXH3 hashingxxh3_64 function or xxhash_rust crate usageKey hashing module
Write buffer stagingBufWriter<File> with simd_copy integrationDataStore write implementation

Sources: README.md:248-256 High-Level Diagram 6


Summary

SIMD acceleration in SIMD R Drive focuses on write-path optimization and key hashing performance. The dual-platform implementation (AVX2 for x86_64, NEON for aarch64) with automatic runtime detection ensures optimal performance across diverse hardware while maintaining portability through scalar fallback.

The 64-byte payload alignment strategy complements SIMD operations by ensuring cache-friendly access patterns. However, the storage engine intentionally does not use SIMD for read operations—zero-copy memory-mapped access eliminates the need for data transfer entirely, providing superior read performance without vectorization.

For related performance optimizations, see:

Sources: README.md:248-256 README.md:43-59 High-Level Diagrams 2, 5, and 6