SIMD Acceleration

Relevant source files

Purpose and Scope

This page documents the SIMD (Single Instruction, Multiple Data) optimizations implemented in SIMD R Drive. The storage engine uses SIMD acceleration in two critical areas: write operations (via the simd_copy function) and key hashing (via the xxh3_64 algorithm).

SIMD is not used for read operations, which rely on zero-copy memory-mapped access for optimal performance. For information about zero-copy reads and memory management, see Memory Management and Zero-Copy Access. For details on the 64-byte alignment strategy that enables efficient SIMD operations, see Payload Alignment and Cache Efficiency.

Sources: README.md:248-256

Overview

SIMD R Drive leverages hardware SIMD capabilities to accelerate performance-critical operations during write workflows. The storage engine uses platform-specific SIMD instructions to minimize CPU cycles spent on memory operations, resulting in improved write throughput and indexing performance.

The SIMD acceleration strategy consists of two components:

Component	Purpose	Platforms	Performance Impact
`simd_copy` Function	Accelerates memory copying during write buffer staging	x86_64 (AVX2), aarch64 (NEON)	Reduces write latency by vectorizing memory transfers
`xxh3_64` Hashing	Hardware-accelerated key hashing for index operations	x86_64 (SSE2/AVX2), aarch64 (Neon)	Improves key lookup and index building performance

The system uses runtime CPU feature detection to select the optimal implementation path and automatically falls back to scalar operations when SIMD instructions are unavailable.

Sources: README.md:248-256 High-Level Diagram 6

SIMD Copy Function

Purpose and Role in Write Operations

The simd_copy function is a specialized memory transfer routine used during write operations. When the storage engine writes data to disk, it first stages the payload in an internal buffer. The simd_copy function accelerates this staging process by using vectorized memory operations instead of standard byte-wise copying.

Diagram: SIMD Copy Function Flow in Write Operations

graph TB
    WriteAPI["DataStoreWriter::write()"]
Buffer["Internal Write Buffer\n(BufWriter&lt;File&gt;)"]
SIMDCopy["simd_copy Function"]
CPUDetect["CPU Feature Detection\n(Runtime)"]
AVX2Path["AVX2 Implementation\n_mm256_loadu_si256\n_mm256_storeu_si256\n(32-byte chunks)"]
NEONPath["NEON Implementation\nvld1q_u8\nvst1q_u8\n(16-byte chunks)"]
ScalarPath["Scalar Fallback\ncopy_from_slice\n(Standard memcpy)"]
DiskWrite["Write to Disk\n(Buffered I/O)"]
WriteAPI -->|user payload| SIMDCopy
 
   SIMDCopy --> CPUDetect
 
   CPUDetect -->|x86_64 + AVX2| AVX2Path
 
   CPUDetect -->|aarch64| NEONPath
 
   CPUDetect -->|no SIMD| ScalarPath
    
 
   AVX2Path --> Buffer
 
   NEONPath --> Buffer
 
   ScalarPath --> Buffer
 
   Buffer --> DiskWrite

The simd_copy function operates transparently within the write path. Applications using the DataStoreWriter trait do not need to explicitly invoke SIMD operations—the acceleration is automatic and selected based on the host CPU capabilities.

Sources: README.md:252-253 High-Level Diagram 6

Platform-Specific Implementations

x86_64 Architecture: AVX2 Instructions

On x86_64 platforms with AVX2 support, the simd_copy function uses 256-bit Advanced Vector Extensions to process memory in 32-byte chunks. The implementation employs two primary intrinsics:

Intrinsic	Operation	Description
`_mm256_loadu_si256`	Load	Reads 32 bytes from source memory into a 256-bit register (unaligned load)
`_mm256_storeu_si256`	Store	Writes 32 bytes from a 256-bit register to destination memory (unaligned store)

graph LR
    SrcMem["Source Memory\n(Payload Data)"]
AVX2Reg["256-bit AVX2 Register\n(__m256i)"]
DstMem["Destination Buffer\n(Write Buffer)"]
SrcMem -->|_mm256_loadu_si256 32 bytes| AVX2Reg
 
   AVX2Reg -->|_mm256_storeu_si256 32 bytes| DstMem

The AVX2 path is activated when the CPU supports the avx2 feature flag. Runtime detection ensures that the instruction set is available before executing AVX2 code paths.

Diagram: AVX2 Memory Transfer Operation

Sources: README.md:252-253 High-Level Diagram 6

aarch64 Architecture: NEON Instructions

On aarch64 (ARM64) platforms, the simd_copy function uses NEON (Advanced SIMD) instructions to process memory in 16-byte chunks. The implementation uses:

Intrinsic	Operation	Description
`vld1q_u8`	Load	Reads 16 bytes from source memory into a 128-bit NEON register
`vst1q_u8`	Store	Writes 16 bytes from a 128-bit NEON register to destination memory

graph LR
    SrcMem["Source Memory\n(Payload Data)"]
NEONReg["128-bit NEON Register\n(uint8x16_t)"]
DstMem["Destination Buffer\n(Write Buffer)"]
SrcMem -->|vld1q_u8 16 bytes| NEONReg
 
   NEONReg -->|vst1q_u8 16 bytes| DstMem

NEON is available by default on all aarch64 targets, so no runtime feature detection is required for ARM64 platforms.

Diagram: NEON Memory Transfer Operation

Sources: README.md:252-253 High-Level Diagram 6

Scalar Fallback Implementation

When SIMD instructions are unavailable (e.g., on older CPUs or unsupported architectures), the simd_copy function automatically falls back to the standard Rust copy_from_slice method. This ensures portability across all platforms while still benefiting from compiler optimizations (which may include auto-vectorization in some cases).

The fallback path provides correct functionality on all platforms but does not achieve the same throughput as explicit SIMD implementations.

Sources: README.md:252-253 High-Level Diagram 6

Runtime CPU Feature Detection

Detection Mechanism

The storage engine uses conditional compilation and runtime feature detection to select the appropriate SIMD implementation:

Diagram: SIMD Path Selection Logic

graph TB
    Start["Program Execution Start"]
CompileTarget["Compile-Time Target Check"]
X86Check{"Target:\nx86_64?"}
RuntimeAVX2{"Runtime:\nAVX2 available?"}
ARMCheck{"Target:\naarch64?"}
UseAVX2["Use AVX2 Path\n(32-byte chunks)"]
UseNEON["Use NEON Path\n(16-byte chunks)"]
UseScalar["Use Scalar Fallback\n(copy_from_slice)"]
Start --> CompileTarget
 
   CompileTarget --> X86Check
 
   X86Check -->|Yes| RuntimeAVX2
 
   RuntimeAVX2 -->|Yes| UseAVX2
 
   RuntimeAVX2 -->|No| UseScalar
 
   X86Check -->|No| ARMCheck
 
   ARMCheck -->|Yes| UseNEON
 
   ARMCheck -->|No| UseScalar

Feature Detection Strategy

Platform	Detection Method	Feature Flag	Fallback Condition
x86_64	Runtime `is_x86_feature_detected!("avx2")`	`avx2`	AVX2 not supported → Scalar
aarch64	Compile-time (default available)	N/A	N/A (NEON always present)
Other	Compile-time target check	N/A	Always use scalar

The runtime detection overhead is negligible—the feature check is typically performed once during the first simd_copy invocation and the result is cached for subsequent calls.

Sources: README.md:252-256 High-Level Diagram 6

SIMD Acceleration in Key Hashing

XXH3_64 Algorithm with Hardware Acceleration

In addition to write buffer acceleration, SIMD R Drive uses the xxh3_64 hashing algorithm for key indexing. This algorithm is specifically designed to leverage SIMD instructions for high-speed hash computation.

The hashing subsystem uses hardware acceleration on supported platforms:

Platform	SIMD Extension	Availability	Performance Benefit
x86_64	SSE2	Universal (all 64-bit x86)	Baseline SIMD acceleration
x86_64	AVX2	Supported on modern CPUs	Additional performance gains
aarch64	Neon	Default on all ARM64	SIMD acceleration

Diagram: SIMD-Accelerated Key Hashing Pipeline

The xxh3_64 algorithm is used in multiple operations:

Key hash computation during write operations
Index lookups during read operations
Index building during storage recovery

By using SIMD-accelerated hashing, the storage engine minimizes the CPU overhead of hash computation, which is critical for maintaining high throughput in index-heavy workloads.

Sources: README.md:158-166 README.md:254-255 High-Level Diagram 6

Performance Characteristics

Write Throughput Improvements

The SIMD-accelerated write path provides measurable performance improvements over scalar memory operations. The performance gain depends on several factors:

Factor	Impact on Performance
Payload size	Larger payloads benefit more from vectorized copying
CPU architecture	AVX2 (32-byte) typically faster than NEON (16-byte)
Memory alignment	64-byte aligned payloads maximize cache efficiency
Buffer size	Larger write buffers reduce function call overhead

While exact performance gains vary by workload and hardware, typical scenarios show:

AVX2 implementations : 2-4× throughput improvement over scalar copy for medium-to-large payloads
NEON implementations : 1.5-3× throughput improvement over scalar copy
Hash computation : XXH3 with SIMD is significantly faster than non-accelerated hash functions

Interaction with 64-Byte Alignment

The storage engine's 64-byte payload alignment strategy (see Payload Alignment and Cache Efficiency) synergizes with SIMD operations:

Diagram: Alignment and SIMD Interaction

The 64-byte alignment ensures that:

Cache line alignment : Payloads start on cache line boundaries, avoiding split-cache-line access penalties
SIMD-friendly boundaries : Both AVX2 (32-byte) and NEON (16-byte) operations can operate at full speed without crossing alignment boundaries
Zero-copy efficiency : Memory-mapped reads benefit from predictable alignment (see Memory Management and Zero-Copy Access)

Sources: README.md:51-59 README.md:248-256 High-Level Diagram 6

When SIMD Is (and Isn't) Used

Operations Using SIMD Acceleration

Operation	SIMD Usage	Implementation
Write operations (`write`, `batch_write`, `write_stream`)	✅ Yes	`simd_copy` function transfers payload to write buffer
Key hashing (all operations requiring key lookup)	✅ Yes	`xxh3_64` with SSE2/AVX2/Neon acceleration
Index building (during storage recovery)	✅ Yes	`xxh3_64` hashing during forward scan

Operations NOT Using SIMD

Operation	SIMD Usage	Reason
Read operations (`read`, `batch_read`, `read_stream`)	❌ No	Zero-copy memory-mapped access—data is accessed directly from mmap without copying
Iteration (`into_iter`, `par_iter_entries`)	❌ No	Iterators return references to memory-mapped regions—no data transfer occurs
Entry validation (CRC32C checksum)	❌ No	Uses standard CRC32C implementation (hardware-accelerated on supporting CPUs via separate mechanism)

Diagram: SIMD Usage in Write vs. Read Paths

Why Reads Don't Need SIMD

Read operations in SIMD R Drive use memory-mapped I/O (mmap), which provides direct access to the storage file's contents without copying data into buffers. Since no memory transfer occurs, there is no opportunity to apply SIMD acceleration.

The zero-copy read strategy is fundamentally faster than any SIMD-accelerated copy operation because it eliminates the copy entirely. The memory-mapped approach allows the operating system's virtual memory subsystem to handle data access, often utilizing hardware-level optimizations like demand paging and read-ahead caching.

For more details on the zero-copy read architecture, see Memory Management and Zero-Copy Access.

Sources: README.md:43-49 README.md:254-256 High-Level Diagram 2

Code Entity Reference

The following table maps the conceptual SIMD components discussed in this document to their likely implementation locations within the codebase:

Concept	Code Entity	Expected Location
SIMD copy function	`simd_copy()`	Core storage engine or utilities
AVX2 implementation	Architecture-specific module with `_mm256_*` intrinsics	Platform-specific code (likely `#[cfg(target_arch = "x86_64")]`)
NEON implementation	Architecture-specific module with `vld1q_` / `vst1q_` intrinsics	Platform-specific code (likely `#[cfg(target_arch = "aarch64")]`)
Runtime detection	`is_x86_feature_detected!("avx2")` macro	AVX2 code path guard
XXH3 hashing	`xxh3_64` function or `xxhash_rust` crate usage	Key hashing module
Write buffer staging	`BufWriter<File>` with `simd_copy` integration	`DataStore` write implementation

Sources: README.md:248-256 High-Level Diagram 6

Summary

SIMD acceleration in SIMD R Drive focuses on write-path optimization and key hashing performance. The dual-platform implementation (AVX2 for x86_64, NEON for aarch64) with automatic runtime detection ensures optimal performance across diverse hardware while maintaining portability through scalar fallback.

The 64-byte payload alignment strategy complements SIMD operations by ensuring cache-friendly access patterns. However, the storage engine intentionally does not use SIMD for read operations—zero-copy memory-mapped access eliminates the need for data transfer entirely, providing superior read performance without vectorization.

For related performance optimizations, see:

Payload Alignment and Cache Efficiency for alignment strategy details
Memory Management and Zero-Copy Access for zero-copy read architecture
Write and Read Modes for performance comparisons across different access patterns

Sources: README.md:248-256 README.md:43-59 High-Level Diagrams 2, 5, and 6

Keyboard shortcuts

rust-simd-r-drive Documentation