Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SIMD Acceleration

Loading…

SIMD Acceleration

Relevant source files

Purpose and Scope

This document details the SIMD (Single Instruction, Multiple Data) acceleration features implemented in simd-r-drive. SIMD optimizations are applied to two critical performance paths: memory copying during write operations and key hashing for indexing. These optimizations leverage hardware-specific vector instructions (AVX2, NEON) to reduce CPU overhead and improve throughput in write-heavy workloads.

For information about the broader performance context including alignment strategies, see Payload Alignment and Cache Efficiency. For details on how these optimizations integrate with different write patterns, see Write and Read Modes.

Sources: README.md:248-257


SIMD Operations Overview

The system employs SIMD acceleration in two distinct areas:

OperationPurposeInstruction SetsPrimary Use Case
Memory Copy (simd_copy)Vectorized data transfer to write buffersAVX2 (x86_64), NEON (aarch64)Write operations, payload staging
Key Hashing (xxh3_64)Hardware-accelerated hash computationSSE2, AVX2, NeonKey indexing, collision detection

Key Distinction: SIMD is not used for read operations. Zero-copy reads access memory-mapped data directly without transformations, making SIMD unnecessary. The acceleration targets write-path bottlenecks where data must be moved through buffers before disk flush.

Sources: README.md:248-257


Architecture and Implementation Flow

Architecture Decision Flow

The simd_copy function performs runtime detection to select the optimal implementation. On x86_64, it checks for AVX2 support using std::is_x86_feature_detected!("avx2"). On aarch64, NEON is assumed available (ARM64 standard). Unsupported architectures fall back to standard Rust slice operations.

Sources: src/storage_engine/simd_copy.rs:110-138


x86_64 AVX2 Implementation

Memory Copy Mechanism

AVX2 Implementation Details

The simd_copy_x86 function processes memory in 32-byte chunks using AVX2 vector instructions:

Processing Strategy:

  • Computes chunks = len / 32 to determine full vector iterations
  • Uses _mm256_loadu_si256 for unaligned 32-byte loads from source
  • Uses _mm256_storeu_si256 for unaligned 32-byte stores to destination
  • Processes remainder bytes (< 32) with scalar copy_from_slice

Safety Considerations:

  • Marked unsafe and requires #[target_feature(enable = "avx2")]
  • Caller must ensure both src and dst have at least len valid bytes
  • Pointer arithmetic is bounds-checked via chunk calculation
  • Unaligned loads/stores handle arbitrary memory alignment

Performance Characteristics:

  • AVX2 processes 256 bits (32 bytes) per instruction
  • Typical throughput: 4-8x faster than scalar copy on supported CPUs
  • Fallback warning logged once if AVX2 unavailable (e.g., virtualized environments)

Sources: src/storage_engine/simd_copy.rs:16-62 src/storage_engine/simd_copy.rs:112-125


AArch64 NEON Implementation

ARM Vector Operations

NEON Implementation Details

The simd_copy_arm function implements ARM Advanced SIMD (NEON):

Processing Strategy:

  • Processes 16-byte chunks (NEON vector width: 128 bits)
  • Uses vld1q_u8 for 16-byte vector loads
  • Uses vst1q_u8 for 16-byte vector stores
  • Handles remainder with scalar copy

Architecture Notes:

  • NEON is mandatory on ARM64 (aarch64), no runtime detection needed
  • No fallback warning on ARM; NEON is guaranteed available
  • Smaller vector width (16 vs 32 bytes) but consistent across ARM64 devices

Performance Characteristics:

  • Typical speedup: 2-4x over scalar copy
  • Particularly beneficial on Apple Silicon, AWS Graviton, and other ARM server platforms

Sources: src/storage_engine/simd_copy.rs:64-108 src/storage_engine/simd_copy.rs:127-133


Hardware-Accelerated Hashing (XXH3)

Hashing Infrastructure

XXH3 Acceleration Tiers

The XXH3 hashing algorithm automatically selects hardware acceleration based on available instruction sets:

PlatformDefault AccelerationEnhanced AccelerationAvailability
x86_64SSE2 (universal)AVX2 (opt-in)SSE2 guaranteed on all x86_64
aarch64Neon (universal)N/ANeon guaranteed on all ARM64
OtherScalar fallbackN/APortable implementation

Integration Points:

  1. Write Path: Keys are hashed during write(), batch_write(), and write_stream() operations
  2. Read Path: Keys are hashed during read() and batch_read() lookups
  3. Indexing: Hash values populate the KeyIndexer HashMap for O(1) lookups
  4. Collision Detection: High-order 16 bits serve as tag for collision verification (see Key Indexing and Hashing)

Performance Impact:

Benchmarking shows XXH3 with hardware acceleration can compute 1 million random key hashes in well under 1 second, enabling high-throughput indexing operations without becoming a bottleneck.

Sources: README.md:158-167


Write Path Integration

SIMD in Write Operations

Write Operation Flow with SIMD

The SIMD copy function is invoked during all write operations to transfer payload data into the BufWriter<File> buffer:

  1. Key Hashing: XXH3 with SIMD acceleration computes the key hash
  2. Payload Staging: simd_copy transfers value bytes to write buffer
  3. Metadata Appending: Non-SIMD operations append metadata (hash, offset, checksum)
  4. Flush: Buffer contents written to disk in single system call
  5. Index Update: Hash stored in KeyIndexer for subsequent lookups

Why SIMD for Writes Only:

  • Writes: Data must traverse multiple memory regions (app → buffer → disk), benefiting from vectorized transfer
  • Reads: Memory-mapped access provides direct pointer to disk pages, no intermediate copying required

Sources: README.md:248-257 src/storage_engine/simd_copy.rs:1-139


Performance Characteristics

Benchmark Results

The storage benchmark demonstrates SIMD acceleration impact across write-heavy workloads:

Test Configuration:

  • Entry Count: 1,000,000 entries
  • Entry Size: 8 bytes payload
  • Write Batch Size: 1,024 entries per batch
  • Platform: Multi-architecture (x86_64 with AVX2, aarch64 with NEON)

Typical Performance Metrics:

OperationThroughputNotes
Batched Writes100,000+ writes/sSIMD copy + XXH3 acceleration
Sequential Reads1,000,000+ reads/sZero-copy, no SIMD needed
Random Reads500,000+ reads/sXXH3 hash lookup acceleration
Batch Reads700,000+ reads/sVectorized hash lookups

SIMD Contribution Analysis:

  1. Write Throughput: SIMD copy reduces memory transfer overhead by 4-8x on AVX2 systems
  2. Index Performance: XXH3 SIMD hashing maintains O(1) lookup performance even at high scales
  3. Memory Efficiency: Vectorized operations reduce cache pressure by minimizing instruction count

Fallback Impact:

On systems without SIMD support (virtualized x86_64, non-ARM/x86 architectures), performance degrades gracefully to scalar implementations. A warning is logged once on x86_64 systems lacking AVX2.

Sources: benches/storage_benchmark.rs:1-234 README.md:166-167


Runtime Feature Detection

Detection Implementation

Detection Strategy

The system uses a tiered approach to SIMD feature selection:

Compile-Time Decisions:

  • Target architecture determined by rustc build target (e.g., x86_64-unknown-linux-gnu, aarch64-apple-darwin)
  • Conditional compilation blocks via #[cfg(target_arch = "x86_64")] and #[cfg(target_arch = "aarch64")]

Runtime Decisions (x86_64 only):

  • AVX2 availability checked with std::is_x86_feature_detected!("avx2")
  • Detection performed on every simd_copy call (minimal overhead, inlined)
  • Warning logged once via std::sync::Once if AVX2 unavailable

ARM64 Assumption:

  • NEON guaranteed on all ARM64 targets, no runtime check needed
  • ARM specification mandates Advanced SIMD support

Fallback Behavior:

  • Graceful degradation to copy_from_slice on unsupported platforms
  • No panics or errors; functionality preserved with reduced performance

Sources: src/storage_engine/simd_copy.rs:4-8 src/storage_engine/simd_copy.rs:110-138


Limitations and Scope

What SIMD Does Not Accelerate

Read Operations:

Zero-copy memory-mapped reads access data directly via pointers into mmap pages. No intermediate copying occurs, making SIMD memory transfer unnecessary. The system optimizes reads through:

  • Direct pointer arithmetic into mapped regions
  • Cacheline-aligned payloads (64-byte boundaries, see Payload Alignment and Cache Efficiency)
  • Memory prefetching handled by OS page fault mechanisms

CRC32 Checksums:

The CRC32C checksum calculation in entry validation is not SIMD-accelerated in the current implementation. This operation uses standard library CRC computation.

Metadata Processing:

Metadata writes (20 bytes: key_hash, prev_offset, checksum) use scalar operations. The small, fixed size provides no benefit from vectorization.

Platform Limitations:

PlatformLimitationWorkaround
Windows ARM64 (emulated)AVX2 detection may failAutomatic fallback to scalar
RISC-V, WASM, otherNo SIMD implementationScalar copy_from_slice
x86 (32-bit)No AVX2 supportNot a target architecture

Sources: README.md:256-257 src/storage_engine/simd_copy.rs:119-124


Summary

SIMD acceleration in simd-r-drive targets write-path bottlenecks through two mechanisms:

  1. simd_copy: Vectorized memory transfer (AVX2/NEON) for staging write buffers
  2. XXH3 hashing: Hardware-accelerated key hashing (SSE2/AVX2/Neon) for indexing

These optimizations complement the zero-copy read architecture, where direct memory-mapped access eliminates the need for data transformation. The combination enables high-throughput writes while maintaining sub-microsecond read latencies at scale.

For details on how payload alignment enhances these optimizations, see Payload Alignment and Cache Efficiency. For broader context on write patterns that leverage SIMD acceleration, see Write and Read Modes.

Sources: README.md:248-257 src/storage_engine/simd_copy.rs:1-139

Dismiss

Refresh this wiki

Enter email to refresh