Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Benchmarking

Loading…

Benchmarking

Relevant source files

This document describes the performance benchmark suite for the SIMD R Drive storage engine. The benchmarks measure write throughput, read throughput (sequential, random, and batch), and concurrent access patterns under contention. For general performance optimization features, see Performance Optimizations. For information about SIMD acceleration techniques, see SIMD Acceleration.


Benchmark Suite Overview

The SIMD R Drive project includes two primary benchmark suites that measure different aspects of storage engine performance:

BenchmarkFilePurposeKey Metrics
Storage Benchmarkbenches/storage_benchmark.rsSingle-process throughput testingWrites/sec, Reads/sec for sequential, random, and batch operations
Contention Benchmarkbenches/contention_benchmark.rsMulti-threaded concurrent accessPerformance degradation under write contention

Both benchmarks are configured in Cargo.toml:57-63 and use Criterion.rs for statistical analysis, enabling detection of performance regressions across code changes.

Sources: Cargo.toml:36-63 benches/storage_benchmark.rs:1-234

graph TB
    subgraph "Benchmark Configuration"
        CARGO["Cargo.toml"]
CRITERION["criterion = 0.6.0"]
HARNESS["harness = false"]
end
    
    subgraph "Storage Benchmark"
        STORAGE_BIN["benches/storage_benchmark.rs"]
WRITE_BENCH["benchmark_append_entries()"]
SEQ_BENCH["benchmark_sequential_reads()"]
RAND_BENCH["benchmark_random_reads()"]
BATCH_BENCH["benchmark_batch_reads()"]
end
    
    subgraph "Contention Benchmark"
        CONTENTION_BIN["benches/contention_benchmark.rs"]
CONCURRENT_TESTS["Multi-threaded write tests"]
end
    
    subgraph "Core Operations Measured"
        BATCH_WRITE["DataStoreWriter::batch_write()"]
READ["DataStoreReader::read()"]
BATCH_READ["DataStoreReader::batch_read()"]
ITER["into_iter()"]
end
    
 
   CARGO --> CRITERION
 
   CARGO --> STORAGE_BIN
 
   CARGO --> CONTENTION_BIN
 
   CARGO --> HARNESS
    
 
   STORAGE_BIN --> WRITE_BENCH
 
   STORAGE_BIN --> SEQ_BENCH
 
   STORAGE_BIN --> RAND_BENCH
 
   STORAGE_BIN --> BATCH_BENCH
    
 
   WRITE_BENCH --> BATCH_WRITE
 
   SEQ_BENCH --> ITER
 
   RAND_BENCH --> READ
 
   BATCH_BENCH --> BATCH_READ
    
 
   CONTENTION_BIN --> CONCURRENT_TESTS
 
   CONCURRENT_TESTS --> BATCH_WRITE

Storage Benchmark Architecture

The storage benchmark (storage_benchmark.rs) is a standalone binary that measures single-process throughput across four operation types. It writes 1,000,000 entries and then exercises different read access patterns.

Benchmark Configuration Constants

The benchmark behavior is controlled by tunable constants defined at the top of the file:

ConstantValuePurpose
ENTRY_SIZE8 bytesSize of each value payload (stores u64 in little-endian)
WRITE_BATCH_SIZE1,024 entriesNumber of entries per batch_write call
READ_BATCH_SIZE1,024 entriesNumber of entries per batch_read call
NUM_ENTRIES1,000,000Total entries written during setup phase
NUM_RANDOM_CHECKS1,000,000Number of random single-key lookups
NUM_BATCH_CHECKS1,000,000Total entries verified via batch reads

Sources: benches/storage_benchmark.rs:16-26

Write Benchmark: benchmark_append_entries()

The write benchmark measures append-only write throughput using batched operations.

graph LR
    subgraph "Write Benchmark Flow"
        START["Start: Instant::now()"]
LOOP["Loop: 0..NUM_ENTRIES"]
BATCH["Accumulate in batch Vec"]
FLUSH["flush_batch()
every 1,024 entries"]
BATCH_WRITE["storage.batch_write(&refs)"]
CALC["Calculate writes/sec"]
end
    
 
   START --> LOOP
 
   LOOP --> BATCH
 
   BATCH --> FLUSH
 
   FLUSH --> BATCH_WRITE
 
   LOOP --> CALC
    
    subgraph "Data Generation"
        KEY["Key: bench-key-{i}"]
VALUE["Value: i.to_le_bytes()
in 8-byte buffer"]
end
    
 
   LOOP --> KEY
 
   LOOP --> VALUE
 
   KEY --> BATCH
 
   VALUE --> BATCH

The benchmark creates fixed-width 8-byte payloads containing the loop index as a little-endian u64. Entries are accumulated in a batch and flushed every 1,024 entries via batch_write(). The elapsed time is used to calculate writes per second.

Sources: benches/storage_benchmark.rs:52-83 benches/storage_benchmark.rs:85-92

Sequential Read Benchmark: benchmark_sequential_reads()

The sequential read benchmark measures zero-copy iteration performance by walking the entire storage file from newest to oldest entry.

graph TB
    subgraph "Sequential Read Pattern"
        ITER_START["storage.into_iter()"]
ENTRY["For each EntryHandle"]
DEREF["Dereference: &*entry"]
PARSE["u64::from_le_bytes()"]
VERIFY["assert_eq!(stored, expected)"]
end
    
    subgraph "Memory Access"
        MMAP["Memory-mapped file access"]
ZERO_COPY["Zero-copy EntryHandle"]
BACKWARD_CHAIN["Follow prev_offset backward"]
end
    
 
   ITER_START --> ENTRY
 
   ENTRY --> DEREF
 
   DEREF --> PARSE
 
   PARSE --> VERIFY
    
 
   ITER_START --> BACKWARD_CHAIN
 
   BACKWARD_CHAIN --> MMAP
 
   MMAP --> ZERO_COPY
 
   ZERO_COPY --> DEREF

This benchmark uses storage.into_iter() which implements the IntoIterator trait, traversing the backward-linked chain via prev_offset fields. Each entry is accessed via zero-copy memory mapping and validated by parsing the stored u64 value.

Sources: benches/storage_benchmark.rs:98-118

Random Read Benchmark: benchmark_random_reads()

The random read benchmark measures hash index lookup performance by performing 1,000,000 random single-key reads.

graph LR
    subgraph "Random Read Flow"
        RNG["rng.random_range(0..NUM_ENTRIES)"]
KEY_GEN["Generate key: bench-key-{i}"]
LOOKUP["storage.read(key.as_bytes())"]
UNWRAP["unwrap() -> EntryHandle"]
VALIDATE["Parse u64 and assert_eq!"]
end
    
    subgraph "Index Lookup Path"
        XXH3["XXH3 hash of key"]
DASHMAP["DashMap index lookup"]
TAG_CHECK["16-bit tag collision check"]
OFFSET["Extract 48-bit offset"]
MMAP_ACCESS["Memory-mapped access at offset"]
end
    
 
   RNG --> KEY_GEN
 
   KEY_GEN --> LOOKUP
 
   LOOKUP --> UNWRAP
 
   UNWRAP --> VALIDATE
    
 
   LOOKUP --> XXH3
 
   XXH3 --> DASHMAP
 
   DASHMAP --> TAG_CHECK
 
   TAG_CHECK --> OFFSET
 
   OFFSET --> MMAP_ACCESS
 
   MMAP_ACCESS --> UNWRAP

Each iteration generates a random index, constructs the corresponding key, and performs a single read() operation. This exercises the O(1) hash index lookup path including XXH3 hashing, DashMap access, tag-based collision detection, and memory-mapped file access.

Sources: benches/storage_benchmark.rs:124-149

Batch Read Benchmark: benchmark_batch_reads()

The batch read benchmark measures vectorized lookup performance by reading 1,024 keys at a time via batch_read().

graph TB
    subgraph "Batch Read Flow"
        ACCUMULATE["Accumulate 1,024 keys in Vec"]
CONVERT["Convert to Vec<&[u8]>"]
BATCH_CALL["storage.batch_read(&key_refs)"]
RESULTS["Process Vec<Option<EntryHandle>>"]
VERIFY["Verify each entry's payload"]
end
    
    subgraph "Batch Read Implementation"
        PARALLEL["Optional: Rayon parallel iterator"]
MULTI_LOOKUP["Multiple hash lookups"]
COLLECT["Collect into result Vec"]
end
    
    subgraph "Verification Logic"
        PARSE_KEY["Extract index from key suffix"]
PARSE_VALUE["u64::from_le_bytes(handle)"]
ASSERT["assert_eq!(stored, idx)"]
end
    
 
   ACCUMULATE --> CONVERT
 
   CONVERT --> BATCH_CALL
 
   BATCH_CALL --> RESULTS
 
   RESULTS --> VERIFY
    
 
   BATCH_CALL --> PARALLEL
 
   PARALLEL --> MULTI_LOOKUP
 
   MULTI_LOOKUP --> COLLECT
 
   COLLECT --> RESULTS
    
 
   VERIFY --> PARSE_KEY
 
   VERIFY --> PARSE_VALUE
 
   PARSE_KEY --> ASSERT
 
   PARSE_VALUE --> ASSERT

The benchmark accumulates keys in batches of 1,024 and invokes batch_read() which can optionally use the parallel feature to perform lookups concurrently via Rayon. The verification phase includes a fast numeric suffix parser that extracts the index from the key string without heap allocation.

Sources: benches/storage_benchmark.rs:155-181 benches/storage_benchmark.rs:183-202


graph LR
    subgraph "Rate Formatting Logic"
        INPUT["Input: f64 rate"]
SPLIT["Split into whole and fractional parts"]
ROUND["Round fractional to 3 decimals"]
CARRY["Handle 1000 rounding carry"]
SEPARATE["Comma-separate thousands"]
FORMAT["Output: 4,741,483.464"]
end
    
 
   INPUT --> SPLIT
 
   SPLIT --> ROUND
 
   ROUND --> CARRY
 
   CARRY --> SEPARATE
 
   SEPARATE --> FORMAT

Output Formatting

The benchmark produces human-readable output with formatted throughput numbers using the fmt_rate() utility function.

The fmt_rate() function formats rates with comma-separated thousands and exactly three decimal places. It uses the thousands crate’s separate_with_commas() method and handles edge cases where rounding produces 1000 in the fractional part.

Example output:

Wrote 1,000,000 entries of 8 bytes in 0.234s (4,273,504.273 writes/s)
Sequentially read 1,000,000 entries in 0.089s (11,235,955.056 reads/s)
Randomly read 1,000,000 entries in 0.532s (1,879,699.248 reads/s)
Batch-read verified 1,000,000 entries in 0.156s (6,410,256.410 reads/s)

Sources: benches/storage_benchmark.rs:204-233


Contention Benchmark

The contention benchmark (contention_benchmark.rs) measures performance degradation under concurrent write load. While the source file is not included in the provided files, it is referenced in Cargo.toml:61-63 and is designed to complement the concurrency tests shown in tests/concurrency_tests.rs:1-230

Expected Contention Scenarios

Based on the concurrency test patterns, the contention benchmark likely measures:

ScenarioDescriptionMeasured Metric
Concurrent WritesMultiple threads writing different keys simultaneouslyThroughput degradation under RwLock contention
Write SerializationEffect of RwLock serializing write operationsComparison vs. theoretical maximum (single-threaded)
Index ContentionDashMap update performance under concurrent loadLock-free read performance maintained
Streaming WritesConcurrent write_stream() calls with slow readersI/O bottleneck vs. lock contention

Concurrency Test Patterns

The tests/concurrency_tests.rs:1-230 file demonstrates three concurrency patterns that inform contention benchmarking:

Sources: tests/concurrency_tests.rs:111-161 tests/concurrency_tests.rs:163-229 tests/concurrency_tests.rs:14-109

graph TB
    subgraph "Concurrent Write Test"
        THREADS_WRITE["16 threads × 10 writes each"]
RWLOCK_WRITE["RwLock serializes writes"]
ATOMIC_UPDATE["AtomicU64 tail_offset updated"]
DASHMAP_UPDATE["DashMap index updated"]
end
    
    subgraph "Interleaved Read/Write Test"
        THREAD_A["Thread A: write → notify → read"]
THREAD_B["Thread B: wait → read → write → notify"]
SYNC["Tokio Notify synchronization"]
end
    
    subgraph "Streaming Write Test"
        SLOW_READER["SlowReader with artificial delay"]
CONCURRENT_STREAMS["2 concurrent write_stream()
calls"]
IO_BOUND["Tests I/O vs. lock contention"]
end
    
 
   THREADS_WRITE --> RWLOCK_WRITE
 
   RWLOCK_WRITE --> ATOMIC_UPDATE
 
   ATOMIC_UPDATE --> DASHMAP_UPDATE
    
 
   THREAD_A --> SYNC
 
   THREAD_B --> SYNC
    
 
   SLOW_READER --> CONCURRENT_STREAMS
 
   CONCURRENT_STREAMS --> IO_BOUND

Running Benchmarks

Command-Line Usage

Execute benchmarks using Cargo’s benchmark runner:

Benchmark Execution Flow

The benchmarks use harness = false in Cargo.toml:59-63 meaning they execute as standalone binaries rather than using Criterion’s default test harness. This allows for custom output formatting and fine-grained control over benchmark structure.

Sources: Cargo.toml:57-63 benches/storage_benchmark.rs:32-46


Metrics and Analysis

Performance Indicators

The benchmark suite measures the following key performance indicators:

MetricCalculationTypical RangeOptimization Focus
Write ThroughputNUM_ENTRIES / elapsed_time2-10M writes/secSIMD copy, batch sizing
Sequential Read ThroughputNUM_ENTRIES / elapsed_time5-20M reads/secMemory mapping, iterator overhead
Random Read ThroughputNUM_RANDOM_CHECKS / elapsed_time1-5M reads/secHash index lookup, cache efficiency
Batch Read ThroughputNUM_BATCH_CHECKS / elapsed_time3-12M reads/secParallel lookup, vectorization

Factors Affecting Performance

Sources: benches/storage_benchmark.rs:16-26 Cargo.toml:49-55


Integration with Development Workflow

Performance Regression Detection

While Criterion.rs is included as a dependency Cargo.toml39 the current benchmark implementations use custom timing via std::time::Instant rather than Criterion’s statistical framework. This provides:

  • Immediate feedback : Results printed directly to stdout during execution
  • Reproducibility : Fixed workload sizes and patterns for consistent comparison
  • Simplicity : No statistical overhead for quick performance checks

Benchmark Data for Tuning

The benchmark results inform optimization decisions:

OptimizationBenchmark ValidationExpected Impact
SIMD copy implementationWrite throughput increase2-4x improvement on AVX2 systems
64-byte alignment changeAll operations improve10-30% from cache-line alignment
parallel featureBatch read throughput2-4x on multi-core systems
DashMap vs RwLockRandom read throughputEliminates read lock contention

Sources: benches/storage_benchmark.rs:1-234 Cargo.toml:36-55

Dismiss

Refresh this wiki

Enter email to refresh