This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Benchmarking
Loading…
Benchmarking
Relevant source files
- Cargo.toml
- benches/storage_benchmark.rs
- src/main.rs
- src/utils/format_bytes.rs
- tests/concurrency_tests.rs
This document describes the performance benchmark suite for the SIMD R Drive storage engine. The benchmarks measure write throughput, read throughput (sequential, random, and batch), and concurrent access patterns under contention. For general performance optimization features, see Performance Optimizations. For information about SIMD acceleration techniques, see SIMD Acceleration.
Benchmark Suite Overview
The SIMD R Drive project includes two primary benchmark suites that measure different aspects of storage engine performance:
| Benchmark | File | Purpose | Key Metrics |
|---|---|---|---|
| Storage Benchmark | benches/storage_benchmark.rs | Single-process throughput testing | Writes/sec, Reads/sec for sequential, random, and batch operations |
| Contention Benchmark | benches/contention_benchmark.rs | Multi-threaded concurrent access | Performance degradation under write contention |
Both benchmarks are configured in Cargo.toml:57-63 and use Criterion.rs for statistical analysis, enabling detection of performance regressions across code changes.
Sources: Cargo.toml:36-63 benches/storage_benchmark.rs:1-234
graph TB
subgraph "Benchmark Configuration"
CARGO["Cargo.toml"]
CRITERION["criterion = 0.6.0"]
HARNESS["harness = false"]
end
subgraph "Storage Benchmark"
STORAGE_BIN["benches/storage_benchmark.rs"]
WRITE_BENCH["benchmark_append_entries()"]
SEQ_BENCH["benchmark_sequential_reads()"]
RAND_BENCH["benchmark_random_reads()"]
BATCH_BENCH["benchmark_batch_reads()"]
end
subgraph "Contention Benchmark"
CONTENTION_BIN["benches/contention_benchmark.rs"]
CONCURRENT_TESTS["Multi-threaded write tests"]
end
subgraph "Core Operations Measured"
BATCH_WRITE["DataStoreWriter::batch_write()"]
READ["DataStoreReader::read()"]
BATCH_READ["DataStoreReader::batch_read()"]
ITER["into_iter()"]
end
CARGO --> CRITERION
CARGO --> STORAGE_BIN
CARGO --> CONTENTION_BIN
CARGO --> HARNESS
STORAGE_BIN --> WRITE_BENCH
STORAGE_BIN --> SEQ_BENCH
STORAGE_BIN --> RAND_BENCH
STORAGE_BIN --> BATCH_BENCH
WRITE_BENCH --> BATCH_WRITE
SEQ_BENCH --> ITER
RAND_BENCH --> READ
BATCH_BENCH --> BATCH_READ
CONTENTION_BIN --> CONCURRENT_TESTS
CONCURRENT_TESTS --> BATCH_WRITE
Storage Benchmark Architecture
The storage benchmark (storage_benchmark.rs) is a standalone binary that measures single-process throughput across four operation types. It writes 1,000,000 entries and then exercises different read access patterns.
Benchmark Configuration Constants
The benchmark behavior is controlled by tunable constants defined at the top of the file:
| Constant | Value | Purpose |
|---|---|---|
ENTRY_SIZE | 8 bytes | Size of each value payload (stores u64 in little-endian) |
WRITE_BATCH_SIZE | 1,024 entries | Number of entries per batch_write call |
READ_BATCH_SIZE | 1,024 entries | Number of entries per batch_read call |
NUM_ENTRIES | 1,000,000 | Total entries written during setup phase |
NUM_RANDOM_CHECKS | 1,000,000 | Number of random single-key lookups |
NUM_BATCH_CHECKS | 1,000,000 | Total entries verified via batch reads |
Sources: benches/storage_benchmark.rs:16-26
Write Benchmark: benchmark_append_entries()
The write benchmark measures append-only write throughput using batched operations.
graph LR
subgraph "Write Benchmark Flow"
START["Start: Instant::now()"]
LOOP["Loop: 0..NUM_ENTRIES"]
BATCH["Accumulate in batch Vec"]
FLUSH["flush_batch()
every 1,024 entries"]
BATCH_WRITE["storage.batch_write(&refs)"]
CALC["Calculate writes/sec"]
end
START --> LOOP
LOOP --> BATCH
BATCH --> FLUSH
FLUSH --> BATCH_WRITE
LOOP --> CALC
subgraph "Data Generation"
KEY["Key: bench-key-{i}"]
VALUE["Value: i.to_le_bytes()
in 8-byte buffer"]
end
LOOP --> KEY
LOOP --> VALUE
KEY --> BATCH
VALUE --> BATCH
The benchmark creates fixed-width 8-byte payloads containing the loop index as a little-endian u64. Entries are accumulated in a batch and flushed every 1,024 entries via batch_write(). The elapsed time is used to calculate writes per second.
Sources: benches/storage_benchmark.rs:52-83 benches/storage_benchmark.rs:85-92
Sequential Read Benchmark: benchmark_sequential_reads()
The sequential read benchmark measures zero-copy iteration performance by walking the entire storage file from newest to oldest entry.
graph TB
subgraph "Sequential Read Pattern"
ITER_START["storage.into_iter()"]
ENTRY["For each EntryHandle"]
DEREF["Dereference: &*entry"]
PARSE["u64::from_le_bytes()"]
VERIFY["assert_eq!(stored, expected)"]
end
subgraph "Memory Access"
MMAP["Memory-mapped file access"]
ZERO_COPY["Zero-copy EntryHandle"]
BACKWARD_CHAIN["Follow prev_offset backward"]
end
ITER_START --> ENTRY
ENTRY --> DEREF
DEREF --> PARSE
PARSE --> VERIFY
ITER_START --> BACKWARD_CHAIN
BACKWARD_CHAIN --> MMAP
MMAP --> ZERO_COPY
ZERO_COPY --> DEREF
This benchmark uses storage.into_iter() which implements the IntoIterator trait, traversing the backward-linked chain via prev_offset fields. Each entry is accessed via zero-copy memory mapping and validated by parsing the stored u64 value.
Sources: benches/storage_benchmark.rs:98-118
Random Read Benchmark: benchmark_random_reads()
The random read benchmark measures hash index lookup performance by performing 1,000,000 random single-key reads.
graph LR
subgraph "Random Read Flow"
RNG["rng.random_range(0..NUM_ENTRIES)"]
KEY_GEN["Generate key: bench-key-{i}"]
LOOKUP["storage.read(key.as_bytes())"]
UNWRAP["unwrap() -> EntryHandle"]
VALIDATE["Parse u64 and assert_eq!"]
end
subgraph "Index Lookup Path"
XXH3["XXH3 hash of key"]
DASHMAP["DashMap index lookup"]
TAG_CHECK["16-bit tag collision check"]
OFFSET["Extract 48-bit offset"]
MMAP_ACCESS["Memory-mapped access at offset"]
end
RNG --> KEY_GEN
KEY_GEN --> LOOKUP
LOOKUP --> UNWRAP
UNWRAP --> VALIDATE
LOOKUP --> XXH3
XXH3 --> DASHMAP
DASHMAP --> TAG_CHECK
TAG_CHECK --> OFFSET
OFFSET --> MMAP_ACCESS
MMAP_ACCESS --> UNWRAP
Each iteration generates a random index, constructs the corresponding key, and performs a single read() operation. This exercises the O(1) hash index lookup path including XXH3 hashing, DashMap access, tag-based collision detection, and memory-mapped file access.
Sources: benches/storage_benchmark.rs:124-149
Batch Read Benchmark: benchmark_batch_reads()
The batch read benchmark measures vectorized lookup performance by reading 1,024 keys at a time via batch_read().
graph TB
subgraph "Batch Read Flow"
ACCUMULATE["Accumulate 1,024 keys in Vec"]
CONVERT["Convert to Vec<&[u8]>"]
BATCH_CALL["storage.batch_read(&key_refs)"]
RESULTS["Process Vec<Option<EntryHandle>>"]
VERIFY["Verify each entry's payload"]
end
subgraph "Batch Read Implementation"
PARALLEL["Optional: Rayon parallel iterator"]
MULTI_LOOKUP["Multiple hash lookups"]
COLLECT["Collect into result Vec"]
end
subgraph "Verification Logic"
PARSE_KEY["Extract index from key suffix"]
PARSE_VALUE["u64::from_le_bytes(handle)"]
ASSERT["assert_eq!(stored, idx)"]
end
ACCUMULATE --> CONVERT
CONVERT --> BATCH_CALL
BATCH_CALL --> RESULTS
RESULTS --> VERIFY
BATCH_CALL --> PARALLEL
PARALLEL --> MULTI_LOOKUP
MULTI_LOOKUP --> COLLECT
COLLECT --> RESULTS
VERIFY --> PARSE_KEY
VERIFY --> PARSE_VALUE
PARSE_KEY --> ASSERT
PARSE_VALUE --> ASSERT
The benchmark accumulates keys in batches of 1,024 and invokes batch_read() which can optionally use the parallel feature to perform lookups concurrently via Rayon. The verification phase includes a fast numeric suffix parser that extracts the index from the key string without heap allocation.
Sources: benches/storage_benchmark.rs:155-181 benches/storage_benchmark.rs:183-202
graph LR
subgraph "Rate Formatting Logic"
INPUT["Input: f64 rate"]
SPLIT["Split into whole and fractional parts"]
ROUND["Round fractional to 3 decimals"]
CARRY["Handle 1000 rounding carry"]
SEPARATE["Comma-separate thousands"]
FORMAT["Output: 4,741,483.464"]
end
INPUT --> SPLIT
SPLIT --> ROUND
ROUND --> CARRY
CARRY --> SEPARATE
SEPARATE --> FORMAT
Output Formatting
The benchmark produces human-readable output with formatted throughput numbers using the fmt_rate() utility function.
The fmt_rate() function formats rates with comma-separated thousands and exactly three decimal places. It uses the thousands crate’s separate_with_commas() method and handles edge cases where rounding produces 1000 in the fractional part.
Example output:
Wrote 1,000,000 entries of 8 bytes in 0.234s (4,273,504.273 writes/s)
Sequentially read 1,000,000 entries in 0.089s (11,235,955.056 reads/s)
Randomly read 1,000,000 entries in 0.532s (1,879,699.248 reads/s)
Batch-read verified 1,000,000 entries in 0.156s (6,410,256.410 reads/s)
Sources: benches/storage_benchmark.rs:204-233
Contention Benchmark
The contention benchmark (contention_benchmark.rs) measures performance degradation under concurrent write load. While the source file is not included in the provided files, it is referenced in Cargo.toml:61-63 and is designed to complement the concurrency tests shown in tests/concurrency_tests.rs:1-230
Expected Contention Scenarios
Based on the concurrency test patterns, the contention benchmark likely measures:
| Scenario | Description | Measured Metric |
|---|---|---|
| Concurrent Writes | Multiple threads writing different keys simultaneously | Throughput degradation under RwLock contention |
| Write Serialization | Effect of RwLock serializing write operations | Comparison vs. theoretical maximum (single-threaded) |
| Index Contention | DashMap update performance under concurrent load | Lock-free read performance maintained |
| Streaming Writes | Concurrent write_stream() calls with slow readers | I/O bottleneck vs. lock contention |
Concurrency Test Patterns
The tests/concurrency_tests.rs:1-230 file demonstrates three concurrency patterns that inform contention benchmarking:
Sources: tests/concurrency_tests.rs:111-161 tests/concurrency_tests.rs:163-229 tests/concurrency_tests.rs:14-109
graph TB
subgraph "Concurrent Write Test"
THREADS_WRITE["16 threads × 10 writes each"]
RWLOCK_WRITE["RwLock serializes writes"]
ATOMIC_UPDATE["AtomicU64 tail_offset updated"]
DASHMAP_UPDATE["DashMap index updated"]
end
subgraph "Interleaved Read/Write Test"
THREAD_A["Thread A: write → notify → read"]
THREAD_B["Thread B: wait → read → write → notify"]
SYNC["Tokio Notify synchronization"]
end
subgraph "Streaming Write Test"
SLOW_READER["SlowReader with artificial delay"]
CONCURRENT_STREAMS["2 concurrent write_stream()
calls"]
IO_BOUND["Tests I/O vs. lock contention"]
end
THREADS_WRITE --> RWLOCK_WRITE
RWLOCK_WRITE --> ATOMIC_UPDATE
ATOMIC_UPDATE --> DASHMAP_UPDATE
THREAD_A --> SYNC
THREAD_B --> SYNC
SLOW_READER --> CONCURRENT_STREAMS
CONCURRENT_STREAMS --> IO_BOUND
Running Benchmarks
Command-Line Usage
Execute benchmarks using Cargo’s benchmark runner:
Benchmark Execution Flow
The benchmarks use harness = false in Cargo.toml:59-63 meaning they execute as standalone binaries rather than using Criterion’s default test harness. This allows for custom output formatting and fine-grained control over benchmark structure.
Sources: Cargo.toml:57-63 benches/storage_benchmark.rs:32-46
Metrics and Analysis
Performance Indicators
The benchmark suite measures the following key performance indicators:
| Metric | Calculation | Typical Range | Optimization Focus |
|---|---|---|---|
| Write Throughput | NUM_ENTRIES / elapsed_time | 2-10M writes/sec | SIMD copy, batch sizing |
| Sequential Read Throughput | NUM_ENTRIES / elapsed_time | 5-20M reads/sec | Memory mapping, iterator overhead |
| Random Read Throughput | NUM_RANDOM_CHECKS / elapsed_time | 1-5M reads/sec | Hash index lookup, cache efficiency |
| Batch Read Throughput | NUM_BATCH_CHECKS / elapsed_time | 3-12M reads/sec | Parallel lookup, vectorization |
Factors Affecting Performance
Sources: benches/storage_benchmark.rs:16-26 Cargo.toml:49-55
Integration with Development Workflow
Performance Regression Detection
While Criterion.rs is included as a dependency Cargo.toml39 the current benchmark implementations use custom timing via std::time::Instant rather than Criterion’s statistical framework. This provides:
- Immediate feedback : Results printed directly to stdout during execution
- Reproducibility : Fixed workload sizes and patterns for consistent comparison
- Simplicity : No statistical overhead for quick performance checks
Benchmark Data for Tuning
The benchmark results inform optimization decisions:
| Optimization | Benchmark Validation | Expected Impact |
|---|---|---|
| SIMD copy implementation | Write throughput increase | 2-4x improvement on AVX2 systems |
| 64-byte alignment change | All operations improve | 10-30% from cache-line alignment |
parallel feature | Batch read throughput | 2-4x on multi-core systems |
| DashMap vs RwLock | Random read throughput | Eliminates read lock contention |
Sources: benches/storage_benchmark.rs:1-234 Cargo.toml:36-55
Dismiss
Refresh this wiki
Enter email to refresh