Key Indexing and Hashing

Relevant source files

Purpose and Scope

This page documents the key indexing system and hashing mechanisms used in SIMD R Drive's storage engine. It covers the KeyIndexer data structure, the XXH3 hashing algorithm, tag-based collision detection, and hardware acceleration features.

For information about how the index is accessed in concurrent operations, see Concurrency and Thread Safety. For details on how metadata is stored alongside payloads, see Entry Structure and Metadata.

Overview

The SIMD R Drive storage engine maintains an in-memory index that maps key hashes to file offsets, enabling O(1) lookup performance for stored entries. This index is critical for avoiding full file scans when retrieving data.

The indexing system consists of three main components:

KeyIndexer : A concurrent hash map that stores packed values containing both a collision-detection tag and a file offset
XXH3_64 hashing : A fast, hardware-accelerated hashing algorithm that generates 64-bit hashes from arbitrary keys
Tag-based verification : A secondary collision detection mechanism that validates lookups to prevent hash collision errors

Sources: src/storage_engine/data_store.rs:1-33 README.md:158-168

KeyIndexer Structure

The KeyIndexer is the core data structure managing the key-to-offset mapping. It is wrapped in thread-safe containers to support concurrent access.

graph TB
    subgraph "DataStore"
        KeyIndexerField["key_indexer\nArc&lt;RwLock&lt;KeyIndexer&gt;&gt;"]
end
    
    subgraph "KeyIndexer Internals"
        HashMap["HashMap&lt;u64, u64&gt;\nwith Xxh3BuildHasher"]
PackedValue["Packed u64 Value\n(16-bit tag / 48-bit offset)"]
end
    
    subgraph "Operations"
        Insert["insert(key_hash, offset)\n→ Result&lt;()&gt;"]
Lookup["get_packed(&key_hash)\n→ Option&lt;u64&gt;"]
Unpack["unpack(packed)\n→ (tag, offset)"]
TagFromKey["tag_from_key(key)\n→ u16"]
Build["build(mmap, tail_offset)\n→ KeyIndexer"]
end
    
 
   KeyIndexerField --> HashMap
 
   HashMap --> PackedValue
    
 
   Insert -.-> HashMap
 
   Lookup -.-> HashMap
 
   Unpack -.-> PackedValue
 
   TagFromKey -.-> PackedValue
 
   Build --> KeyIndexerField

Packed Value Format

The KeyIndexer stores a compact 64-bit packed value for each hash. This value encodes two pieces of information:

Bits	Field	Description
63-48	Tag (16-bit)	Collision detection tag derived from key bytes
47-0	Offset (48-bit)	Absolute file offset to entry metadata

This packing allows the index to store both the offset and a verification tag without requiring additional memory allocations.

Sources: src/storage_engine/data_store.rs31 src/storage_engine/data_store.rs:509-511

Hashing Algorithm: XXH3_64

SIMD R Drive uses the XXH3_64 hashing algorithm from the xxhash-rust crate. XXH3 is optimized for speed and provides hardware acceleration on supported platforms.

Hardware Acceleration

The XXH3_64 implementation automatically detects and utilizes CPU-specific SIMD instructions:

graph LR
    KeyBytes["Key Bytes\n&[u8]"]
subgraph "compute_hash(key)"
        XXH3["xxhash-rust\nXXH3_64"]
end
    
    subgraph "Hardware Detection"
        X86Check["x86_64 CPU?"]
ARMCheck["aarch64 CPU?"]
end
    
    subgraph "SIMD Paths"
        SSE2["SSE2 Instructions\n(baseline x86_64)"]
AVX2["AVX2 Instructions\n(if available)"]
NEON["NEON Instructions\n(aarch64 default)"]
end
    
 
   KeyBytes --> XXH3
 
   XXH3 --> X86Check
 
   XXH3 --> ARMCheck
    
 
   X86Check -->|Yes| SSE2
 
   X86Check -->|Yes + feature| AVX2
 
   ARMCheck -->|Yes| NEON
    
 
   SSE2 --> Hash64["u64 Hash"]
AVX2 --> Hash64
 
   NEON --> Hash64

Platform	Baseline Instructions	Optional Extensions	Performance Boost
x86_64	SSE2 (always enabled)	AVX2	Significant
aarch64	NEON (default)	None required	Built-in
Fallback	Scalar operations	None	Baseline

Sources: README.md:160-165 Cargo.lock:1814-1820 src/storage_engine/data_store.rs:2-4

Hash Computation Functions

The digest module provides batch and single-key hashing functions:

Function	Signature	Use Case
`compute_hash`	`fn(key: &[u8]) -> u64`	Single key hashing
`compute_hash_batch`	`fn(keys: &[&[u8]]) -> Vec<u64>`	Parallel batch hashing
`compute_checksum`	`fn(payload: &[u8]) -> [u8; 4]`	CRC32C payload validation

Sources: src/storage_engine/data_store.rs:2-4 src/storage_engine/data_store.rs754 src/storage_engine/data_store.rs:839-842

Tag-Based Collision Detection

While XXH3_64 produces high-quality 64-bit hashes, the system implements an additional layer of collision detection using 16-bit tags derived from the original key bytes.

sequenceDiagram
    participant Client
    participant DataStore
    participant KeyIndexer
    participant Mmap
    
    Note over Client,Mmap: Write Operation
    Client->>DataStore: write(key, payload)
    DataStore->>DataStore: key_hash = compute_hash(key)
    DataStore->>KeyIndexer: tag = tag_from_key(key)
    DataStore->>Mmap: write payload + metadata
    DataStore->>KeyIndexer: insert(key_hash, pack(tag, offset))
    
    Note over Client,Mmap: Read Operation with Tag Verification
    Client->>DataStore: read(key)
    DataStore->>DataStore: key_hash = compute_hash(key)
    DataStore->>KeyIndexer: packed = get_packed(key_hash)
    KeyIndexer-->>DataStore: Some(packed_value)
    DataStore->>DataStore: (stored_tag, offset) = unpack(packed)
    DataStore->>DataStore: expected_tag = tag_from_key(key)
    
    alt Tag Mismatch
        DataStore->>DataStore: warn!("Tag mismatch")
        DataStore-->>Client: None (collision detected)
    else Tag Match
        DataStore->>Mmap: read entry at offset
        Mmap-->>DataStore: EntryHandle
        DataStore-->>Client: Some(EntryHandle)
    end

How Tags Work

Tag Verification Process

When reading an entry, the system performs the following verification:

Compute the 64-bit hash of the requested key
Look up the hash in the KeyIndexer to get the packed value
Unpack the stored tag and offset from the packed value
Compute the expected tag from the original key bytes
Compare the stored tag with the expected tag
If tags match, proceed with reading; if not, return None (collision detected)

This two-level verification (hash + tag) dramatically reduces the probability of false positives from hash collisions.

Sources: src/storage_engine/data_store.rs:502-522 src/storage_engine/data_store.rs:513-521

Collision Detection in Batch Operations

For batch read operations, the system can perform tag verification when the original keys are provided:

Sources: src/storage_engine/data_store.rs:1105-1158

Index Building and Maintenance

Index Construction on Open

When a DataStore is opened, the KeyIndexer is constructed by scanning the validated storage file forward from offset 0 to the tail offset:

graph TB
    OpenFile["DataStore::open(path)"]
RecoverChain["recover_valid_chain()"]
BuildIndex["KeyIndexer::build(mmap, final_len)"]
subgraph "Index Build Process"
        ScanForward["Scan from 0 to tail_offset"]
ReadMetadata["Read EntryMetadata"]
ExtractHash["Extract key_hash"]
ComputeTag["Compute tag from prev entries"]
PackValue["pack(tag, metadata_offset)"]
InsertMap["Insert into HashMap"]
end
    
 
   OpenFile --> RecoverChain
 
   RecoverChain --> BuildIndex
 
   BuildIndex --> ScanForward
 
   ScanForward --> ReadMetadata
 
   ReadMetadata --> ExtractHash
 
   ExtractHash --> ComputeTag
 
   ComputeTag --> PackValue
 
   PackValue --> InsertMap
 
   InsertMap --> ScanForward
    
 
   InsertMap --> Initialized["KeyIndexer Ready"]

The index is built only once during the open() operation. Subsequent writes update the index incrementally.

Sources: src/storage_engine/data_store.rs:84-116 src/storage_engine/data_store.rs:106-108

graph LR
    WriteOp["Write Operation\n(single or batch)"]
FlushFile["Flush to disk"]
Reindex["reindex()"]
subgraph "Reindex Steps"
        RemapFile["Create new mmap"]
LockIndex["Acquire RwLock write"]
UpdateEntries["Update key_hash → offset"]
HandleDeletes["Remove deleted keys"]
UpdateTail["Store tail_offset (Atomic)"]
end
    
 
   WriteOp --> FlushFile
 
   FlushFile --> Reindex
 
   Reindex --> RemapFile
 
   RemapFile --> LockIndex
 
   LockIndex --> UpdateEntries
 
   UpdateEntries --> HandleDeletes
 
   HandleDeletes --> UpdateTail

Index Updates During Writes

After each write operation, the reindex() method updates the in-memory index with new key mappings:

Critical : The file must be flushed before remapping to ensure newly written data is visible in the new memory-mapped view.

Sources: src/storage_engine/data_store.rs:224-259 src/storage_engine/data_store.rs:814-824

Concurrent Access and Locking

The KeyIndexer is protected by an Arc<RwLock<KeyIndexer>> wrapper, enabling multiple concurrent readers while ensuring exclusive access for writers.

Operation	Lock Type	Concurrency
`read()`	Read lock	Multiple threads can read in parallel
`batch_read()`	Read lock	Multiple threads can read in parallel
`write()`	Write lock	Single writer, blocks all readers
`batch_write()`	Write lock	Single writer, blocks all readers
`delete()`	Write lock	Single writer, blocks all readers

Sources: src/storage_engine/data_store.rs31 src/storage_engine/data_store.rs:1042-1049 src/storage_engine/data_store.rs:852-856

Performance Characteristics

Lookup Performance

The indexing system provides O(1) average-case lookup performance. Benchmarks demonstrate:

1 million random seeks (8-byte entries): typically < 1 second
Hash computation overhead : negligible due to hardware acceleration
Tag verification overhead : minimal (single comparison)

Sources: README.md:166-167

Memory Overhead

Each index entry consumes:

8 bytes for the key hash (map key)
8 bytes for the packed value (map value)
Total : 16 bytes per unique key in memory

For a dataset with 1 million unique keys, the index occupies approximately 16 MB of RAM.

graph LR
    subgraph "Scalar Baseline"
        ScalarOps["Byte-by-byte\nprocessing"]
ScalarTime["~100% time"]
end
    
    subgraph "SSE2 (x86_64)"
        SSE2Ops["16-byte chunks\nparallel"]
SSE2Time["~40-60% time"]
end
    
    subgraph "AVX2 (x86_64)"
        AVX2Ops["32-byte chunks\nparallel"]
AVX2Time["~30-50% time"]
end
    
    subgraph "NEON (aarch64)"
        NEONOps["16-byte chunks\nparallel"]
NEONTime["~40-60% time"]
end
    
 
   ScalarOps --> ScalarTime
 
   SSE2Ops --> SSE2Time
 
   AVX2Ops --> AVX2Time
 
   NEONOps --> NEONTime

Hardware Acceleration Impact

The use of SIMD instructions in XXH3_64 significantly improves hash computation speed:

Performance gains vary by workload but are most significant for batch operations that process many keys.

Sources: README.md:160-165 src/storage_engine/data_store.rs:839-842

Error Handling and Collision Management

Hash Collision Detection

The dual-layer verification (64-bit hash + 16-bit tag) provides strong collision resistance:

Probability of hash collision : ~1 in 2^64 for XXH3_64
Probability of tag collision given hash collision : ~1 in 2^16
Combined probability : ~1 in 2^80

Write-Time Collision Handling

If a collision is detected during a write operation (same hash but different tag), the batch write operation fails entirely:

This prevents inconsistent index states and ensures data integrity.

Sources: src/storage_engine/data_store.rs:245-251

Read-Time Collision Handling

If a tag mismatch is detected during a read, the system logs a warning and returns None:

Sources: src/storage_engine/data_store.rs:513-521

Summary

The key indexing and hashing system in SIMD R Drive provides:

Fast lookups : O(1) hash-based access to entries
Hardware acceleration : Automatic SIMD optimization on SSE2, AVX2, and NEON platforms
Collision resistance : Dual-layer verification with 64-bit hashes and 16-bit tags
Thread safety : Concurrent reads with exclusive writes via RwLock
Low memory overhead : 16 bytes per unique key

This design enables efficient storage operations even for datasets with millions of entries, while maintaining data integrity through robust collision detection.

Sources: src/storage_engine/data_store.rs:1-1183 README.md:158-168

Keyboard shortcuts

rust-simd-r-drive Documentation