Key Indexing and Hashing
Relevant source files
Purpose and Scope
This page documents the key indexing system and hashing mechanisms used in SIMD R Drive's storage engine. It covers the KeyIndexer data structure, the XXH3 hashing algorithm, tag-based collision detection, and hardware acceleration features.
For information about how the index is accessed in concurrent operations, see Concurrency and Thread Safety. For details on how metadata is stored alongside payloads, see Entry Structure and Metadata.
Overview
The SIMD R Drive storage engine maintains an in-memory index that maps key hashes to file offsets, enabling O(1) lookup performance for stored entries. This index is critical for avoiding full file scans when retrieving data.
The indexing system consists of three main components:
KeyIndexer: A concurrent hash map that stores packed values containing both a collision-detection tag and a file offset- XXH3_64 hashing : A fast, hardware-accelerated hashing algorithm that generates 64-bit hashes from arbitrary keys
- Tag-based verification : A secondary collision detection mechanism that validates lookups to prevent hash collision errors
Sources: src/storage_engine/data_store.rs:1-33 README.md:158-168
KeyIndexer Structure
The KeyIndexer is the core data structure managing the key-to-offset mapping. It is wrapped in thread-safe containers to support concurrent access.
graph TB
subgraph "DataStore"
KeyIndexerField["key_indexer\nArc<RwLock<KeyIndexer>>"]
end
subgraph "KeyIndexer Internals"
HashMap["HashMap<u64, u64>\nwith Xxh3BuildHasher"]
PackedValue["Packed u64 Value\n(16-bit tag / 48-bit offset)"]
end
subgraph "Operations"
Insert["insert(key_hash, offset)\n→ Result<()>"]
Lookup["get_packed(&key_hash)\n→ Option<u64>"]
Unpack["unpack(packed)\n→ (tag, offset)"]
TagFromKey["tag_from_key(key)\n→ u16"]
Build["build(mmap, tail_offset)\n→ KeyIndexer"]
end
KeyIndexerField --> HashMap
HashMap --> PackedValue
Insert -.-> HashMap
Lookup -.-> HashMap
Unpack -.-> PackedValue
TagFromKey -.-> PackedValue
Build --> KeyIndexerField
Packed Value Format
The KeyIndexer stores a compact 64-bit packed value for each hash. This value encodes two pieces of information:
| Bits | Field | Description |
|---|---|---|
| 63-48 | Tag (16-bit) | Collision detection tag derived from key bytes |
| 47-0 | Offset (48-bit) | Absolute file offset to entry metadata |
This packing allows the index to store both the offset and a verification tag without requiring additional memory allocations.
Sources: src/storage_engine/data_store.rs31 src/storage_engine/data_store.rs:509-511
Hashing Algorithm: XXH3_64
SIMD R Drive uses the XXH3_64 hashing algorithm from the xxhash-rust crate. XXH3 is optimized for speed and provides hardware acceleration on supported platforms.
Hardware Acceleration
The XXH3_64 implementation automatically detects and utilizes CPU-specific SIMD instructions:
graph LR
KeyBytes["Key Bytes\n&[u8]"]
subgraph "compute_hash(key)"
XXH3["xxhash-rust\nXXH3_64"]
end
subgraph "Hardware Detection"
X86Check["x86_64 CPU?"]
ARMCheck["aarch64 CPU?"]
end
subgraph "SIMD Paths"
SSE2["SSE2 Instructions\n(baseline x86_64)"]
AVX2["AVX2 Instructions\n(if available)"]
NEON["NEON Instructions\n(aarch64 default)"]
end
KeyBytes --> XXH3
XXH3 --> X86Check
XXH3 --> ARMCheck
X86Check -->|Yes| SSE2
X86Check -->|Yes + feature| AVX2
ARMCheck -->|Yes| NEON
SSE2 --> Hash64["u64 Hash"]
AVX2 --> Hash64
NEON --> Hash64
| Platform | Baseline Instructions | Optional Extensions | Performance Boost |
|---|---|---|---|
| x86_64 | SSE2 (always enabled) | AVX2 | Significant |
| aarch64 | NEON (default) | None required | Built-in |
| Fallback | Scalar operations | None | Baseline |
Sources: README.md:160-165 Cargo.lock:1814-1820 src/storage_engine/data_store.rs:2-4
Hash Computation Functions
The digest module provides batch and single-key hashing functions:
| Function | Signature | Use Case |
|---|---|---|
compute_hash | fn(key: &[u8]) -> u64 | Single key hashing |
compute_hash_batch | fn(keys: &[&[u8]]) -> Vec<u64> | Parallel batch hashing |
compute_checksum | fn(payload: &[u8]) -> [u8; 4] | CRC32C payload validation |
Sources: src/storage_engine/data_store.rs:2-4 src/storage_engine/data_store.rs754 src/storage_engine/data_store.rs:839-842
Tag-Based Collision Detection
While XXH3_64 produces high-quality 64-bit hashes, the system implements an additional layer of collision detection using 16-bit tags derived from the original key bytes.
sequenceDiagram
participant Client
participant DataStore
participant KeyIndexer
participant Mmap
Note over Client,Mmap: Write Operation
Client->>DataStore: write(key, payload)
DataStore->>DataStore: key_hash = compute_hash(key)
DataStore->>KeyIndexer: tag = tag_from_key(key)
DataStore->>Mmap: write payload + metadata
DataStore->>KeyIndexer: insert(key_hash, pack(tag, offset))
Note over Client,Mmap: Read Operation with Tag Verification
Client->>DataStore: read(key)
DataStore->>DataStore: key_hash = compute_hash(key)
DataStore->>KeyIndexer: packed = get_packed(key_hash)
KeyIndexer-->>DataStore: Some(packed_value)
DataStore->>DataStore: (stored_tag, offset) = unpack(packed)
DataStore->>DataStore: expected_tag = tag_from_key(key)
alt Tag Mismatch
DataStore->>DataStore: warn!("Tag mismatch")
DataStore-->>Client: None (collision detected)
else Tag Match
DataStore->>Mmap: read entry at offset
Mmap-->>DataStore: EntryHandle
DataStore-->>Client: Some(EntryHandle)
end
How Tags Work
Tag Verification Process
When reading an entry, the system performs the following verification:
- Compute the 64-bit hash of the requested key
- Look up the hash in the
KeyIndexerto get the packed value - Unpack the stored tag and offset from the packed value
- Compute the expected tag from the original key bytes
- Compare the stored tag with the expected tag
- If tags match, proceed with reading; if not, return
None(collision detected)
This two-level verification (hash + tag) dramatically reduces the probability of false positives from hash collisions.
Sources: src/storage_engine/data_store.rs:502-522 src/storage_engine/data_store.rs:513-521
Collision Detection in Batch Operations
For batch read operations, the system can perform tag verification when the original keys are provided:
Sources: src/storage_engine/data_store.rs:1105-1158
Index Building and Maintenance
Index Construction on Open
When a DataStore is opened, the KeyIndexer is constructed by scanning the validated storage file forward from offset 0 to the tail offset:
graph TB
OpenFile["DataStore::open(path)"]
RecoverChain["recover_valid_chain()"]
BuildIndex["KeyIndexer::build(mmap, final_len)"]
subgraph "Index Build Process"
ScanForward["Scan from 0 to tail_offset"]
ReadMetadata["Read EntryMetadata"]
ExtractHash["Extract key_hash"]
ComputeTag["Compute tag from prev entries"]
PackValue["pack(tag, metadata_offset)"]
InsertMap["Insert into HashMap"]
end
OpenFile --> RecoverChain
RecoverChain --> BuildIndex
BuildIndex --> ScanForward
ScanForward --> ReadMetadata
ReadMetadata --> ExtractHash
ExtractHash --> ComputeTag
ComputeTag --> PackValue
PackValue --> InsertMap
InsertMap --> ScanForward
InsertMap --> Initialized["KeyIndexer Ready"]
The index is built only once during the open() operation. Subsequent writes update the index incrementally.
Sources: src/storage_engine/data_store.rs:84-116 src/storage_engine/data_store.rs:106-108
graph LR
WriteOp["Write Operation\n(single or batch)"]
FlushFile["Flush to disk"]
Reindex["reindex()"]
subgraph "Reindex Steps"
RemapFile["Create new mmap"]
LockIndex["Acquire RwLock write"]
UpdateEntries["Update key_hash → offset"]
HandleDeletes["Remove deleted keys"]
UpdateTail["Store tail_offset (Atomic)"]
end
WriteOp --> FlushFile
FlushFile --> Reindex
Reindex --> RemapFile
RemapFile --> LockIndex
LockIndex --> UpdateEntries
UpdateEntries --> HandleDeletes
HandleDeletes --> UpdateTail
Index Updates During Writes
After each write operation, the reindex() method updates the in-memory index with new key mappings:
Critical : The file must be flushed before remapping to ensure newly written data is visible in the new memory-mapped view.
Sources: src/storage_engine/data_store.rs:224-259 src/storage_engine/data_store.rs:814-824
Concurrent Access and Locking
The KeyIndexer is protected by an Arc<RwLock<KeyIndexer>> wrapper, enabling multiple concurrent readers while ensuring exclusive access for writers.
| Operation | Lock Type | Concurrency |
|---|---|---|
read() | Read lock | Multiple threads can read in parallel |
batch_read() | Read lock | Multiple threads can read in parallel |
write() | Write lock | Single writer, blocks all readers |
batch_write() | Write lock | Single writer, blocks all readers |
delete() | Write lock | Single writer, blocks all readers |
Sources: src/storage_engine/data_store.rs31 src/storage_engine/data_store.rs:1042-1049 src/storage_engine/data_store.rs:852-856
Performance Characteristics
Lookup Performance
The indexing system provides O(1) average-case lookup performance. Benchmarks demonstrate:
- 1 million random seeks (8-byte entries): typically < 1 second
- Hash computation overhead : negligible due to hardware acceleration
- Tag verification overhead : minimal (single comparison)
Sources: README.md:166-167
Memory Overhead
Each index entry consumes:
- 8 bytes for the key hash (map key)
- 8 bytes for the packed value (map value)
- Total : 16 bytes per unique key in memory
For a dataset with 1 million unique keys, the index occupies approximately 16 MB of RAM.
graph LR
subgraph "Scalar Baseline"
ScalarOps["Byte-by-byte\nprocessing"]
ScalarTime["~100% time"]
end
subgraph "SSE2 (x86_64)"
SSE2Ops["16-byte chunks\nparallel"]
SSE2Time["~40-60% time"]
end
subgraph "AVX2 (x86_64)"
AVX2Ops["32-byte chunks\nparallel"]
AVX2Time["~30-50% time"]
end
subgraph "NEON (aarch64)"
NEONOps["16-byte chunks\nparallel"]
NEONTime["~40-60% time"]
end
ScalarOps --> ScalarTime
SSE2Ops --> SSE2Time
AVX2Ops --> AVX2Time
NEONOps --> NEONTime
Hardware Acceleration Impact
The use of SIMD instructions in XXH3_64 significantly improves hash computation speed:
Performance gains vary by workload but are most significant for batch operations that process many keys.
Sources: README.md:160-165 src/storage_engine/data_store.rs:839-842
Error Handling and Collision Management
Hash Collision Detection
The dual-layer verification (64-bit hash + 16-bit tag) provides strong collision resistance:
- Probability of hash collision : ~1 in 2^64 for XXH3_64
- Probability of tag collision given hash collision : ~1 in 2^16
- Combined probability : ~1 in 2^80
Write-Time Collision Handling
If a collision is detected during a write operation (same hash but different tag), the batch write operation fails entirely:
This prevents inconsistent index states and ensures data integrity.
Sources: src/storage_engine/data_store.rs:245-251
Read-Time Collision Handling
If a tag mismatch is detected during a read, the system logs a warning and returns None:
Sources: src/storage_engine/data_store.rs:513-521
Summary
The key indexing and hashing system in SIMD R Drive provides:
- Fast lookups : O(1) hash-based access to entries
- Hardware acceleration : Automatic SIMD optimization on SSE2, AVX2, and NEON platforms
- Collision resistance : Dual-layer verification with 64-bit hashes and 16-bit tags
- Thread safety : Concurrent reads with exclusive writes via
RwLock - Low memory overhead : 16 bytes per unique key
This design enables efficient storage operations even for datasets with millions of entries, while maintaining data integrity through robust collision detection.
Sources: src/storage_engine/data_store.rs:1-1183 README.md:158-168