Skip to content

ZeroDB v2#2

Open
qdequele wants to merge 40 commits into
mainfrom
v2
Open

ZeroDB v2#2
qdequele wants to merge 40 commits into
mainfrom
v2

Conversation

@qdequele

Copy link
Copy Markdown
Owner

No description provided.

- Error enum with LMDB-compatible error types
- EnvFlags, DatabaseFlags, PutFlags matching LMDB constants
- Type aliases for PageNo and common types
- Page header, meta page, leaf/branch page structures
- Overflow page support for large values
- MemoryMap abstraction with read-only and read-write modes
- DataFile with batch writes and fdatasync support
- PageAllocator with freelist management
- Sorted freelist for sequential allocation (cache locality)
- PagePool for reusing page buffers
- DirtyPages tracking for write transactions
- Node operations for leaf and branch pages
- Binary search with LMDB-compatible key comparison
- Cursor with stateful traversal (first, last, next, prev)
- Page builder for constructing new pages
- Insert operations with page splitting
- Env with mmap management and transaction handling
- RoTxn/RwTxn with MVCC semantics
- Optimized commit path with batch I/O and single fdatasync
- WRITEMAP mode for direct mmap writes
- Database handle with cursor-based reads and writes
- Export all public modules and types
- Add memmap2, bitflags, page_size dependencies
- Add dev dependencies for benchmarks (criterion, heed, rocksdb)
- Basic environment and transaction tests
- B+tree operations tests
- Cursor traversal tests
- Sequential write benchmarks with fsync
- Random read benchmarks
- Iteration benchmarks
- B+tree search microbenchmarks
- Transaction overhead benchmarks
- LMDB architecture reference document
- Action plan for ZeroDB development
- Heed compatibility wrapper (work in progress)
Comprehensive list of 24 performance optimizations:
- 7 implemented (single fsync, fdatasync, batch I/O, etc.)
- 17 pending (cursor caching, prefetch, SIMD, io_uring, etc.)

Includes benchmarking checklist and platform-specific notes.
Optimizations implemented:
- Cursor page cache with LRU eviction (up to 16 pages)
- search_cached() method for cached tree traversal
- madvise() support for Sequential/Random/WillNeed/DontNeed hints
- Page prefetching via MADV_WILLNEED
- macOS F_FULLFSYNC for guaranteed durability (not just drive cache)
- libc dependency for Unix syscalls
Ensure implemented and pending counts match the documented optimizations.
Add #[cold] and #[inline] hints to optimize branch prediction:
- #[cold] on error constructors in error.rs
- #[inline(always)] on frequently called methods (num_keys, is_leaf, key, value)
- #[inline] on search and parse methods in page_ops, node, and header
Implement hardware prefetch instructions for improved cache utilization:
- prefetch_read<T>() using x86_64 _mm_prefetch and aarch64 inline asm
- prefetch_range() for prefetching cache-line sized chunks
- Integrated into CursorOps::next() to prefetch next node during iteration

Supports x86_64, x86, and aarch64 architectures with no-op fallback.
Add infrastructure for deferred freelist loading:
- freelist_loaded flag to track loading state
- is_freelist_loaded() to check if freelist is loaded
- mark_freelist_loaded() for marking as loaded
- needs_freelist_load() for checking if loading is needed

This allows faster environment open by deferring freelist
loading until pages are actually needed.
Update OPTIMIZATIONS.md to reflect newly implemented features:
- Branch prediction hints (now implemented)
- CPU prefetch for sequential scans (now implemented)
- Lazy freelist loading (now implemented)

Total: 13 implemented, 11 pending optimizations.
Implement hardware-accelerated key comparison for keys >= 16 bytes:
- x86_64: SSE2 using _mm_loadu_si128 and _mm_cmpeq_epi8
- aarch64: NEON using vld1q_u8 and vceqq_u8
- Fallback to standard comparison for short keys or unsupported archs

Compares 16 bytes at a time, finding first differing byte position
using movemask/trailing_ones pattern.

Includes comprehensive tests for short keys, long keys, and edge cases.
Move SIMD key comparison from pending to implemented.
Total: 14 implemented, 10 pending optimizations.
Implement bump-pointer arena allocator to reduce heap allocation overhead:
- Allocates from pre-allocated chunks (default 64KB)
- 8-byte alignment for all allocations
- Grows automatically with new chunks when needed
- reset() allows memory reuse without deallocation
- alloc_zeroed() for zero-initialized memory
- alloc_with<T>() for typed allocations with proper alignment

Includes comprehensive tests for basic allocation, growth, and reset.
- Arena allocator moved from pending to implemented (Memory #9)
- Renumbered pending items 16-24
- Updated summary table: 15 implemented, 9 pending

🤖 Generated with [Claude Code](https://claude.com/claude-code)
The Error type is only used in tests, so mark the import with
cfg(test) to avoid the unused import warning.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Added prefetch_range calls when positioning on leaf pages:
- first(): prefetch leaf for forward iteration
- last(): prefetch leaf for backward iteration
- descend_left(): prefetch when moving to next leaf
- descend_right(): prefetch when moving to previous leaf

This warms the CPU cache with the first 2KB of leaf page data
to improve iteration performance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Added is_empty flag to commit_txn to detect transactions with no changes:
- No dirty pages
- No new pages allocated
- No freed pages

This infrastructure is reserved for future optimization when proper
WRITE_MAP mode with async flush is implemented. Currently all
transactions still persist to maintain durability guarantees.

Also added has_freed_pages() helper to PageAllocator.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Updated current performance status:
- Sequential Writes: ~455ms vs LMDB ~404ms (89%)
- Point Lookup: ~150ns vs LMDB ~130ns (87%)
- Read Transaction: ~14ns vs LMDB ~37ns (264% - faster than LMDB!)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Use get_page_buffer() for the meta page serialization buffer instead
of allocating a new Vec each commit. The buffer is returned to the
pool after the write completes.

This eliminates one 4KB allocation per transaction commit.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Removed the buf.fill(0) call when returning buffers to the pool.

Pages are always fully overwritten before being written to disk:
- Meta pages: MetaPage::write_to() writes the entire page
- Data pages: copied in full via copy_from_slice()

This eliminates a 4KB memset per page returned to the pool,
providing ~3.5% write performance improvement.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Updated current performance status:
- Sequential Writes: ~436ms vs LMDB ~372ms (85%)
- Improvement from meta buffer pooling and skip zeroing

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Add Heed-compatible typed database API with support for multiple named
databases within a single environment. Each named database gets a unique
DBI and is tracked in an in-memory registry that manages root page updates.
Add comprehensive Heed-compatible API for drop-in replacement in Meilisearch:

TLS Mode Support:
- WithTls/WithoutTls markers for transaction Send-ability
- read_txn_with_tls() and read_txn_without_tls() methods
- static_read_txn() for 'static lifetime transactions that own the Env
- Generic Env<T: TlsUsage> and RoTxn<'e, T> types

New Types (Meilisearch required):
- DecodeIgnore for skipping value decoding
- U16<O>, U32BE<O>, U64BE<O>, U128<O>, I128<O> endian-aware integers
- BEU16, BEU32, BEU64, BEU128, BEI128 big-endian type aliases
- SerdeJson<T> and SerdeBincode<T> (serde feature)
- MdbError type alias for Error compatibility

Re-exports:
- byteorder crate types (BigEndian, LittleEndian, NativeEndian, etc.)
- Add U8 type codec for 8-bit unsigned integers
- Add BoxedError type alias for codec operations
- Export U8 and BoxedError in public API

Note: RoTxn::commit() already exists in the codebase.
Add ignored tests to verify large value handling:
- test_2gib_value: Attempts to insert a 2GiB value
- test_max_value_size: Finds the maximum working value size

Results show max value size is ~8-16KB due to missing
overflow page support. Values exceeding page capacity
return PageFull error.

Run with: cargo test test_2gib_value -- --ignored
Add LMDB-compatible overflow page support allowing values larger than
~2KB (for 4KB pages) to be stored in overflow pages.

Changes:
- Add node_max() calculation matching LMDB's me_nodemax formula
- Add overflow_pages() function using LMDB's OVPAGES formula
- Implement write_overflow_pages() to allocate and write overflow data
- Implement read_overflow_pages() to read values from overflow pages
- Update put/get/first/last to detect and handle overflow nodes
- Preserve overflow nodes when copying entries during page splits

This enables storing values up to 2GiB+ (tested and verified).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Fix unit_cmp error in types.rs test
- Remove unused imports across codebase
- Add #[allow(dead_code)] for code planned for future use
- Add #[allow(clippy::type_complexity)] for complex return types
- Add #[allow(clippy::large_enum_variant)] for transaction enums
- Fix collapsible_if patterns
- Apply cargo fmt formatting

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Added 'heed/' to .gitignore to exclude the Heed subproject from version control.
- Deleted ACTION_PLAN.md and OPTIMIZATIONS.md as they are no longer needed.
- Removed the Heed subproject reference from the repository.

This cleanup helps streamline the project structure and maintain focus on the core implementation.
- Create comprehensive README.md with usage examples, API docs, and benchmarks
- Add CONTRIBUTING.md with development guidelines and code style
- Fix lib.rs doctests to be runnable (add tempdir, max_dbs, proper syntax)
- Add iteration and safety documentation to lib.rs

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Add fast path for in-place insertion when page has enough space
  - Directly manipulates page buffer without rebuilding
  - Shifts only node pointers (2 bytes), not node data
  - Avoids O(n) key/value copying to intermediate Vec

- Optimize rebuild path for splits
  - Calculate split requirement upfront
  - Single-pass page building instead of try-then-rebuild
  - Direct iteration over source page nodes

- Add free_space() method to LeafPage

- Add profiling examples (profile_writes, profile_reads, bench_insert)

- Enable debug symbols in release profile for profiling

Performance improvement:
- insert_into_leaf CPU time: -89% (50.6% -> 5.5%)
- insert_into_tree CPU time: -48% (69.9% -> 36.3%)
- Bottleneck shifted from CPU to I/O (fsync)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Add thread-local page buffer pool to avoid repeated allocations
- PageBuilder::new_leaf and new_branch now reuse buffers from pool
- Buffers are zeroed before reuse for safety
- Pool limited to 8 buffers per thread to avoid memory bloat
- Export return_page_buffer/return_page_buffers for callers to recycle

This reduces allocation pressure during page splits and rebuilds,
though the impact is minimal since insertion is now I/O bound.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Ignore *.svg flamegraph files
- Ignore *.trace profiling traces
- Ignore perf.data files

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant