A high-performance C++23 application for generating binary deltas between files using content-defined chunking with Rabin fingerprints and BLAKE-512 hashing.
- Content-Defined Chunking: Dynamic chunk boundaries based on file content, not fixed positions
- Shift-Resistant: Handles insertions and deletions efficiently without affecting surrounding chunks
- Dual Hashing: Combines fast Rabin fingerprints with cryptographically strong BLAKE-512
- Adaptive Chunking: Intelligent chunk size adaptation for optimal performance across file sizes
# Clone the repository
git clone https://github.com/asmie/roll.git
cd roll
# Build the project
mkdir build && cd build
cmake .. && make -j4
# Generate a delta between two files
./rolling_hash oldfile.txt newfile.txt delta.bin- Compiler: GCC 11+, Clang 13+, or MSVC 2019+ with C++23 support
- Build System: CMake 3.16 or newer
- Memory: ~100MB for compilation
- Optional: Google Test (automatically fetched if not found)
mkdir build
cd build
cmake ..
make -j$(nproc)mkdir build-debug
cd build-debug
cmake -DCMAKE_BUILD_TYPE=Debug ..
make -j$(nproc)mkdir build-release
cd build-release
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)mkdir build
cd build
cmake -G "Visual Studio 16 2019" ..
cmake --build . --config ReleaseGenerate a delta file between two versions:
./rolling_hash original.txt modified.txt changes.deltaThe delta file is a binary format containing:
- Chunk signatures (64-bit Rabin fingerprints)
- BLAKE-512 hashes (64 bytes per chunk)
- Chunk metadata (size, offset)
- Binary diff data for modified chunks
A delta viewer utility is provided to inspect delta files:
# Build the delta viewer with the rest of the C++23 project
cmake --build . --target delta_viewer
# View delta file contents
./delta_viewer changes.deltaExample output:
Delta File Viewer - Analyzing: changes.delta
========================================
Chunk #1:
Type: MODIFIED
Signature: 0x7cf82b36
Chunk Size: 1012 bytes
Hash (first 8 bytes): f7 7d 84 02 04 ee 91 fd ...
Modifications:
Modify byte at position 172 to 0x4e ('N')
Modify byte at position 173 to 0x45 ('E')
... and 158 more changes
-
Chunking Phase
- Files are divided into variable-sized chunks using a rolling hash window
- Chunk boundaries occur when the hash matches specific criteria
- Adaptive sizing: 512 bytes (min) to 16KB (max)
-
Signature Generation
- Each chunk gets a Rabin fingerprint (fast, weak hash)
- Each chunk gets a BLAKE-512 hash (slow, strong hash)
- Creates a signature list for each file
-
Delta Computation
- Compares chunk lists between original and new files
- Identifies: unchanged, added, removed, and modified chunks
- For modified chunks, generates byte-level diffs
-
Output Generation
- Writes binary delta file with all changes
- Optimized format for minimal size and fast application
The application uses content-defined chunking with adaptive boundaries:
// Adaptive boundary detection based on chunk size
if (chunk.size() >= MAX_CHUNK_SIZE) {
// Force boundary at maximum size
boundary_found = true;
} else if (chunk.size() >= MIN_CHUNK_SIZE) {
// Use adaptive mask based on current size
uint32_t mask = chunk.size() < 2048 ? 0x1FF : // 1/512 probability
chunk.size() < 4096 ? 0x7FF : // 1/2048 probability
0x1FFF; // 1/8192 probability
boundary_found = (((last << 8 | b) & mask) == 0);
}This approach ensures:
- Small files generate appropriate chunks
- Large files maintain efficiency
- Chunk boundaries remain content-dependent
| File Size | Chunk Count | Delta Generation | Memory Usage |
|---|---|---|---|
| 1 KB | 1-2 | < 1ms | ~1 MB |
| 100 KB | 10-20 | ~5ms | ~2 MB |
| 10 MB | 500-1000 | ~100ms | ~15 MB |
| 1 GB | 50K-100K | ~10s | ~500 MB |
- Minimized Memory Allocation: Smart pointer usage
- Efficient I/O: Buffered reading with configurable chunk sizes
- Parallel-Ready: Thread-safe design for future parallelization
- Cache-Friendly: Sequential memory access patterns
# Build and run all tests
make rolling_hash_unit
./rolling_hash_unit
# Run tests with detailed output
./rolling_hash_unit --gtest_verbose- Unit Tests: 20+ test cases covering all major components
- Memory Tests: Valgrind verified (no leaks)
- Integration Tests: Full pipeline testing with various file types
valgrind --leak-check=full ./rolling_hash file1.txt file2.txt delta.outWe welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Use descriptive variable names
- Follow existing formatting (tabs for indentation)
- Add comments for complex logic
- Include unit tests for new features
- Update documentation as needed
- All tests must pass
- No memory leaks (Valgrind clean)
- No compiler warnings
- Code coverage for new features
Found a bug? Please report it on the GitHub issue tracker with:
- Description of the issue
- Steps to reproduce
- Expected vs actual behavior
- System information (OS, compiler version)
- Sample files if applicable
This project is licensed under the MIT License - see the LICENSE file for details.
- Piotr Olszewski - Initial work - asmie
- BLAKE hash implementation from the public domain reference
- Inspired by rsync's rolling checksum algorithm
- Content-defined chunking concepts from LBFS