Skip to content

Latest commit

 

History

History
163 lines (115 loc) · 6.14 KB

File metadata and controls

163 lines (115 loc) · 6.14 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Development Commands

Building and Testing

# Format code (MANDATORY before commits)
cargo fmt --all

# Run clippy linter with strict settings
cargo clippy --all-features -- -D warnings

# Run all Rust tests
cargo test --release

# Run comprehensive test script (includes Python tests)
./scripts/test.sh

# Build Python package with maturin
maturin develop --features python

# Run Python tests
pytest tests/ -v

# Run benchmarks
cargo bench

# Check for unused dependencies
cargo udeps --all-targets

# Publish dry run
cargo publish --dry-run

Single Test Execution

# Run a specific Rust test
cargo test test_name --release

# Run a specific Python test
pytest tests/test_file.py::test_name -v

# Run tests with output
cargo test -- --nocapture

Architecture Overview

Core Encryption Process

The self_encryption crate implements convergent encryption with obfuscation through a three-stage process:

  1. Content Chunking: Files are split into chunks (up to 1MB each)
  2. Per-Chunk Processing:
    • Compression (Brotli with configurable quality)
    • Encryption (AES-256-CBC)
    • XOR obfuscation
  3. Key Derivation: Each chunk's encryption keys are derived from a circular dependency pattern:
    • Chunks 0 and 1 have special handling due to circular dependencies
    • For chunk N (where N ≥ 2): uses hashes from chunks N, (N+1) % total, (N+2) % total
    • Creates interdependency where modifying any chunk affects multiple others

Key Components

  • src/lib.rs: Main library interface, exports public API including encrypt, decrypt_full_set
  • src/encrypt.rs: Core encryption logic, handles chunk processing and key generation
  • src/decrypt.rs: Decryption logic, reverses the encryption process
  • src/data_map.rs: DataMap structure that stores chunk metadata (src/dst hashes, sizes, indices)
  • src/stream.rs: Streaming encryption/decryption for memory-efficient large file handling
  • src/chunk.rs: Chunk data structures (EncryptedChunk, ChunkInfo) and validation
  • src/aes.rs: AES encryption implementation using CBC mode
  • src/utils.rs: Utility functions for key derivation, hash extraction, chunk size calculation
  • src/python.rs: PyO3 bindings for Python interface
  • src/error.rs: Error types and handling

Storage Backend Design

The library uses a trait-based design for flexible storage backends:

  • Store functions: Fn(XorName, Bytes) -> Result<()>
  • Retrieve functions: Fn(XorName) -> Result<Bytes>
  • Supports memory, disk, or custom storage implementations

DataMap Hierarchy

For large files, DataMaps can be shrunk hierarchically:

  • Serialize large DataMap → Encrypt as data → Create new smaller DataMap
  • Process repeats until manageable size reached
  • child field tracks hierarchy level

Critical Constraints

  • Minimum file size: 3072 bytes (3 * MIN_CHUNK_SIZE) for self-encryption
  • Chunk size: Maximum 1MB per chunk
  • Key security: The returned secret key from encryption requires secure handling
  • Hash verification: All chunks are self-validating through SHA3-256 hashes

Python Bindings

The Python interface is built with PyO3 and maturin:

  • CLI tool: self-encryption command
  • Module: self_encryption Python package
  • Supports both in-memory and streaming operations

CI/CD Workflow

  • PR checks: Format, clippy, tests, coverage, unused deps
  • Warnings as errors: RUSTFLAGS="-D warnings" enforced in CI
  • Code coverage: Uses cargo-llvm-cov and reports to coveralls.io
  • 32-bit testing: Includes i686 target testing
  • Python package: Automated publishing via GitHub Actions

Performance Considerations

  • Parallel chunk processing via rayon in standard implementation
  • Streaming APIs for memory efficiency with large files
  • Benchmarks in benches/lib.rs for tracking performance
  • Optimized compression settings in Brotli
  • Chunk size optimization based on file size

StreamSelfEncryptor Implementation Notes

The streaming implementation differs from the standard implementation in several important ways:

Design Differences

  1. Memory Usage:

    • Standard: Loads entire file into memory, processes all chunks at once
    • Streaming: Processes one chunk at a time, O(1) memory usage
  2. API Pattern:

    • Standard: Functional approach with encrypt(bytes) -> (DataMap, Vec<EncryptedChunk>)
    • Streaming: Stateful object with next_encryption() returning chunks incrementally
  3. Chunk Processing:

    • Standard: Special handling for chunks 0 and 1 (deferred processing due to circular dependencies)
    • Streaming: Processes all chunks uniformly (potential issue)

Known Issues with StreamSelfEncryptor

  1. First Two Chunks: Does not implement the special handling for chunks 0 and 1 that the standard implementation uses. This could lead to incorrect encryption in edge cases.

  2. Error Handling: Less robust error handling compared to standard implementation, particularly around chunk validation.

  3. File System Dependency: StreamSelfDecryptor uses temporary files extensively, which adds complexity and potential failure points.

When to Use Each Implementation

  • Standard Implementation: Use for files that fit comfortably in memory (< 1GB)
  • Streaming Implementation: Use for large files where memory usage is a concern
  • Note: Both implementations produce compatible output when working correctly

Potential Improvements Needed

  1. Unify Chunk Processing: Align StreamSelfEncryptor's chunk processing with standard implementation, especially for chunks 0 and 1
  2. Error Handling: Improve error handling in streaming implementation to match standard implementation's robustness
  3. Reduce File System Operations: Consider memory-mapping or buffering strategies for StreamSelfDecryptor
  4. Progress Callbacks: Add progress reporting capabilities to streaming implementation
  5. Test Coverage: Ensure streaming implementation has comprehensive tests for edge cases
  6. API Consistency: Consider refactoring to provide more consistent APIs between implementations