Dev/jodavis/initial commits for new ml pipeline#239
Merged
jodavis merged 9 commits intoJun 28, 2026
Merged
Conversation
…be used for reference only.
…core/ - sample.py: Sample(ABC), SampleWithPath(Sample,ABC), TextSample, AudioSample, SampleSpectrogram, SampleTokens dataclasses with all spec fields. - manifest.py: Manifest[S] generic with O(1) by_id/by_content_hash lookups; ManifestStore.read() dispatches on sample_type; write() emits version-1 JSON. TextSample JSON omits path/parent_content_hash/applied_values; Spectrogram and Tokens JSON omits applied_values and includes parent_id per spec. - pyproject.toml: minimal pytest config so plain `pytest` from ml/ works. - Ambiguity resolutions: mutable dataclasses (no frozen=True); seed not a default field (callers pass seed=0 explicitly, matching spec's "PhraseVariator sets seed=0"). https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
1. Manifest.__init__: raise ValueError on duplicate sample ids (UUIDs must
be unique); keep first occurrence for duplicate content_hash (documented
contract — deterministic stages can produce identical hashes).
2. ManifestStore.read(): use data.get("sample_type") and raise ValueError
explicitly when the key is absent, matching the defensive pattern used
for "version".
3. ManifestStore.write(): validate that all samples share the same type
before writing; a mixed-type Manifest would produce a JSON file whose
sample_type header mismatches later entries, causing read() to KeyError.
Adds four new tests covering each guard.
https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
Tests use pytest's tmp_path fixture; tempfile was never called. https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
TextSample now derives its id from content_hash via __post_init__, so callers no longer pass id= explicitly. This means two TextSamples with the same content always share the same id, enforcing the deduplication invariant at the type level. Update ManifestStore._deserialise to omit id= when constructing TextSample, and update tests to match: duplicate-id and duplicate-content-hash tests now use AudioSample (where id and content_hash are independent) to exercise those code paths. https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
* ADR-222: Implement seed-based randomisation engine (PassFilter, VariationGenerator) Adds ml/pipeline/core/randomization.py with: - PassFilter ABC (density, sample_domain) - MinMaxFilter: uniform rejection over [min, max] - NormalFilter: Gaussian density normalised to peak 1.0; rejects std_dev <= 0 - VariationGenerator: should_vary, generate (float rejection-sampling), generate_int (bitmask rejection), choose (direct index-0 hash) All values derived from sha256 with deterministic key patterns; no random module. generate_int special-cases range==0 to return min_val immediately. 46 unit tests covering determinism, exact values, boundary conditions, ValueError after 1000 attempts, probability convergence, and stability across MinMaxFilter range widening/narrowing using concrete seed values. * Address code review: add input validation and strengthen independence tests * VariationGenerator.generate() uses a power-of-2 modulo algorithm to improve variable value stability as parameters change * Add Python pytest to CI and validate-tests scripts * Add error handling for cd ml in validate-tests scripts --------- Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
* Use dev-team plugin from feature branch * ADR-223: Implement ModifierStage abstract base class Adds ModifierStage[T_in, T_out] with the three-case transform() algorithm (skip / regen-with-stored-seed / new sample), _compute_content_hash() static helper, GC of orphaned output files, and manifest read/write integration. Also adds randomization.py (ADR-222 prerequisite) containing PassFilter, MinMaxFilter, NormalFilter, and VariationGenerator — required for modifier_stage.py to compile and for the regen path to construct generators from stored seeds. Key decisions: - ValueError from ManifestStore.read() propagates (hard error); no fallback to "no previous manifest" for unreadable/unsupported manifests. - output_dir creation is caller's responsibility; transform() does not call mkdir (consistent with DVC/entry-point owning directory setup per spec). - GC guards with output_dir.exists() so an empty first run does not raise. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ADR-223: address review feedback Bound T_in TypeVar to Sample so mypy can verify content_hash access at all four call sites in transform(); removes four type: ignore[arg-type] suppressions. --------- Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Test Results401 tests ±0 401 ✅ ±0 2m 14s ⏱️ +9s Results for commit 27c59fe. ± Comparison against base commit 8f1b8b7. This pull request removes 3 and adds 2 tests. Note that renamed tests count towards both. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Commit changes up to ADR-223 to the new branch. No changes are needed for these tasks.