Skip to content

Dev/jodavis/initial commits for new ml pipeline#239

Merged
jodavis merged 9 commits into
feature/adr-191-new-ml-pipelinefrom
dev/jodavis/initial-commits-for-new-ml-pipeline
Jun 28, 2026
Merged

Dev/jodavis/initial commits for new ml pipeline#239
jodavis merged 9 commits into
feature/adr-191-new-ml-pipelinefrom
dev/jodavis/initial-commits-for-new-ml-pipeline

Conversation

@jodavis

@jodavis jodavis commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Commit changes up to ADR-223 to the new branch. No changes are needed for these tasks.

ElwoodMoves and others added 9 commits June 28, 2026 07:04
…core/

- sample.py: Sample(ABC), SampleWithPath(Sample,ABC), TextSample, AudioSample,
  SampleSpectrogram, SampleTokens dataclasses with all spec fields.
- manifest.py: Manifest[S] generic with O(1) by_id/by_content_hash lookups;
  ManifestStore.read() dispatches on sample_type; write() emits version-1 JSON.
  TextSample JSON omits path/parent_content_hash/applied_values; Spectrogram and
  Tokens JSON omits applied_values and includes parent_id per spec.
- pyproject.toml: minimal pytest config so plain `pytest` from ml/ works.
- Ambiguity resolutions: mutable dataclasses (no frozen=True); seed not a default
  field (callers pass seed=0 explicitly, matching spec's "PhraseVariator sets seed=0").

https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
1. Manifest.__init__: raise ValueError on duplicate sample ids (UUIDs must
   be unique); keep first occurrence for duplicate content_hash (documented
   contract — deterministic stages can produce identical hashes).

2. ManifestStore.read(): use data.get("sample_type") and raise ValueError
   explicitly when the key is absent, matching the defensive pattern used
   for "version".

3. ManifestStore.write(): validate that all samples share the same type
   before writing; a mixed-type Manifest would produce a JSON file whose
   sample_type header mismatches later entries, causing read() to KeyError.

Adds four new tests covering each guard.

https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
TextSample now derives its id from content_hash via __post_init__,
so callers no longer pass id= explicitly. This means two TextSamples
with the same content always share the same id, enforcing the
deduplication invariant at the type level.

Update ManifestStore._deserialise to omit id= when constructing
TextSample, and update tests to match: duplicate-id and
duplicate-content-hash tests now use AudioSample (where id and
content_hash are independent) to exercise those code paths.

https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
* ADR-222: Implement seed-based randomisation engine (PassFilter, VariationGenerator)

Adds ml/pipeline/core/randomization.py with:
- PassFilter ABC (density, sample_domain)
- MinMaxFilter: uniform rejection over [min, max]
- NormalFilter: Gaussian density normalised to peak 1.0; rejects std_dev <= 0
- VariationGenerator: should_vary, generate (float rejection-sampling),
  generate_int (bitmask rejection), choose (direct index-0 hash)

All values derived from sha256 with deterministic key patterns; no random module.
generate_int special-cases range==0 to return min_val immediately.

46 unit tests covering determinism, exact values, boundary conditions,
ValueError after 1000 attempts, probability convergence, and stability
across MinMaxFilter range widening/narrowing using concrete seed values.

* Address code review: add input validation and strengthen independence tests

* VariationGenerator.generate() uses a power-of-2 modulo algorithm to improve variable value stability as parameters change

* Add Python pytest to CI and validate-tests scripts

* Add error handling for cd ml in validate-tests scripts

---------

Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
* Use dev-team plugin from feature branch

* ADR-223: Implement ModifierStage abstract base class

Adds ModifierStage[T_in, T_out] with the three-case transform() algorithm
(skip / regen-with-stored-seed / new sample), _compute_content_hash() static
helper, GC of orphaned output files, and manifest read/write integration.

Also adds randomization.py (ADR-222 prerequisite) containing PassFilter,
MinMaxFilter, NormalFilter, and VariationGenerator — required for
modifier_stage.py to compile and for the regen path to construct generators
from stored seeds.

Key decisions:
- ValueError from ManifestStore.read() propagates (hard error); no fallback
  to "no previous manifest" for unreadable/unsupported manifests.
- output_dir creation is caller's responsibility; transform() does not call
  mkdir (consistent with DVC/entry-point owning directory setup per spec).
- GC guards with output_dir.exists() so an empty first run does not raise.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ADR-223: address review feedback

Bound T_in TypeVar to Sample so mypy can verify content_hash access at all
four call sites in transform(); removes four type: ignore[arg-type] suppressions.

---------

Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@jodavis jodavis enabled auto-merge (rebase) June 28, 2026 14:08
@github-actions

Copy link
Copy Markdown

Test Results

401 tests  ±0   401 ✅ ±0   2m 14s ⏱️ +9s
  5 suites ±0     0 💤 ±0 
  5 files   ±0     0 ❌ ±0 

Results for commit 27c59fe. ± Comparison against base commit 8f1b8b7.

This pull request removes 3 and adds 2 tests. Note that renamed tests count towards both.
,False)
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesKeyNameAsync (Hello
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesValueAsync (Invalid
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesKeyNameAsync (Hello
,False)
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesValueAsync (Invalid
,False)

@jodavis jodavis merged commit c6638cc into feature/adr-191-new-ml-pipeline Jun 28, 2026
5 checks passed
@jodavis jodavis deleted the dev/jodavis/initial-commits-for-new-ml-pipeline branch June 28, 2026 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants