Working PR for ADR-191: Refactor DVC pipeline into proper OOP design patterns#202
Draft
jodavis wants to merge 27 commits into
Draft
Working PR for ADR-191: Refactor DVC pipeline into proper OOP design patterns#202jodavis wants to merge 27 commits into
jodavis wants to merge 27 commits into
Conversation
Test Results401 tests ±0 400 ✅ - 1 2m 2s ⏱️ -3s For more details on these failures, see this check. Results for commit 9ab9062. ± Comparison against base commit 8f1b8b7. This pull request removes 3 and adds 2 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Owner
Author
|
@copilot rebase this PR branch on main so it gets the latest fixes. Make sure scripts/validate passes with no warnings, errors, or test failures. |
Contributor
|
@jodavis I'm unable to start working on this because of repository rules that prevent me from pushing to the branch:
See the documentation for more details. |
4 tasks
…be used for reference only.
…core/ - sample.py: Sample(ABC), SampleWithPath(Sample,ABC), TextSample, AudioSample, SampleSpectrogram, SampleTokens dataclasses with all spec fields. - manifest.py: Manifest[S] generic with O(1) by_id/by_content_hash lookups; ManifestStore.read() dispatches on sample_type; write() emits version-1 JSON. TextSample JSON omits path/parent_content_hash/applied_values; Spectrogram and Tokens JSON omits applied_values and includes parent_id per spec. - pyproject.toml: minimal pytest config so plain `pytest` from ml/ works. - Ambiguity resolutions: mutable dataclasses (no frozen=True); seed not a default field (callers pass seed=0 explicitly, matching spec's "PhraseVariator sets seed=0"). https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
1. Manifest.__init__: raise ValueError on duplicate sample ids (UUIDs must
be unique); keep first occurrence for duplicate content_hash (documented
contract — deterministic stages can produce identical hashes).
2. ManifestStore.read(): use data.get("sample_type") and raise ValueError
explicitly when the key is absent, matching the defensive pattern used
for "version".
3. ManifestStore.write(): validate that all samples share the same type
before writing; a mixed-type Manifest would produce a JSON file whose
sample_type header mismatches later entries, causing read() to KeyError.
Adds four new tests covering each guard.
https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
Tests use pytest's tmp_path fixture; tempfile was never called. https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
TextSample now derives its id from content_hash via __post_init__, so callers no longer pass id= explicitly. This means two TextSamples with the same content always share the same id, enforcing the deduplication invariant at the type level. Update ManifestStore._deserialise to omit id= when constructing TextSample, and update tests to match: duplicate-id and duplicate-content-hash tests now use AudioSample (where id and content_hash are independent) to exercise those code paths. https://claude.ai/code/session_01DLVaMxWuhm75L3YaMdVTFT
b835493 to
cb8b515
Compare
* ADR-222: Implement seed-based randomisation engine (PassFilter, VariationGenerator) Adds ml/pipeline/core/randomization.py with: - PassFilter ABC (density, sample_domain) - MinMaxFilter: uniform rejection over [min, max] - NormalFilter: Gaussian density normalised to peak 1.0; rejects std_dev <= 0 - VariationGenerator: should_vary, generate (float rejection-sampling), generate_int (bitmask rejection), choose (direct index-0 hash) All values derived from sha256 with deterministic key patterns; no random module. generate_int special-cases range==0 to return min_val immediately. 46 unit tests covering determinism, exact values, boundary conditions, ValueError after 1000 attempts, probability convergence, and stability across MinMaxFilter range widening/narrowing using concrete seed values. * Address code review: add input validation and strengthen independence tests * VariationGenerator.generate() uses a power-of-2 modulo algorithm to improve variable value stability as parameters change * Add Python pytest to CI and validate-tests scripts * Add error handling for cd ml in validate-tests scripts --------- Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
* Use dev-team plugin from feature branch * ADR-223: Implement ModifierStage abstract base class Adds ModifierStage[T_in, T_out] with the three-case transform() algorithm (skip / regen-with-stored-seed / new sample), _compute_content_hash() static helper, GC of orphaned output files, and manifest read/write integration. Also adds randomization.py (ADR-222 prerequisite) containing PassFilter, MinMaxFilter, NormalFilter, and VariationGenerator — required for modifier_stage.py to compile and for the regen path to construct generators from stored seeds. Key decisions: - ValueError from ManifestStore.read() propagates (hard error); no fallback to "no previous manifest" for unreadable/unsupported manifests. - output_dir creation is caller's responsibility; transform() does not call mkdir (consistent with DVC/entry-point owning directory setup per spec). - GC guards with output_dir.exists() so an empty first run does not raise. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ADR-223: address review feedback Bound T_in TypeVar to Sample so mypy can verify content_hash access at all four call sites in transform(); removes four type: ignore[arg-type] suppressions. --------- Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…entry-points) (#217) * ADR-224: uncommitted changes at validation * #217: Add ml/params.yaml with pipeline and stage parameters as DVC variables Moves variations_per_phrase, subsample_rate, and the five probability floats out of source code and into a central params.yaml so they can be tracked and overridden as DVC parameters. * #217: Refactor PhraseVariator to use VariationGenerator for stable reproducible randomisation Replace injectable rng: random.Random with vgen_factory: Callable[[int], VariationGenerator] and five probability float constructor params. Remove module-level _*_CHANCE constants. In generate(), use random.Random(42) to draw a phrase_seed per base phrase, then random.Random(phrase_seed) to draw a variant_seed per variant, and pass each seed to vgen_factory to get a VariationGenerator for that variant's _create_variation() call. In _create_variation(), replace all self._rng.* calls with vgen.should_vary(), vgen.choose(), and vgen.generate_int() using the stable variable name strings specified in ADR-224. Update tests to use the new constructor signature and a factory lambda. * #217: Load stage params from params.yaml; remove CLI args for DVC variables Remove --variations-per-phrase and --subsample-rate from argparse. Load all five probability floats, variations_per_phrase, and subsample_rate from ml/params.yaml (resolved relative to __file__). Construct PhraseVariator with the factory lambda and probability params from the YAML. * Minor changes during code review * Addition of task to start wiring the DVC pipeline, added as a result of code review. * Small changes before pushing * Address PR #217 review comments: params path convention, typed params object, PhraseVariator params object, move TestConventions - conventions.params_path(project_root) centralises the params.yaml path - PipelineParams/GeneratePhraseParams dataclasses deserialise params.yaml into a typed object - PhraseVariator.__init__ now takes GeneratePhraseParams instead of five individual floats - TestConventions moved from test_vocab_computer.py to its own test_conventions.py file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix CI ModuleNotFoundError: add pyyaml to ml/requirements.txt and install from it in CI pyyaml was used in ml/pipeline/stages/params.py and ml/test/pipeline/stages/test_params.py but was not listed as a dependency. The CI workflow only ran `pip install pytest`, so pyyaml was never installed. Added ml/requirements.txt with pytest and pyyaml>=6.0, and updated the python-tests job to install from that file instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
--- updated-dependencies: - dependency-name: Microsoft.Extensions.Hosting dependency-version: 10.0.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
--- updated-dependencies: - dependency-name: Azure.Monitor.OpenTelemetry.Exporter dependency-version: 1.8.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
--- updated-dependencies: - dependency-name: Microsoft.NET.Test.Sdk dependency-version: 18.6.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
* ADR-278: uncommitted changes at validation * Remove unneeded DVC files from the root of the repo * DVC init * Fix behavior of VocabComputer to read "content" instead of "label" * Lock file after first `dvc repro` run * Add a link to the Epic in the spec * Address PR review: move download script, add --output-dir arg, prefix stage names - Move ml/download_phoneme_dictionary.py to ml/pipeline/download_phoneme_dictionary.py so it lives alongside other pipeline scripts - Add --output-dir argument to the download script instead of hardcoding the path - Prefix all dvc.yaml stage names with intent_NN_ to match the script naming convention - Update dvc.lock to reflect renamed stages and updated cmd Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* ADR-225: Implement text-to-speech sample generation using edge-tts * ADR-225: Add soundfile and edge-tts to ml/requirements.txt * Revert unrelated extraKnownMarketplaces change from ADR-225 The ADR-225 implementation commit accidentally cleared extraKnownMarketplaces; restore the jodavis-agent-plugins marketplace entry that was present before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ADR-225: Add _doc_speech.md for ml/pipeline/speech package CONTRIBUTING.md requires a documentation file for new subsystems. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ADR-225: Remove multi-line docstring from EdgeTtsProvider CLAUDE.md: never write multi-paragraph docstrings. The retry strategy is self-evident from the constants and loop structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ADR-225: Remove 10-line module docstring from speech_03_generate_samples.py CLAUDE.md: never write multi-paragraph docstrings. The usage block duplicates what argparse --help already surfaces. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ADR-225: Rename speech_03_generate_samples to speech_01_generate_samples Speech and intent pipelines have separate number spaces. The speech pipeline starts at 01, so this stage is speech_01, not speech_03. --------- Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* ADR-226: Implement audio I/O protocols and DelayAugmentor stage
- ml/pipeline/io/: new AudioReader/AudioWriter Protocol classes with
LibrosaAudioReader (mono conversion via librosa.load) and
SoundfileAudioWriter (PCM_16 WAV write)
- ml/pipeline/speech/delay_stage.py: DelayAugmentor(ModifierStage)
with independent prefix/suffix should_vary checks; both keys always
in applied_values (0.0 when not applied) for hash stability;
_derive_id format: "{id}_pre{ms}_suf{ms}"
- ml/pipeline/stages/speech_04_add_delays.py: thin DVC entry-point
- ml/pipeline/stages/params.py: AddDelaysParams dataclass + load()
- ml/params.yaml: add stages.add_delays section (6 float keys)
- ml/requirements.txt: add librosa>=0.10
- Tests: 28 delay stage tests, 9 audio I/O tests, 5 new params tests
* Fix a script error in validate-build.sh
* Fix build and test validation scripts for agentic workflows
* ADR-226: Fix blocking I/O in LibrosaAudioReader and SoundfileAudioWriter
Both async methods called synchronous disk I/O directly (librosa.load and
sf.write), blocking the event loop for the full duration of each file
operation. Offload each to a thread pool executor via run_in_executor so
the event loop remains free during I/O, matching the async contract.
* ADR-226: Move asyncio import to module top level in test_audio_io.py
import asyncio was placed inside each individual test method. Move it to
the top-level imports per standard Python convention.
* ADR-226: Add _doc_io.md for ml/pipeline/io subsystem
New subsystems require a _doc_*.md per CONTRIBUTING.md. Documents the
AudioReader/AudioWriter Protocol contracts, default implementations
(LibrosaAudioReader, SoundfileAudioWriter), algorithm choices, thread
pool offload rationale, and deferred import rationale.
* ADR-226: Update _doc_speech.md to reflect DelayAugmentor is implemented
delay_stage.py / DelayAugmentor was listed as a planned (not yet
implemented) stage. It is now implemented; add it to the Stages table
and remove it from the planned list.
* ADR-226: Replace tuple[np.ndarray, int] return type with AudioData named type
AudioReader.read() now returns AudioData(samples, sample_rate) instead of a raw
tuple. This improves call-site readability and makes intent explicit at the
protocol boundary. LibrosaAudioReader updated to construct AudioData; tests
updated to access result.samples and result.sample_rate instead of unpacking.
* ADR-226: Pass AddDelaysParams to DelayAugmentor; reduce MinMaxFilter precision; omit zero-delay ID parts
- DelayAugmentor constructor now accepts a single AddDelaysParams instead of six
individual float arguments. The entry-point passes params.add_delays directly.
- MinMaxFilter precision reduced from 6 to 1 (100ms resolution is sufficient for
audio delay augmentation; sub-millisecond granularity is unnecessary).
- _derive_id now omits _pre and _suf segments when the respective delay is 0.0,
producing cleaner IDs (e.g. "TV_ON_Jenny_r77" instead of "TV_ON_Jenny_r77_pre0_suf0").
- Tests updated: _make_stage accepts AddDelaysParams; TestDeriveId updated for new
format; TestGenerateOutput/TestAppliedValues use precision=1-valid delay values (0.1s).
* ADR-226: Rename speech_04_add_delays to speech_02; update params.yaml defaults for precision=1
Renamed speech_04_add_delays.py to speech_02_add_delays.py because only speech_01
exists; the delay stage is the second speech pipeline stage, not the fourth.
Updated params.yaml default delay range from 0.02-0.15s to 0.1-0.5s so that the
defaults are valid at MinMaxFilter precision=1 (100ms resolution). The previous
defaults (20ms, 150ms) would round to 0.0 and 0.1 at precision=1, making the
effective range degenerate.
* ADR-226: Update _doc_io.md for AudioData return type
Document AudioData dataclass in the Protocols table and update the AudioReader
contract description to reflect the new return type.
* ADR-226: Use AudioData in SoundfileAudioWriter.write() and callers
SoundfileAudioWriter.write() was still taking data/sample_rate as
separate arguments rather than AudioData, inconsistent with the
AudioWriter Protocol definition (line 44) that was updated in the
previous fix iteration.
Updated SoundfileAudioWriter.write() to accept audio: AudioData and
unpack audio.samples / audio.sample_rate internally. Updated
delay_stage._generate_output() to pass AudioData to the writer.
Updated _RecordingAudioWriter stub and all test call sites in
test_audio_io.py and test_delay_stage.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ADR-226: Update _doc_io.md AudioWriter contract to AudioData
The AudioWriter contract description still showed the old
data/sample_rate signature. Updated to reflect the current
audio: AudioData parameter.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ages (#231) * ADR-227: Add BackgroundNoiseAugmentor and MicrophoneNoiseAugmentor - background_noise_stage.py: BackgroundNoiseAugmentor with NoiseProvider protocol; noise_file always chosen for hash stability; noise_start_s bounds derived from file durations at runtime; all three keys always present in applied_values - mic_noise_stage.py: MicrophoneNoiseAugmentor with Gaussian noise seeded from output_seed - speech_05_add_background_noise.py, speech_06_add_mic_noise.py: entry-point scripts - params.py: AddBackgroundNoiseParams, AddMicNoiseParams; PipelineParams.load() extended - params.yaml: add_background_noise and add_mic_noise sections - _doc_speech.md: updated stages table and key design decisions * ADR-227: Fix executor.submit coroutine form; store _noise_dir to remove redundant list_files() calls Issue 1: Changed executor.submit(asyncio.run, _read()) to executor.submit(lambda: asyncio.run(_read())) so the coroutine is created inside the worker thread. The old form passed a pre-created coroutine object across threads, which is fragile (the coroutine is bound to the originating thread's context). The lambda form creates a fresh coroutine in the worker thread per asyncio.run() semantics. Issues 2 & 3: Added noise_dir: Path constructor parameter to BackgroundNoiseAugmentor and stored it as _noise_dir. Both _get_applied_values (line 106) and _generate_output (line 151) previously called self._noise_provider.list_files()[0].parent to recover the noise directory — re-globbing the filesystem on each call and raising an opaque IndexError on an empty directory. _noise_dir eliminates both calls. * ADR-227: Remove unused imports from test_background_noise_stage.py and test_mic_noise_stage.py Issues 4 & 5: 'from typing import Any' and 'import pytest' were imported but never referenced in either test file. Also updates test_background_noise_stage.py to pass the new noise_dir parameter required by the BackgroundNoiseAugmentor constructor. * ADR-227: Pass noise_dir to BackgroundNoiseAugmentor; add validation for empty noise directory Issue 6: _DirectoryNoiseProvider.list_files() now raises ValueError with the directory path if no WAV files are found. Previously it returned [] which caused VariationGenerator.choose() to raise ValueError("options must be non-empty") — an opaque message that didn't identify the misconfigured path. The new guard surfaces the root cause immediately. Also passes args.noise_dir as the new noise_dir constructor parameter added to BackgroundNoiseAugmentor. * ADR-227: Remove noise_start_s variation and _read_duration_s Per user decision (PRRT_kwDOQYcoLM6KbINz, PRRT_kwDOQYcoLM6KbIOC, PRRT_kwDOQYcoLM6KjMkD): noise_start_s is no longer varied — it is always 0.0. This eliminates the hacky synchronous _read_duration_s helper that called asyncio.run inside a ThreadPoolExecutor to avoid blocking the event loop. noise_start_s is still stored in applied_values for hash stability. Update _doc_speech.md to remove the now-obsolete bounds-derivation design note. * ADR-227: Avoid output_samples copy when amplitude is 0 in MicrophoneNoiseAugmentor Per Copilot review (PRRT_kwDOQYcoLM6KbINm) and user decision (PRRT_kwDOQYcoLM6KjhqB): only add Gaussian noise if amplitude > 0. Use audio.samples directly on the no-noise path instead of creating an unnecessary copy. New array is only allocated when mixing occurs. * ADR-227: Redesign NoiseProvider; return input when noise not applied - NoiseProvider.list_files() now returns list[tuple[str, AudioData]] so that implementations can load and resample files once at construction time, avoiding per-sample I/O and ensuring sample rate consistency. - _DirectoryNoiseProvider gains a sample_rate: int constructor parameter; it loads all WAV files eagerly via librosa (resampling to sample_rate), satisfying the sample-rate equality contract. - BackgroundNoiseAugmentor removes noise_dir from its constructor (no longer needed; noise audio is retrieved directly from the provider). - _generate_output in both BackgroundNoiseAugmentor and MicrophoneNoiseAugmentor returns input_sample unchanged when noise_volume / amplitude == 0.0 — no file is written and no I/O occurs, eliminating unnecessary file copies. - MicrophoneNoiseAugmentor._derive_id returns input_sample.id when amplitude == 0.0 (no _mic### suffix added for unmodified samples). - A sample_rate: int field is added to PipelineParams (pipeline-level param) with a sample_rate: 16000 default in params.yaml. - Added ValueError guard in _generate_output when noise and audio sample rates differ, with a diagnostic message explaining the fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…try-points (#233) * ADR-228: Add SpectrogramStage, TokenStage, SetManifestSplitter and DVC entry-points - SpectrogramStage: deterministic stage producing (n_mels, time_steps) .npy files via librosa - TokenStage: deterministic stage producing padded phoneme token .json files - SetManifestSplitter: splits augmented manifest into train/val/test by percentage - DVC entry-points: speech_07_compute_tokens, speech_08_compute_spectrograms, speech_09_create_set_manifests - PipelineParams extended with ComputeSpectrogramsParams, ComputeTokensParams, CreateSetManifestsParams - params.yaml updated with compute_spectrograms, compute_tokens, create_set_manifests sections * ADR-228: Offload TokenStage JSON write to run_in_executor for consistency with SpectrogramStage * ADR-228: Update _doc_speech.md to cover featurisation stages and SetManifestSplitter * ADR-228: Replace round() with int() truncation in SetManifestSplitter for predictable split boundaries Python's round() uses bankers rounding, which can produce unintuitive results at .5 boundaries. Integer truncation for train/val with remainder assigned to test is simpler and avoids unexpected off-by-one behaviour on small manifests. * ADR-228: Enforce float32 dtype on SpectrogramStage output librosa.power_to_db may return float64; adding .astype(np.float32) ensures the saved .npy always matches the documented contract. Added test_output_npy_dtype_is_float32 to verify. * ADR-228: Include config params in _get_applied_values for SpectrogramStage and TokenStage Reviewer (jodavis) confirmed that per-file content_hash is used to determine whether individual outputs are up-to-date between pipeline runs, not just the DVC stage-level dependency. Returning {} from _get_applied_values means that changing n_mels/time_steps/input_token_length/phoneme_list does not invalidate previously-cached outputs on a per-file basis. SpectrogramStage now returns {"n_mels": ..., "time_steps": ...}. TokenStage now returns {"input_token_length": ..., "phoneme_list": [...]}. Replaced test_get_applied_values_returns_empty_dict with three tests each that assert the new config keys are present and reflect constructed values. Added test_changing_n_mels_triggers_regen, test_changing_time_steps_triggers_regen, test_changing_input_token_length_triggers_regen, and test_changing_phoneme_list_triggers_regen to verify end-to-end regen behaviour. --------- Co-authored-by: Joe Davis <ElwoodMoves@hotmail.com>
librosa -> resampy -> numba, and numba <0.59.0 does not support Python 3.12.3 (Ubuntu 24.04 system Python). numba 0.59.0 added Python 3.12 support. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
numpy==2.5.0 conflicts with numba (latest numba supports numpy<2.5). Librosa brings in resampy which requires numba>=0.59.0, but numba 0.59.0 requires numpy<1.27, and no numba version supports numpy 2.5.0. Use numpy>=1.22 to let pip find a compatible version that satisfies both tensorflow 2.21.0 and numba/librosa. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Working PR for ADR-191: Refactor DVC pipeline into proper OOP design patterns