ADR-228: implement SpectrogramStage, TokenStage, SetManifestSplitter by jodavis · Pull Request #232 · jodavis/AdaptiveRemote

jodavis · 2026-06-19T04:11:18Z

No description provided.

Add two deterministic featurisation stages and a set splitter: - SpectrogramStage: writes {id}.npy of shape (n_mels, time_steps); truncates long spectrograms and zero-pads short ones - TokenStage: writes {id}.json with phoneme strings and token indices padded to input_token_length; padding value is 0 (index 0) - SetManifestSplitter: shuffles with seed=42 and writes train/val/test manifests Also adds three DVC entry-point scripts (speech_07, 08, 09), extends PipelineParams with ComputeSpectrogramsParams, ComputeTokensParams, CreateSetManifestsParams, adds matching sections to params.yaml, and updates _doc_speech.md.

Copilot

Pull request overview

This PR adds three new speech-pipeline components to the ML pipeline—TokenStage, SpectrogramStage, and SetManifestSplitter—plus their DVC entrypoints and parameter plumbing, so augmented audio manifests can be converted into model-ready token/spectrogram artifacts and split into train/val/test sets.

Changes:

Extend PipelineParams / params.yaml to include compute_tokens, compute_spectrograms, and create_set_manifests stage parameters.
Add TokenStage (phoneme tokens JSON) and SpectrogramStage (mel spectrogram NPY) plus corresponding CLI stage scripts.
Add SetManifestSplitter and CLI stage script to produce train/val/test manifest files.

Confidence: 95/100

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
ml/pipeline/speech/token_stage.py	New deterministic stage that writes per-sample phoneme/token JSON.
ml/pipeline/speech/spectrogram_stage.py	New deterministic stage that writes per-sample mel spectrogram NPY.
ml/pipeline/speech/set_splitter.py	New utility to shuffle/split an augmented manifest into train/val/test manifests.
ml/pipeline/stages/speech_07_compute_tokens.py	CLI entrypoint to run TokenStage using pipeline params + vocab files.
ml/pipeline/stages/speech_08_compute_spectrograms.py	CLI entrypoint to run SpectrogramStage using pipeline params.
ml/pipeline/stages/speech_09_create_set_manifests.py	CLI entrypoint to split augmented manifest into dataset splits.
ml/pipeline/stages/params.py	Adds dataclasses + loading for the three new stage parameter blocks.
ml/params.yaml	Adds YAML configuration for the three new stage blocks.
ml/pipeline/speech/_doc_speech.md	Updates docs to list the now-implemented stages.
ml/test/pipeline/speech/test_token_stage.py	New unit tests covering TokenStage outputs and padding.
ml/test/pipeline/speech/test_spectrogram_stage.py	New unit tests covering SpectrogramStage output shape/padding and determinism.
ml/test/pipeline/speech/test_set_splitter.py	New unit tests covering SetManifestSplitter outputs and determinism.
ml/test/pipeline/stages/test_params.py	Extends params loading tests for the new stage blocks.

+        # Collect phonemes for each word in the transcript
+        phonemes: list[str] = []
+        for word in input_sample.transcript.split():
+            word_phonemes = self._vocab.words_to_phonemes.get(word, [])
+            phonemes.extend(word_phonemes)
+
+        # Compute token indices (1-based lookup into phoneme_list)
+        tokens = [self._vocab.phoneme_list.index(p) for p in phonemes]
+
+        # Pad to input_token_length
+        pad_len = max(0, self._input_token_length - len(phonemes))
+        padded_phonemes = phonemes + [""] * pad_len
+        padded_tokens = tokens + [0] * pad_len
+
+        # Truncate if longer than input_token_length
+        padded_phonemes = padded_phonemes[: self._input_token_length]
+        padded_tokens = padded_tokens[: self._input_token_length]


+"""TokenStage: compute phoneme tokens from AudioSample transcripts.
+
+Deterministic stage (seed=0). Writes {id}.json with phonemes and tokens
+padded to input_token_length. Token index 0 is used for padding.
+"""


+        loop = asyncio.get_running_loop()
+        mel = await loop.run_in_executor(
+            None,
+            lambda: librosa.feature.melspectrogram(
+                y=audio.samples,
+                sr=self._sample_rate,
+                n_mels=self._n_mels,
+            ),
+        )


+    def test_tokens_padding_value_is_zero(self, tmp_path: Path) -> None:
+        """Padding uses index 0."""
+        vocab = _make_vocab(
+            phoneme_list=["AH", "N", "T", "V"],
+            words_to_phonemes={"TV_ON": ["T", "V"]},
+        )
+        stage = _make_stage(tmp_path, vocab=vocab, input_token_length=10)
+        sample = _make_audio_sample(transcript="TV_ON")
+        asyncio.run(stage.transform(Manifest([sample]), tmp_path / "manifest.json"))
+        data = json.loads((tmp_path / f"{sample.id}.json").read_text())
+        # Only 2 phonemes for "TV_ON", remaining 8 should be 0
+        assert data["tokens"][2:] == [0] * 8


github-actions · 2026-06-19T04:16:19Z

Test Results

401 tests ±0 401 ✅ ±0 1m 59s ⏱️ -18s
5 suites ±0 0 💤 ±0
5 files ±0 0 ❌ ±0

Results for commit f4054bc. ± Comparison against base commit acbc760.

jodavis · 2026-06-19T04:25:30Z

@copilot Address the code review comments in this PR

…rate, unused imports

Copilot · 2026-06-19T04:29:57Z

@copilot Address the code review comments in this PR

All review comments addressed in commits ADR-228: address PR review comments and Move pytest import to module level:

token_stage.py: padding now uses vocab.ctc_blank_idx (== len(phoneme_list)), which cannot collide with any real phoneme index; raises KeyError immediately when a transcript word is absent from words_to_phonemes; builds a phoneme_to_idx dict once instead of repeated list.index() calls; docstrings updated.
spectrogram_stage.py: passes audio.sample_rate (the audio's native rate as returned by AudioReader) to librosa instead of the configured self._sample_rate.
set_splitter.py: removed unused Any import.
test_token_stage.py: renamed and updated padding test to assert vocab.ctc_blank_idx; added test_unknown_transcript_word_raises_key_error; moved import pytest to module level.
test_set_splitter.py: removed unused json import.

jodavis requested a review from Copilot June 19, 2026 04:11

Copilot started reviewing on behalf of jodavis June 19, 2026 04:11 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Copilot started work on behalf of jodavis June 19, 2026 04:25 View session

Copilot AI added 2 commits June 19, 2026 04:28

ADR-228: address PR review comments - padding, unknown words, sample_…

deeea1e

…rate, unused imports

Move pytest import to module level in test_token_stage.py

38c474e

Copilot finished work on behalf of jodavis June 19, 2026 04:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR-228: implement SpectrogramStage, TokenStage, SetManifestSplitter#232

ADR-228: implement SpectrogramStage, TokenStage, SetManifestSplitter#232
jodavis wants to merge 3 commits into
feature/ADR-191-oop-ml-pipelinefrom
claude/determined-ramanujan-3qaj1w

jodavis commented Jun 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

jodavis commented Jun 19, 2026

Uh oh!

Copilot AI commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jodavis commented Jun 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026

Test Results

Uh oh!

jodavis commented Jun 19, 2026

Uh oh!

Copilot AI commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants