Skip to content

ADR-228: implement SpectrogramStage, TokenStage, SetManifestSplitter#232

Open
jodavis wants to merge 3 commits into
feature/ADR-191-oop-ml-pipelinefrom
claude/determined-ramanujan-3qaj1w
Open

ADR-228: implement SpectrogramStage, TokenStage, SetManifestSplitter#232
jodavis wants to merge 3 commits into
feature/ADR-191-oop-ml-pipelinefrom
claude/determined-ramanujan-3qaj1w

Conversation

@jodavis

@jodavis jodavis commented Jun 19, 2026

Copy link
Copy Markdown
Owner

No description provided.

Add two deterministic featurisation stages and a set splitter:
- SpectrogramStage: writes {id}.npy of shape (n_mels, time_steps); truncates
  long spectrograms and zero-pads short ones
- TokenStage: writes {id}.json with phoneme strings and token indices padded to
  input_token_length; padding value is 0 (index 0)
- SetManifestSplitter: shuffles with seed=42 and writes train/val/test manifests

Also adds three DVC entry-point scripts (speech_07, 08, 09), extends PipelineParams
with ComputeSpectrogramsParams, ComputeTokensParams, CreateSetManifestsParams,
adds matching sections to params.yaml, and updates _doc_speech.md.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds three new speech-pipeline components to the ML pipeline—TokenStage, SpectrogramStage, and SetManifestSplitter—plus their DVC entrypoints and parameter plumbing, so augmented audio manifests can be converted into model-ready token/spectrogram artifacts and split into train/val/test sets.

Changes:

  • Extend PipelineParams / params.yaml to include compute_tokens, compute_spectrograms, and create_set_manifests stage parameters.
  • Add TokenStage (phoneme tokens JSON) and SpectrogramStage (mel spectrogram NPY) plus corresponding CLI stage scripts.
  • Add SetManifestSplitter and CLI stage script to produce train/val/test manifest files.

Confidence: 95/100

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
ml/pipeline/speech/token_stage.py New deterministic stage that writes per-sample phoneme/token JSON.
ml/pipeline/speech/spectrogram_stage.py New deterministic stage that writes per-sample mel spectrogram NPY.
ml/pipeline/speech/set_splitter.py New utility to shuffle/split an augmented manifest into train/val/test manifests.
ml/pipeline/stages/speech_07_compute_tokens.py CLI entrypoint to run TokenStage using pipeline params + vocab files.
ml/pipeline/stages/speech_08_compute_spectrograms.py CLI entrypoint to run SpectrogramStage using pipeline params.
ml/pipeline/stages/speech_09_create_set_manifests.py CLI entrypoint to split augmented manifest into dataset splits.
ml/pipeline/stages/params.py Adds dataclasses + loading for the three new stage parameter blocks.
ml/params.yaml Adds YAML configuration for the three new stage blocks.
ml/pipeline/speech/_doc_speech.md Updates docs to list the now-implemented stages.
ml/test/pipeline/speech/test_token_stage.py New unit tests covering TokenStage outputs and padding.
ml/test/pipeline/speech/test_spectrogram_stage.py New unit tests covering SpectrogramStage output shape/padding and determinism.
ml/test/pipeline/speech/test_set_splitter.py New unit tests covering SetManifestSplitter outputs and determinism.
ml/test/pipeline/stages/test_params.py Extends params loading tests for the new stage blocks.

Comment thread ml/pipeline/speech/token_stage.py Outdated
Comment on lines +60 to +76
# Collect phonemes for each word in the transcript
phonemes: list[str] = []
for word in input_sample.transcript.split():
word_phonemes = self._vocab.words_to_phonemes.get(word, [])
phonemes.extend(word_phonemes)

# Compute token indices (1-based lookup into phoneme_list)
tokens = [self._vocab.phoneme_list.index(p) for p in phonemes]

# Pad to input_token_length
pad_len = max(0, self._input_token_length - len(phonemes))
padded_phonemes = phonemes + [""] * pad_len
padded_tokens = tokens + [0] * pad_len

# Truncate if longer than input_token_length
padded_phonemes = padded_phonemes[: self._input_token_length]
padded_tokens = padded_tokens[: self._input_token_length]
Comment on lines +1 to +5
"""TokenStage: compute phoneme tokens from AudioSample transcripts.

Deterministic stage (seed=0). Writes {id}.json with phonemes and tokens
padded to input_token_length. Token index 0 is used for padding.
"""
Comment on lines +70 to +78
loop = asyncio.get_running_loop()
mel = await loop.run_in_executor(
None,
lambda: librosa.feature.melspectrogram(
y=audio.samples,
sr=self._sample_rate,
n_mels=self._n_mels,
),
)
Comment on lines +156 to +167
def test_tokens_padding_value_is_zero(self, tmp_path: Path) -> None:
"""Padding uses index 0."""
vocab = _make_vocab(
phoneme_list=["AH", "N", "T", "V"],
words_to_phonemes={"TV_ON": ["T", "V"]},
)
stage = _make_stage(tmp_path, vocab=vocab, input_token_length=10)
sample = _make_audio_sample(transcript="TV_ON")
asyncio.run(stage.transform(Manifest([sample]), tmp_path / "manifest.json"))
data = json.loads((tmp_path / f"{sample.id}.json").read_text())
# Only 2 phonemes for "TV_ON", remaining 8 should be 0
assert data["tokens"][2:] == [0] * 8
Comment thread ml/pipeline/speech/set_splitter.py
Comment thread ml/test/pipeline/speech/test_set_splitter.py
@github-actions

Copy link
Copy Markdown

Test Results

401 tests  ±0   401 ✅ ±0   1m 59s ⏱️ -18s
  5 suites ±0     0 💤 ±0 
  5 files   ±0     0 ❌ ±0 

Results for commit f4054bc. ± Comparison against base commit acbc760.

@jodavis

jodavis commented Jun 19, 2026

Copy link
Copy Markdown
Owner Author

@copilot Address the code review comments in this PR

Copilot AI commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@copilot Address the code review comments in this PR

All review comments addressed in commits ADR-228: address PR review comments and Move pytest import to module level:

  • token_stage.py: padding now uses vocab.ctc_blank_idx (== len(phoneme_list)), which cannot collide with any real phoneme index; raises KeyError immediately when a transcript word is absent from words_to_phonemes; builds a phoneme_to_idx dict once instead of repeated list.index() calls; docstrings updated.
  • spectrogram_stage.py: passes audio.sample_rate (the audio's native rate as returned by AudioReader) to librosa instead of the configured self._sample_rate.
  • set_splitter.py: removed unused Any import.
  • test_token_stage.py: renamed and updated padding test to assert vocab.ctc_blank_idx; added test_unknown_transcript_word_raises_key_error; moved import pytest to module level.
  • test_set_splitter.py: removed unused json import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants