ADR-228: implement SpectrogramStage, TokenStage, SetManifestSplitter#232
Open
jodavis wants to merge 3 commits into
Open
ADR-228: implement SpectrogramStage, TokenStage, SetManifestSplitter#232jodavis wants to merge 3 commits into
jodavis wants to merge 3 commits into
Conversation
Add two deterministic featurisation stages and a set splitter:
- SpectrogramStage: writes {id}.npy of shape (n_mels, time_steps); truncates
long spectrograms and zero-pads short ones
- TokenStage: writes {id}.json with phoneme strings and token indices padded to
input_token_length; padding value is 0 (index 0)
- SetManifestSplitter: shuffles with seed=42 and writes train/val/test manifests
Also adds three DVC entry-point scripts (speech_07, 08, 09), extends PipelineParams
with ComputeSpectrogramsParams, ComputeTokensParams, CreateSetManifestsParams,
adds matching sections to params.yaml, and updates _doc_speech.md.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds three new speech-pipeline components to the ML pipeline—TokenStage, SpectrogramStage, and SetManifestSplitter—plus their DVC entrypoints and parameter plumbing, so augmented audio manifests can be converted into model-ready token/spectrogram artifacts and split into train/val/test sets.
Changes:
- Extend
PipelineParams/params.yamlto includecompute_tokens,compute_spectrograms, andcreate_set_manifestsstage parameters. - Add
TokenStage(phoneme tokens JSON) andSpectrogramStage(mel spectrogram NPY) plus corresponding CLI stage scripts. - Add
SetManifestSplitterand CLI stage script to produce train/val/test manifest files.
Confidence: 95/100
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| ml/pipeline/speech/token_stage.py | New deterministic stage that writes per-sample phoneme/token JSON. |
| ml/pipeline/speech/spectrogram_stage.py | New deterministic stage that writes per-sample mel spectrogram NPY. |
| ml/pipeline/speech/set_splitter.py | New utility to shuffle/split an augmented manifest into train/val/test manifests. |
| ml/pipeline/stages/speech_07_compute_tokens.py | CLI entrypoint to run TokenStage using pipeline params + vocab files. |
| ml/pipeline/stages/speech_08_compute_spectrograms.py | CLI entrypoint to run SpectrogramStage using pipeline params. |
| ml/pipeline/stages/speech_09_create_set_manifests.py | CLI entrypoint to split augmented manifest into dataset splits. |
| ml/pipeline/stages/params.py | Adds dataclasses + loading for the three new stage parameter blocks. |
| ml/params.yaml | Adds YAML configuration for the three new stage blocks. |
| ml/pipeline/speech/_doc_speech.md | Updates docs to list the now-implemented stages. |
| ml/test/pipeline/speech/test_token_stage.py | New unit tests covering TokenStage outputs and padding. |
| ml/test/pipeline/speech/test_spectrogram_stage.py | New unit tests covering SpectrogramStage output shape/padding and determinism. |
| ml/test/pipeline/speech/test_set_splitter.py | New unit tests covering SetManifestSplitter outputs and determinism. |
| ml/test/pipeline/stages/test_params.py | Extends params loading tests for the new stage blocks. |
Comment on lines
+60
to
+76
| # Collect phonemes for each word in the transcript | ||
| phonemes: list[str] = [] | ||
| for word in input_sample.transcript.split(): | ||
| word_phonemes = self._vocab.words_to_phonemes.get(word, []) | ||
| phonemes.extend(word_phonemes) | ||
|
|
||
| # Compute token indices (1-based lookup into phoneme_list) | ||
| tokens = [self._vocab.phoneme_list.index(p) for p in phonemes] | ||
|
|
||
| # Pad to input_token_length | ||
| pad_len = max(0, self._input_token_length - len(phonemes)) | ||
| padded_phonemes = phonemes + [""] * pad_len | ||
| padded_tokens = tokens + [0] * pad_len | ||
|
|
||
| # Truncate if longer than input_token_length | ||
| padded_phonemes = padded_phonemes[: self._input_token_length] | ||
| padded_tokens = padded_tokens[: self._input_token_length] |
Comment on lines
+1
to
+5
| """TokenStage: compute phoneme tokens from AudioSample transcripts. | ||
|
|
||
| Deterministic stage (seed=0). Writes {id}.json with phonemes and tokens | ||
| padded to input_token_length. Token index 0 is used for padding. | ||
| """ |
Comment on lines
+70
to
+78
| loop = asyncio.get_running_loop() | ||
| mel = await loop.run_in_executor( | ||
| None, | ||
| lambda: librosa.feature.melspectrogram( | ||
| y=audio.samples, | ||
| sr=self._sample_rate, | ||
| n_mels=self._n_mels, | ||
| ), | ||
| ) |
Comment on lines
+156
to
+167
| def test_tokens_padding_value_is_zero(self, tmp_path: Path) -> None: | ||
| """Padding uses index 0.""" | ||
| vocab = _make_vocab( | ||
| phoneme_list=["AH", "N", "T", "V"], | ||
| words_to_phonemes={"TV_ON": ["T", "V"]}, | ||
| ) | ||
| stage = _make_stage(tmp_path, vocab=vocab, input_token_length=10) | ||
| sample = _make_audio_sample(transcript="TV_ON") | ||
| asyncio.run(stage.transform(Manifest([sample]), tmp_path / "manifest.json")) | ||
| data = json.loads((tmp_path / f"{sample.id}.json").read_text()) | ||
| # Only 2 phonemes for "TV_ON", remaining 8 should be 0 | ||
| assert data["tokens"][2:] == [0] * 8 |
Owner
Author
|
@copilot Address the code review comments in this PR |
Contributor
All review comments addressed in commits
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.