fix(transcript): word spacing, empty boxes & non-monotonic timestamps#6
Merged
Merged
Conversation
The transcript rendered run-together ("morePDFlost...") with stray empty
boxes, and clicking a word seeked to the wrong place (timestamps were even
non-monotonic, e.g. 0:49 -> 0:40).
Root cause was in the Whisper post-processing + rendering:
- worker stored raw tokens (Whisper emits each with a leading space) and
joined segment text with "", so the editor's per-word <button> spans ran
together; whitespace-only chunks became empty clickable boxes
- word timestamps from chunked/stride long-form audio jump backwards at chunk
boundaries, scrambling click-to-seek
Fixes:
- extract groupWordsIntoSegments into a pure, tested module
(transcription/group-words.ts): trim each token, drop empties, and clamp
each word to start >= previous word's end (monotonic), space-join segments
- transcript panel: render a space separator between words, trim display text
(robust for legacy docs), and skip whitespace-only tokens
- add group-words.test.ts (9 cases: trim, empties, null ts, monotonic clamp,
gap preservation, segmentation)
Spacing + empty boxes are fixed for existing transcripts on reload (render
layer); improved timestamps apply to newly generated transcripts.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the two transcript-editor issues from the screenshot.
1. Run-together words + empty boxes (
morePDFlost...)Whisper emits each token with a leading space and the worker stored them raw + joined segment text with
"", while the editor renders each word as an adjacent inline<button>with no separator. Whitespace-only chunks became empty clickable boxes.2. Inaccurate click-to-seek / non-monotonic timestamps (
0:49 → 0:40 → 0:47)Word timestamps from
return_timestamps: "word"over chunked/stride long-form audio jump backwards at chunk boundaries.Fixes
groupWordsIntoSegmentsinto a pure, tested moduletranscription/group-words.ts: trims each token, drops empties, clamps each wordstart >= prev.end(monotonic), space-joins segment text.group-words.test.ts— 9 cases (trim, empties, null timestamps, monotonic clamp, gap preservation, sentence/gap/max-word segmentation).Verification
tsc --noEmitclean; 38 tests pass (29 existing + 9 new).Scope note
Spacing + empty boxes are fixed for existing transcripts on reload (render layer). The improved monotonic timestamps apply to newly generated transcripts (existing ones have their times baked in — re-generate to pick them up). Absolute word-timing precision is still bounded by the model; Tiny is roughest — Small/Medium give tighter word boundaries.