Explore silence detection in speech-to-text #379

jonatanklosko · 2024-07-04T10:15:52Z

Whisper may hallucinate text when an audio chunk is silence or noise (see #377 (comment)). The openai-whisper implementation has no_speech_threshold and logprob_threshold that may be related. By a quick search there are a few discussions around Whisper hallucination, it may be worth experimenting if there's something we can incorporate into the current algorithm.

The text was updated successfully, but these errors were encountered:

noozo · 2024-09-10T14:39:39Z

Any progress on this? Compared to python the transcripts provided by bumblebee are pretty bad. Lots of repetition of sentences, missing text, etc. We are on the verge of giving up and moving to SaaS for this, unfortunately :(

josevalim · 2024-09-10T14:42:53Z

PRs are definitely welcome.

tubedude · 2024-10-04T21:46:27Z

Jonatan, Valim,
I was looking at this issue and thought of implementing a "silence_processor" as part of the logits_processors.

So I thought of changing these two files:

Bumblebee.Audio.SpeechToTextWhisper: changing the function generate_opts/2 to generate_opts/3 passing the model_info and adding the silence_processor to the list of logits_processors.
Bumblebee.Text.Generation.LogitsProcessing: And added the actual numerical definition of the silence processor.

I still need to review and test all the logic. But do you think this would be the place to implement this processor?

jonatanklosko · 2024-10-07T11:44:39Z

@tubedude unfortunately it doesn't fit into the usual logits processing approach. We generate the transcription token-by-token, and logits processing applies some transformation to logits at each iteration. My understanding is that the <|nospeech|> token is a (somewhat hacky) voice activity detection for the whole input chunk. What openai-whisper does is, it tracks <|nospeech|> probability only from the last iteration (last token prediction) and then uses that, combined with average logprob, to determine if the whole chunk should be skipped.

While looking around, I noticed that huggingface/transformers made significant changes to long-form transcription within the last year. They added support for sequential transcription of long inputs, similar to openai-whisper for improved transcription quality. The implementation involves several techniques, including the nospeech detection. They do use logits processor as part of this, however not to alter the logits, but rather to accumulate information in the object state, and extract it later, when deciding if a chunk is silence (the authors actually consider it hacky, but that's what they did to match the openai implementation, ref). This hack doesn't really fit into our functional implementation; but regardless it is only applicable within the new long-form implementation. The two main PRs with the new changes are huggingface/transformers#27492 and huggingface/transformers#27658.

So taking a step back, huggingface/transformers now has two separate approaches for long-form transcription (a) "sequential" long-input generation (which does the nonspeech detection among other techniques) (b) chunked generation with output merging. Our current implementation does (b). Maintaining both, especially with streaming, is most likely too much. Implementing (a) is a lot of work and I think there are challenges related to serving and streaming, because the input slice points are not known upfront (offsets are adjusted on each iteration).

All that said, I think it may be worth looking at the PRs, the paper mentioned in those PRs, and consider a different implementation for long-form transcription. Given the complexity, I can't really point to anything directly actionable, and it's not something we can prioritize at the moment.

tubedude added a commit to tubedude/bumblebee that referenced this issue Oct 4, 2024

[WIP] Implementing Silence Processor elixir-nx#379

0d2c066

tubedude added a commit to tubedude/bumblebee that referenced this issue Oct 6, 2024

[WIP] Implementing No Speech Processor elixir-nx#379

0271ac1

tubedude added a commit to tubedude/bumblebee that referenced this issue Oct 6, 2024

[WIP] Implementing No Speech Processor elixir-nx#379

d46453f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore silence detection in speech-to-text #379

Explore silence detection in speech-to-text #379

jonatanklosko commented Jul 4, 2024 •

edited

Loading

noozo commented Sep 10, 2024

josevalim commented Sep 10, 2024

tubedude commented Oct 4, 2024

jonatanklosko commented Oct 7, 2024

Explore silence detection in speech-to-text #379

Explore silence detection in speech-to-text #379

Comments

jonatanklosko commented Jul 4, 2024 • edited Loading

noozo commented Sep 10, 2024

josevalim commented Sep 10, 2024

tubedude commented Oct 4, 2024

jonatanklosko commented Oct 7, 2024

jonatanklosko commented Jul 4, 2024 •

edited

Loading