Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore silence detection in speech-to-text #379

Open
jonatanklosko opened this issue Jul 4, 2024 · 4 comments
Open

Explore silence detection in speech-to-text #379

jonatanklosko opened this issue Jul 4, 2024 · 4 comments

Comments

@jonatanklosko
Copy link
Member

jonatanklosko commented Jul 4, 2024

Whisper may hallucinate text when an audio chunk is silence or noise (see #377 (comment)). The openai-whisper implementation has no_speech_threshold and logprob_threshold that may be related. By a quick search there are a few discussions around Whisper hallucination, it may be worth experimenting if there's something we can incorporate into the current algorithm.

@noozo
Copy link

noozo commented Sep 10, 2024

Any progress on this? Compared to python the transcripts provided by bumblebee are pretty bad. Lots of repetition of sentences, missing text, etc. We are on the verge of giving up and moving to SaaS for this, unfortunately :(

@josevalim
Copy link
Contributor

PRs are definitely welcome.

tubedude added a commit to tubedude/bumblebee that referenced this issue Oct 4, 2024
@tubedude
Copy link

tubedude commented Oct 4, 2024

Jonatan, Valim,
I was looking at this issue and thought of implementing a "silence_processor" as part of the logits_processors.

So I thought of changing these two files:

  • Bumblebee.Audio.SpeechToTextWhisper: changing the function generate_opts/2 to generate_opts/3 passing the model_info and adding the silence_processor to the list of logits_processors.
  • Bumblebee.Text.Generation.LogitsProcessing: And added the actual numerical definition of the silence processor.

I still need to review and test all the logic. But do you think this would be the place to implement this processor?

tubedude added a commit to tubedude/bumblebee that referenced this issue Oct 6, 2024
tubedude added a commit to tubedude/bumblebee that referenced this issue Oct 6, 2024
@jonatanklosko
Copy link
Member Author

@tubedude unfortunately it doesn't fit into the usual logits processing approach. We generate the transcription token-by-token, and logits processing applies some transformation to logits at each iteration. My understanding is that the <|nospeech|> token is a (somewhat hacky) voice activity detection for the whole input chunk. What openai-whisper does is, it tracks <|nospeech|> probability only from the last iteration (last token prediction) and then uses that, combined with average logprob, to determine if the whole chunk should be skipped.

While looking around, I noticed that huggingface/transformers made significant changes to long-form transcription within the last year. They added support for sequential transcription of long inputs, similar to openai-whisper for improved transcription quality. The implementation involves several techniques, including the nospeech detection. They do use logits processor as part of this, however not to alter the logits, but rather to accumulate information in the object state, and extract it later, when deciding if a chunk is silence (the authors actually consider it hacky, but that's what they did to match the openai implementation, ref). This hack doesn't really fit into our functional implementation; but regardless it is only applicable within the new long-form implementation. The two main PRs with the new changes are huggingface/transformers#27492 and huggingface/transformers#27658.

So taking a step back, huggingface/transformers now has two separate approaches for long-form transcription (a) "sequential" long-input generation (which does the nonspeech detection among other techniques) (b) chunked generation with output merging. Our current implementation does (b). Maintaining both, especially with streaming, is most likely too much. Implementing (a) is a lot of work and I think there are challenges related to serving and streaming, because the input slice points are not known upfront (offsets are adjusted on each iteration).

All that said, I think it may be worth looking at the PRs, the paper mentioned in those PRs, and consider a different implementation for long-form transcription. Given the complexity, I can't really point to anything directly actionable, and it's not something we can prioritize at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants