Is there any paper of document about the theory detail of the model #188

qinyuenlp · 2022-05-11T16:19:40Z

qinyuenlp
May 11, 2022

Hi,
Thanks for sharing the code, it really help me a lot. Is there any paper or document can help me learn the theory detail of the model ?
When I use SlidingWindow method to feed audio data into your model, different window-size cause different VAD result. I want to know how the self._h and self._c in OnnxWrapper change exactly.

Answered by snakers4

May 11, 2022

Hi,

Is there any paper or document can help me learn the theory detail of the model ?

There is no paper, but there is a short article - https://thegradient.pub/one-voice-detector-to-rule-them-all/

When I use SlidingWindow method to feed audio data into your model, different window-size cause different VAD result. I want to know how the self._h and self._c in OnnxWrapper change exactly.

The results should be slightly different for different windows.
Please be careful and read the docstring in the utils.

silero-vad/utils_vad.py

Lines 119 to 171 in ea7af70

     def get_speech_timestamps(audio: torch.Tensor,  
   model,  
   threshold: float = 0.5,  
   sampling_rate: int = 16000,  
 

View full answer

snakers4 · 2022-05-11T16:23:42Z

snakers4
May 11, 2022
Maintainer

Hi,

Is there any paper or document can help me learn the theory detail of the model ?

There is no paper, but there is a short article - https://thegradient.pub/one-voice-detector-to-rule-them-all/

When I use SlidingWindow method to feed audio data into your model, different window-size cause different VAD result. I want to know how the self._h and self._c in OnnxWrapper change exactly.

The results should be slightly different for different windows.
Please be careful and read the docstring in the utils.

silero-vad/utils_vad.py

Lines 119 to 171 in ea7af70

    
           def get_speech_timestamps(audio: torch.Tensor, 
        
                                     model, 
        
                                     threshold: float = 0.5, 
        
                                     sampling_rate: int = 16000, 
        
                                     min_speech_duration_ms: int = 250, 
        
                                     min_silence_duration_ms: int = 100, 
        
                                     window_size_samples: int = 1536, 
        
                                     speech_pad_ms: int = 30, 
        
                                     return_seconds: bool = False, 
        
                                     visualize_probs: bool = False): 
        
               """ 
        
               This method is used for splitting long audios into speech chunks using silero VAD 
        
               Parameters 
        
               ---------- 
        
               audio: torch.Tensor, one dimensional 
        
                   One dimensional float torch.Tensor, other types are casted to torch if possible 
        
               model: preloaded .jit silero VAD model 
        
               threshold: float (default - 0.5) 
        
                   Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. 
        
                   It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets. 
        
               sampling_rate: int (default - 16000) 
        
                   Currently silero VAD models support 8000 and 16000 sample rates 
        
               min_speech_duration_ms: int (default - 250 milliseconds) 
        
                   Final speech chunks shorter min_speech_duration_ms are thrown out 
        
               min_silence_duration_ms: int (default - 100 milliseconds) 
        
                   In the end of each speech chunk wait for min_silence_duration_ms before separating it 
        
               window_size_samples: int (default - 1536 samples) 
        
                   Audio chunks of window_size_samples size are fed to the silero VAD model. 
        
                   WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate and 256, 512, 768 samples for 8000 sample rate. 
        
                   Values other than these may affect model perfomance!! 
        
               speech_pad_ms: int (default - 30 milliseconds) 
        
                   Final speech chunks are padded by speech_pad_ms each side 
        
               return_seconds: bool (default - False) 
        
                   whether return timestamps in seconds (default - samples) 
        
               visualize_probs: bool (default - False) 
        
                   whether draw prob hist or not 
        
               Returns 
        
               ---------- 
        
               speeches: list of dicts 
        
                   list containing ends and beginnings of speech chunks (samples or seconds based on return_seconds) 
        
               """

i.e.:

     window_size_samples: int (default - 1536 samples) 
         Audio chunks of window_size_samples size are fed to the silero VAD model. 
         WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate and 256, 512, 768 samples for 8000 sample rate. 
         Values other than these may affect model perfomance!!

4 replies

snakers4 May 11, 2022
Maintainer

@qinyuenlp

qinyuenlp May 11, 2022
Author

I mean, at utils_vad.py line 206 and 207, your way to feed chunk into model (assume window_size_samples=768) is

for current_start_sample in range(0, audio_length_samples, 768):
    chunk = audio[current_start_sample: current_start_sample + 768]

But for some reason, I try to feed as

for current_start_sample in range(0, audio_length_samples, 384):
    chunk = audio[current_start_sample: current_start_sample + 768]

They will feed same-size-chunk to model, but when I feed a "same" chunk like audio[3840:4608] by different slide-window-size into model, I got different result.

snakers4 May 11, 2022
Maintainer

I see, you are doing overlapped winbdows to increase quality / resolution?

The values should be similar, but they may differ because this is not really required, because the model has a built-in sequential inductive bias.

We used to use "overlapping windows" when our temporal resolution was bad, i.e. 100-200ms.

But now this is not required. How the model may behave unpredictably, because we did not train this model to handle overlapping windows.

Can you please plot the probablity charts for these two cases?

qinyuenlp May 11, 2022
Author

Sure, I'll plot that tomorrow, have no wav data in my own PC right now.
Thanks again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any paper of document about the theory detail of the model #188

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

	def get_speech_timestamps(audio: torch.Tensor,
	model,
	threshold: float = 0.5,
	sampling_rate: int = 16000,

Is there any paper of document about the theory detail of the model #188

qinyuenlp May 11, 2022

Replies: 1 comment · 4 replies

snakers4 May 11, 2022 Maintainer

snakers4 May 11, 2022 Maintainer

qinyuenlp May 11, 2022 Author

snakers4 May 11, 2022 Maintainer

qinyuenlp May 11, 2022 Author

qinyuenlp
May 11, 2022

Replies: 1 comment 4 replies

snakers4
May 11, 2022
Maintainer

snakers4 May 11, 2022
Maintainer

qinyuenlp May 11, 2022
Author

snakers4 May 11, 2022
Maintainer

qinyuenlp May 11, 2022
Author