Skip to content

Latest commit

 

History

History
62 lines (47 loc) · 3.97 KB

2024.09.14_E1_TTS.md

File metadata and controls

62 lines (47 loc) · 3.97 KB

E1 TTS

基本信息
  • 标题: "Title"
  • 作者:
    • 01 Zhijun Liu,
    • 02 Shuai Wang,
    • 03 Pengcheng Zhu,
    • 04 Mengxiao Bi,
    • 05 Haizhou Li
  • 链接:
  • 文件:
    • ArXiv
    • [Publication] #TODO

Abstract: 摘要

This paper introduces Easy One-Step Text-to-Speech (E1 TTS), an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation. The training of E1 TTS is straightforward; it does not require explicit monotonic alignment between the text and audio pairs. The inference of E1 TTS is efficient, requiring only one neural network evaluation for each utterance. Despite its sampling efficiency, E1 TTS achieves naturalness and speaker similarity comparable to various strong baseline models. Audio samples are available at https://e1tts.github.io/.

1.Introduction: 引言

Non-autoregressive (NAR) text-to-speech (TTS) models \cite{TTSSurvey} generate speech from text in parallel, synthesizing all speech units simultaneously. This enables faster inference compared to autoregressive (AR) models, which generate speech one unit at a time. Most NAR TTS models incorporate duration predictors in their architecture and rely on alignment supervision~\cite{DeepVoice,FastPitch,FastSpeech}. Monotonic alignments between input text and corresponding speech provide information about the number of speech units associated with each text unit, guiding the model during training. During inference, learned duration predictors estimate speech timing for each text unit.

Several pioneering studies~\cite{VARA-TTS,Flow-TTS} have proposed implicit-duration non-autoregressive (ID-NAR) TTS models that eliminate the need for alignment supervision or explicit duration prediction. These models learn to align text and speech units in an end-to-end fashion using attention mechanisms, implicitly generating text-to-speech alignment.

Recently, several diffusion-based~\cite{ScoreSDE} ID-NAR TTS models~\cite{E3TTS,SimpleTTS2,SimpleSpeech,E2TTS,Seed-TTS,DiTToTTS,Mapache} have been proposed, demonstrating state-of-the-art naturalness and speaker similarity in zero-shot text-to-speech~\cite{YourTTS}. However, these models still require an iterative sampling procedure taking dozens of network evaluations to reach high synthesis quality. Diffusion distillation techniques~\cite{BlogDistillation} can be employed to reduce the number of network evaluations in sampling from diffusion models. Most distillation techniques are based on approximating the ODE sampling trajectories of the teacher model. For example, ProDiff~\cite{ProDiff} applied Progressive Distillation~\cite{ProgressiveDistillation}, CoMoSpeech~\cite{CoMoSpeech} and FlashSpeech~\cite{FlashSpeech} applied Consistency Distillation~\cite{ConsistencyModel}, and VoiceFlow~\cite{VoiceFlow} and ReFlow-TTS~\cite{ReFlowTTS} applied Rectified Flow~\cite{RectifiedFlow}. Recently, a different family of distillation methods was discovered~\cite{DiffInstruct,DMD2}, which directly approximates and minimizes various divergences between the generator's sample distribution and the data distribution. Compared to ODE trajectory-based methods, the student model can match or even outperform the diffusion teacher model~\cite{DMD2}, as the distilled one-step generator does not suffer from error accumulation in diffusion sampling.

In this work, we distill a diffusion-based ID-NAR TTS model into a one-step generator with recently proposed distribution matching distillation~\cite{DiffInstruct,DMD2} method. The distilled model demonstrates better robustness after distillation, and it achieves comparable performance to several strong AR and NAR baseline systems.

2.Related Works: 相关工作

3.Methodology: 方法

4.Experiments: 实验

5.Results: 结果

6.Conclusions: 结论