Skip to content

Latest commit

 

History

History
382 lines (295 loc) · 37.3 KB

2024.10.09_F5-TTS.md

File metadata and controls

382 lines (295 loc) · 37.3 KB

F5-TTS

基本信息
  • 标题: "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
  • 作者:
    • 01 Yushen Chen - Shanghai Jiao Tong University
    • 02 Zhikang Niu - Shanghai Jiao Tong University
    • 03 Ziyang Ma - Shanghai Jiao Tong University
    • 04 Keqi Deng - University of Cambridge
    • 05 Chunhui Wang - Geely Automobile Research Institute
    • 06 Jian Zhao - Geely Automobile Research Institute
    • 07 Kai Yu - Shanghai Jiao Tong University
    • 08 Xie Chen - Shanghai Jiao Tong University - chenxie95@sjtu.edu.cn
  • 链接:
  • 文件:
    • ArXiv
    • [Publication] #TODO

Abstract: 摘要

展开原文

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at this https URL. We release all code and checkpoints to promote community development (Github).


本论文介绍了 F5-TTS, 一个基于流匹配和 DiT 的完全非自回归文本到语音系统. 无需复杂的设计, 如时长模型, 文本编码器和音素对齐, 文本输入只需要通过补充填充符号使得长度与输入语音相同, 然后进行去噪以生成语音, 这一方式最初由 E2 TTS 证明是可行的. 然而, E2 TTS 的原始设计使其难以跟随研究, 这是因为其缓慢的收敛和低鲁棒性. 为了解决这些问题, 我们首先使用 ConvNeXt 模型对输入进行建模, 优化文本表示使其更容易与语音对齐. 我们进一步提出了一个推理时摇摆采样策略, 它显著提高了模型的性能和效率. 这个采样策略可以很容易地应用到现有的基于流匹配的模型上, 而不需要重新训练. 我们的设计允许更快的训练, 并实现了推理 RTF 为 0.15, 这比现有的基于流匹配的 TTS 模型的最新水平提高了很多. 在一个公开的 100K 小时多语言数据集上训练, 我们的 Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) 具有高度自然和富有表现力的零样本能力, 无缝切换能力, 以及速度控制的效率. 演示样本可以在此处找到. 我们发布所有代码和检查点, 以促进社区开发 (Github).

1.Introduction: 引言

Recent research in Text-to-Speech (TTS) has experienced great advancement (Tacotron2 [1]; Transformer-TTS [2]; FastSpeech 2 [3]; Glow-TTS [4]; VITS [5]; Grad-TTS [6]; VioLA [7]; NaturalSpeech [8]). With a few seconds of audio prompt, current TTS models are able to synthesize speech for any given text and mimic the speaker of audio prompt (VALL-E [9]; VALL-E X [10]). The synthesized speech can achieve high fidelity and naturalness that they are almost indistinguishable from human speech (NaturalSpeech 2 [11]; NaturalSpeech 3 [12]; VALL-E 2 [13]; Voicebox [14]).

While autoregressive (AR) based TTS models exhibit an intuitive way of consecutively predicting the next token(s) and have achieved promising zero-shot TTS capability, the inherent limitations of AR modeling require extra efforts addressing issues such as inference latency and exposure bias \cite{ELLA-V [15]; vallt,valler,ralle,voicecraft}. Moreover, the quality of speech tokenizer is essential for AR models to achieve high-fidelity synthesis\cite{soundstream,encodec,audiodec,hificodec,speechtokenizer,dmel,ndvq}. Thus, there have been studies exploring direct modeling in continuous space~\cite{ardittts,ar-wovq,MELLE [29]} to enhance synthesized speech quality recently.

Although AR models demonstrate impressive zero-shot performance as they perform implicit duration modeling and can leverage diverse sampling strategies, non-autoregressive (NAR) models benefit from fast inference through parallel processing, and effectively balance synthesis quality and latency. Notably, diffusion models \cite{ddpm,score} contribute most to the success of current NAR speech models (NaturalSpeech 2 [11]; NaturalSpeech 3 [12]). In particular, Flow Matching with Optimal Transport path (FM-OT)\cite{cfm-ot} is widely used in recent research fields not only text-to-speech~\cite{Voicebox [14],voiceflow,matchatts,DiTTo-TTS [35]; E2 TTS [36]} but also image generation~\cite{sd3} and music generation~\cite{fluxmusic}.

Unlike AR-based models, the alignment modeling between input text and synthesized speech is crucial and challenging for NAR-based models. While NaturalSpeech 3 [12] and Voicebox [14] use frame-wise phoneme alignment; Matcha-TTS \cite{matchatts} adopts monotonic alignment search and relies on phoneme-level duration model; recent works find that introducing such rigid and inflexible alignment between text and speech hinders the model from generating results with higher naturalness (E2 TTS [36]; Seed-TTS [39]).

E3 TTS [40] abandons phoneme-level duration and applies cross-attention on the input sequence but yields limited audio quality. DiTTo-TTS [35] uses Diffusion Transformer (DiT) [41] with cross-attention conditioned on encoded text from a pretrained language model. To further enhance alignment, it uses the pretrained language model to finetune the neural audio codec, infusing semantic information into the generated representations. In contrast, E2 TTS [36], based on Voicebox [14], adopts a simpler way, which removes the phoneme and duration predictor and directly uses characters padded with filler tokens to the length of mel spectrograms as input. This simple scheme also achieves very natural and realistic synthesized results. However, we found that robustness issues exist in E2 TTS for the text and speech alignment. Seed-TTS [39] employs a similar strategy and achieves excellent results, though not elaborated in model details. In these ways of not explicitly modeling phoneme-level duration, models learn to assign the length of each word or phoneme according to the given total sequence length, resulting in improved prosody and rhythm.

In this paper, we propose F5-TTS, a Fairytaler that Fakes Fluent and Faithful speech with Flow matching. Maintaining the simplicity of pipeline without phoneme alignment, duration predictor, text encoder, and semantically infused codec model, F5-TTS leverages the Diffusion Transformer with ConvNeXt V2~\cite{convnextv2} to better tackle text-speech alignment during in-context learning. We stress the deep entanglement of semantic and acoustic features in the E2 TTS model design, which has inherent problems and will pose alignment failure issues that could not simply be solved with re-ranking. With in-depth ablation studies, our proposed F5-TTS demonstrates stronger robustness, in generating more faithful speech to the text prompt, while maintaining comparable speaker similarity. Additionally, we introduce an inference-time sampling strategy for flow steps substantially improving naturalness, intelligibility, and speaker similarity of generation. This approach can be seamlessly integrated into existing flow matching based models without retraining.

2.Preliminaries: 预备知识

Flow Matching: 流匹配

The Flow Matching (FM) objective is to match a probability path $p_t$ from a simple distribution $\displaystyle p_0$, \textit{e.g.}, the standard normal distribution $p(x) = \mathcal{N}(x|0,I)$, to $\displaystyle p_1$ approximating the data distribution $q$. In short, the FM loss regresses the vector field $u_t$ with a neural network $v_t$ as

$$ \mathcal{L}{FM}(\theta) = E{t, p_t(x)} \left| v_t(x) - u_t(x) \right| ^2, $$

where $\theta$ parameterizes the neural network, $t\sim\mathcal{U}[0,1]$ and $x\sim p_t(x)$. The model $v_t$ is trained over the entire flow step and data range, ensuring it learns to handle the entire transformation process from the initial distribution to the target distribution.

As we have no prior knowledge of how to approximate $p_t$ and $u_t$, a conditional probability path $p_t(x|x_1) = \mathcal{N}(x\ |\ \mu_t(x_1),\sigma_t(x_1)^2I)$ is considered in actual training, and the Conditional Flow Matching (CFM) loss is proved to have identical gradients \textit{w.r.t.} $\theta$~\cite{cfm-ot}. $x_1$ is the random variable corresponding to training data. $\mu$ and $\sigma$ is the time-dependent mean and scalar standard deviation of Gaussian distribution.

Remember that the goal is to construct target distribution (data samples) from initial simple distribution, \textit{e.g.}, Gaussian noise. With the conditional form, the flow map $\psi_t(x)=\sigma_t(x_1)x+\mu_t(x_1)$ with $\mu_0(x_1)=0$ and $\sigma_0(x_1)=1$, $\mu_1(x_1)=x_1$ and $\sigma_1(x_1)=0$ is made to have all conditional probability paths converging to $p_0$ and $p_1$ at the start and end. The flow thus provides a vector field $d\psi_t(x_0)/dt = u_t(\psi_t(x_0)|x_1)$. Reparameterize $p_t(x|x_1)$ with $x_0$, we have

$$ \mathcal{L}{\text{CFM}}(\theta) = E{t, q(x_{1}), p(x_0)} | v_{t}(\psi_t(x_0)) - \frac{d}{dt}\psi_t(x_0) | ^2. $$

Further leveraging Optimal Transport form $\psi_t(x)=(1-t)x+tx_1$, we have the OT-CFM loss,

$$ \mathcal{L}{\text{CFM}}(\theta) = E{t, q(x_{1}), p(x_0)} | v_{t}((1-t)x_0+tx_1) - (x_1-x_0) | ^2. $$

To view in a more general way~\cite{snrtheory}, if formulating the loss in terms of log signal-to-noise ratio (log-SNR) $\lambda$ instead of flow step $t$, and parameterizing to predict $x_0$ ($\epsilon$, commonly stated in diffusion model) instead of predict $x_1-x_0$, the CFM loss is equivalent to the $v$-prediction~\cite{vpredict} loss with cosine schedule.

For inference, given sampled noise $x_0$ from initial distribution $p_0$, flow step $t\in[0,1]$ and condition with respect to generation task, the ordinary differential equation (ODE) solver~\cite{torchdiffeq} is used to evaluate $\psi_1(x_0)$ the integration of $d\psi_t(x_0)/dt$ with $\psi_0(x_0)=x_0$. The number of function evaluations (NFE) is the times going through the neural network as we may provide multiple flow step values from 0 to 1 as input to approximate the integration. Higher NFE will produce more accurate results and certainly take more calculation time.

Classifier-Free Guidance

Classifier Guidance (CG) is proposed by \cite{cg}, functions by adding the gradient of an additional classifier, while such an explicit way to condition the generation process may have several problems. Extra training of the classifier is required and the generation result is directly affected by the quality of the classifier. Adversarial attacks might also occur as the guidance is introduced through the way of updating the gradient. Thus deceptive images with imperceptible details to human eyes may be generated, which are not conditional.

Classifier-Free Guidance (CFG)~\cite{cfg} proposes to replace the explicit classifier with an implicit classifier without directly computing the explicit classifier and its gradient. The gradient of a classifier can be expressed as a combination of conditional generation probability and unconditional generation probability. By dropping the condition with a certain rate during training, and linear extrapolating the inference outputs with and without condition $c$, the final guided result is obtained. We could balance between fidelity and diversity of the generated samples with

$$ v_{t,CFG} = v_{t}(\psi_t(x_0),c) + \alpha (v_{t}(\psi_t(x_0),c)-v_{t}(\psi_t(x_0))) $$

in CFM case, where $\alpha$ is the CFG strength.\footnote{Note that the inference time will be doubled if CFG. Model $v_t$ will execute the forward process twice, once with condition, and once without.}

3.Methodology: 方法

This work aims to build a high-level text-to-speech synthesis system. Following Voicebox [14], we trained our model on the text-guided speech-infilling task. Based on recent research (DiTTo-TTS [35]; E2 TTS [36]; E1 TTS [48]), it is promising to train without phoneme-level duration predictor and can achieve higher naturalness in zero-shot generation deprecating explicit phoneme-level alignment. We adopt a similar pipeline as E2 TTS [36] and propose our advanced architecture F5-TTS, addressing the slow convergence (timbre learned well at an early stage but struggled to learn alignment) and robustness issues (failures on hard case generation) of E2 TTS. We also propose a Sway Sampling strategy for flow steps at inference, which significantly improves our model's performance in faithfulness to reference text and speaker similarity.

3.1.Pipeline

Training

The infilling task is to predict a segment of speech given its surrounding audio and full text (for both surrounding transcription and the part to generate). For simplicity, we reuse the symbol $x$ to denote an audio sample and $y$ the corresponding transcript for a data pair $(x, y)$. As shown in Fig.\ref{fig:overview} (left), the acoustic input for training is an extracted mel spectrogram features $x_1\in \mathbb{R}^{F\times N}$ from the audio sample $x$, where $F$ is mel dimension and $N$ is the sequence length. In the scope of CFM, we pass in the model the noisy speech $(1-t)x_0+tx_1$ and the masked speech $(1-m)\odot x_1$, where $x_0$ denotes sampled Gaussian noise, $t$ is sampled flow step, and $m\in{0,1}^{F\times N}$ represents a binary temporal mask.

Following E2 TTS, we directly use alphabets and symbols for English. We opt for full pinyin to facilitate Chinese zero-shot generation. By breaking the raw text into such character sequence and padding it with filler tokens to the same length as mel frames, we form an extended sequence $z$ with $c_i$ denoting the $i$-th character: $$ z = (c_1, c_2, \ldots, c_M, \underbrace{\langle F \rangle, \ldots, \langle F \rangle}_{(N-M)\text{ times}}). $$ The model is trained to reconstruct $m\odot x_1$ with $(1-m)\odot x_1$ and $z$, which equals to learn the target distribution $p_1$ in form of $P(m\odot x_1|(1-m)\odot x_1,z)$ approximating real data distribution $q$.

Inference

To generate a speech with the desired content, we have the audio prompt's mel spectrogram features $x_{ref}$, its transcription $y_{ref}$, and a text prompt $y_{gen}$. Audio prompt serves to provide speaker characteristics and text prompt is to guide the content of generated speech.

The sequence length $N$, or duration, has now become a pivotal factor that necessitates informing the model of the desired length for sample generation. One could train a separate model to predict and deliver the duration based on $x_{ref}$, $y_{ref}$ and $y_{gen}$. Here we simply estimate the duration based on the ratio of the number of characters in $y_{gen}$ and $y_{ref}$. We assume that the sum-up length of characters is no longer than mel length, thus padding with filler tokens is done as during training.

To sample from the learned distribution, the converted mel features $x_{ref}$, along with concatenated and extended character sequence $z_{ref\cdot gen}$ serve as the condition in Eq.\ref{eq:cfg}. We have $$ v_t(\psi_t(x_0),c) = v_t((1-t)x_0+tx_1|x_{ref}, z_{ref\cdot gen}), $$ See from Fig.\ref{fig:overview} (right), we start from a sampled noise $x_0$, and what we want is the other end of flow $x_1$. Thus we use the ODE solver to gradually integrate from $\psi_0(x_0)=x_0$ to $\psi_1(x_0)=x_1$, given $d\psi_t(x_0)/dt=v_t(\psi_t(x_0),x_{ref}, z_{ref\cdot gen})$. During inference, the flow steps are provided in an ordered way, $e.g.$, uniformly sampled a certain number from 0 to 1 according to the NFE setting.

After getting the generated mel with model $v_t$ and ODE solver, we discard the part of $x_{ref}$. Then we leverage a vocoder to convert the mel back to speech signal.

3.2.F5-TTS

E2 TTS directly concatenates the padded character sequence with input speech, thus deeply entangling semantic and acoustic features with a large length gap of effective information, which is the underlying cause of hard training and poses several problems in a zero-shot scenario (Sec.\ref{sec:modelarchitecture}). To alleviate the problem of slow convergence and low robustness, we propose F5-TTS which accelerates training and inference and shows a strong robustness in generation. Also, an inference-time Sway Sampling is introduced, which allows inference faster (using less NFE) while maintaining performance. This sampling way of flow step can be directly applied to other CFM models.

Model

As shown in Fig.\ref{fig:overview}, we use latent Diffusion Transformer (DiT) [41] as backbone. To be specific, we use DiT blocks with zero-initialized adaptive Layer Norm (adaLN-zero). To enhance the model's alignment ability, we also leverage ConvNeXt V2 blocks~\cite{convnextv2}. Its predecessor ConvNeXt V1~\cite{convnext} is used in many works and shows a strong temporal modeling capability in speech domain tasks~\cite{vocos,convnexttts}.

As described in Sec.\ref{sec:pipeline}, the model input is character sequence, noisy speech, and masked speech. Before concatenation in the feature dimension, the character sequence first goes through ConvNeXt blocks. Experiments have shown that this way of providing individual modeling space allows text input to better prepare itself before later in-context learning. Unlike the phoneme-level force alignment done in Voicebox, a rigid boundary for text is not explicitly introduced. The semantic and acoustic features are jointly learned with the entire model. And unlike the way of feeding the model with inputs of significant length difference (length with effective information) as E2 TTS does, our design mitigates such gap.

The flow step $t$ for CFM is provided as the condition of adaLN-zero rather than appended to the concatenated input sequence in Voicebox. We found that an additional mean pooled token of text sequence for adaLN condition is not essential for the TTS task, maybe because the TTS task requires more rigorously guided results and the mean pooled text token is more coarse.

We adopt some position embedding settings in Voicebox. The flow step is embedded with a sinusoidal position. The concatenated input sequence is added with a convolutional position embedding. We apply a rotary position embedding (RoPE)\cite{rope} for self-attention rather than symmetric bi-directional ALiBi bias\cite{alibi}. And for extended character sequence $\hat{y}$, we also add it with an absolute sinusoidal position embedding before feeding it into ConvNeXt blocks.

Compared with Voicebox and E2 TTS, we abandoned the U-Net~\cite{unet} style skip connection structure and switched to using DiT with adaLN-zero. Without a phoneme-level duration predictor and explicit alignment process, and nor with extra text encoder and semantically infused neural codec model in DiTTo-TTS, we give the text input a little freedom (individual modeling space) to let it prepare itself before concatenation and in-context learning with speech input.

Sampling

As stated in Sec.\ref{sec:fm}, the CFM could be viewed as v-prediction with a cosine schedule. For image synthesis, \cite{sd3} propose to further schedule the flow step with a single-peak logit-normal~\cite{lognorm} sampling, in order to give more weight to intermediate flow steps by sampling them more frequently. We speculate that such sampling distributes the model's learning difficulty more evenly over different flow step $t\in[0,1]$.

In contrast, we train our model with traditional uniformly sampled flow step $t\sim \mathcal{U}[0,1]$ but apply a non-uniform sampling during inference. In specific, we define a Sway Sampling function as $$ f_{sway}(u;s) = u+s\cdot(\cos(\frac{\pi}{2}u)-1+u), $$ which is monotonic with coefficient $s\in[-1,\frac{2}{\pi-2}]$. We first sample $u\sim \mathcal{U}[0,1]$, then apply this function to obtain sway sampled flow step $t$. With $s<0$, the sampling is sway to left; with $s>0$, the sampling is sway to right; and $s=0$ case equals to uniform sampling. Fig.\ref{fig:swaysampling} shows the probability density function of Sway Sampling on flow step $t$.

Conceptually, CFM models focus more on sketching the contours of speech in the early stage ($t\to0$) from pure noise and later focus more on the embellishment of fine-grained details. Therefore, the alignment between speech and text will be determined based on the first few generated results. With a scale parameter $s<0$, we make model inference more with smaller $t$, thus providing the ODE solver with more startup information to evaluate more precisely in initial integration steps.

4.Experiments: 实验

Datasets

We utilize the in-the-wild multilingual speech dataset Emilia~\cite{emilia} to train our base models. After simply filtering out transcription failure and misclassified language speech, we retain approximately 95K hours of English and Chinese data. We also trained small models for ablation study and architecture search on WenetSpeech4TTS~\cite{wenet4tts} Premium subset, consisting of a 945 hours Mandarin corpus. Base model configurations are introduced below, and small model configurations are in Appendix \ref{appx:smallmodels}. Three test sets are adopted for evaluation, which are LibriSpeech-PC \textit{test-clean}\cite{librispeechpc}, Seed-TTS [39] \textit{test-en} with 1088 samples from Common Voice\cite{commonvoice}, and Seed-TTS \textit{test-zh} with 2020 samples from DiDiSpeech~\cite{didispeech}\footnote{\url{https://github.com/BytedanceSpeech/seed-tts-eval}}. Most of the previous English-only models are evaluated on different subsets of LibriSpeech \textit{test-clean} while the used prompt list is not released, which makes fair comparison difficult. Thus we build and release a 4-to-10-second LibriSpeech-PC subset with 1127 samples to facilitate community comparisons.

Training

Our base models are trained to 1.2M updates with a batch size of 307,200 audio frames (0.91 hours), for over one week on 8 NVIDIA A100 80G GPUs. The AdamW optimizer~\cite{adamw} is used with a peak learning rate of 7.5e-5, linearly warmed up for 20K updates, and linearly decays over the rest of the training. We set 1 for the max gradient norm clip. The F5-TTS base model has 22 layers, 16 attention heads, 1024/2048 embedding/feed-forward network (FFN) dimension for DiT; and 4 layers, 512/1024 embedding/FFN dimension for ConvNeXt V2; in total 335.8M parameters. The reproduced E2 TTS, a 333.2M flat U-Net equipped Transformer, has 24 layers, 16 attention heads, and 1024/4096 embedding/FFN dimension. Both models use RoPE as mentioned in Sec.\ref{sec:f5tts}, a dropout rate of 0.1 for attention and FFN, the same convolutional position embedding as in Voicebox [14].

We directly use alphabets and symbols for English, use jieba\footnote{\url{https://github.com/fxsjy/jieba}} and pypinyin\footnote{\url{https://github.com/mozillazg/python-pinyin}} to process raw Chinese characters to full pinyins. The character embedding vocabulary size is 2546, counting in the special filler token and all other language characters exist in the Emilia dataset as there are many code-switched sentences. For audio samples we use 100-dimensional log mel-filterbank features with 24 kHz sampling rate and hop length 256. A random 70% to 100% of mel frames is masked for infilling task training. For CFG (Sec.\ref{sec:cfg}) training, first the masked speech input is dropped with a rate of 0.3, then the masked speech again but with text input together is dropped with a rate of 0.2. We assume that the two-stage control of CFG training may have the model learn more with text alignment.

Inference

The inference process is mainly elaborated in Sec.\ref{sec:pipeline}. We use the Exponential Moving Averaged (EMA)\cite{ema} weights for inference, and the Euler ODE solver for F5-TTS (midpoint for E2 TTS as described in E2 TTS [36]). We use the pretrained vocoder Vocos\cite{vocos} to convert generated log mel spectrograms to audio signals.

Baselines

We compare our models with leading TTS systems including, (mainly)

Metrics

We measure the performances under \textit{cross-sentence} task. The model is given a reference text, a short speech prompt, and its transcription, and made to synthesize a speech reading the reference text mimicking the speech prompt speaker. In specific, we report Word Error Rate (WER) and speaker Similarity between generated and the original target speeches (SIM-o) for objective evaluation. For WER, we employ Whisper-large-v3~\cite{whisper} to transcribe English and Paraformer-zh~\cite{funasr} for Chinese, following Seed-TTS [39]. For SIM-o, we use a WavLM-large-based~\cite{wavlm} speaker verification model to extract speaker embeddings for calculating the cosine similarity of synthesized and ground truth speeches. We use Comparative Mean Opinion Scores (CMOS) and Similarity Mean Opinion Scores (SMOS) for subjective evaluation. For CMOS, human evaluators are given randomly ordered synthesized speech and ground truth, and are to decide how higher the naturalness of the better one surpasses the counterpart, \textit{w.r.t.}~prompt speech. For SMOS, human evaluators are to score the similarity between the synthesized and prompt.

5.Results: 结果

Tab.\ref{tab:librispeech-test} and \ref{tab:seedtts-test} show the main results of objective and subjective evaluations. We report the average score of three random seed generation results with our model and open-sourced baselines. We use by default a CFG strength of 2 and a Sway Sampling coefficient of $-1$ for our F5-TTS.

For English zero-shot evaluation, the previous works are hard to compare directly as they use different subsets of LibriSpeech \textit{test-clean} \cite{librispeech}. Although most of them claim to filter out 4-to-10-second utterances as the generation target, the corresponding prompt audios used are not released. Therefore, we build a 4-to-10-second sample test set based on LibriSpeech-PC \cite{librispeechpc} which is an extension of LibriSpeech with additional punctuation marks and casing. To facilitate future comparison, we release the 2-hour test set with 1,127 samples, sourced from 39 speakers (LibriSpeech-PC missing one speaker).

F5-TTS achieves a WER of 2.42 on LibriSpeech-PC \textit{test-clean} with 32 NFE and Sway Sampling, demonstrating its robustness in zero-shot generation. Inference with 16 NFE, F5-TTS gains an RTF of 0.15 while still supporting high-quality generation with a WER of 2.53. It is clear that the Sway Sampling strategy greatly improves performance. The reproduced E2 TTS shows an excellent speaker similarity (SIM) but much worse WER in the zero-shot scenario, indicating the inherent deficiency of alignment robustness.

From the evaluation results on the Seed-TTS test sets, F5-TTS behaves similarly with a close WER to ground truth and comparable SIM scores. It produces smooth and fluent speech in zero-shot generation with a CMOS of 0.31 (0.21) and SMOS of 3.89 (3.83) on Seed-TTS \textit{test-en} (\textit{test-zh}), and surpasses some baseline models trained with larger scales. It is worth mentioning that Seed-TTS with the best result is trained with orders of larger model size and dataset (several million hours) than ours. As stated in Sec.\ref{sec:pipeline}, we simply estimate duration based on the ratio of the audio prompt's transcript length and the text prompt length. If providing ground truth duration, F5-TTS with 32 NFE and Sway Sampling will have WER of 1.74 for \textit{test-en} and 1.53 for \textit{test-zh} while maintaining the same SIM, indicating a high upper bound. A robustness test on ELLA-V [15] hard sentences is further included in Appendix \ref{appx:ellavhardtest}.

5.1.Ablation of Model Architecture

To clarify our F5-TTS's efficiency and stress the limitation of E2 TTS. We conduct in-depth ablation studies. We trained small models to 800K updates (each on 8 NVIDIA RTX 3090 GPUs for one week), all scaled to around 155M parameters, on the WenetSpeech4TTS Premium 945 hours Mandarin dataset with half the batch size and the same optimizer and scheduler as base models. Details of small model configurations see Appendix \ref{appx:smallmodels}.

We first experiment with pure adaLN DiT (F5-TTS\textit{$-$Conv2Text}), which fails to learn alignment given simply padded character sequences. Based on the concept of refining the input text representation to better align with speech modality, and keep the simplicity of system design, we propose to add jointly learned structure to the input context. Specifically, we leverage ConvNeXt's capabilities of capturing local connections, multi-scale features, and spatial invariance for the input text, which is our F5-TTS. And we ablate with adding the same branch for input speech, denoted F5-TTS\textit{$+$Conv2Audio}. We further conduct experiments to figure out whether the long skip connection and the pre-refinement of input text are beneficial to the counterpart backbone, \textit{i.e.} F5-TTS and E2 TTS, named F5-TTS\textit{$+$LongSkip} and E2 TTS\textit{$+$Conv2Text} respectively. We also tried with the Multi-Modal DiT (MMDiT) \cite{sd3} a double-stream joint-attention structure for the TTS task which learned fast and collapsed fast, resulting in severe repeated utterance with wild timbre and prosody. We assume that the pure MMDiT structure is far too flexible for rigorous task \textit{e.g.} TTS which needs more faithful generation following the prompt guidance.

Fig.\ref{fig:modelarchitecture} shows the overall trend of small models' WER and SIM scores evaluated on Seed-TTS \textit{test-zh}. Trained with only 945 hours of data, F5-TTS (32 NFE \textit{w/o} SS) achieves a WER of 4.17 and a SIM of 0.54 at 800K updates, while E2 TTS is 9.63 and 0.53. F5-TTS\textit{$+$Conv2Audio} trades much alignment robustness (+1.61 WER) with a slightly higher speaker similarity (+0.01 SIM), which is not ideal for scaling up. We found that the long skip connection structure can not simply fit into DiT to improve speaker similarity, while the ConvNeXt for input text refinement can not directly apply to the flat U-Net Transformer to improve WER as well, both showing significant degradation of performance. To further analyze the unsatisfactory results with E2 TTS, we studied the consistent failure (unable to solve with re-ranking) on a 7% of the test set (WER$>$50%) all along the training process. We found that E2 TTS typically struggles with around 140 samples which we speculate to have a large distribution gap with the train set, while F5-TTS easily tackles this issue.

We investigate the models' behaviors with different input conditions to illustrate the advantages of F5-TTS further and disclose the possible reasons for E2 TTS's deficiency. See from Tab.\ref{tab:ablationinput} in Appendix~\ref{appx:ablationinput}, providing the ground truth duration allows more gains on WER for F5-TTS than E2 TTS, showing its robustness in alignment. By dropping the audio prompt, and synthesizing speech solely with the text prompt, E2 TTS is free of failures. This phenomenon implied a deep entanglement of semantic and acoustic features within E2 TTS's model design. From Tab.\ref{tab:smallmodels} GFLOPs statistics, F5-TTS carries out faster training and inference than E2 TTS.

The aforementioned limitations of E2 TTS greatly hinder real-world application as the failed generation cannot be solved with re-ranking. Supervised fine-tuning facing out-of-domain data or a tremendous pretraining scale is mandatory for E2 TTS, which is inconvenient for industrial deployment. On the contrary, our F5-TTS better handles zero-shot generation, showing stronger robustness.

5.2.Ablation of Sway Sampling

It is clear from Fig.\ref{fig:swaysampling} that a Sway Sampling with a negative $s$ improves the generation results. Further with a more negative $s$, models achieve lower WER and higher SIM scores. We additionally include comparing results on base models with and without Sway Sampling in Appendix \ref{appx:swaysampling}.

As stated at the end of Sec.\ref{sec:f5tts}, Sway Sampling with $s<0$ scales more flow step toward early-stage inference ($t\to0$), thus having CFM models capture more startup information to sketch the contours of target speech better. To be more concrete, we conduct a “leak and override" experiment. We first replace the Gaussian noise input $x_0$ at inference time with a ground-truth-information-leaked input $(1-t')x_0+t'x_{ref}'$, where $t'=0.1$ and $x_{ref}'$ is a duplicate of the audio prompt mel features. Then, we provide a text prompt different from the duplicated audio transcript and let the model continue the subsequent inference (skip the flow steps before $t'$). The model succeeds in overriding leaked utterances and producing speech following the text prompt if Sway Sampling is used, and fails without. Uniformly sampled flow steps will have the model producing speech dominated by leaked information, speaking the duplicated audio prompt's context. Similarly, a leaked timbre can be overridden with another speaker's utterance as an audio prompt, leveraging Sway Sampling.

The experiment result is a shred of strong evidence proving that the early flow steps are crucial for sketching the silhouette of target speech based on given prompts faithfully, the later steps focus more on formed intermediate noisy output, where our sway-to-left sampling ($s<0$) finds the profitable niche and takes advantage of it. We emphasize that our inference-time Sway Sampling can be easily applied to existing CFM-based models without retraining. And we will work in the future to combine it with training-time noise schedulers and distillation techniques to further boost efficiency.

6.Conclusions: 结论

This work introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with diffusion transformer (DiT). With a tidy pipeline, literally text in and speech out, F5-TTS achieves state-of-the-art zero-shot ability compared to existing works trained on industry-scale data. We adopt ConvNeXt for text modeling and propose the test-time Sway Sampling strategy to further improve the robustness of speech generation and inference efficiency. Our design allows faster training and inference, by achieving a test-time RTF of 0.15, which is competitive with other heavily optimized TTS models of similar performance. We open-source our code, and models, to enhance transparency and facilitate reproducible research in this area.