open-mmlab · RMSnow · Jul 9, 2024 · Jun 22, 2024 · Jun 22, 2024 · Jun 22, 2024
diff --git a/README.md b/README.md
@@ -27,13 +27,9 @@
 
 In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.
 
-Here is the Amphion v0.1 demo, whose voice, audio effects, and singing voice are generated by our models. Just enjoy it!
-
-[amphion-v0.1-en](https://github.com/open-mmlab/Amphion/assets/24860155/7fcdcea5-3d95-4b31-bd93-4b4da734ef9b
-)
-
 ## 🚀 News
-- **2024/6/17**: Amphion has a new release for its VALL-E models, it uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/tts/VALLE_V2/README.md)
+- **2024/07/01**: Amphion now releases **Emilia**, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the **Emilia-Pipe**, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](preprocessors/Emilia/README.md)
+- **2024/06/17**: Amphion has a new release for its **VALL-E** model! It uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/tts/VALLE_V2/README.md)
 - **2024/03/12**: Amphion now support **NaturalSpeech3 FACodec** and release pretrained checkpoints. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2403.03100) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/naturalspeech3_facodec) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/naturalspeech3_facodec) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/codec/ns3_codec/README.md)
 - **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://github.com/open-mmlab/Amphion/assets/33707885/0a6e39e8-d5f1-4288-b0f8-32da5a2d6e96) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
 - **2023/12/18**: Amphion v0.1 release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.09911) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink)](https://huggingface.co/amphion) [![youtube](https://img.shields.io/badge/YouTube-Demo-red)](https://www.youtube.com/watch?v=1aw0HhcggvQ) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/39)
@@ -79,7 +75,8 @@ Amphion provides a comprehensive objective evaluation of the generated audio. Th
 
 ### Datasets
 
-Amphion unifies the data preprocess of the open-source datasets including [AudioCaps](https://audiocaps.github.io/), [LibriTTS](https://www.openslr.org/60/), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [M4Singer](https://github.com/M4Singer/M4Singer), [Opencpop](https://wenet.org.cn/opencpop/), [OpenSinger](https://github.com/Multi-Singer/Multi-Singer.github.io), [SVCC](http://vc-challenge.org/), [VCTK](https://datashare.ed.ac.uk/handle/10283/3443), and more. The supported dataset list can be seen [here](egs/datasets/README.md) (updating).
+- Amphion unifies the data preprocess of the open-source datasets including [AudioCaps](https://audiocaps.github.io/), [LibriTTS](https://www.openslr.org/60/), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [M4Singer](https://github.com/M4Singer/M4Singer), [Opencpop](https://wenet.org.cn/opencpop/), [OpenSinger](https://github.com/Multi-Singer/Multi-Singer.github.io), [SVCC](http://vc-challenge.org/), [VCTK](https://datashare.ed.ac.uk/handle/10283/3443), and more. The supported dataset list can be seen [here](egs/datasets/README.md) (updating). 
+- Amphion (exclusively) supports the [**Emilia**](preprocessors/Emilia/README.md) dataset and its preprocessing pipeline **Emilia-Pipe** for in-the-wild speech data!
 
 ### Visualization
 

diff --git a/preprocessors/Emilia/README.md b/preprocessors/Emilia/README.md
@@ -0,0 +1,123 @@
+## Emilia
+
+This is the official repository for the **Emilia** dataset and the **Emilia-Pipe** source code.
+
+Emilia is a comprehensive, multilingual dataset featuring over 101k hours of speech in six languages: English (En), Chinese (Zh), German (De), French (Fr), Japanese (Ja), and Korean (Ko). The dataset includes diverse speech samples with various speaking styles.
+
+Emilia-Pipe is the first open-source preprocessing pipeline designed to transform raw, in-the-wild speech data into high-quality training data with annotations for speech generation. This pipeline can process one hour of raw audio into model-ready data in just a few minutes, requiring only the URLs of the audio or video sources. 
+
+By downloading the raw audio files from our provided list of URLs and processing them with Emilia-Pipe, users can obtain the Emilia dataset. Additionally, users can easily use Emilia-Pipe to preprocess their own raw speech data for custom needs. By open-sourcing the Emilia-Pipe code, we aim to enable the speech community to collaborate on large-scale speech generation research.
+
+This README file will introduce the usage of the Emilia-Pipe and provide an installation guide.
+
+## Pipeline Overview
+
+The Emilia-Pipe includes the following major steps:
+
+0. Standardization：Audio normalization
+1. Source Separation: Long audio -> Long audio without BGM
+2. Speaker Diarization: Get medium-length single-speaker speech data
+3. Fine-grained Segmentation by VAD: Get 3-30s single-speaker speech segments
+4. ASR: Get transcriptions of the speech segments
+5. Filtering: Obtain the final processed dataset
+
+## Setup Steps
+
+### 0. Prepare Environment
+
+1. Install Python and CUDA.
+2. Run the following commands to install the required packages:
+
+```bash
+conda create -y -n AudioPipeline python=3.9 
+conda activate AudioPipeline
+
+bash env.sh
+```
+
+3. Download the model files.
+Bgm Separator:[UVR-MDX-NET-Inst_HQ_3](https://github.com/TRvlvr/model_repo/releases/tag/all_public_uvr_models)
+VAD:[Silero](https://github.com/snakers4/silero-vad)
+SpeakerDiarization: [pyannote](https://github.com/pyannote/pyannote-audio)
+ASR: [whisperx-medium](https://github.com/m-bain/whisperX)
+AutoMOS:[DNSMOS P. 835](https://github.com/microsoft/DNS-Challenge)
+
+### 1. Config File
+
+```json
+{
+    "language": {
+        "multilingual": true,
+        "supported": [
+            "zh",
+            "en",
+            "fr",
+            "ja",
+            "ko",
+            "de"
+        ]
+    },
+    "entrypoint": {
+        // TODO: Fill in the input_folder_path. 
+        "input_folder_path": "examples", // #1: Data input
+        "SAMPLE_RATE": 24000
+    },
+    "separate": {
+        "step1": {
+            // TODO: Fill in the source separation model's path. 
+            "model_path": "/path/to/model/separate_model/UVR-MDX-NET-Inst_HQ_3.onnx", // #2: Model path
+            "denoise": true,
+            "margin": 44100,
+            "chunks": 15,
+            "n_fft": 6144,
+            "dim_t": 8,
+            "dim_f": 3072
+        }
+    },
+    "mos_model": {
+        // TODO: Fill in the DNSMOS prediction model's path. 
+        "primary_model_path": "/path/to/model/mos_model/DNSMOS/sig_bak_ovr.onnx" // #3: Model path
+    },
+     // TODO: Fill in your huggingface acess token for pynannote. 
+    "huggingface_token": "<HUGGINGFACE_ACCESS_TOKEN>" // #4: Huggingface access token for pyannote
+}
+```
+
+- #1: Data to be processed
+- #2 - #3: Model path configuration
+- #4: Huggingface access token
+
+
+### 2. Running Script
+
+1. Change the `input_folder_path` in `config.json` to the folder path where the downloaded audio files are stored
+2. Run the following command to process the audio files:
+
+```bash
+conda activate AudioPipeline
+export CUDA_VISIBLE_DEVICES=0  # Setting the GPU to run the pipeline
+
+python main.py
+```
+
+3. Processed audio will be saved into `input_folder_path_processed`.
+
+
+### 3. Check the Results
+
+The processed audio (default 24k sample rate) files will be saved into `input_folder_path_processed`. The results will be saved in the same folder and include the following information:
+
+1. **MP3 file**: `<original_name>_<idx>.mp3`
+2. **JSON file**: `<original_name>.json`
+
+```json
+[
+    {
+        "text": "So, don't worry about that. But, like for instance, like yesterday was very hard for me to say, you know what, I should go to bed.", // Transcription
+        "start": 67.18, // Start timestamp
+        "end": 74.41, // End timestamp
+        "language": "en", // Language
+        "dnsmos": 3.44 // DNSMOS score
+    }
+]
+```
diff --git a/preprocessors/Emilia/config.json b/preprocessors/Emilia/config.json
@@ -0,0 +1,35 @@
+{
+    "language": {
+        "multilingual": true,
+        "supported": [
+            "zh",
+            "en",
+            "fr",
+            "ja",
+            "ko",
+            "de"
+        ]
+    },
+    "entrypoint": {
+        // TODO: Fill in the input_folder_path. 
+        "input_folder_path": "examples",
+        "SAMPLE_RATE": 24000
+    },
+    "separate": {
+        "step1": {
+            // TODO: Fill in the source separation model's path. 
+            "model_path": "/path/to/model/separate_model/UVR-MDX-NET-Inst_HQ_3.onnx",
+            "denoise": true,
+            "margin": 44100,
+            "chunks": 15,
+            "n_fft": 6144,
+            "dim_t": 8,
+            "dim_f": 3072
+        }
+    },
+    "mos_model": {
+        // TODO: Fill in the DNSMOS prediction model's path. 
+        "primary_model_path": "/path/to/model/mos_model/DNSMOS/sig_bak_ovr.onnx"
+    },
+    "huggingface_token": "<HUGGINGFACE_ACCESS_TOKEN>"
+}
diff --git a/preprocessors/Emilia/env.sh b/preprocessors/Emilia/env.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+# Copyright (c) 2024 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+conda install ffmpeg -y
+conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
+pip install -r requirements.txt
+pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/