Kokoro Speech Dataset

Kokoro Speech Dataset is a public domain Japanese speech dataset. It contains 43,253 short audio clips of a single speaker reading 14 novel books. The format of the metadata is similar to that of LJ Speech so that the dataset is compatible with modern speech synthesis systems.

The texts are from Aozora Bunko, which is in the public domain. The audio clips are from LibriVox project, which is also in the public domain. Readings are estimated by MeCab and UniDic Lite from kanji-kana mixture text. Readings are romanized which are similar to the format used by Julius.

The audio clips were split and transcripts were aligned automatically by Kokoro-Align.

Sample data

Listen from your browser or download randomly sampled 100 clips.

File Format

Metadata is provided in metadata.csv. This file consists of one record per line, delimited by the pipe character (0x7c). The fields are:

ID: this is the name of the corresponding .wav file
Transcription: Kanji-kana mixture text spoken by the reader (UTF-8)
Reading: Romanized text spoken by the reader (UTF-8)

Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.

Statistics

The dataset is provided in different sizes, xlarge, large, small, tiny. large, small and tiny don't share same clips. xlarge contains all available clips, including large, small and tiny.

X Large:
Total clips: 44788
Min duration: 3.007 secs
Max duration: 14.861 secs
Mean duration: 4.718 secs
Total duration: 58:41:39

Large:
Total clips: 23461
Min duration: 3.007 secs
Max duration: 14.861 secs
Mean duration: 4.742 secs
Total duration: 30:54:16

Small:
Total clips: 9199
Min duration: 3.007 secs
Max duration: 9.961 secs
Mean duration: 4.687 secs
Total duration: 11:58:31

Tiny:
Total clips: 308
Min duration: 3.030 secs
Max duration: 8.092 secs
Mean duration: 4.695 secs
Total duration: 00:24:05

How to get the data

Because of its large data size of the dataset, audio files are not included in this repository, but the metadata is included.

To make .wav files of the dataset, run

$ bash download.sh

to download the metadata from the project page. Then run

$ pip3 install torchaudio
$ python3 extract.py --size tiny

This prints a shell script example to download MP3 audio files from archive.org and extract them if you haven't done it already.

After doing so, run the command again

$ python3 extract.py --size tiny

to get files for tiny under ./output directory.

You can give another size name to the --size option to get dataset of the size.

You can specify the audio clip format to the --format option.

Pretrained Tacotron model

Audio Samples
Pretrained model

Pretrained Tacotron model trained with Kokoro Speech Dataset and audio samples are available. The model was trained for 21K steps with small. According to the above repo, "Speech started to become intelligible around 20K steps" with LJ Speech Dataset. Audio samples read the first few sentences from Gon Gitsune which is not included in small.

Books

The dataset contains recordings from these books read by ekzemplaro

明暗 (Meian) 16:39:29 Online text
こころ (Kokoro) 08:46:41 Online text
田舎教師 (Inaka Kyoshi) 08:13:26 Online text
野分 (Nowaki) 4:40:49 Online text
草枕 (Kusamakura) 04:27:35 Online text
坊っちゃん (Botchan) 04:26:27 Online text
雁 (Gan) 03:41:31 Online text
生まれいずる悩み (Umareizuru Nayami) 2:43:12 Online text
硝子戸の中 (Garasudono uchi) 2:39:53 Online text
永日小品 (Eijitsu Syohin) 2:33:54 Online text
蒲団 (Futon) 2:28:58 Online text
高野聖 (Kouyahijiri) 2:06:23 Online text
ごん狐 (Gon gitsune) 0:15:42 Online text
コーカサスの禿鷹 (Caucasus no Hagetaka) 0:13:04 Online text

Similar project

This project was also inspired by CSS10, which contains audio clips of various languages from LibriVox.

Changelog

v1.3 Keep word separators in transcripts with '_'
v1.2 New metadata generated with a new align model
v1.1.1 Added FLAC, MP3, OGG support
v1.1 Added more books
v1.0 Initial release

Credits

All texts are from Aozora Bunko. Recordings by ekzemplaro from LibriVox. Alignment and annotation by Katsuya Iida.

License

This dataset is in the public domain in the USA (and most likely other countries as well). There are no restrictions on its use. For more information, please see: librivox.org/pages/public-domain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Kokoro Speech Dataset

Sample data

File Format

Statistics

How to get the data

Pretrained Tacotron model

Books

Similar project

Changelog

Credits

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Kokoro Speech Dataset

Sample data

File Format

Statistics

How to get the data

Pretrained Tacotron model

Books

Similar project

Changelog

Credits

License