diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d4a8cf00..2b3a9737 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -11,30 +11,25 @@ You can contribute not only with code but with bug reports, comments, questions, If you like to contribute code, squash a bug but if you don't know where to start, here are some pointers. -- [Development Road Map](https://github.com/coqui-ai/TTS/issues/378) - - You can pick something out of our road map. We keep the progess of the project in this simple issue thread. It has new model proposals or developmental updates etc. - - [Github Issues Tracker](https://github.com/idiap/coqui-ai-TTS/issues) This is a place to find feature requests, bugs. - Issues with the ```good first issue``` tag are good place for beginners to take on. - -- ✨**PR**✨ [pages](https://github.com/idiap/coqui-ai-TTS/pulls) with the ```🚀new version``` tag. - - We list all the target improvements for the next version. You can pick one of them and start contributing. + Issues with the ```good first issue``` tag are good place for beginners to + take on. Issues tagged with `help wanted` are suited for more experienced + outside contributors. - Also feel free to suggest new features, ideas and models. We're always open for new things. -## Call for sharing language models +## Call for sharing pretrained models If possible, please consider sharing your pre-trained models in any language (if the licences allow for you to do so). We will include them in our model catalogue for public use and give the proper attribution, whether it be your name, company, website or any other source specified. This model can be shared in two ways: 1. Share the model files with us and we serve them with the next 🐸 TTS release. 2. Upload your models on GDrive and share the link. -Models are served under `.models.json` file and any model is available under TTS CLI or Server end points. +Models are served under `.models.json` file and any model is available under TTS +CLI and Python API end points. Either way you choose, please make sure you send the models [here](https://github.com/coqui-ai/TTS/discussions/930). @@ -135,7 +130,8 @@ curl -LsSf https://astral.sh/uv/install.sh | sh 13. Let's discuss until it is perfect. 💪 - We might ask you for certain changes that would appear in the ✨**PR**✨'s page under 🐸TTS[https://github.com/idiap/coqui-ai-TTS/pulls]. + We might ask you for certain changes that would appear in the + [Github ✨**PR**✨'s page](https://github.com/idiap/coqui-ai-TTS/pulls). 14. Once things look perfect, We merge it to the ```dev``` branch and make it ready for the next version. @@ -143,9 +139,9 @@ curl -LsSf https://astral.sh/uv/install.sh | sh If you prefer working within a Docker container as your development environment, you can do the following: -1. Fork 🐸TTS[https://github.com/idiap/coqui-ai-TTS] by clicking the fork button at the top right corner of the project page. +1. Fork the 🐸TTS [Github repository](https://github.com/idiap/coqui-ai-TTS) by clicking the fork button at the top right corner of the page. -2. Clone 🐸TTS and add the main repo as a new remote named ```upsteam```. +2. Clone 🐸TTS and add the main repo as a new remote named ```upstream```. ```bash git clone git@github.com:/coqui-ai-TTS.git diff --git a/Makefile b/Makefile index 1d6867f5..6964773f 100644 --- a/Makefile +++ b/Makefile @@ -59,9 +59,6 @@ lint: ## run linters. system-deps: ## install linux system deps sudo apt-get install -y libsndfile1-dev -build-docs: ## build the docs - cd docs && make clean && make build - install: ## install 🐸 TTS uv sync --all-extras @@ -70,4 +67,4 @@ install_dev: ## install 🐸 TTS for development. uv run pre-commit install docs: ## build the docs - $(MAKE) -C docs clean && $(MAKE) -C docs html + uv run --group docs $(MAKE) -C docs clean && uv run --group docs $(MAKE) -C docs html diff --git a/README.md b/README.md index 7dddf3a3..9ccf8657 100644 --- a/README.md +++ b/README.md @@ -1,39 +1,34 @@ - -## 🐸Coqui TTS News -- 📣 Fork of the [original, unmaintained repository](https://github.com/coqui-ai/TTS). New PyPI package: [coqui-tts](https://pypi.org/project/coqui-tts) -- 📣 [OpenVoice](https://github.com/myshell-ai/OpenVoice) models now available for voice conversion. -- 📣 Prebuilt wheels are now also published for Mac and Windows (in addition to Linux as before) for easier installation across platforms. -- 📣 ⓍTTSv2 is here with 17 languages and better performance across the board. ⓍTTS can stream with <200ms latency. -- 📣 ⓍTTS fine-tuning code is out. Check the [example recipes](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes/ljspeech). -- 📣 [🐶Bark](https://github.com/suno-ai/bark) is now available for inference with unconstrained voice cloning. [Docs](https://coqui-tts.readthedocs.io/en/latest/models/bark.html) -- 📣 You can use [Fairseq models in ~1100 languages](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) with 🐸TTS. - -## +# -**🐸TTS is a library for advanced Text-to-Speech generation.** +**🐸 Coqui TTS is a library for advanced Text-to-Speech generation.** 🚀 Pretrained models in +1100 languages. 🛠️ Tools for training new models and fine-tuning existing models in any language. 📚 Utilities for dataset analysis and curation. -______________________________________________________________________ [![Discord](https://img.shields.io/discord/1037326658807533628?color=%239B59B6&label=chat%20on%20discord)](https://discord.gg/5eXr5seRrv) +[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/coqui-tts)](https://pypi.org/project/coqui-tts/) [![License]()](https://opensource.org/licenses/MPL-2.0) -[![PyPI version](https://badge.fury.io/py/coqui-tts.svg)](https://badge.fury.io/py/coqui-tts) +[![PyPI version](https://badge.fury.io/py/coqui-tts.svg)](https://pypi.org/project/coqui-tts/) [![Downloads](https://pepy.tech/badge/coqui-tts)](https://pepy.tech/project/coqui-tts) [![DOI](https://zenodo.org/badge/265612440.svg)](https://zenodo.org/badge/latestdoi/265612440) - -![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/tests.yml/badge.svg) -![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/docker.yaml/badge.svg) -![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/style_check.yml/badge.svg) +[![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/tests.yml/badge.svg)](https://github.com/idiap/coqui-ai-TTS/actions/workflows/tests.yml) +[![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/docker.yaml/badge.svg)](https://github.com/idiap/coqui-ai-TTS/actions/workflows/docker.yaml) +[![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/style_check.yml/badge.svg)](https://github.com/idiap/coqui-ai-TTS/actions/workflows/style_check.yml) [![Docs]()](https://coqui-tts.readthedocs.io/en/latest/) -______________________________________________________________________ +## 📣 News +- **Fork of the [original, unmaintained repository](https://github.com/coqui-ai/TTS). New PyPI package: [coqui-tts](https://pypi.org/project/coqui-tts)** +- 0.25.0: [OpenVoice](https://github.com/myshell-ai/OpenVoice) models now available for voice conversion. +- 0.24.2: Prebuilt wheels are now also published for Mac and Windows (in addition to Linux as before) for easier installation across platforms. +- 0.20.0: XTTSv2 is here with 17 languages and better performance across the board. XTTS can stream with <200ms latency. +- 0.19.0: XTTS fine-tuning code is out. Check the [example recipes](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes/ljspeech). +- 0.14.1: You can use [Fairseq models in ~1100 languages](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) with 🐸TTS. ## 💬 Where to ask questions Please use our dedicated channels for questions and discussion. Help is much more valuable if it's shared publicly so that more people can benefit from it. @@ -63,71 +58,67 @@ repository are also still a useful source of information. | 🚀 **Released Models** | [Standard models](https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.json) and [Fairseq models in ~1100 languages](https://github.com/idiap/coqui-ai-TTS#example-text-to-speech-using-fairseq-models-in-1100-languages-)| ## Features -- High-performance Deep Learning models for Text2Speech tasks. See lists of models below. -- Fast and efficient model training. -- Detailed training logs on the terminal and Tensorboard. -- Support for Multi-speaker TTS. -- Efficient, flexible, lightweight but feature complete `Trainer API`. +- High-performance text-to-speech and voice conversion models, see list below. +- Fast and efficient model training with detailed training logs on the terminal and Tensorboard. +- Support for multi-speaker and multilingual TTS. - Released and ready-to-use models. -- Tools to curate Text2Speech datasets under```dataset_analysis```. -- Utilities to use and test your models. +- Tools to curate TTS datasets under ```dataset_analysis/```. +- Command line and Python APIs to use and test your models. - Modular (but not too much) code base enabling easy implementation of new ideas. ## Model Implementations ### Spectrogram models -- Tacotron: [paper](https://arxiv.org/abs/1703.10135) -- Tacotron2: [paper](https://arxiv.org/abs/1712.05884) -- Glow-TTS: [paper](https://arxiv.org/abs/2005.11129) -- Speedy-Speech: [paper](https://arxiv.org/abs/2008.03802) -- Align-TTS: [paper](https://arxiv.org/abs/2003.01950) -- FastPitch: [paper](https://arxiv.org/pdf/2006.06873.pdf) -- FastSpeech: [paper](https://arxiv.org/abs/1905.09263) -- FastSpeech2: [paper](https://arxiv.org/abs/2006.04558) -- SC-GlowTTS: [paper](https://arxiv.org/abs/2104.05557) -- Capacitron: [paper](https://arxiv.org/abs/1906.03402) -- OverFlow: [paper](https://arxiv.org/abs/2211.06892) -- Neural HMM TTS: [paper](https://arxiv.org/abs/2108.13320) -- Delightful TTS: [paper](https://arxiv.org/abs/2110.12612) +- [Tacotron](https://arxiv.org/abs/1703.10135), [Tacotron2](https://arxiv.org/abs/1712.05884) +- [Glow-TTS](https://arxiv.org/abs/2005.11129), [SC-GlowTTS](https://arxiv.org/abs/2104.05557) +- [Speedy-Speech](https://arxiv.org/abs/2008.03802) +- [Align-TTS](https://arxiv.org/abs/2003.01950) +- [FastPitch](https://arxiv.org/pdf/2006.06873.pdf) +- [FastSpeech](https://arxiv.org/abs/1905.09263), [FastSpeech2](https://arxiv.org/abs/2006.04558) +- [Capacitron](https://arxiv.org/abs/1906.03402) +- [OverFlow](https://arxiv.org/abs/2211.06892) +- [Neural HMM TTS](https://arxiv.org/abs/2108.13320) +- [Delightful TTS](https://arxiv.org/abs/2110.12612) ### End-to-End Models -- ⓍTTS: [blog](https://coqui.ai/blog/tts/open_xtts) -- VITS: [paper](https://arxiv.org/pdf/2106.06103) -- 🐸 YourTTS: [paper](https://arxiv.org/abs/2112.02418) -- 🐢 Tortoise: [orig. repo](https://github.com/neonbjb/tortoise-tts) -- 🐶 Bark: [orig. repo](https://github.com/suno-ai/bark) - -### Attention Methods -- Guided Attention: [paper](https://arxiv.org/abs/1710.08969) -- Forward Backward Decoding: [paper](https://arxiv.org/abs/1907.09006) -- Graves Attention: [paper](https://arxiv.org/abs/1910.10288) -- Double Decoder Consistency: [blog](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) -- Dynamic Convolutional Attention: [paper](https://arxiv.org/pdf/1910.10288.pdf) -- Alignment Network: [paper](https://arxiv.org/abs/2108.10447) - -### Speaker Encoder -- GE2E: [paper](https://arxiv.org/abs/1710.10467) -- Angular Loss: [paper](https://arxiv.org/pdf/2003.11982.pdf) +- [XTTS](https://arxiv.org/abs/2406.04904) +- [VITS](https://arxiv.org/pdf/2106.06103) +- 🐸[YourTTS](https://arxiv.org/abs/2112.02418) +- 🐢[Tortoise](https://github.com/neonbjb/tortoise-tts) +- 🐶[Bark](https://github.com/suno-ai/bark) ### Vocoders -- MelGAN: [paper](https://arxiv.org/abs/1910.06711) -- MultiBandMelGAN: [paper](https://arxiv.org/abs/2005.05106) -- ParallelWaveGAN: [paper](https://arxiv.org/abs/1910.11480) -- GAN-TTS discriminators: [paper](https://arxiv.org/abs/1909.11646) -- WaveRNN: [origin](https://github.com/fatchord/WaveRNN/) -- WaveGrad: [paper](https://arxiv.org/abs/2009.00713) -- HiFiGAN: [paper](https://arxiv.org/abs/2010.05646) -- UnivNet: [paper](https://arxiv.org/abs/2106.07889) +- [MelGAN](https://arxiv.org/abs/1910.06711) +- [MultiBandMelGAN](https://arxiv.org/abs/2005.05106) +- [ParallelWaveGAN](https://arxiv.org/abs/1910.11480) +- [GAN-TTS discriminators](https://arxiv.org/abs/1909.11646) +- [WaveRNN](https://github.com/fatchord/WaveRNN/) +- [WaveGrad](https://arxiv.org/abs/2009.00713) +- [HiFiGAN](https://arxiv.org/abs/2010.05646) +- [UnivNet](https://arxiv.org/abs/2106.07889) ### Voice Conversion -- FreeVC: [paper](https://arxiv.org/abs/2210.15418) -- OpenVoice: [technical report](https://arxiv.org/abs/2312.01479) +- [FreeVC](https://arxiv.org/abs/2210.15418) +- [OpenVoice](https://arxiv.org/abs/2312.01479) + +### Others +- Attention methods: [Guided Attention](https://arxiv.org/abs/1710.08969), + [Forward Backward Decoding](https://arxiv.org/abs/1907.09006), + [Graves Attention](https://arxiv.org/abs/1910.10288), + [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/), + [Dynamic Convolutional Attention](https://arxiv.org/pdf/1910.10288.pdf), + [Alignment Network](https://arxiv.org/abs/2108.10447) +- Speaker encoders: [GE2E](https://arxiv.org/abs/1710.10467), + [Angular Loss](https://arxiv.org/pdf/2003.11982.pdf) You can also help us implement more models. + ## Installation -🐸TTS is tested on Ubuntu 22.04 with **python >= 3.9, < 3.13.**. -If you are only interested in [synthesizing speech](https://coqui-tts.readthedocs.io/en/latest/inference.html) with the released 🐸TTS models, installing from PyPI is the easiest option. +🐸TTS is tested on Ubuntu 24.04 with **python >= 3.9, < 3.13**, but should also +work on Mac and Windows. + +If you are only interested in [synthesizing speech](https://coqui-tts.readthedocs.io/en/latest/inference.html) with the pretrained 🐸TTS models, installing from PyPI is the easiest option. ```bash pip install coqui-tts @@ -165,21 +156,18 @@ pip install -e .[server,ja] ### Platforms -If you are on Ubuntu (Debian), you can also run following commands for installation. +If you are on Ubuntu (Debian), you can also run the following commands for installation. ```bash -make system-deps # intended to be used on Ubuntu (Debian). Let us know if you have a different OS. +make system-deps make install ``` -If you are on Windows, 👑@GuyPaddock wrote installation instructions -[here](https://stackoverflow.com/questions/66726331/how-can-i-run-mozilla-tts-coqui-tts-training-with-cuda-on-a-windows-system) -(note that these are out of date, e.g. you need to have at least Python 3.9). - + ## Docker Image -You can also try TTS without install with the docker image. -Simply run the following command and you will be able to run TTS without installing it. +You can also try out Coqui TTS without installation with the docker image. +Simply run the following command and you will be able to run TTS: ```bash docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu @@ -193,10 +181,10 @@ More details about the docker images (like GPU support) can be found ## Synthesizing speech by 🐸TTS - + ### 🐍 Python API -#### Running a multi-speaker and multi-lingual model +#### Multi-speaker and multi-lingual model ```python import torch @@ -208,47 +196,60 @@ device = "cuda" if torch.cuda.is_available() else "cpu" # List available 🐸TTS models print(TTS().list_models()) -# Init TTS +# Initialize TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device) +# List speakers +print(tts.speakers) + # Run TTS -# ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language -# Text to speech list of amplitude values as output -wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en") -# Text to speech to a file -tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav") +# ❗ XTTS supports both, but many models allow only one of the `speaker` and +# `speaker_wav` arguments + +# TTS with list of amplitude values as output, clone the voice from `speaker_wav` +wav = tts.tts( + text="Hello world!", + speaker_wav="my/cloning/audio.wav", + language="en" +) + +# TTS to a file, use a preset speaker +tts.tts_to_file( + text="Hello world!", + speaker="Craig Gutsy", + language="en", + file_path="output.wav" +) ``` -#### Running a single speaker model +#### Single speaker model ```python -# Init TTS with the target model name -tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False).to(device) +# Initialize TTS with the target model name +tts = TTS("tts_models/de/thorsten/tacotron2-DDC").to(device) # Run TTS tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH) - -# Example voice cloning with YourTTS in English, French and Portuguese -tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to(device) -tts.tts_to_file("This is voice cloning.", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav") -tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wav", language="fr-fr", file_path="output.wav") -tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt-br", file_path="output.wav") ``` -#### Example voice conversion +#### Voice conversion (VC) Converting the voice in `source_wav` to the voice of `target_wav` ```python -tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda") -tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav") +tts = TTS("voice_conversion_models/multilingual/vctk/freevc24").to("cuda") +tts.voice_conversion_to_file( + source_wav="my/source.wav", + target_wav="my/target.wav", + file_path="output.wav" +) ``` Other available voice conversion models: - `voice_conversion_models/multilingual/multi-dataset/openvoice_v1` - `voice_conversion_models/multilingual/multi-dataset/openvoice_v2` -#### Example voice cloning together with the default voice conversion model. +#### Voice cloning by combining single speaker TTS model with the default VC model This way, you can clone voices by using any model in 🐸TTS. The FreeVC model is used for voice conversion after synthesizing speech. @@ -263,7 +264,7 @@ tts.tts_with_vc_to_file( ) ``` -#### Example text to speech using **Fairseq models in ~1100 languages** 🤯. +#### TTS using Fairseq models in ~1100 languages 🤯 For Fairseq models, use the following name format: `tts_models//fairseq/vits`. You can find the language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html) and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms). @@ -277,147 +278,126 @@ api.tts_to_file( ) ``` -### Command-line `tts` +### Command-line interface `tts` -Synthesize speech on command line. +Synthesize speech on the command line. You can either use your trained model or choose a model from the provided list. -If you don't specify any models, then it uses LJSpeech based English model. - -#### Single Speaker Models - - List provided models: - ``` - $ tts --list_models + ```sh + tts --list_models ``` -- Get model info (for both tts_models and vocoder_models): - - - Query by type/name: - The model_info_by_name uses the name as it from the --list_models. - ``` - $ tts --model_info_by_name "///" - ``` - For example: - ``` - $ tts --model_info_by_name tts_models/tr/common-voice/glow-tts - $ tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2 - ``` - - Query by type/idx: - The model_query_idx uses the corresponding idx from --list_models. - - ``` - $ tts --model_info_by_idx "/" - ``` - - For example: - - ``` - $ tts --model_info_by_idx tts_models/3 - ``` - - - Query info for model info by full name: - ``` - $ tts --model_info_by_name "///" - ``` - -- Run TTS with default models: - +- Get model information. Use the names obtained from `--list_models`. + ```sh + tts --model_info_by_name "///" ``` - $ tts --text "Text for TTS" --out_path output/path/speech.wav + For example: + ```sh + tts --model_info_by_name tts_models/tr/common-voice/glow-tts + tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2 + ``` + +#### Single speaker models + +- Run TTS with the default model (`tts_models/en/ljspeech/tacotron2-DDC`): + + ```sh + tts --text "Text for TTS" --out_path output/path/speech.wav ``` - Run TTS and pipe out the generated TTS wav file data: - ``` - $ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay + ```sh + tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay ``` - Run a TTS model with its default vocoder model: - ``` - $ tts --text "Text for TTS" --model_name "///" --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" \ + --model_name "///" \ + --out_path output/path/speech.wav ``` For example: - ``` - $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" \ + --model_name "tts_models/en/ljspeech/glow-tts" \ + --out_path output/path/speech.wav ``` -- Run with specific TTS and vocoder models from the list: +- Run with specific TTS and vocoder models from the list. Note that not every vocoder is compatible with every TTS model. - ``` - $ tts --text "Text for TTS" --model_name "///" --vocoder_name "///" --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" \ + --model_name "///" \ + --vocoder_name "///" \ + --out_path output/path/speech.wav ``` For example: - ``` - $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --vocoder_name "vocoder_models/en/ljspeech/univnet" --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" \ + --model_name "tts_models/en/ljspeech/glow-tts" \ + --vocoder_name "vocoder_models/en/ljspeech/univnet" \ + --out_path output/path/speech.wav ``` -- Run your own TTS model (Using Griffin-Lim Vocoder): +- Run your own TTS model (using Griffin-Lim Vocoder): - ``` - $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" \ + --model_path path/to/model.pth \ + --config_path path/to/config.json \ + --out_path output/path/speech.wav ``` - Run your own TTS and Vocoder models: - ``` - $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav - --vocoder_path path/to/vocoder.pth --vocoder_config_path path/to/vocoder_config.json + ```sh + tts --text "Text for TTS" \ + --model_path path/to/model.pth \ + --config_path path/to/config.json \ + --out_path output/path/speech.wav \ + --vocoder_path path/to/vocoder.pth \ + --vocoder_config_path path/to/vocoder_config.json ``` -#### Multi-speaker Models +#### Multi-speaker models -- List the available speakers and choose a among them: +- List the available speakers and choose a `` among them: - ``` - $ tts --model_name "//" --list_speaker_idxs + ```sh + tts --model_name "//" --list_speaker_idxs ``` - Run the multi-speaker TTS model with the target speaker ID: - ``` - $ tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "//" --speaker_idx + ```sh + tts --text "Text for TTS." --out_path output/path/speech.wav \ + --model_name "//" --speaker_idx ``` - Run your own multi-speaker TTS model: - ``` - $ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx + ```sh + tts --text "Text for TTS" --out_path output/path/speech.wav \ + --model_path path/to/model.pth --config_path path/to/config.json \ + --speakers_file_path path/to/speaker.json --speaker_idx ``` -### Voice Conversion Models +#### Voice conversion models -``` -$ tts --out_path output/path/speech.wav --model_name "//" --source_wav --target_wav +```sh +tts --out_path output/path/speech.wav --model_name "//" \ + --source_wav --target_wav ``` - -## Directory Structure -``` -|- notebooks/ (Jupyter Notebooks for model evaluation, parameter selection and data analysis.) -|- utils/ (common utilities.) -|- TTS - |- bin/ (folder for all the executables.) - |- train*.py (train your target model.) - |- ... - |- tts/ (text to speech models) - |- layers/ (model layer definitions) - |- models/ (model definitions) - |- utils/ (model specific utilities.) - |- speaker_encoder/ (Speaker Encoder models.) - |- (same) - |- vocoder/ (Vocoder models.) - |- (same) - |- vc/ (Voice conversion models.) - |- (same) -``` diff --git a/TTS/bin/synthesize.py b/TTS/bin/synthesize.py index 885f6d6f..5fce93b7 100755 --- a/TTS/bin/synthesize.py +++ b/TTS/bin/synthesize.py @@ -14,123 +14,122 @@ from TTS.utils.generic_utils import ConsoleFormatter, setup_logger logger = logging.getLogger(__name__) description = """ -Synthesize speech on command line. +Synthesize speech on the command line. You can either use your trained model or choose a model from the provided list. -If you don't specify any models, then it uses LJSpeech based English model. +- List provided models: + + ```sh + tts --list_models + ``` + +- Get model information. Use the names obtained from `--list_models`. + ```sh + tts --model_info_by_name "///" + ``` + For example: + ```sh + tts --model_info_by_name tts_models/tr/common-voice/glow-tts + tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2 + ``` #### Single Speaker Models -- List provided models: +- Run TTS with the default model (`tts_models/en/ljspeech/tacotron2-DDC`): - ``` - $ tts --list_models - ``` - -- Get model info (for both tts_models and vocoder_models): - - - Query by type/name: - The model_info_by_name uses the name as it from the --list_models. - ``` - $ tts --model_info_by_name "///" - ``` - For example: - ``` - $ tts --model_info_by_name tts_models/tr/common-voice/glow-tts - $ tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2 - ``` - - Query by type/idx: - The model_query_idx uses the corresponding idx from --list_models. - - ``` - $ tts --model_info_by_idx "/" - ``` - - For example: - - ``` - $ tts --model_info_by_idx tts_models/3 - ``` - - - Query info for model info by full name: - ``` - $ tts --model_info_by_name "///" - ``` - -- Run TTS with default models: - - ``` - $ tts --text "Text for TTS" --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" --out_path output/path/speech.wav ``` - Run TTS and pipe out the generated TTS wav file data: - ``` - $ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay + ```sh + tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay ``` - Run a TTS model with its default vocoder model: - ``` - $ tts --text "Text for TTS" --model_name "///" --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" \\ + --model_name "///" \\ + --out_path output/path/speech.wav ``` For example: - ``` - $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" \\ + --model_name "tts_models/en/ljspeech/glow-tts" \\ + --out_path output/path/speech.wav ``` -- Run with specific TTS and vocoder models from the list: +- Run with specific TTS and vocoder models from the list. Note that not every vocoder is compatible with every TTS model. - ``` - $ tts --text "Text for TTS" --model_name "///" --vocoder_name "///" --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" \\ + --model_name "///" \\ + --vocoder_name "///" \\ + --out_path output/path/speech.wav ``` For example: - ``` - $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --vocoder_name "vocoder_models/en/ljspeech/univnet" --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" \\ + --model_name "tts_models/en/ljspeech/glow-tts" \\ + --vocoder_name "vocoder_models/en/ljspeech/univnet" \\ + --out_path output/path/speech.wav ``` -- Run your own TTS model (Using Griffin-Lim Vocoder): +- Run your own TTS model (using Griffin-Lim Vocoder): - ``` - $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav + ```sh + tts --text "Text for TTS" \\ + --model_path path/to/model.pth \\ + --config_path path/to/config.json \\ + --out_path output/path/speech.wav ``` - Run your own TTS and Vocoder models: - ``` - $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav - --vocoder_path path/to/vocoder.pth --vocoder_config_path path/to/vocoder_config.json + ```sh + tts --text "Text for TTS" \\ + --model_path path/to/model.pth \\ + --config_path path/to/config.json \\ + --out_path output/path/speech.wav \\ + --vocoder_path path/to/vocoder.pth \\ + --vocoder_config_path path/to/vocoder_config.json ``` #### Multi-speaker Models -- List the available speakers and choose a among them: +- List the available speakers and choose a `` among them: - ``` - $ tts --model_name "//" --list_speaker_idxs + ```sh + tts --model_name "//" --list_speaker_idxs ``` - Run the multi-speaker TTS model with the target speaker ID: - ``` - $ tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "//" --speaker_idx + ```sh + tts --text "Text for TTS." --out_path output/path/speech.wav \\ + --model_name "//" --speaker_idx ``` - Run your own multi-speaker TTS model: - ``` - $ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx + ```sh + tts --text "Text for TTS" --out_path output/path/speech.wav \\ + --model_path path/to/model.pth --config_path path/to/config.json \\ + --speakers_file_path path/to/speaker.json --speaker_idx ``` -### Voice Conversion Models +#### Voice Conversion Models -``` -$ tts --out_path output/path/speech.wav --model_name "//" --source_wav --target_wav +```sh +tts --out_path output/path/speech.wav --model_name "//" \\ + --source_wav --target_wav ``` """ diff --git a/TTS/model.py b/TTS/model.py index c3707c85..779b1775 100644 --- a/TTS/model.py +++ b/TTS/model.py @@ -12,7 +12,7 @@ from trainer import TrainerModel class BaseTrainerModel(TrainerModel): """BaseTrainerModel model expanding TrainerModel with required functions by 🐸TTS. - Every new 🐸TTS model must inherit it. + Every new Coqui model must inherit it. """ @staticmethod diff --git a/TTS/tts/models/bark.py b/TTS/tts/models/bark.py index ced8f60e..c52c541b 100644 --- a/TTS/tts/models/bark.py +++ b/TTS/tts/models/bark.py @@ -206,12 +206,14 @@ class Bark(BaseTTS): speaker_wav (str): Path to the speaker audio file for cloning a new voice. It is cloned and saved in `voice_dirs` with the name `speaker_id`. Defaults to None. voice_dirs (List[str]): List of paths that host reference audio files for speakers. Defaults to None. - **kwargs: Model specific inference settings used by `generate_audio()` and `TTS.tts.layers.bark.inference_funcs.generate_text_semantic(). + **kwargs: Model specific inference settings used by `generate_audio()` and + `TTS.tts.layers.bark.inference_funcs.generate_text_semantic()`. Returns: - A dictionary of the output values with `wav` as output waveform, `deterministic_seed` as seed used at inference, - `text_input` as text token IDs after tokenizer, `voice_samples` as samples used for cloning, `conditioning_latents` - as latents used at inference. + A dictionary of the output values with `wav` as output waveform, + `deterministic_seed` as seed used at inference, `text_input` as text token IDs + after tokenizer, `voice_samples` as samples used for cloning, + `conditioning_latents` as latents used at inference. """ speaker_id = "random" if speaker_id is None else speaker_id diff --git a/TTS/tts/models/base_tts.py b/TTS/tts/models/base_tts.py index ccb023ce..33a75598 100644 --- a/TTS/tts/models/base_tts.py +++ b/TTS/tts/models/base_tts.py @@ -80,15 +80,17 @@ class BaseTTS(BaseTrainerModel): raise ValueError("config must be either a *Config or *Args") def init_multispeaker(self, config: Coqpit, data: List = None): - """Initialize a speaker embedding layer if needen and define expected embedding channel size for defining - `in_channels` size of the connected layers. + """Set up for multi-speaker TTS. + + Initialize a speaker embedding layer if needed and define expected embedding + channel size for defining `in_channels` size of the connected layers. This implementation yields 3 possible outcomes: - 1. If `config.use_speaker_embedding` and `config.use_d_vector_file are False, do nothing. + 1. If `config.use_speaker_embedding` and `config.use_d_vector_file` are False, do nothing. 2. If `config.use_d_vector_file` is True, set expected embedding channel size to `config.d_vector_dim` or 512. 3. If `config.use_speaker_embedding`, initialize a speaker embedding layer with channel size of - `config.d_vector_dim` or 512. + `config.d_vector_dim` or 512. You can override this function for new models. diff --git a/TTS/tts/models/overflow.py b/TTS/tts/models/overflow.py index ac09e406..1c146b2e 100644 --- a/TTS/tts/models/overflow.py +++ b/TTS/tts/models/overflow.py @@ -33,32 +33,33 @@ class Overflow(BaseTTS): Paper abstract:: Neural HMMs are a type of neural transducer recently proposed for - sequence-to-sequence modelling in text-to-speech. They combine the best features - of classic statistical speech synthesis and modern neural TTS, requiring less - data and fewer training updates, and are less prone to gibberish output caused - by neural attention failures. In this paper, we combine neural HMM TTS with - normalising flows for describing the highly non-Gaussian distribution of speech - acoustics. The result is a powerful, fully probabilistic model of durations and - acoustics that can be trained using exact maximum likelihood. Compared to - dominant flow-based acoustic models, our approach integrates autoregression for - improved modelling of long-range dependences such as utterance-level prosody. - Experiments show that a system based on our proposal gives more accurate - pronunciations and better subjective speech quality than comparable methods, - whilst retaining the original advantages of neural HMMs. Audio examples and code - are available at https://shivammehta25.github.io/OverFlow/. + sequence-to-sequence modelling in text-to-speech. They combine the best features + of classic statistical speech synthesis and modern neural TTS, requiring less + data and fewer training updates, and are less prone to gibberish output caused + by neural attention failures. In this paper, we combine neural HMM TTS with + normalising flows for describing the highly non-Gaussian distribution of speech + acoustics. The result is a powerful, fully probabilistic model of durations and + acoustics that can be trained using exact maximum likelihood. Compared to + dominant flow-based acoustic models, our approach integrates autoregression for + improved modelling of long-range dependences such as utterance-level prosody. + Experiments show that a system based on our proposal gives more accurate + pronunciations and better subjective speech quality than comparable methods, + whilst retaining the original advantages of neural HMMs. Audio examples and code + are available at https://shivammehta25.github.io/OverFlow/. Note: - - Neural HMMs uses flat start initialization i.e it computes the means and std and transition probabilities - of the dataset and uses them to initialize the model. This benefits the model and helps with faster learning - If you change the dataset or want to regenerate the parameters change the `force_generate_statistics` and - `mel_statistics_parameter_path` accordingly. + - Neural HMMs uses flat start initialization i.e it computes the means + and std and transition probabilities of the dataset and uses them to initialize + the model. This benefits the model and helps with faster learning If you change + the dataset or want to regenerate the parameters change the + `force_generate_statistics` and `mel_statistics_parameter_path` accordingly. - To enable multi-GPU training, set the `use_grad_checkpointing=False` in config. - This will significantly increase the memory usage. This is because to compute - the actual data likelihood (not an approximation using MAS/Viterbi) we must use - all the states at the previous time step during the forward pass to decide the - probability distribution at the current step i.e the difference between the forward - algorithm and viterbi approximation. + This will significantly increase the memory usage. This is because to compute + the actual data likelihood (not an approximation using MAS/Viterbi) we must use + all the states at the previous time step during the forward pass to decide the + probability distribution at the current step i.e the difference between the forward + algorithm and viterbi approximation. Check :class:`TTS.tts.configs.overflow.OverFlowConfig` for class arguments. """ diff --git a/TTS/tts/models/tortoise.py b/TTS/tts/models/tortoise.py index 01629b5d..738e9dd9 100644 --- a/TTS/tts/models/tortoise.py +++ b/TTS/tts/models/tortoise.py @@ -423,7 +423,9 @@ class Tortoise(BaseTTS): Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent). These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic properties. - :param voice_samples: List of arbitrary reference clips, which should be *pairs* of torch tensors containing arbitrary kHz waveform data. + + :param voice_samples: List of arbitrary reference clips, which should be *pairs* + of torch tensors containing arbitrary kHz waveform data. :param latent_averaging_mode: 0/1/2 for following modes: 0 - latents will be generated as in original tortoise, using ~4.27s from each voice sample, averaging latent across all samples 1 - latents will be generated using (almost) entire voice samples, averaged across all the ~4.27s chunks @@ -671,7 +673,7 @@ class Tortoise(BaseTTS): As cond_free_k increases, the output becomes dominated by the conditioning-free signal. diffusion_temperature: (float) Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0 are the "mean" prediction of the diffusion network and will sound bland and smeared. - hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive transformer. + hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation here: https://huggingface.co/docs/transformers/internal/generation_utils diff --git a/TTS/tts/models/xtts.py b/TTS/tts/models/xtts.py index f05863ae..395208cc 100644 --- a/TTS/tts/models/xtts.py +++ b/TTS/tts/models/xtts.py @@ -178,7 +178,7 @@ class XttsArgs(Coqpit): class Xtts(BaseTTS): - """ⓍTTS model implementation. + """XTTS model implementation. ❗ Currently it only supports inference. @@ -460,7 +460,7 @@ class Xtts(BaseTTS): gpt_cond_chunk_len: (int) Chunk length used for cloning. It must be <= `gpt_cond_len`. If gpt_cond_len == gpt_cond_chunk_len, no chunking. Defaults to 6 seconds. - hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive + hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation here: https://huggingface.co/docs/transformers/internal/generation_utils diff --git a/docs/source/conf.py b/docs/source/conf.py index e7d36c1f..e878d0e8 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -52,6 +52,7 @@ extensions = [ "sphinx_inline_tabs", ] +suppress_warnings = ["autosectionlabel.*"] # Add any paths that contain templates here, relative to this directory. templates_path = ["_templates"] @@ -67,6 +68,8 @@ myst_enable_extensions = [ "linkify", ] +myst_heading_anchors = 4 + # 'sphinxcontrib.katex', # 'sphinx.ext.autosectionlabel', diff --git a/docs/source/configuration.md b/docs/source/configuration.md index ada61e16..220c96c3 100644 --- a/docs/source/configuration.md +++ b/docs/source/configuration.md @@ -1,6 +1,6 @@ # Configuration -We use 👩‍✈️[Coqpit] for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit. +We use 👩‍✈️[Coqpit](https://github.com/idiap/coqui-ai-coqpit) for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit. ```python from dataclasses import asdict, dataclass, field @@ -36,7 +36,7 @@ class SimpleConfig(Coqpit): check_argument("val_c", c, restricted=True) ``` -In TTS, each model must have a configuration class that exposes all the values necessary for its lifetime. +In Coqui, each model must have a configuration class that exposes all the values necessary for its lifetime. It defines model architecture, hyper-parameters, training, and inference settings. For our models, we merge all the fields in a single configuration class for ease. It may not look like a wise practice but enables easier bookkeeping and reproducible experiments. diff --git a/docs/source/formatting_your_dataset.md b/docs/source/datasets/formatting_your_dataset.md similarity index 95% rename from docs/source/formatting_your_dataset.md rename to docs/source/datasets/formatting_your_dataset.md index 23c497d0..e9226333 100644 --- a/docs/source/formatting_your_dataset.md +++ b/docs/source/datasets/formatting_your_dataset.md @@ -1,7 +1,9 @@ (formatting_your_dataset)= -# Formatting Your Dataset +# Formatting your dataset -For training a TTS model, you need a dataset with speech recordings and transcriptions. The speech must be divided into audio clips and each clip needs transcription. +For training a TTS model, you need a dataset with speech recordings and +transcriptions. The speech must be divided into audio clips and each clip needs +a transcription. If you have a single audio file and you need to split it into clips, there are different open-source tools for you. We recommend Audacity. It is an open-source and free audio editing software. @@ -49,7 +51,7 @@ The format above is taken from widely-used the [LJSpeech](https://keithito.com/L Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English. -For more info about dataset qualities and properties check our [post](https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset). +For more info about dataset qualities and properties check [this page](what_makes_a_good_dataset.md). ## Using Your Dataset in 🐸TTS diff --git a/docs/source/datasets/index.md b/docs/source/datasets/index.md new file mode 100644 index 00000000..6b040fc4 --- /dev/null +++ b/docs/source/datasets/index.md @@ -0,0 +1,12 @@ +# Datasets + +For training a TTS model, you need a dataset with speech recordings and +transcriptions. See the following pages for more information on: + +```{toctree} +:maxdepth: 1 + +formatting_your_dataset +what_makes_a_good_dataset +tts_datasets +``` diff --git a/docs/source/tts_datasets.md b/docs/source/datasets/tts_datasets.md similarity index 90% rename from docs/source/tts_datasets.md rename to docs/source/datasets/tts_datasets.md index 11da1b76..df8d2f2a 100644 --- a/docs/source/tts_datasets.md +++ b/docs/source/datasets/tts_datasets.md @@ -1,6 +1,6 @@ -# TTS Datasets +# Public TTS datasets -Some of the known public datasets that we successfully applied 🐸TTS: +Some of the known public datasets that were successfully used for 🐸TTS: - [English - LJ Speech](https://keithito.com/LJ-Speech-Dataset/) - [English - Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/) diff --git a/docs/source/what_makes_a_good_dataset.md b/docs/source/datasets/what_makes_a_good_dataset.md similarity index 100% rename from docs/source/what_makes_a_good_dataset.md rename to docs/source/datasets/what_makes_a_good_dataset.md diff --git a/docs/source/docker_images.md b/docs/source/docker_images.md index 58d96120..042f9f8e 100644 --- a/docs/source/docker_images.md +++ b/docs/source/docker_images.md @@ -1,20 +1,20 @@ (docker_images)= -## Docker images +# Docker images We provide docker images to be able to test TTS without having to setup your own environment. -### Using premade images +## Using premade images You can use premade images built automatically from the latest TTS version. -#### CPU version +### CPU version ```bash docker pull ghcr.io/coqui-ai/tts-cpu ``` -#### GPU version +### GPU version ```bash docker pull ghcr.io/coqui-ai/tts ``` -### Building your own image +## Building your own image ```bash docker build -t tts . ``` diff --git a/docs/source/implementing_a_new_language_frontend.md b/docs/source/extension/implementing_a_new_language_frontend.md similarity index 88% rename from docs/source/implementing_a_new_language_frontend.md rename to docs/source/extension/implementing_a_new_language_frontend.md index 2041352d..0b3ef59b 100644 --- a/docs/source/implementing_a_new_language_frontend.md +++ b/docs/source/extension/implementing_a_new_language_frontend.md @@ -1,6 +1,6 @@ -# Implementing a New Language Frontend +# Implementing new language front ends -- Language frontends are located under `TTS.tts.utils.text` +- Language front ends are located under `TTS.tts.utils.text` - Each special language has a separate folder. - Each folder contains all the utilities for processing the text input. - `TTS.tts.utils.text.phonemizers` contains the main phonemizer for a language. This is the class that uses the utilities diff --git a/docs/source/implementing_a_new_model.md b/docs/source/extension/implementing_a_new_model.md similarity index 97% rename from docs/source/implementing_a_new_model.md rename to docs/source/extension/implementing_a_new_model.md index 1bf7a882..25217897 100644 --- a/docs/source/implementing_a_new_model.md +++ b/docs/source/extension/implementing_a_new_model.md @@ -1,4 +1,4 @@ -# Implementing a Model +# Implementing new models 1. Implement layers. @@ -36,7 +36,8 @@ There is also the `callback` interface by which you can manipulate both the model and the `Trainer` states. Callbacks give you an infinite flexibility to add custom behaviours for your model and training routines. - For more details, see {ref}`BaseTTS ` and :obj:`TTS.utils.callbacks`. + For more details, see [BaseTTS](../main_classes/model_api.md#base-tts-model) + and `TTS.utils.callbacks`. 6. Optionally, define `MyModelArgs`. @@ -62,7 +63,7 @@ We love you more when you document your code. ❤️ -# Template 🐸TTS Model implementation +## Template 🐸TTS Model implementation You can start implementing your model by copying the following base class. diff --git a/docs/source/extension/index.md b/docs/source/extension/index.md new file mode 100644 index 00000000..39c36b63 --- /dev/null +++ b/docs/source/extension/index.md @@ -0,0 +1,14 @@ +# Adding models or languages + +You can extend Coqui by implementing new model architectures or adding front +ends for new languages. See the pages below for more details. The [project +structure](../project_structure.md) and [contribution +guidelines](../contributing.md) may also be helpful. Please open a pull request +with your changes to share back the improvements with the community. + +```{toctree} +:maxdepth: 1 + +implementing_a_new_model +implementing_a_new_language_frontend +``` diff --git a/docs/source/faq.md b/docs/source/faq.md index 1090aaa3..1dd5c184 100644 --- a/docs/source/faq.md +++ b/docs/source/faq.md @@ -1,4 +1,4 @@ -# Humble FAQ +# FAQ We tried to collect common issues and questions we receive about 🐸TTS. It is worth checking before going deeper. ## Errors with a pre-trained model. How can I resolve this? @@ -7,7 +7,7 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is - If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny. ## What are the requirements of a good 🐸TTS dataset? -* {ref}`See this page ` +- [See this page](datasets/what_makes_a_good_dataset.md) ## How should I choose the right model? - First, train Tacotron. It is smaller and faster to experiment with. If it performs poorly, try Tacotron2. @@ -18,7 +18,7 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is ## How can I train my own `tts` model? 0. Check your dataset with notebooks in [dataset_analysis](https://github.com/idiap/coqui-ai-TTS/tree/main/notebooks/dataset_analysis) folder. Use [this notebook](https://github.com/idiap/coqui-ai-TTS/blob/main/notebooks/dataset_analysis/CheckSpectrograms.ipynb) to find the right audio processing parameters. A better set of parameters results in a better audio synthesis. -1. Write your own dataset `formatter` in `datasets/formatters.py` or format your dataset as one of the supported datasets, like LJSpeech. +1. Write your own dataset `formatter` in `datasets/formatters.py` or [format](datasets/formatting_your_dataset) your dataset as one of the supported datasets, like LJSpeech. A `formatter` parses the metadata file and converts a list of training samples. 2. If you have a dataset with a different alphabet than English, you need to set your own character list in the ```config.json```. @@ -61,7 +61,8 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is - SingleGPU training: ```CUDA_VISIBLE_DEVICES="0" python train_tts.py --config_path config.json``` - MultiGPU training: ```python3 -m trainer.distribute --gpus "0,1" --script TTS/bin/train_tts.py --config_path config.json``` -**Note:** You can also train your model using pure 🐍 python. Check ```{eval-rst} :ref: 'tutorial_for_nervous_beginners'```. +**Note:** You can also train your model using pure 🐍 python. Check the +[tutorial](tutorial_for_nervous_beginners.md). ## How can I train in a different language? - Check steps 2, 3, 4, 5 above. @@ -104,7 +105,7 @@ The best approach is to pick a set of promising models and run a Mean-Opinion-Sc - Check the 4th step under "How can I check model performance?" ## How can I test a trained model? -- The best way is to use `tts` or `tts-server` commands. For details check {ref}`here `. +- The best way is to use `tts` or `tts-server` commands. For details check [here](inference.md). - If you need to code your own ```TTS.utils.synthesizer.Synthesizer``` class. ## My Tacotron model does not stop - I see "Decoder stopped with 'max_decoder_steps" - Stopnet does not work. diff --git a/docs/source/index.md b/docs/source/index.md index 79993eec..3a030b4f 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -1,62 +1,63 @@ +--- +hide-toc: true +--- ```{include} ../../README.md :relative-images: +:end-before: ``` ----- -# Documentation Content -```{eval-rst} -.. toctree:: - :maxdepth: 2 - :caption: Get started - - tutorial_for_nervous_beginners - installation - faq - contributing - -.. toctree:: - :maxdepth: 2 - :caption: Using 🐸TTS - - inference - docker_images - implementing_a_new_model - implementing_a_new_language_frontend - training_a_model - finetuning - configuration - formatting_your_dataset - what_makes_a_good_dataset - tts_datasets - marytts - -.. toctree:: - :maxdepth: 2 - :caption: Main Classes - - main_classes/trainer_api - main_classes/audio_processor - main_classes/model_api - main_classes/dataset - main_classes/gan - main_classes/speaker_manager - -.. toctree:: - :maxdepth: 2 - :caption: `tts` Models - - models/glow_tts.md - models/vits.md - models/forward_tts.md - models/tacotron1-2.md - models/overflow.md - models/tortoise.md - models/bark.md - models/xtts.md - -.. toctree:: - :maxdepth: 2 - :caption: `vocoder` Models +```{toctree} +:maxdepth: 1 +:caption: Get started +:hidden: +tutorial_for_nervous_beginners +installation +docker_images +faq +project_structure +contributing +``` + +```{toctree} +:maxdepth: 1 +:caption: Using Coqui +:hidden: + +inference +training/index +extension/index +datasets/index +``` + + +```{toctree} +:maxdepth: 1 +:caption: Main Classes +:hidden: + +configuration +main_classes/trainer_api +main_classes/audio_processor +main_classes/model_api +main_classes/dataset +main_classes/gan +main_classes/speaker_manager +``` + + +```{toctree} +:maxdepth: 1 +:caption: TTS Models +:hidden: + +models/glow_tts.md +models/vits.md +models/forward_tts.md +models/tacotron1-2.md +models/overflow.md +models/tortoise.md +models/bark.md +models/xtts.md ``` diff --git a/docs/source/inference.md b/docs/source/inference.md index 4cb8f45a..cb7d01fc 100644 --- a/docs/source/inference.md +++ b/docs/source/inference.md @@ -1,194 +1,21 @@ (synthesizing_speech)= -# Synthesizing Speech +# Synthesizing speech -First, you need to install TTS. We recommend using PyPi. You need to call the command below: +## Overview -```bash -$ pip install coqui-tts +Coqui TTS provides three main methods for inference: + +1. 🐍Python API +2. TTS command line interface (CLI) +3. [Local demo server](server.md) + +```{include} ../../README.md +:start-after: ``` -After the installation, 2 terminal commands are available. -1. TTS Command Line Interface (CLI). - `tts` -2. Local Demo Server. - `tts-server` -3. In 🐍Python. - `from TTS.api import TTS` - -## On the Commandline - `tts` -![cli.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/tts_cli.gif) - -After the installation, 🐸TTS provides a CLI interface for synthesizing speech using pre-trained models. You can either use your own model or the release models under 🐸TTS. - -Listing released 🐸TTS models. - -```bash -tts --list_models -``` - -Run a TTS model, from the release models list, with its default vocoder. (Simply copy and paste the full model names from the list as arguments for the command below.) - -```bash -tts --text "Text for TTS" \ - --model_name "///" \ - --out_path folder/to/save/output.wav -``` - -Run a tts and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model. - -```bash -tts --text "Text for TTS" \ - --model_name "tts_models///" \ - --vocoder_name "vocoder_models///" \ - --out_path folder/to/save/output.wav -``` - -Run your own TTS model (Using Griffin-Lim Vocoder) - -```bash -tts --text "Text for TTS" \ - --model_path path/to/model.pth \ - --config_path path/to/config.json \ - --out_path folder/to/save/output.wav -``` - -Run your own TTS and Vocoder models - -```bash -tts --text "Text for TTS" \ - --config_path path/to/config.json \ - --model_path path/to/model.pth \ - --out_path folder/to/save/output.wav \ - --vocoder_path path/to/vocoder.pth \ - --vocoder_config_path path/to/vocoder_config.json -``` - -Run a multi-speaker TTS model from the released models list. - -```bash -tts --model_name "tts_models///" --list_speaker_idxs # list the possible speaker IDs. -tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "tts_models///" --speaker_idx "" -``` - -Run a released voice conversion model - -```bash -tts --model_name "voice_conversion///" - --source_wav "my/source/speaker/audio.wav" - --target_wav "my/target/speaker/audio.wav" - --out_path folder/to/save/output.wav -``` - -**Note:** You can use ```./TTS/bin/synthesize.py``` if you prefer running ```tts``` from the TTS project folder. - -## On the Demo Server - `tts-server` - - -![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif) - -You can boot up a demo 🐸TTS server to run an inference with your models (make -sure to install the additional dependencies with `pip install coqui-tts[server]`). -Note that the server is not optimized for performance but gives you an easy way -to interact with the models. - -The demo server provides pretty much the same interface as the CLI command. - -```bash -tts-server -h # see the help -tts-server --list_models # list the available models. -``` - -Run a TTS model, from the release models list, with its default vocoder. -If the model you choose is a multi-speaker TTS model, you can select different speakers on the Web interface and synthesize -speech. - -```bash -tts-server --model_name "///" -``` - -Run a TTS and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model. - -```bash -tts-server --model_name "///" \ - --vocoder_name "///" -``` - -## Python 🐸TTS API - -You can run a multi-speaker and multi-lingual model in Python as - -```python -import torch -from TTS.api import TTS - -# Get device -device = "cuda" if torch.cuda.is_available() else "cpu" - -# List available 🐸TTS models -print(TTS().list_models()) - -# Init TTS -tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device) - -# Run TTS -# ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language -# Text to speech list of amplitude values as output -wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en") -# Text to speech to a file -tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav") -``` - -#### Here is an example for a single speaker model. - -```python -# Init TTS with the target model name -tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False) -# Run TTS -tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH) -``` - -#### Example voice cloning with YourTTS in English, French and Portuguese: - -```python -tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to("cuda") -tts.tts_to_file("This is voice cloning.", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav") -tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wav", language="fr", file_path="output.wav") -tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt", file_path="output.wav") -``` - -#### Example voice conversion converting speaker of the `source_wav` to the speaker of the `target_wav` - -```python -tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda") -tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav") -``` - -#### Example voice cloning by a single speaker TTS model combining with the voice conversion model. - -This way, you can clone voices by using any model in 🐸TTS. - -```python -tts = TTS("tts_models/de/thorsten/tacotron2-DDC") -tts.tts_with_vc_to_file( - "Wie sage ich auf Italienisch, dass ich dich liebe?", - speaker_wav="target/speaker.wav", - file_path="ouptut.wav" -) -``` - -#### Example text to speech using **Fairseq models in ~1100 languages** 🤯. -For these models use the following name format: `tts_models//fairseq/vits`. - -You can find the list of language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html) and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms). - -```python -from TTS.api import TTS -api = TTS(model_name="tts_models/eng/fairseq/vits").to("cuda") -api.tts_to_file("This is a test.", file_path="output.wav") - -# TTS with on the fly voice conversion -api = TTS("tts_models/deu/fairseq/vits") -api.tts_with_vc_to_file( - "Wie sage ich auf Italienisch, dass ich dich liebe?", - speaker_wav="target/speaker.wav", - file_path="ouptut.wav" -) +```{toctree} +:hidden: +server +marytts ``` diff --git a/docs/source/installation.md b/docs/source/installation.md index 405c4366..1315395a 100644 --- a/docs/source/installation.md +++ b/docs/source/installation.md @@ -1,40 +1,6 @@ # Installation -🐸TTS supports python >=3.9 <3.13.0 and was tested on Ubuntu 22.04. - -## Using `pip` - -`pip` is recommended if you want to use 🐸TTS only for inference. - -You can install from PyPI as follows: - -```bash -pip install coqui-tts # from PyPI +```{include} ../../README.md +:start-after: +:end-before: ``` - -Or install from Github: - -```bash -pip install git+https://github.com/idiap/coqui-ai-TTS # from Github -``` - -## Installing From Source - -This is recommended for development and more control over 🐸TTS. - -```bash -git clone https://github.com/idiap/coqui-ai-TTS -cd coqui-ai-TTS -make system-deps # only on Linux systems. - -# Install package and optional extras -make install - -# Same as above + dev dependencies and pre-commit -make install_dev -``` - -## On Windows -If you are on Windows, 👑@GuyPaddock wrote installation instructions -[here](https://stackoverflow.com/questions/66726331/) (note that these are out -of date, e.g. you need to have at least Python 3.9) diff --git a/docs/source/main_classes/model_api.md b/docs/source/main_classes/model_api.md index 71b3d416..bb7e9d1a 100644 --- a/docs/source/main_classes/model_api.md +++ b/docs/source/main_classes/model_api.md @@ -1,22 +1,22 @@ # Model API Model API provides you a set of functions that easily make your model compatible with the `Trainer`, -`Synthesizer` and `ModelZoo`. +`Synthesizer` and the Coqui Python API. -## Base TTS Model +## Base Trainer Model ```{eval-rst} .. autoclass:: TTS.model.BaseTrainerModel :members: ``` -## Base tts Model +## Base TTS Model ```{eval-rst} .. autoclass:: TTS.tts.models.base_tts.BaseTTS :members: ``` -## Base vocoder Model +## Base Vocoder Model ```{eval-rst} .. autoclass:: TTS.vocoder.models.base_vocoder.BaseVocoder diff --git a/docs/source/main_classes/trainer_api.md b/docs/source/main_classes/trainer_api.md index 335294aa..bdb6048e 100644 --- a/docs/source/main_classes/trainer_api.md +++ b/docs/source/main_classes/trainer_api.md @@ -1,3 +1,3 @@ # Trainer API -We made the trainer a separate project on https://github.com/eginhard/coqui-trainer +We made the trainer a separate project: https://github.com/idiap/coqui-ai-Trainer diff --git a/docs/source/marytts.md b/docs/source/marytts.md index 9091ca33..11cf4a2b 100644 --- a/docs/source/marytts.md +++ b/docs/source/marytts.md @@ -1,4 +1,4 @@ -# Mary-TTS API Support for Coqui-TTS +# Mary-TTS API support for Coqui TTS ## What is Mary-TTS? diff --git a/docs/source/models/xtts.md b/docs/source/models/xtts.md index 7c0f1c4a..96f5bb7c 100644 --- a/docs/source/models/xtts.md +++ b/docs/source/models/xtts.md @@ -1,25 +1,25 @@ -# ⓍTTS -ⓍTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise, -ⓍTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy. +# XTTS +XTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise, +XTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy. There is no need for an excessive amount of training data that spans countless hours. -### Features +## Features - Voice cloning. - Cross-language voice cloning. - Multi-lingual speech generation. - 24khz sampling rate. -- Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-inference)) +- Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-manually)) - Fine-tuning support. (See [Training](#training)) -### Updates with v2 +## Updates with v2 - Improved voice cloning. - Voices can be cloned with a single audio file or multiple audio files, without any effect on the runtime. - Across the board quality improvements. -### Code +## Code Current implementation only supports inference and GPT encoder training. -### Languages +## Languages XTTS-v2 supports 17 languages: - Arabic (ar) @@ -40,15 +40,15 @@ XTTS-v2 supports 17 languages: - Spanish (es) - Turkish (tr) -### License +## License This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml). -### Contact +## Contact Come and join in our 🐸Community. We're active on [Discord](https://discord.gg/fBC58unbKE) and [Github](https://github.com/idiap/coqui-ai-TTS/discussions). -### Inference +## Inference -#### 🐸TTS Command line +### 🐸TTS Command line You can check all supported languages with the following command: @@ -64,7 +64,7 @@ You can check all Coqui available speakers with the following command: --list_speaker_idx ``` -##### Coqui speakers +#### Coqui speakers You can do inference using one of the available speakers using the following command: ```console @@ -75,10 +75,10 @@ You can do inference using one of the available speakers using the following com --use_cuda ``` -##### Clone a voice +#### Clone a voice You can clone a speaker voice using a single or multiple references: -###### Single reference +##### Single reference ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ @@ -88,7 +88,7 @@ You can clone a speaker voice using a single or multiple references: --use_cuda ``` -###### Multiple references +##### Multiple references ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --text "Bugün okula gitmek istemiyorum." \ @@ -106,12 +106,12 @@ or for all wav files in a directory you can use: --use_cuda ``` -#### 🐸TTS API +### 🐸TTS API -##### Clone a voice +#### Clone a voice You can clone a speaker voice using a single or multiple references: -###### Single reference +##### Single reference Splits the text into sentences and generates audio for each sentence. The audio files are then concatenated to produce the final audio. You can optionally disable sentence splitting for better coherence but more VRAM and possibly hitting models context length limit. @@ -129,7 +129,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t ) ``` -###### Multiple references +##### Multiple references You can pass multiple audio files to the `speaker_wav` argument for better voice cloning. @@ -154,7 +154,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t language="en") ``` -##### Coqui speakers +#### Coqui speakers You can do inference using one of the available speakers using the following code: @@ -172,11 +172,11 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t ``` -#### 🐸TTS Model API +### 🐸TTS Model API To use the model API, you need to download the model files and pass config and model file paths manually. -#### Manual Inference +### Manual Inference If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first. @@ -184,7 +184,7 @@ If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjo pip install deepspeed==0.10.3 ``` -##### inference parameters +#### Inference parameters - `text`: The text to be synthesized. - `language`: The language of the text to be synthesized. @@ -199,7 +199,7 @@ pip install deepspeed==0.10.3 - `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True. -##### Inference +#### Inference ```python @@ -231,7 +231,7 @@ torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000) ``` -##### Streaming manually +#### Streaming manually Here the goal is to stream the audio as it is being generated. This is useful for real-time applications. Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster. @@ -275,9 +275,9 @@ torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000) ``` -### Training +## Training -#### Easy training +### Easy training To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps: - Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter @@ -286,7 +286,7 @@ To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio The user can run this gradio demo locally or remotely using a Colab Notebook. -##### Run demo on Colab +#### Run demo on Colab To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook. The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing). @@ -302,7 +302,7 @@ If you are not able to acess the video you need to follow the steps: 5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference". -##### Run demo locally +#### Run demo locally To run the demo locally you need to do the following steps: 1. Install 🐸 TTS following the instructions available [here](https://coqui-tts.readthedocs.io/en/latest/installation.html). @@ -319,7 +319,7 @@ If you are not able to access the video, here is what you need to do: 4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. 5. Now you can run inference with the model by clicking on the button "Step 4 - Inference". -#### Advanced training +### Advanced training A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py @@ -393,6 +393,6 @@ torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000) ## XTTS Model ```{eval-rst} -.. autoclass:: TTS.tts.models.xtts.XTTS +.. autoclass:: TTS.tts.models.xtts.Xtts :members: ``` diff --git a/docs/source/project_structure.md b/docs/source/project_structure.md new file mode 100644 index 00000000..af3e472a --- /dev/null +++ b/docs/source/project_structure.md @@ -0,0 +1,30 @@ +# Project structure + +## Directory structure + +A non-comprehensive overview of the Coqui source code: + +| Directory | Contents | +| - | - | +| **Core** | | +| **[`TTS/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS)** | Main source code | +| **[`- .models.json`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/.models.json)** | Pretrained model list | +| **[`- api.py`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/api.py)** | Python API | +| **[`- bin/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/bin)** | Executables and CLI | +| **[`- tts/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts)** | Text-to-speech models | +| **[`- configs/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts/configs)** | Model configurations | +| **[`- layers/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts/layers)** | Model layer definitions | +| **[`- models/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts/models)** | Model definitions | +| **[`- vc/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/vc)** | Voice conversion models | +| `- (same)` | | +| **[`- vocoder/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/vocoder)** | Vocoder models | +| `- (same)` | | +| **[`- encoder/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/encoder)** | Speaker encoder models | +| `- (same)` | | +| **Recipes/notebooks** | | +| **[`notebooks/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/notebooks)** | Jupyter Notebooks for model evaluation, parameter selection and data analysis | +| **[`recipes/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes)** | Training recipes | +| **Others** | | +| **[`pyproject.toml`](https://github.com/idiap/coqui-ai-TTS/tree/dev/pyproject.toml)** | Project metadata, configuration and dependencies | +| **[`docs/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/docs)** | Documentation | +| **[`tests/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/tests)** | Unit and integration tests | diff --git a/docs/source/server.md b/docs/source/server.md new file mode 100644 index 00000000..3fa211d0 --- /dev/null +++ b/docs/source/server.md @@ -0,0 +1,30 @@ +# Demo server + +![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif) + +You can boot up a demo 🐸TTS server to run an inference with your models (make +sure to install the additional dependencies with `pip install coqui-tts[server]`). +Note that the server is not optimized for performance and does not support all +Coqui models yet. + +The demo server provides pretty much the same interface as the CLI command. + +```bash +tts-server -h # see the help +tts-server --list_models # list the available models. +``` + +Run a TTS model, from the release models list, with its default vocoder. +If the model you choose is a multi-speaker TTS model, you can select different speakers on the Web interface and synthesize +speech. + +```bash +tts-server --model_name "///" +``` + +Run a TTS and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model. + +```bash +tts-server --model_name "///" \ + --vocoder_name "///" +``` diff --git a/docs/source/finetuning.md b/docs/source/training/finetuning.md similarity index 93% rename from docs/source/finetuning.md rename to docs/source/training/finetuning.md index 548e385e..1fe54fbc 100644 --- a/docs/source/finetuning.md +++ b/docs/source/training/finetuning.md @@ -1,4 +1,4 @@ -# Fine-tuning a 🐸 TTS model +# Fine-tuning a model ## Fine-tuning @@ -21,8 +21,9 @@ them and fine-tune it for your own dataset. This will help you in two main ways: Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own speech dataset and achieve reasonable results with only a couple of hours of data. - However, note that, fine-tuning does not ensure great results. The model performance still depends on the - {ref}`dataset quality ` and the hyper-parameters you choose for fine-tuning. Therefore, + However, note that, fine-tuning does not ensure great results. The model + performance still depends on the [dataset quality](../datasets/what_makes_a_good_dataset.md) + and the hyper-parameters you choose for fine-tuning. Therefore, it still takes a bit of tinkering. @@ -31,7 +32,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways: 1. Setup your dataset. You need to format your target dataset in a certain way so that 🐸TTS data loader will be able to load it for the - training. Please see {ref}`this page ` for more information about formatting. + training. Please see [this page](../datasets/formatting_your_dataset.md) for more information about formatting. 2. Choose the model you want to fine-tune. @@ -47,7 +48,8 @@ them and fine-tune it for your own dataset. This will help you in two main ways: You should choose the model based on your requirements. Some models are fast and some are better in speech quality. One lazy way to test a model is running the model on the hardware you want to use and see how it works. For - simple testing, you can use the `tts` command on the terminal. For more info see {ref}`here `. + simple testing, you can use the `tts` command on the terminal. For more info + see [here](../inference.md). 3. Download the model. diff --git a/docs/source/training/index.md b/docs/source/training/index.md new file mode 100644 index 00000000..bb76a705 --- /dev/null +++ b/docs/source/training/index.md @@ -0,0 +1,10 @@ +# Training and fine-tuning + +The following pages show you how to train and fine-tune Coqui models: + +```{toctree} +:maxdepth: 1 + +training_a_model +finetuning +``` diff --git a/docs/source/training_a_model.md b/docs/source/training/training_a_model.md similarity index 92% rename from docs/source/training_a_model.md rename to docs/source/training/training_a_model.md index 989a5704..22505ccb 100644 --- a/docs/source/training_a_model.md +++ b/docs/source/training/training_a_model.md @@ -1,4 +1,4 @@ -# Training a Model +# Training a model 1. Decide the model you want to use. @@ -11,11 +11,10 @@ 3. Check the recipes. - Recipes are located under `TTS/recipes/`. They do not promise perfect models but they provide a good start point for - `Nervous Beginners`. + Recipes are located under `TTS/recipes/`. They do not promise perfect models but they provide a good start point. A recipe for `GlowTTS` using `LJSpeech` dataset looks like below. Let's be creative and call this `train_glowtts.py`. - ```{literalinclude} ../../recipes/ljspeech/glow_tts/train_glowtts.py + ```{literalinclude} ../../../recipes/ljspeech/glow_tts/train_glowtts.py ``` You need to change fields of the `BaseDatasetConfig` to match your dataset and then update `GlowTTSConfig` @@ -113,7 +112,7 @@ Note that different models have different metrics, visuals and outputs. - You should also check the [FAQ page](https://github.com/coqui-ai/TTS/wiki/FAQ) for common problems and solutions + You should also check the [FAQ page](../faq.md) for common problems and solutions that occur in a training. 7. Use your best model for inference. @@ -132,7 +131,7 @@ In the example above, we trained a `GlowTTS` model, but the same workflow applies to all the other 🐸TTS models. -# Multi-speaker Training +## Multi-speaker Training Training a multi-speaker model is mostly the same as training a single-speaker model. You need to specify a couple of configuration parameters, initiate a `SpeakerManager` instance and pass it to the model. @@ -142,5 +141,5 @@ d-vectors. For using d-vectors, you first need to compute the d-vectors using th The same Glow-TTS model above can be trained on a multi-speaker VCTK dataset with the script below. -```{literalinclude} ../../recipes/vctk/glow_tts/train_glow_tts.py +```{literalinclude} ../../../recipes/vctk/glow_tts/train_glow_tts.py ``` diff --git a/docs/source/tutorial_for_nervous_beginners.md b/docs/source/tutorial_for_nervous_beginners.md index b417c4c4..a8a64410 100644 --- a/docs/source/tutorial_for_nervous_beginners.md +++ b/docs/source/tutorial_for_nervous_beginners.md @@ -1,24 +1,37 @@ -# Tutorial For Nervous Beginners +# Tutorial for nervous beginners -## Installation +First [install](installation.md) Coqui TTS. -User friendly installation. Recommended only for synthesizing voice. +## Synthesizing Speech + +You can run `tts` and synthesize speech directly on the terminal. ```bash -$ pip install coqui-tts +$ tts -h # see the help +$ tts --list_models # list the available models. ``` -Developer friendly installation. +![cli.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/tts_cli.gif) + + +You can call `tts-server` to start a local demo server that you can open on +your favorite web browser and 🗣️ (make sure to install the additional +dependencies with `pip install coqui-tts[server]`). ```bash -$ git clone https://github.com/idiap/coqui-ai-TTS -$ cd coqui-ai-TTS -$ pip install -e . +$ tts-server -h # see the help +$ tts-server --list_models # list the available models. ``` +![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif) + +See [this page](inference.md) for more details on synthesizing speech with the +CLI, server or Python API. ## Training a `tts` Model -A breakdown of a simple script that trains a GlowTTS model on the LJspeech dataset. See the comments for more details. +A breakdown of a simple script that trains a GlowTTS model on the LJspeech +dataset. For a more in-depth guide to training and fine-tuning also see [this +page](training/index.md). ### Pure Python Way @@ -99,25 +112,3 @@ We still support running training from CLI like in the old days. The same traini ``` ❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above. - -## Synthesizing Speech - -You can run `tts` and synthesize speech directly on the terminal. - -```bash -$ tts -h # see the help -$ tts --list_models # list the available models. -``` - -![cli.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/tts_cli.gif) - - -You can call `tts-server` to start a local demo server that you can open on -your favorite web browser and 🗣️ (make sure to install the additional -dependencies with `pip install coqui-tts[server]`). - -```bash -$ tts-server -h # see the help -$ tts-server --list_models # list the available models. -``` -![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif) diff --git a/pyproject.toml b/pyproject.toml index bf0a1d88..16d990c1 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -143,12 +143,12 @@ dev = [ ] # Dependencies for building the documentation docs = [ - "furo>=2023.5.20", - "myst-parser==2.0.0", - "sphinx==7.2.5", + "furo>=2024.8.6", + "myst-parser==3.0.1", + "sphinx==7.4.7", "sphinx_inline_tabs>=2023.4.21", - "sphinx_copybutton>=0.1", - "linkify-it-py>=2.0.0", + "sphinx_copybutton>=0.5.2", + "linkify-it-py>=2.0.3", ] [project.urls] diff --git a/scripts/sync_readme.py b/scripts/sync_readme.py index 58428681..97256bca 100644 --- a/scripts/sync_readme.py +++ b/scripts/sync_readme.py @@ -22,8 +22,12 @@ def sync_readme(): new_content = replace_between_markers(orig_content, "tts-readme", description.strip()) if args.check: if orig_content != new_content: - print("README.md is out of sync; please edit TTS/bin/TTS_README.md and run scripts/sync_readme.py") + print( + "README.md is out of sync; please reconcile README.md and TTS/bin/synthesize.py and run scripts/sync_readme.py" + ) exit(42) + print("All good, files in sync") + exit(0) readme_path.write_text(new_content) print("Updated README.md")