mirror of https://github.com/coqui-ai/TTS.git
commit
cd52907351
|
@ -11,30 +11,25 @@ You can contribute not only with code but with bug reports, comments, questions,
|
||||||
|
|
||||||
If you like to contribute code, squash a bug but if you don't know where to start, here are some pointers.
|
If you like to contribute code, squash a bug but if you don't know where to start, here are some pointers.
|
||||||
|
|
||||||
- [Development Road Map](https://github.com/coqui-ai/TTS/issues/378)
|
|
||||||
|
|
||||||
You can pick something out of our road map. We keep the progess of the project in this simple issue thread. It has new model proposals or developmental updates etc.
|
|
||||||
|
|
||||||
- [Github Issues Tracker](https://github.com/idiap/coqui-ai-TTS/issues)
|
- [Github Issues Tracker](https://github.com/idiap/coqui-ai-TTS/issues)
|
||||||
|
|
||||||
This is a place to find feature requests, bugs.
|
This is a place to find feature requests, bugs.
|
||||||
|
|
||||||
Issues with the ```good first issue``` tag are good place for beginners to take on.
|
Issues with the ```good first issue``` tag are good place for beginners to
|
||||||
|
take on. Issues tagged with `help wanted` are suited for more experienced
|
||||||
- ✨**PR**✨ [pages](https://github.com/idiap/coqui-ai-TTS/pulls) with the ```🚀new version``` tag.
|
outside contributors.
|
||||||
|
|
||||||
We list all the target improvements for the next version. You can pick one of them and start contributing.
|
|
||||||
|
|
||||||
- Also feel free to suggest new features, ideas and models. We're always open for new things.
|
- Also feel free to suggest new features, ideas and models. We're always open for new things.
|
||||||
|
|
||||||
## Call for sharing language models
|
## Call for sharing pretrained models
|
||||||
If possible, please consider sharing your pre-trained models in any language (if the licences allow for you to do so). We will include them in our model catalogue for public use and give the proper attribution, whether it be your name, company, website or any other source specified.
|
If possible, please consider sharing your pre-trained models in any language (if the licences allow for you to do so). We will include them in our model catalogue for public use and give the proper attribution, whether it be your name, company, website or any other source specified.
|
||||||
|
|
||||||
This model can be shared in two ways:
|
This model can be shared in two ways:
|
||||||
1. Share the model files with us and we serve them with the next 🐸 TTS release.
|
1. Share the model files with us and we serve them with the next 🐸 TTS release.
|
||||||
2. Upload your models on GDrive and share the link.
|
2. Upload your models on GDrive and share the link.
|
||||||
|
|
||||||
Models are served under `.models.json` file and any model is available under TTS CLI or Server end points.
|
Models are served under `.models.json` file and any model is available under TTS
|
||||||
|
CLI and Python API end points.
|
||||||
|
|
||||||
Either way you choose, please make sure you send the models [here](https://github.com/coqui-ai/TTS/discussions/930).
|
Either way you choose, please make sure you send the models [here](https://github.com/coqui-ai/TTS/discussions/930).
|
||||||
|
|
||||||
|
@ -135,7 +130,8 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||||
|
|
||||||
13. Let's discuss until it is perfect. 💪
|
13. Let's discuss until it is perfect. 💪
|
||||||
|
|
||||||
We might ask you for certain changes that would appear in the ✨**PR**✨'s page under 🐸TTS[https://github.com/idiap/coqui-ai-TTS/pulls].
|
We might ask you for certain changes that would appear in the
|
||||||
|
[Github ✨**PR**✨'s page](https://github.com/idiap/coqui-ai-TTS/pulls).
|
||||||
|
|
||||||
14. Once things look perfect, We merge it to the ```dev``` branch and make it ready for the next version.
|
14. Once things look perfect, We merge it to the ```dev``` branch and make it ready for the next version.
|
||||||
|
|
||||||
|
@ -143,9 +139,9 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||||
|
|
||||||
If you prefer working within a Docker container as your development environment, you can do the following:
|
If you prefer working within a Docker container as your development environment, you can do the following:
|
||||||
|
|
||||||
1. Fork 🐸TTS[https://github.com/idiap/coqui-ai-TTS] by clicking the fork button at the top right corner of the project page.
|
1. Fork the 🐸TTS [Github repository](https://github.com/idiap/coqui-ai-TTS) by clicking the fork button at the top right corner of the page.
|
||||||
|
|
||||||
2. Clone 🐸TTS and add the main repo as a new remote named ```upsteam```.
|
2. Clone 🐸TTS and add the main repo as a new remote named ```upstream```.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone git@github.com:<your Github name>/coqui-ai-TTS.git
|
git clone git@github.com:<your Github name>/coqui-ai-TTS.git
|
||||||
|
|
5
Makefile
5
Makefile
|
@ -59,9 +59,6 @@ lint: ## run linters.
|
||||||
system-deps: ## install linux system deps
|
system-deps: ## install linux system deps
|
||||||
sudo apt-get install -y libsndfile1-dev
|
sudo apt-get install -y libsndfile1-dev
|
||||||
|
|
||||||
build-docs: ## build the docs
|
|
||||||
cd docs && make clean && make build
|
|
||||||
|
|
||||||
install: ## install 🐸 TTS
|
install: ## install 🐸 TTS
|
||||||
uv sync --all-extras
|
uv sync --all-extras
|
||||||
|
|
||||||
|
@ -70,4 +67,4 @@ install_dev: ## install 🐸 TTS for development.
|
||||||
uv run pre-commit install
|
uv run pre-commit install
|
||||||
|
|
||||||
docs: ## build the docs
|
docs: ## build the docs
|
||||||
$(MAKE) -C docs clean && $(MAKE) -C docs html
|
uv run --group docs $(MAKE) -C docs clean && uv run --group docs $(MAKE) -C docs html
|
||||||
|
|
344
README.md
344
README.md
|
@ -1,39 +1,34 @@
|
||||||
|
# <img src="https://raw.githubusercontent.com/idiap/coqui-ai-TTS/main/images/coqui-log-green-TTS.png" height="56"/>
|
||||||
## 🐸Coqui TTS News
|
|
||||||
- 📣 Fork of the [original, unmaintained repository](https://github.com/coqui-ai/TTS). New PyPI package: [coqui-tts](https://pypi.org/project/coqui-tts)
|
|
||||||
- 📣 [OpenVoice](https://github.com/myshell-ai/OpenVoice) models now available for voice conversion.
|
|
||||||
- 📣 Prebuilt wheels are now also published for Mac and Windows (in addition to Linux as before) for easier installation across platforms.
|
|
||||||
- 📣 ⓍTTSv2 is here with 17 languages and better performance across the board. ⓍTTS can stream with <200ms latency.
|
|
||||||
- 📣 ⓍTTS fine-tuning code is out. Check the [example recipes](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes/ljspeech).
|
|
||||||
- 📣 [🐶Bark](https://github.com/suno-ai/bark) is now available for inference with unconstrained voice cloning. [Docs](https://coqui-tts.readthedocs.io/en/latest/models/bark.html)
|
|
||||||
- 📣 You can use [Fairseq models in ~1100 languages](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) with 🐸TTS.
|
|
||||||
|
|
||||||
## <img src="https://raw.githubusercontent.com/idiap/coqui-ai-TTS/main/images/coqui-log-green-TTS.png" height="56"/>
|
|
||||||
|
|
||||||
|
|
||||||
**🐸TTS is a library for advanced Text-to-Speech generation.**
|
**🐸 Coqui TTS is a library for advanced Text-to-Speech generation.**
|
||||||
|
|
||||||
🚀 Pretrained models in +1100 languages.
|
🚀 Pretrained models in +1100 languages.
|
||||||
|
|
||||||
🛠️ Tools for training new models and fine-tuning existing models in any language.
|
🛠️ Tools for training new models and fine-tuning existing models in any language.
|
||||||
|
|
||||||
📚 Utilities for dataset analysis and curation.
|
📚 Utilities for dataset analysis and curation.
|
||||||
______________________________________________________________________
|
|
||||||
|
|
||||||
[](https://discord.gg/5eXr5seRrv)
|
[](https://discord.gg/5eXr5seRrv)
|
||||||
|
[](https://pypi.org/project/coqui-tts/)
|
||||||
[](https://opensource.org/licenses/MPL-2.0)
|
[](https://opensource.org/licenses/MPL-2.0)
|
||||||
[](https://badge.fury.io/py/coqui-tts)
|
[](https://pypi.org/project/coqui-tts/)
|
||||||
[](https://pepy.tech/project/coqui-tts)
|
[](https://pepy.tech/project/coqui-tts)
|
||||||
[](https://zenodo.org/badge/latestdoi/265612440)
|
[](https://zenodo.org/badge/latestdoi/265612440)
|
||||||
|
[](https://github.com/idiap/coqui-ai-TTS/actions/workflows/tests.yml)
|
||||||

|
[](https://github.com/idiap/coqui-ai-TTS/actions/workflows/docker.yaml)
|
||||||

|
[](https://github.com/idiap/coqui-ai-TTS/actions/workflows/style_check.yml)
|
||||||

|
|
||||||
[](https://coqui-tts.readthedocs.io/en/latest/)
|
[](https://coqui-tts.readthedocs.io/en/latest/)
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
______________________________________________________________________
|
## 📣 News
|
||||||
|
- **Fork of the [original, unmaintained repository](https://github.com/coqui-ai/TTS). New PyPI package: [coqui-tts](https://pypi.org/project/coqui-tts)**
|
||||||
|
- 0.25.0: [OpenVoice](https://github.com/myshell-ai/OpenVoice) models now available for voice conversion.
|
||||||
|
- 0.24.2: Prebuilt wheels are now also published for Mac and Windows (in addition to Linux as before) for easier installation across platforms.
|
||||||
|
- 0.20.0: XTTSv2 is here with 17 languages and better performance across the board. XTTS can stream with <200ms latency.
|
||||||
|
- 0.19.0: XTTS fine-tuning code is out. Check the [example recipes](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes/ljspeech).
|
||||||
|
- 0.14.1: You can use [Fairseq models in ~1100 languages](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) with 🐸TTS.
|
||||||
|
|
||||||
## 💬 Where to ask questions
|
## 💬 Where to ask questions
|
||||||
Please use our dedicated channels for questions and discussion. Help is much more valuable if it's shared publicly so that more people can benefit from it.
|
Please use our dedicated channels for questions and discussion. Help is much more valuable if it's shared publicly so that more people can benefit from it.
|
||||||
|
@ -63,71 +58,67 @@ repository are also still a useful source of information.
|
||||||
| 🚀 **Released Models** | [Standard models](https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.json) and [Fairseq models in ~1100 languages](https://github.com/idiap/coqui-ai-TTS#example-text-to-speech-using-fairseq-models-in-1100-languages-)|
|
| 🚀 **Released Models** | [Standard models](https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.json) and [Fairseq models in ~1100 languages](https://github.com/idiap/coqui-ai-TTS#example-text-to-speech-using-fairseq-models-in-1100-languages-)|
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
- High-performance Deep Learning models for Text2Speech tasks. See lists of models below.
|
- High-performance text-to-speech and voice conversion models, see list below.
|
||||||
- Fast and efficient model training.
|
- Fast and efficient model training with detailed training logs on the terminal and Tensorboard.
|
||||||
- Detailed training logs on the terminal and Tensorboard.
|
- Support for multi-speaker and multilingual TTS.
|
||||||
- Support for Multi-speaker TTS.
|
|
||||||
- Efficient, flexible, lightweight but feature complete `Trainer API`.
|
|
||||||
- Released and ready-to-use models.
|
- Released and ready-to-use models.
|
||||||
- Tools to curate Text2Speech datasets under```dataset_analysis```.
|
- Tools to curate TTS datasets under ```dataset_analysis/```.
|
||||||
- Utilities to use and test your models.
|
- Command line and Python APIs to use and test your models.
|
||||||
- Modular (but not too much) code base enabling easy implementation of new ideas.
|
- Modular (but not too much) code base enabling easy implementation of new ideas.
|
||||||
|
|
||||||
## Model Implementations
|
## Model Implementations
|
||||||
### Spectrogram models
|
### Spectrogram models
|
||||||
- Tacotron: [paper](https://arxiv.org/abs/1703.10135)
|
- [Tacotron](https://arxiv.org/abs/1703.10135), [Tacotron2](https://arxiv.org/abs/1712.05884)
|
||||||
- Tacotron2: [paper](https://arxiv.org/abs/1712.05884)
|
- [Glow-TTS](https://arxiv.org/abs/2005.11129), [SC-GlowTTS](https://arxiv.org/abs/2104.05557)
|
||||||
- Glow-TTS: [paper](https://arxiv.org/abs/2005.11129)
|
- [Speedy-Speech](https://arxiv.org/abs/2008.03802)
|
||||||
- Speedy-Speech: [paper](https://arxiv.org/abs/2008.03802)
|
- [Align-TTS](https://arxiv.org/abs/2003.01950)
|
||||||
- Align-TTS: [paper](https://arxiv.org/abs/2003.01950)
|
- [FastPitch](https://arxiv.org/pdf/2006.06873.pdf)
|
||||||
- FastPitch: [paper](https://arxiv.org/pdf/2006.06873.pdf)
|
- [FastSpeech](https://arxiv.org/abs/1905.09263), [FastSpeech2](https://arxiv.org/abs/2006.04558)
|
||||||
- FastSpeech: [paper](https://arxiv.org/abs/1905.09263)
|
- [Capacitron](https://arxiv.org/abs/1906.03402)
|
||||||
- FastSpeech2: [paper](https://arxiv.org/abs/2006.04558)
|
- [OverFlow](https://arxiv.org/abs/2211.06892)
|
||||||
- SC-GlowTTS: [paper](https://arxiv.org/abs/2104.05557)
|
- [Neural HMM TTS](https://arxiv.org/abs/2108.13320)
|
||||||
- Capacitron: [paper](https://arxiv.org/abs/1906.03402)
|
- [Delightful TTS](https://arxiv.org/abs/2110.12612)
|
||||||
- OverFlow: [paper](https://arxiv.org/abs/2211.06892)
|
|
||||||
- Neural HMM TTS: [paper](https://arxiv.org/abs/2108.13320)
|
|
||||||
- Delightful TTS: [paper](https://arxiv.org/abs/2110.12612)
|
|
||||||
|
|
||||||
### End-to-End Models
|
### End-to-End Models
|
||||||
- ⓍTTS: [blog](https://coqui.ai/blog/tts/open_xtts)
|
- [XTTS](https://arxiv.org/abs/2406.04904)
|
||||||
- VITS: [paper](https://arxiv.org/pdf/2106.06103)
|
- [VITS](https://arxiv.org/pdf/2106.06103)
|
||||||
- 🐸 YourTTS: [paper](https://arxiv.org/abs/2112.02418)
|
- 🐸[YourTTS](https://arxiv.org/abs/2112.02418)
|
||||||
- 🐢 Tortoise: [orig. repo](https://github.com/neonbjb/tortoise-tts)
|
- 🐢[Tortoise](https://github.com/neonbjb/tortoise-tts)
|
||||||
- 🐶 Bark: [orig. repo](https://github.com/suno-ai/bark)
|
- 🐶[Bark](https://github.com/suno-ai/bark)
|
||||||
|
|
||||||
### Attention Methods
|
|
||||||
- Guided Attention: [paper](https://arxiv.org/abs/1710.08969)
|
|
||||||
- Forward Backward Decoding: [paper](https://arxiv.org/abs/1907.09006)
|
|
||||||
- Graves Attention: [paper](https://arxiv.org/abs/1910.10288)
|
|
||||||
- Double Decoder Consistency: [blog](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/)
|
|
||||||
- Dynamic Convolutional Attention: [paper](https://arxiv.org/pdf/1910.10288.pdf)
|
|
||||||
- Alignment Network: [paper](https://arxiv.org/abs/2108.10447)
|
|
||||||
|
|
||||||
### Speaker Encoder
|
|
||||||
- GE2E: [paper](https://arxiv.org/abs/1710.10467)
|
|
||||||
- Angular Loss: [paper](https://arxiv.org/pdf/2003.11982.pdf)
|
|
||||||
|
|
||||||
### Vocoders
|
### Vocoders
|
||||||
- MelGAN: [paper](https://arxiv.org/abs/1910.06711)
|
- [MelGAN](https://arxiv.org/abs/1910.06711)
|
||||||
- MultiBandMelGAN: [paper](https://arxiv.org/abs/2005.05106)
|
- [MultiBandMelGAN](https://arxiv.org/abs/2005.05106)
|
||||||
- ParallelWaveGAN: [paper](https://arxiv.org/abs/1910.11480)
|
- [ParallelWaveGAN](https://arxiv.org/abs/1910.11480)
|
||||||
- GAN-TTS discriminators: [paper](https://arxiv.org/abs/1909.11646)
|
- [GAN-TTS discriminators](https://arxiv.org/abs/1909.11646)
|
||||||
- WaveRNN: [origin](https://github.com/fatchord/WaveRNN/)
|
- [WaveRNN](https://github.com/fatchord/WaveRNN/)
|
||||||
- WaveGrad: [paper](https://arxiv.org/abs/2009.00713)
|
- [WaveGrad](https://arxiv.org/abs/2009.00713)
|
||||||
- HiFiGAN: [paper](https://arxiv.org/abs/2010.05646)
|
- [HiFiGAN](https://arxiv.org/abs/2010.05646)
|
||||||
- UnivNet: [paper](https://arxiv.org/abs/2106.07889)
|
- [UnivNet](https://arxiv.org/abs/2106.07889)
|
||||||
|
|
||||||
### Voice Conversion
|
### Voice Conversion
|
||||||
- FreeVC: [paper](https://arxiv.org/abs/2210.15418)
|
- [FreeVC](https://arxiv.org/abs/2210.15418)
|
||||||
- OpenVoice: [technical report](https://arxiv.org/abs/2312.01479)
|
- [OpenVoice](https://arxiv.org/abs/2312.01479)
|
||||||
|
|
||||||
|
### Others
|
||||||
|
- Attention methods: [Guided Attention](https://arxiv.org/abs/1710.08969),
|
||||||
|
[Forward Backward Decoding](https://arxiv.org/abs/1907.09006),
|
||||||
|
[Graves Attention](https://arxiv.org/abs/1910.10288),
|
||||||
|
[Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/),
|
||||||
|
[Dynamic Convolutional Attention](https://arxiv.org/pdf/1910.10288.pdf),
|
||||||
|
[Alignment Network](https://arxiv.org/abs/2108.10447)
|
||||||
|
- Speaker encoders: [GE2E](https://arxiv.org/abs/1710.10467),
|
||||||
|
[Angular Loss](https://arxiv.org/pdf/2003.11982.pdf)
|
||||||
|
|
||||||
You can also help us implement more models.
|
You can also help us implement more models.
|
||||||
|
|
||||||
|
<!-- start installation -->
|
||||||
## Installation
|
## Installation
|
||||||
🐸TTS is tested on Ubuntu 22.04 with **python >= 3.9, < 3.13.**.
|
|
||||||
|
|
||||||
If you are only interested in [synthesizing speech](https://coqui-tts.readthedocs.io/en/latest/inference.html) with the released 🐸TTS models, installing from PyPI is the easiest option.
|
🐸TTS is tested on Ubuntu 24.04 with **python >= 3.9, < 3.13**, but should also
|
||||||
|
work on Mac and Windows.
|
||||||
|
|
||||||
|
If you are only interested in [synthesizing speech](https://coqui-tts.readthedocs.io/en/latest/inference.html) with the pretrained 🐸TTS models, installing from PyPI is the easiest option.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install coqui-tts
|
pip install coqui-tts
|
||||||
|
@ -165,21 +156,18 @@ pip install -e .[server,ja]
|
||||||
|
|
||||||
### Platforms
|
### Platforms
|
||||||
|
|
||||||
If you are on Ubuntu (Debian), you can also run following commands for installation.
|
If you are on Ubuntu (Debian), you can also run the following commands for installation.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
make system-deps # intended to be used on Ubuntu (Debian). Let us know if you have a different OS.
|
make system-deps
|
||||||
make install
|
make install
|
||||||
```
|
```
|
||||||
|
|
||||||
If you are on Windows, 👑@GuyPaddock wrote installation instructions
|
<!-- end installation -->
|
||||||
[here](https://stackoverflow.com/questions/66726331/how-can-i-run-mozilla-tts-coqui-tts-training-with-cuda-on-a-windows-system)
|
|
||||||
(note that these are out of date, e.g. you need to have at least Python 3.9).
|
|
||||||
|
|
||||||
|
|
||||||
## Docker Image
|
## Docker Image
|
||||||
You can also try TTS without install with the docker image.
|
You can also try out Coqui TTS without installation with the docker image.
|
||||||
Simply run the following command and you will be able to run TTS without installing it.
|
Simply run the following command and you will be able to run TTS:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu
|
docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu
|
||||||
|
@ -193,10 +181,10 @@ More details about the docker images (like GPU support) can be found
|
||||||
|
|
||||||
|
|
||||||
## Synthesizing speech by 🐸TTS
|
## Synthesizing speech by 🐸TTS
|
||||||
|
<!-- start inference -->
|
||||||
### 🐍 Python API
|
### 🐍 Python API
|
||||||
|
|
||||||
#### Running a multi-speaker and multi-lingual model
|
#### Multi-speaker and multi-lingual model
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import torch
|
import torch
|
||||||
|
@ -208,47 +196,60 @@ device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
# List available 🐸TTS models
|
# List available 🐸TTS models
|
||||||
print(TTS().list_models())
|
print(TTS().list_models())
|
||||||
|
|
||||||
# Init TTS
|
# Initialize TTS
|
||||||
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
|
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
|
||||||
|
|
||||||
|
# List speakers
|
||||||
|
print(tts.speakers)
|
||||||
|
|
||||||
# Run TTS
|
# Run TTS
|
||||||
# ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language
|
# ❗ XTTS supports both, but many models allow only one of the `speaker` and
|
||||||
# Text to speech list of amplitude values as output
|
# `speaker_wav` arguments
|
||||||
wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en")
|
|
||||||
# Text to speech to a file
|
# TTS with list of amplitude values as output, clone the voice from `speaker_wav`
|
||||||
tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
|
wav = tts.tts(
|
||||||
|
text="Hello world!",
|
||||||
|
speaker_wav="my/cloning/audio.wav",
|
||||||
|
language="en"
|
||||||
|
)
|
||||||
|
|
||||||
|
# TTS to a file, use a preset speaker
|
||||||
|
tts.tts_to_file(
|
||||||
|
text="Hello world!",
|
||||||
|
speaker="Craig Gutsy",
|
||||||
|
language="en",
|
||||||
|
file_path="output.wav"
|
||||||
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Running a single speaker model
|
#### Single speaker model
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# Init TTS with the target model name
|
# Initialize TTS with the target model name
|
||||||
tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False).to(device)
|
tts = TTS("tts_models/de/thorsten/tacotron2-DDC").to(device)
|
||||||
|
|
||||||
# Run TTS
|
# Run TTS
|
||||||
tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH)
|
tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH)
|
||||||
|
|
||||||
# Example voice cloning with YourTTS in English, French and Portuguese
|
|
||||||
tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to(device)
|
|
||||||
tts.tts_to_file("This is voice cloning.", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
|
|
||||||
tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wav", language="fr-fr", file_path="output.wav")
|
|
||||||
tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt-br", file_path="output.wav")
|
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Example voice conversion
|
#### Voice conversion (VC)
|
||||||
|
|
||||||
Converting the voice in `source_wav` to the voice of `target_wav`
|
Converting the voice in `source_wav` to the voice of `target_wav`
|
||||||
|
|
||||||
```python
|
```python
|
||||||
tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda")
|
tts = TTS("voice_conversion_models/multilingual/vctk/freevc24").to("cuda")
|
||||||
tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav")
|
tts.voice_conversion_to_file(
|
||||||
|
source_wav="my/source.wav",
|
||||||
|
target_wav="my/target.wav",
|
||||||
|
file_path="output.wav"
|
||||||
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
Other available voice conversion models:
|
Other available voice conversion models:
|
||||||
- `voice_conversion_models/multilingual/multi-dataset/openvoice_v1`
|
- `voice_conversion_models/multilingual/multi-dataset/openvoice_v1`
|
||||||
- `voice_conversion_models/multilingual/multi-dataset/openvoice_v2`
|
- `voice_conversion_models/multilingual/multi-dataset/openvoice_v2`
|
||||||
|
|
||||||
#### Example voice cloning together with the default voice conversion model.
|
#### Voice cloning by combining single speaker TTS model with the default VC model
|
||||||
|
|
||||||
This way, you can clone voices by using any model in 🐸TTS. The FreeVC model is
|
This way, you can clone voices by using any model in 🐸TTS. The FreeVC model is
|
||||||
used for voice conversion after synthesizing speech.
|
used for voice conversion after synthesizing speech.
|
||||||
|
@ -263,7 +264,7 @@ tts.tts_with_vc_to_file(
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Example text to speech using **Fairseq models in ~1100 languages** 🤯.
|
#### TTS using Fairseq models in ~1100 languages 🤯
|
||||||
For Fairseq models, use the following name format: `tts_models/<lang-iso_code>/fairseq/vits`.
|
For Fairseq models, use the following name format: `tts_models/<lang-iso_code>/fairseq/vits`.
|
||||||
You can find the language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html)
|
You can find the language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html)
|
||||||
and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms).
|
and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms).
|
||||||
|
@ -277,147 +278,126 @@ api.tts_to_file(
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Command-line `tts`
|
### Command-line interface `tts`
|
||||||
|
|
||||||
<!-- begin-tts-readme -->
|
<!-- begin-tts-readme -->
|
||||||
|
|
||||||
Synthesize speech on command line.
|
Synthesize speech on the command line.
|
||||||
|
|
||||||
You can either use your trained model or choose a model from the provided list.
|
You can either use your trained model or choose a model from the provided list.
|
||||||
|
|
||||||
If you don't specify any models, then it uses LJSpeech based English model.
|
|
||||||
|
|
||||||
#### Single Speaker Models
|
|
||||||
|
|
||||||
- List provided models:
|
- List provided models:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --list_models
|
tts --list_models
|
||||||
```
|
```
|
||||||
|
|
||||||
- Get model info (for both tts_models and vocoder_models):
|
- Get model information. Use the names obtained from `--list_models`.
|
||||||
|
```sh
|
||||||
- Query by type/name:
|
tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
|
||||||
The model_info_by_name uses the name as it from the --list_models.
|
|
||||||
```
|
|
||||||
$ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
|
|
||||||
```
|
```
|
||||||
For example:
|
For example:
|
||||||
```
|
```sh
|
||||||
$ tts --model_info_by_name tts_models/tr/common-voice/glow-tts
|
tts --model_info_by_name tts_models/tr/common-voice/glow-tts
|
||||||
$ tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
|
tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
|
||||||
```
|
|
||||||
- Query by type/idx:
|
|
||||||
The model_query_idx uses the corresponding idx from --list_models.
|
|
||||||
|
|
||||||
```
|
|
||||||
$ tts --model_info_by_idx "<model_type>/<model_query_idx>"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
For example:
|
#### Single speaker models
|
||||||
|
|
||||||
```
|
- Run TTS with the default model (`tts_models/en/ljspeech/tacotron2-DDC`):
|
||||||
$ tts --model_info_by_idx tts_models/3
|
|
||||||
```
|
|
||||||
|
|
||||||
- Query info for model info by full name:
|
```sh
|
||||||
```
|
tts --text "Text for TTS" --out_path output/path/speech.wav
|
||||||
$ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
|
|
||||||
```
|
|
||||||
|
|
||||||
- Run TTS with default models:
|
|
||||||
|
|
||||||
```
|
|
||||||
$ tts --text "Text for TTS" --out_path output/path/speech.wav
|
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run TTS and pipe out the generated TTS wav file data:
|
- Run TTS and pipe out the generated TTS wav file data:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
|
tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run a TTS model with its default vocoder model:
|
- Run a TTS model with its default vocoder model:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
|
tts --text "Text for TTS" \
|
||||||
|
--model_name "<model_type>/<language>/<dataset>/<model_name>" \
|
||||||
|
--out_path output/path/speech.wav
|
||||||
```
|
```
|
||||||
|
|
||||||
For example:
|
For example:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --out_path output/path/speech.wav
|
tts --text "Text for TTS" \
|
||||||
|
--model_name "tts_models/en/ljspeech/glow-tts" \
|
||||||
|
--out_path output/path/speech.wav
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run with specific TTS and vocoder models from the list:
|
- Run with specific TTS and vocoder models from the list. Note that not every vocoder is compatible with every TTS model.
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
|
tts --text "Text for TTS" \
|
||||||
|
--model_name "<model_type>/<language>/<dataset>/<model_name>" \
|
||||||
|
--vocoder_name "<model_type>/<language>/<dataset>/<model_name>" \
|
||||||
|
--out_path output/path/speech.wav
|
||||||
```
|
```
|
||||||
|
|
||||||
For example:
|
For example:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --vocoder_name "vocoder_models/en/ljspeech/univnet" --out_path output/path/speech.wav
|
tts --text "Text for TTS" \
|
||||||
|
--model_name "tts_models/en/ljspeech/glow-tts" \
|
||||||
|
--vocoder_name "vocoder_models/en/ljspeech/univnet" \
|
||||||
|
--out_path output/path/speech.wav
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run your own TTS model (Using Griffin-Lim Vocoder):
|
- Run your own TTS model (using Griffin-Lim Vocoder):
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
|
tts --text "Text for TTS" \
|
||||||
|
--model_path path/to/model.pth \
|
||||||
|
--config_path path/to/config.json \
|
||||||
|
--out_path output/path/speech.wav
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run your own TTS and Vocoder models:
|
- Run your own TTS and Vocoder models:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
|
tts --text "Text for TTS" \
|
||||||
--vocoder_path path/to/vocoder.pth --vocoder_config_path path/to/vocoder_config.json
|
--model_path path/to/model.pth \
|
||||||
|
--config_path path/to/config.json \
|
||||||
|
--out_path output/path/speech.wav \
|
||||||
|
--vocoder_path path/to/vocoder.pth \
|
||||||
|
--vocoder_config_path path/to/vocoder_config.json
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Multi-speaker Models
|
#### Multi-speaker models
|
||||||
|
|
||||||
- List the available speakers and choose a <speaker_id> among them:
|
- List the available speakers and choose a `<speaker_id>` among them:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --model_name "<language>/<dataset>/<model_name>" --list_speaker_idxs
|
tts --model_name "<language>/<dataset>/<model_name>" --list_speaker_idxs
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run the multi-speaker TTS model with the target speaker ID:
|
- Run the multi-speaker TTS model with the target speaker ID:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --speaker_idx <speaker_id>
|
tts --text "Text for TTS." --out_path output/path/speech.wav \
|
||||||
|
--model_name "<language>/<dataset>/<model_name>" --speaker_idx <speaker_id>
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run your own multi-speaker TTS model:
|
- Run your own multi-speaker TTS model:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
|
tts --text "Text for TTS" --out_path output/path/speech.wav \
|
||||||
|
--model_path path/to/model.pth --config_path path/to/config.json \
|
||||||
|
--speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
|
||||||
```
|
```
|
||||||
|
|
||||||
### Voice Conversion Models
|
#### Voice conversion models
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
|
tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" \
|
||||||
|
--source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
|
||||||
```
|
```
|
||||||
|
|
||||||
<!-- end-tts-readme -->
|
<!-- end-tts-readme -->
|
||||||
|
|
||||||
## Directory Structure
|
|
||||||
```
|
|
||||||
|- notebooks/ (Jupyter Notebooks for model evaluation, parameter selection and data analysis.)
|
|
||||||
|- utils/ (common utilities.)
|
|
||||||
|- TTS
|
|
||||||
|- bin/ (folder for all the executables.)
|
|
||||||
|- train*.py (train your target model.)
|
|
||||||
|- ...
|
|
||||||
|- tts/ (text to speech models)
|
|
||||||
|- layers/ (model layer definitions)
|
|
||||||
|- models/ (model definitions)
|
|
||||||
|- utils/ (model specific utilities.)
|
|
||||||
|- speaker_encoder/ (Speaker Encoder models.)
|
|
||||||
|- (same)
|
|
||||||
|- vocoder/ (Vocoder models.)
|
|
||||||
|- (same)
|
|
||||||
|- vc/ (Voice conversion models.)
|
|
||||||
|- (same)
|
|
||||||
```
|
|
||||||
|
|
|
@ -14,123 +14,122 @@ from TTS.utils.generic_utils import ConsoleFormatter, setup_logger
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
description = """
|
description = """
|
||||||
Synthesize speech on command line.
|
Synthesize speech on the command line.
|
||||||
|
|
||||||
You can either use your trained model or choose a model from the provided list.
|
You can either use your trained model or choose a model from the provided list.
|
||||||
|
|
||||||
If you don't specify any models, then it uses LJSpeech based English model.
|
- List provided models:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
tts --list_models
|
||||||
|
```
|
||||||
|
|
||||||
|
- Get model information. Use the names obtained from `--list_models`.
|
||||||
|
```sh
|
||||||
|
tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
|
||||||
|
```
|
||||||
|
For example:
|
||||||
|
```sh
|
||||||
|
tts --model_info_by_name tts_models/tr/common-voice/glow-tts
|
||||||
|
tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
|
||||||
|
```
|
||||||
|
|
||||||
#### Single Speaker Models
|
#### Single Speaker Models
|
||||||
|
|
||||||
- List provided models:
|
- Run TTS with the default model (`tts_models/en/ljspeech/tacotron2-DDC`):
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --list_models
|
tts --text "Text for TTS" --out_path output/path/speech.wav
|
||||||
```
|
|
||||||
|
|
||||||
- Get model info (for both tts_models and vocoder_models):
|
|
||||||
|
|
||||||
- Query by type/name:
|
|
||||||
The model_info_by_name uses the name as it from the --list_models.
|
|
||||||
```
|
|
||||||
$ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
|
|
||||||
```
|
|
||||||
For example:
|
|
||||||
```
|
|
||||||
$ tts --model_info_by_name tts_models/tr/common-voice/glow-tts
|
|
||||||
$ tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
|
|
||||||
```
|
|
||||||
- Query by type/idx:
|
|
||||||
The model_query_idx uses the corresponding idx from --list_models.
|
|
||||||
|
|
||||||
```
|
|
||||||
$ tts --model_info_by_idx "<model_type>/<model_query_idx>"
|
|
||||||
```
|
|
||||||
|
|
||||||
For example:
|
|
||||||
|
|
||||||
```
|
|
||||||
$ tts --model_info_by_idx tts_models/3
|
|
||||||
```
|
|
||||||
|
|
||||||
- Query info for model info by full name:
|
|
||||||
```
|
|
||||||
$ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
|
|
||||||
```
|
|
||||||
|
|
||||||
- Run TTS with default models:
|
|
||||||
|
|
||||||
```
|
|
||||||
$ tts --text "Text for TTS" --out_path output/path/speech.wav
|
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run TTS and pipe out the generated TTS wav file data:
|
- Run TTS and pipe out the generated TTS wav file data:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
|
tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run a TTS model with its default vocoder model:
|
- Run a TTS model with its default vocoder model:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
|
tts --text "Text for TTS" \\
|
||||||
|
--model_name "<model_type>/<language>/<dataset>/<model_name>" \\
|
||||||
|
--out_path output/path/speech.wav
|
||||||
```
|
```
|
||||||
|
|
||||||
For example:
|
For example:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --out_path output/path/speech.wav
|
tts --text "Text for TTS" \\
|
||||||
|
--model_name "tts_models/en/ljspeech/glow-tts" \\
|
||||||
|
--out_path output/path/speech.wav
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run with specific TTS and vocoder models from the list:
|
- Run with specific TTS and vocoder models from the list. Note that not every vocoder is compatible with every TTS model.
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
|
tts --text "Text for TTS" \\
|
||||||
|
--model_name "<model_type>/<language>/<dataset>/<model_name>" \\
|
||||||
|
--vocoder_name "<model_type>/<language>/<dataset>/<model_name>" \\
|
||||||
|
--out_path output/path/speech.wav
|
||||||
```
|
```
|
||||||
|
|
||||||
For example:
|
For example:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --vocoder_name "vocoder_models/en/ljspeech/univnet" --out_path output/path/speech.wav
|
tts --text "Text for TTS" \\
|
||||||
|
--model_name "tts_models/en/ljspeech/glow-tts" \\
|
||||||
|
--vocoder_name "vocoder_models/en/ljspeech/univnet" \\
|
||||||
|
--out_path output/path/speech.wav
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run your own TTS model (Using Griffin-Lim Vocoder):
|
- Run your own TTS model (using Griffin-Lim Vocoder):
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
|
tts --text "Text for TTS" \\
|
||||||
|
--model_path path/to/model.pth \\
|
||||||
|
--config_path path/to/config.json \\
|
||||||
|
--out_path output/path/speech.wav
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run your own TTS and Vocoder models:
|
- Run your own TTS and Vocoder models:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
|
tts --text "Text for TTS" \\
|
||||||
--vocoder_path path/to/vocoder.pth --vocoder_config_path path/to/vocoder_config.json
|
--model_path path/to/model.pth \\
|
||||||
|
--config_path path/to/config.json \\
|
||||||
|
--out_path output/path/speech.wav \\
|
||||||
|
--vocoder_path path/to/vocoder.pth \\
|
||||||
|
--vocoder_config_path path/to/vocoder_config.json
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Multi-speaker Models
|
#### Multi-speaker Models
|
||||||
|
|
||||||
- List the available speakers and choose a <speaker_id> among them:
|
- List the available speakers and choose a `<speaker_id>` among them:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --model_name "<language>/<dataset>/<model_name>" --list_speaker_idxs
|
tts --model_name "<language>/<dataset>/<model_name>" --list_speaker_idxs
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run the multi-speaker TTS model with the target speaker ID:
|
- Run the multi-speaker TTS model with the target speaker ID:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --speaker_idx <speaker_id>
|
tts --text "Text for TTS." --out_path output/path/speech.wav \\
|
||||||
|
--model_name "<language>/<dataset>/<model_name>" --speaker_idx <speaker_id>
|
||||||
```
|
```
|
||||||
|
|
||||||
- Run your own multi-speaker TTS model:
|
- Run your own multi-speaker TTS model:
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
|
tts --text "Text for TTS" --out_path output/path/speech.wav \\
|
||||||
|
--model_path path/to/model.pth --config_path path/to/config.json \\
|
||||||
|
--speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
|
||||||
```
|
```
|
||||||
|
|
||||||
### Voice Conversion Models
|
#### Voice Conversion Models
|
||||||
|
|
||||||
```
|
```sh
|
||||||
$ tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
|
tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" \\
|
||||||
|
--source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
|
||||||
```
|
```
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
|
@ -12,7 +12,7 @@ from trainer import TrainerModel
|
||||||
class BaseTrainerModel(TrainerModel):
|
class BaseTrainerModel(TrainerModel):
|
||||||
"""BaseTrainerModel model expanding TrainerModel with required functions by 🐸TTS.
|
"""BaseTrainerModel model expanding TrainerModel with required functions by 🐸TTS.
|
||||||
|
|
||||||
Every new 🐸TTS model must inherit it.
|
Every new Coqui model must inherit it.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
|
|
|
@ -206,12 +206,14 @@ class Bark(BaseTTS):
|
||||||
speaker_wav (str): Path to the speaker audio file for cloning a new voice. It is cloned and saved in
|
speaker_wav (str): Path to the speaker audio file for cloning a new voice. It is cloned and saved in
|
||||||
`voice_dirs` with the name `speaker_id`. Defaults to None.
|
`voice_dirs` with the name `speaker_id`. Defaults to None.
|
||||||
voice_dirs (List[str]): List of paths that host reference audio files for speakers. Defaults to None.
|
voice_dirs (List[str]): List of paths that host reference audio files for speakers. Defaults to None.
|
||||||
**kwargs: Model specific inference settings used by `generate_audio()` and `TTS.tts.layers.bark.inference_funcs.generate_text_semantic().
|
**kwargs: Model specific inference settings used by `generate_audio()` and
|
||||||
|
`TTS.tts.layers.bark.inference_funcs.generate_text_semantic()`.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
A dictionary of the output values with `wav` as output waveform, `deterministic_seed` as seed used at inference,
|
A dictionary of the output values with `wav` as output waveform,
|
||||||
`text_input` as text token IDs after tokenizer, `voice_samples` as samples used for cloning, `conditioning_latents`
|
`deterministic_seed` as seed used at inference, `text_input` as text token IDs
|
||||||
as latents used at inference.
|
after tokenizer, `voice_samples` as samples used for cloning,
|
||||||
|
`conditioning_latents` as latents used at inference.
|
||||||
|
|
||||||
"""
|
"""
|
||||||
speaker_id = "random" if speaker_id is None else speaker_id
|
speaker_id = "random" if speaker_id is None else speaker_id
|
||||||
|
|
|
@ -80,12 +80,14 @@ class BaseTTS(BaseTrainerModel):
|
||||||
raise ValueError("config must be either a *Config or *Args")
|
raise ValueError("config must be either a *Config or *Args")
|
||||||
|
|
||||||
def init_multispeaker(self, config: Coqpit, data: List = None):
|
def init_multispeaker(self, config: Coqpit, data: List = None):
|
||||||
"""Initialize a speaker embedding layer if needen and define expected embedding channel size for defining
|
"""Set up for multi-speaker TTS.
|
||||||
`in_channels` size of the connected layers.
|
|
||||||
|
Initialize a speaker embedding layer if needed and define expected embedding
|
||||||
|
channel size for defining `in_channels` size of the connected layers.
|
||||||
|
|
||||||
This implementation yields 3 possible outcomes:
|
This implementation yields 3 possible outcomes:
|
||||||
|
|
||||||
1. If `config.use_speaker_embedding` and `config.use_d_vector_file are False, do nothing.
|
1. If `config.use_speaker_embedding` and `config.use_d_vector_file` are False, do nothing.
|
||||||
2. If `config.use_d_vector_file` is True, set expected embedding channel size to `config.d_vector_dim` or 512.
|
2. If `config.use_d_vector_file` is True, set expected embedding channel size to `config.d_vector_dim` or 512.
|
||||||
3. If `config.use_speaker_embedding`, initialize a speaker embedding layer with channel size of
|
3. If `config.use_speaker_embedding`, initialize a speaker embedding layer with channel size of
|
||||||
`config.d_vector_dim` or 512.
|
`config.d_vector_dim` or 512.
|
||||||
|
|
|
@ -48,10 +48,11 @@ class Overflow(BaseTTS):
|
||||||
are available at https://shivammehta25.github.io/OverFlow/.
|
are available at https://shivammehta25.github.io/OverFlow/.
|
||||||
|
|
||||||
Note:
|
Note:
|
||||||
- Neural HMMs uses flat start initialization i.e it computes the means and std and transition probabilities
|
- Neural HMMs uses flat start initialization i.e it computes the means
|
||||||
of the dataset and uses them to initialize the model. This benefits the model and helps with faster learning
|
and std and transition probabilities of the dataset and uses them to initialize
|
||||||
If you change the dataset or want to regenerate the parameters change the `force_generate_statistics` and
|
the model. This benefits the model and helps with faster learning If you change
|
||||||
`mel_statistics_parameter_path` accordingly.
|
the dataset or want to regenerate the parameters change the
|
||||||
|
`force_generate_statistics` and `mel_statistics_parameter_path` accordingly.
|
||||||
|
|
||||||
- To enable multi-GPU training, set the `use_grad_checkpointing=False` in config.
|
- To enable multi-GPU training, set the `use_grad_checkpointing=False` in config.
|
||||||
This will significantly increase the memory usage. This is because to compute
|
This will significantly increase the memory usage. This is because to compute
|
||||||
|
|
|
@ -423,7 +423,9 @@ class Tortoise(BaseTTS):
|
||||||
Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
|
Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
|
||||||
These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
|
These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
|
||||||
properties.
|
properties.
|
||||||
:param voice_samples: List of arbitrary reference clips, which should be *pairs* of torch tensors containing arbitrary kHz waveform data.
|
|
||||||
|
:param voice_samples: List of arbitrary reference clips, which should be *pairs*
|
||||||
|
of torch tensors containing arbitrary kHz waveform data.
|
||||||
:param latent_averaging_mode: 0/1/2 for following modes:
|
:param latent_averaging_mode: 0/1/2 for following modes:
|
||||||
0 - latents will be generated as in original tortoise, using ~4.27s from each voice sample, averaging latent across all samples
|
0 - latents will be generated as in original tortoise, using ~4.27s from each voice sample, averaging latent across all samples
|
||||||
1 - latents will be generated using (almost) entire voice samples, averaged across all the ~4.27s chunks
|
1 - latents will be generated using (almost) entire voice samples, averaged across all the ~4.27s chunks
|
||||||
|
@ -671,7 +673,7 @@ class Tortoise(BaseTTS):
|
||||||
As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
|
As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
|
||||||
diffusion_temperature: (float) Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
|
diffusion_temperature: (float) Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
|
||||||
are the "mean" prediction of the diffusion network and will sound bland and smeared.
|
are the "mean" prediction of the diffusion network and will sound bland and smeared.
|
||||||
hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive transformer.
|
hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive transformer.
|
||||||
Extra keyword args fed to this function get forwarded directly to that API. Documentation
|
Extra keyword args fed to this function get forwarded directly to that API. Documentation
|
||||||
here: https://huggingface.co/docs/transformers/internal/generation_utils
|
here: https://huggingface.co/docs/transformers/internal/generation_utils
|
||||||
|
|
||||||
|
|
|
@ -178,7 +178,7 @@ class XttsArgs(Coqpit):
|
||||||
|
|
||||||
|
|
||||||
class Xtts(BaseTTS):
|
class Xtts(BaseTTS):
|
||||||
"""ⓍTTS model implementation.
|
"""XTTS model implementation.
|
||||||
|
|
||||||
❗ Currently it only supports inference.
|
❗ Currently it only supports inference.
|
||||||
|
|
||||||
|
@ -460,7 +460,7 @@ class Xtts(BaseTTS):
|
||||||
gpt_cond_chunk_len: (int) Chunk length used for cloning. It must be <= `gpt_cond_len`.
|
gpt_cond_chunk_len: (int) Chunk length used for cloning. It must be <= `gpt_cond_len`.
|
||||||
If gpt_cond_len == gpt_cond_chunk_len, no chunking. Defaults to 6 seconds.
|
If gpt_cond_len == gpt_cond_chunk_len, no chunking. Defaults to 6 seconds.
|
||||||
|
|
||||||
hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive
|
hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive
|
||||||
transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation
|
transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation
|
||||||
here: https://huggingface.co/docs/transformers/internal/generation_utils
|
here: https://huggingface.co/docs/transformers/internal/generation_utils
|
||||||
|
|
||||||
|
|
|
@ -52,6 +52,7 @@ extensions = [
|
||||||
"sphinx_inline_tabs",
|
"sphinx_inline_tabs",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
suppress_warnings = ["autosectionlabel.*"]
|
||||||
|
|
||||||
# Add any paths that contain templates here, relative to this directory.
|
# Add any paths that contain templates here, relative to this directory.
|
||||||
templates_path = ["_templates"]
|
templates_path = ["_templates"]
|
||||||
|
@ -67,6 +68,8 @@ myst_enable_extensions = [
|
||||||
"linkify",
|
"linkify",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
myst_heading_anchors = 4
|
||||||
|
|
||||||
# 'sphinxcontrib.katex',
|
# 'sphinxcontrib.katex',
|
||||||
# 'sphinx.ext.autosectionlabel',
|
# 'sphinx.ext.autosectionlabel',
|
||||||
|
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# Configuration
|
# Configuration
|
||||||
|
|
||||||
We use 👩✈️[Coqpit] for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit.
|
We use 👩✈️[Coqpit](https://github.com/idiap/coqui-ai-coqpit) for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from dataclasses import asdict, dataclass, field
|
from dataclasses import asdict, dataclass, field
|
||||||
|
@ -36,7 +36,7 @@ class SimpleConfig(Coqpit):
|
||||||
check_argument("val_c", c, restricted=True)
|
check_argument("val_c", c, restricted=True)
|
||||||
```
|
```
|
||||||
|
|
||||||
In TTS, each model must have a configuration class that exposes all the values necessary for its lifetime.
|
In Coqui, each model must have a configuration class that exposes all the values necessary for its lifetime.
|
||||||
|
|
||||||
It defines model architecture, hyper-parameters, training, and inference settings. For our models, we merge all the fields in a single configuration class for ease. It may not look like a wise practice but enables easier bookkeeping and reproducible experiments.
|
It defines model architecture, hyper-parameters, training, and inference settings. For our models, we merge all the fields in a single configuration class for ease. It may not look like a wise practice but enables easier bookkeeping and reproducible experiments.
|
||||||
|
|
||||||
|
|
|
@ -1,7 +1,9 @@
|
||||||
(formatting_your_dataset)=
|
(formatting_your_dataset)=
|
||||||
# Formatting Your Dataset
|
# Formatting your dataset
|
||||||
|
|
||||||
For training a TTS model, you need a dataset with speech recordings and transcriptions. The speech must be divided into audio clips and each clip needs transcription.
|
For training a TTS model, you need a dataset with speech recordings and
|
||||||
|
transcriptions. The speech must be divided into audio clips and each clip needs
|
||||||
|
a transcription.
|
||||||
|
|
||||||
If you have a single audio file and you need to split it into clips, there are different open-source tools for you. We recommend Audacity. It is an open-source and free audio editing software.
|
If you have a single audio file and you need to split it into clips, there are different open-source tools for you. We recommend Audacity. It is an open-source and free audio editing software.
|
||||||
|
|
||||||
|
@ -49,7 +51,7 @@ The format above is taken from widely-used the [LJSpeech](https://keithito.com/L
|
||||||
|
|
||||||
Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English.
|
Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English.
|
||||||
|
|
||||||
For more info about dataset qualities and properties check our [post](https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset).
|
For more info about dataset qualities and properties check [this page](what_makes_a_good_dataset.md).
|
||||||
|
|
||||||
## Using Your Dataset in 🐸TTS
|
## Using Your Dataset in 🐸TTS
|
||||||
|
|
|
@ -0,0 +1,12 @@
|
||||||
|
# Datasets
|
||||||
|
|
||||||
|
For training a TTS model, you need a dataset with speech recordings and
|
||||||
|
transcriptions. See the following pages for more information on:
|
||||||
|
|
||||||
|
```{toctree}
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
formatting_your_dataset
|
||||||
|
what_makes_a_good_dataset
|
||||||
|
tts_datasets
|
||||||
|
```
|
|
@ -1,6 +1,6 @@
|
||||||
# TTS Datasets
|
# Public TTS datasets
|
||||||
|
|
||||||
Some of the known public datasets that we successfully applied 🐸TTS:
|
Some of the known public datasets that were successfully used for 🐸TTS:
|
||||||
|
|
||||||
- [English - LJ Speech](https://keithito.com/LJ-Speech-Dataset/)
|
- [English - LJ Speech](https://keithito.com/LJ-Speech-Dataset/)
|
||||||
- [English - Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/)
|
- [English - Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/)
|
|
@ -1,20 +1,20 @@
|
||||||
(docker_images)=
|
(docker_images)=
|
||||||
## Docker images
|
# Docker images
|
||||||
We provide docker images to be able to test TTS without having to setup your own environment.
|
We provide docker images to be able to test TTS without having to setup your own environment.
|
||||||
|
|
||||||
### Using premade images
|
## Using premade images
|
||||||
You can use premade images built automatically from the latest TTS version.
|
You can use premade images built automatically from the latest TTS version.
|
||||||
|
|
||||||
#### CPU version
|
### CPU version
|
||||||
```bash
|
```bash
|
||||||
docker pull ghcr.io/coqui-ai/tts-cpu
|
docker pull ghcr.io/coqui-ai/tts-cpu
|
||||||
```
|
```
|
||||||
#### GPU version
|
### GPU version
|
||||||
```bash
|
```bash
|
||||||
docker pull ghcr.io/coqui-ai/tts
|
docker pull ghcr.io/coqui-ai/tts
|
||||||
```
|
```
|
||||||
|
|
||||||
### Building your own image
|
## Building your own image
|
||||||
```bash
|
```bash
|
||||||
docker build -t tts .
|
docker build -t tts .
|
||||||
```
|
```
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# Implementing a New Language Frontend
|
# Implementing new language front ends
|
||||||
|
|
||||||
- Language front ends are located under `TTS.tts.utils.text`
|
- Language front ends are located under `TTS.tts.utils.text`
|
||||||
- Each special language has a separate folder.
|
- Each special language has a separate folder.
|
|
@ -1,4 +1,4 @@
|
||||||
# Implementing a Model
|
# Implementing new models
|
||||||
|
|
||||||
1. Implement layers.
|
1. Implement layers.
|
||||||
|
|
||||||
|
@ -36,7 +36,8 @@
|
||||||
There is also the `callback` interface by which you can manipulate both the model and the `Trainer` states. Callbacks give you
|
There is also the `callback` interface by which you can manipulate both the model and the `Trainer` states. Callbacks give you
|
||||||
an infinite flexibility to add custom behaviours for your model and training routines.
|
an infinite flexibility to add custom behaviours for your model and training routines.
|
||||||
|
|
||||||
For more details, see {ref}`BaseTTS <Base tts Model>` and :obj:`TTS.utils.callbacks`.
|
For more details, see [BaseTTS](../main_classes/model_api.md#base-tts-model)
|
||||||
|
and `TTS.utils.callbacks`.
|
||||||
|
|
||||||
6. Optionally, define `MyModelArgs`.
|
6. Optionally, define `MyModelArgs`.
|
||||||
|
|
||||||
|
@ -62,7 +63,7 @@
|
||||||
We love you more when you document your code. ❤️
|
We love you more when you document your code. ❤️
|
||||||
|
|
||||||
|
|
||||||
# Template 🐸TTS Model implementation
|
## Template 🐸TTS Model implementation
|
||||||
|
|
||||||
You can start implementing your model by copying the following base class.
|
You can start implementing your model by copying the following base class.
|
||||||
|
|
|
@ -0,0 +1,14 @@
|
||||||
|
# Adding models or languages
|
||||||
|
|
||||||
|
You can extend Coqui by implementing new model architectures or adding front
|
||||||
|
ends for new languages. See the pages below for more details. The [project
|
||||||
|
structure](../project_structure.md) and [contribution
|
||||||
|
guidelines](../contributing.md) may also be helpful. Please open a pull request
|
||||||
|
with your changes to share back the improvements with the community.
|
||||||
|
|
||||||
|
```{toctree}
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
implementing_a_new_model
|
||||||
|
implementing_a_new_language_frontend
|
||||||
|
```
|
|
@ -1,4 +1,4 @@
|
||||||
# Humble FAQ
|
# FAQ
|
||||||
We tried to collect common issues and questions we receive about 🐸TTS. It is worth checking before going deeper.
|
We tried to collect common issues and questions we receive about 🐸TTS. It is worth checking before going deeper.
|
||||||
|
|
||||||
## Errors with a pre-trained model. How can I resolve this?
|
## Errors with a pre-trained model. How can I resolve this?
|
||||||
|
@ -7,7 +7,7 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is
|
||||||
- If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny.
|
- If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny.
|
||||||
|
|
||||||
## What are the requirements of a good 🐸TTS dataset?
|
## What are the requirements of a good 🐸TTS dataset?
|
||||||
* {ref}`See this page <what_makes_a_good_dataset>`
|
- [See this page](datasets/what_makes_a_good_dataset.md)
|
||||||
|
|
||||||
## How should I choose the right model?
|
## How should I choose the right model?
|
||||||
- First, train Tacotron. It is smaller and faster to experiment with. If it performs poorly, try Tacotron2.
|
- First, train Tacotron. It is smaller and faster to experiment with. If it performs poorly, try Tacotron2.
|
||||||
|
@ -18,7 +18,7 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is
|
||||||
## How can I train my own `tts` model?
|
## How can I train my own `tts` model?
|
||||||
0. Check your dataset with notebooks in [dataset_analysis](https://github.com/idiap/coqui-ai-TTS/tree/main/notebooks/dataset_analysis) folder. Use [this notebook](https://github.com/idiap/coqui-ai-TTS/blob/main/notebooks/dataset_analysis/CheckSpectrograms.ipynb) to find the right audio processing parameters. A better set of parameters results in a better audio synthesis.
|
0. Check your dataset with notebooks in [dataset_analysis](https://github.com/idiap/coqui-ai-TTS/tree/main/notebooks/dataset_analysis) folder. Use [this notebook](https://github.com/idiap/coqui-ai-TTS/blob/main/notebooks/dataset_analysis/CheckSpectrograms.ipynb) to find the right audio processing parameters. A better set of parameters results in a better audio synthesis.
|
||||||
|
|
||||||
1. Write your own dataset `formatter` in `datasets/formatters.py` or format your dataset as one of the supported datasets, like LJSpeech.
|
1. Write your own dataset `formatter` in `datasets/formatters.py` or [format](datasets/formatting_your_dataset) your dataset as one of the supported datasets, like LJSpeech.
|
||||||
A `formatter` parses the metadata file and converts a list of training samples.
|
A `formatter` parses the metadata file and converts a list of training samples.
|
||||||
|
|
||||||
2. If you have a dataset with a different alphabet than English, you need to set your own character list in the ```config.json```.
|
2. If you have a dataset with a different alphabet than English, you need to set your own character list in the ```config.json```.
|
||||||
|
@ -61,7 +61,8 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is
|
||||||
- SingleGPU training: ```CUDA_VISIBLE_DEVICES="0" python train_tts.py --config_path config.json```
|
- SingleGPU training: ```CUDA_VISIBLE_DEVICES="0" python train_tts.py --config_path config.json```
|
||||||
- MultiGPU training: ```python3 -m trainer.distribute --gpus "0,1" --script TTS/bin/train_tts.py --config_path config.json```
|
- MultiGPU training: ```python3 -m trainer.distribute --gpus "0,1" --script TTS/bin/train_tts.py --config_path config.json```
|
||||||
|
|
||||||
**Note:** You can also train your model using pure 🐍 python. Check ```{eval-rst} :ref: 'tutorial_for_nervous_beginners'```.
|
**Note:** You can also train your model using pure 🐍 python. Check the
|
||||||
|
[tutorial](tutorial_for_nervous_beginners.md).
|
||||||
|
|
||||||
## How can I train in a different language?
|
## How can I train in a different language?
|
||||||
- Check steps 2, 3, 4, 5 above.
|
- Check steps 2, 3, 4, 5 above.
|
||||||
|
@ -104,7 +105,7 @@ The best approach is to pick a set of promising models and run a Mean-Opinion-Sc
|
||||||
- Check the 4th step under "How can I check model performance?"
|
- Check the 4th step under "How can I check model performance?"
|
||||||
|
|
||||||
## How can I test a trained model?
|
## How can I test a trained model?
|
||||||
- The best way is to use `tts` or `tts-server` commands. For details check {ref}`here <synthesizing_speech>`.
|
- The best way is to use `tts` or `tts-server` commands. For details check [here](inference.md).
|
||||||
- If you need to code your own ```TTS.utils.synthesizer.Synthesizer``` class.
|
- If you need to code your own ```TTS.utils.synthesizer.Synthesizer``` class.
|
||||||
|
|
||||||
## My Tacotron model does not stop - I see "Decoder stopped with 'max_decoder_steps" - Stopnet does not work.
|
## My Tacotron model does not stop - I see "Decoder stopped with 'max_decoder_steps" - Stopnet does not work.
|
||||||
|
|
|
@ -1,50 +1,56 @@
|
||||||
|
---
|
||||||
|
hide-toc: true
|
||||||
|
---
|
||||||
|
|
||||||
```{include} ../../README.md
|
```{include} ../../README.md
|
||||||
:relative-images:
|
:relative-images:
|
||||||
|
:end-before: <!-- start installation -->
|
||||||
```
|
```
|
||||||
----
|
|
||||||
|
|
||||||
# Documentation Content
|
```{toctree}
|
||||||
```{eval-rst}
|
:maxdepth: 1
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 2
|
|
||||||
:caption: Get started
|
:caption: Get started
|
||||||
|
:hidden:
|
||||||
|
|
||||||
tutorial_for_nervous_beginners
|
tutorial_for_nervous_beginners
|
||||||
installation
|
installation
|
||||||
|
docker_images
|
||||||
faq
|
faq
|
||||||
|
project_structure
|
||||||
contributing
|
contributing
|
||||||
|
```
|
||||||
|
|
||||||
.. toctree::
|
```{toctree}
|
||||||
:maxdepth: 2
|
:maxdepth: 1
|
||||||
:caption: Using 🐸TTS
|
:caption: Using Coqui
|
||||||
|
:hidden:
|
||||||
|
|
||||||
inference
|
inference
|
||||||
docker_images
|
training/index
|
||||||
implementing_a_new_model
|
extension/index
|
||||||
implementing_a_new_language_frontend
|
datasets/index
|
||||||
training_a_model
|
```
|
||||||
finetuning
|
|
||||||
configuration
|
|
||||||
formatting_your_dataset
|
|
||||||
what_makes_a_good_dataset
|
|
||||||
tts_datasets
|
|
||||||
marytts
|
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 2
|
```{toctree}
|
||||||
|
:maxdepth: 1
|
||||||
:caption: Main Classes
|
:caption: Main Classes
|
||||||
|
:hidden:
|
||||||
|
|
||||||
|
configuration
|
||||||
main_classes/trainer_api
|
main_classes/trainer_api
|
||||||
main_classes/audio_processor
|
main_classes/audio_processor
|
||||||
main_classes/model_api
|
main_classes/model_api
|
||||||
main_classes/dataset
|
main_classes/dataset
|
||||||
main_classes/gan
|
main_classes/gan
|
||||||
main_classes/speaker_manager
|
main_classes/speaker_manager
|
||||||
|
```
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 2
|
```{toctree}
|
||||||
:caption: `tts` Models
|
:maxdepth: 1
|
||||||
|
:caption: TTS Models
|
||||||
|
:hidden:
|
||||||
|
|
||||||
models/glow_tts.md
|
models/glow_tts.md
|
||||||
models/vits.md
|
models/vits.md
|
||||||
|
@ -54,9 +60,4 @@
|
||||||
models/tortoise.md
|
models/tortoise.md
|
||||||
models/bark.md
|
models/bark.md
|
||||||
models/xtts.md
|
models/xtts.md
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 2
|
|
||||||
:caption: `vocoder` Models
|
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
|
@ -1,194 +1,21 @@
|
||||||
(synthesizing_speech)=
|
(synthesizing_speech)=
|
||||||
# Synthesizing Speech
|
# Synthesizing speech
|
||||||
|
|
||||||
First, you need to install TTS. We recommend using PyPi. You need to call the command below:
|
## Overview
|
||||||
|
|
||||||
```bash
|
Coqui TTS provides three main methods for inference:
|
||||||
$ pip install coqui-tts
|
|
||||||
|
1. 🐍Python API
|
||||||
|
2. TTS command line interface (CLI)
|
||||||
|
3. [Local demo server](server.md)
|
||||||
|
|
||||||
|
```{include} ../../README.md
|
||||||
|
:start-after: <!-- start inference -->
|
||||||
```
|
```
|
||||||
|
|
||||||
After the installation, 2 terminal commands are available.
|
|
||||||
|
|
||||||
1. TTS Command Line Interface (CLI). - `tts`
|
```{toctree}
|
||||||
2. Local Demo Server. - `tts-server`
|
:hidden:
|
||||||
3. In 🐍Python. - `from TTS.api import TTS`
|
server
|
||||||
|
marytts
|
||||||
## On the Commandline - `tts`
|
|
||||||

|
|
||||||
|
|
||||||
After the installation, 🐸TTS provides a CLI interface for synthesizing speech using pre-trained models. You can either use your own model or the release models under 🐸TTS.
|
|
||||||
|
|
||||||
Listing released 🐸TTS models.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tts --list_models
|
|
||||||
```
|
|
||||||
|
|
||||||
Run a TTS model, from the release models list, with its default vocoder. (Simply copy and paste the full model names from the list as arguments for the command below.)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tts --text "Text for TTS" \
|
|
||||||
--model_name "<type>/<language>/<dataset>/<model_name>" \
|
|
||||||
--out_path folder/to/save/output.wav
|
|
||||||
```
|
|
||||||
|
|
||||||
Run a tts and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tts --text "Text for TTS" \
|
|
||||||
--model_name "tts_models/<language>/<dataset>/<model_name>" \
|
|
||||||
--vocoder_name "vocoder_models/<language>/<dataset>/<model_name>" \
|
|
||||||
--out_path folder/to/save/output.wav
|
|
||||||
```
|
|
||||||
|
|
||||||
Run your own TTS model (Using Griffin-Lim Vocoder)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tts --text "Text for TTS" \
|
|
||||||
--model_path path/to/model.pth \
|
|
||||||
--config_path path/to/config.json \
|
|
||||||
--out_path folder/to/save/output.wav
|
|
||||||
```
|
|
||||||
|
|
||||||
Run your own TTS and Vocoder models
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tts --text "Text for TTS" \
|
|
||||||
--config_path path/to/config.json \
|
|
||||||
--model_path path/to/model.pth \
|
|
||||||
--out_path folder/to/save/output.wav \
|
|
||||||
--vocoder_path path/to/vocoder.pth \
|
|
||||||
--vocoder_config_path path/to/vocoder_config.json
|
|
||||||
```
|
|
||||||
|
|
||||||
Run a multi-speaker TTS model from the released models list.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tts --model_name "tts_models/<language>/<dataset>/<model_name>" --list_speaker_idxs # list the possible speaker IDs.
|
|
||||||
tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "tts_models/<language>/<dataset>/<model_name>" --speaker_idx "<speaker_id>"
|
|
||||||
```
|
|
||||||
|
|
||||||
Run a released voice conversion model
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tts --model_name "voice_conversion/<language>/<dataset>/<model_name>"
|
|
||||||
--source_wav "my/source/speaker/audio.wav"
|
|
||||||
--target_wav "my/target/speaker/audio.wav"
|
|
||||||
--out_path folder/to/save/output.wav
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note:** You can use ```./TTS/bin/synthesize.py``` if you prefer running ```tts``` from the TTS project folder.
|
|
||||||
|
|
||||||
## On the Demo Server - `tts-server`
|
|
||||||
|
|
||||||
<!-- <img src="https://raw.githubusercontent.com/idiap/coqui-ai-TTS/main/images/demo_server.gif" height="56"/> -->
|
|
||||||

|
|
||||||
|
|
||||||
You can boot up a demo 🐸TTS server to run an inference with your models (make
|
|
||||||
sure to install the additional dependencies with `pip install coqui-tts[server]`).
|
|
||||||
Note that the server is not optimized for performance but gives you an easy way
|
|
||||||
to interact with the models.
|
|
||||||
|
|
||||||
The demo server provides pretty much the same interface as the CLI command.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tts-server -h # see the help
|
|
||||||
tts-server --list_models # list the available models.
|
|
||||||
```
|
|
||||||
|
|
||||||
Run a TTS model, from the release models list, with its default vocoder.
|
|
||||||
If the model you choose is a multi-speaker TTS model, you can select different speakers on the Web interface and synthesize
|
|
||||||
speech.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tts-server --model_name "<type>/<language>/<dataset>/<model_name>"
|
|
||||||
```
|
|
||||||
|
|
||||||
Run a TTS and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tts-server --model_name "<type>/<language>/<dataset>/<model_name>" \
|
|
||||||
--vocoder_name "<type>/<language>/<dataset>/<model_name>"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Python 🐸TTS API
|
|
||||||
|
|
||||||
You can run a multi-speaker and multi-lingual model in Python as
|
|
||||||
|
|
||||||
```python
|
|
||||||
import torch
|
|
||||||
from TTS.api import TTS
|
|
||||||
|
|
||||||
# Get device
|
|
||||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
||||||
|
|
||||||
# List available 🐸TTS models
|
|
||||||
print(TTS().list_models())
|
|
||||||
|
|
||||||
# Init TTS
|
|
||||||
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
|
|
||||||
|
|
||||||
# Run TTS
|
|
||||||
# ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language
|
|
||||||
# Text to speech list of amplitude values as output
|
|
||||||
wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en")
|
|
||||||
# Text to speech to a file
|
|
||||||
tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Here is an example for a single speaker model.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Init TTS with the target model name
|
|
||||||
tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False)
|
|
||||||
# Run TTS
|
|
||||||
tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Example voice cloning with YourTTS in English, French and Portuguese:
|
|
||||||
|
|
||||||
```python
|
|
||||||
tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to("cuda")
|
|
||||||
tts.tts_to_file("This is voice cloning.", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
|
|
||||||
tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wav", language="fr", file_path="output.wav")
|
|
||||||
tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt", file_path="output.wav")
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Example voice conversion converting speaker of the `source_wav` to the speaker of the `target_wav`
|
|
||||||
|
|
||||||
```python
|
|
||||||
tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda")
|
|
||||||
tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav")
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Example voice cloning by a single speaker TTS model combining with the voice conversion model.
|
|
||||||
|
|
||||||
This way, you can clone voices by using any model in 🐸TTS.
|
|
||||||
|
|
||||||
```python
|
|
||||||
tts = TTS("tts_models/de/thorsten/tacotron2-DDC")
|
|
||||||
tts.tts_with_vc_to_file(
|
|
||||||
"Wie sage ich auf Italienisch, dass ich dich liebe?",
|
|
||||||
speaker_wav="target/speaker.wav",
|
|
||||||
file_path="ouptut.wav"
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Example text to speech using **Fairseq models in ~1100 languages** 🤯.
|
|
||||||
For these models use the following name format: `tts_models/<lang-iso_code>/fairseq/vits`.
|
|
||||||
|
|
||||||
You can find the list of language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html) and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms).
|
|
||||||
|
|
||||||
```python
|
|
||||||
from TTS.api import TTS
|
|
||||||
api = TTS(model_name="tts_models/eng/fairseq/vits").to("cuda")
|
|
||||||
api.tts_to_file("This is a test.", file_path="output.wav")
|
|
||||||
|
|
||||||
# TTS with on the fly voice conversion
|
|
||||||
api = TTS("tts_models/deu/fairseq/vits")
|
|
||||||
api.tts_with_vc_to_file(
|
|
||||||
"Wie sage ich auf Italienisch, dass ich dich liebe?",
|
|
||||||
speaker_wav="target/speaker.wav",
|
|
||||||
file_path="ouptut.wav"
|
|
||||||
)
|
|
||||||
```
|
```
|
||||||
|
|
|
@ -1,40 +1,6 @@
|
||||||
# Installation
|
# Installation
|
||||||
|
|
||||||
🐸TTS supports python >=3.9 <3.13.0 and was tested on Ubuntu 22.04.
|
```{include} ../../README.md
|
||||||
|
:start-after: <!-- start installation -->
|
||||||
## Using `pip`
|
:end-before: <!-- end installation -->
|
||||||
|
|
||||||
`pip` is recommended if you want to use 🐸TTS only for inference.
|
|
||||||
|
|
||||||
You can install from PyPI as follows:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pip install coqui-tts # from PyPI
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Or install from Github:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pip install git+https://github.com/idiap/coqui-ai-TTS # from Github
|
|
||||||
```
|
|
||||||
|
|
||||||
## Installing From Source
|
|
||||||
|
|
||||||
This is recommended for development and more control over 🐸TTS.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git clone https://github.com/idiap/coqui-ai-TTS
|
|
||||||
cd coqui-ai-TTS
|
|
||||||
make system-deps # only on Linux systems.
|
|
||||||
|
|
||||||
# Install package and optional extras
|
|
||||||
make install
|
|
||||||
|
|
||||||
# Same as above + dev dependencies and pre-commit
|
|
||||||
make install_dev
|
|
||||||
```
|
|
||||||
|
|
||||||
## On Windows
|
|
||||||
If you are on Windows, 👑@GuyPaddock wrote installation instructions
|
|
||||||
[here](https://stackoverflow.com/questions/66726331/) (note that these are out
|
|
||||||
of date, e.g. you need to have at least Python 3.9)
|
|
||||||
|
|
|
@ -1,22 +1,22 @@
|
||||||
# Model API
|
# Model API
|
||||||
Model API provides you a set of functions that easily make your model compatible with the `Trainer`,
|
Model API provides you a set of functions that easily make your model compatible with the `Trainer`,
|
||||||
`Synthesizer` and `ModelZoo`.
|
`Synthesizer` and the Coqui Python API.
|
||||||
|
|
||||||
## Base TTS Model
|
## Base Trainer Model
|
||||||
|
|
||||||
```{eval-rst}
|
```{eval-rst}
|
||||||
.. autoclass:: TTS.model.BaseTrainerModel
|
.. autoclass:: TTS.model.BaseTrainerModel
|
||||||
:members:
|
:members:
|
||||||
```
|
```
|
||||||
|
|
||||||
## Base tts Model
|
## Base TTS Model
|
||||||
|
|
||||||
```{eval-rst}
|
```{eval-rst}
|
||||||
.. autoclass:: TTS.tts.models.base_tts.BaseTTS
|
.. autoclass:: TTS.tts.models.base_tts.BaseTTS
|
||||||
:members:
|
:members:
|
||||||
```
|
```
|
||||||
|
|
||||||
## Base vocoder Model
|
## Base Vocoder Model
|
||||||
|
|
||||||
```{eval-rst}
|
```{eval-rst}
|
||||||
.. autoclass:: TTS.vocoder.models.base_vocoder.BaseVocoder
|
.. autoclass:: TTS.vocoder.models.base_vocoder.BaseVocoder
|
||||||
|
|
|
@ -1,3 +1,3 @@
|
||||||
# Trainer API
|
# Trainer API
|
||||||
|
|
||||||
We made the trainer a separate project on https://github.com/eginhard/coqui-trainer
|
We made the trainer a separate project: https://github.com/idiap/coqui-ai-Trainer
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# Mary-TTS API Support for Coqui-TTS
|
# Mary-TTS API support for Coqui TTS
|
||||||
|
|
||||||
## What is Mary-TTS?
|
## What is Mary-TTS?
|
||||||
|
|
||||||
|
|
|
@ -1,25 +1,25 @@
|
||||||
# ⓍTTS
|
# XTTS
|
||||||
ⓍTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise,
|
XTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise,
|
||||||
ⓍTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy.
|
XTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy.
|
||||||
There is no need for an excessive amount of training data that spans countless hours.
|
There is no need for an excessive amount of training data that spans countless hours.
|
||||||
|
|
||||||
### Features
|
## Features
|
||||||
- Voice cloning.
|
- Voice cloning.
|
||||||
- Cross-language voice cloning.
|
- Cross-language voice cloning.
|
||||||
- Multi-lingual speech generation.
|
- Multi-lingual speech generation.
|
||||||
- 24khz sampling rate.
|
- 24khz sampling rate.
|
||||||
- Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-inference))
|
- Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-manually))
|
||||||
- Fine-tuning support. (See [Training](#training))
|
- Fine-tuning support. (See [Training](#training))
|
||||||
|
|
||||||
### Updates with v2
|
## Updates with v2
|
||||||
- Improved voice cloning.
|
- Improved voice cloning.
|
||||||
- Voices can be cloned with a single audio file or multiple audio files, without any effect on the runtime.
|
- Voices can be cloned with a single audio file or multiple audio files, without any effect on the runtime.
|
||||||
- Across the board quality improvements.
|
- Across the board quality improvements.
|
||||||
|
|
||||||
### Code
|
## Code
|
||||||
Current implementation only supports inference and GPT encoder training.
|
Current implementation only supports inference and GPT encoder training.
|
||||||
|
|
||||||
### Languages
|
## Languages
|
||||||
XTTS-v2 supports 17 languages:
|
XTTS-v2 supports 17 languages:
|
||||||
|
|
||||||
- Arabic (ar)
|
- Arabic (ar)
|
||||||
|
@ -40,15 +40,15 @@ XTTS-v2 supports 17 languages:
|
||||||
- Spanish (es)
|
- Spanish (es)
|
||||||
- Turkish (tr)
|
- Turkish (tr)
|
||||||
|
|
||||||
### License
|
## License
|
||||||
This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml).
|
This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml).
|
||||||
|
|
||||||
### Contact
|
## Contact
|
||||||
Come and join in our 🐸Community. We're active on [Discord](https://discord.gg/fBC58unbKE) and [Github](https://github.com/idiap/coqui-ai-TTS/discussions).
|
Come and join in our 🐸Community. We're active on [Discord](https://discord.gg/fBC58unbKE) and [Github](https://github.com/idiap/coqui-ai-TTS/discussions).
|
||||||
|
|
||||||
### Inference
|
## Inference
|
||||||
|
|
||||||
#### 🐸TTS Command line
|
### 🐸TTS Command line
|
||||||
|
|
||||||
You can check all supported languages with the following command:
|
You can check all supported languages with the following command:
|
||||||
|
|
||||||
|
@ -64,7 +64,7 @@ You can check all Coqui available speakers with the following command:
|
||||||
--list_speaker_idx
|
--list_speaker_idx
|
||||||
```
|
```
|
||||||
|
|
||||||
##### Coqui speakers
|
#### Coqui speakers
|
||||||
You can do inference using one of the available speakers using the following command:
|
You can do inference using one of the available speakers using the following command:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
|
@ -75,10 +75,10 @@ You can do inference using one of the available speakers using the following com
|
||||||
--use_cuda
|
--use_cuda
|
||||||
```
|
```
|
||||||
|
|
||||||
##### Clone a voice
|
#### Clone a voice
|
||||||
You can clone a speaker voice using a single or multiple references:
|
You can clone a speaker voice using a single or multiple references:
|
||||||
|
|
||||||
###### Single reference
|
##### Single reference
|
||||||
|
|
||||||
```console
|
```console
|
||||||
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
|
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
|
||||||
|
@ -88,7 +88,7 @@ You can clone a speaker voice using a single or multiple references:
|
||||||
--use_cuda
|
--use_cuda
|
||||||
```
|
```
|
||||||
|
|
||||||
###### Multiple references
|
##### Multiple references
|
||||||
```console
|
```console
|
||||||
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
|
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
|
||||||
--text "Bugün okula gitmek istemiyorum." \
|
--text "Bugün okula gitmek istemiyorum." \
|
||||||
|
@ -106,12 +106,12 @@ or for all wav files in a directory you can use:
|
||||||
--use_cuda
|
--use_cuda
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 🐸TTS API
|
### 🐸TTS API
|
||||||
|
|
||||||
##### Clone a voice
|
#### Clone a voice
|
||||||
You can clone a speaker voice using a single or multiple references:
|
You can clone a speaker voice using a single or multiple references:
|
||||||
|
|
||||||
###### Single reference
|
##### Single reference
|
||||||
|
|
||||||
Splits the text into sentences and generates audio for each sentence. The audio files are then concatenated to produce the final audio.
|
Splits the text into sentences and generates audio for each sentence. The audio files are then concatenated to produce the final audio.
|
||||||
You can optionally disable sentence splitting for better coherence but more VRAM and possibly hitting models context length limit.
|
You can optionally disable sentence splitting for better coherence but more VRAM and possibly hitting models context length limit.
|
||||||
|
@ -129,7 +129,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
###### Multiple references
|
##### Multiple references
|
||||||
|
|
||||||
You can pass multiple audio files to the `speaker_wav` argument for better voice cloning.
|
You can pass multiple audio files to the `speaker_wav` argument for better voice cloning.
|
||||||
|
|
||||||
|
@ -154,7 +154,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
|
||||||
language="en")
|
language="en")
|
||||||
```
|
```
|
||||||
|
|
||||||
##### Coqui speakers
|
#### Coqui speakers
|
||||||
|
|
||||||
You can do inference using one of the available speakers using the following code:
|
You can do inference using one of the available speakers using the following code:
|
||||||
|
|
||||||
|
@ -172,11 +172,11 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
#### 🐸TTS Model API
|
### 🐸TTS Model API
|
||||||
|
|
||||||
To use the model API, you need to download the model files and pass config and model file paths manually.
|
To use the model API, you need to download the model files and pass config and model file paths manually.
|
||||||
|
|
||||||
#### Manual Inference
|
### Manual Inference
|
||||||
|
|
||||||
If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first.
|
If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first.
|
||||||
|
|
||||||
|
@ -184,7 +184,7 @@ If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjo
|
||||||
pip install deepspeed==0.10.3
|
pip install deepspeed==0.10.3
|
||||||
```
|
```
|
||||||
|
|
||||||
##### inference parameters
|
#### Inference parameters
|
||||||
|
|
||||||
- `text`: The text to be synthesized.
|
- `text`: The text to be synthesized.
|
||||||
- `language`: The language of the text to be synthesized.
|
- `language`: The language of the text to be synthesized.
|
||||||
|
@ -199,7 +199,7 @@ pip install deepspeed==0.10.3
|
||||||
- `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True.
|
- `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True.
|
||||||
|
|
||||||
|
|
||||||
##### Inference
|
#### Inference
|
||||||
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
@ -231,7 +231,7 @@ torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
##### Streaming manually
|
#### Streaming manually
|
||||||
|
|
||||||
Here the goal is to stream the audio as it is being generated. This is useful for real-time applications.
|
Here the goal is to stream the audio as it is being generated. This is useful for real-time applications.
|
||||||
Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster.
|
Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster.
|
||||||
|
@ -275,9 +275,9 @@ torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Training
|
## Training
|
||||||
|
|
||||||
#### Easy training
|
### Easy training
|
||||||
To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps:
|
To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps:
|
||||||
|
|
||||||
- Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter
|
- Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter
|
||||||
|
@ -286,7 +286,7 @@ To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio
|
||||||
|
|
||||||
The user can run this gradio demo locally or remotely using a Colab Notebook.
|
The user can run this gradio demo locally or remotely using a Colab Notebook.
|
||||||
|
|
||||||
##### Run demo on Colab
|
#### Run demo on Colab
|
||||||
To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook.
|
To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook.
|
||||||
|
|
||||||
The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing).
|
The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing).
|
||||||
|
@ -302,7 +302,7 @@ If you are not able to acess the video you need to follow the steps:
|
||||||
5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference".
|
5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference".
|
||||||
|
|
||||||
|
|
||||||
##### Run demo locally
|
#### Run demo locally
|
||||||
|
|
||||||
To run the demo locally you need to do the following steps:
|
To run the demo locally you need to do the following steps:
|
||||||
1. Install 🐸 TTS following the instructions available [here](https://coqui-tts.readthedocs.io/en/latest/installation.html).
|
1. Install 🐸 TTS following the instructions available [here](https://coqui-tts.readthedocs.io/en/latest/installation.html).
|
||||||
|
@ -319,7 +319,7 @@ If you are not able to access the video, here is what you need to do:
|
||||||
4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded.
|
4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded.
|
||||||
5. Now you can run inference with the model by clicking on the button "Step 4 - Inference".
|
5. Now you can run inference with the model by clicking on the button "Step 4 - Inference".
|
||||||
|
|
||||||
#### Advanced training
|
### Advanced training
|
||||||
|
|
||||||
A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py
|
A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py
|
||||||
|
|
||||||
|
@ -393,6 +393,6 @@ torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000)
|
||||||
|
|
||||||
## XTTS Model
|
## XTTS Model
|
||||||
```{eval-rst}
|
```{eval-rst}
|
||||||
.. autoclass:: TTS.tts.models.xtts.XTTS
|
.. autoclass:: TTS.tts.models.xtts.Xtts
|
||||||
:members:
|
:members:
|
||||||
```
|
```
|
||||||
|
|
|
@ -0,0 +1,30 @@
|
||||||
|
# Project structure
|
||||||
|
|
||||||
|
## Directory structure
|
||||||
|
|
||||||
|
A non-comprehensive overview of the Coqui source code:
|
||||||
|
|
||||||
|
| Directory | Contents |
|
||||||
|
| - | - |
|
||||||
|
| **Core** | |
|
||||||
|
| **[`TTS/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS)** | Main source code |
|
||||||
|
| **[`- .models.json`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/.models.json)** | Pretrained model list |
|
||||||
|
| **[`- api.py`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/api.py)** | Python API |
|
||||||
|
| **[`- bin/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/bin)** | Executables and CLI |
|
||||||
|
| **[`- tts/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts)** | Text-to-speech models |
|
||||||
|
| **[`- configs/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts/configs)** | Model configurations |
|
||||||
|
| **[`- layers/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts/layers)** | Model layer definitions |
|
||||||
|
| **[`- models/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts/models)** | Model definitions |
|
||||||
|
| **[`- vc/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/vc)** | Voice conversion models |
|
||||||
|
| `- (same)` | |
|
||||||
|
| **[`- vocoder/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/vocoder)** | Vocoder models |
|
||||||
|
| `- (same)` | |
|
||||||
|
| **[`- encoder/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/encoder)** | Speaker encoder models |
|
||||||
|
| `- (same)` | |
|
||||||
|
| **Recipes/notebooks** | |
|
||||||
|
| **[`notebooks/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/notebooks)** | Jupyter Notebooks for model evaluation, parameter selection and data analysis |
|
||||||
|
| **[`recipes/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes)** | Training recipes |
|
||||||
|
| **Others** | |
|
||||||
|
| **[`pyproject.toml`](https://github.com/idiap/coqui-ai-TTS/tree/dev/pyproject.toml)** | Project metadata, configuration and dependencies |
|
||||||
|
| **[`docs/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/docs)** | Documentation |
|
||||||
|
| **[`tests/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/tests)** | Unit and integration tests |
|
|
@ -0,0 +1,30 @@
|
||||||
|
# Demo server
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
You can boot up a demo 🐸TTS server to run an inference with your models (make
|
||||||
|
sure to install the additional dependencies with `pip install coqui-tts[server]`).
|
||||||
|
Note that the server is not optimized for performance and does not support all
|
||||||
|
Coqui models yet.
|
||||||
|
|
||||||
|
The demo server provides pretty much the same interface as the CLI command.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tts-server -h # see the help
|
||||||
|
tts-server --list_models # list the available models.
|
||||||
|
```
|
||||||
|
|
||||||
|
Run a TTS model, from the release models list, with its default vocoder.
|
||||||
|
If the model you choose is a multi-speaker TTS model, you can select different speakers on the Web interface and synthesize
|
||||||
|
speech.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tts-server --model_name "<type>/<language>/<dataset>/<model_name>"
|
||||||
|
```
|
||||||
|
|
||||||
|
Run a TTS and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tts-server --model_name "<type>/<language>/<dataset>/<model_name>" \
|
||||||
|
--vocoder_name "<type>/<language>/<dataset>/<model_name>"
|
||||||
|
```
|
|
@ -1,4 +1,4 @@
|
||||||
# Fine-tuning a 🐸 TTS model
|
# Fine-tuning a model
|
||||||
|
|
||||||
## Fine-tuning
|
## Fine-tuning
|
||||||
|
|
||||||
|
@ -21,8 +21,9 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
|
||||||
Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own
|
Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own
|
||||||
speech dataset and achieve reasonable results with only a couple of hours of data.
|
speech dataset and achieve reasonable results with only a couple of hours of data.
|
||||||
|
|
||||||
However, note that, fine-tuning does not ensure great results. The model performance still depends on the
|
However, note that, fine-tuning does not ensure great results. The model
|
||||||
{ref}`dataset quality <what_makes_a_good_dataset>` and the hyper-parameters you choose for fine-tuning. Therefore,
|
performance still depends on the [dataset quality](../datasets/what_makes_a_good_dataset.md)
|
||||||
|
and the hyper-parameters you choose for fine-tuning. Therefore,
|
||||||
it still takes a bit of tinkering.
|
it still takes a bit of tinkering.
|
||||||
|
|
||||||
|
|
||||||
|
@ -31,7 +32,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
|
||||||
1. Setup your dataset.
|
1. Setup your dataset.
|
||||||
|
|
||||||
You need to format your target dataset in a certain way so that 🐸TTS data loader will be able to load it for the
|
You need to format your target dataset in a certain way so that 🐸TTS data loader will be able to load it for the
|
||||||
training. Please see {ref}`this page <formatting_your_dataset>` for more information about formatting.
|
training. Please see [this page](../datasets/formatting_your_dataset.md) for more information about formatting.
|
||||||
|
|
||||||
2. Choose the model you want to fine-tune.
|
2. Choose the model you want to fine-tune.
|
||||||
|
|
||||||
|
@ -47,7 +48,8 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
|
||||||
|
|
||||||
You should choose the model based on your requirements. Some models are fast and some are better in speech quality.
|
You should choose the model based on your requirements. Some models are fast and some are better in speech quality.
|
||||||
One lazy way to test a model is running the model on the hardware you want to use and see how it works. For
|
One lazy way to test a model is running the model on the hardware you want to use and see how it works. For
|
||||||
simple testing, you can use the `tts` command on the terminal. For more info see {ref}`here <synthesizing_speech>`.
|
simple testing, you can use the `tts` command on the terminal. For more info
|
||||||
|
see [here](../inference.md).
|
||||||
|
|
||||||
3. Download the model.
|
3. Download the model.
|
||||||
|
|
|
@ -0,0 +1,10 @@
|
||||||
|
# Training and fine-tuning
|
||||||
|
|
||||||
|
The following pages show you how to train and fine-tune Coqui models:
|
||||||
|
|
||||||
|
```{toctree}
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
training_a_model
|
||||||
|
finetuning
|
||||||
|
```
|
|
@ -1,4 +1,4 @@
|
||||||
# Training a Model
|
# Training a model
|
||||||
|
|
||||||
1. Decide the model you want to use.
|
1. Decide the model you want to use.
|
||||||
|
|
||||||
|
@ -11,11 +11,10 @@
|
||||||
|
|
||||||
3. Check the recipes.
|
3. Check the recipes.
|
||||||
|
|
||||||
Recipes are located under `TTS/recipes/`. They do not promise perfect models but they provide a good start point for
|
Recipes are located under `TTS/recipes/`. They do not promise perfect models but they provide a good start point.
|
||||||
`Nervous Beginners`.
|
|
||||||
A recipe for `GlowTTS` using `LJSpeech` dataset looks like below. Let's be creative and call this `train_glowtts.py`.
|
A recipe for `GlowTTS` using `LJSpeech` dataset looks like below. Let's be creative and call this `train_glowtts.py`.
|
||||||
|
|
||||||
```{literalinclude} ../../recipes/ljspeech/glow_tts/train_glowtts.py
|
```{literalinclude} ../../../recipes/ljspeech/glow_tts/train_glowtts.py
|
||||||
```
|
```
|
||||||
|
|
||||||
You need to change fields of the `BaseDatasetConfig` to match your dataset and then update `GlowTTSConfig`
|
You need to change fields of the `BaseDatasetConfig` to match your dataset and then update `GlowTTSConfig`
|
||||||
|
@ -113,7 +112,7 @@
|
||||||
|
|
||||||
Note that different models have different metrics, visuals and outputs.
|
Note that different models have different metrics, visuals and outputs.
|
||||||
|
|
||||||
You should also check the [FAQ page](https://github.com/coqui-ai/TTS/wiki/FAQ) for common problems and solutions
|
You should also check the [FAQ page](../faq.md) for common problems and solutions
|
||||||
that occur in a training.
|
that occur in a training.
|
||||||
|
|
||||||
7. Use your best model for inference.
|
7. Use your best model for inference.
|
||||||
|
@ -132,7 +131,7 @@
|
||||||
In the example above, we trained a `GlowTTS` model, but the same workflow applies to all the other 🐸TTS models.
|
In the example above, we trained a `GlowTTS` model, but the same workflow applies to all the other 🐸TTS models.
|
||||||
|
|
||||||
|
|
||||||
# Multi-speaker Training
|
## Multi-speaker Training
|
||||||
|
|
||||||
Training a multi-speaker model is mostly the same as training a single-speaker model.
|
Training a multi-speaker model is mostly the same as training a single-speaker model.
|
||||||
You need to specify a couple of configuration parameters, initiate a `SpeakerManager` instance and pass it to the model.
|
You need to specify a couple of configuration parameters, initiate a `SpeakerManager` instance and pass it to the model.
|
||||||
|
@ -142,5 +141,5 @@ d-vectors. For using d-vectors, you first need to compute the d-vectors using th
|
||||||
|
|
||||||
The same Glow-TTS model above can be trained on a multi-speaker VCTK dataset with the script below.
|
The same Glow-TTS model above can be trained on a multi-speaker VCTK dataset with the script below.
|
||||||
|
|
||||||
```{literalinclude} ../../recipes/vctk/glow_tts/train_glow_tts.py
|
```{literalinclude} ../../../recipes/vctk/glow_tts/train_glow_tts.py
|
||||||
```
|
```
|
|
@ -1,24 +1,37 @@
|
||||||
# Tutorial For Nervous Beginners
|
# Tutorial for nervous beginners
|
||||||
|
|
||||||
## Installation
|
First [install](installation.md) Coqui TTS.
|
||||||
|
|
||||||
User friendly installation. Recommended only for synthesizing voice.
|
## Synthesizing Speech
|
||||||
|
|
||||||
|
You can run `tts` and synthesize speech directly on the terminal.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ pip install coqui-tts
|
$ tts -h # see the help
|
||||||
|
$ tts --list_models # list the available models.
|
||||||
```
|
```
|
||||||
|
|
||||||
Developer friendly installation.
|

|
||||||
|
|
||||||
|
|
||||||
|
You can call `tts-server` to start a local demo server that you can open on
|
||||||
|
your favorite web browser and 🗣️ (make sure to install the additional
|
||||||
|
dependencies with `pip install coqui-tts[server]`).
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ git clone https://github.com/idiap/coqui-ai-TTS
|
$ tts-server -h # see the help
|
||||||
$ cd coqui-ai-TTS
|
$ tts-server --list_models # list the available models.
|
||||||
$ pip install -e .
|
|
||||||
```
|
```
|
||||||
|

|
||||||
|
|
||||||
|
See [this page](inference.md) for more details on synthesizing speech with the
|
||||||
|
CLI, server or Python API.
|
||||||
|
|
||||||
## Training a `tts` Model
|
## Training a `tts` Model
|
||||||
|
|
||||||
A breakdown of a simple script that trains a GlowTTS model on the LJspeech dataset. See the comments for more details.
|
A breakdown of a simple script that trains a GlowTTS model on the LJspeech
|
||||||
|
dataset. For a more in-depth guide to training and fine-tuning also see [this
|
||||||
|
page](training/index.md).
|
||||||
|
|
||||||
### Pure Python Way
|
### Pure Python Way
|
||||||
|
|
||||||
|
@ -99,25 +112,3 @@ We still support running training from CLI like in the old days. The same traini
|
||||||
```
|
```
|
||||||
|
|
||||||
❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above.
|
❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above.
|
||||||
|
|
||||||
## Synthesizing Speech
|
|
||||||
|
|
||||||
You can run `tts` and synthesize speech directly on the terminal.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ tts -h # see the help
|
|
||||||
$ tts --list_models # list the available models.
|
|
||||||
```
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
|
|
||||||
You can call `tts-server` to start a local demo server that you can open on
|
|
||||||
your favorite web browser and 🗣️ (make sure to install the additional
|
|
||||||
dependencies with `pip install coqui-tts[server]`).
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ tts-server -h # see the help
|
|
||||||
$ tts-server --list_models # list the available models.
|
|
||||||
```
|
|
||||||

|
|
||||||
|
|
|
@ -143,12 +143,12 @@ dev = [
|
||||||
]
|
]
|
||||||
# Dependencies for building the documentation
|
# Dependencies for building the documentation
|
||||||
docs = [
|
docs = [
|
||||||
"furo>=2023.5.20",
|
"furo>=2024.8.6",
|
||||||
"myst-parser==2.0.0",
|
"myst-parser==3.0.1",
|
||||||
"sphinx==7.2.5",
|
"sphinx==7.4.7",
|
||||||
"sphinx_inline_tabs>=2023.4.21",
|
"sphinx_inline_tabs>=2023.4.21",
|
||||||
"sphinx_copybutton>=0.1",
|
"sphinx_copybutton>=0.5.2",
|
||||||
"linkify-it-py>=2.0.0",
|
"linkify-it-py>=2.0.3",
|
||||||
]
|
]
|
||||||
|
|
||||||
[project.urls]
|
[project.urls]
|
||||||
|
|
|
@ -22,8 +22,12 @@ def sync_readme():
|
||||||
new_content = replace_between_markers(orig_content, "tts-readme", description.strip())
|
new_content = replace_between_markers(orig_content, "tts-readme", description.strip())
|
||||||
if args.check:
|
if args.check:
|
||||||
if orig_content != new_content:
|
if orig_content != new_content:
|
||||||
print("README.md is out of sync; please edit TTS/bin/TTS_README.md and run scripts/sync_readme.py")
|
print(
|
||||||
|
"README.md is out of sync; please reconcile README.md and TTS/bin/synthesize.py and run scripts/sync_readme.py"
|
||||||
|
)
|
||||||
exit(42)
|
exit(42)
|
||||||
|
print("All good, files in sync")
|
||||||
|
exit(0)
|
||||||
readme_path.write_text(new_content)
|
readme_path.write_text(new_content)
|
||||||
print("Updated README.md")
|
print("Updated README.md")
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue