Update docs

This commit is contained in:
Eren Gölge 2021-09-30 14:34:53 +00:00
parent f904dd4828
commit 6d3b2d3cdd
3 changed files with 310 additions and 84 deletions

View File

@ -10,7 +10,7 @@
We keep tests under `tests` folder. You can add `tts` layers tests under `tts_tests` folder.
Basic tests are checking input-output tensor shapes and output values for a given input. Consider testing extreme cases that are more likely to cause problems like `zero` tensors.
3. Implement loss function.
3. Implement a loss function.
We keep loss functions under `TTS/tts/layers/losses.py`. You can also mix-and-match implemented loss functions as you like.
@ -29,19 +29,20 @@
A model interacts with the `Trainer API` for training, `Synthesizer API` for inference and testing.
A 🐸TTS model must return a dictionary by the `forward()` and `inference()` functions. This dictionary must also include the `model_outputs` key that is considered as the main model output by the `Trainer` and `Synthesizer`.
A 🐸TTS model must return a dictionary by the `forward()` and `inference()` functions. This dictionary must `model_outputs` key that is considered as the main model output by the `Trainer` and `Synthesizer`.
You can place your `tts` model implementation under `TTS/tts/models/new_model.py` then inherit and implement the `BaseTTS`.
There is also the `callback` interface by which you can manipulate both the model and the `Trainer` states. Callbacks give you
the infinite flexibility to add custom behaviours for your model and training routines.
an infinite flexibility to add custom behaviours for your model and training routines.
For more details, see {ref}`BaseTTS <Base TTS Model>` and :obj:`TTS.utils.callbacks`.
6. Optionally, define `MyModelArgs`.
`MyModelArgs` is a 👨Coqpit class that sets all the class arguments of the `MyModel`. It should be enough to pass
an `MyModelArgs` instance to initiate the `MyModel`.
`MyModelArgs` is a 👨Coqpit class that sets all the class arguments of the `MyModel`. `MyModelArgs` must have
all the fields neccessary to instantiate the `MyModel`. However, for training, you need to pass `MyModelConfig` to
the model.
7. Test `MyModel`.
@ -59,3 +60,149 @@
9. Write Docstrings.
We love you more when you document your code. ❤️
# Template 🐸TTS Model implementation
You can start implementing your model by copying the following base class.
```python
from TTS.tts.models.base_tts import BaseTTS
class MyModel(BaseTTS):
"""
Notes on input/output tensor shapes:
Any input or output tensor of the model must be shaped as
- 3D tensors `batch x time x channels`
- 2D tensors `batch x channels`
- 1D tensors `batch x 1`
"""
def __init__(self, config: Coqpit):
super().__init__()
self._set_model_args(config)
def _set_model_args(self, config: Coqpit):
"""Set model arguments from the config. Override this."""
pass
def forward(self, input: torch.Tensor, *args, aux_input={}, **kwargs) -> Dict:
"""Forward pass for the model mainly used in training.
You can be flexible here and use different number of arguments and argument names since it is intended to be
used by `train_step()` without exposing it out of the model.
Args:
input (torch.Tensor): Input tensor.
aux_input (Dict): Auxiliary model inputs like embeddings, durations or any other sorts of inputs.
Returns:
Dict: Model outputs. Main model output must be named as "model_outputs".
"""
outputs_dict = {"model_outputs": None}
...
return outputs_dict
def inference(self, input: torch.Tensor, aux_input={}) -> Dict:
"""Forward pass for inference.
We don't use `*kwargs` since it is problematic with the TorchScript API.
Args:
input (torch.Tensor): [description]
aux_input (Dict): Auxiliary inputs like speaker embeddings, durations etc.
Returns:
Dict: [description]
"""
outputs_dict = {"model_outputs": None}
...
return outputs_dict
def train_step(self, batch: Dict, criterion: nn.Module) -> Tuple[Dict, Dict]:
"""Perform a single training step. Run the model forward pass and compute losses.
Args:
batch (Dict): Input tensors.
criterion (nn.Module): Loss layer designed for the model.
Returns:
Tuple[Dict, Dict]: Model ouputs and computed losses.
"""
outputs_dict = {}
loss_dict = {} # this returns from the criterion
...
return outputs_dict, loss_dict
def train_log(self, batch: Dict, outputs: Dict, logger: "Logger", assets:Dict, steps:int) -> None:
"""Create visualizations and waveform examples for training.
For example, here you can plot spectrograms and generate sample sample waveforms from these spectrograms to
be projected onto Tensorboard.
Args:
ap (AudioProcessor): audio processor used at training.
batch (Dict): Model inputs used at the previous training step.
outputs (Dict): Model outputs generated at the previoud training step.
Returns:
Tuple[Dict, np.ndarray]: training plots and output waveform.
"""
pass
def eval_step(self, batch: Dict, criterion: nn.Module) -> Tuple[Dict, Dict]:
"""Perform a single evaluation step. Run the model forward pass and compute losses. In most cases, you can
call `train_step()` with no changes.
Args:
batch (Dict): Input tensors.
criterion (nn.Module): Loss layer designed for the model.
Returns:
Tuple[Dict, Dict]: Model ouputs and computed losses.
"""
outputs_dict = {}
loss_dict = {} # this returns from the criterion
...
return outputs_dict, loss_dict
def eval_log(self, batch: Dict, outputs: Dict, logger: "Logger", assets:Dict, steps:int) -> None:
"""The same as `train_log()`"""
pass
def load_checkpoint(self, config: Coqpit, checkpoint_path: str, eval: bool = False) -> None:
"""Load a checkpoint and get ready for training or inference.
Args:
config (Coqpit): Model configuration.
checkpoint_path (str): Path to the model checkpoint file.
eval (bool, optional): If true, init model for inference else for training. Defaults to False.
"""
...
def get_optimizer(self) -> Union["Optimizer", List["Optimizer"]]:
"""Setup an return optimizer or optimizers."""
pass
def get_lr(self) -> Union[float, List[float]]:
"""Return learning rate(s).
Returns:
Union[float, List[float]]: Model's initial learning rates.
"""
pass
def get_scheduler(self, optimizer: torch.optim.Optimizer):
pass
def get_criterion(self):
pass
def format_batch(self):
pass
```

View File

@ -5,27 +5,30 @@
Each model has a different set of pros and cons that define the run-time efficiency and the voice quality. It is up to you to decide what model servers your needs. Other than referring to the papers, one easy way is to test the 🐸TTS
community models and see how fast and good each of the models. Or you can start a discussion on our communication channels.
2. Understand the configuration class, its fields and values of your model.
2. Understand the configuration, its fields and values of your model.
For instance, if you want to train a `Tacotron` model then see the `TacotronConfig` class and make sure you understand it.
3. Go to the recipes and check the recipe of your target model.
Recipes do not promise perfect models but they provide a good start point for `Nervous Beginners`. A recipe script training
a `GlowTTS` model on `LJSpeech` dataset looks like below. Let's be creative and call this script `train_glowtts.py`.
Recipes do not promise perfect models but they provide a good start point for `Nervous Beginners`. A recipe script for
`GlowTTS` using `LJSpeech` dataset looks like below. Let's be creative and call this `train_glowtts.py`.
```python
# train_glowtts.py
import os
from TTS.tts.configs import GlowTTSConfig
from TTS.tts.configs import BaseDatasetConfig
from TTS.trainer import init_training, Trainer, TrainingArgs
from TTS.trainer import Trainer, TrainingArgs
from TTS.tts.configs import BaseDatasetConfig, GlowTTSConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.glow_tts import GlowTTS
from TTS.utils.audio import AudioProcessor
output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/"))
dataset_config = BaseDatasetConfig(
name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/")
)
config = GlowTTSConfig(
batch_size=32,
eval_batch_size=16,
@ -34,33 +37,50 @@
run_eval=True,
test_delay_epochs=-1,
epochs=1000,
text_cleaner="english_cleaners",
use_phonemes=False,
text_cleaner="phoneme_cleaners",
use_phonemes=True,
phoneme_language="en-us",
phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
print_step=25,
print_eval=True,
mixed_precision=False,
print_eval=False,
mixed_precision=True,
output_path=output_path,
datasets=[dataset_config]
datasets=[dataset_config],
)
# init audio processor
ap = AudioProcessor(**config.audio.to_dict())
# load training samples
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)
# init model
model = GlowTTS(config)
# init the trainer and 🚀
trainer = Trainer(
TrainingArgs(),
config,
output_path,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
training_assets={"audio_processor": ap},
)
args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), config)
trainer = Trainer(args, config, output_path, c_logger, tb_logger)
trainer.fit()
```
You need to change fields of the `BaseDatasetConfig` to match your own dataset and then update `GlowTTSConfig`
You need to change fields of the `BaseDatasetConfig` to match your dataset and then update `GlowTTSConfig`
fields as you need.
4. Run the training.
You need to run the training script.
```bash
$ CUDA_VISIBLE_DEVICES="0" python train_glowtts.py
```
Notice that you set the GPU you want to use on your system by setting `CUDA_VISIBLE_DEVICES` environment variable.
Notice that we set the GPU for the training by `CUDA_VISIBLE_DEVICES` environment variable.
To see available GPUs on your system, you can use `nvidia-smi` command on the terminal.
If you like to run a multi-gpu training using DDP back-end,
@ -71,7 +91,7 @@
The example above runs a multi-gpu training using GPUs `0, 1, 2`.
The beginning of a training run looks like below.
Beginning of a training log looks like this:
```console
> Experiment folder: /your/output_path/-Juni-23-2021_02+52-78899209
@ -140,11 +160,11 @@
$ tensorboard --logdir=<path to your training directory>
```
6. Check the logs and the Tensorboard and monitor the training.
6. Monitor the training process.
On the terminal and Tensorboard, you can monitor the losses and their changes over time. Also Tensorboard provides certain figures and sample outputs.
On the terminal and Tensorboard, you can monitor the progress of your model. Also Tensorboard provides certain figures and sample outputs.
Note that different models have different metrics, visuals and outputs to be displayed.
Note that different models have different metrics, visuals and outputs.
You should also check the [FAQ page](https://github.com/coqui-ai/TTS/wiki/FAQ) for common problems and solutions
that occur in a training.

View File

@ -23,28 +23,30 @@ each line.
### Pure Python Way
```python
import os
1. Define `train.py`.
# GlowTTSConfig: all model related values for training, validating and testing.
from TTS.tts.configs import GlowTTSConfig
```python
import os
# BaseDatasetConfig: defines name, formatter and path of the dataset.
from TTS.tts.configs import BaseDatasetConfig
# GlowTTSConfig: all model related values for training, validating and testing.
from TTS.tts.configs import GlowTTSConfig
# init_training: Initialize and setup the training environment.
# Trainer: Where the ✨️ happens.
# TrainingArgs: Defines the set of arguments of the Trainer.
from TTS.trainer import init_training, Trainer, TrainingArgs
# BaseDatasetConfig: defines name, formatter and path of the dataset.
from TTS.tts.configs import BaseDatasetConfig
# we use the same path as this script as our training folder.
output_path = os.path.dirname(os.path.abspath(__file__))
# init_training: Initialize and setup the training environment.
# Trainer: Where the ✨️ happens.
# TrainingArgs: Defines the set of arguments of the Trainer.
from TTS.trainer import init_training, Trainer, TrainingArgs
# set LJSpeech as our target dataset and define its path so that the Trainer knows what data formatter it needs.
dataset_config = BaseDatasetConfig(name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/"))
# we use the same path as this script as our training folder.
output_path = os.path.dirname(os.path.abspath(__file__))
# Configure the model. Every config class inherits the BaseTTSConfig to have all the fields defined for the Trainer.
config = GlowTTSConfig(
# set LJSpeech as our target dataset and define its path so that the Trainer knows what data formatter it needs.
dataset_config = BaseDatasetConfig(name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/"))
# Configure the model. Every config class inherits the BaseTTSConfig to have all the fields defined for the Trainer.
config = GlowTTSConfig(
batch_size=32,
eval_batch_size=16,
num_loader_workers=4,
@ -61,25 +63,64 @@ config = GlowTTSConfig(
mixed_precision=False,
output_path=output_path,
datasets=[dataset_config]
)
)
# Take the config and the default Trainer arguments, setup the training environment and override the existing
# config values from the terminal. So you can do the following.
# >>> python train.py --coqpit.batch_size 128
args, config, output_path, _, _, _= init_training(TrainingArgs(), config)
# initialize the audio processor used for feature extraction and audio I/O.
# It is mainly used by the dataloader and the training loggers.
ap = AudioProcessor(**config.audio.to_dict())
# Initiate the Trainer.
# Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,
# distributed training etc.
trainer = Trainer(args, config, output_path)
# load a list of training samples
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)
# And kick it 🚀
trainer.fit()
```
# initialize the model
# Models only takes the config object as input.
model = GlowTTS(config)
# Initiate the Trainer.
# Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,
# distributed training etc.
trainer = Trainer(
TrainingArgs(),
config,
output_path,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
training_assets={"audio_processor": ap},
)
# And kick it 🚀
trainer.fit()
```
2. Run the script.
```bash
CUDA_VISIBLE_DEVICES=0 python train.py
```
- Continue a previous run.
```bash
CUDA_VISIBLE_DEVICES=0 python train.py --continue_path path/to/previous/run/folder/
```
- Fine-tune a model.
```bash
CUDA_VISIBLE_DEVICES=0 python train.py --restore_path path/to/model/checkpoint.pth.tar
```
- Run multi-gpu training.
```bash
CUDA_VISIBLE_DEVICES=0,1,2 python TTS/bin/distribute.py --script train.py
```
### CLI Way
We still support running training from CLI like in the old days. The same training can be started as follows.
We still support running training from CLI like in the old days. The same training run can also be started as follows.
1. Define your `config.json`
@ -111,45 +152,63 @@ We still support running training from CLI like in the old days. The same traini
$ CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py --config_path config.json
```
## Training a `vocoder` Model
```python
import os
from TTS.trainer import Trainer, TrainingArgs
from TTS.utils.audio import AudioProcessor
from TTS.vocoder.configs import HifiganConfig
from TTS.trainer import init_training, Trainer, TrainingArgs
from TTS.vocoder.datasets.preprocess import load_wav_data
from TTS.vocoder.models.gan import GAN
output_path = os.path.dirname(os.path.abspath(__file__))
config = HifiganConfig(
batch_size=32,
eval_batch_size=16,
num_loader_workers=4,
num_eval_loader_workers=4,
run_eval=True,
test_delay_epochs=-1,
test_delay_epochs=5,
epochs=1000,
seq_len=8192,
pad_short=2000,
use_noise_augment=True,
eval_split_size=10,
print_step=25,
print_eval=True,
print_eval=False,
mixed_precision=False,
lr_gen=1e-4,
lr_disc=1e-4,
# `vocoder` only needs a data path and they read recursively all the `.wav` files underneath.
data_path=os.path.join(output_path, "../LJSpeech-1.1/wavs/"),
output_path=output_path,
)
args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), config)
trainer = Trainer(args, config, output_path, c_logger, tb_logger)
# init audio processor
ap = AudioProcessor(**config.audio.to_dict())
# load training samples
eval_samples, train_samples = load_wav_data(config.data_path, config.eval_split_size)
# init model
model = GAN(config)
# init the trainer and 🚀
trainer = Trainer(
TrainingArgs(),
config,
output_path,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
training_assets={"audio_processor": ap},
)
trainer.fit()
```
❗️ Note that you can also start the training run from CLI as the `tts` model above.
❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above.
## Synthesizing Speech