Update docs

2021-09-30 14:34:53 +00:00 · 2021-09-30 14:34:53 +00:00 · 6d3b2d3cdd
parent f904dd4828
commit 6d3b2d3cdd
3 changed files with 310 additions and 84 deletions
--- a/docs/source/implementing_a_new_model.md
+++ b/docs/source/implementing_a_new_model.md
@ -10,7 +10,7 @@
    We keep tests under `tests` folder. You can add `tts` layers tests under `tts_tests` folder.
    Basic tests are checking input-output tensor shapes and output values for a given input. Consider testing extreme cases that are more likely to cause problems like `zero` tensors.

-3. Implement loss function.
+3. Implement a loss function.

    We keep loss functions under `TTS/tts/layers/losses.py`. You can also mix-and-match implemented loss functions as you like.

@ -29,19 +29,20 @@

    A model interacts with the `Trainer API` for training, `Synthesizer API` for inference and testing.

-    A 🐸TTS model must return a dictionary by the `forward()` and `inference()` functions. This dictionary must also include the `model_outputs` key that is considered as the main model output by the `Trainer` and `Synthesizer`.
+    A 🐸TTS model must return a dictionary by the `forward()` and `inference()` functions. This dictionary must `model_outputs` key that is considered as the main model output by the `Trainer` and `Synthesizer`.

    You can place your `tts` model implementation under `TTS/tts/models/new_model.py` then inherit and implement the `BaseTTS`.

    There is also the `callback` interface by which you can manipulate both the model and the `Trainer` states. Callbacks give you
-    the infinite flexibility to add custom behaviours for your model and training routines.
+    an infinite flexibility to add custom behaviours for your model and training routines.

    For more details, see {ref}`BaseTTS <Base TTS Model>` and :obj:`TTS.utils.callbacks`.

 6. Optionally, define `MyModelArgs`.

-    `MyModelArgs` is a 👨‍✈️Coqpit class that sets all the class arguments of the `MyModel`. It should be enough to pass
-    an `MyModelArgs` instance to initiate the `MyModel`.
+    `MyModelArgs` is a 👨‍✈️Coqpit class that sets all the class arguments of the `MyModel`. `MyModelArgs` must have
+    all the fields neccessary to instantiate the `MyModel`. However, for training, you need to pass `MyModelConfig` to
+    the model.

 7. Test `MyModel`.

@ -59,3 +60,149 @@
 9. Write Docstrings.

    We love you more when you document your code. ❤️
+
+
+# Template 🐸TTS Model implementation
+
+You can start implementing your model by copying the following base class.
+
+```python
+from TTS.tts.models.base_tts import BaseTTS
+
+
+class MyModel(BaseTTS):
+    """
+    Notes on input/output tensor shapes:
+        Any input or output tensor of the model must be shaped as
+
+        - 3D tensors `batch x time x channels`
+        - 2D tensors `batch x channels`
+        - 1D tensors `batch x 1`
+    """
+
+    def __init__(self, config: Coqpit):
+        super().__init__()
+        self._set_model_args(config)
+
+    def _set_model_args(self, config: Coqpit):
+        """Set model arguments from the config. Override this."""
+        pass
+
+    def forward(self, input: torch.Tensor, *args, aux_input={}, **kwargs) -> Dict:
+        """Forward pass for the model mainly used in training.
+
+        You can be flexible here and use different number of arguments and argument names since it is intended to be
+        used by `train_step()` without exposing it out of the model.
+
+        Args:
+            input (torch.Tensor): Input tensor.
+            aux_input (Dict): Auxiliary model inputs like embeddings, durations or any other sorts of inputs.
+
+        Returns:
+            Dict: Model outputs. Main model output must be named as "model_outputs".
+        """
+        outputs_dict = {"model_outputs": None}
+        ...
+        return outputs_dict
+
+    def inference(self, input: torch.Tensor, aux_input={}) -> Dict:
+        """Forward pass for inference.
+
+        We don't use `*kwargs` since it is problematic with the TorchScript API.
+
+        Args:
+            input (torch.Tensor): [description]
+            aux_input (Dict): Auxiliary inputs like speaker embeddings, durations etc.
+
+        Returns:
+            Dict: [description]
+        """
+        outputs_dict = {"model_outputs": None}
+        ...
+        return outputs_dict
+
+    def train_step(self, batch: Dict, criterion: nn.Module) -> Tuple[Dict, Dict]:
+        """Perform a single training step. Run the model forward pass and compute losses.
+
+        Args:
+            batch (Dict): Input tensors.
+            criterion (nn.Module): Loss layer designed for the model.
+
+        Returns:
+            Tuple[Dict, Dict]: Model ouputs and computed losses.
+        """
+        outputs_dict = {}
+        loss_dict = {}  # this returns from the criterion
+        ...
+        return outputs_dict, loss_dict
+
+    def train_log(self, batch: Dict, outputs: Dict, logger: "Logger", assets:Dict, steps:int) -> None:
+        """Create visualizations and waveform examples for training.
+
+        For example, here you can plot spectrograms and generate sample sample waveforms from these spectrograms to
+        be projected onto Tensorboard.
+
+        Args:
+            ap (AudioProcessor): audio processor used at training.
+            batch (Dict): Model inputs used at the previous training step.
+            outputs (Dict): Model outputs generated at the previoud training step.
+
+        Returns:
+            Tuple[Dict, np.ndarray]: training plots and output waveform.
+        """
+        pass
+
+    def eval_step(self, batch: Dict, criterion: nn.Module) -> Tuple[Dict, Dict]:
+        """Perform a single evaluation step. Run the model forward pass and compute losses. In most cases, you can
+        call `train_step()` with no changes.
+
+        Args:
+            batch (Dict): Input tensors.
+            criterion (nn.Module): Loss layer designed for the model.
+
+        Returns:
+            Tuple[Dict, Dict]: Model ouputs and computed losses.
+        """
+        outputs_dict = {}
+        loss_dict = {}  # this returns from the criterion
+        ...
+        return outputs_dict, loss_dict
+
+    def eval_log(self, batch: Dict, outputs: Dict, logger: "Logger", assets:Dict, steps:int) -> None:
+        """The same as `train_log()`"""
+        pass
+
+    def load_checkpoint(self, config: Coqpit, checkpoint_path: str, eval: bool = False) -> None:
+        """Load a checkpoint and get ready for training or inference.
+
+        Args:
+            config (Coqpit): Model configuration.
+            checkpoint_path (str): Path to the model checkpoint file.
+            eval (bool, optional): If true, init model for inference else for training. Defaults to False.
+        """
+        ...
+
+    def get_optimizer(self) -> Union["Optimizer", List["Optimizer"]]:
+        """Setup an return optimizer or optimizers."""
+        pass
+
+    def get_lr(self) -> Union[float, List[float]]:
+        """Return learning rate(s).
+
+        Returns:
+            Union[float, List[float]]: Model's initial learning rates.
+        """
+        pass
+
+    def get_scheduler(self, optimizer: torch.optim.Optimizer):
+        pass
+
+    def get_criterion(self):
+        pass
+
+    def format_batch(self):
+        pass
+
+```
+
+
--- a/docs/source/training_a_model.md
+++ b/docs/source/training_a_model.md
@ -5,27 +5,30 @@
    Each model has a different set of pros and cons that define the run-time efficiency and the voice quality. It is up to you to decide what model servers your needs. Other than referring to the papers, one easy way is to test the 🐸TTS
    community models and see how fast and good each of the models. Or you can start a discussion on our communication channels.

-2. Understand the configuration class, its fields and values of your model.
+2. Understand the configuration, its fields and values of your model.

    For instance, if you want to train a `Tacotron` model then see the `TacotronConfig` class and make sure you understand it.

 3. Go to the recipes and check the recipe of your target model.

-    Recipes do not promise perfect models but they provide a good start point for `Nervous Beginners`. A recipe script training
-    a `GlowTTS` model on `LJSpeech` dataset looks like below. Let's be creative and call this script `train_glowtts.py`.
+    Recipes do not promise perfect models but they provide a good start point for `Nervous Beginners`. A recipe script for
+    `GlowTTS` using `LJSpeech` dataset looks like below. Let's be creative and call this `train_glowtts.py`.

    ```python
    # train_glowtts.py

   import os

-    from TTS.tts.configs import GlowTTSConfig
-    from TTS.tts.configs import BaseDatasetConfig
-    from TTS.trainer import init_training, Trainer, TrainingArgs
-
+    from TTS.trainer import Trainer, TrainingArgs
+    from TTS.tts.configs import BaseDatasetConfig, GlowTTSConfig
+    from TTS.tts.datasets import load_tts_samples
+    from TTS.tts.models.glow_tts import GlowTTS
+    from TTS.utils.audio import AudioProcessor

    output_path = os.path.dirname(os.path.abspath(__file__))
-    dataset_config = BaseDatasetConfig(name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/"))
+    dataset_config = BaseDatasetConfig(
+        name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/")
+    )
    config = GlowTTSConfig(
        batch_size=32,
        eval_batch_size=16,
@ -34,33 +37,50 @@
        run_eval=True,
        test_delay_epochs=-1,
        epochs=1000,
-        text_cleaner="english_cleaners",
-        use_phonemes=False,
+        text_cleaner="phoneme_cleaners",
+        use_phonemes=True,
        phoneme_language="en-us",
        phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
        print_step=25,
-        print_eval=True,
-        mixed_precision=False,
+        print_eval=False,
+        mixed_precision=True,
        output_path=output_path,
-        datasets=[dataset_config]
+        datasets=[dataset_config],
+    )
+
+    # init audio processor
+    ap = AudioProcessor(**config.audio.to_dict())
+
+    # load training samples
+    train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)
+
+    # init model
+    model = GlowTTS(config)
+
+    # init the trainer and 🚀
+    trainer = Trainer(
+        TrainingArgs(),
+        config,
+        output_path,
+        model=model,
+        train_samples=train_samples,
+        eval_samples=eval_samples,
+        training_assets={"audio_processor": ap},
    )
-    args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), config)
-    trainer = Trainer(args, config, output_path, c_logger, tb_logger)
    trainer.fit()
+
    ```

-    You need to change fields of the `BaseDatasetConfig` to match your own dataset and then update `GlowTTSConfig`
+    You need to change fields of the `BaseDatasetConfig` to match your dataset and then update `GlowTTSConfig`
    fields as you need.

 4. Run the training.

-    You need to run the training script.
-
    ```bash
    $ CUDA_VISIBLE_DEVICES="0" python train_glowtts.py
    ```

-    Notice that you set the GPU you want to use on your system by setting `CUDA_VISIBLE_DEVICES` environment variable.
+    Notice that we set the GPU for the training by `CUDA_VISIBLE_DEVICES` environment variable.
    To see available GPUs on your system, you can use `nvidia-smi` command on the terminal.

    If you like to run a multi-gpu training using DDP back-end,
@ -71,7 +91,7 @@

    The example above runs a multi-gpu training using GPUs `0, 1, 2`.

-    The beginning of a training run looks like below.
+    Beginning of a training log looks like this:

    ```console
    > Experiment folder: /your/output_path/-Juni-23-2021_02+52-78899209
@ -140,11 +160,11 @@
    $ tensorboard --logdir=<path to your training directory>
    ```

-6. Check the logs and the Tensorboard and monitor the training.
+6. Monitor the training process.

-    On the terminal and Tensorboard, you can monitor the losses and their changes over time. Also Tensorboard provides certain figures and sample outputs.
+    On the terminal and Tensorboard, you can monitor the progress of your model. Also Tensorboard provides certain figures and sample outputs.

-    Note that different models have different metrics, visuals and outputs to be displayed.
+    Note that different models have different metrics, visuals and outputs.

    You should also check the [FAQ page](https://github.com/coqui-ai/TTS/wiki/FAQ) for common problems and solutions
    that occur in a training.
--- a/docs/source/tutorial_for_nervous_beginners.md
+++ b/docs/source/tutorial_for_nervous_beginners.md
@ -23,28 +23,30 @@ each line.

 ### Pure Python Way

-```python
-import os
+1. Define `train.py`.

-# GlowTTSConfig: all model related values for training, validating and testing.
-from TTS.tts.configs import GlowTTSConfig
+    ```python
+    import os

-# BaseDatasetConfig: defines name, formatter and path of the dataset.
-from TTS.tts.configs import BaseDatasetConfig
+    # GlowTTSConfig: all model related values for training, validating and testing.
+    from TTS.tts.configs import GlowTTSConfig

-# init_training: Initialize and setup the training environment.
-# Trainer: Where the ✨️ happens.
-# TrainingArgs: Defines the set of arguments of the Trainer.
-from TTS.trainer import init_training, Trainer, TrainingArgs
+    # BaseDatasetConfig: defines name, formatter and path of the dataset.
+    from TTS.tts.configs import BaseDatasetConfig

-# we use the same path as this script as our training folder.
-output_path = os.path.dirname(os.path.abspath(__file__))
+    # init_training: Initialize and setup the training environment.
+    # Trainer: Where the ✨️ happens.
+    # TrainingArgs: Defines the set of arguments of the Trainer.
+    from TTS.trainer import init_training, Trainer, TrainingArgs

-# set LJSpeech as our target dataset and define its path so that the Trainer knows what data formatter it needs.
-dataset_config = BaseDatasetConfig(name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/"))
+    # we use the same path as this script as our training folder.
+    output_path = os.path.dirname(os.path.abspath(__file__))

-# Configure the model. Every config class inherits the BaseTTSConfig to have all the fields defined for the Trainer.
-config = GlowTTSConfig(
+    # set LJSpeech as our target dataset and define its path so that the Trainer knows what data formatter it needs.
+    dataset_config = BaseDatasetConfig(name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/"))
+
+    # Configure the model. Every config class inherits the BaseTTSConfig to have all the fields defined for the Trainer.
+    config = GlowTTSConfig(
        batch_size=32,
        eval_batch_size=16,
        num_loader_workers=4,
@ -61,25 +63,64 @@ config = GlowTTSConfig(
        mixed_precision=False,
        output_path=output_path,
        datasets=[dataset_config]
-)
+    )

-# Take the config and the default Trainer arguments, setup the training environment and override the existing
-# config values from the terminal. So you can do the following.
-# >>> python train.py --coqpit.batch_size 128
-args, config, output_path, _, _, _= init_training(TrainingArgs(), config)
+    # initialize the audio processor used for feature extraction and audio I/O.
+    # It is mainly used by the dataloader and the training loggers.
+    ap = AudioProcessor(**config.audio.to_dict())

-# Initiate the Trainer.
-# Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,
-# distributed training etc.
-trainer = Trainer(args, config, output_path)
+    # load a list of training samples
+    # Each sample is a list of ```[text, audio_file_path, speaker_name]```
+    train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)

-# And kick it 🚀
-trainer.fit()
-```
+    # initialize the model
+    # Models only takes the config object as input.
+    model = GlowTTS(config)
+
+    # Initiate the Trainer.
+    # Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,
+    # distributed training etc.
+    trainer = Trainer(
+        TrainingArgs(),
+        config,
+        output_path,
+        model=model,
+        train_samples=train_samples,
+        eval_samples=eval_samples,
+        training_assets={"audio_processor": ap},
+    )
+
+    # And kick it 🚀
+    trainer.fit()
+    ```
+
+2. Run the script.
+
+    ```bash
+    CUDA_VISIBLE_DEVICES=0 python train.py
+    ```
+
+    - Continue a previous run.
+
+        ```bash
+        CUDA_VISIBLE_DEVICES=0 python train.py --continue_path path/to/previous/run/folder/
+        ```
+
+    - Fine-tune a model.
+
+        ```bash
+        CUDA_VISIBLE_DEVICES=0 python train.py --restore_path path/to/model/checkpoint.pth.tar
+        ```
+
+    - Run multi-gpu training.
+
+        ```bash
+        CUDA_VISIBLE_DEVICES=0,1,2 python TTS/bin/distribute.py --script train.py
+        ```

 ### CLI Way

-We still support running training from CLI like in the old days. The same training can be started as follows.
+We still support running training from CLI like in the old days. The same training run can also be started as follows.

 1. Define your `config.json`

@ -111,45 +152,63 @@ We still support running training from CLI like in the old days. The same traini
    $ CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py --config_path config.json
    ```

-
-
 ## Training a `vocoder` Model

 ```python
 import os

+from TTS.trainer import Trainer, TrainingArgs
+from TTS.utils.audio import AudioProcessor
 from TTS.vocoder.configs import HifiganConfig
-from TTS.trainer import init_training, Trainer, TrainingArgs
-
+from TTS.vocoder.datasets.preprocess import load_wav_data
+from TTS.vocoder.models.gan import GAN

 output_path = os.path.dirname(os.path.abspath(__file__))
+
 config = HifiganConfig(
    batch_size=32,
    eval_batch_size=16,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
-    test_delay_epochs=-1,
+    test_delay_epochs=5,
    epochs=1000,
    seq_len=8192,
    pad_short=2000,
    use_noise_augment=True,
    eval_split_size=10,
    print_step=25,
-    print_eval=True,
+    print_eval=False,
    mixed_precision=False,
    lr_gen=1e-4,
    lr_disc=1e-4,
-    # `vocoder` only needs a data path and they read recursively all the `.wav` files underneath.
    data_path=os.path.join(output_path, "../LJSpeech-1.1/wavs/"),
    output_path=output_path,
 )
-args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), config)
-trainer = Trainer(args, config, output_path, c_logger, tb_logger)
+
+# init audio processor
+ap = AudioProcessor(**config.audio.to_dict())
+
+# load training samples
+eval_samples, train_samples = load_wav_data(config.data_path, config.eval_split_size)
+
+# init model
+model = GAN(config)
+
+# init the trainer and 🚀
+trainer = Trainer(
+    TrainingArgs(),
+    config,
+    output_path,
+    model=model,
+    train_samples=train_samples,
+    eval_samples=eval_samples,
+    training_assets={"audio_processor": ap},
+)
 trainer.fit()
 ```

-❗️ Note that you can also start the training run from CLI as the `tts` model above.
+❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above.

 ## Synthesizing Speech