Merge branch 'dev'

2020-11-25 15:29:06 +01:00 · 2020-11-25 15:29:06 +01:00 · f6c96b0ac2
parent cfbc660f15 e3b7157146
commit f6c96b0ac2
71 changed files with 5000 additions and 418 deletions
--- a/.compute
+++ b/.compute
@ -1,14 +1,14 @@
 #!/bin/bash
 yes | apt-get install sox
 yes | apt-get install ffmpeg
-yes | apt-get install espeak 
+yes | apt-get install espeak
 yes | apt-get install tmux
 yes | apt-get install zsh
 sh -c "$(curl -fsSL https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"
 pip3 install https://download.pytorch.org/whl/cu100/torch-1.3.0%2Bcu100-cp36-cp36m-linux_x86_64.whl
 sudo sh install.sh
-pip install pytorch==1.3.0+cu100
-python3 setup.py develop
+# pip install pytorch==1.7.0+cu100
+# python3 setup.py develop
 # python3 distribute.py --config_path config.json  --data_path /data/ro/shared/data/keithito/LJSpeech-1.1/
 # cp -R ${USER_DIR}/Mozilla_22050 ../tmp/
 # python3 distribute.py --config_path config_tacotron_gst.json  --data_path ../tmp/Mozilla_22050/
--- a/.travis/script
+++ b/.travis/script
@ -17,5 +17,6 @@ fi
 if [[ "$TEST_SUITE" == "testscripts" ]]; then
   # test model training scripts
  ./tests/test_tts_train.sh
-  ./tests/test_vocoder_train.sh
+  ./tests/test_vocoder_gan_train.sh
+  ./tests/test_vocoder_wavernn_train.sh
 fi
--- a/README.md
+++ b/README.md
@ -26,7 +26,7 @@ TTS paper collection: https://github.com/erogol/TTS-papers
 ## TTS Performance
 <p align="center"><img src="https://discourse-prod-uploads-81679984178418.s3.dualstack.us-west-2.amazonaws.com/optimized/3X/6/4/6428f980e9ec751c248e591460895f7881aec0c6_2_1035x591.png" width="800" /></p>

-"Mozilla*" and "Judy*" are our models. 
+"Mozilla*" and "Judy*" are our models.
 [Details...](https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results)

 ## Provided Models and Methods
@ -47,7 +47,10 @@ Speaker Encoder:
 Vocoders:
 - MelGAN: [paper](https://arxiv.org/abs/1710.10467)
 - MultiBandMelGAN: [paper](https://arxiv.org/abs/2005.05106)
+- ParallelWaveGAN: [paper](https://arxiv.org/abs/1910.11480)
 - GAN-TTS discriminators: [paper](https://arxiv.org/abs/1909.11646)
+- WaveRNN: [origin][https://github.com/fatchord/WaveRNN/]
+- WaveGrad: [paper][https://arxiv.org/abs/2009.00713]

 You can also help us implement more models. Some TTS related work can be found [here](https://github.com/erogol/TTS-papers).

@ -70,8 +73,8 @@ You can also help us implement more models. Some TTS related work can be found [
 ## Main Requirements and Installation
 Highly recommended to use [miniconda](https://conda.io/miniconda.html) for easier installation.
  * python>=3.6
-  * pytorch>=1.4.1
-  * tensorflow>=2.2
+  * pytorch>=1.5.0
+  * tensorflow>=2.3
  * librosa
  * tensorboard
  * tensorboardX
@ -149,23 +152,25 @@ head -n 12000 metadata_shuf.csv > metadata_train.csv
 tail -n 1100 metadata_shuf.csv > metadata_val.csv
 ```

-To train a new model, you need to define your own ```config.json``` file (check the example) and call with the command below. You also set the model architecture in  ```config.json```.
+To train a new model, you need to define your own ```config.json``` to define model details, trainin configuration and more (check the examples). Then call the corressponding train script.

-```python TTS/bin/train_tts.py --config_path TTS/tts/configs/config.json```
+For instance, in order to train a tacotron or tacotron2 model on LJSpeech dataset, follow these steps.
+
+```python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json```

 To fine-tune a model, use ```--restore_path```.

-```python TTS/bin/train_tts.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar```
+```python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar```

 To continue an old training run, use ```--continue_path```.

-```python TTS/bin/train_tts.py --continue_path /path/to/your/run_folder/```
+```python TTS/bin/train_tacotron.py --continue_path /path/to/your/run_folder/```

-For multi-GPU training use ```distribute.py```. It enables process based multi-GPU training where each process uses a single GPU.
+For multi-GPU training, call ```distribute.py```. It runs any provided train script in multi-GPU setting.

-```CUDA_VISIBLE_DEVICES="0,1,4" TTS/bin/distribute.py --config_path TTS/tts/configs/config.json```
+```CUDA_VISIBLE_DEVICES="0,1,4" python TTS/bin/distribute.py --script train_tacotron.py --config_path TTS/tts/configs/config.json```

-Each run creates a new output folder and ```config.json``` is copied under this folder.
+Each run creates a new output folder accomodating used ```config.json```, model checkpoints and tensorboard logs.

 In case of any error or intercepted execution, if there is no checkpoint yet under the output folder, the whole folder is going to be removed.

@ -199,7 +204,7 @@ If you like to use TTS to try a new idea and like to share your experiments with
 - [x] Train TTS with r=1 successfully.
 - [x] Enable process based distributed training. Similar to (https://github.com/fastai/imagenet-fast/).
 - [x] Adapting Neural Vocoder. TTS works with WaveRNN and ParallelWaveGAN (https://github.com/erogol/WaveRNN and https://github.com/erogol/ParallelWaveGAN)
- [ ] Multi-speaker embedding.
+- [x] Multi-speaker embedding.
 - [x] Model optimization (model export, model pruning etc.)

 <!--## References
@ -218,3 +223,4 @@ If you like to use TTS to try a new idea and like to share your experiments with
 - https://github.com/r9y9/tacotron_pytorch (Initial Tacotron architecture)
 - https://github.com/kan-bayashi/ParallelWaveGAN (vocoder library)
 - https://github.com/jaywalnut310/glow-tts (Original Glow-TTS implementation)
+- https://github.com/fatchord/WaveRNN/ (Original WaveRNN implementation)
--- a/TTS/bin/compute_embeddings.py
+++ b/TTS/bin/compute_embeddings.py
@ -0,0 +1,130 @@
+import argparse
+import glob
+import os
+
+import numpy as np
+from tqdm import tqdm
+
+import torch
+from TTS.speaker_encoder.model import SpeakerEncoder
+from TTS.utils.audio import AudioProcessor
+from TTS.utils.io import load_config
+from TTS.tts.utils.speakers import save_speaker_mapping
+from TTS.tts.datasets.preprocess import load_meta_data
+
+parser = argparse.ArgumentParser(
+    description='Compute embedding vectors for each wav file in a dataset. If "target_dataset" is defined, it generates "speakers.json" necessary for training a multi-speaker model.')
+parser.add_argument(
+    'model_path',
+    type=str,
+    help='Path to model outputs (checkpoint, tensorboard etc.).')
+parser.add_argument(
+    'config_path',
+    type=str,
+    help='Path to config file for training.',
+)
+parser.add_argument(
+    'data_path',
+    type=str,
+    help='Data path for wav files - directory or CSV file')
+parser.add_argument(
+    'output_path',
+    type=str,
+    help='path for training outputs.')
+parser.add_argument(
+    '--target_dataset',
+    type=str,
+    default='',
+    help='Target dataset to pick a processor from TTS.tts.dataset.preprocess. Necessary to create a speakers.json file.'
+)
+parser.add_argument(
+    '--use_cuda', type=bool, help='flag to set cuda.', default=False
+)
+parser.add_argument(
+    '--separator', type=str, help='Separator used in file if CSV is passed for data_path', default='|'
+)
+args = parser.parse_args()
+
+
+c = load_config(args.config_path)
+ap = AudioProcessor(**c['audio'])
+
+data_path = args.data_path
+split_ext = os.path.splitext(data_path)
+sep = args.separator
+
+if args.target_dataset != '':
+    # if target dataset is defined
+    dataset_config = [
+        {
+            "name": args.target_dataset,
+            "path": args.data_path,
+            "meta_file_train": None,
+            "meta_file_val": None
+        },
+    ]
+    wav_files, _ = load_meta_data(dataset_config, eval_split=False)
+    output_files = [wav_file[1].replace(data_path, args.output_path).replace(
+        '.wav', '.npy') for wav_file in wav_files]
+else:
+    # if target dataset is not defined
+    if len(split_ext) > 0 and split_ext[1].lower() == '.csv':
+        # Parse CSV
+        print(f'CSV file: {data_path}')
+        with open(data_path) as f:
+            wav_path = os.path.join(os.path.dirname(data_path), 'wavs')
+            wav_files = []
+            print(f'Separator is: {sep}')
+            for line in f:
+                components = line.split(sep)
+                if len(components) != 2:
+                    print("Invalid line")
+                    continue
+                wav_file = os.path.join(wav_path, components[0] + '.wav')
+                #print(f'wav_file: {wav_file}')
+                if os.path.exists(wav_file):
+                    wav_files.append(wav_file)
+        print(f'Count of wavs imported: {len(wav_files)}')
+    else:
+        # Parse all wav files in data_path
+        wav_files = glob.glob(data_path + '/**/*.wav', recursive=True)
+
+        output_files = [wav_file.replace(data_path, args.output_path).replace(
+            '.wav', '.npy') for wav_file in wav_files]
+
+for output_file in output_files:
+    os.makedirs(os.path.dirname(output_file), exist_ok=True)
+
+# define Encoder model
+model = SpeakerEncoder(**c.model)
+model.load_state_dict(torch.load(args.model_path)['model'])
+model.eval()
+if args.use_cuda:
+    model.cuda()
+
+# compute speaker embeddings
+speaker_mapping = {}
+for idx, wav_file in enumerate(tqdm(wav_files)):
+    if isinstance(wav_file, list):
+        speaker_name = wav_file[2]
+        wav_file = wav_file[1]
+
+    mel_spec = ap.melspectrogram(ap.load_wav(wav_file, sr=ap.sample_rate)).T
+    mel_spec = torch.FloatTensor(mel_spec[None, :, :])
+    if args.use_cuda:
+        mel_spec = mel_spec.cuda()
+    embedd = model.compute_embedding(mel_spec)
+    embedd = embedd.detach().cpu().numpy()
+    np.save(output_files[idx], embedd)
+
+    if args.target_dataset != '':
+        # create speaker_mapping if target dataset is defined
+        wav_file_name = os.path.basename(wav_file)
+        speaker_mapping[wav_file_name] = {}
+        speaker_mapping[wav_file_name]['name'] = speaker_name
+        speaker_mapping[wav_file_name]['embedding'] = embedd.flatten().tolist()
+
+if args.target_dataset != '':
+    # save speaker_mapping if target dataset is defined
+    mapping_file_path = os.path.join(args.output_path, 'speakers.json')
+    save_speaker_mapping(args.output_path, speaker_mapping)
--- a/TTS/bin/compute_statistics.py
+++ b/TTS/bin/compute_statistics.py
@ -2,6 +2,7 @@
 # -*- coding: utf-8 -*-

 import os
+import glob
 import argparse

 import numpy as np
@ -11,6 +12,7 @@ from TTS.tts.datasets.preprocess import load_meta_data
 from TTS.utils.io import load_config
 from TTS.utils.audio import AudioProcessor

+
 def main():
    """Run preprocessing process."""
    parser = argparse.ArgumentParser(
@ -30,7 +32,10 @@ def main():
    ap = AudioProcessor(**CONFIG.audio)

    # load the meta data of target dataset
-    dataset_items = load_meta_data(CONFIG.datasets)[0]  # take only train data
+    if 'data_path' in CONFIG.keys():
+        dataset_items = glob.glob(os.path.join(CONFIG.data_path, '**', '*.wav'), recursive=True)
+    else:
+        dataset_items = load_meta_data(CONFIG.datasets)[0]  # take only train data
    print(f" > There are {len(dataset_items)} files.")

    mel_sum = 0
@ -40,7 +45,7 @@ def main():
    N = 0
    for item in tqdm(dataset_items):
        # compute features
-        wav = ap.load_wav(item[1])
+        wav = ap.load_wav(item if isinstance(item, str) else item[1])
        linear = ap.spectrogram(wav)
        mel = ap.melspectrogram(wav)

@ -56,7 +61,7 @@ def main():
    linear_mean = linear_sum / N
    linear_scale = np.sqrt(linear_square_sum / N - linear_mean ** 2)

-    output_file_path = os.path.join(args.out_path, "scale_stats.npy")
+    output_file_path = args.out_path
    stats = {}
    stats['mel_mean'] = mel_mean
    stats['mel_std'] = mel_scale
@ -78,7 +83,7 @@ def main():
    del CONFIG.audio['clip_norm']
    stats['audio_config'] = CONFIG.audio
    np.save(output_file_path, stats, allow_pickle=True)
-    print(f' > scale_stats.npy is saved to {output_file_path}')
+    print(f' > stats saved to {output_file_path}')


 if __name__ == "__main__":
--- a/TTS/bin/synthesize.py
+++ b/TTS/bin/synthesize.py
@ -10,7 +10,7 @@ import time

 import torch

-from TTS.tts.utils.generic_utils import setup_model
+from TTS.tts.utils.generic_utils import setup_model, is_tacotron
 from TTS.tts.utils.synthesis import synthesis
 from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
 from TTS.utils.audio import AudioProcessor
@ -125,7 +125,8 @@ if __name__ == "__main__":
    model.eval()
    if args.use_cuda:
        model.cuda()
-    model.decoder.set_r(cp['r'])
+    if is_tacotron(C):
+        model.decoder.set_r(cp['r'])

    # load vocoder model
    if args.vocoder_path != "":
@ -153,7 +154,10 @@ if __name__ == "__main__":
        args.speaker_fileid = None

    if args.gst_style is None:
-        gst_style = C.gst['gst_style_input']
+        if is_tacotron(C):
+            gst_style = C.gst['gst_style_input']
+        else:
+            gst_style = None
    else:
        # check if gst_style string is a dict, if is dict convert  else use string
        try:
--- a/TTS/bin/train_encoder.py
+++ b/TTS/bin/train_encoder.py
@ -35,7 +35,7 @@ print(" > Using CUDA: ", use_cuda)
 print(" > Number of GPUs: ", num_gpus)


-def setup_loader(ap, is_val=False, verbose=False):
+def setup_loader(ap: AudioProcessor, is_val: bool=False, verbose: bool=False):
    if is_val:
        loader = None
    else:
@ -212,6 +212,7 @@ if __name__ == '__main__':
    parser.add_argument(
        '--config_path',
        type=str,
+        required=True,
        help='Path to config file for training.',
    )
    parser.add_argument('--debug',
--- a/TTS/bin/train_glow_tts.py
+++ b/TTS/bin/train_glow_tts.py
@ -9,17 +9,16 @@ import time
 import traceback

 import torch
+from random import randrange
 from torch.utils.data import DataLoader
+
 from TTS.tts.datasets.preprocess import load_meta_data
 from TTS.tts.datasets.TTSDataset import MyDataset
 from TTS.tts.layers.losses import GlowTTSLoss
-from TTS.tts.utils.distribute import (DistributedSampler, init_distributed,
-                                      reduce_tensor)
-from TTS.tts.utils.generic_utils import setup_model
+from TTS.tts.utils.generic_utils import setup_model, check_config_tts
 from TTS.tts.utils.io import save_best_model, save_checkpoint
 from TTS.tts.utils.measures import alignment_diagonal_score
-from TTS.tts.utils.speakers import (get_speakers, load_speaker_mapping,
-                                    save_speaker_mapping)
+from TTS.tts.utils.speakers import parse_speakers, load_speaker_mapping
 from TTS.tts.utils.synthesis import synthesis
 from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
 from TTS.tts.utils.visual import plot_alignment, plot_spectrogram
@ -34,10 +33,15 @@ from TTS.utils.tensorboard_logger import TensorboardLogger
 from TTS.utils.training import (NoamLR, check_update,
                                setup_torch_training_env)

+# DISTRIBUTED
+from torch.nn.parallel import DistributedDataParallel as DDP_th
+from torch.utils.data.distributed import DistributedSampler
+from TTS.utils.distribute import init_distributed, reduce_tensor
+
+
 use_cuda, num_gpus = setup_torch_training_env(True, False)

-def setup_loader(ap, r, is_val=False, verbose=False):
-
+def setup_loader(ap, r, is_val=False, verbose=False, speaker_mapping=None):
    if is_val and not c.run_eval:
        loader = None
    else:
@ -48,6 +52,7 @@ def setup_loader(ap, r, is_val=False, verbose=False):
            meta_data=meta_data_eval if is_val else meta_data_train,
            ap=ap,
            tp=c.characters if 'characters' in c.keys() else None,
+            add_blank=c['add_blank'] if 'add_blank' in c.keys() else False,
            batch_group_size=0 if is_val else c.batch_group_size *
            c.batch_size,
            min_seq_len=c.min_seq_len,
@ -56,7 +61,8 @@ def setup_loader(ap, r, is_val=False, verbose=False):
            use_phonemes=c.use_phonemes,
            phoneme_language=c.phoneme_language,
            enable_eos_bos=c.enable_eos_bos_chars,
-            verbose=verbose)
+            verbose=verbose,
+            speaker_mapping=speaker_mapping if c.use_speaker_embedding and c.use_external_speaker_embedding_file else None)
        sampler = DistributedSampler(dataset) if num_gpus > 1 else None
        loader = DataLoader(
            dataset,
@ -86,10 +92,13 @@ def format_data(data):
    avg_spec_length = torch.mean(mel_lengths.float())

    if c.use_speaker_embedding:
-        speaker_ids = [
-            speaker_mapping[speaker_name] for speaker_name in speaker_names
-        ]
-        speaker_ids = torch.LongTensor(speaker_ids)
+        if c.use_external_speaker_embedding_file:
+            speaker_ids = data[8]
+        else:
+            speaker_ids = [
+                speaker_mapping[speaker_name] for speaker_name in speaker_names
+            ]
+            speaker_ids = torch.LongTensor(speaker_ids)
    else:
        speaker_ids = None

@ -107,7 +116,7 @@ def format_data(data):
         avg_text_length, avg_spec_length, attn_mask


-def data_depended_init(model, ap):
+def data_depended_init(model, ap, speaker_mapping=None):
    """Data depended initialization for activation normalization."""
    if hasattr(model, 'module'):
        for f in model.module.decoder.flows:
@ -118,19 +127,19 @@ def data_depended_init(model, ap):
            if getattr(f, "set_ddi", False):
                f.set_ddi(True)

-    data_loader = setup_loader(ap, 1, is_val=False)
+    data_loader = setup_loader(ap, 1, is_val=False, speaker_mapping=speaker_mapping)
    model.train()
    print(" > Data depended initialization ... ")
    with torch.no_grad():
        for _, data in enumerate(data_loader):

            # format data
-            text_input, text_lengths, mel_input, mel_lengths, _,\
+            text_input, text_lengths, mel_input, mel_lengths, speaker_ids,\
                _, _, attn_mask = format_data(data)

            # forward pass model
            _ = model.forward(
-                text_input, text_lengths, mel_input, mel_lengths, attn_mask)
+                text_input, text_lengths, mel_input, mel_lengths, attn_mask, g=speaker_ids)
            break

    if hasattr(model, 'module'):
@ -145,9 +154,9 @@ def data_depended_init(model, ap):


 def train(model, criterion, optimizer, scheduler,
-          ap, global_step, epoch, amp):
+          ap, global_step, epoch, speaker_mapping=None):
    data_loader = setup_loader(ap, 1, is_val=False,
-                               verbose=(epoch == 0))
+                               verbose=(epoch == 0), speaker_mapping=speaker_mapping)
    model.train()
    epoch_time = 0
    keep_avg = KeepAverage()
@ -158,43 +167,49 @@ def train(model, criterion, optimizer, scheduler,
        batch_n_iter = int(len(data_loader.dataset) / c.batch_size)
    end_time = time.time()
    c_logger.print_train_start()
+    scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
    for num_iter, data in enumerate(data_loader):
        start_time = time.time()

        # format data
-        text_input, text_lengths, mel_input, mel_lengths, _,\
+        text_input, text_lengths, mel_input, mel_lengths, speaker_ids,\
            avg_text_length, avg_spec_length, attn_mask = format_data(data)

        loader_time = time.time() - end_time

        global_step += 1
+        optimizer.zero_grad()
+
+        # forward pass model
+        with torch.cuda.amp.autocast(enabled=c.mixed_precision):
+            z, logdet, y_mean, y_log_scale, alignments, o_dur_log, o_total_dur = model.forward(
+                text_input, text_lengths, mel_input, mel_lengths, attn_mask, g=speaker_ids)
+
+            # compute loss
+            loss_dict = criterion(z, y_mean, y_log_scale, logdet, mel_lengths,
+                                o_dur_log, o_total_dur, text_lengths)
+
+        # backward pass with loss scaling
+        if c.mixed_precision:
+            scaler.scale(loss_dict['loss']).backward()
+            scaler.unscale_(optimizer)
+            grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
+                                           c.grad_clip)
+            scaler.step(optimizer)
+            scaler.update()
+        else:
+            loss_dict['loss'].backward()
+            grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
+                                           c.grad_clip)
+            optimizer.step()
+
+
+        grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True)
+        optimizer.step()

        # setup lr
        if c.noam_schedule:
            scheduler.step()
-        optimizer.zero_grad()
-
-        # forward pass model
-        z, logdet, y_mean, y_log_scale, alignments, o_dur_log, o_total_dur = model.forward(
-            text_input, text_lengths, mel_input, mel_lengths, attn_mask)
-
-        # compute loss
-        loss_dict = criterion(z, y_mean, y_log_scale, logdet, mel_lengths,
-                              o_dur_log, o_total_dur, text_lengths)
-
-        # backward pass
-        if amp is not None:
-            with amp.scale_loss(loss_dict['loss'], optimizer) as scaled_loss:
-                scaled_loss.backward()
-        else:
-            loss_dict['loss'].backward()
-
-        if amp:
-            amp_opt_params = amp.master_params(optimizer)
-        else:
-            amp_opt_params = None
-        grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True, amp_opt_params=amp_opt_params)
-        optimizer.step()

        # current_lr
        current_lr = optimizer.param_groups[0]['lr']
@ -257,12 +272,12 @@ def train(model, criterion, optimizer, scheduler,
                if c.checkpoint:
                    # save model
                    save_checkpoint(model, optimizer, global_step, epoch, 1, OUT_PATH,
-                                    model_loss=loss_dict['loss'],
-                                    amp_state_dict=amp.state_dict() if amp else None)
+                                    model_loss=loss_dict['loss'])

                # Diagnostic visualizations
                # direct pass on model for spec predictions
-                spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1])
+                target_speaker = None if speaker_ids is None else speaker_ids[:1]
+                spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1], g=target_speaker)
                spec_pred = spec_pred.permute(0, 2, 1)
                gt_spec = mel_input.permute(0, 2, 1)
                const_spec = spec_pred[0].data.cpu().numpy()
@ -298,8 +313,8 @@ def train(model, criterion, optimizer, scheduler,


@torch.no_grad()
-def evaluate(model, criterion, ap, global_step, epoch):
-    data_loader = setup_loader(ap, 1, is_val=True)
+def evaluate(model, criterion, ap, global_step, epoch, speaker_mapping):
+    data_loader = setup_loader(ap, 1, is_val=True, speaker_mapping=speaker_mapping)
    model.eval()
    epoch_time = 0
    keep_avg = KeepAverage()
@ -309,12 +324,12 @@ def evaluate(model, criterion, ap, global_step, epoch):
            start_time = time.time()

            # format data
-            text_input, text_lengths, mel_input, mel_lengths, _,\
+            text_input, text_lengths, mel_input, mel_lengths, speaker_ids,\
                _, _, attn_mask = format_data(data)

            # forward pass model
            z, logdet, y_mean, y_log_scale, alignments, o_dur_log, o_total_dur = model.forward(
-                text_input, text_lengths, mel_input, mel_lengths, attn_mask)
+                text_input, text_lengths, mel_input, mel_lengths, attn_mask, g=speaker_ids)

            # compute loss
            loss_dict = criterion(z, y_mean, y_log_scale, logdet, mel_lengths,
@ -355,10 +370,11 @@ def evaluate(model, criterion, ap, global_step, epoch):
        if args.rank == 0:
            # Diagnostic visualizations
            # direct pass on model for spec predictions
+            target_speaker = None if speaker_ids is None else speaker_ids[:1]
            if hasattr(model, 'module'):
-                spec_pred, *_ = model.module.inference(text_input[:1], text_lengths[:1])
+                spec_pred, *_ = model.module.inference(text_input[:1], text_lengths[:1], g=target_speaker)
            else:
-                spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1])
+                spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1], g=target_speaker)
            spec_pred = spec_pred.permute(0, 2, 1)
            gt_spec = mel_input.permute(0, 2, 1)

@ -398,7 +414,17 @@ def evaluate(model, criterion, ap, global_step, epoch):
        test_audios = {}
        test_figures = {}
        print(" | > Synthesizing test sentences")
-        speaker_id = 0 if c.use_speaker_embedding else None
+        if c.use_speaker_embedding:
+            if c.use_external_speaker_embedding_file:
+                speaker_embedding = speaker_mapping[list(speaker_mapping.keys())[randrange(len(speaker_mapping)-1)]]['embedding']
+                speaker_id = None
+            else:
+                speaker_id = 0
+                speaker_embedding = None
+        else:
+            speaker_id = None
+            speaker_embedding = None
+
        style_wav = c.get("style_wav_for_test")
        for idx, test_sentence in enumerate(test_sentences):
            try:
@ -409,6 +435,7 @@ def evaluate(model, criterion, ap, global_step, epoch):
                    use_cuda,
                    ap,
                    speaker_id=speaker_id,
+                    speaker_embedding=speaker_embedding,
                    style_wav=style_wav,
                    truncated=False,
                    enable_eos_bos_chars=c.enable_eos_bos_chars, #pylint: disable=unused-argument
@ -459,38 +486,13 @@ def main(args):  # pylint: disable=redefined-outer-name
        meta_data_eval = meta_data_eval[:int(len(meta_data_eval) * c.eval_portion)]

    # parse speakers
-    if c.use_speaker_embedding:
-        speakers = get_speakers(meta_data_train)
-        if args.restore_path:
-            prev_out_path = os.path.dirname(args.restore_path)
-            speaker_mapping = load_speaker_mapping(prev_out_path)
-            assert all([speaker in speaker_mapping
-                        for speaker in speakers]), "As of now you, you cannot " \
-                                                   "introduce new speakers to " \
-                                                   "a previously trained model."
-        else:
-            speaker_mapping = {name: i for i, name in enumerate(speakers)}
-        save_speaker_mapping(OUT_PATH, speaker_mapping)
-        num_speakers = len(speaker_mapping)
-        print("Training with {} speakers: {}".format(num_speakers,
-                                                     ", ".join(speakers)))
-    else:
-        num_speakers = 0
+    num_speakers, speaker_embedding_dim, speaker_mapping = parse_speakers(c, args, meta_data_train, OUT_PATH)

    # setup model
-    model = setup_model(num_chars, num_speakers, c)
+    model = setup_model(num_chars, num_speakers, c, speaker_embedding_dim=speaker_embedding_dim)
    optimizer = RAdam(model.parameters(), lr=c.lr, weight_decay=0, betas=(0.9, 0.98), eps=1e-9)
    criterion = GlowTTSLoss()

-    if c.apex_amp_level:
-        # pylint: disable=import-outside-toplevel
-        from apex import amp
-        from apex.parallel import DistributedDataParallel as DDP
-        model.cuda()
-        model, optimizer = amp.initialize(model, optimizer, opt_level=c.apex_amp_level)
-    else:
-        amp = None
-
    if args.restore_path:
        checkpoint = torch.load(args.restore_path, map_location='cpu')
        try:
@ -507,9 +509,6 @@ def main(args):  # pylint: disable=redefined-outer-name
            model.load_state_dict(model_dict)
            del model_dict

-        if amp and 'amp' in checkpoint:
-            amp.load_state_dict(checkpoint['amp'])
-
        for group in optimizer.param_groups:
            group['initial_lr'] = c.lr
        print(" > Model restored from step %d" % checkpoint['step'],
@ -524,7 +523,7 @@ def main(args):  # pylint: disable=redefined-outer-name

    # DISTRUBUTED
    if num_gpus > 1:
-        model = DDP(model)
+        model = DDP_th(model, device_ids=[args.rank])

    if c.noam_schedule:
        scheduler = NoamLR(optimizer,
@ -540,19 +539,19 @@ def main(args):  # pylint: disable=redefined-outer-name
        best_loss = float('inf')

    global_step = args.restore_step
-    model = data_depended_init(model, ap)
+    model = data_depended_init(model, ap, speaker_mapping)
    for epoch in range(0, c.epochs):
        c_logger.print_epoch_start(epoch, c.epochs)
        train_avg_loss_dict, global_step = train(model, criterion, optimizer,
                                                 scheduler, ap, global_step,
-                                                 epoch, amp)
-        eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch)
+                                                 epoch, speaker_mapping)
+        eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch, speaker_mapping=speaker_mapping)
        c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
        target_loss = train_avg_loss_dict['avg_loss']
        if c.run_eval:
            target_loss = eval_avg_loss_dict['avg_loss']
        best_loss = save_best_model(target_loss, best_loss, model, optimizer, global_step, epoch, c.r,
-                                    OUT_PATH, amp_state_dict=amp.state_dict() if amp else None)
+                                    OUT_PATH)


 if __name__ == '__main__':
@ -602,10 +601,11 @@ if __name__ == '__main__':
    # setup output paths and read configs
    c = load_config(args.config_path)
    # check_config(c)
+    check_config_tts(c)
    _ = os.path.dirname(os.path.realpath(__file__))

-    if c.apex_amp_level:
-        print("   >  apex AMP level: ", c.apex_amp_level)
+    if c.mixed_precision:
+        print("   > Mixed precision enabled.")

    OUT_PATH = args.continue_path
    if args.continue_path == '':
--- a/TTS/bin/train_tacotron.py
+++ b/TTS/bin/train_tacotron.py
@ -7,28 +7,25 @@ import os
 import sys
 import time
 import traceback
+from random import randrange

 import numpy as np
 import torch
-
-from random import randrange
 from torch.utils.data import DataLoader
 from TTS.tts.datasets.preprocess import load_meta_data
 from TTS.tts.datasets.TTSDataset import MyDataset
 from TTS.tts.layers.losses import TacotronLoss
-from TTS.tts.utils.distribute import (DistributedSampler,
-                                      apply_gradient_allreduce,
-                                      init_distributed, reduce_tensor)
-from TTS.tts.utils.generic_utils import setup_model, check_config_tts
+from TTS.tts.utils.generic_utils import check_config_tts, setup_model
 from TTS.tts.utils.io import save_best_model, save_checkpoint
 from TTS.tts.utils.measures import alignment_diagonal_score
-from TTS.tts.utils.speakers import (get_speakers, load_speaker_mapping,
-                                    save_speaker_mapping)
+from TTS.tts.utils.speakers import load_speaker_mapping, parse_speakers
 from TTS.tts.utils.synthesis import synthesis
 from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
 from TTS.tts.utils.visual import plot_alignment, plot_spectrogram
 from TTS.utils.audio import AudioProcessor
 from TTS.utils.console_logger import ConsoleLogger
+from TTS.utils.distribute import (DistributedSampler, apply_gradient_allreduce,
+                                  init_distributed, reduce_tensor)
 from TTS.utils.generic_utils import (KeepAverage, count_parameters,
                                     create_experiment_folder, get_git_branch,
                                     remove_experiment_folder, set_init_dict)
@ -41,6 +38,7 @@ from TTS.utils.training import (NoamLR, adam_weight_decay, check_update,

 use_cuda, num_gpus = setup_torch_training_env(True, False)

+
 def setup_loader(ap, r, is_val=False, verbose=False, speaker_mapping=None):
    if is_val and not c.run_eval:
        loader = None
@ -52,6 +50,7 @@ def setup_loader(ap, r, is_val=False, verbose=False, speaker_mapping=None):
            meta_data=meta_data_eval if is_val else meta_data_train,
            ap=ap,
            tp=c.characters if 'characters' in c.keys() else None,
+            add_blank=c['add_blank'] if 'add_blank' in c.keys() else False,
            batch_group_size=0 if is_val else c.batch_group_size *
            c.batch_size,
            min_seq_len=c.min_seq_len,
@ -87,8 +86,8 @@ def format_data(data, speaker_mapping=None):
    mel_input = data[4]
    mel_lengths = data[5]
    stop_targets = data[6]
-    avg_text_length = torch.mean(text_lengths.float())
-    avg_spec_length = torch.mean(mel_lengths.float())
+    max_text_length = torch.max(text_lengths.float())
+    max_spec_length = torch.max(mel_lengths.float())

    if c.use_speaker_embedding:
        if c.use_external_speaker_embedding_file:
@ -124,11 +123,11 @@ def format_data(data, speaker_mapping=None):
        if speaker_embeddings is not None:
            speaker_embeddings = speaker_embeddings.cuda(non_blocking=True)

-    return text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, avg_text_length, avg_spec_length
+    return text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, max_text_length, max_spec_length


 def train(model, criterion, optimizer, optimizer_st, scheduler,
-          ap, global_step, epoch, amp, speaker_mapping=None):
+          ap, global_step, epoch, scaler, scaler_st, speaker_mapping=None):
    data_loader = setup_loader(ap, model.decoder.r, is_val=False,
                               verbose=(epoch == 0), speaker_mapping=speaker_mapping)
    model.train()
@ -145,7 +144,7 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
        start_time = time.time()

        # format data
-        text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, avg_text_length, avg_spec_length = format_data(data, speaker_mapping)
+        text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, max_text_length, max_spec_length = format_data(data, speaker_mapping)
        loader_time = time.time() - end_time

        global_step += 1
@ -153,65 +152,79 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
        # setup lr
        if c.noam_schedule:
            scheduler.step()
+
        optimizer.zero_grad()
        if optimizer_st:
            optimizer_st.zero_grad()

-        # forward pass model
-        if c.bidirectional_decoder or c.double_decoder_consistency:
-            decoder_output, postnet_output, alignments, stop_tokens, decoder_backward_output, alignments_backward = model(
-                text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
-        else:
-            decoder_output, postnet_output, alignments, stop_tokens = model(
-                text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
-            decoder_backward_output = None
-            alignments_backward = None
+        with torch.cuda.amp.autocast(enabled=c.mixed_precision):
+            # forward pass model
+            if c.bidirectional_decoder or c.double_decoder_consistency:
+                decoder_output, postnet_output, alignments, stop_tokens, decoder_backward_output, alignments_backward = model(
+                    text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
+            else:
+                decoder_output, postnet_output, alignments, stop_tokens = model(
+                    text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
+                decoder_backward_output = None
+                alignments_backward = None

-        # set the [alignment] lengths wrt reduction factor for guided attention
-        if mel_lengths.max() % model.decoder.r != 0:
-            alignment_lengths = (mel_lengths + (model.decoder.r - (mel_lengths.max() % model.decoder.r))) // model.decoder.r
-        else:
-            alignment_lengths = mel_lengths //  model.decoder.r
+            # set the [alignment] lengths wrt reduction factor for guided attention
+            if mel_lengths.max() % model.decoder.r != 0:
+                alignment_lengths = (mel_lengths + (model.decoder.r - (mel_lengths.max() % model.decoder.r))) // model.decoder.r
+            else:
+                alignment_lengths = mel_lengths //  model.decoder.r

-        # compute loss
-        loss_dict = criterion(postnet_output, decoder_output, mel_input,
-                              linear_input, stop_tokens, stop_targets,
-                              mel_lengths, decoder_backward_output,
-                              alignments, alignment_lengths, alignments_backward,
-                              text_lengths)
+            # compute loss
+            loss_dict = criterion(postnet_output, decoder_output, mel_input,
+                                linear_input, stop_tokens, stop_targets,
+                                mel_lengths, decoder_backward_output,
+                                alignments, alignment_lengths, alignments_backward,
+                                text_lengths)

-        # backward pass
-        if amp is not None:
-            with amp.scale_loss(loss_dict['loss'], optimizer) as scaled_loss:
-                scaled_loss.backward()
+        # check nan loss
+        if torch.isnan(loss_dict['loss']).any():
+            raise RuntimeError(f'Detected NaN loss at step {global_step}.')
+
+        # optimizer step
+        if c.mixed_precision:
+            # model optimizer step in mixed precision mode
+            scaler.scale(loss_dict['loss']).backward()
+            scaler.unscale_(optimizer)
+            optimizer, current_lr = adam_weight_decay(optimizer)
+            grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True)
+            scaler.step(optimizer)
+            scaler.update()
+
+            # stopnet optimizer step
+            if c.separate_stopnet:
+                scaler_st.scale( loss_dict['stopnet_loss']).backward()
+                scaler.unscale_(optimizer_st)
+                optimizer_st, _ = adam_weight_decay(optimizer_st)
+                grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0)
+                scaler_st.step(optimizer)
+                scaler_st.update()
+            else:
+                grad_norm_st = 0
        else:
+            # main model optimizer step
            loss_dict['loss'].backward()
+            optimizer, current_lr = adam_weight_decay(optimizer)
+            grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True)
+            optimizer.step()

-        optimizer, current_lr = adam_weight_decay(optimizer)
-        if amp:
-            amp_opt_params = amp.master_params(optimizer)
-        else:
-            amp_opt_params = None
-        grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True, amp_opt_params=amp_opt_params)
-        optimizer.step()
+            # stopnet optimizer step
+            if c.separate_stopnet:
+                loss_dict['stopnet_loss'].backward()
+                optimizer_st, _ = adam_weight_decay(optimizer_st)
+                grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0)
+                optimizer_st.step()
+            else:
+                grad_norm_st = 0

        # compute alignment error (the lower the better )
        align_error = 1 - alignment_diagonal_score(alignments)
        loss_dict['align_error'] = align_error

-        # backpass and check the grad norm for stop loss
-        if c.separate_stopnet:
-            loss_dict['stopnet_loss'].backward()
-            optimizer_st, _ = adam_weight_decay(optimizer_st)
-            if amp:
-                amp_opt_params = amp.master_params(optimizer)
-            else:
-                amp_opt_params = None
-            grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0, amp_opt_params=amp_opt_params)
-            optimizer_st.step()
-        else:
-            grad_norm_st = 0
-
        step_time = time.time() - start_time
        epoch_time += step_time

@ -242,8 +255,8 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
        # print training progress
        if global_step % c.print_step == 0:
            log_dict = {
-                "avg_spec_length": [avg_spec_length, 1],  # value, precision
-                "avg_text_length": [avg_text_length, 1],
+                "max_spec_length": [max_spec_length, 1],  # value, precision
+                "max_text_length": [max_text_length, 1],
                "step_time": [step_time, 4],
                "loader_time": [loader_time, 2],
                "current_lr": current_lr,
@ -270,7 +283,7 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
                    save_checkpoint(model, optimizer, global_step, epoch, model.decoder.r, OUT_PATH,
                                    optimizer_st=optimizer_st,
                                    model_loss=loss_dict['postnet_loss'],
-                                    amp_state_dict=amp.state_dict() if amp else None)
+                                    scaler=scaler.state_dict() if c.mixed_precision else None)

                # Diagnostic visualizations
                const_spec = postnet_output[0].data.cpu().numpy()
@ -502,45 +515,14 @@ def main(args):  # pylint: disable=redefined-outer-name
        meta_data_eval = meta_data_eval[:int(len(meta_data_eval) * c.eval_portion)]

    # parse speakers
-    if c.use_speaker_embedding:
-        speakers = get_speakers(meta_data_train)
-        if args.restore_path:
-            if c.use_external_speaker_embedding_file: # if restore checkpoint and use External Embedding file
-                prev_out_path = os.path.dirname(args.restore_path)
-                speaker_mapping = load_speaker_mapping(prev_out_path)
-                if not speaker_mapping:
-                    print("WARNING: speakers.json was not found in restore_path, trying to use CONFIG.external_speaker_embedding_file")
-                    speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
-                    if not speaker_mapping:
-                        raise RuntimeError("You must copy the file speakers.json to restore_path, or set a valid file in CONFIG.external_speaker_embedding_file")
-                speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
-            elif not c.use_external_speaker_embedding_file: # if restore checkpoint and don't use External Embedding file
-                prev_out_path = os.path.dirname(args.restore_path)
-                speaker_mapping = load_speaker_mapping(prev_out_path)
-                speaker_embedding_dim = None
-                assert all([speaker in speaker_mapping
-                            for speaker in speakers]), "As of now you, you cannot " \
-                                                    "introduce new speakers to " \
-                                                    "a previously trained model."
-        elif c.use_external_speaker_embedding_file and c.external_speaker_embedding_file: # if start new train using External Embedding file
-            speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
-            speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
-        elif c.use_external_speaker_embedding_file and not c.external_speaker_embedding_file: # if start new train using External Embedding file and don't pass external embedding file
-            raise "use_external_speaker_embedding_file is True, so you need pass a external speaker embedding file, run GE2E-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb or AngularPrototypical-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb notebook in notebooks/ folder"
-        else: # if start new train and don't use External Embedding file
-            speaker_mapping = {name: i for i, name in enumerate(speakers)}
-            speaker_embedding_dim = None
-        save_speaker_mapping(OUT_PATH, speaker_mapping)
-        num_speakers = len(speaker_mapping)
-        print("Training with {} speakers: {}".format(num_speakers,
-                                                     ", ".join(speakers)))
-    else:
-        num_speakers = 0
-        speaker_embedding_dim = None
-        speaker_mapping = None
+    num_speakers, speaker_embedding_dim, speaker_mapping = parse_speakers(c, args, meta_data_train, OUT_PATH)

    model = setup_model(num_chars, num_speakers, c, speaker_embedding_dim)

+    # scalers for mixed precision training
+    scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
+    scaler_st = torch.cuda.amp.GradScaler() if c.mixed_precision and c.separate_stopnet else None
+
    params = set_weight_decay(model, c.wd)
    optimizer = RAdam(params, lr=c.lr, weight_decay=0)
    if c.stopnet and c.separate_stopnet:
@ -550,26 +532,22 @@ def main(args):  # pylint: disable=redefined-outer-name
    else:
        optimizer_st = None

-    if c.apex_amp_level == "O1":
-        # pylint: disable=import-outside-toplevel
-        from apex import amp
-        model.cuda()
-        model, optimizer = amp.initialize(model, optimizer, opt_level=c.apex_amp_level)
-    else:
-        amp = None
-
    # setup criterion
    criterion = TacotronLoss(c, stopnet_pos_weight=10.0, ga_sigma=0.4)

    if args.restore_path:
        checkpoint = torch.load(args.restore_path, map_location='cpu')
        try:
-            # TODO: fix optimizer init, model.cuda() needs to be called before
+            print(" > Restoring Model.")
+            model.load_state_dict(checkpoint['model'])
            # optimizer restore
-            # optimizer.load_state_dict(checkpoint['optimizer'])
+            print(" > Restoring Optimizer.")
+            optimizer.load_state_dict(checkpoint['optimizer'])
+            if "scaler" in checkpoint and c.mixed_precision:
+                print(" > Restoring AMP Scaler...")
+                scaler.load_state_dict(checkpoint["scaler"])
            if c.reinit_layers:
                raise RuntimeError
-            model.load_state_dict(checkpoint['model'])
        except KeyError:
            print(" > Partial model initialization.")
            model_dict = model.state_dict()
@ -579,9 +557,6 @@ def main(args):  # pylint: disable=redefined-outer-name
            model.load_state_dict(model_dict)
            del model_dict

-        if amp and 'amp' in checkpoint:
-            amp.load_state_dict(checkpoint['amp'])
-
        for group in optimizer.param_groups:
            group['lr'] = c.lr
        print(" > Model restored from step %d" % checkpoint['step'],
@ -624,14 +599,14 @@ def main(args):  # pylint: disable=redefined-outer-name
            print("\n > Number of output frames:", model.decoder.r)
        train_avg_loss_dict, global_step = train(model, criterion, optimizer,
                                                 optimizer_st, scheduler, ap,
-                                                 global_step, epoch, amp, speaker_mapping)
+                                                 global_step, epoch, scaler, scaler_st, speaker_mapping)
        eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch, speaker_mapping)
        c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
        target_loss = train_avg_loss_dict['avg_postnet_loss']
        if c.run_eval:
            target_loss = eval_avg_loss_dict['avg_postnet_loss']
        best_loss = save_best_model(target_loss, best_loss, model, optimizer, global_step, epoch, c.r,
-                                    OUT_PATH, amp_state_dict=amp.state_dict() if amp else None)
+                                    OUT_PATH, scaler=scaler.state_dict() if c.mixed_precision else None)


 if __name__ == '__main__':
@ -683,8 +658,8 @@ if __name__ == '__main__':
    check_config_tts(c)
    _ = os.path.dirname(os.path.realpath(__file__))

-    if c.apex_amp_level == 'O1':
-        print("   >  apex AMP level: ", c.apex_amp_level)
+    if c.mixed_precision:
+        print("   >  Mixed precision mode is ON")

    OUT_PATH = args.continue_path
    if args.continue_path == '':
--- a/TTS/bin/train_vocoder_gan.py
+++ b/TTS/bin/train_vocoder_gan.py
@ -19,13 +19,16 @@ from TTS.utils.tensorboard_logger import TensorboardLogger
 from TTS.utils.training import setup_torch_training_env
 from TTS.vocoder.datasets.gan_dataset import GANDataset
 from TTS.vocoder.datasets.preprocess import load_wav_data, load_wav_feat_data
-# from distribute import (DistributedSampler, apply_gradient_allreduce,
-#                         init_distributed, reduce_tensor)
 from TTS.vocoder.layers.losses import DiscriminatorLoss, GeneratorLoss
 from TTS.vocoder.utils.generic_utils import (plot_results, setup_discriminator,
                                             setup_generator)
 from TTS.vocoder.utils.io import save_best_model, save_checkpoint

+# DISTRIBUTED
+from torch.nn.parallel import DistributedDataParallel as DDP_th
+from torch.utils.data.distributed import DistributedSampler
+from TTS.utils.distribute import init_distributed
+
 use_cuda, num_gpus = setup_torch_training_env(True, True)


@ -45,12 +48,12 @@ def setup_loader(ap, is_val=False, verbose=False):
                             use_cache=c.use_cache,
                             verbose=verbose)
        dataset.shuffle_mapping()
-        # sampler = DistributedSampler(dataset) if num_gpus > 1 else None
+        sampler = DistributedSampler(dataset, shuffle=True) if num_gpus > 1 else None
        loader = DataLoader(dataset,
                            batch_size=1 if is_val else c.batch_size,
-                            shuffle=True,
+                            shuffle=False if num_gpus > 1 else True,
                            drop_last=False,
-                            sampler=None,
+                            sampler=sampler,
                            num_workers=c.num_val_loader_workers
                            if is_val else c.num_loader_workers,
                            pin_memory=False)
@ -243,41 +246,42 @@ def train(model_G, criterion_G, optimizer_G, model_D, criterion_D, optimizer_D,
            c_logger.print_train_step(batch_n_iter, num_iter, global_step,
                                      log_dict, loss_dict, keep_avg.avg_values)

-        # plot step stats
-        if global_step % 10 == 0:
-            iter_stats = {
-                "lr_G": current_lr_G,
-                "lr_D": current_lr_D,
-                "step_time": step_time
-            }
-            iter_stats.update(loss_dict)
-            tb_logger.tb_train_iter_stats(global_step, iter_stats)
+        if args.rank == 0:
+            # plot step stats
+            if global_step % 10 == 0:
+                iter_stats = {
+                    "lr_G": current_lr_G,
+                    "lr_D": current_lr_D,
+                    "step_time": step_time
+                }
+                iter_stats.update(loss_dict)
+                tb_logger.tb_train_iter_stats(global_step, iter_stats)

-        # save checkpoint
-        if global_step % c.save_step == 0:
-            if c.checkpoint:
-                # save model
-                save_checkpoint(model_G,
-                                optimizer_G,
-                                scheduler_G,
-                                model_D,
-                                optimizer_D,
-                                scheduler_D,
-                                global_step,
-                                epoch,
-                                OUT_PATH,
-                                model_losses=loss_dict)
+            # save checkpoint
+            if global_step % c.save_step == 0:
+                if c.checkpoint:
+                    # save model
+                    save_checkpoint(model_G,
+                                    optimizer_G,
+                                    scheduler_G,
+                                    model_D,
+                                    optimizer_D,
+                                    scheduler_D,
+                                    global_step,
+                                    epoch,
+                                    OUT_PATH,
+                                    model_losses=loss_dict)

-            # compute spectrograms
-            figures = plot_results(y_hat_vis, y_G, ap, global_step,
-                                   'train')
-            tb_logger.tb_train_figures(global_step, figures)
+                # compute spectrograms
+                figures = plot_results(y_hat_vis, y_G, ap, global_step,
+                                    'train')
+                tb_logger.tb_train_figures(global_step, figures)

-            # Sample audio
-            sample_voice = y_hat_vis[0].squeeze(0).detach().cpu().numpy()
-            tb_logger.tb_train_audios(global_step,
-                                      {'train/audio': sample_voice},
-                                      c.audio["sample_rate"])
+                # Sample audio
+                sample_voice = y_hat_vis[0].squeeze(0).detach().cpu().numpy()
+                tb_logger.tb_train_audios(global_step,
+                                        {'train/audio': sample_voice},
+                                        c.audio["sample_rate"])
        end_time = time.time()

    # print epoch stats
@ -286,7 +290,8 @@ def train(model_G, criterion_G, optimizer_G, model_D, criterion_D, optimizer_D,
    # Plot Training Epoch Stats
    epoch_stats = {"epoch_time": epoch_time}
    epoch_stats.update(keep_avg.avg_values)
-    tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
+    if args.rank == 0:
+        tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
    # TODO: plot model stats
    # if c.tb_model_param_stats:
    # tb_logger.tb_model_weights(model, global_step)
@ -326,7 +331,6 @@ def evaluate(model_G, criterion_G, model_D, criterion_D, ap, global_step, epoch)
            y_hat = model_G.pqmf_synthesis(y_hat)
            y_G_sub = model_G.pqmf_analysis(y_G)

-
        scores_fake, feats_fake, feats_real = None, None, None
        if global_step > c.steps_to_start_discriminator:

@ -403,7 +407,6 @@ def evaluate(model_G, criterion_G, model_D, criterion_D, ap, global_step, epoch)
                else:
                    loss_dict[key] = value.item()

-
        step_time = time.time() - start_time
        epoch_time += step_time

@ -419,20 +422,21 @@ def evaluate(model_G, criterion_G, model_D, criterion_D, ap, global_step, epoch)
        if c.print_eval:
            c_logger.print_eval_step(num_iter, loss_dict, keep_avg.avg_values)

-    # compute spectrograms
-    figures = plot_results(y_hat, y_G, ap, global_step, 'eval')
-    tb_logger.tb_eval_figures(global_step, figures)
+    if args.rank == 0:
+        # compute spectrograms
+        figures = plot_results(y_hat, y_G, ap, global_step, 'eval')
+        tb_logger.tb_eval_figures(global_step, figures)

-    # Sample audio
-    sample_voice = y_hat[0].squeeze(0).detach().cpu().numpy()
-    tb_logger.tb_eval_audios(global_step, {'eval/audio': sample_voice},
-                             c.audio["sample_rate"])
+        # Sample audio
+        sample_voice = y_hat[0].squeeze(0).detach().cpu().numpy()
+        tb_logger.tb_eval_audios(global_step, {'eval/audio': sample_voice},
+                                c.audio["sample_rate"])

-    # synthesize a full voice
+        tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
+
+     # synthesize a full voice
    data_loader.return_segments = False

-    tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
-
    return keep_avg.avg_values


@ -443,7 +447,8 @@ def main(args):  # pylint: disable=redefined-outer-name
    print(f" > Loading wavs from: {c.data_path}")
    if c.feature_path is not None:
        print(f" > Loading features from: {c.feature_path}")
-        eval_data, train_data = load_wav_feat_data(c.data_path, c.feature_path, c.eval_split_size)
+        eval_data, train_data = load_wav_feat_data(
+            c.data_path, c.feature_path, c.eval_split_size)
    else:
        eval_data, train_data = load_wav_data(c.data_path, c.eval_split_size)

@ -451,9 +456,9 @@ def main(args):  # pylint: disable=redefined-outer-name
    ap = AudioProcessor(**c.audio)

    # DISTRUBUTED
-    # if num_gpus > 1:
-    # init_distributed(args.rank, num_gpus, args.group_id,
-    #  c.distributed["backend"], c.distributed["url"])
+    if num_gpus > 1:
+        init_distributed(args.rank, num_gpus, args.group_id,
+                         c.distributed["backend"], c.distributed["url"])

    # setup models
    model_gen = setup_generator(c)
@ -470,10 +475,12 @@ def main(args):  # pylint: disable=redefined-outer-name
    scheduler_disc = None
    if 'lr_scheduler_gen' in c:
        scheduler_gen = getattr(torch.optim.lr_scheduler, c.lr_scheduler_gen)
-        scheduler_gen = scheduler_gen(optimizer_gen, **c.lr_scheduler_gen_params)
+        scheduler_gen = scheduler_gen(
+            optimizer_gen, **c.lr_scheduler_gen_params)
    if 'lr_scheduler_disc' in c:
        scheduler_disc = getattr(torch.optim.lr_scheduler, c.lr_scheduler_disc)
-        scheduler_disc = scheduler_disc(optimizer_disc, **c.lr_scheduler_disc_params)
+        scheduler_disc = scheduler_disc(
+            optimizer_disc, **c.lr_scheduler_disc_params)

    # setup criterion
    criterion_gen = GeneratorLoss(c)
@ -531,8 +538,9 @@ def main(args):  # pylint: disable=redefined-outer-name
        criterion_disc.cuda()

    # DISTRUBUTED
-    # if num_gpus > 1:
-    #     model = apply_gradient_allreduce(model)
+    if num_gpus > 1:
+        model_gen = DDP_th(model_gen, device_ids=[args.rank])
+        model_disc = DDP_th(model_disc, device_ids=[args.rank])

    num_params = count_parameters(model_gen)
    print(" > Generator has {} parameters".format(num_params), flush=True)
@ -572,8 +580,7 @@ if __name__ == '__main__':
    parser.add_argument(
        '--continue_path',
        type=str,
-        help=
-        'Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
+        help='Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
        default='',
        required='--config_path' not in sys.argv)
    parser.add_argument(
--- a/TTS/bin/train_vocoder_wavegrad.py
+++ b/TTS/bin/train_vocoder_wavegrad.py
@ -0,0 +1,511 @@
+import argparse
+import glob
+import os
+import sys
+import time
+import traceback
+import numpy as np
+
+import torch
+# DISTRIBUTED
+from torch.nn.parallel import DistributedDataParallel as DDP_th
+from torch.optim import Adam
+from torch.utils.data import DataLoader
+from torch.utils.data.distributed import DistributedSampler
+from TTS.utils.audio import AudioProcessor
+from TTS.utils.console_logger import ConsoleLogger
+from TTS.utils.distribute import init_distributed
+from TTS.utils.generic_utils import (KeepAverage, count_parameters,
+                                     create_experiment_folder, get_git_branch,
+                                     remove_experiment_folder, set_init_dict)
+from TTS.utils.io import copy_config_file, load_config
+from TTS.utils.tensorboard_logger import TensorboardLogger
+from TTS.utils.training import setup_torch_training_env
+from TTS.vocoder.datasets.preprocess import load_wav_data, load_wav_feat_data
+from TTS.vocoder.datasets.wavegrad_dataset import WaveGradDataset
+from TTS.vocoder.utils.generic_utils import plot_results, setup_generator
+from TTS.vocoder.utils.io import save_best_model, save_checkpoint
+
+use_cuda, num_gpus = setup_torch_training_env(True, True)
+
+
+def setup_loader(ap, is_val=False, verbose=False):
+    if is_val and not c.run_eval:
+        loader = None
+    else:
+        dataset = WaveGradDataset(ap=ap,
+                                items=eval_data if is_val else train_data,
+                                seq_len=c.seq_len,
+                                hop_len=ap.hop_length,
+                                pad_short=c.pad_short,
+                                conv_pad=c.conv_pad,
+                                is_training=not is_val,
+                                return_segments=True,
+                                use_noise_augment=False,
+                                use_cache=c.use_cache,
+                                verbose=verbose)
+        sampler = DistributedSampler(dataset) if num_gpus > 1 else None
+        loader = DataLoader(dataset,
+                            batch_size=c.batch_size,
+                            shuffle=num_gpus <= 1,
+                            drop_last=False,
+                            sampler=sampler,
+                            num_workers=c.num_val_loader_workers
+                            if is_val else c.num_loader_workers,
+                            pin_memory=False)
+
+
+    return loader
+
+
+def format_data(data):
+    # return a whole audio segment
+    m, x = data
+    x = x.unsqueeze(1)
+    if use_cuda:
+        m = m.cuda(non_blocking=True)
+        x = x.cuda(non_blocking=True)
+    return m, x
+
+
+def format_test_data(data):
+    # return a whole audio segment
+    m, x = data
+    m = m[None, ...]
+    x = x[None, None, ...]
+    if use_cuda:
+        m = m.cuda(non_blocking=True)
+        x = x.cuda(non_blocking=True)
+    return m, x
+
+
+def train(model, criterion, optimizer,
+          scheduler, scaler, ap, global_step, epoch):
+    data_loader = setup_loader(ap, is_val=False, verbose=(epoch == 0))
+    model.train()
+    epoch_time = 0
+    keep_avg = KeepAverage()
+    if use_cuda:
+        batch_n_iter = int(
+            len(data_loader.dataset) / (c.batch_size * num_gpus))
+    else:
+        batch_n_iter = int(len(data_loader.dataset) / c.batch_size)
+    end_time = time.time()
+    c_logger.print_train_start()
+    # setup noise schedule
+    noise_schedule = c['train_noise_schedule']
+    betas = np.linspace(noise_schedule['min_val'], noise_schedule['max_val'], noise_schedule['num_steps'])
+    if hasattr(model, 'module'):
+        model.module.compute_noise_level(betas)
+    else:
+        model.compute_noise_level(betas)
+    for num_iter, data in enumerate(data_loader):
+        start_time = time.time()
+
+        # format data
+        m, x = format_data(data)
+        loader_time = time.time() - end_time
+
+        global_step += 1
+
+        with torch.cuda.amp.autocast(enabled=c.mixed_precision):
+            # compute noisy input
+            if hasattr(model, 'module'):
+                noise, x_noisy, noise_scale = model.module.compute_y_n(x)
+            else:
+                noise, x_noisy, noise_scale = model.compute_y_n(x)
+
+            # forward pass
+            noise_hat = model(x_noisy, m, noise_scale)
+
+            # compute losses
+            loss = criterion(noise, noise_hat)
+        loss_wavegrad_dict = {'wavegrad_loss':loss}
+
+        # check nan loss
+        if torch.isnan(loss).any():
+            raise RuntimeError(f'Detected NaN loss at step {global_step}.')
+
+        optimizer.zero_grad()
+
+        # backward pass with loss scaling
+        if c.mixed_precision:
+            scaler.scale(loss).backward()
+            scaler.unscale_(optimizer)
+            grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
+                                           c.clip_grad)
+            scaler.step(optimizer)
+            scaler.update()
+        else:
+            loss.backward()
+            grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
+                                           c.clip_grad)
+            optimizer.step()
+
+        # schedule update
+        if scheduler is not None:
+            scheduler.step()
+
+        # disconnect loss values
+        loss_dict = dict()
+        for key, value in loss_wavegrad_dict.items():
+            if isinstance(value, int):
+                loss_dict[key] = value
+            else:
+                loss_dict[key] = value.item()
+
+        # epoch/step timing
+        step_time = time.time() - start_time
+        epoch_time += step_time
+
+        # get current learning rates
+        current_lr = list(optimizer.param_groups)[0]['lr']
+
+        # update avg stats
+        update_train_values = dict()
+        for key, value in loss_dict.items():
+            update_train_values['avg_' + key] = value
+        update_train_values['avg_loader_time'] = loader_time
+        update_train_values['avg_step_time'] = step_time
+        keep_avg.update_values(update_train_values)
+
+        # print training stats
+        if global_step % c.print_step == 0:
+            log_dict = {
+                'step_time': [step_time, 2],
+                'loader_time': [loader_time, 4],
+                "current_lr": current_lr,
+                "grad_norm": grad_norm.item()
+            }
+            c_logger.print_train_step(batch_n_iter, num_iter, global_step,
+                                      log_dict, loss_dict, keep_avg.avg_values)
+
+        if args.rank == 0:
+            # plot step stats
+            if global_step % 10 == 0:
+                iter_stats = {
+                    "lr": current_lr,
+                    "grad_norm": grad_norm.item(),
+                    "step_time": step_time
+                }
+                iter_stats.update(loss_dict)
+                tb_logger.tb_train_iter_stats(global_step, iter_stats)
+
+            # save checkpoint
+            if global_step % c.save_step == 0:
+                if c.checkpoint:
+                    # save model
+                    save_checkpoint(model,
+                                    optimizer,
+                                    scheduler,
+                                    None,
+                                    None,
+                                    None,
+                                    global_step,
+                                    epoch,
+                                    OUT_PATH,
+                                    model_losses=loss_dict,
+                                    scaler=scaler.state_dict() if c.mixed_precision else None)
+
+        end_time = time.time()
+
+    # print epoch stats
+    c_logger.print_train_epoch_end(global_step, epoch, epoch_time, keep_avg)
+
+    # Plot Training Epoch Stats
+    epoch_stats = {"epoch_time": epoch_time}
+    epoch_stats.update(keep_avg.avg_values)
+    if args.rank == 0:
+        tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
+    # TODO: plot model stats
+    if c.tb_model_param_stats and args.rank == 0:
+        tb_logger.tb_model_weights(model, global_step)
+    return keep_avg.avg_values, global_step
+
+
+@torch.no_grad()
+def evaluate(model, criterion, ap, global_step, epoch):
+    data_loader = setup_loader(ap, is_val=True, verbose=(epoch == 0))
+    model.eval()
+    epoch_time = 0
+    keep_avg = KeepAverage()
+    end_time = time.time()
+    c_logger.print_eval_start()
+    for num_iter, data in enumerate(data_loader):
+        start_time = time.time()
+
+        # format data
+        m, x = format_data(data)
+        loader_time = time.time() - end_time
+
+        global_step += 1
+
+        # compute noisy input
+        if hasattr(model, 'module'):
+            noise, x_noisy, noise_scale = model.module.compute_y_n(x)
+        else:
+            noise, x_noisy, noise_scale = model.compute_y_n(x)
+
+
+        # forward pass
+        noise_hat = model(x_noisy, m, noise_scale)
+
+        # compute losses
+        loss = criterion(noise, noise_hat)
+        loss_wavegrad_dict = {'wavegrad_loss':loss}
+
+
+        loss_dict = dict()
+        for key, value in loss_wavegrad_dict.items():
+            if isinstance(value, (int, float)):
+                loss_dict[key] = value
+            else:
+                loss_dict[key] = value.item()
+
+        step_time = time.time() - start_time
+        epoch_time += step_time
+
+        # update avg stats
+        update_eval_values = dict()
+        for key, value in loss_dict.items():
+            update_eval_values['avg_' + key] = value
+        update_eval_values['avg_loader_time'] = loader_time
+        update_eval_values['avg_step_time'] = step_time
+        keep_avg.update_values(update_eval_values)
+
+        # print eval stats
+        if c.print_eval:
+            c_logger.print_eval_step(num_iter, loss_dict, keep_avg.avg_values)
+
+    if args.rank == 0:
+        data_loader.dataset.return_segments = False
+        samples = data_loader.dataset.load_test_samples(1)
+        m, x = format_test_data(samples[0])
+
+        # setup noise schedule and inference
+        noise_schedule = c['test_noise_schedule']
+        betas = np.linspace(noise_schedule['min_val'], noise_schedule['max_val'], noise_schedule['num_steps'])
+        if hasattr(model, 'module'):
+            model.module.compute_noise_level(betas)
+            # compute voice
+            x_pred = model.module.inference(m)
+        else:
+            model.compute_noise_level(betas)
+            # compute voice
+            x_pred = model.inference(m)
+
+        # compute spectrograms
+        figures = plot_results(x_pred, x, ap, global_step, 'eval')
+        tb_logger.tb_eval_figures(global_step, figures)
+
+        # Sample audio
+        sample_voice = x_pred[0].squeeze(0).detach().cpu().numpy()
+        tb_logger.tb_eval_audios(global_step, {'eval/audio': sample_voice},
+                                 c.audio["sample_rate"])
+
+        tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
+        data_loader.dataset.return_segments = True
+
+    return keep_avg.avg_values
+
+
+def main(args):  # pylint: disable=redefined-outer-name
+    # pylint: disable=global-variable-undefined
+    global train_data, eval_data
+    print(f" > Loading wavs from: {c.data_path}")
+    if c.feature_path is not None:
+        print(f" > Loading features from: {c.feature_path}")
+        eval_data, train_data = load_wav_feat_data(c.data_path, c.feature_path, c.eval_split_size)
+    else:
+        eval_data, train_data = load_wav_data(c.data_path, c.eval_split_size)
+
+    # setup audio processor
+    ap = AudioProcessor(**c.audio)
+
+    # DISTRUBUTED
+    if num_gpus > 1:
+        init_distributed(args.rank, num_gpus, args.group_id,
+                         c.distributed["backend"], c.distributed["url"])
+
+    # setup models
+    model = setup_generator(c)
+
+    # scaler for mixed_precision
+    scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
+
+    # setup optimizers
+    optimizer = Adam(model.parameters(), lr=c.lr, weight_decay=0)
+
+    # schedulers
+    scheduler = None
+    if 'lr_scheduler' in c:
+        scheduler = getattr(torch.optim.lr_scheduler, c.lr_scheduler)
+        scheduler = scheduler(optimizer, **c.lr_scheduler_params)
+
+    # setup criterion
+    criterion = torch.nn.L1Loss().cuda()
+
+    if args.restore_path:
+        checkpoint = torch.load(args.restore_path, map_location='cpu')
+        try:
+            print(" > Restoring Model...")
+            model.load_state_dict(checkpoint['model'])
+            print(" > Restoring Optimizer...")
+            optimizer.load_state_dict(checkpoint['optimizer'])
+            if 'scheduler' in checkpoint:
+                print(" > Restoring LR Scheduler...")
+                scheduler.load_state_dict(checkpoint['scheduler'])
+                # NOTE: Not sure if necessary
+                scheduler.optimizer = optimizer
+            if "scaler" in checkpoint and c.mixed_precision:
+                print(" > Restoring AMP Scaler...")
+                scaler.load_state_dict(checkpoint["scaler"])
+        except RuntimeError:
+            # retore only matching layers.
+            print(" > Partial model initialization...")
+            model_dict = model.state_dict()
+            model_dict = set_init_dict(model_dict, checkpoint['model'], c)
+            model.load_state_dict(model_dict)
+            del model_dict
+
+        # reset lr if not countinuining training.
+        for group in optimizer.param_groups:
+            group['lr'] = c.lr
+
+        print(" > Model restored from step %d" % checkpoint['step'],
+              flush=True)
+        args.restore_step = checkpoint['step']
+    else:
+        args.restore_step = 0
+
+    if use_cuda:
+        model.cuda()
+        criterion.cuda()
+
+    # DISTRUBUTED
+    if num_gpus > 1:
+        model = DDP_th(model, device_ids=[args.rank])
+
+    num_params = count_parameters(model)
+    print(" > WaveGrad has {} parameters".format(num_params), flush=True)
+
+    if 'best_loss' not in locals():
+        best_loss = float('inf')
+
+    global_step = args.restore_step
+    for epoch in range(0, c.epochs):
+        c_logger.print_epoch_start(epoch, c.epochs)
+        _, global_step = train(model, criterion, optimizer,
+                               scheduler, scaler, ap, global_step,
+                               epoch)
+        eval_avg_loss_dict = evaluate(model, criterion, ap,
+                                      global_step, epoch)
+        c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
+        target_loss = eval_avg_loss_dict[c.target_loss]
+        best_loss = save_best_model(target_loss,
+                                    best_loss,
+                                    model,
+                                    optimizer,
+                                    scheduler,
+                                    None,
+                                    None,
+                                    None,
+                                    global_step,
+                                    epoch,
+                                    OUT_PATH,
+                                    model_losses=eval_avg_loss_dict,
+                                    scaler=scaler.state_dict() if c.mixed_precision else None)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '--continue_path',
+        type=str,
+        help=
+        'Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
+        default='',
+        required='--config_path' not in sys.argv)
+    parser.add_argument(
+        '--restore_path',
+        type=str,
+        help='Model file to be restored. Use to finetune a model.',
+        default='')
+    parser.add_argument('--config_path',
+                        type=str,
+                        help='Path to config file for training.',
+                        required='--continue_path' not in sys.argv)
+    parser.add_argument('--debug',
+                        type=bool,
+                        default=False,
+                        help='Do not verify commit integrity to run training.')
+
+    # DISTRUBUTED
+    parser.add_argument(
+        '--rank',
+        type=int,
+        default=0,
+        help='DISTRIBUTED: process rank for distributed training.')
+    parser.add_argument('--group_id',
+                        type=str,
+                        default="",
+                        help='DISTRIBUTED: process group id.')
+    args = parser.parse_args()
+
+    if args.continue_path != '':
+        args.output_path = args.continue_path
+        args.config_path = os.path.join(args.continue_path, 'config.json')
+        list_of_files = glob.glob(
+            args.continue_path +
+            "/*.pth.tar")  # * means all if need specific format then *.csv
+        latest_model_file = max(list_of_files, key=os.path.getctime)
+        args.restore_path = latest_model_file
+        print(f" > Training continues for {args.restore_path}")
+
+    # setup output paths and read configs
+    c = load_config(args.config_path)
+    # check_config(c)
+    _ = os.path.dirname(os.path.realpath(__file__))
+
+    # DISTRIBUTED
+    if c.mixed_precision:
+        print("   >  Mixed precision is enabled")
+
+    OUT_PATH = args.continue_path
+    if args.continue_path == '':
+        OUT_PATH = create_experiment_folder(c.output_path, c.run_name,
+                                            args.debug)
+
+    AUDIO_PATH = os.path.join(OUT_PATH, 'test_audios')
+
+    c_logger = ConsoleLogger()
+
+    if args.rank == 0:
+        os.makedirs(AUDIO_PATH, exist_ok=True)
+        new_fields = {}
+        if args.restore_path:
+            new_fields["restore_path"] = args.restore_path
+        new_fields["github_branch"] = get_git_branch()
+        copy_config_file(args.config_path,
+                         os.path.join(OUT_PATH, 'config.json'), new_fields)
+        os.chmod(AUDIO_PATH, 0o775)
+        os.chmod(OUT_PATH, 0o775)
+
+        LOG_DIR = OUT_PATH
+        tb_logger = TensorboardLogger(LOG_DIR, model_name='VOCODER')
+
+        # write model desc to tensorboard
+        tb_logger.tb_add_text('model-description', c['run_description'], 0)
+
+    try:
+        main(args)
+    except KeyboardInterrupt:
+        remove_experiment_folder(OUT_PATH)
+        try:
+            sys.exit(0)
+        except SystemExit:
+            os._exit(0)  # pylint: disable=protected-access
+    except Exception:  # pylint: disable=broad-except
+        remove_experiment_folder(OUT_PATH)
+        traceback.print_exc()
+        sys.exit(1)
--- a/TTS/bin/train_vocoder_wavernn.py
+++ b/TTS/bin/train_vocoder_wavernn.py
@ -0,0 +1,539 @@
+import argparse
+import os
+import sys
+import traceback
+import time
+import glob
+import random
+
+import torch
+from torch.utils.data import DataLoader
+
+# from torch.utils.data.distributed import DistributedSampler
+
+from TTS.tts.utils.visual import plot_spectrogram
+from TTS.utils.audio import AudioProcessor
+from TTS.utils.radam import RAdam
+from TTS.utils.io import copy_config_file, load_config
+from TTS.utils.training import setup_torch_training_env
+from TTS.utils.console_logger import ConsoleLogger
+from TTS.utils.tensorboard_logger import TensorboardLogger
+from TTS.utils.generic_utils import (
+    KeepAverage,
+    count_parameters,
+    create_experiment_folder,
+    get_git_branch,
+    remove_experiment_folder,
+    set_init_dict,
+)
+from TTS.vocoder.datasets.wavernn_dataset import WaveRNNDataset
+from TTS.vocoder.datasets.preprocess import (
+    load_wav_data,
+    load_wav_feat_data
+)
+from TTS.vocoder.utils.distribution import discretized_mix_logistic_loss, gaussian_loss
+from TTS.vocoder.utils.generic_utils import setup_wavernn
+from TTS.vocoder.utils.io import save_best_model, save_checkpoint
+
+
+use_cuda, num_gpus = setup_torch_training_env(True, True)
+
+
+def setup_loader(ap, is_val=False, verbose=False):
+    if is_val and not c.run_eval:
+        loader = None
+    else:
+        dataset = WaveRNNDataset(ap=ap,
+                                 items=eval_data if is_val else train_data,
+                                 seq_len=c.seq_len,
+                                 hop_len=ap.hop_length,
+                                 pad=c.padding,
+                                 mode=c.mode,
+                                 mulaw=c.mulaw,
+                                 is_training=not is_val,
+                                 verbose=verbose,
+                                 )
+        # sampler = DistributedSampler(dataset) if num_gpus > 1 else None
+        loader = DataLoader(dataset,
+                            shuffle=True,
+                            collate_fn=dataset.collate,
+                            batch_size=c.batch_size,
+                            num_workers=c.num_val_loader_workers
+                            if is_val
+                            else c.num_loader_workers,
+                            pin_memory=True,
+                            )
+    return loader
+
+
+def format_data(data):
+    # setup input data
+    x_input = data[0]
+    mels = data[1]
+    y_coarse = data[2]
+
+    # dispatch data to GPU
+    if use_cuda:
+        x_input = x_input.cuda(non_blocking=True)
+        mels = mels.cuda(non_blocking=True)
+        y_coarse = y_coarse.cuda(non_blocking=True)
+
+    return x_input, mels, y_coarse
+
+
+def train(model, optimizer, criterion, scheduler, scaler, ap, global_step, epoch):
+    # create train loader
+    data_loader = setup_loader(ap, is_val=False, verbose=(epoch == 0))
+    model.train()
+    epoch_time = 0
+    keep_avg = KeepAverage()
+    if use_cuda:
+        batch_n_iter = int(len(data_loader.dataset) /
+                           (c.batch_size * num_gpus))
+    else:
+        batch_n_iter = int(len(data_loader.dataset) / c.batch_size)
+    end_time = time.time()
+    c_logger.print_train_start()
+    # train loop
+    for num_iter, data in enumerate(data_loader):
+        start_time = time.time()
+        x_input, mels, y_coarse = format_data(data)
+        loader_time = time.time() - end_time
+        global_step += 1
+
+        optimizer.zero_grad()
+
+        if c.mixed_precision:
+            # mixed precision training
+            with torch.cuda.amp.autocast():
+                y_hat = model(x_input, mels)
+                if isinstance(model.mode, int):
+                    y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
+                else:
+                    y_coarse = y_coarse.float()
+                y_coarse = y_coarse.unsqueeze(-1)
+                # compute losses
+                loss = criterion(y_hat, y_coarse)
+            scaler.scale(loss).backward()
+            scaler.unscale_(optimizer)
+            if c.grad_clip > 0:
+                torch.nn.utils.clip_grad_norm_(
+                    model.parameters(), c.grad_clip)
+            scaler.step(optimizer)
+            scaler.update()
+        else:
+            # full precision training
+            y_hat = model(x_input, mels)
+            if isinstance(model.mode, int):
+                y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
+            else:
+                y_coarse = y_coarse.float()
+            y_coarse = y_coarse.unsqueeze(-1)
+            # compute losses
+            loss = criterion(y_hat, y_coarse)
+            if loss.item() is None:
+                raise RuntimeError(" [!] None loss. Exiting ...")
+            loss.backward()
+            if c.grad_clip > 0:
+                torch.nn.utils.clip_grad_norm_(
+                    model.parameters(), c.grad_clip)
+            optimizer.step()
+
+        if scheduler is not None:
+            scheduler.step()
+
+        # get the current learning rate
+        cur_lr = list(optimizer.param_groups)[0]["lr"]
+
+        step_time = time.time() - start_time
+        epoch_time += step_time
+
+        update_train_values = dict()
+        loss_dict = dict()
+        loss_dict["model_loss"] = loss.item()
+        for key, value in loss_dict.items():
+            update_train_values["avg_" + key] = value
+        update_train_values["avg_loader_time"] = loader_time
+        update_train_values["avg_step_time"] = step_time
+        keep_avg.update_values(update_train_values)
+
+        # print training stats
+        if global_step % c.print_step == 0:
+            log_dict = {"step_time": [step_time, 2],
+                        "loader_time": [loader_time, 4],
+                        "current_lr": cur_lr,
+                        }
+            c_logger.print_train_step(batch_n_iter,
+                                      num_iter,
+                                      global_step,
+                                      log_dict,
+                                      loss_dict,
+                                      keep_avg.avg_values,
+                                      )
+
+        # plot step stats
+        if global_step % 10 == 0:
+            iter_stats = {"lr": cur_lr, "step_time": step_time}
+            iter_stats.update(loss_dict)
+            tb_logger.tb_train_iter_stats(global_step, iter_stats)
+
+        # save checkpoint
+        if global_step % c.save_step == 0:
+            if c.checkpoint:
+                # save model
+                save_checkpoint(model,
+                                optimizer,
+                                scheduler,
+                                None,
+                                None,
+                                None,
+                                global_step,
+                                epoch,
+                                OUT_PATH,
+                                model_losses=loss_dict,
+                                scaler=scaler.state_dict() if c.mixed_precision else None
+                                )
+
+            # synthesize a full voice
+            rand_idx = random.randrange(0, len(train_data))
+            wav_path = train_data[rand_idx] if not isinstance(
+                train_data[rand_idx], (tuple, list)) else train_data[rand_idx][0]
+            wav = ap.load_wav(wav_path)
+            ground_mel = ap.melspectrogram(wav)
+            sample_wav = model.generate(ground_mel,
+                                        c.batched,
+                                        c.target_samples,
+                                        c.overlap_samples,
+                                        use_cuda
+                                        )
+            predict_mel = ap.melspectrogram(sample_wav)
+
+            # compute spectrograms
+            figures = {"train/ground_truth": plot_spectrogram(ground_mel.T),
+                       "train/prediction": plot_spectrogram(predict_mel.T)
+                       }
+            tb_logger.tb_train_figures(global_step, figures)
+
+            # Sample audio
+            tb_logger.tb_train_audios(
+                global_step, {
+                    "train/audio": sample_wav}, c.audio["sample_rate"]
+            )
+        end_time = time.time()
+
+    # print epoch stats
+    c_logger.print_train_epoch_end(global_step, epoch, epoch_time, keep_avg)
+
+    # Plot Training Epoch Stats
+    epoch_stats = {"epoch_time": epoch_time}
+    epoch_stats.update(keep_avg.avg_values)
+    tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
+    # TODO: plot model stats
+    # if c.tb_model_param_stats:
+    # tb_logger.tb_model_weights(model, global_step)
+    return keep_avg.avg_values, global_step
+
+
+@torch.no_grad()
+def evaluate(model, criterion, ap, global_step, epoch):
+    # create train loader
+    data_loader = setup_loader(ap, is_val=True, verbose=(epoch == 0))
+    model.eval()
+    epoch_time = 0
+    keep_avg = KeepAverage()
+    end_time = time.time()
+    c_logger.print_eval_start()
+    with torch.no_grad():
+        for num_iter, data in enumerate(data_loader):
+            start_time = time.time()
+            # format data
+            x_input, mels, y_coarse = format_data(data)
+            loader_time = time.time() - end_time
+            global_step += 1
+
+            y_hat = model(x_input, mels)
+            if isinstance(model.mode, int):
+                y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
+            else:
+                y_coarse = y_coarse.float()
+            y_coarse = y_coarse.unsqueeze(-1)
+            loss = criterion(y_hat, y_coarse)
+            # Compute avg loss
+            # if num_gpus > 1:
+            #     loss = reduce_tensor(loss.data, num_gpus)
+            loss_dict = dict()
+            loss_dict["model_loss"] = loss.item()
+
+            step_time = time.time() - start_time
+            epoch_time += step_time
+
+            # update avg stats
+            update_eval_values = dict()
+            for key, value in loss_dict.items():
+                update_eval_values["avg_" + key] = value
+            update_eval_values["avg_loader_time"] = loader_time
+            update_eval_values["avg_step_time"] = step_time
+            keep_avg.update_values(update_eval_values)
+
+            # print eval stats
+            if c.print_eval:
+                c_logger.print_eval_step(
+                    num_iter, loss_dict, keep_avg.avg_values)
+
+    if epoch % c.test_every_epochs == 0 and epoch != 0:
+        # synthesize a full voice
+        rand_idx = random.randrange(0, len(eval_data))
+        wav_path = eval_data[rand_idx] if not isinstance(
+            eval_data[rand_idx], (tuple, list)) else eval_data[rand_idx][0]
+        wav = ap.load_wav(wav_path)
+        ground_mel = ap.melspectrogram(wav)
+        sample_wav = model.generate(ground_mel,
+                                    c.batched,
+                                    c.target_samples,
+                                    c.overlap_samples,
+                                    use_cuda
+                                    )
+        predict_mel = ap.melspectrogram(sample_wav)
+
+        # Sample audio
+        tb_logger.tb_eval_audios(
+            global_step, {
+                "eval/audio": sample_wav}, c.audio["sample_rate"]
+        )
+
+        # compute spectrograms
+        figures = {"eval/ground_truth": plot_spectrogram(ground_mel.T),
+                   "eval/prediction": plot_spectrogram(predict_mel.T)
+                   }
+        tb_logger.tb_eval_figures(global_step, figures)
+
+    tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
+    return keep_avg.avg_values
+
+
+# FIXME: move args definition/parsing inside of main?
+def main(args):  # pylint: disable=redefined-outer-name
+    # pylint: disable=global-variable-undefined
+    global train_data, eval_data
+
+    # setup audio processor
+    ap = AudioProcessor(**c.audio)
+
+    # print(f" > Loading wavs from: {c.data_path}")
+    # if c.feature_path is not None:
+    #     print(f" > Loading features from: {c.feature_path}")
+    #     eval_data, train_data = load_wav_feat_data(
+    #         c.data_path, c.feature_path, c.eval_split_size
+    #     )
+    # else:
+    #     mel_feat_path = os.path.join(OUT_PATH, "mel")
+    #     feat_data = find_feat_files(mel_feat_path)
+    #     if feat_data:
+    #         print(f" > Loading features from: {mel_feat_path}")
+    #         eval_data, train_data = load_wav_feat_data(
+    #             c.data_path, mel_feat_path, c.eval_split_size
+    #         )
+    #     else:
+    #         print(" > No feature data found. Preprocessing...")
+    #         # preprocessing feature data from given wav files
+    #         preprocess_wav_files(OUT_PATH, CONFIG, ap)
+    #         eval_data, train_data = load_wav_feat_data(
+    #             c.data_path, mel_feat_path, c.eval_split_size
+    #         )
+
+    print(f" > Loading wavs from: {c.data_path}")
+    if c.feature_path is not None:
+        print(f" > Loading features from: {c.feature_path}")
+        eval_data, train_data = load_wav_feat_data(
+            c.data_path, c.feature_path, c.eval_split_size)
+    else:
+        eval_data, train_data = load_wav_data(
+            c.data_path, c.eval_split_size)
+    # setup model
+    model_wavernn = setup_wavernn(c)
+
+    # setup amp scaler
+    scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
+
+    # define train functions
+    if c.mode == "mold":
+        criterion = discretized_mix_logistic_loss
+    elif c.mode == "gauss":
+        criterion = gaussian_loss
+    elif isinstance(c.mode, int):
+        criterion = torch.nn.CrossEntropyLoss()
+
+    if use_cuda:
+        model_wavernn.cuda()
+        if isinstance(c.mode, int):
+            criterion.cuda()
+
+    optimizer = RAdam(model_wavernn.parameters(), lr=c.lr, weight_decay=0)
+
+    scheduler = None
+    if "lr_scheduler" in c:
+        scheduler = getattr(torch.optim.lr_scheduler, c.lr_scheduler)
+        scheduler = scheduler(optimizer, **c.lr_scheduler_params)
+    # slow start for the first 5 epochs
+    # lr_lambda = lambda epoch: min(epoch / c.warmup_steps, 1)
+    # scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
+
+    # restore any checkpoint
+    if args.restore_path:
+        checkpoint = torch.load(args.restore_path, map_location="cpu")
+        try:
+            print(" > Restoring Model...")
+            model_wavernn.load_state_dict(checkpoint["model"])
+            print(" > Restoring Optimizer...")
+            optimizer.load_state_dict(checkpoint["optimizer"])
+            if "scheduler" in checkpoint:
+                print(" > Restoring Generator LR Scheduler...")
+                scheduler.load_state_dict(checkpoint["scheduler"])
+                scheduler.optimizer = optimizer
+            if "scaler" in checkpoint and c.mixed_precision:
+                print(" > Restoring AMP Scaler...")
+                scaler.load_state_dict(checkpoint["scaler"])
+        except RuntimeError:
+            # retore only matching layers.
+            print(" > Partial model initialization...")
+            model_dict = model_wavernn.state_dict()
+            model_dict = set_init_dict(model_dict, checkpoint["model"], c)
+            model_wavernn.load_state_dict(model_dict)
+
+        print(" > Model restored from step %d" %
+              checkpoint["step"], flush=True)
+        args.restore_step = checkpoint["step"]
+    else:
+        args.restore_step = 0
+
+    # DISTRIBUTED
+    # if num_gpus > 1:
+    #     model = apply_gradient_allreduce(model)
+
+    num_parameters = count_parameters(model_wavernn)
+    print(" > Model has {} parameters".format(num_parameters), flush=True)
+
+    if "best_loss" not in locals():
+        best_loss = float("inf")
+
+    global_step = args.restore_step
+    for epoch in range(0, c.epochs):
+        c_logger.print_epoch_start(epoch, c.epochs)
+        _, global_step = train(model_wavernn, optimizer,
+                               criterion, scheduler, scaler, ap, global_step, epoch)
+        eval_avg_loss_dict = evaluate(
+            model_wavernn, criterion, ap, global_step, epoch)
+        c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
+        target_loss = eval_avg_loss_dict["avg_model_loss"]
+        best_loss = save_best_model(
+            target_loss,
+            best_loss,
+            model_wavernn,
+            optimizer,
+            scheduler,
+            None,
+            None,
+            None,
+            global_step,
+            epoch,
+            OUT_PATH,
+            model_losses=eval_avg_loss_dict,
+            scaler=scaler.state_dict() if c.mixed_precision else None
+        )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--continue_path",
+        type=str,
+        help='Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
+        default="",
+        required="--config_path" not in sys.argv,
+    )
+    parser.add_argument(
+        "--restore_path",
+        type=str,
+        help="Model file to be restored. Use to finetune a model.",
+        default="",
+    )
+    parser.add_argument(
+        "--config_path",
+        type=str,
+        help="Path to config file for training.",
+        required="--continue_path" not in sys.argv,
+    )
+    parser.add_argument(
+        "--debug",
+        type=bool,
+        default=False,
+        help="Do not verify commit integrity to run training.",
+    )
+
+    # DISTRUBUTED
+    parser.add_argument(
+        "--rank",
+        type=int,
+        default=0,
+        help="DISTRIBUTED: process rank for distributed training.",
+    )
+    parser.add_argument(
+        "--group_id", type=str, default="", help="DISTRIBUTED: process group id."
+    )
+    args = parser.parse_args()
+
+    if args.continue_path != "":
+        args.output_path = args.continue_path
+        args.config_path = os.path.join(args.continue_path, "config.json")
+        list_of_files = glob.glob(
+            args.continue_path + "/*.pth.tar"
+        )  # * means all if need specific format then *.csv
+        latest_model_file = max(list_of_files, key=os.path.getctime)
+        args.restore_path = latest_model_file
+        print(f" > Training continues for {args.restore_path}")
+
+    # setup output paths and read configs
+    c = load_config(args.config_path)
+    # check_config(c)
+    _ = os.path.dirname(os.path.realpath(__file__))
+
+    OUT_PATH = args.continue_path
+    if args.continue_path == "":
+        OUT_PATH = create_experiment_folder(
+            c.output_path, c.run_name, args.debug
+        )
+
+    AUDIO_PATH = os.path.join(OUT_PATH, "test_audios")
+
+    c_logger = ConsoleLogger()
+
+    if args.rank == 0:
+        os.makedirs(AUDIO_PATH, exist_ok=True)
+        new_fields = {}
+        if args.restore_path:
+            new_fields["restore_path"] = args.restore_path
+        new_fields["github_branch"] = get_git_branch()
+        copy_config_file(
+            args.config_path, os.path.join(OUT_PATH, "config.json"), new_fields
+        )
+        os.chmod(AUDIO_PATH, 0o775)
+        os.chmod(OUT_PATH, 0o775)
+
+        LOG_DIR = OUT_PATH
+        tb_logger = TensorboardLogger(LOG_DIR, model_name="VOCODER")
+
+        # write model desc to tensorboard
+        tb_logger.tb_add_text("model-description", c["run_description"], 0)
+
+    try:
+        main(args)
+    except KeyboardInterrupt:
+        remove_experiment_folder(OUT_PATH)
+        try:
+            sys.exit(0)
+        except SystemExit:
+            os._exit(0)  # pylint: disable=protected-access
+    except Exception:  # pylint: disable=broad-except
+        remove_experiment_folder(OUT_PATH)
+        traceback.print_exc()
+        sys.exit(1)
--- a/TTS/bin/tune_wavegrad.py
+++ b/TTS/bin/tune_wavegrad.py
@ -0,0 +1,91 @@
+"""Search a good noise schedule for WaveGrad for a given number of inferece iterations"""
+import argparse
+from itertools import product as cartesian_product
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader
+from tqdm import tqdm
+from TTS.utils.audio import AudioProcessor
+from TTS.utils.io import load_config
+from TTS.vocoder.datasets.preprocess import load_wav_data
+from TTS.vocoder.datasets.wavegrad_dataset import WaveGradDataset
+from TTS.vocoder.utils.generic_utils import setup_generator
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--model_path', type=str, help='Path to model checkpoint.')
+parser.add_argument('--config_path', type=str, help='Path to model config file.')
+parser.add_argument('--data_path', type=str, help='Path to data directory.')
+parser.add_argument('--output_path', type=str, help='path for output file including file name and extension.')
+parser.add_argument('--num_iter', type=int, help='Number of model inference iterations that you like to optimize noise schedule for.')
+parser.add_argument('--use_cuda', type=bool, help='enable/disable CUDA.')
+parser.add_argument('--num_samples', type=int, default=1, help='Number of datasamples used for inference.')
+parser.add_argument('--search_depth', type=int, default=3, help='Search granularity. Increasing this increases the run-time exponentially.')
+
+# load config
+args = parser.parse_args()
+config = load_config(args.config_path)
+
+# setup audio processor
+ap = AudioProcessor(**config.audio)
+
+# load dataset
+_, train_data = load_wav_data(args.data_path, 0)
+train_data = train_data[:args.num_samples]
+dataset = WaveGradDataset(ap=ap,
+                          items=train_data,
+                          seq_len=-1,
+                          hop_len=ap.hop_length,
+                          pad_short=config.pad_short,
+                          conv_pad=config.conv_pad,
+                          is_training=True,
+                          return_segments=False,
+                          use_noise_augment=False,
+                          use_cache=False,
+                          verbose=True)
+loader = DataLoader(
+    dataset,
+    batch_size=1,
+    shuffle=False,
+    collate_fn=dataset.collate_full_clips,
+    drop_last=False,
+    num_workers=config.num_loader_workers,
+    pin_memory=False)
+
+# setup the model
+model = setup_generator(config)
+if args.use_cuda:
+    model.cuda()
+
+# setup optimization parameters
+base_values = sorted(10 * np.random.uniform(size=args.search_depth))
+print(base_values)
+exponents = 10 ** np.linspace(-6, -1, num=args.num_iter)
+best_error = float('inf')
+best_schedule = None
+total_search_iter = len(base_values)**args.num_iter
+for base in tqdm(cartesian_product(base_values, repeat=args.num_iter), total=total_search_iter):
+    beta = exponents * base
+    model.compute_noise_level(beta)
+    for data in loader:
+        mel, audio = data
+        y_hat = model.inference(mel.cuda() if args.use_cuda else mel)
+
+        if args.use_cuda:
+            y_hat = y_hat.cpu()
+        y_hat = y_hat.numpy()
+
+        mel_hat = []
+        for i in range(y_hat.shape[0]):
+            m = ap.melspectrogram(y_hat[i, 0])[:, :-1]
+            mel_hat.append(torch.from_numpy(m))
+
+        mel_hat = torch.stack(mel_hat)
+        mse = torch.sum((mel - mel_hat) ** 2).mean()
+        if mse.item() < best_error:
+            best_error = mse.item()
+            best_schedule = {'beta': beta}
+            print(f" > Found a better schedule. - MSE: {mse.item()}")
+            np.save(args.output_path, best_schedule)
+
+
--- a/TTS/speaker_encoder/model.py
+++ b/TTS/speaker_encoder/model.py
@ -61,6 +61,7 @@ class SpeakerEncoder(nn.Module):
            d = torch.nn.functional.normalize(d, p=2, dim=1)
        return d

+    @torch.no_grad()
    def inference(self, x):
        d = self.layers.forward(x)
        if self.use_lstm_with_projection:
--- a/TTS/tts/configs/config.json
+++ b/TTS/tts/configs/config.json
@ -65,14 +65,19 @@
    "eval_batch_size":16,
    "r": 7,                 // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
    "gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
-    "apex_amp_level": null,     // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use "O1" to activate.
+    "mixed_precision": true,     // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use "O1" to activate.

    // LOSS SETTINGS
    "loss_masking": true,       // enable / disable loss masking against the sequence padding.
-    "decoder_loss_alpha": 0.5,  // decoder loss weight. If > 0, it is enabled
-    "postnet_loss_alpha": 0.25, // postnet loss weight. If > 0, it is enabled
+    "decoder_loss_alpha": 0.5,  // original decoder loss weight. If > 0, it is enabled
+    "postnet_loss_alpha": 0.25, // original postnet loss weight. If > 0, it is enabled
+    "postnet_diff_spec_alpha": 0.25,     // differential spectral loss weight. If > 0, it is enabled
+    "decoder_diff_spec_alpha": 0.25,     // differential spectral loss weight. If > 0, it is enabled
+    "decoder_ssim_alpha": 0.5,     // decoder ssim loss weight. If > 0, it is enabled
+    "postnet_ssim_alpha": 0.25,     // postnet ssim loss weight. If > 0, it is enabled
    "ga_alpha": 5.0,           // weight for guided attention loss. If > 0, guided attention is enabled.
-    "diff_spec_alpha": 0.25,     // differential spectral loss weight. If > 0, it is enabled
+    "stopnet_pos_weight": 15.0, // pos class weight for stopnet loss since there are way more negative samples than positive samples.
+

    // VALIDATION
    "run_eval": true,
--- a/TTS/tts/configs/glow_tts_gated_conv.json
+++ b/TTS/tts/configs/glow_tts_gated_conv.json
@ -51,10 +51,13 @@
    //     "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
    // },

+    "add_blank": false, // if true add a new token after each token of the sentence. This increases the size of the input sequence, but has considerably improved the prosody of the GlowTTS model.
+
    // DISTRIBUTED TRAINING
+    "apex_amp_level": null,     // APEX amp optimization level. "O1" is currently supported.
    "distributed":{
        "backend": "nccl",
-        "url": "tcp:\/\/localhost:54321"
+        "url": "tcp:\/\/localhost:54323"
    },

    "reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
--- a/TTS/tts/configs/glow_tts_tdsep.json
+++ b/TTS/tts/configs/glow_tts_tdsep.json
@ -51,6 +51,8 @@
        //     "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
        // },

+        "add_blank": false, // if true add a new token after each token of the sentence. This increases the size of the input sequence, but has considerably improved the prosody of the GlowTTS model.
+
        // DISTRIBUTED TRAINING
        "distributed":{
            "backend": "nccl",
--- a/TTS/tts/datasets/TTSDataset.py
+++ b/TTS/tts/datasets/TTSDataset.py
@ -17,6 +17,7 @@ class MyDataset(Dataset):
                 ap,
                 meta_data,
                 tp=None,
+                 add_blank=False,
                 batch_group_size=0,
                 min_seq_len=0,
                 max_seq_len=float("inf"),
@ -55,6 +56,7 @@ class MyDataset(Dataset):
        self.max_seq_len = max_seq_len
        self.ap = ap
        self.tp = tp
+        self.add_blank = add_blank
        self.use_phonemes = use_phonemes
        self.phoneme_cache_path = phoneme_cache_path
        self.phoneme_language = phoneme_language
@ -88,7 +90,7 @@ class MyDataset(Dataset):
        phonemes = phoneme_to_sequence(text, [self.cleaners],
                                       language=self.phoneme_language,
                                       enable_eos_bos=False,
-                                       tp=self.tp)
+                                       tp=self.tp, add_blank=self.add_blank)
        phonemes = np.asarray(phonemes, dtype=np.int32)
        np.save(cache_path, phonemes)
        return phonemes
@ -127,7 +129,7 @@ class MyDataset(Dataset):
            text = self._load_or_generate_phoneme_sequence(wav_file, text)
        else:
            text = np.asarray(text_to_sequence(text, [self.cleaners],
-                                               tp=self.tp),
+                                               tp=self.tp, add_blank=self.add_blank),
                              dtype=np.int32)

        assert text.size > 0, self.items[idx][1]
--- a/TTS/tts/datasets/preprocess.py
+++ b/TTS/tts/datasets/preprocess.py
@ -9,9 +9,9 @@ from tqdm import tqdm
 from TTS.tts.utils.generic_utils import split_dataset


-def load_meta_data(datasets):
+def load_meta_data(datasets, eval_split=True):
    meta_data_train_all = []
-    meta_data_eval_all = []
+    meta_data_eval_all = [] if eval_split else None
    for dataset in datasets:
        name = dataset['name']
        root_path = dataset['path']
@ -20,12 +20,13 @@ def load_meta_data(datasets):
        preprocessor = get_preprocessor_by_name(name)
        meta_data_train = preprocessor(root_path, meta_file_train)
        print(f" | > Found {len(meta_data_train)} files in {Path(root_path).resolve()}")
-        if meta_file_val is None:
-            meta_data_eval, meta_data_train = split_dataset(meta_data_train)
-        else:
-            meta_data_eval = preprocessor(root_path, meta_file_val)
+        if eval_split:
+            if meta_file_val is None:
+                meta_data_eval, meta_data_train = split_dataset(meta_data_train)
+            else:
+                meta_data_eval = preprocessor(root_path, meta_file_val)
+            meta_data_eval_all += meta_data_eval
        meta_data_train_all += meta_data_train
-        meta_data_eval_all += meta_data_eval
    return meta_data_train_all, meta_data_eval_all


@ -227,7 +228,6 @@ def brspeech(root_path, meta_file):
            if line.startswith("wav_filename"):
                continue
            cols = line.split('|')
-            #print(cols)
            wav_file = os.path.join(root_path, cols[0])
            text = cols[2]
            speaker_name = cols[3]
@ -303,17 +303,17 @@ def _voxcel_x(root_path, meta_file, voxcel_idx):

    elif not cache_to.exists():
        cnt = 0
-        meta_data = ""
+        meta_data = []
        wav_files = voxceleb_path.rglob("**/*.wav")
        for path in tqdm(wav_files, desc=f"Building VoxCeleb {voxcel_idx} Meta file ... this needs to be done only once.",
                         total=expected_count):
            speaker_id = str(Path(path).parent.parent.stem)
            assert speaker_id.startswith('id')
            text = None  # VoxCel does not provide transciptions, and they are not needed for training the SE
-            meta_data += f"{text}|{path}|voxcel{voxcel_idx}_{speaker_id}\n"
+            meta_data.append(f"{text}|{path}|voxcel{voxcel_idx}_{speaker_id}\n")
            cnt += 1
        with open(str(cache_to), 'w') as f:
-            f.write(meta_data)
+            f.write("".join(meta_data))
        if cnt < expected_count:
            raise ValueError(f"Found too few instances for Voxceleb. Should be around {expected_count}, is: {cnt}")

--- a/TTS/tts/layers/losses.py
+++ b/TTS/tts/layers/losses.py
@ -2,10 +2,14 @@ import math
 import numpy as np
 import torch
 from torch import nn
+from inspect import signature
 from torch.nn import functional
 from TTS.tts.utils.generic_utils import sequence_mask
+from TTS.tts.utils.ssim import ssim


+# pylint: disable=abstract-method Method
+# relates https://github.com/pytorch/pytorch/issues/42305
 class L1LossMasked(nn.Module):
    def __init__(self, seq_len_norm):
        super().__init__()
@ -22,6 +26,10 @@ class L1LossMasked(nn.Module):
                class for each corresponding step.
            length: A Variable containing a LongTensor of size (batch,)
                which contains the length of each data in a batch.
+        Shapes:
+            x: B x T X D
+            target: B x T x D
+            length: B
        Returns:
            loss: An average loss value in range [0, 1] masked by the length.
        """
@ -60,6 +68,10 @@ class MSELossMasked(nn.Module):
                class for each corresponding step.
            length: A Variable containing a LongTensor of size (batch,)
                which contains the length of each data in a batch.
+        Shapes:
+            x: B x T X D
+            target: B x T x D
+            length: B
        Returns:
            loss: An average loss value in range [0, 1] masked by the length.
        """
@ -84,6 +96,33 @@ class MSELossMasked(nn.Module):
        return loss


+class SSIMLoss(torch.nn.Module):
+    """SSIM loss as explained here https://en.wikipedia.org/wiki/Structural_similarity"""
+    def __init__(self):
+        super().__init__()
+        self.loss_func = ssim
+
+    def forward(self, y_hat, y, length=None):
+        """
+        Args:
+            y_hat (tensor): model prediction values.
+            y (tensor): target values.
+            length (tensor): length of each sample in a batch.
+        Shapes:
+            y_hat: B x T X D
+            y: B x T x D
+            length: B
+         Returns:
+            loss: An average loss value in range [0, 1] masked by the length.
+        """
+        if length is not None:
+            m = sequence_mask(sequence_length=length,
+                              max_len=y.size(1)).unsqueeze(2).float().to(
+                                  y_hat.device)
+            y_hat, y = y_hat * m, y * m
+        return 1 - self.loss_func(y_hat.unsqueeze(1), y.unsqueeze(1))
+
+
 class AttentionEntropyLoss(nn.Module):
    # pylint: disable=R0201
    def forward(self, align):
@ -115,19 +154,29 @@ class BCELossMasked(nn.Module):
                class for each corresponding step.
            length: A Variable containing a LongTensor of size (batch,)
                which contains the length of each data in a batch.
+        Shapes:
+            x: B x T
+            target: B x T
+            length: B
        Returns:
            loss: An average loss value in range [0, 1] masked by the length.
        """
        # mask: (batch, max_len, 1)
        target.requires_grad = False
-        mask = sequence_mask(sequence_length=length,
-                             max_len=target.size(1)).float()
+        if length is not None:
+            mask = sequence_mask(sequence_length=length,
+                                max_len=target.size(1)).float()
+            x = x * mask
+            target = target * mask
+            num_items = mask.sum()
+        else:
+            num_items = torch.numel(x)
        loss = functional.binary_cross_entropy_with_logits(
-            x * mask,
-            target * mask,
+            x,
+            target,
            pos_weight=self.pos_weight,
            reduction='sum')
-        loss = loss / mask.sum()
+        loss = loss / num_items
        return loss


@ -139,9 +188,19 @@ class DifferentailSpectralLoss(nn.Module):
        super().__init__()
        self.loss_func = loss_func

-    def forward(self, x, target, length):
+    def forward(self, x, target, length=None):
+        """
+         Shapes:
+            x: B x T
+            target: B x T
+            length: B
+        Returns:
+            loss: An average loss value in range [0, 1] masked by the length.
+        """
        x_diff = x[:, 1:] - x[:, :-1]
        target_diff = target[:, 1:] - target[:, :-1]
+        if length is None:
+            return self.loss_func(x_diff, target_diff)
        return self.loss_func(x_diff, target_diff, length-1)


@ -169,7 +228,7 @@ class GuidedAttentionLoss(torch.nn.Module):

    @staticmethod
    def _make_ga_mask(ilen, olen, sigma):
-        grid_x, grid_y = torch.meshgrid(torch.arange(olen, device=olen.device), torch.arange(ilen, device=ilen.device))
+        grid_x, grid_y = torch.meshgrid(torch.arange(olen).to(olen), torch.arange(ilen).to(ilen))
        grid_x, grid_y = grid_x.float(), grid_y.float()
        return 1.0 - torch.exp(-(grid_y / ilen - grid_x / olen)**2 /
                               (2 * (sigma**2)))
@ -182,13 +241,17 @@ class GuidedAttentionLoss(torch.nn.Module):


 class TacotronLoss(torch.nn.Module):
+    """Collection of Tacotron set-up based on provided config."""
    def __init__(self, c, stopnet_pos_weight=10, ga_sigma=0.4):
        super(TacotronLoss, self).__init__()
        self.stopnet_pos_weight = stopnet_pos_weight
        self.ga_alpha = c.ga_alpha
-        self.diff_spec_alpha = c.diff_spec_alpha
+        self.decoder_diff_spec_alpha = c.decoder_diff_spec_alpha
+        self.postnet_diff_spec_alpha = c.postnet_diff_spec_alpha
        self.decoder_alpha = c.decoder_loss_alpha
        self.postnet_alpha = c.postnet_loss_alpha
+        self.decoder_ssim_alpha = c.decoder_ssim_alpha
+        self.postnet_ssim_alpha = c.postnet_ssim_alpha
        self.config = c

        # postnet and decoder loss
@ -199,12 +262,15 @@ class TacotronLoss(torch.nn.Module):
        else:
            self.criterion = nn.L1Loss() if c.model in ["Tacotron"
                                                        ] else nn.MSELoss()
-        # differential spectral loss
-        if c.diff_spec_alpha > 0:
-            self.criterion_diff_spec = DifferentailSpectralLoss(loss_func=self.criterion)
        # guided attention loss
        if c.ga_alpha > 0:
            self.criterion_ga = GuidedAttentionLoss(sigma=ga_sigma)
+        # differential spectral loss
+        if c.postnet_diff_spec_alpha > 0 or c.decoder_diff_spec_alpha > 0:
+            self.criterion_diff_spec = DifferentailSpectralLoss(loss_func=self.criterion)
+        # ssim loss
+        if c.postnet_ssim_alpha > 0 or c.decoder_ssim_alpha > 0:
+            self.criterion_ssim = SSIMLoss()
        # stopnet loss
        # pylint: disable=not-callable
        self.criterion_st = BCELossMasked(
@ -215,6 +281,9 @@ class TacotronLoss(torch.nn.Module):
                alignments, alignment_lens, alignments_backwards, input_lens):

        return_dict = {}
+        # remove lengths if no masking is applied
+        if not self.config.loss_masking:
+            output_lens = None
        # decoder and postnet losses
        if self.config.loss_masking:
            if self.decoder_alpha > 0:
@ -262,8 +331,11 @@ class TacotronLoss(torch.nn.Module):

        # double decoder consistency loss (if enabled)
        if self.config.double_decoder_consistency:
-            decoder_b_loss = self.criterion(decoder_b_output, mel_input,
-                                            output_lens)
+            if self.config.loss_masking:
+                decoder_b_loss = self.criterion(decoder_b_output, mel_input,
+                                                output_lens)
+            else:
+                decoder_b_loss = self.criterion(decoder_b_output, mel_input)
            # decoder_c_loss = torch.nn.functional.l1_loss(decoder_b_output, decoder_output)
            attention_c_loss = torch.nn.functional.l1_loss(alignments, alignments_backwards)
            loss += self.decoder_alpha * (decoder_b_loss + attention_c_loss)
@ -274,14 +346,38 @@ class TacotronLoss(torch.nn.Module):
        if self.config.ga_alpha > 0:
            ga_loss = self.criterion_ga(alignments, input_lens, alignment_lens)
            loss += ga_loss * self.ga_alpha
-            return_dict['ga_loss'] = ga_loss * self.ga_alpha
+            return_dict['ga_loss'] = ga_loss
+
+        # decoder differential spectral loss
+        if self.config.decoder_diff_spec_alpha > 0:
+            decoder_diff_spec_loss = self.criterion_diff_spec(decoder_output, mel_input, output_lens)
+            loss += decoder_diff_spec_loss * self.decoder_diff_spec_alpha
+            return_dict['decoder_diff_spec_loss'] = decoder_diff_spec_loss
+
+        # postnet differential spectral loss
+        if self.config.postnet_diff_spec_alpha > 0:
+            postnet_diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
+            loss += postnet_diff_spec_loss * self.postnet_diff_spec_alpha
+            return_dict['postnet_diff_spec_loss'] = postnet_diff_spec_loss
+
+        # decoder ssim loss
+        if self.config.decoder_ssim_alpha > 0:
+            decoder_ssim_loss = self.criterion_ssim(decoder_output, mel_input, output_lens)
+            loss += decoder_ssim_loss * self.postnet_ssim_alpha
+            return_dict['decoder_ssim_loss'] = decoder_ssim_loss
+
+        # postnet ssim loss
+        if self.config.postnet_ssim_alpha > 0:
+            postnet_ssim_loss = self.criterion_ssim(postnet_output, mel_input, output_lens)
+            loss += postnet_ssim_loss * self.postnet_ssim_alpha
+            return_dict['postnet_ssim_loss'] = postnet_ssim_loss

-        # differential spectral loss
-        if self.config.diff_spec_alpha > 0:
-            diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
-            loss += diff_spec_loss * self.diff_spec_alpha
-            return_dict['diff_spec_loss'] = diff_spec_loss
        return_dict['loss'] = loss
+
+        # check if any loss is NaN
+        for key, loss in return_dict.items():
+            if torch.isnan(loss):
+                raise RuntimeError(f" [!] NaN loss with {key}.")
        return return_dict


@ -306,4 +402,9 @@ class GlowTTSLoss(torch.nn.Module):
        return_dict['loss'] = log_mle + loss_dur
        return_dict['log_mle'] = log_mle
        return_dict['loss_dur'] = loss_dur
+
+         # check if any loss is NaN
+        for key, loss in return_dict.items():
+            if torch.isnan(loss):
+                raise RuntimeError(f" [!] NaN loss with {key}.")
        return return_dict
--- a/TTS/tts/layers/tacotron2.py
+++ b/TTS/tts/layers/tacotron2.py
@ -102,7 +102,7 @@ class Encoder(nn.Module):
            o = layer(o)
        o = o.transpose(1, 2)
        o = nn.utils.rnn.pack_padded_sequence(o,
-                                              input_lengths,
+                                              input_lengths.cpu(),
                                              batch_first=True)
        self.lstm.flatten_parameters()
        o, _ = self.lstm(o)
--- a/TTS/tts/models/glow_tts.py
+++ b/TTS/tts/models/glow_tts.py
@ -37,7 +37,8 @@ class GlowTts(nn.Module):
                 hidden_channels_enc=None,
                 hidden_channels_dec=None,
                 use_encoder_prenet=False,
-                 encoder_type="transformer"):
+                 encoder_type="transformer",
+                 external_speaker_embedding_dim=None):

        super().__init__()
        self.num_chars = num_chars
@ -67,6 +68,14 @@ class GlowTts(nn.Module):
        self.use_encoder_prenet = use_encoder_prenet
        self.noise_scale = 0.66
        self.length_scale = 1.
+        self.external_speaker_embedding_dim = external_speaker_embedding_dim
+
+        # if is a multispeaker and c_in_channels is 0, set to 256
+        if num_speakers > 1:
+            if self.c_in_channels == 0 and not self.external_speaker_embedding_dim:
+                self.c_in_channels = 512
+            elif self.external_speaker_embedding_dim:
+                self.c_in_channels = self.external_speaker_embedding_dim

        self.encoder = Encoder(num_chars,
                               out_channels=out_channels,
@ -80,7 +89,7 @@ class GlowTts(nn.Module):
                               dropout_p=dropout_p,
                               mean_only=mean_only,
                               use_prenet=use_encoder_prenet,
-                               c_in_channels=c_in_channels)
+                               c_in_channels=self.c_in_channels)

        self.decoder = Decoder(out_channels,
                               hidden_channels_dec or hidden_channels,
@ -92,10 +101,10 @@ class GlowTts(nn.Module):
                               num_splits=num_splits,
                               num_sqz=num_sqz,
                               sigmoid_scale=sigmoid_scale,
-                               c_in_channels=c_in_channels)
+                               c_in_channels=self.c_in_channels)

-        if num_speakers > 1:
-            self.emb_g = nn.Embedding(num_speakers, c_in_channels)
+        if num_speakers > 1 and not external_speaker_embedding_dim:
+            self.emb_g = nn.Embedding(num_speakers, self.c_in_channels)
            nn.init.uniform_(self.emb_g.weight, -0.1, 0.1)

    @staticmethod
@ -122,7 +131,11 @@ class GlowTts(nn.Module):
        y_max_length = y.size(2)
        # norm speaker embeddings
        if g is not None:
-            g = F.normalize(self.emb_g(g)).unsqueeze(-1)  # [b, h]
+            if self.external_speaker_embedding_dim:
+                g = F.normalize(g).unsqueeze(-1)
+            else:
+                g = F.normalize(self.emb_g(g)).unsqueeze(-1)# [b, h]
+
        # embedding pass
        o_mean, o_log_scale, o_dur_log, x_mask = self.encoder(x,
                                                              x_lengths,
@ -157,8 +170,13 @@ class GlowTts(nn.Module):

    @torch.no_grad()
    def inference(self, x, x_lengths, g=None):
+
        if g is not None:
-            g = F.normalize(self.emb_g(g)).unsqueeze(-1)  # [b, h]
+            if self.external_speaker_embedding_dim:
+                g = F.normalize(g).unsqueeze(-1)
+            else:
+                g = F.normalize(self.emb_g(g)).unsqueeze(-1)  # [b, h]
+
        # embedding pass
        o_mean, o_log_scale, o_dur_log, x_mask = self.encoder(x,
                                                              x_lengths,
--- a/TTS/tts/utils/generic_utils.py
+++ b/TTS/tts/utils/generic_utils.py
@ -28,7 +28,6 @@ def split_dataset(items):
        return items_eval, items
    return items[:eval_split_size], items[eval_split_size:]

-
 # from https://gist.github.com/jihunchoi/f1434a77df9db1bb337417854b398df1
 def sequence_mask(sequence_length, max_len=None):
    if max_len is None:
@ -50,7 +49,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
    MyModel = importlib.import_module('TTS.tts.models.' + c.model.lower())
    MyModel = getattr(MyModel, to_camel(c.model))
    if c.model.lower() in "tacotron":
-        model = MyModel(num_chars=num_chars,
+        model = MyModel(num_chars=num_chars + getattr(c, "add_blank", False),
                        num_speakers=num_speakers,
                        r=c.r,
                        postnet_output_dim=int(c.audio['fft_size'] / 2 + 1),
@ -77,7 +76,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
                        ddc_r=c.ddc_r,
                        speaker_embedding_dim=speaker_embedding_dim)
    elif c.model.lower() == "tacotron2":
-        model = MyModel(num_chars=num_chars,
+        model = MyModel(num_chars=num_chars + getattr(c, "add_blank", False),
                        num_speakers=num_speakers,
                        r=c.r,
                        postnet_output_dim=c.audio['num_mels'],
@ -103,7 +102,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
                        ddc_r=c.ddc_r,
                        speaker_embedding_dim=speaker_embedding_dim)
    elif c.model.lower() == "glow_tts":
-        model = MyModel(num_chars=num_chars,
+        model = MyModel(num_chars=num_chars + getattr(c, "add_blank", False),
                        hidden_channels=192,
                        filter_channels=768,
                        filter_channels_dp=256,
@ -126,13 +125,15 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
                        mean_only=True,
                        hidden_channels_enc=192,
                        hidden_channels_dec=192,
-                        use_encoder_prenet=True)
+                        use_encoder_prenet=True,
+                        external_speaker_embedding_dim=speaker_embedding_dim)
    return model

-
+def is_tacotron(c):
+    return False if 'glow_tts' in c['model'] else True

 def check_config_tts(c):
-    check_argument('model', c, enum_list=['tacotron', 'tacotron2'], restricted=True, val_type=str)
+    check_argument('model', c, enum_list=['tacotron', 'tacotron2', 'glow_tts'], restricted=True, val_type=str)
    check_argument('run_name', c, restricted=True, val_type=str)
    check_argument('run_description', c, val_type=str)

@ -176,10 +177,20 @@ def check_config_tts(c):
    check_argument('eval_batch_size', c, restricted=True, val_type=int, min_val=1)
    check_argument('r', c, restricted=True, val_type=int, min_val=1)
    check_argument('gradual_training', c, restricted=False, val_type=list)
-    check_argument('loss_masking', c, restricted=True, val_type=bool)
    check_argument('apex_amp_level', c, restricted=False, val_type=str)
    # check_argument('grad_accum', c, restricted=True, val_type=int, min_val=1, max_val=100)

+    # loss parameters
+    check_argument('loss_masking', c, restricted=True, val_type=bool)
+    if c['model'].lower() in ['tacotron', 'tacotron2']:
+        check_argument('decoder_loss_alpha', c, restricted=True, val_type=float, min_val=0)
+        check_argument('postnet_loss_alpha', c, restricted=True, val_type=float, min_val=0)
+        check_argument('postnet_diff_spec_alpha', c, restricted=True, val_type=float, min_val=0)
+        check_argument('decoder_diff_spec_alpha', c, restricted=True, val_type=float, min_val=0)
+        check_argument('decoder_ssim_alpha', c, restricted=True, val_type=float, min_val=0)
+        check_argument('postnet_ssim_alpha', c, restricted=True, val_type=float, min_val=0)
+        check_argument('ga_alpha', c, restricted=True, val_type=float, min_val=0)
+
    # validation parameters
    check_argument('run_eval', c, restricted=True, val_type=bool)
    check_argument('test_delay_epochs', c, restricted=True, val_type=int, min_val=0)
@ -195,27 +206,30 @@ def check_config_tts(c):
    check_argument('seq_len_norm', c, restricted=True, val_type=bool)

    # tacotron prenet
-    check_argument('memory_size', c, restricted=True, val_type=int, min_val=-1)
-    check_argument('prenet_type', c, restricted=True, val_type=str, enum_list=['original', 'bn'])
-    check_argument('prenet_dropout', c, restricted=True, val_type=bool)
+    check_argument('memory_size', c, restricted=is_tacotron(c), val_type=int, min_val=-1)
+    check_argument('prenet_type', c, restricted=is_tacotron(c), val_type=str, enum_list=['original', 'bn'])
+    check_argument('prenet_dropout', c, restricted=is_tacotron(c), val_type=bool)

    # attention
-    check_argument('attention_type', c, restricted=True, val_type=str, enum_list=['graves', 'original'])
-    check_argument('attention_heads', c, restricted=True, val_type=int)
-    check_argument('attention_norm', c, restricted=True, val_type=str, enum_list=['sigmoid', 'softmax'])
-    check_argument('windowing', c, restricted=True, val_type=bool)
-    check_argument('use_forward_attn', c, restricted=True, val_type=bool)
-    check_argument('forward_attn_mask', c, restricted=True, val_type=bool)
-    check_argument('transition_agent', c, restricted=True, val_type=bool)
-    check_argument('transition_agent', c, restricted=True, val_type=bool)
-    check_argument('location_attn', c, restricted=True, val_type=bool)
-    check_argument('bidirectional_decoder', c, restricted=True, val_type=bool)
-    check_argument('double_decoder_consistency', c, restricted=True, val_type=bool)
+    check_argument('attention_type', c, restricted=is_tacotron(c), val_type=str, enum_list=['graves', 'original'])
+    check_argument('attention_heads', c, restricted=is_tacotron(c), val_type=int)
+    check_argument('attention_norm', c, restricted=is_tacotron(c), val_type=str, enum_list=['sigmoid', 'softmax'])
+    check_argument('windowing', c, restricted=is_tacotron(c), val_type=bool)
+    check_argument('use_forward_attn', c, restricted=is_tacotron(c), val_type=bool)
+    check_argument('forward_attn_mask', c, restricted=is_tacotron(c), val_type=bool)
+    check_argument('transition_agent', c, restricted=is_tacotron(c), val_type=bool)
+    check_argument('transition_agent', c, restricted=is_tacotron(c), val_type=bool)
+    check_argument('location_attn', c, restricted=is_tacotron(c), val_type=bool)
+    check_argument('bidirectional_decoder', c, restricted=is_tacotron(c), val_type=bool)
+    check_argument('double_decoder_consistency', c, restricted=is_tacotron(c), val_type=bool)
    check_argument('ddc_r', c, restricted='double_decoder_consistency' in c.keys(), min_val=1, max_val=7, val_type=int)

    # stopnet
-    check_argument('stopnet', c, restricted=True, val_type=bool)
-    check_argument('separate_stopnet', c, restricted=True, val_type=bool)
+    check_argument('stopnet', c, restricted=is_tacotron(c), val_type=bool)
+    check_argument('separate_stopnet', c, restricted=is_tacotron(c), val_type=bool)
+
+    # GlowTTS parameters
+    check_argument('encoder_type', c, restricted=not is_tacotron(c), val_type=str)

    # tensorboard
    check_argument('print_step', c, restricted=True, val_type=int, min_val=1)
@ -240,15 +254,16 @@ def check_config_tts(c):

    # multi-speaker and gst
    check_argument('use_speaker_embedding', c, restricted=True, val_type=bool)
-    check_argument('use_external_speaker_embedding_file', c, restricted=True, val_type=bool)
-    check_argument('external_speaker_embedding_file', c, restricted=True, val_type=str)
-    check_argument('use_gst', c, restricted=True, val_type=bool)
-    check_argument('gst', c, restricted=True, val_type=dict)
-    check_argument('gst_style_input', c['gst'], restricted=True, val_type=[str, dict])
-    check_argument('gst_embedding_dim', c['gst'], restricted=True, val_type=int, min_val=0, max_val=1000)
-    check_argument('gst_use_speaker_embedding', c['gst'], restricted=True, val_type=bool)
-    check_argument('gst_num_heads', c['gst'], restricted=True, val_type=int, min_val=2, max_val=10)
-    check_argument('gst_style_tokens', c['gst'], restricted=True, val_type=int, min_val=1, max_val=1000)
+    check_argument('use_external_speaker_embedding_file', c, restricted=c['use_speaker_embedding'], val_type=bool)
+    check_argument('external_speaker_embedding_file', c, restricted=c['use_external_speaker_embedding_file'], val_type=str)
+    check_argument('use_gst', c, restricted=is_tacotron(c), val_type=bool)
+    if c['model'].lower() in ['tacotron', 'tacotron2'] and c['use_gst']:
+        check_argument('gst', c, restricted=is_tacotron(c), val_type=dict)
+        check_argument('gst_style_input', c['gst'], restricted=is_tacotron(c), val_type=[str, dict])
+        check_argument('gst_embedding_dim', c['gst'], restricted=is_tacotron(c), val_type=int, min_val=0, max_val=1000)
+        check_argument('gst_use_speaker_embedding', c['gst'], restricted=is_tacotron(c), val_type=bool)
+        check_argument('gst_num_heads', c['gst'], restricted=is_tacotron(c), val_type=int, min_val=2, max_val=10)
+        check_argument('gst_style_tokens', c['gst'], restricted=is_tacotron(c), val_type=int, min_val=1, max_val=1000)

    # datasets - checking only the first entry
    check_argument('datasets', c, restricted=True, val_type=list)
--- a/TTS/tts/utils/io.py
+++ b/TTS/tts/utils/io.py
@ -6,6 +6,7 @@ import pickle as pickle_tts
 from TTS.utils.io import RenamingUnpickler


+
 def load_checkpoint(model, checkpoint_path, amp=None, use_cuda=False):
    try:
        state = torch.load(checkpoint_path, map_location=torch.device('cpu'))
@ -25,9 +26,12 @@ def load_checkpoint(model, checkpoint_path, amp=None, use_cuda=False):


 def save_model(model, optimizer, current_step, epoch, r, output_path, amp_state_dict=None, **kwargs):
-    new_state_dict = model.state_dict()
+    if hasattr(model, 'module'):
+        model_state = model.module.state_dict()
+    else:
+        model_state = model.state_dict()
    state = {
-        'model': new_state_dict,
+        'model': model_state,
        'optimizer': optimizer.state_dict() if optimizer is not None else None,
        'step': current_step,
        'epoch': epoch,
--- a/TTS/tts/utils/speakers.py
+++ b/TTS/tts/utils/speakers.py
@ -30,3 +30,44 @@ def get_speakers(items):
    """Returns a sorted, unique list of speakers in a given dataset."""
    speakers = {e[2] for e in items}
    return sorted(speakers)
+
+def parse_speakers(c, args, meta_data_train, OUT_PATH):
+    """ Returns number of speakers, speaker embedding shape and speaker mapping"""
+    if c.use_speaker_embedding:
+        speakers = get_speakers(meta_data_train)
+        if args.restore_path:
+            if c.use_external_speaker_embedding_file: # if restore checkpoint and use External Embedding file
+                prev_out_path = os.path.dirname(args.restore_path)
+                speaker_mapping = load_speaker_mapping(prev_out_path)
+                if not speaker_mapping:
+                    print("WARNING: speakers.json was not found in restore_path, trying to use CONFIG.external_speaker_embedding_file")
+                    speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
+                    if not speaker_mapping:
+                        raise RuntimeError("You must copy the file speakers.json to restore_path, or set a valid file in CONFIG.external_speaker_embedding_file")
+                speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
+            elif not c.use_external_speaker_embedding_file: # if restore checkpoint and don't use External Embedding file
+                prev_out_path = os.path.dirname(args.restore_path)
+                speaker_mapping = load_speaker_mapping(prev_out_path)
+                speaker_embedding_dim = None
+                assert all([speaker in speaker_mapping
+                            for speaker in speakers]), "As of now you, you cannot " \
+                                                    "introduce new speakers to " \
+                                                    "a previously trained model."
+        elif c.use_external_speaker_embedding_file and c.external_speaker_embedding_file: # if start new train using External Embedding file
+            speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
+            speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
+        elif c.use_external_speaker_embedding_file and not c.external_speaker_embedding_file: # if start new train using External Embedding file and don't pass external embedding file
+            raise "use_external_speaker_embedding_file is True, so you need pass a external speaker embedding file, run GE2E-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb or AngularPrototypical-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb notebook in notebooks/ folder"
+        else: # if start new train and don't use External Embedding file
+            speaker_mapping = {name: i for i, name in enumerate(speakers)}
+            speaker_embedding_dim = None
+        save_speaker_mapping(OUT_PATH, speaker_mapping)
+        num_speakers = len(speaker_mapping)
+        print("Training with {} speakers: {}".format(len(speakers),
+                                                     ", ".join(speakers)))
+    else:
+        num_speakers = 0
+        speaker_embedding_dim = None
+        speaker_mapping = None
+
+    return num_speakers, speaker_embedding_dim, speaker_mapping
--- a/TTS/tts/utils/ssim.py
+++ b/TTS/tts/utils/ssim.py
@ -0,0 +1,75 @@
+# taken from https://github.com/Po-Hsun-Su/pytorch-ssim
+
+from math import exp
+
+import torch
+import torch.nn.functional as F
+from torch.autograd import Variable
+
+
+def gaussian(window_size, sigma):
+    gauss = torch.Tensor([exp(-(x - window_size//2)**2/float(2*sigma**2)) for x in range(window_size)])
+    return gauss/gauss.sum()
+
+def create_window(window_size, channel):
+    _1D_window = gaussian(window_size, 1.5).unsqueeze(1)
+    _2D_window = _1D_window.mm(_1D_window.t()).float().unsqueeze(0).unsqueeze(0)
+    window = Variable(_2D_window.expand(channel, 1, window_size, window_size).contiguous())
+    return window
+
+def _ssim(img1, img2, window, window_size, channel, size_average = True):
+    mu1 = F.conv2d(img1, window, padding = window_size//2, groups = channel)
+    mu2 = F.conv2d(img2, window, padding = window_size//2, groups = channel)
+
+    mu1_sq = mu1.pow(2)
+    mu2_sq = mu2.pow(2)
+    mu1_mu2 = mu1*mu2
+
+    sigma1_sq = F.conv2d(img1*img1, window, padding = window_size//2, groups = channel) - mu1_sq
+    sigma2_sq = F.conv2d(img2*img2, window, padding = window_size//2, groups = channel) - mu2_sq
+    sigma12 = F.conv2d(img1*img2, window, padding = window_size//2, groups = channel) - mu1_mu2
+
+    C1 = 0.01**2
+    C2 = 0.03**2
+
+    ssim_map = ((2*mu1_mu2 + C1)*(2*sigma12 + C2))/((mu1_sq + mu2_sq + C1)*(sigma1_sq + sigma2_sq + C2))
+
+    if size_average:
+        return ssim_map.mean()
+    return ssim_map.mean(1).mean(1).mean(1)
+
+class SSIM(torch.nn.Module):
+    def __init__(self, window_size = 11, size_average = True):
+        super().__init__()
+        self.window_size = window_size
+        self.size_average = size_average
+        self.channel = 1
+        self.window = create_window(window_size, self.channel)
+
+    def forward(self, img1, img2):
+        (_, channel, _, _) = img1.size()
+
+        if channel == self.channel and self.window.data.type() == img1.data.type():
+            window = self.window
+        else:
+            window = create_window(self.window_size, channel)
+
+            if img1.is_cuda:
+                window = window.cuda(img1.get_device())
+            window = window.type_as(img1)
+
+            self.window = window
+            self.channel = channel
+
+
+        return _ssim(img1, img2, window, self.window_size, channel, self.size_average)
+
+def ssim(img1, img2, window_size = 11, size_average = True):
+    (_, channel, _, _) = img1.size()
+    window = create_window(window_size, channel)
+
+    if img1.is_cuda:
+        window = window.cuda(img1.get_device())
+    window = window.type_as(img1)
+
+    return _ssim(img1, img2, window, window_size, channel, size_average)
--- a/TTS/tts/utils/synthesis.py
+++ b/TTS/tts/utils/synthesis.py
@ -14,10 +14,13 @@ def text_to_seqvec(text, CONFIG):
        seq = np.asarray(
            phoneme_to_sequence(text, text_cleaner, CONFIG.phoneme_language,
                                CONFIG.enable_eos_bos_chars,
-                                tp=CONFIG.characters if 'characters' in CONFIG.keys() else None),
+                                tp=CONFIG.characters if 'characters' in CONFIG.keys() else None,
+                                add_blank=CONFIG['add_blank'] if 'add_blank' in CONFIG.keys() else False),
            dtype=np.int32)
    else:
-        seq = np.asarray(text_to_sequence(text, text_cleaner, tp=CONFIG.characters if 'characters' in CONFIG.keys() else None), dtype=np.int32)
+        seq = np.asarray(
+            text_to_sequence(text, text_cleaner, tp=CONFIG.characters if 'characters' in CONFIG.keys() else None,
+            add_blank=CONFIG['add_blank'] if 'add_blank' in CONFIG.keys() else False), dtype=np.int32)
    return seq


@ -59,7 +62,7 @@ def run_model_torch(model, inputs, CONFIG, truncated, speaker_id=None, style_mel
                    inputs, speaker_ids=speaker_id, speaker_embeddings=speaker_embeddings)
    elif 'glow' in CONFIG.model.lower():
        inputs_lengths = torch.tensor(inputs.shape[1:2]).to(inputs.device)  # pylint: disable=not-callable
-        postnet_output, _, _, _, alignments, _, _ = model.inference(inputs, inputs_lengths)
+        postnet_output, _, _, _, alignments, _, _ = model.inference(inputs, inputs_lengths, g=speaker_id if speaker_id else speaker_embeddings)
        postnet_output = postnet_output.permute(0, 2, 1)
        # these only belong to tacotron models.
        decoder_output = None
@ -207,7 +210,7 @@ def synthesis(model,
    """
    # GST processing
    style_mel = None
-    if CONFIG.use_gst and style_wav is not None:
+    if 'use_gst' in CONFIG.keys() and CONFIG.use_gst and style_wav is not None:
        if isinstance(style_wav, dict):
            style_mel = style_wav
        else:
--- a/TTS/tts/utils/text/init.py
+++ b/TTS/tts/utils/text/init.py
@ -16,6 +16,8 @@ _id_to_symbol = {i: s for i, s in enumerate(symbols)}
 _phonemes_to_id = {s: i for i, s in enumerate(phonemes)}
 _id_to_phonemes = {i: s for i, s in enumerate(phonemes)}

+_symbols = symbols
+_phonemes = phonemes
 # Regular expression matching text enclosed in curly braces:
 _CURLY_RE = re.compile(r'(.*?)\{(.+?)\}(.*)')

@ -57,6 +59,10 @@ def text2phone(text, language):

    return ph

+def intersperse(sequence, token):
+    result = [token] * (len(sequence) * 2 + 1)
+    result[1::2] = sequence
+    return result

 def pad_with_eos_bos(phoneme_sequence, tp=None):
    # pylint: disable=global-statement
@ -69,10 +75,9 @@ def pad_with_eos_bos(phoneme_sequence, tp=None):

    return [_phonemes_to_id[_bos]] + list(phoneme_sequence) + [_phonemes_to_id[_eos]]

-
-def phoneme_to_sequence(text, cleaner_names, language, enable_eos_bos=False, tp=None):
+def phoneme_to_sequence(text, cleaner_names, language, enable_eos_bos=False, tp=None, add_blank=False):
    # pylint: disable=global-statement
-    global _phonemes_to_id
+    global _phonemes_to_id, _phonemes
    if tp:
        _, _phonemes = make_symbols(**tp)
        _phonemes_to_id = {s: i for i, s in enumerate(_phonemes)}
@ -88,13 +93,17 @@ def phoneme_to_sequence(text, cleaner_names, language, enable_eos_bos=False, tp=
    # Append EOS char
    if enable_eos_bos:
        sequence = pad_with_eos_bos(sequence, tp=tp)
+    if add_blank:
+        sequence = intersperse(sequence, len(_phonemes)) # add a blank token (new), whose id number is len(_phonemes)
    return sequence


-def sequence_to_phoneme(sequence, tp=None):
+def sequence_to_phoneme(sequence, tp=None, add_blank=False):
    # pylint: disable=global-statement
    '''Converts a sequence of IDs back to a string'''
-    global _id_to_phonemes
+    global _id_to_phonemes, _phonemes
+    if add_blank:
+        sequence = list(filter(lambda x: x != len(_phonemes), sequence))
    result = ''
    if tp:
        _, _phonemes = make_symbols(**tp)
@ -107,7 +116,7 @@ def sequence_to_phoneme(sequence, tp=None):
    return result.replace('}{', ' ')


-def text_to_sequence(text, cleaner_names, tp=None):
+def text_to_sequence(text, cleaner_names, tp=None, add_blank=False):
    '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.

      The text can optionally have ARPAbet sequences enclosed in curly braces embedded
@ -121,7 +130,7 @@ def text_to_sequence(text, cleaner_names, tp=None):
        List of integers corresponding to the symbols in the text
    '''
    # pylint: disable=global-statement
-    global _symbol_to_id
+    global _symbol_to_id, _symbols
    if tp:
        _symbols, _ = make_symbols(**tp)
        _symbol_to_id = {s: i for i, s in enumerate(_symbols)}
@ -137,13 +146,19 @@ def text_to_sequence(text, cleaner_names, tp=None):
            _clean_text(m.group(1), cleaner_names))
        sequence += _arpabet_to_sequence(m.group(2))
        text = m.group(3)
+
+    if add_blank:
+        sequence = intersperse(sequence, len(_symbols)) # add a blank token (new), whose id number is len(_symbols)
    return sequence


-def sequence_to_text(sequence, tp=None):
+def sequence_to_text(sequence, tp=None, add_blank=False):
    '''Converts a sequence of IDs back to a string'''
    # pylint: disable=global-statement
-    global _id_to_symbol
+    global _id_to_symbol, _symbols
+    if add_blank:
+        sequence = list(filter(lambda x: x != len(_symbols), sequence))
+
    if tp:
        _symbols, _ = make_symbols(**tp)
        _id_to_symbol = {i: s for i, s in enumerate(_symbols)}
--- a/TTS/tts/utils/visual.py
+++ b/TTS/tts/utils/visual.py
@ -1,6 +1,8 @@
-import torch
 import librosa
 import matplotlib
+import numpy as np
+import torch
+
 matplotlib.use('Agg')
 import matplotlib.pyplot as plt
 from TTS.tts.utils.text import phoneme_to_sequence, sequence_to_phoneme
@ -43,6 +45,8 @@ def plot_spectrogram(spectrogram,
        spectrogram_ = spectrogram.detach().cpu().numpy().squeeze().T
    else:
        spectrogram_ = spectrogram.T
+    spectrogram_ = spectrogram_.astype(
+        np.float32) if spectrogram_.dtype == np.float16 else spectrogram_
    if ap is not None:
        spectrogram_ = ap._denormalize(spectrogram_)  # pylint: disable=protected-access
    fig = plt.figure(figsize=fig_size)
--- a/TTS/utils/audio.py
+++ b/TTS/utils/audio.py
@ -174,7 +174,7 @@ class AudioProcessor(object):
        for key in stats_config.keys():
            if key in skip_parameters:
                continue
-            if key != 'sample_rate':
+            if key not in ['sample_rate', 'trim_db']:
                assert stats_config[key] == self.__dict__[key],\
                    f" [!] Audio param {key} does not match the value used for computing mean-var stats. {stats_config[key]} vs {self.__dict__[key]}"
        return mel_mean, mel_std, linear_mean, linear_std, stats_config
--- a/TTS/tts/utils/distribute.py
+++ b/TTS/tts/utils/distribute.py
--- a/TTS/utils/generic_utils.py
+++ b/TTS/utils/generic_utils.py
@ -1,9 +1,11 @@
-import os
-import glob
-import shutil
 import datetime
+import glob
+import os
+import shutil
 import subprocess

+import torch
+

 def get_git_branch():
    try:
--- a/TTS/utils/io.py
+++ b/TTS/utils/io.py
@ -1,5 +1,7 @@
+import os
 import re
 import json
+import yaml
 import pickle as pickle_tts


@ -17,19 +19,27 @@ class AttrDict(dict):
        self.__dict__ = self


-def load_config(config_path):
+def load_config(config_path: str) -> AttrDict:
    """Load config files and discard comments

    Args:
        config_path (str): path to config file.
    """
    config = AttrDict()
-    with open(config_path, "r") as f:
-        input_str = f.read()
-    # handle comments
-    input_str = re.sub(r'\\\n', '', input_str)
-    input_str = re.sub(r'//.*\n', '\n', input_str)
-    data = json.loads(input_str)
+
+    ext = os.path.splitext(config_path)[1]
+    if ext in (".yml", ".yaml"):
+        with open(config_path, "r") as f:
+            data = yaml.safe_load(f)
+    else:
+        # fallback to json
+        with open(config_path, "r") as f:
+            input_str = f.read()
+        # handle comments
+        input_str = re.sub(r'\\\n', '', input_str)
+        input_str = re.sub(r'//.*\n', '\n', input_str)
+        data = json.loads(input_str)
+
    config.update(data)
    return config

--- a/TTS/vocoder/configs/multiband_melgan_config_mozilla.json
+++ b/TTS/vocoder/configs/multiband_melgan_config_mozilla.json
@ -92,8 +92,8 @@
    // DATASET
    "data_path": "/home/erogol/Data/MozillaMerged22050/wavs/",
    "feature_path": null,
-    "seq_len": 16384,
-    "pad_short": 2000,
+    "seq_len": 6144,
+    "pad_short": 500,
    "conv_pad": 0,
    "use_noise_augment": false,
    "use_cache": true,
@ -102,6 +102,16 @@

    // TRAINING
    "batch_size": 64,       // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
+    "train_noise_schedule":{
+        "min_val": 1e-6,
+        "max_val": 1e-2,
+        "num_steps": 1000
+    },
+    "test_noise_schedule":{
+        "min_val": 1e-6,
+        "max_val": 1e-2,
+        "num_steps": 50
+    }

    // VALIDATION
    "run_eval": true,
--- a/TTS/vocoder/configs/universal_fullband_melgan.json
+++ b/TTS/vocoder/configs/universal_fullband_melgan.json
@ -0,0 +1,138 @@
+{
+    "run_name": "fullband-melgan",
+    "run_description": "fullband melgan mean-var scaling",
+
+    // AUDIO PARAMETERS
+    "audio":{
+        "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.
+        "win_length": 1024,      // stft window length in ms.
+        "hop_length": 256,       // stft window hop-lengh in ms.
+        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
+        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.
+
+        // Audio processing parameters
+        "sample_rate": 24000,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
+        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
+        "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.
+
+        // Silence trimming
+        "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
+        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.
+
+        // MelSpectrogram parameters
+        "num_mels": 80,         // size of the mel spec frame.
+        "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
+        "mel_fmax": 7600.0,     // maximum freq level for mel-spec. Tune for dataset!!
+        "spec_gain": 1.0,         // scaler value appplied after log transform of spectrogram.
+
+        // Normalization parameters
+        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
+        "min_level_db": -100,   // lower bound for normalization
+        "symmetric_norm": true, // move normalization to range [-1, 1]
+        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
+        "clip_norm": true,      // clip normalized values into the range.
+        "stats_path": "/home/erogol/Data/libritts/LibriTTS/scale_stats.npy"    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
+    },
+
+    // DISTRIBUTED TRAINING
+    "distributed":{
+        "backend": "nccl",
+        "url": "tcp:\/\/localhost:54324"
+    },
+
+    // MODEL PARAMETERS
+    "use_pqmf": false,
+
+    // LOSS PARAMETERS
+    "use_stft_loss": true,
+    "use_subband_stft_loss": false,
+    "use_mse_gan_loss": true,
+    "use_hinge_gan_loss": false,
+    "use_feat_match_loss": false,  // use only with melgan discriminators
+
+    // loss weights
+    "stft_loss_weight": 0.5,
+    "subband_stft_loss_weight": 0.5,
+    "mse_G_loss_weight": 2.5,
+    "hinge_G_loss_weight": 2.5,
+    "feat_match_loss_weight": 25,
+
+    // multiscale stft loss parameters
+    "stft_loss_params": {
+        "n_ffts": [1024, 2048, 512],
+        "hop_lengths": [120, 240, 50],
+        "win_lengths": [600, 1200, 240]
+    },
+
+    "target_loss": "avg_G_loss",  // loss value to pick the best model to save after each epoch
+
+    // DISCRIMINATOR
+    "discriminator_model": "melgan_multiscale_discriminator",
+    "discriminator_model_params":{
+        "base_channels": 16,
+        "max_channels":512,
+        "downsample_factors":[4, 4, 4]
+    },
+    "steps_to_start_discriminator": 200000,      // steps required to start GAN trainining.1
+
+    // GENERATOR
+    "generator_model": "fullband_melgan_generator",
+    "generator_model_params": {
+        "upsample_factors":[8, 8, 4],
+        "num_res_blocks": 4
+    },
+
+    // DATASET
+    "data_path": "/home/erogol/Data/libritts/LibriTTS/train-clean-360/",
+    "feature_path": null,
+    "seq_len": 16384,
+    "pad_short": 2000,
+    "conv_pad": 0,
+    "use_noise_augment": false,
+    "use_cache": true,
+
+    "reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
+
+    // TRAINING
+    "batch_size": 48,       // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
+
+    // VALIDATION
+    "run_eval": true,
+    "test_delay_epochs": 10,  //Until attention is aligned, testing only wastes computation time.
+    "test_sentences_file": null,  // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
+
+    // OPTIMIZER
+    "epochs": 10000,                // total number of epochs to train.
+    "wd": 0.0,                // Weight decay weight.
+    "gen_clip_grad": -1,      // Generator gradient clipping threshold. Apply gradient clipping if > 0
+    "disc_clip_grad": -1,     // Discriminator gradient clipping threshold.
+    "lr_scheduler_gen": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
+    "lr_scheduler_gen_params": {
+       "gamma": 0.5,
+       "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
+    },
+    "lr_scheduler_disc": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
+    "lr_scheduler_disc_params": {
+    	  "gamma": 0.5,
+    	  "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
+    },
+    "lr_gen": 0.000015625,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
+    "lr_disc": 0.000015625,
+
+    // TENSORBOARD and LOGGING
+    "print_step": 25,       // Number of steps to log traning on console.
+    "print_eval": false,     // If True, it prints loss values for each step in eval run.
+    "save_step": 25000,      // Number of training steps expected to plot training stats on TB and save model checkpoints.
+    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
+    "tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
+
+    // DATA LOADING
+    "num_loader_workers": 4,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
+    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
+    "eval_split_size": 10,
+
+    // PATHS
+    "output_path": "/home/erogol/Models/"
+}
+
+
--- a/TTS/vocoder/configs/wavegrad_libritts.json
+++ b/TTS/vocoder/configs/wavegrad_libritts.json
@ -0,0 +1,116 @@
+{
+    "run_name": "wavegrad-libritts",
+    "run_description": "wavegrad libritts",
+
+    "audio":{
+        "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.
+        "win_length": 1024,      // stft window length in ms.
+        "hop_length": 256,       // stft window hop-lengh in ms.
+        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
+        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.
+
+        // Audio processing parameters
+        "sample_rate": 24000,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
+        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
+        "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.
+
+        // Silence trimming
+        "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
+        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.
+
+        // MelSpectrogram parameters
+        "num_mels": 80,         // size of the mel spec frame.
+        "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
+        "mel_fmax": 7600.0,     // maximum freq level for mel-spec. Tune for dataset!!
+        "spec_gain": 1.0,         // scaler value appplied after log transform of spectrogram.
+
+        // Normalization parameters
+        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
+        "min_level_db": -100,   // lower bound for normalization
+        "symmetric_norm": true, // move normalization to range [-1, 1]
+        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
+        "clip_norm": true,      // clip normalized values into the range.
+        "stats_path": "/home/erogol/Data/libritts/LibriTTS/scale_stats_wavegrad.npy"    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
+    },
+
+    // DISTRIBUTED TRAINING
+    "mixed_precision": true,     // enable torch mixed precision training (true, false)
+    "distributed":{
+        "backend": "nccl",
+        "url": "tcp:\/\/localhost:54322"
+    },
+
+    "target_loss": "avg_wavegrad_loss",  // loss value to pick the best model to save after each epoch
+
+    // MODEL PARAMETERS
+    "generator_model": "wavegrad",
+    "model_params":{
+        "use_weight_norm": true,
+        "y_conv_channels":32,
+        "x_conv_channels":768,
+        "ublock_out_channels": [512, 512, 256, 128, 128],
+        "dblock_out_channels": [128, 128, 256, 512],
+        "upsample_factors": [4, 4, 4, 2, 2],
+        "upsample_dilations": [
+            [1, 2, 1, 2],
+            [1, 2, 1, 2],
+            [1, 2, 4, 8],
+            [1, 2, 4, 8],
+            [1, 2, 4, 8]]
+    },
+
+    // DATASET
+    "data_path": "/home/erogol/Data/libritts/LibriTTS/train-clean-360/",  // root data path. It finds all wav files recursively from there.
+    "feature_path": null,   // if you use precomputed features
+    "seq_len": 6144,        // 24 * hop_length
+    "pad_short": 0,      // additional padding for short wavs
+    "conv_pad": 0,          // additional padding against convolutions applied to spectrograms
+    "use_noise_augment": false,     // add noise to the audio signal for augmentation
+    "use_cache": false,      // use in memory cache to keep the computed features. This might cause OOM.
+
+    "reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
+
+    // TRAINING
+    "batch_size": 96,      // Batch size for training.
+
+    // NOISE SCHEDULE PARAMS - Only effective at training time.
+    "train_noise_schedule":{
+        "min_val": 1e-6,
+        "max_val": 1e-2,
+        "num_steps": 1000
+    },
+    "test_noise_schedule":{
+        "min_val": 1e-6,
+        "max_val": 1e-2,
+        "num_steps": 50
+    },
+
+    // VALIDATION
+    "run_eval": true,       // enable/disable evaluation run
+
+    // OPTIMIZER
+    "epochs": 10000,                // total number of epochs to train.
+    "clip_grad": 1.0,                 // Generator gradient clipping threshold. Apply gradient clipping if > 0
+    "lr_scheduler": "MultiStepLR",  // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
+    "lr_scheduler_params": {
+        "gamma": 0.5,
+        "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
+    },
+    "lr": 1e-4,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
+
+    // TENSORBOARD and LOGGING
+    "print_step": 50,       // Number of steps to log traning on console.
+    "print_eval": false,     // If True, it prints loss values for each step in eval run.
+    "save_step": 5000,      // Number of training steps expected to plot training stats on TB and save model checkpoints.
+    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
+    "tb_model_param_stats": true,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
+
+    // DATA LOADING
+    "num_loader_workers": 4,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
+    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
+    "eval_split_size": 256,
+
+    // PATHS
+    "output_path": "/home/erogol/Models/LJSpeech/"
+}
+
--- a/TTS/vocoder/configs/wavernn_config.json
+++ b/TTS/vocoder/configs/wavernn_config.json
@ -0,0 +1,98 @@
+{
+    "run_name": "wavernn_librittts",
+    "run_description": "wavernn libritts training from LJSpeech model",
+
+// AUDIO PARAMETERS
+    "audio": {
+        "fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
+        "win_length": 1024, // stft window length in ms.
+        "hop_length": 256, // stft window hop-lengh in ms.
+        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
+        "frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
+        // Audio processing parameters
+        "sample_rate": 24000, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
+        "preemphasis": 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
+        "ref_level_db": 20, // reference level db, theoretically 20db is the sound of air.
+        // Silence trimming
+        "do_trim_silence": false, // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
+        "trim_db": 60, // threshold for timming silence. Set this according to your dataset.
+        // MelSpectrogram parameters
+        "num_mels": 80, // size of the mel spec frame.
+        "mel_fmin": 40.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
+        "mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
+        "spec_gain": 20.0, // scaler value appplied after log transform of spectrogram.
+        // Normalization parameters
+        "signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
+        "min_level_db": -100, // lower bound for normalization
+        "symmetric_norm": true, // move normalization to range [-1, 1]
+        "max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
+        "clip_norm": true, // clip normalized values into the range.
+        "stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
+    },
+
+// Generating / Synthesizing
+    "batched": true,
+    "target_samples": 11000, // target number of samples to be generated in each batch entry
+    "overlap_samples": 550, // number of samples for crossfading between batches
+    // DISTRIBUTED TRAINING
+    // "distributed":{
+    //     "backend": "nccl",
+    //     "url": "tcp:\/\/localhost:54321"
+    // },
+
+// MODEL MODE
+    "mode": "mold", // mold [string], gauss [string], bits [int]
+    "mulaw": true, // apply mulaw if mode is bits
+
+// MODEL PARAMETERS
+    "wavernn_model_params": {
+        "rnn_dims": 512,
+        "fc_dims": 512,
+        "compute_dims": 128,
+        "res_out_dims": 128,
+        "num_res_blocks": 10,
+        "use_aux_net": true,
+        "use_upsample_net": true,
+        "upsample_factors": [4, 8, 8] 	// this needs to correctly factorise hop_length
+    },
+
+// DATASET
+    //"use_gta": true,								// use computed gta features from the tts model
+    "data_path": "/home/erogol/Data/libritts/LibriTTS/train-clean-360/", // path containing training wav files
+    "feature_path": null, // path containing computed features from wav files if null compute them
+    "seq_len": 1280, // has to be devideable by hop_length
+    "padding": 2, // pad the input for resnet to see wider input length
+
+// TRAINING
+    "batch_size": 256, // Batch size for training.
+    "epochs": 10000, // total number of epochs to train.
+    "mixed_precision": true, // enable/ disable mixed precision training
+
+// VALIDATION
+    "run_eval": true,
+    "test_every_epochs": 10, // Test after set number of epochs (Test every 10 epochs for example)
+
+// OPTIMIZER
+    "grad_clip": 4, // apply gradient clipping if > 0
+    "lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
+    "lr_scheduler_params": {
+        "gamma": 0.5,
+        "milestones": [200000, 400000, 600000]
+    },
+    "lr": 1e-4, // initial learning rate
+
+// TENSORBOARD and LOGGING
+    "print_step": 25, // Number of steps to log traning on console.
+    "print_eval": false, // If True, it prints loss values for each step in eval run.
+    "save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
+    "checkpoint": true, // If true, it saves checkpoints per "save_step"
+    "tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
+
+// DATA LOADING
+    "num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
+    "num_val_loader_workers": 4, // number of evaluation data loader processes.
+    "eval_split_size": 50, // number of samples for testing
+
+// PATHS
+    "output_path": "/home/erogol/Models/LJSpeech/"
+}
--- a/TTS/vocoder/datasets/preprocess.py
+++ b/TTS/vocoder/datasets/preprocess.py
@ -1,17 +1,38 @@
 import glob
 import os
 from pathlib import Path
+from tqdm import tqdm

 import numpy as np


+def preprocess_wav_files(out_path, config, ap):
+    os.makedirs(os.path.join(out_path, "quant"), exist_ok=True)
+    os.makedirs(os.path.join(out_path, "mel"), exist_ok=True)
+    wav_files = find_wav_files(config.data_path)
+    for path in tqdm(wav_files):
+        wav_name = Path(path).stem
+        quant_path = os.path.join(out_path, "quant", wav_name + ".npy")
+        mel_path = os.path.join(out_path, "mel", wav_name + ".npy")
+        y = ap.load_wav(path)
+        mel = ap.melspectrogram(y)
+        np.save(mel_path, mel)
+        if isinstance(config.mode, int):
+            quant = (
+                ap.mulaw_encode(y, qc=config.mode)
+                if config.mulaw
+                else ap.quantize(y, bits=config.mode)
+            )
+            np.save(quant_path, quant)
+
+
 def find_wav_files(data_path):
-    wav_paths = glob.glob(os.path.join(data_path, '**', '*.wav'), recursive=True)
+    wav_paths = glob.glob(os.path.join(data_path, "**", "*.wav"), recursive=True)
    return wav_paths


 def find_feat_files(data_path):
-    feat_paths = glob.glob(os.path.join(data_path, '**', '*.npy'), recursive=True)
+    feat_paths = glob.glob(os.path.join(data_path, "**", "*.npy"), recursive=True)
    return feat_paths


@ -23,8 +44,12 @@ def load_wav_data(data_path, eval_split_size):


 def load_wav_feat_data(data_path, feat_path, eval_split_size):
-    wav_paths = sorted(find_wav_files(data_path))
-    feat_paths = sorted(find_feat_files(feat_path))
+    wav_paths = find_wav_files(data_path)
+    feat_paths = find_feat_files(feat_path)
+
+    wav_paths.sort(key=lambda x: Path(x).stem)
+    feat_paths.sort(key=lambda x: Path(x).stem)
+
    assert len(wav_paths) == len(feat_paths)
    for wav, feat in zip(wav_paths, feat_paths):
        wav_name = Path(wav).stem
--- a/TTS/vocoder/datasets/wavegrad_dataset.py
+++ b/TTS/vocoder/datasets/wavegrad_dataset.py
@ -0,0 +1,131 @@
+import os
+import glob
+import torch
+import random
+import numpy as np
+from torch.utils.data import Dataset
+from multiprocessing import Manager
+
+
+class WaveGradDataset(Dataset):
+    """
+    WaveGrad Dataset searchs for all the wav files under root path
+    and converts them to acoustic features on the fly and returns
+    random segments of (audio, feature) couples.
+    """
+    def __init__(self,
+                 ap,
+                 items,
+                 seq_len,
+                 hop_len,
+                 pad_short,
+                 conv_pad=2,
+                 is_training=True,
+                 return_segments=True,
+                 use_noise_augment=False,
+                 use_cache=False,
+                 verbose=False):
+
+        self.ap = ap
+        self.item_list = items
+        self.seq_len = seq_len if return_segments else None
+        self.hop_len = hop_len
+        self.pad_short = pad_short
+        self.conv_pad = conv_pad
+        self.is_training = is_training
+        self.return_segments = return_segments
+        self.use_cache = use_cache
+        self.use_noise_augment = use_noise_augment
+        self.verbose = verbose
+
+        if return_segments:
+            assert seq_len % hop_len == 0, " [!] seq_len has to be a multiple of hop_len."
+        self.feat_frame_len = seq_len // hop_len + (2 * conv_pad)
+
+        # cache acoustic features
+        if use_cache:
+            self.create_feature_cache()
+
+    def create_feature_cache(self):
+        self.manager = Manager()
+        self.cache = self.manager.list()
+        self.cache += [None for _ in range(len(self.item_list))]
+
+    @staticmethod
+    def find_wav_files(path):
+        return glob.glob(os.path.join(path, '**', '*.wav'), recursive=True)
+
+    def __len__(self):
+        return len(self.item_list)
+
+    def __getitem__(self, idx):
+        item = self.load_item(idx)
+        return item
+
+    def load_test_samples(self, num_samples):
+        samples = []
+        return_segments = self.return_segments
+        self.return_segments = False
+        for idx in range(num_samples):
+            mel, audio = self.load_item(idx)
+            samples.append([mel, audio])
+        self.return_segments = return_segments
+        return samples
+
+    def load_item(self, idx):
+        """ load (audio, feat) couple """
+        # compute features from wav
+        wavpath = self.item_list[idx]
+
+        if self.use_cache and self.cache[idx] is not None:
+            audio = self.cache[idx]
+        else:
+            audio = self.ap.load_wav(wavpath)
+
+            if self.return_segments:
+                # correct audio length wrt segment length
+                if audio.shape[-1] < self.seq_len + self.pad_short:
+                    audio = np.pad(audio, (0, self.seq_len + self.pad_short - len(audio)), \
+                            mode='constant', constant_values=0.0)
+                assert audio.shape[-1] >= self.seq_len + self.pad_short, f"{audio.shape[-1]} vs {self.seq_len + self.pad_short}"
+
+            # correct the audio length wrt hop length
+            p = (audio.shape[-1] // self.hop_len + 1) * self.hop_len - audio.shape[-1]
+            audio = np.pad(audio, (0, p), mode='constant', constant_values=0.0)
+
+            if self.use_cache:
+                self.cache[idx] = audio
+
+        if self.return_segments:
+            max_start = len(audio) - self.seq_len
+            start = random.randint(0, max_start)
+            end = start + self.seq_len
+            audio = audio[start:end]
+
+        if self.use_noise_augment and self.is_training and self.return_segments:
+            audio = audio + (1 / 32768) * torch.randn_like(audio)
+
+        mel = self.ap.melspectrogram(audio)
+        mel = mel[..., :-1]  # ignore the padding
+
+        audio = torch.from_numpy(audio).float()
+        mel = torch.from_numpy(mel).float().squeeze(0)
+        return (mel, audio)
+
+    @staticmethod
+    def collate_full_clips(batch):
+        """This is used in tune_wavegrad.py.
+        It pads sequences to the max length."""
+        max_mel_length = max([b[0].shape[1] for b in batch]) if len(batch) > 1 else batch[0][0].shape[1]
+        max_audio_length = max([b[1].shape[0] for b in batch]) if len(batch) > 1 else batch[0][1].shape[0]
+
+        mels = torch.zeros([len(batch), batch[0][0].shape[0], max_mel_length])
+        audios = torch.zeros([len(batch), max_audio_length])
+
+        for idx, b in enumerate(batch):
+            mel = b[0]
+            audio = b[1]
+            mels[idx, :, :mel.shape[1]] = mel
+            audios[idx, :audio.shape[0]] = audio
+
+        return mels, audios
--- a/TTS/vocoder/datasets/wavernn_dataset.py
+++ b/TTS/vocoder/datasets/wavernn_dataset.py
@ -0,0 +1,118 @@
+import torch
+import numpy as np
+from torch.utils.data import Dataset
+
+
+class WaveRNNDataset(Dataset):
+    """
+    WaveRNN Dataset searchs for all the wav files under root path
+    and converts them to acoustic features on the fly.
+    """
+
+    def __init__(self,
+                 ap,
+                 items,
+                 seq_len,
+                 hop_len,
+                 pad,
+                 mode,
+                 mulaw,
+                 is_training=True,
+                 verbose=False,
+                 ):
+
+        self.ap = ap
+        self.compute_feat = not isinstance(items[0], (tuple, list))
+        self.item_list = items
+        self.seq_len = seq_len
+        self.hop_len = hop_len
+        self.mel_len = seq_len // hop_len
+        self.pad = pad
+        self.mode = mode
+        self.mulaw = mulaw
+        self.is_training = is_training
+        self.verbose = verbose
+
+        assert self.seq_len % self.hop_len == 0
+
+    def __len__(self):
+        return len(self.item_list)
+
+    def __getitem__(self, index):
+        item = self.load_item(index)
+        return item
+
+    def load_item(self, index):
+        """
+        load (audio, feat) couple if feature_path is set
+        else compute it on the fly
+        """
+        if self.compute_feat:
+
+            wavpath = self.item_list[index]
+            audio = self.ap.load_wav(wavpath)
+            min_audio_len = 2 * self.seq_len + (2 * self.pad * self.hop_len)
+            if audio.shape[0] < min_audio_len:
+                print(" [!] Instance is too short! : {}".format(wavpath))
+                audio = np.pad(audio, [0, min_audio_len - audio.shape[0] + self.hop_len])
+            mel = self.ap.melspectrogram(audio)
+
+            if self.mode in ["gauss", "mold"]:
+                x_input = audio
+            elif isinstance(self.mode, int):
+                x_input = (self.ap.mulaw_encode(audio, qc=self.mode)
+                           if self.mulaw else self.ap.quantize(audio, bits=self.mode))
+            else:
+                raise RuntimeError("Unknown dataset mode - ", self.mode)
+
+        else:
+
+            wavpath, feat_path = self.item_list[index]
+            mel = np.load(feat_path.replace("/quant/", "/mel/"))
+
+            if mel.shape[-1] < self.mel_len  + 2 * self.pad:
+                print(" [!] Instance is too short! : {}".format(wavpath))
+                self.item_list[index] = self.item_list[index + 1]
+                feat_path = self.item_list[index]
+                mel = np.load(feat_path.replace("/quant/", "/mel/"))
+            if self.mode in ["gauss", "mold"]:
+                x_input = self.ap.load_wav(wavpath)
+            elif isinstance(self.mode, int):
+                x_input = np.load(feat_path.replace("/mel/", "/quant/"))
+            else:
+                raise RuntimeError("Unknown dataset mode - ", self.mode)
+
+        return mel, x_input, wavpath
+
+    def collate(self, batch):
+        mel_win = self.seq_len // self.hop_len + 2 * self.pad
+        max_offsets = [x[0].shape[-1] -
+                       (mel_win + 2 * self.pad) for x in batch]
+
+        mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]
+        sig_offsets = [(offset + self.pad) *
+                       self.hop_len for offset in mel_offsets]
+
+        mels = [
+            x[0][:, mel_offsets[i]: mel_offsets[i] + mel_win]
+            for i, x in enumerate(batch)
+        ]
+
+        coarse = [
+            x[1][sig_offsets[i]: sig_offsets[i] + self.seq_len + 1]
+            for i, x in enumerate(batch)
+        ]
+
+        mels = np.stack(mels).astype(np.float32)
+        if self.mode in ["gauss", "mold"]:
+            coarse = np.stack(coarse).astype(np.float32)
+            coarse = torch.FloatTensor(coarse)
+            x_input = coarse[:, : self.seq_len]
+        elif isinstance(self.mode, int):
+            coarse = np.stack(coarse).astype(np.int64)
+            coarse = torch.LongTensor(coarse)
+            x_input = (2 * coarse[:, : self.seq_len].float() /
+                       (2 ** self.mode - 1.0) - 1.0)
+        y_coarse = coarse[:, 1:]
+        mels = torch.FloatTensor(mels)
+        return x_input, mels, y_coarse
--- a/TTS/vocoder/layers/wavegrad.py
+++ b/TTS/vocoder/layers/wavegrad.py
@ -0,0 +1,175 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.utils import weight_norm
+
+
+class Conv1d(nn.Conv1d):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        nn.init.orthogonal_(self.weight)
+        nn.init.zeros_(self.bias)
+
+
+class PositionalEncoding(nn.Module):
+    """Positional encoding with noise level conditioning"""
+    def __init__(self, n_channels, max_len=10000):
+        super().__init__()
+        self.n_channels = n_channels
+        self.max_len = max_len
+        self.C = 5000
+        self.pe = torch.zeros(0, 0)
+
+    def forward(self, x, noise_level):
+        if x.shape[2] > self.pe.shape[1]:
+            self.init_pe_matrix(x.shape[1] ,x.shape[2], x)
+        return x + noise_level[..., None, None] + self.pe[:, :x.size(2)].repeat(x.shape[0], 1, 1) / self.C
+
+    def init_pe_matrix(self, n_channels, max_len, x):
+        pe = torch.zeros(max_len, n_channels)
+        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
+        div_term = torch.pow(10000, torch.arange(0, n_channels, 2).float() / n_channels)
+
+        pe[:, 0::2] = torch.sin(position / div_term)
+        pe[:, 1::2] = torch.cos(position / div_term)
+        self.pe = pe.transpose(0, 1).to(x)
+
+
+class FiLM(nn.Module):
+    def __init__(self, input_size, output_size):
+        super().__init__()
+        self.encoding = PositionalEncoding(input_size)
+        self.input_conv = nn.Conv1d(input_size, input_size, 3, padding=1)
+        self.output_conv = nn.Conv1d(input_size, output_size * 2, 3, padding=1)
+
+        nn.init.xavier_uniform_(self.input_conv.weight)
+        nn.init.xavier_uniform_(self.output_conv.weight)
+        nn.init.zeros_(self.input_conv.bias)
+        nn.init.zeros_(self.output_conv.bias)
+
+    def forward(self, x, noise_scale):
+        o = self.input_conv(x)
+        o = F.leaky_relu(o, 0.2)
+        o = self.encoding(o, noise_scale)
+        shift, scale = torch.chunk(self.output_conv(o), 2, dim=1)
+        return shift, scale
+
+    def remove_weight_norm(self):
+        nn.utils.remove_weight_norm(self.input_conv)
+        nn.utils.remove_weight_norm(self.output_conv)
+
+    def apply_weight_norm(self):
+        self.input_conv = weight_norm(self.input_conv)
+        self.output_conv = weight_norm(self.output_conv)
+
+
+@torch.jit.script
+def shif_and_scale(x, scale, shift):
+    o = shift + scale * x
+    return o
+
+
+class UBlock(nn.Module):
+    def __init__(self, input_size, hidden_size, factor, dilation):
+        super().__init__()
+        assert isinstance(dilation, (list, tuple))
+        assert len(dilation) == 4
+
+        self.factor = factor
+        self.res_block = Conv1d(input_size, hidden_size, 1)
+        self.main_block = nn.ModuleList([
+            Conv1d(input_size,
+                   hidden_size,
+                   3,
+                   dilation=dilation[0],
+                   padding=dilation[0]),
+            Conv1d(hidden_size,
+                   hidden_size,
+                   3,
+                   dilation=dilation[1],
+                   padding=dilation[1])
+        ])
+        self.out_block = nn.ModuleList([
+            Conv1d(hidden_size,
+                   hidden_size,
+                   3,
+                   dilation=dilation[2],
+                   padding=dilation[2]),
+            Conv1d(hidden_size,
+                   hidden_size,
+                   3,
+                   dilation=dilation[3],
+                   padding=dilation[3])
+        ])
+
+    def forward(self, x, shift, scale):
+        x_inter = F.interpolate(x, size=x.shape[-1] * self.factor)
+        res = self.res_block(x_inter)
+        o = F.leaky_relu(x_inter, 0.2)
+        o = F.interpolate(o, size=x.shape[-1] * self.factor)
+        o = self.main_block[0](o)
+        o = shif_and_scale(o, scale, shift)
+        o = F.leaky_relu(o, 0.2)
+        o = self.main_block[1](o)
+        res2 = res + o
+        o = shif_and_scale(res2, scale, shift)
+        o = F.leaky_relu(o, 0.2)
+        o = self.out_block[0](o)
+        o = shif_and_scale(o, scale, shift)
+        o = F.leaky_relu(o, 0.2)
+        o = self.out_block[1](o)
+        o = o + res2
+        return o
+
+    def remove_weight_norm(self):
+        nn.utils.remove_weight_norm(self.res_block)
+        for _, layer in enumerate(self.main_block):
+            if len(layer.state_dict()) != 0:
+                nn.utils.remove_weight_norm(layer)
+        for _, layer in enumerate(self.out_block):
+            if len(layer.state_dict()) != 0:
+                nn.utils.remove_weight_norm(layer)
+
+    def apply_weight_norm(self):
+        self.res_block = weight_norm(self.res_block)
+        for idx, layer in enumerate(self.main_block):
+            if len(layer.state_dict()) != 0:
+                self.main_block[idx] = weight_norm(layer)
+        for idx, layer in enumerate(self.out_block):
+            if len(layer.state_dict()) != 0:
+                self.out_block[idx] = weight_norm(layer)
+
+
+class DBlock(nn.Module):
+    def __init__(self, input_size, hidden_size, factor):
+        super().__init__()
+        self.factor = factor
+        self.res_block = Conv1d(input_size, hidden_size, 1)
+        self.main_block = nn.ModuleList([
+            Conv1d(input_size, hidden_size, 3, dilation=1, padding=1),
+            Conv1d(hidden_size, hidden_size, 3, dilation=2, padding=2),
+            Conv1d(hidden_size, hidden_size, 3, dilation=4, padding=4),
+        ])
+
+    def forward(self, x):
+        size = x.shape[-1] // self.factor
+        res = self.res_block(x)
+        res = F.interpolate(res, size=size)
+        o = F.interpolate(x, size=size)
+        for layer in self.main_block:
+            o = F.leaky_relu(o, 0.2)
+            o = layer(o)
+        return o + res
+
+    def remove_weight_norm(self):
+        nn.utils.remove_weight_norm(self.res_block)
+        for _, layer in enumerate(self.main_block):
+            if len(layer.state_dict()) != 0:
+                nn.utils.remove_weight_norm(layer)
+
+    def apply_weight_norm(self):
+        self.res_block = weight_norm(self.res_block)
+        for idx, layer in enumerate(self.main_block):
+            if len(layer.state_dict()) != 0:
+               self.main_block[idx] = weight_norm(layer)
+
--- a/TTS/vocoder/models/wavegrad.py
+++ b/TTS/vocoder/models/wavegrad.py
@ -0,0 +1,177 @@
+import numpy as np
+import torch
+from torch import nn
+from torch.nn.utils import weight_norm
+
+from ..layers.wavegrad import DBlock, FiLM, UBlock, Conv1d
+
+
+class Wavegrad(nn.Module):
+    # pylint: disable=dangerous-default-value
+    def __init__(self,
+                 in_channels=80,
+                 out_channels=1,
+                 use_weight_norm=False,
+                 y_conv_channels=32,
+                 x_conv_channels=768,
+                 dblock_out_channels=[128, 128, 256, 512],
+                 ublock_out_channels=[512, 512, 256, 128, 128],
+                 upsample_factors=[5, 5, 3, 2, 2],
+                 upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2], [1, 2, 4, 8],
+                                     [1, 2, 4, 8], [1, 2, 4, 8]]):
+        super().__init__()
+
+        self.use_weight_norm = use_weight_norm
+        self.hop_len = np.prod(upsample_factors)
+        self.noise_level = None
+        self.num_steps = None
+        self.beta = None
+        self.alpha = None
+        self.alpha_hat = None
+        self.noise_level = None
+        self.c1 = None
+        self.c2 = None
+        self.sigma = None
+
+        # dblocks
+        self.y_conv = Conv1d(1, y_conv_channels, 5, padding=2)
+        self.dblocks = nn.ModuleList([])
+        ic = y_conv_channels
+        for oc, df in zip(dblock_out_channels, reversed(upsample_factors)):
+            self.dblocks.append(DBlock(ic, oc, df))
+            ic = oc
+
+        # film
+        self.film = nn.ModuleList([])
+        ic = y_conv_channels
+        for oc in reversed(ublock_out_channels):
+            self.film.append(FiLM(ic, oc))
+            ic = oc
+
+        # ublocks
+        self.ublocks = nn.ModuleList([])
+        ic = x_conv_channels
+        for oc, uf, ud in zip(ublock_out_channels, upsample_factors, upsample_dilations):
+            self.ublocks.append(UBlock(ic, oc, uf, ud))
+            ic = oc
+
+        self.x_conv = Conv1d(in_channels, x_conv_channels, 3, padding=1)
+        self.out_conv = Conv1d(oc, out_channels, 3, padding=1)
+
+        if use_weight_norm:
+            self.apply_weight_norm()
+
+    def forward(self, x, spectrogram, noise_scale):
+        shift_and_scale = []
+
+        x = self.y_conv(x)
+        shift_and_scale.append(self.film[0](x, noise_scale))
+
+        for film, layer in zip(self.film[1:], self.dblocks):
+            x = layer(x)
+            shift_and_scale.append(film(x, noise_scale))
+
+        x = self.x_conv(spectrogram)
+        for layer, (film_shift, film_scale) in zip(self.ublocks,
+                                                   reversed(shift_and_scale)):
+            x = layer(x, film_shift, film_scale)
+        x = self.out_conv(x)
+        return x
+
+    def load_noise_schedule(self, path):
+        beta = np.load(path, allow_pickle=True).item()['beta']
+        self.compute_noise_level(beta)
+
+    @torch.no_grad()
+    def inference(self, x, y_n=None):
+        """ x: B x D X T """
+        if y_n is None:
+            y_n = torch.randn(x.shape[0], 1, self.hop_len * x.shape[-1], dtype=torch.float32).to(x)
+        else:
+            y_n = torch.FloatTensor(y_n).unsqueeze(0).unsqueeze(0).to(x)
+        sqrt_alpha_hat = self.noise_level.to(x)
+        for n in range(len(self.alpha) - 1, -1, -1):
+            y_n = self.c1[n] * (y_n -
+                        self.c2[n] * self.forward(y_n, x, sqrt_alpha_hat[n].repeat(x.shape[0])))
+            if n > 0:
+                z = torch.randn_like(y_n)
+                y_n += self.sigma[n - 1] * z
+            y_n.clamp_(-1.0, 1.0)
+        return y_n
+
+
+    def compute_y_n(self, y_0):
+        """Compute noisy audio based on noise schedule"""
+        self.noise_level = self.noise_level.to(y_0)
+        if len(y_0.shape) == 3:
+            y_0 = y_0.squeeze(1)
+        s = torch.randint(1, self.num_steps + 1, [y_0.shape[0]])
+        l_a, l_b = self.noise_level[s-1], self.noise_level[s]
+        noise_scale = l_a + torch.rand(y_0.shape[0]).to(y_0) * (l_b - l_a)
+        noise_scale = noise_scale.unsqueeze(1)
+        noise = torch.randn_like(y_0)
+        noisy_audio = noise_scale * y_0 + (1.0 - noise_scale**2)**0.5 * noise
+        return noise.unsqueeze(1), noisy_audio.unsqueeze(1), noise_scale[:, 0]
+
+    def compute_noise_level(self, beta):
+        """Compute noise schedule parameters"""
+        self.num_steps = len(beta)
+        alpha = 1 - beta
+        alpha_hat = np.cumprod(alpha)
+        noise_level = np.concatenate([[1.0], alpha_hat ** 0.5], axis=0)
+        noise_level = alpha_hat ** 0.5
+
+        # pylint: disable=not-callable
+        self.beta = torch.tensor(beta.astype(np.float32))
+        self.alpha = torch.tensor(alpha.astype(np.float32))
+        self.alpha_hat = torch.tensor(alpha_hat.astype(np.float32))
+        self.noise_level = torch.tensor(noise_level.astype(np.float32))
+
+        self.c1 = 1 / self.alpha**0.5
+        self.c2 = (1 - self.alpha) / (1 - self.alpha_hat)**0.5
+        self.sigma = ((1.0 - self.alpha_hat[:-1]) / (1.0 - self.alpha_hat[1:]) * self.beta[1:])**0.5
+
+    def remove_weight_norm(self):
+        for _, layer in enumerate(self.dblocks):
+            if len(layer.state_dict()) != 0:
+                try:
+                    nn.utils.remove_weight_norm(layer)
+                except ValueError:
+                    layer.remove_weight_norm()
+
+        for _, layer in enumerate(self.film):
+            if len(layer.state_dict()) != 0:
+                try:
+                    nn.utils.remove_weight_norm(layer)
+                except ValueError:
+                    layer.remove_weight_norm()
+
+
+        for _, layer in enumerate(self.ublocks):
+            if len(layer.state_dict()) != 0:
+                try:
+                    nn.utils.remove_weight_norm(layer)
+                except ValueError:
+                    layer.remove_weight_norm()
+
+        nn.utils.remove_weight_norm(self.x_conv)
+        nn.utils.remove_weight_norm(self.out_conv)
+        nn.utils.remove_weight_norm(self.y_conv)
+
+    def apply_weight_norm(self):
+        for _, layer in enumerate(self.dblocks):
+            if len(layer.state_dict()) != 0:
+                layer.apply_weight_norm()
+
+        for _, layer in enumerate(self.film):
+            if len(layer.state_dict()) != 0:
+                layer.apply_weight_norm()
+
+
+        for _, layer in enumerate(self.ublocks):
+            if len(layer.state_dict()) != 0:
+                layer.apply_weight_norm()
+
+        self.x_conv = weight_norm(self.x_conv)
+        self.out_conv = weight_norm(self.out_conv)
+        self.y_conv = weight_norm(self.y_conv)
--- a/TTS/vocoder/models/wavernn.py
+++ b/TTS/vocoder/models/wavernn.py
@ -0,0 +1,501 @@
+import sys
+import torch
+import torch.nn as nn
+import numpy as np
+import torch.nn.functional as F
+import time
+
+# fix this
+from TTS.utils.audio import AudioProcessor as ap
+from TTS.vocoder.utils.distribution import (
+    sample_from_gaussian,
+    sample_from_discretized_mix_logistic,
+)
+
+
+def stream(string, variables):
+    sys.stdout.write(f"\r{string}" % variables)
+
+# pylint: disable=abstract-method
+# relates https://github.com/pytorch/pytorch/issues/42305
+class ResBlock(nn.Module):
+    def __init__(self, dims):
+        super().__init__()
+        self.conv1 = nn.Conv1d(dims, dims, kernel_size=1, bias=False)
+        self.conv2 = nn.Conv1d(dims, dims, kernel_size=1, bias=False)
+        self.batch_norm1 = nn.BatchNorm1d(dims)
+        self.batch_norm2 = nn.BatchNorm1d(dims)
+
+    def forward(self, x):
+        residual = x
+        x = self.conv1(x)
+        x = self.batch_norm1(x)
+        x = F.relu(x)
+        x = self.conv2(x)
+        x = self.batch_norm2(x)
+        return x + residual
+
+
+class MelResNet(nn.Module):
+    def __init__(self, num_res_blocks, in_dims, compute_dims, res_out_dims, pad):
+        super().__init__()
+        k_size = pad * 2 + 1
+        self.conv_in = nn.Conv1d(
+            in_dims, compute_dims, kernel_size=k_size, bias=False)
+        self.batch_norm = nn.BatchNorm1d(compute_dims)
+        self.layers = nn.ModuleList()
+        for _ in range(num_res_blocks):
+            self.layers.append(ResBlock(compute_dims))
+        self.conv_out = nn.Conv1d(compute_dims, res_out_dims, kernel_size=1)
+
+    def forward(self, x):
+        x = self.conv_in(x)
+        x = self.batch_norm(x)
+        x = F.relu(x)
+        for f in self.layers:
+            x = f(x)
+        x = self.conv_out(x)
+        return x
+
+
+class Stretch2d(nn.Module):
+    def __init__(self, x_scale, y_scale):
+        super().__init__()
+        self.x_scale = x_scale
+        self.y_scale = y_scale
+
+    def forward(self, x):
+        b, c, h, w = x.size()
+        x = x.unsqueeze(-1).unsqueeze(3)
+        x = x.repeat(1, 1, 1, self.y_scale, 1, self.x_scale)
+        return x.view(b, c, h * self.y_scale, w * self.x_scale)
+
+
+class UpsampleNetwork(nn.Module):
+    def __init__(
+        self,
+        feat_dims,
+        upsample_scales,
+        compute_dims,
+        num_res_blocks,
+        res_out_dims,
+        pad,
+        use_aux_net,
+    ):
+        super().__init__()
+        self.total_scale = np.cumproduct(upsample_scales)[-1]
+        self.indent = pad * self.total_scale
+        self.use_aux_net = use_aux_net
+        if use_aux_net:
+            self.resnet = MelResNet(
+                num_res_blocks, feat_dims, compute_dims, res_out_dims, pad
+            )
+            self.resnet_stretch = Stretch2d(self.total_scale, 1)
+        self.up_layers = nn.ModuleList()
+        for scale in upsample_scales:
+            k_size = (1, scale * 2 + 1)
+            padding = (0, scale)
+            stretch = Stretch2d(scale, 1)
+            conv = nn.Conv2d(1, 1, kernel_size=k_size,
+                             padding=padding, bias=False)
+            conv.weight.data.fill_(1.0 / k_size[1])
+            self.up_layers.append(stretch)
+            self.up_layers.append(conv)
+
+    def forward(self, m):
+        if self.use_aux_net:
+            aux = self.resnet(m).unsqueeze(1)
+            aux = self.resnet_stretch(aux)
+            aux = aux.squeeze(1)
+            aux = aux.transpose(1, 2)
+        else:
+            aux = None
+        m = m.unsqueeze(1)
+        for f in self.up_layers:
+            m = f(m)
+        m = m.squeeze(1)[:, :, self.indent: -self.indent]
+        return m.transpose(1, 2), aux
+
+
+class Upsample(nn.Module):
+    def __init__(
+        self, scale, pad, num_res_blocks, feat_dims, compute_dims, res_out_dims, use_aux_net
+    ):
+        super().__init__()
+        self.scale = scale
+        self.pad = pad
+        self.indent = pad * scale
+        self.use_aux_net = use_aux_net
+        self.resnet = MelResNet(num_res_blocks, feat_dims,
+                                compute_dims, res_out_dims, pad)
+
+    def forward(self, m):
+        if self.use_aux_net:
+            aux = self.resnet(m)
+            aux = torch.nn.functional.interpolate(
+                aux, scale_factor=self.scale, mode="linear", align_corners=True
+            )
+            aux = aux.transpose(1, 2)
+        else:
+            aux = None
+        m = torch.nn.functional.interpolate(
+            m, scale_factor=self.scale, mode="linear", align_corners=True
+        )
+        m = m[:, :, self.indent: -self.indent]
+        m = m * 0.045  # empirically found
+
+        return m.transpose(1, 2), aux
+
+
+class WaveRNN(nn.Module):
+    def __init__(self,
+                 rnn_dims,
+                 fc_dims,
+                 mode,
+                 mulaw,
+                 pad,
+                 use_aux_net,
+                 use_upsample_net,
+                 upsample_factors,
+                 feat_dims,
+                 compute_dims,
+                 res_out_dims,
+                 num_res_blocks,
+                 hop_length,
+                 sample_rate,
+                 ):
+        super().__init__()
+        self.mode = mode
+        self.mulaw = mulaw
+        self.pad = pad
+        self.use_upsample_net = use_upsample_net
+        self.use_aux_net = use_aux_net
+        if isinstance(self.mode, int):
+            self.n_classes = 2 ** self.mode
+        elif self.mode == "mold":
+            self.n_classes = 3 * 10
+        elif self.mode == "gauss":
+            self.n_classes = 2
+        else:
+            raise RuntimeError("Unknown model mode value - ", self.mode)
+
+        self.rnn_dims = rnn_dims
+        self.aux_dims = res_out_dims // 4
+        self.hop_length = hop_length
+        self.sample_rate = sample_rate
+
+        if self.use_upsample_net:
+            assert (
+                np.cumproduct(upsample_factors)[-1] == self.hop_length
+            ), " [!] upsample scales needs to be equal to hop_length"
+            self.upsample = UpsampleNetwork(
+                feat_dims,
+                upsample_factors,
+                compute_dims,
+                num_res_blocks,
+                res_out_dims,
+                pad,
+                use_aux_net,
+            )
+        else:
+            self.upsample = Upsample(
+                hop_length,
+                pad,
+                num_res_blocks,
+                feat_dims,
+                compute_dims,
+                res_out_dims,
+                use_aux_net,
+            )
+        if self.use_aux_net:
+            self.I = nn.Linear(feat_dims + self.aux_dims + 1, rnn_dims)
+            self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
+            self.rnn2 = nn.GRU(rnn_dims + self.aux_dims,
+                               rnn_dims, batch_first=True)
+            self.fc1 = nn.Linear(rnn_dims + self.aux_dims, fc_dims)
+            self.fc2 = nn.Linear(fc_dims + self.aux_dims, fc_dims)
+            self.fc3 = nn.Linear(fc_dims, self.n_classes)
+        else:
+            self.I = nn.Linear(feat_dims + 1, rnn_dims)
+            self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
+            self.rnn2 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
+            self.fc1 = nn.Linear(rnn_dims, fc_dims)
+            self.fc2 = nn.Linear(fc_dims, fc_dims)
+            self.fc3 = nn.Linear(fc_dims, self.n_classes)
+
+    def forward(self, x, mels):
+        bsize = x.size(0)
+        h1 = torch.zeros(1, bsize, self.rnn_dims).to(x.device)
+        h2 = torch.zeros(1, bsize, self.rnn_dims).to(x.device)
+        mels, aux = self.upsample(mels)
+
+        if self.use_aux_net:
+            aux_idx = [self.aux_dims * i for i in range(5)]
+            a1 = aux[:, :, aux_idx[0]: aux_idx[1]]
+            a2 = aux[:, :, aux_idx[1]: aux_idx[2]]
+            a3 = aux[:, :, aux_idx[2]: aux_idx[3]]
+            a4 = aux[:, :, aux_idx[3]: aux_idx[4]]
+
+        x = (
+            torch.cat([x.unsqueeze(-1), mels, a1], dim=2)
+            if self.use_aux_net
+            else torch.cat([x.unsqueeze(-1), mels], dim=2)
+        )
+        x = self.I(x)
+        res = x
+        self.rnn1.flatten_parameters()
+        x, _ = self.rnn1(x, h1)
+
+        x = x + res
+        res = x
+        x = torch.cat([x, a2], dim=2) if self.use_aux_net else x
+        self.rnn2.flatten_parameters()
+        x, _ = self.rnn2(x, h2)
+
+        x = x + res
+        x = torch.cat([x, a3], dim=2) if self.use_aux_net else x
+        x = F.relu(self.fc1(x))
+
+        x = torch.cat([x, a4], dim=2) if self.use_aux_net else x
+        x = F.relu(self.fc2(x))
+        return self.fc3(x)
+
+    def inference(self, mels, batched, target, overlap):
+
+        self.eval()
+        device = mels.device
+        output = []
+        start = time.time()
+        rnn1 = self.get_gru_cell(self.rnn1)
+        rnn2 = self.get_gru_cell(self.rnn2)
+
+        with torch.no_grad():
+            if isinstance(mels, np.ndarray):
+                mels = torch.FloatTensor(mels).to(device)
+
+            if mels.ndim == 2:
+                mels = mels.unsqueeze(0)
+            wave_len = (mels.size(-1) - 1) * self.hop_length
+
+            mels = self.pad_tensor(mels.transpose(
+                1, 2), pad=self.pad, side="both")
+            mels, aux = self.upsample(mels.transpose(1, 2))
+
+            if batched:
+                mels = self.fold_with_overlap(mels, target, overlap)
+                if aux is not None:
+                    aux = self.fold_with_overlap(aux, target, overlap)
+
+            b_size, seq_len, _ = mels.size()
+
+            h1 = torch.zeros(b_size, self.rnn_dims).to(device)
+            h2 = torch.zeros(b_size, self.rnn_dims).to(device)
+            x = torch.zeros(b_size, 1).to(device)
+
+            if self.use_aux_net:
+                d = self.aux_dims
+                aux_split = [aux[:, :, d * i: d * (i + 1)] for i in range(4)]
+
+            for i in range(seq_len):
+
+                m_t = mels[:, i, :]
+
+                if self.use_aux_net:
+                    a1_t, a2_t, a3_t, a4_t = (a[:, i, :] for a in aux_split)
+
+                x = (
+                    torch.cat([x, m_t, a1_t], dim=1)
+                    if self.use_aux_net
+                    else torch.cat([x, m_t], dim=1)
+                )
+                x = self.I(x)
+                h1 = rnn1(x, h1)
+
+                x = x + h1
+                inp = torch.cat([x, a2_t], dim=1) if self.use_aux_net else x
+                h2 = rnn2(inp, h2)
+
+                x = x + h2
+                x = torch.cat([x, a3_t], dim=1) if self.use_aux_net else x
+                x = F.relu(self.fc1(x))
+
+                x = torch.cat([x, a4_t], dim=1) if self.use_aux_net else x
+                x = F.relu(self.fc2(x))
+
+                logits = self.fc3(x)
+
+                if self.mode == "mold":
+                    sample = sample_from_discretized_mix_logistic(
+                        logits.unsqueeze(0).transpose(1, 2)
+                    )
+                    output.append(sample.view(-1))
+                    x = sample.transpose(0, 1).to(device)
+                elif self.mode == "gauss":
+                    sample = sample_from_gaussian(
+                        logits.unsqueeze(0).transpose(1, 2))
+                    output.append(sample.view(-1))
+                    x = sample.transpose(0, 1).to(device)
+                elif isinstance(self.mode, int):
+                    posterior = F.softmax(logits, dim=1)
+                    distrib = torch.distributions.Categorical(posterior)
+
+                    sample = 2 * distrib.sample().float() / (self.n_classes - 1.0) - 1.0
+                    output.append(sample)
+                    x = sample.unsqueeze(-1)
+                else:
+                    raise RuntimeError(
+                        "Unknown model mode value - ", self.mode)
+
+                if i % 100 == 0:
+                    self.gen_display(i, seq_len, b_size, start)
+
+        output = torch.stack(output).transpose(0, 1)
+        output = output.cpu().numpy()
+        output = output.astype(np.float64)
+
+        if batched:
+            output = self.xfade_and_unfold(output, target, overlap)
+        else:
+            output = output[0]
+
+        if self.mulaw and isinstance(self.mode, int):
+            output = ap.mulaw_decode(output, self.mode)
+
+        # Fade-out at the end to avoid signal cutting out suddenly
+        fade_out = np.linspace(1, 0, 20 * self.hop_length)
+        output = output[:wave_len]
+
+        if wave_len > len(fade_out):
+            output[-20 * self.hop_length:] *= fade_out
+
+        self.train()
+        return output
+
+    def gen_display(self, i, seq_len, b_size, start):
+        gen_rate = (i + 1) / (time.time() - start) * b_size / 1000
+        realtime_ratio = gen_rate * 1000 / self.sample_rate
+        stream(
+            "%i/%i -- batch_size: %i -- gen_rate: %.1f kHz -- x_realtime: %.1f  ",
+            (i * b_size, seq_len * b_size, b_size, gen_rate, realtime_ratio),
+        )
+
+    def fold_with_overlap(self, x, target, overlap):
+        """Fold the tensor with overlap for quick batched inference.
+            Overlap will be used for crossfading in xfade_and_unfold()
+        Args:
+            x (tensor)    : Upsampled conditioning features.
+                            shape=(1, timesteps, features)
+            target (int)  : Target timesteps for each index of batch
+            overlap (int) : Timesteps for both xfade and rnn warmup
+        Return:
+            (tensor) : shape=(num_folds, target + 2 * overlap, features)
+        Details:
+            x = [[h1, h2, ... hn]]
+            Where each h is a vector of conditioning features
+            Eg: target=2, overlap=1 with x.size(1)=10
+            folded = [[h1, h2, h3, h4],
+                      [h4, h5, h6, h7],
+                      [h7, h8, h9, h10]]
+        """
+
+        _, total_len, features = x.size()
+
+        # Calculate variables needed
+        num_folds = (total_len - overlap) // (target + overlap)
+        extended_len = num_folds * (overlap + target) + overlap
+        remaining = total_len - extended_len
+
+        # Pad if some time steps poking out
+        if remaining != 0:
+            num_folds += 1
+            padding = target + 2 * overlap - remaining
+            x = self.pad_tensor(x, padding, side="after")
+
+        folded = torch.zeros(num_folds, target + 2 *
+                             overlap, features).to(x.device)
+
+        # Get the values for the folded tensor
+        for i in range(num_folds):
+            start = i * (target + overlap)
+            end = start + target + 2 * overlap
+            folded[i] = x[:, start:end, :]
+
+        return folded
+
+    @staticmethod
+    def get_gru_cell(gru):
+        gru_cell = nn.GRUCell(gru.input_size, gru.hidden_size)
+        gru_cell.weight_hh.data = gru.weight_hh_l0.data
+        gru_cell.weight_ih.data = gru.weight_ih_l0.data
+        gru_cell.bias_hh.data = gru.bias_hh_l0.data
+        gru_cell.bias_ih.data = gru.bias_ih_l0.data
+        return gru_cell
+
+    @staticmethod
+    def pad_tensor(x, pad, side="both"):
+        # NB - this is just a quick method i need right now
+        # i.e., it won't generalise to other shapes/dims
+        b, t, c = x.size()
+        total = t + 2 * pad if side == "both" else t + pad
+        padded = torch.zeros(b, total, c).to(x.device)
+        if side in ("before", "both"):
+            padded[:, pad: pad + t, :] = x
+        elif side == "after":
+            padded[:, :t, :] = x
+        return padded
+
+    @staticmethod
+    def xfade_and_unfold(y, target, overlap):
+        """Applies a crossfade and unfolds into a 1d array.
+        Args:
+            y (ndarry)    : Batched sequences of audio samples
+                            shape=(num_folds, target + 2 * overlap)
+                            dtype=np.float64
+            overlap (int) : Timesteps for both xfade and rnn warmup
+        Return:
+            (ndarry) : audio samples in a 1d array
+                       shape=(total_len)
+                       dtype=np.float64
+        Details:
+            y = [[seq1],
+                 [seq2],
+                 [seq3]]
+            Apply a gain envelope at both ends of the sequences
+            y = [[seq1_in, seq1_target, seq1_out],
+                 [seq2_in, seq2_target, seq2_out],
+                 [seq3_in, seq3_target, seq3_out]]
+            Stagger and add up the groups of samples:
+            [seq1_in, seq1_target, (seq1_out + seq2_in), seq2_target, ...]
+        """
+
+        num_folds, length = y.shape
+        target = length - 2 * overlap
+        total_len = num_folds * (target + overlap) + overlap
+
+        # Need some silence for the rnn warmup
+        silence_len = overlap // 2
+        fade_len = overlap - silence_len
+        silence = np.zeros((silence_len), dtype=np.float64)
+
+        # Equal power crossfade
+        t = np.linspace(-1, 1, fade_len, dtype=np.float64)
+        fade_in = np.sqrt(0.5 * (1 + t))
+        fade_out = np.sqrt(0.5 * (1 - t))
+
+        # Concat the silence to the fades
+        fade_in = np.concatenate([silence, fade_in])
+        fade_out = np.concatenate([fade_out, silence])
+
+        # Apply the gain to the overlap samples
+        y[:, :overlap] *= fade_in
+        y[:, -overlap:] *= fade_out
+
+        unfolded = np.zeros((total_len), dtype=np.float64)
+
+        # Loop to add up all the samples
+        for i in range(num_folds):
+            start = i * (target + overlap)
+            end = start + target + 2 * overlap
+            unfolded[start:end] += y[i]
+
+        return unfolded
--- a/TTS/vocoder/utils/distribution.py
+++ b/TTS/vocoder/utils/distribution.py
@ -0,0 +1,168 @@
+import numpy as np
+import math
+import torch
+from torch.distributions.normal import Normal
+import torch.nn.functional as F
+
+
+def gaussian_loss(y_hat, y, log_std_min=-7.0):
+    assert y_hat.dim() == 3
+    assert y_hat.size(2) == 2
+    mean = y_hat[:, :, :1]
+    log_std = torch.clamp(y_hat[:, :, 1:], min=log_std_min)
+    # TODO: replace with pytorch dist
+    log_probs = -0.5 * (
+        -math.log(2.0 * math.pi)
+        - 2.0 * log_std
+        - torch.pow(y - mean, 2) * torch.exp((-2.0 * log_std))
+    )
+    return log_probs.squeeze().mean()
+
+
+def sample_from_gaussian(y_hat, log_std_min=-7.0, scale_factor=1.0):
+    assert y_hat.size(2) == 2
+    mean = y_hat[:, :, :1]
+    log_std = torch.clamp(y_hat[:, :, 1:], min=log_std_min)
+    dist = Normal(
+        mean,
+        torch.exp(log_std),
+    )
+    sample = dist.sample()
+    sample = torch.clamp(torch.clamp(
+        sample, min=-scale_factor), max=scale_factor)
+    del dist
+    return sample
+
+
+def log_sum_exp(x):
+    """ numerically stable log_sum_exp implementation that prevents overflow """
+    # TF ordering
+    axis = len(x.size()) - 1
+    m, _ = torch.max(x, dim=axis)
+    m2, _ = torch.max(x, dim=axis, keepdim=True)
+    return m + torch.log(torch.sum(torch.exp(x - m2), dim=axis))
+
+
+# It is adapted from https://github.com/r9y9/wavenet_vocoder/blob/master/wavenet_vocoder/mixture.py
+def discretized_mix_logistic_loss(
+    y_hat, y, num_classes=65536, log_scale_min=None, reduce=True
+):
+    if log_scale_min is None:
+        log_scale_min = float(np.log(1e-14))
+    y_hat = y_hat.permute(0, 2, 1)
+    assert y_hat.dim() == 3
+    assert y_hat.size(1) % 3 == 0
+    nr_mix = y_hat.size(1) // 3
+
+    # (B x T x C)
+    y_hat = y_hat.transpose(1, 2)
+
+    # unpack parameters. (B, T, num_mixtures) x 3
+    logit_probs = y_hat[:, :, :nr_mix]
+    means = y_hat[:, :, nr_mix: 2 * nr_mix]
+    log_scales = torch.clamp(
+        y_hat[:, :, 2 * nr_mix: 3 * nr_mix], min=log_scale_min)
+
+    # B x T x 1 -> B x T x num_mixtures
+    y = y.expand_as(means)
+
+    centered_y = y - means
+    inv_stdv = torch.exp(-log_scales)
+    plus_in = inv_stdv * (centered_y + 1.0 / (num_classes - 1))
+    cdf_plus = torch.sigmoid(plus_in)
+    min_in = inv_stdv * (centered_y - 1.0 / (num_classes - 1))
+    cdf_min = torch.sigmoid(min_in)
+
+    # log probability for edge case of 0 (before scaling)
+    # equivalent: torch.log(F.sigmoid(plus_in))
+    log_cdf_plus = plus_in - F.softplus(plus_in)
+
+    # log probability for edge case of 255 (before scaling)
+    # equivalent: (1 - F.sigmoid(min_in)).log()
+    log_one_minus_cdf_min = -F.softplus(min_in)
+
+    # probability for all other cases
+    cdf_delta = cdf_plus - cdf_min
+
+    mid_in = inv_stdv * centered_y
+    # log probability in the center of the bin, to be used in extreme cases
+    # (not actually used in our code)
+    log_pdf_mid = mid_in - log_scales - 2.0 * F.softplus(mid_in)
+
+    # tf equivalent
+
+    # log_probs = tf.where(x < -0.999, log_cdf_plus,
+    #                      tf.where(x > 0.999, log_one_minus_cdf_min,
+    #                               tf.where(cdf_delta > 1e-5,
+    #                                        tf.log(tf.maximum(cdf_delta, 1e-12)),
+    #                                        log_pdf_mid - np.log(127.5))))
+
+    # TODO: cdf_delta <= 1e-5 actually can happen. How can we choose the value
+    # for num_classes=65536 case? 1e-7? not sure..
+    inner_inner_cond = (cdf_delta > 1e-5).float()
+
+    inner_inner_out = inner_inner_cond * torch.log(
+        torch.clamp(cdf_delta, min=1e-12)
+    ) + (1.0 - inner_inner_cond) * (log_pdf_mid - np.log((num_classes - 1) / 2))
+    inner_cond = (y > 0.999).float()
+    inner_out = (
+        inner_cond * log_one_minus_cdf_min +
+        (1.0 - inner_cond) * inner_inner_out
+    )
+    cond = (y < -0.999).float()
+    log_probs = cond * log_cdf_plus + (1.0 - cond) * inner_out
+
+    log_probs = log_probs + F.log_softmax(logit_probs, -1)
+
+    if reduce:
+        return -torch.mean(log_sum_exp(log_probs))
+    return -log_sum_exp(log_probs).unsqueeze(-1)
+
+
+def sample_from_discretized_mix_logistic(y, log_scale_min=None):
+    """
+    Sample from discretized mixture of logistic distributions
+    Args:
+        y (Tensor): B x C x T
+        log_scale_min (float): Log scale minimum value
+    Returns:
+        Tensor: sample in range of [-1, 1].
+    """
+    if log_scale_min is None:
+        log_scale_min = float(np.log(1e-14))
+    assert y.size(1) % 3 == 0
+    nr_mix = y.size(1) // 3
+
+    # B x T x C
+    y = y.transpose(1, 2)
+    logit_probs = y[:, :, :nr_mix]
+
+    # sample mixture indicator from softmax
+    temp = logit_probs.data.new(logit_probs.size()).uniform_(1e-5, 1.0 - 1e-5)
+    temp = logit_probs.data - torch.log(-torch.log(temp))
+    _, argmax = temp.max(dim=-1)
+
+    # (B, T) -> (B, T, nr_mix)
+    one_hot = to_one_hot(argmax, nr_mix)
+    # select logistic parameters
+    means = torch.sum(y[:, :, nr_mix: 2 * nr_mix] * one_hot, dim=-1)
+    log_scales = torch.clamp(
+        torch.sum(y[:, :, 2 * nr_mix: 3 * nr_mix] * one_hot, dim=-1), min=log_scale_min
+    )
+    # sample from logistic & clip to interval
+    # we don't actually round to the nearest 8bit value when sampling
+    u = means.data.new(means.size()).uniform_(1e-5, 1.0 - 1e-5)
+    x = means + torch.exp(log_scales) * (torch.log(u) - torch.log(1.0 - u))
+
+    x = torch.clamp(torch.clamp(x, min=-1.0), max=1.0)
+
+    return x
+
+
+def to_one_hot(tensor, n, fill_with=1.0):
+    # we perform one hot encore with respect to the last axis
+    one_hot = torch.FloatTensor(tensor.size() + (n,)).zero_()
+    if tensor.is_cuda:
+        one_hot = one_hot.cuda()
+    one_hot.scatter_(len(tensor.size()), tensor.unsqueeze(-1), fill_with)
+    return one_hot
--- a/TTS/vocoder/utils/generic_utils.py
+++ b/TTS/vocoder/utils/generic_utils.py
@ -42,12 +42,35 @@ def to_camel(text):
    return re.sub(r'(?!^)_([a-zA-Z])', lambda m: m.group(1).upper(), text)


+def setup_wavernn(c):
+    print(" > Model: WaveRNN")
+    MyModel = importlib.import_module("TTS.vocoder.models.wavernn")
+    MyModel = getattr(MyModel, "WaveRNN")
+    model = MyModel(
+        rnn_dims=c.wavernn_model_params['rnn_dims'],
+        fc_dims=c.wavernn_model_params['fc_dims'],
+        mode=c.mode,
+        mulaw=c.mulaw,
+        pad=c.padding,
+        use_aux_net=c.wavernn_model_params['use_aux_net'],
+        use_upsample_net=c.wavernn_model_params['use_upsample_net'],
+        upsample_factors=c.wavernn_model_params['upsample_factors'],
+        feat_dims=c.audio['num_mels'],
+        compute_dims=c.wavernn_model_params['compute_dims'],
+        res_out_dims=c.wavernn_model_params['res_out_dims'],
+        num_res_blocks=c.wavernn_model_params['num_res_blocks'],
+        hop_length=c.audio["hop_length"],
+        sample_rate=c.audio["sample_rate"],
+    )
+    return model
+
+
 def setup_generator(c):
    print(" > Generator Model: {}".format(c.generator_model))
    MyModel = importlib.import_module('TTS.vocoder.models.' +
                                      c.generator_model.lower())
    MyModel = getattr(MyModel, to_camel(c.generator_model))
-    if c.generator_model in 'melgan_generator':
+    if c.generator_model.lower() in 'melgan_generator':
        model = MyModel(
            in_channels=c.audio['num_mels'],
            out_channels=1,
@ -58,7 +81,7 @@ def setup_generator(c):
            num_res_blocks=c.generator_model_params['num_res_blocks'])
    if c.generator_model in 'melgan_fb_generator':
        pass
-    if c.generator_model in 'multiband_melgan_generator':
+    if c.generator_model.lower() in 'multiband_melgan_generator':
        model = MyModel(
            in_channels=c.audio['num_mels'],
            out_channels=4,
@ -67,7 +90,7 @@ def setup_generator(c):
            upsample_factors=c.generator_model_params['upsample_factors'],
            res_kernel=3,
            num_res_blocks=c.generator_model_params['num_res_blocks'])
-    if c.generator_model in 'fullband_melgan_generator':
+    if c.generator_model.lower() in 'fullband_melgan_generator':
        model = MyModel(
            in_channels=c.audio['num_mels'],
            out_channels=1,
@ -76,7 +99,7 @@ def setup_generator(c):
            upsample_factors=c.generator_model_params['upsample_factors'],
            res_kernel=3,
            num_res_blocks=c.generator_model_params['num_res_blocks'])
-    if c.generator_model in 'parallel_wavegan_generator':
+    if c.generator_model.lower() in 'parallel_wavegan_generator':
        model = MyModel(
            in_channels=1,
            out_channels=1,
@ -91,6 +114,17 @@ def setup_generator(c):
            bias=True,
            use_weight_norm=True,
            upsample_factors=c.generator_model_params['upsample_factors'])
+    if c.generator_model.lower() in 'wavegrad':
+        model = MyModel(
+            in_channels=c['audio']['num_mels'],
+            out_channels=1,
+            use_weight_norm=c['model_params']['use_weight_norm'],
+            x_conv_channels=c['model_params']['x_conv_channels'],
+            y_conv_channels=c['model_params']['y_conv_channels'],
+            dblock_out_channels=c['model_params']['dblock_out_channels'],
+            ublock_out_channels=c['model_params']['ublock_out_channels'],
+            upsample_factors=c['model_params']['upsample_factors'],
+            upsample_dilations=c['model_params']['upsample_dilations'])
    return model


--- a/TTS/vocoder/utils/io.py
+++ b/TTS/vocoder/utils/io.py
@ -20,7 +20,10 @@ def load_checkpoint(model, checkpoint_path, use_cuda=False):

 def save_model(model, optimizer, scheduler, model_disc, optimizer_disc,
               scheduler_disc, current_step, epoch, output_path, **kwargs):
-    model_state = model.state_dict()
+    if hasattr(model, 'module'):
+        model_state = model.module.state_dict()
+    else:
+        model_state = model.state_dict()
    model_disc_state = model_disc.state_dict()\
         if model_disc is not None else None
    optimizer_state = optimizer.state_dict()\
--- a/notebooks/PlotUmapLibriTTS.ipynb
+++ b/notebooks/PlotUmapLibriTTS.ipynb
@ -13,7 +13,11 @@
  },
  {
   "cell_type": "code",
+<<<<<<< HEAD
   "execution_count": 2,
+=======
+   "execution_count": null,
+>>>>>>> dev
   "metadata": {},
   "outputs": [],
   "source": [
@ -25,8 +29,13 @@
    "import umap\n",
    "\n",
    "from TTS.speaker_encoder.model import SpeakerEncoder\n",
+<<<<<<< HEAD
    "from TTS.utils.audio import AudioProcessor\n",
    "from TTS.utils.io import load_config\n",
+=======
+    "from TTS.tts.utils.audio import AudioProcessor\n",
+    "from TTS.tts.utils.generic_utils import load_config\n",
+>>>>>>> dev
    "\n",
    "from bokeh.io import output_notebook, show\n",
    "from bokeh.plotting import figure\n",
@ -48,6 +57,7 @@
  },
  {
   "cell_type": "code",
+<<<<<<< HEAD
   "execution_count": 3,
   "metadata": {},
   "outputs": [
@ -367,6 +377,11 @@
     "output_type": "display_data"
    }
   ],
+=======
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+>>>>>>> dev
   "source": [
    "output_notebook()"
   ]
@ -380,12 +395,20 @@
  },
  {
   "cell_type": "code",
+<<<<<<< HEAD
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "#MODEL_RUN_PATH = \"libritts_360-half-October-31-2019_04+54PM-19d2f5f/\"\n",
    "MODEL_RUN_PATH = \"libritts_360-half-September-28-2019_10+46AM-8565c50/\"\n",
+=======
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MODEL_RUN_PATH = \"/media/erogol/data_ssd/Models/libri_tts/speaker_encoder/libritts_360-half-October-31-2019_04+54PM-19d2f5f/\"\n",
+>>>>>>> dev
    "MODEL_PATH = MODEL_RUN_PATH + \"best_model.pth.tar\"\n",
    "CONFIG_PATH = MODEL_RUN_PATH + \"config.json\"\n",
    "\n",
@ -395,11 +418,16 @@
    "\n",
    "# My multi speaker locations\n",
    "EMBED_PATH = \"/home/erogol/Data/Libri-TTS/train-clean-360-embed_128/\"\n",
+<<<<<<< HEAD
    "AUDIO_PATH = \"datasets/LibriTTS/test-clean/\""
+=======
+    "AUDIO_PATH = \"/home/erogol/Data/Libri-TTS/train-clean-360/\""
+>>>>>>> dev
   ]
  },
  {
   "cell_type": "code",
+<<<<<<< HEAD
   "execution_count": 5,
   "metadata": {},
   "outputs": [
@ -413,12 +441,18 @@
     ]
    }
   ],
+=======
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+>>>>>>> dev
   "source": [
    "!ls -1 $MODEL_RUN_PATH"
   ]
  },
  {
   "cell_type": "code",
+<<<<<<< HEAD
   "execution_count": 6,
   "metadata": {},
   "outputs": [
@ -454,6 +488,11 @@
     ]
    }
   ],
+=======
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+>>>>>>> dev
   "source": [
    "CONFIG = load_config(CONFIG_PATH)\n",
    "ap = AudioProcessor(**CONFIG['audio'])"
@ -468,6 +507,7 @@
  },
  {
   "cell_type": "code",
+<<<<<<< HEAD
   "execution_count": 7,
   "metadata": {},
   "outputs": [
@ -479,6 +519,11 @@
     ]
    }
   ],
+=======
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+>>>>>>> dev
   "source": [
    "embed_files = glob.glob(EMBED_PATH+\"/**/*.npy\", recursive=True)\n",
    "print(f'Embeddings found: {len(embed_files)}')"
@ -493,6 +538,7 @@
  },
  {
   "cell_type": "code",
+<<<<<<< HEAD
   "execution_count": 8,
   "metadata": {},
   "outputs": [
@ -508,6 +554,11 @@
     ]
    }
   ],
+=======
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+>>>>>>> dev
   "source": [
    "embed_files[0]"
   ]
@ -523,6 +574,7 @@
  },
  {
   "cell_type": "code",
+<<<<<<< HEAD
   "execution_count": 9,
   "metadata": {},
   "outputs": [
@ -534,6 +586,11 @@
     ]
    }
   ],
+=======
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+>>>>>>> dev
   "source": [
    "speaker_paths = list(set([os.path.dirname(os.path.dirname(embed_file)) for embed_file in embed_files]))\n",
    "speaker_to_utter = {}\n",
@ -557,6 +614,7 @@
  },
  {
   "cell_type": "code",
+<<<<<<< HEAD
   "execution_count": 11,
   "metadata": {},
   "outputs": [
@ -575,6 +633,13 @@
   ],
   "source": [
    "ttsembeds = []\n",
+=======
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "embeds = []\n",
+>>>>>>> dev
    "labels = []\n",
    "locations = []\n",
    "\n",
@ -598,7 +663,11 @@
    "            embed = np.load(embed_path)\n",
    "            embeds.append(embed)\n",
    "            labels.append(str(speaker_num))\n",
+<<<<<<< HEAD
    "            #locations.append(embed_path.replace(EMBED_PATH, '').replace('.npy','.wav'))\n",
+=======
+    "            locations.append(embed_path.replace(EMBED_PATH, '').replace('.npy','.wav'))\n",
+>>>>>>> dev
    "embeds = np.concatenate(embeds)"
   ]
  },
@ -611,6 +680,7 @@
  },
  {
   "cell_type": "code",
+<<<<<<< HEAD
   "execution_count": 12,
   "metadata": {},
   "outputs": [
@ -626,6 +696,11 @@
     ]
    }
   ],
+=======
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+>>>>>>> dev
   "source": [
    "model = umap.UMAP()\n",
    "projection = model.fit_transform(embeds)"
@ -729,7 +804,11 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
+<<<<<<< HEAD
   "version": "3.8.5"
+=======
+   "version": "3.7.4"
+>>>>>>> dev
  }
 },
 "nbformat": 4,
--- a/notebooks/dataset_analysis/CheckSpectrograms.ipynb
+++ b/notebooks/dataset_analysis/CheckSpectrograms.ipynb
--- a/requirements.txt
+++ b/requirements.txt
@ -23,3 +23,4 @@ pylint==2.5.3
 gdown
 umap-learn
 cython
+pyyaml
--- a/run_tests.sh
+++ b/run_tests.sh
@ -1,3 +1,4 @@
+set -e
 TF_CPP_MIN_LOG_LEVEL=3

 # tests
@ -6,7 +7,10 @@ nosetests tests -x &&\
 # runtime tests
 ./tests/test_server_package.sh && \
 ./tests/test_tts_train.sh && \
-./tests/test_vocoder_train.sh && \
+./tests/test_glow-tts_train.sh && \
+./tests/test_vocoder_gan_train.sh && \
+./tests/test_vocoder_wavernn_train.sh && \
+./tests/test_vocoder_wavegrad_train.sh && \

 # linter check
 cardboardlinter --refspec master
--- a/setup.py
+++ b/setup.py
@ -33,7 +33,7 @@ args, unknown_args = parser.parse_known_args()
 # Remove our arguments from argv so that setuptools doesn't see them
 sys.argv = [sys.argv[0]] + unknown_args

-version = '0.0.5'
+version = '0.0.6'

 # Adapted from https://github.com/pytorch/pytorch
 cwd = os.path.dirname(os.path.abspath(__file__))
--- a/tests/inputs/test_glow_tts.json
+++ b/tests/inputs/test_glow_tts.json
@ -0,0 +1,134 @@
+{
+    "model": "glow_tts",
+    "run_name": "glow-tts-gatedconv",
+    "run_description": "glow-tts model training with gated conv.",
+
+    // AUDIO PARAMETERS
+    "audio":{
+        "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.
+        "win_length": 1024,      // stft window length in ms.
+        "hop_length": 256,       // stft window hop-lengh in ms.
+        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
+        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.
+
+        // Audio processing parameters
+        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
+        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
+        "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.
+
+        // Griffin-Lim
+        "power": 1.1,           // value to sharpen wav signals after GL algorithm.
+        "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
+
+        // Silence trimming
+        "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
+        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.
+
+        // MelSpectrogram parameters
+        "num_mels": 80,         // size of the mel spec frame.
+        "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
+        "mel_fmax": 7600.0,     // maximum freq level for mel-spec. Tune for dataset!!
+        "spec_gain": 1.0,         // scaler value appplied after log transform of spectrogram.
+
+        // Normalization parameters
+        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
+        "min_level_db": -100,   // lower bound for normalization
+        "symmetric_norm": true, // move normalization to range [-1, 1]
+        "max_norm": 1.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
+        "clip_norm": true,      // clip normalized values into the range.
+        "stats_path": null    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
+    },
+
+    // VOCABULARY PARAMETERS
+    // if custom character set is not defined,
+    // default set in symbols.py is used
+    // "characters":{
+    //     "pad": "_",
+    //     "eos": "~",
+    //     "bos": "^",
+    //     "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!'(),-.:;? ",
+    //     "punctuations":"!'(),-.:;? ",
+    //     "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
+    // },
+
+    "add_blank": false, // if true add a new token after each token of the sentence. This increases the size of the input sequence, but has considerably improved the prosody of the GlowTTS model.
+
+    // DISTRIBUTED TRAINING
+    "mixed_precision": false,
+    "distributed":{
+        "backend": "nccl",
+        "url": "tcp:\/\/localhost:54323"
+    },
+
+    "reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
+
+    // MODEL PARAMETERS
+    "use_mas": false,       // use Monotonic Alignment Search if true. Otherwise use pre-computed attention alignments.
+
+    // TRAINING
+    "batch_size": 2,       // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
+    "eval_batch_size":1,
+    "r": 1,                 // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
+    "loss_masking": true,   // enable / disable loss masking against the sequence padding.
+
+    // VALIDATION
+    "run_eval": true,
+    "test_delay_epochs": 0,       //Until attention is aligned, testing only wastes computation time.
+    "test_sentences_file": null,  // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
+
+    // OPTIMIZER
+    "noam_schedule": true,         // use noam warmup and lr schedule.
+    "grad_clip": 5.0,              // upper limit for gradients for clipping.
+    "epochs": 1,               // total number of epochs to train.
+    "lr": 1e-3,                    // Initial learning rate. If Noam decay is active, maximum learning rate.
+    "wd": 0.000001,                // Weight decay weight.
+    "warmup_steps": 4000,          // Noam decay steps to increase the learning rate from 0 to "lr"
+    "seq_len_norm": false,         // Normalize eash sample loss with its length to alleviate imbalanced datasets. Use it if your dataset is small or has skewed distribution of sequence lengths.
+
+    "encoder_type": "gatedconv",
+
+    // TENSORBOARD and LOGGING
+    "print_step": 25,       // Number of steps to log training on console.
+    "tb_plot_step": 100,    // Number of steps to plot TB training figures.
+    "print_eval": false,     // If True, it prints intermediate loss values in evalulation.
+    "save_step": 5000,      // Number of training steps expected to save traninpg stats and checkpoints.
+    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
+    "tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
+    "apex_amp_level": null,
+
+    // DATA LOADING
+    "text_cleaner": "phoneme_cleaners",
+    "enable_eos_bos_chars": false, // enable/disable beginning of sentence and end of sentence chars.
+    "num_loader_workers": 4,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
+    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
+    "batch_group_size": 0,  //Number of batches to shuffle after bucketing.
+    "min_seq_len": 3,       // DATASET-RELATED: minimum text length to use in training
+    "max_seq_len": 500,     // DATASET-RELATED: maximum text length
+    "compute_f0": false,     // compute f0 values in data-loader
+
+    // PATHS
+    "output_path": "tests/train_outputs/",
+
+    // PHONEMES
+    "phoneme_cache_path": "tests/outputs/phoneme_cache/",  // phoneme computation is slow, therefore, it caches results in the given folder.
+    "use_phonemes": true,           // use phonemes instead of raw characters. It is suggested for better pronounciation.
+    "phoneme_language": "en-us",     // depending on your target language, pick one from  https://github.com/bootphon/phonemizer#languages
+
+    // MULTI-SPEAKER and GST
+    "use_external_speaker_embedding_file": false,
+    "external_speaker_embedding_file": null,
+    "use_speaker_embedding": false,     // use speaker embedding to enable multi-speaker learning.
+
+    // DATASETS
+    "datasets":   // List of datasets. They all merged and they get different speaker_ids.
+        [
+            {
+                "name": "ljspeech",
+                "path": "tests/data/ljspeech/",
+                "meta_file_train": "metadata.csv",
+                "meta_file_val": "metadata.csv"
+            }
+        ]
+}
+
+
--- a/tests/inputs/test_train_config.json
+++ b/tests/inputs/test_train_config.json
@ -67,13 +67,24 @@
    "gradual_training": [[0, 7, 4]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
    "loss_masking": true,         // enable / disable loss masking against the sequence padding.
    "ga_alpha": 10.0,        // weight for guided attention loss. If > 0, guided attention is enabled.
-    "apex_amp_level": null,
+    "mixed_precision": false,

    // VALIDATION
    "run_eval": true,
    "test_delay_epochs": 0,  //Until attention is aligned, testing only wastes computation time.
    "test_sentences_file": null,  // set a file to load sentences to be used for testing. If it is null then we use default english sentences.

+    // LOSS SETTINGS
+    "loss_masking": true,       // enable / disable loss masking against the sequence padding.
+    "decoder_loss_alpha": 0.5,  // original decoder loss weight. If > 0, it is enabled
+    "postnet_loss_alpha": 0.25, // original postnet loss weight. If > 0, it is enabled
+    "postnet_diff_spec_alpha": 0.25,     // differential spectral loss weight. If > 0, it is enabled
+    "decoder_diff_spec_alpha": 0.25,     // differential spectral loss weight. If > 0, it is enabled
+    "decoder_ssim_alpha": 0.5,     // decoder ssim loss weight. If > 0, it is enabled
+    "postnet_ssim_alpha": 0.25,     // postnet ssim loss weight. If > 0, it is enabled
+    "ga_alpha": 5.0,           // weight for guided attention loss. If > 0, guided attention is enabled.
+    "stopnet_pos_weight": 15.0, // pos class weight for stopnet loss since there are way more negative samples than positive samples.
+
    // OPTIMIZER
    "noam_schedule": false,        // use noam warmup and lr schedule.
    "grad_clip": 1.0,              // upper limit for gradients for clipping.
--- a/tests/inputs/test_vocoder_wavegrad.json
+++ b/tests/inputs/test_vocoder_wavegrad.json
@ -0,0 +1,114 @@
+{
+    "run_name": "wavegrad-ljspeech",
+    "run_description": "wavegrad ljspeech",
+
+    "audio":{
+        "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.
+        "win_length": 1024,      // stft window length in ms.
+        "hop_length": 256,       // stft window hop-lengh in ms.
+        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
+        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.
+
+        // Audio processing parameters
+        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
+        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
+        "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.
+
+        // Silence trimming
+        "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
+        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.
+
+        // MelSpectrogram parameters
+        "num_mels": 80,         // size of the mel spec frame.
+        "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
+        "mel_fmax": 7600.0,     // maximum freq level for mel-spec. Tune for dataset!!
+        "spec_gain": 1.0,         // scaler value appplied after log transform of spectrogram.
+
+        // Normalization parameters
+        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
+        "min_level_db": -100,   // lower bound for normalization
+        "symmetric_norm": true, // move normalization to range [-1, 1]
+        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
+        "clip_norm": true,      // clip normalized values into the range.
+        "stats_path": null      // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
+    },
+
+    // DISTRIBUTED TRAINING
+    "mixed_precision": false,
+    "distributed":{
+        "backend": "nccl",
+        "url": "tcp:\/\/localhost:54322"
+    },
+
+    "target_loss": "avg_wavegrad_loss",  // loss value to pick the best model to save after each epoch
+
+    // MODEL PARAMETERS
+    "generator_model": "wavegrad",
+    "model_params":{
+        "y_conv_channels":32,
+        "x_conv_channels":768,
+        "ublock_out_channels": [512, 512, 256, 128, 128],
+        "dblock_out_channels": [128, 128, 256, 512],
+        "upsample_factors": [4, 4, 4, 2, 2],
+        "upsample_dilations": [
+            [1, 2, 1, 2],
+            [1, 2, 1, 2],
+            [1, 2, 4, 8],
+            [1, 2, 4, 8],
+            [1, 2, 4, 8]],
+        "use_weight_norm": true
+    },
+
+    // DATASET
+    "data_path": "tests/data/ljspeech/wavs/",  // root data path. It finds all wav files recursively from there.
+    "feature_path": null,   // if you use precomputed features
+    "seq_len": 6144,        // 24 * hop_length
+    "pad_short": 0,      // additional padding for short wavs
+    "conv_pad": 0,          // additional padding against convolutions applied to spectrograms
+    "use_noise_augment": false,     // add noise to the audio signal for augmentation
+    "use_cache": true,      // use in memory cache to keep the computed features. This might cause OOM.
+
+    "reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
+
+    // TRAINING
+    "batch_size": 1,      // Batch size for training.
+    "train_noise_schedule":{
+        "min_val": 1e-6,
+        "max_val": 1e-2,
+        "num_steps": 1000
+    },
+    "test_noise_schedule":{
+        "min_val": 1e-6,
+        "max_val": 1e-2,
+        "num_steps": 2
+    },
+
+    // VALIDATION
+    "run_eval": true,       // enable/disable evaluation run
+
+    // OPTIMIZER
+    "epochs": 1,                // total number of epochs to train.
+    "clip_grad": 1.0,                 // Generator gradient clipping threshold. Apply gradient clipping if > 0
+    "lr_scheduler": "MultiStepLR",  // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
+    "lr_scheduler_params": {
+        "gamma": 0.5,
+        "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
+    },
+    "lr": 1e-4,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
+
+    // TENSORBOARD and LOGGING
+    "print_step": 250,       // Number of steps to log traning on console.
+    "print_eval": false,     // If True, it prints loss values for each step in eval run.
+    "save_step": 10000,      // Number of training steps expected to plot training stats on TB and save model checkpoints.
+    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
+    "tb_model_param_stats": true,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
+
+    // DATA LOADING
+    "num_loader_workers": 4,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
+    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
+    "eval_split_size": 4,
+
+    // PATHS
+    "output_path": "tests/train_outputs/"
+}
+
--- a/tests/inputs/test_vocoder_wavernn_config.json
+++ b/tests/inputs/test_vocoder_wavernn_config.json
@ -0,0 +1,107 @@
+{
+    "run_name": "wavernn_test",
+    "run_description": "wavernn_test training",
+
+    // AUDIO PARAMETERS
+    "audio":{
+        "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.
+        "win_length": 1024,      // stft window length in ms.
+        "hop_length": 256,       // stft window hop-lengh in ms.
+        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
+        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.
+
+        // Audio processing parameters
+        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
+        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
+        "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.
+
+        // Silence trimming
+        "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
+        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.
+
+        // MelSpectrogram parameters
+        "num_mels": 80,         // size of the mel spec frame.
+        "mel_fmin": 0.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
+        "mel_fmax": 8000.0,     // maximum freq level for mel-spec. Tune for dataset!!
+        "spec_gain": 20.0,         // scaler value appplied after log transform of spectrogram.
+
+        // Normalization parameters
+        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
+        "min_level_db": -100,   // lower bound for normalization
+        "symmetric_norm": true, // move normalization to range [-1, 1]
+        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
+        "clip_norm": true,      // clip normalized values into the range.
+        "stats_path": null    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
+    },
+
+    // Generating / Synthesizing
+    "batched": true,
+    "target_samples": 11000,		// target number of samples to be generated in each batch entry
+    "overlap_samples": 550,		// number of samples for crossfading between batches
+
+    // DISTRIBUTED TRAINING
+    // "distributed":{
+    //     "backend": "nccl",
+    //     "url": "tcp:\/\/localhost:54321"
+    // },
+
+    // MODEL PARAMETERS
+    "use_aux_net": true,
+    "use_upsample_net": true,
+    "upsample_factors": [4, 8, 8],	// this needs to correctly factorise hop_length
+    "seq_len": 1280,			// has to be devideable by hop_length
+    "mode": "mold",         		// mold [string], gauss [string], bits [int]
+    "mulaw": false,         		// apply mulaw if mode is bits
+    "padding": 2,			// pad the input for resnet to see wider input length
+
+    // DATASET
+    //"use_gta": true,				// use computed gta features from the tts model
+    "data_path": "tests/data/ljspeech/wavs/",	// path containing training wav files
+    "feature_path": null, 			// path containing computed features from wav files if null compute them
+
+    // MODEL PARAMETERS
+    "wavernn_model_params": {
+        "rnn_dims": 512,
+        "fc_dims": 512,
+        "compute_dims": 128,
+        "res_out_dims": 128,
+        "num_res_blocks": 10,
+        "use_aux_net": true,
+        "use_upsample_net": true,
+        "upsample_factors": [4, 8, 8] 	// this needs to correctly factorise hop_length
+    },
+    "mixed_precision": false,
+
+    // TRAINING
+    "batch_size": 4,       	// Batch size for training. Lower values than 32 might cause hard to learn attention.
+    "epochs": 1,        	// total number of epochs to train.
+
+    // VALIDATION
+    "run_eval": true,
+    "test_every_epochs": 10,         // Test after set number of epochs (Test every 20 epochs for example)
+
+    // OPTIMIZER
+    "grad_clip": 4,		     // apply gradient clipping if > 0
+    "lr_scheduler": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
+    "lr_scheduler_params": {
+        "gamma": 0.5,
+        "milestones": [200000, 400000, 600000]
+    },
+    "lr": 1e-4,			// initial learning rate
+
+    // TENSORBOARD and LOGGING
+    "print_step": 25,       // Number of steps to log traning on console.
+    "print_eval": false,     // If True, it prints loss values for each step in eval run.
+    "save_step": 25000,      // Number of training steps expected to plot training stats on TB and save model checkpoints.
+    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
+    "tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
+
+    // DATA LOADING
+    "num_loader_workers": 4,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
+    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
+    "eval_split_size": 10,	    // number of samples for testing
+
+    // PATHS
+    "output_path": "tests/train_outputs/"
+}
+
--- a/tests/test_encoder.py
+++ b/tests/test_encoder.py
@ -62,7 +62,7 @@ class GE2ELossTests(unittest.TestCase):
        assert output.item() >= 0.0
        # check speaker loss with orthogonal d-vectors
        dummy_input = T.empty(3, 64)
-        dummy_input = T.nn.init.orthogonal(dummy_input)
+        dummy_input = T.nn.init.orthogonal_(dummy_input)
        dummy_input = T.cat(
            [
                dummy_input[0].repeat(5, 1, 1).transpose(0, 1),
@ -91,7 +91,7 @@ class AngleProtoLossTests(unittest.TestCase):

        # check speaker loss with orthogonal d-vectors
        dummy_input = T.empty(3, 64)
-        dummy_input = T.nn.init.orthogonal(dummy_input)
+        dummy_input = T.nn.init.orthogonal_(dummy_input)
        dummy_input = T.cat(
            [
                dummy_input[0].repeat(5, 1, 1).transpose(0, 1),
--- a/tests/test_glow-tts_train.sh
+++ b/tests/test_glow-tts_train.sh
@ -0,0 +1,13 @@
+#!/usr/bin/env bash
+set -xe
+BASEDIR=$(dirname "$0")
+echo "$BASEDIR"
+# run training
+CUDA_VISIBLE_DEVICES="" python TTS/bin/train_glow_tts.py --config_path $BASEDIR/inputs/test_glow_tts.json
+# find the training folder
+LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
+echo $LATEST_FOLDER
+# continue the previous training
+CUDA_VISIBLE_DEVICES=""  python TTS/bin/train_glow_tts.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
+# remove all the outputs
+rm -rf $BASEDIR/train_outputs/
--- a/tests/test_layers.py
+++ b/tests/test_layers.py
@ -2,7 +2,7 @@ import unittest
 import torch as T

 from TTS.tts.layers.tacotron import Prenet, CBHG, Decoder, Encoder
-from TTS.tts.layers.losses import L1LossMasked
+from TTS.tts.layers.losses import L1LossMasked, SSIMLoss
 from TTS.tts.utils.generic_utils import sequence_mask

 # pylint: disable=unused-variable
@ -149,3 +149,72 @@ class L1LossMaskedTests(unittest.TestCase):
            (sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
        output = layer(dummy_input + mask, dummy_target, dummy_length)
        assert output.item() == 0, "0 vs {}".format(output.item())
+
+
+class SSIMLossTests(unittest.TestCase):
+    def test_in_out(self):  #pylint: disable=no-self-use
+        # test input == target
+        layer = SSIMLoss()
+        dummy_input = T.ones(4, 8, 128).float()
+        dummy_target = T.ones(4, 8, 128).float()
+        dummy_length = (T.ones(4) * 8).long()
+        output = layer(dummy_input, dummy_target, dummy_length)
+        assert output.item() == 0.0
+
+        # test input != target
+        dummy_input = T.ones(4, 8, 128).float()
+        dummy_target = T.zeros(4, 8, 128).float()
+        dummy_length = (T.ones(4) * 8).long()
+        output = layer(dummy_input, dummy_target, dummy_length)
+        assert abs(output.item() - 1.0) < 1e-4 , "1.0 vs {}".format(output.item())
+
+        # test if padded values of input makes any difference
+        dummy_input = T.ones(4, 8, 128).float()
+        dummy_target = T.zeros(4, 8, 128).float()
+        dummy_length = (T.arange(5, 9)).long()
+        mask = (
+            (sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
+        output = layer(dummy_input + mask, dummy_target, dummy_length)
+        assert abs(output.item() - 1.0) < 1e-4, "1.0 vs {}".format(output.item())
+
+        dummy_input = T.rand(4, 8, 128).float()
+        dummy_target = dummy_input.detach()
+        dummy_length = (T.arange(5, 9)).long()
+        mask = (
+            (sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
+        output = layer(dummy_input + mask, dummy_target, dummy_length)
+        assert output.item() == 0, "0 vs {}".format(output.item())
+
+        # seq_len_norm = True
+        # test input == target
+        layer = L1LossMasked(seq_len_norm=True)
+        dummy_input = T.ones(4, 8, 128).float()
+        dummy_target = T.ones(4, 8, 128).float()
+        dummy_length = (T.ones(4) * 8).long()
+        output = layer(dummy_input, dummy_target, dummy_length)
+        assert output.item() == 0.0
+
+        # test input != target
+        dummy_input = T.ones(4, 8, 128).float()
+        dummy_target = T.zeros(4, 8, 128).float()
+        dummy_length = (T.ones(4) * 8).long()
+        output = layer(dummy_input, dummy_target, dummy_length)
+        assert output.item() == 1.0, "1.0 vs {}".format(output.item())
+
+        # test if padded values of input makes any difference
+        dummy_input = T.ones(4, 8, 128).float()
+        dummy_target = T.zeros(4, 8, 128).float()
+        dummy_length = (T.arange(5, 9)).long()
+        mask = (
+            (sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
+        output = layer(dummy_input + mask, dummy_target, dummy_length)
+        assert abs(output.item() - 1.0) < 1e-5, "1.0 vs {}".format(output.item())
+
+        dummy_input = T.rand(4, 8, 128).float()
+        dummy_target = dummy_input.detach()
+        dummy_length = (T.arange(5, 9)).long()
+        mask = (
+            (sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
+        output = layer(dummy_input + mask, dummy_target, dummy_length)
+        assert output.item() == 0, "0 vs {}".format(output.item())
+
--- a/tests/test_server_package.sh
+++ b/tests/test_server_package.sh
@ -6,12 +6,12 @@ if [[ ! -f tests/outputs/checkpoint_10.pth.tar ]]; then
    exit 1
 fi

+rm -f dist/*.whl
+python setup.py --quiet bdist_wheel --checkpoint tests/outputs/checkpoint_10.pth.tar --model_config tests/outputs/dummy_model_config.json
+
 python -m venv /tmp/venv
 source /tmp/venv/bin/activate
 pip install --quiet --upgrade pip setuptools wheel
-
-rm -f dist/*.whl
-python setup.py --quiet bdist_wheel --checkpoint tests/outputs/checkpoint_10.pth.tar --model_config tests/outputs/dummy_model_config.json
 pip install --quiet dist/TTS*.whl

 # this is related to https://github.com/librosa/librosa/issues/1160
--- a/tests/test_tacotron_model.py
+++ b/tests/test_tacotron_model.py
@ -294,6 +294,7 @@ class SCGSTMultiSpeakeTacotronTrainTest(unittest.TestCase):
        mel_spec = torch.rand(8, 30, c.audio['num_mels']).to(device)
        linear_spec = torch.rand(8, 30, c.audio['fft_size']).to(device)
        mel_lengths = torch.randint(20, 30, (8, )).long().to(device)
+        mel_lengths[-1] = mel_spec.size(1)
        stop_targets = torch.zeros(8, 30, 1).float().to(device)
        speaker_embeddings = torch.rand(8, 55).to(device)

--- a/tests/test_tacotron_train.sh
+++ b/tests/test_tacotron_train.sh
@ -0,0 +1,14 @@
+#!/usr/bin/env bash
+
+set -xe
+BASEDIR=$(dirname "$0")
+echo "$BASEDIR"
+# run training
+CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tts.py --config_path $BASEDIR/inputs/test_train_config.json
+# find the training folder
+LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
+echo $LATEST_FOLDER
+# continue the previous training
+CUDA_VISIBLE_DEVICES=""  python TTS/bin/train_tts.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
+# remove all the outputs
+rm -rf $BASEDIR/train_outputs/
--- a/tests/test_text_processing.py
+++ b/tests/test_text_processing.py
@ -11,6 +11,7 @@ from TTS.utils.io import load_config
 conf = load_config(os.path.join(get_tests_input_path(), 'test_config.json'))

 def test_phoneme_to_sequence():
+
    text = "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase, the grey matter in the parts of the brain responsible for emotional regulation and learning!"
    text_cleaner = ["phoneme_cleaners"]
    lang = "en-us"
@ -20,7 +21,7 @@ def test_phoneme_to_sequence():
    text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters)
    gt = "ɹiːsənt ɹɪsɜːtʃ æt hɑːɹvɚd hɐz ʃoʊn mɛdᵻteɪɾɪŋ fɔːɹ æz lɪɾəl æz eɪt wiːks kæn æktʃuːəli ɪnkɹiːs, ðə ɡɹeɪ mæɾɚɹ ɪnðə pɑːɹts ʌvðə bɹeɪn ɹɪspɑːnsəbəl fɔːɹ ɪmoʊʃənəl ɹɛɡjuːleɪʃən ænd lɜːnɪŋ!"
    assert text_hat == text_hat_with_params == gt
-
+    
    # multiple punctuations
    text = "Be a voice, not an! echo?"
    sequence = phoneme_to_sequence(text, text_cleaner, lang)
@ -87,6 +88,84 @@ def test_phoneme_to_sequence():
    print(len(sequence))
    assert text_hat == text_hat_with_params == gt

+def test_phoneme_to_sequence_with_blank_token():
+
+    text = "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase, the grey matter in the parts of the brain responsible for emotional regulation and learning!"
+    text_cleaner = ["phoneme_cleaners"]
+    lang = "en-us"
+    sequence = phoneme_to_sequence(text, text_cleaner, lang)
+    text_hat = sequence_to_phoneme(sequence)
+    _ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
+    text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
+    gt = "ɹiːsənt ɹɪsɜːtʃ æt hɑːɹvɚd hɐz ʃoʊn mɛdᵻteɪɾɪŋ fɔːɹ æz lɪɾəl æz eɪt wiːks kæn æktʃuːəli ɪnkɹiːs, ðə ɡɹeɪ mæɾɚɹ ɪnðə pɑːɹts ʌvðə bɹeɪn ɹɪspɑːnsəbəl fɔːɹ ɪmoʊʃənəl ɹɛɡjuːleɪʃən ænd lɜːnɪŋ!"
+    assert text_hat == text_hat_with_params == gt
+
+    # multiple punctuations
+    text = "Be a voice, not an! echo?"
+    sequence = phoneme_to_sequence(text, text_cleaner, lang)
+    text_hat = sequence_to_phoneme(sequence)
+    _ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
+    text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
+    gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ?"
+    print(text_hat)
+    print(len(sequence))
+    assert text_hat == text_hat_with_params == gt
+
+    # not ending with punctuation
+    text = "Be a voice, not an! echo"
+    sequence = phoneme_to_sequence(text, text_cleaner, lang)
+    text_hat = sequence_to_phoneme(sequence)
+    _ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
+    text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
+    gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ"
+    print(text_hat)
+    print(len(sequence))
+    assert text_hat == text_hat_with_params == gt
+
+    # original
+    text = "Be a voice, not an echo!"
+    sequence = phoneme_to_sequence(text, text_cleaner, lang)
+    text_hat = sequence_to_phoneme(sequence)
+    _ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
+    text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
+    gt = "biː ɐ vɔɪs, nɑːt ɐn ɛkoʊ!"
+    print(text_hat)
+    print(len(sequence))
+    assert text_hat == text_hat_with_params == gt
+
+    # extra space after the sentence
+    text = "Be a voice, not an! echo.  "
+    sequence = phoneme_to_sequence(text, text_cleaner, lang)
+    text_hat = sequence_to_phoneme(sequence)
+    _ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
+    text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
+    gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ."
+    print(text_hat)
+    print(len(sequence))
+    assert text_hat == text_hat_with_params == gt
+
+    # extra space after the sentence
+    text = "Be a voice, not an! echo.  "
+    sequence = phoneme_to_sequence(text, text_cleaner, lang, True)
+    text_hat = sequence_to_phoneme(sequence)
+    _ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
+    text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
+    gt = "^biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ.~"
+    print(text_hat)
+    print(len(sequence))
+    assert text_hat == text_hat_with_params == gt
+
+    # padding char
+    text = "_Be a _voice, not an! echo_"
+    sequence = phoneme_to_sequence(text, text_cleaner, lang)
+    text_hat = sequence_to_phoneme(sequence)
+    _ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
+    text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
+    gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ"
+    print(text_hat)
+    print(len(sequence))
+    assert text_hat == text_hat_with_params == gt
+
 def test_text2phone():
    text = "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase, the grey matter in the parts of the brain responsible for emotional regulation and learning!"
    gt = "ɹ|iː|s|ə|n|t| |ɹ|ɪ|s|ɜː|tʃ| |æ|t| |h|ɑːɹ|v|ɚ|d| |h|ɐ|z| |ʃ|oʊ|n| |m|ɛ|d|ᵻ|t|eɪ|ɾ|ɪ|ŋ| |f|ɔː|ɹ| |æ|z| |l|ɪ|ɾ|əl| |æ|z| |eɪ|t| |w|iː|k|s| |k|æ|n| |æ|k|tʃ|uː|əl|i| |ɪ|n|k|ɹ|iː|s|,| |ð|ə| |ɡ|ɹ|eɪ| |m|æ|ɾ|ɚ|ɹ| |ɪ|n|ð|ə| |p|ɑːɹ|t|s| |ʌ|v|ð|ə| |b|ɹ|eɪ|n| |ɹ|ɪ|s|p|ɑː|n|s|ə|b|əl| |f|ɔː|ɹ| |ɪ|m|oʊ|ʃ|ə|n|əl| |ɹ|ɛ|ɡ|j|uː|l|eɪ|ʃ|ə|n| |æ|n|d| |l|ɜː|n|ɪ|ŋ|!"
--- a/tests/test_tts_train.sh
+++ b/tests/test_tts_train.sh
@ -1,13 +1,13 @@
 #!/usr/bin/env bash
-
+set -xe
 BASEDIR=$(dirname "$0")
 echo "$BASEDIR"
 # run training
-CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tts.py --config_path $BASEDIR/inputs/test_train_config.json
+CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tacotron.py --config_path $BASEDIR/inputs/test_train_config.json
 # find the training folder
 LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
 echo $LATEST_FOLDER
 # continue the previous training
-CUDA_VISIBLE_DEVICES=""  python TTS/bin/train_tts.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
+CUDA_VISIBLE_DEVICES=""  python TTS/bin/train_tacotron.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
 # remove all the outputs
 rm -rf $BASEDIR/train_outputs/
--- a/tests/test_vocoder_gan_datasets.py
+++ b/tests/test_vocoder_gan_datasets.py
--- a/tests/test_vocoder_gan_train.sh
+++ b/tests/test_vocoder_gan_train.sh
@ -1,15 +1,15 @@
 #!/usr/bin/env bash
-
+set -xe
 BASEDIR=$(dirname "$0")
 echo "$BASEDIR"
 # create run dir
 mkdir $BASEDIR/train_outputs
 # run training
-CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder.py --config_path $BASEDIR/inputs/test_vocoder_multiband_melgan_config.json
+CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_gan.py --config_path $BASEDIR/inputs/test_vocoder_multiband_melgan_config.json
 # find the training folder
 LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
 echo $LATEST_FOLDER
 # continue the previous training
-CUDA_VISIBLE_DEVICES=""  python TTS/bin/train_vocoder.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
+CUDA_VISIBLE_DEVICES=""  python TTS/bin/train_vocoder_gan.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
 # remove all the outputs
 rm -rf $BASEDIR/train_outputs/$LATEST_FOLDER
--- a/tests/test_vocoder_wavegrad_train.sh
+++ b/tests/test_vocoder_wavegrad_train.sh
@ -0,0 +1,15 @@
+#!/usr/bin/env bash
+set -xe
+BASEDIR=$(dirname "$0")
+echo "$BASEDIR"
+# create run dir
+mkdir -p $BASEDIR/train_outputs
+# run training
+CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_wavegrad.py --config_path $BASEDIR/inputs/test_vocoder_wavegrad.json
+# find the training folder
+LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
+echo $LATEST_FOLDER
+# continue the previous training
+CUDA_VISIBLE_DEVICES=""  python TTS/bin/train_vocoder_wavegrad.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
+# remove all the outputs
+rm -rf $BASEDIR/train_outputs/$LATEST_FOLDER
--- a/tests/test_vocoder_wavernn.py
+++ b/tests/test_vocoder_wavernn.py
@ -0,0 +1,31 @@
+import numpy as np
+import torch
+import random
+from TTS.vocoder.models.wavernn import WaveRNN
+
+
+def test_wavernn():
+    model = WaveRNN(
+        rnn_dims=512,
+        fc_dims=512,
+        mode=10,
+        mulaw=False,
+        pad=2,
+        use_aux_net=True,
+        use_upsample_net=True,
+        upsample_factors=[4, 8, 8],
+        feat_dims=80,
+        compute_dims=128,
+        res_out_dims=128,
+        num_res_blocks=10,
+        hop_length=256,
+        sample_rate=22050,
+    )
+    dummy_x = torch.rand((2, 1280))
+    dummy_m = torch.rand((2, 80, 9))
+    y_size = random.randrange(20, 60)
+    dummy_y = torch.rand((80, y_size))
+    output = model(dummy_x, dummy_m)
+    assert np.all(output.shape == (2, 1280, 4 * 256)), output.shape
+    output = model.inference(dummy_y, True, 5500, 550)
+    assert np.all(output.shape == (256 * (y_size - 1),))
--- a/tests/test_vocoder_wavernn_datasets.py
+++ b/tests/test_vocoder_wavernn_datasets.py
@ -0,0 +1,92 @@
+import os
+import shutil
+
+import numpy as np
+from tests import get_tests_path, get_tests_input_path, get_tests_output_path
+from torch.utils.data import DataLoader
+
+from TTS.utils.audio import AudioProcessor
+from TTS.utils.io import load_config
+from TTS.vocoder.datasets.wavernn_dataset import WaveRNNDataset
+from TTS.vocoder.datasets.preprocess import load_wav_feat_data, preprocess_wav_files
+
+file_path = os.path.dirname(os.path.realpath(__file__))
+OUTPATH = os.path.join(get_tests_output_path(), "loader_tests/")
+os.makedirs(OUTPATH, exist_ok=True)
+
+C = load_config(os.path.join(get_tests_input_path(),
+                             "test_vocoder_wavernn_config.json"))
+
+test_data_path = os.path.join(get_tests_path(), "data/ljspeech/")
+test_mel_feat_path = os.path.join(test_data_path, "mel")
+test_quant_feat_path = os.path.join(test_data_path, "quant")
+ok_ljspeech = os.path.exists(test_data_path)
+
+
+def wavernn_dataset_case(batch_size, seq_len, hop_len, pad, mode, mulaw, num_workers):
+    """ run dataloader with given parameters and check conditions """
+    ap = AudioProcessor(**C.audio)
+
+    C.batch_size = batch_size
+    C.mode = mode
+    C.seq_len = seq_len
+    C.data_path = test_data_path
+
+    preprocess_wav_files(test_data_path, C, ap)
+    _, train_items = load_wav_feat_data(
+        test_data_path, test_mel_feat_path, 5)
+
+    dataset = WaveRNNDataset(ap=ap,
+                             items=train_items,
+                             seq_len=seq_len,
+                             hop_len=hop_len,
+                             pad=pad,
+                             mode=mode,
+                             mulaw=mulaw
+                             )
+    # sampler = DistributedSampler(dataset) if num_gpus > 1 else None
+    loader = DataLoader(dataset,
+                        shuffle=True,
+                        collate_fn=dataset.collate,
+                        batch_size=batch_size,
+                        num_workers=num_workers,
+                        pin_memory=True,
+                        )
+
+    max_iter = 10
+    count_iter = 0
+
+    try:
+        for data in loader:
+            x_input, mels, _ = data
+            expected_feat_shape = (ap.num_mels,
+                                   (x_input.shape[-1] // hop_len) + (pad * 2))
+            assert np.all(
+                mels.shape[1:] == expected_feat_shape), f" [!] {mels.shape} vs {expected_feat_shape}"
+
+            assert (mels.shape[2] - pad * 2) * hop_len == x_input.shape[1]
+            count_iter += 1
+            if count_iter == max_iter:
+                break
+    # except AssertionError:
+    #     shutil.rmtree(test_mel_feat_path)
+    #     shutil.rmtree(test_quant_feat_path)
+    finally:
+        shutil.rmtree(test_mel_feat_path)
+        shutil.rmtree(test_quant_feat_path)
+
+
+def test_parametrized_wavernn_dataset():
+    ''' test dataloader with different parameters '''
+    params = [
+        [16, C.audio['hop_length'] * 10, C.audio['hop_length'], 2, 10, True, 0],
+        [16, C.audio['hop_length'] * 10, C.audio['hop_length'], 2, "mold", False, 4],
+        [1, C.audio['hop_length'] * 10, C.audio['hop_length'], 2, 9, False, 0],
+        [1, C.audio['hop_length'], C.audio['hop_length'], 2, 10, True, 0],
+        [1, C.audio['hop_length'], C.audio['hop_length'], 2, "mold", False, 0],
+        [1, C.audio['hop_length'] * 5, C.audio['hop_length'], 4, 10, False, 2],
+        [1, C.audio['hop_length'] * 5, C.audio['hop_length'], 2, "mold", False, 0],
+    ]
+    for param in params:
+        print(param)
+        wavernn_dataset_case(*param)
--- a/tests/test_vocoder_wavernn_train.sh
+++ b/tests/test_vocoder_wavernn_train.sh
@ -0,0 +1,15 @@
+#!/usr/bin/env bash
+set -xe
+BASEDIR=$(dirname "$0")
+echo "$BASEDIR"
+# create run dir
+mkdir -p $BASEDIR/train_outputs
+# run training
+CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_wavernn.py --config_path $BASEDIR/inputs/test_vocoder_wavernn_config.json
+# find the training folder
+LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
+echo $LATEST_FOLDER
+# continue the previous training
+CUDA_VISIBLE_DEVICES=""  python TTS/bin/train_vocoder_wavernn.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
+# remove all the outputs
+rm -rf $BASEDIR/train_outputs/$LATEST_FOLDER
--- a/tests/test_wavegrad_layers.py
+++ b/tests/test_wavegrad_layers.py
@ -0,0 +1,92 @@
+import torch
+
+from TTS.vocoder.layers.wavegrad import PositionalEncoding, FiLM, UBlock, DBlock
+from TTS.vocoder.models.wavegrad import Wavegrad
+
+
+def test_positional_encoding():
+    layer = PositionalEncoding(50)
+    inp = torch.rand(32, 50, 100)
+    nl = torch.rand(32)
+    o = layer(inp, nl)
+
+    assert o.shape[0] == 32
+    assert o.shape[1] == 50
+    assert o.shape[2] == 100
+    assert isinstance(o, torch.FloatTensor)
+
+
+def test_film():
+    layer = FiLM(50, 76)
+    inp = torch.rand(32, 50, 100)
+    nl = torch.rand(32)
+    shift, scale = layer(inp, nl)
+
+    assert shift.shape[0] == 32
+    assert shift.shape[1] == 76
+    assert shift.shape[2] == 100
+    assert isinstance(shift, torch.FloatTensor)
+
+    assert scale.shape[0] == 32
+    assert scale.shape[1] == 76
+    assert scale.shape[2] == 100
+    assert isinstance(scale, torch.FloatTensor)
+
+    layer.apply_weight_norm()
+    layer.remove_weight_norm()
+
+
+def test_ublock():
+    inp1 = torch.rand(32, 50, 100)
+    inp2 = torch.rand(32, 50, 50)
+    nl = torch.rand(32)
+
+    layer_film = FiLM(50, 100)
+    layer = UBlock(50, 100, 2, [1, 2, 4, 8])
+
+    scale, shift = layer_film(inp1, nl)
+    o = layer(inp2, shift, scale)
+
+    assert o.shape[0] == 32
+    assert o.shape[1] == 100
+    assert o.shape[2] == 100
+    assert isinstance(o, torch.FloatTensor)
+
+    layer.apply_weight_norm()
+    layer.remove_weight_norm()
+
+
+def test_dblock():
+    inp = torch.rand(32, 50, 130)
+    layer = DBlock(50, 100, 2)
+    o = layer(inp)
+
+    assert o.shape[0] == 32
+    assert o.shape[1] == 100
+    assert o.shape[2] == 65
+    assert isinstance(o, torch.FloatTensor)
+
+    layer.apply_weight_norm()
+    layer.remove_weight_norm()
+
+
+def test_wavegrad_forward():
+    x = torch.rand(32, 1, 20 * 300)
+    c = torch.rand(32, 80, 20)
+    noise_scale = torch.rand(32)
+
+    model = Wavegrad(in_channels=80,
+                     out_channels=1,
+                     upsample_factors=[5, 5, 3, 2, 2],
+                     upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2],
+                                         [1, 2, 4, 8], [1, 2, 4, 8],
+                                         [1, 2, 4, 8]])
+    o = model.forward(x, c, noise_scale)
+
+    assert o.shape[0] == 32
+    assert o.shape[1] == 1
+    assert o.shape[2] == 20 * 300
+    assert isinstance(o, torch.FloatTensor)
+
+    model.apply_weight_norm()
+    model.remove_weight_norm()
--- a/tests/test_wavegrad_train.py
+++ b/tests/test_wavegrad_train.py
@ -0,0 +1,62 @@
+import unittest
+
+import numpy as np
+import torch
+from torch import optim
+from TTS.vocoder.models.wavegrad import Wavegrad
+
+#pylint: disable=unused-variable
+
+torch.manual_seed(1)
+use_cuda = torch.cuda.is_available()
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+
+
+class WavegradTrainTest(unittest.TestCase):
+    def test_train_step(self):  # pylint: disable=no-self-use
+        """Test if all layers are updated in a basic training cycle"""
+        input_dummy = torch.rand(8, 1, 20 * 300).to(device)
+        mel_spec = torch.rand(8, 80, 20).to(device)
+
+        criterion = torch.nn.L1Loss().to(device)
+        model = Wavegrad(in_channels=80,
+                     out_channels=1,
+                     upsample_factors=[5, 5, 3, 2, 2],
+                     upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2],
+                                         [1, 2, 4, 8], [1, 2, 4, 8],
+                                         [1, 2, 4, 8]])
+
+        model_ref = Wavegrad(in_channels=80,
+                     out_channels=1,
+                     upsample_factors=[5, 5, 3, 2, 2],
+                     upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2],
+                                         [1, 2, 4, 8], [1, 2, 4, 8],
+                                         [1, 2, 4, 8]])
+        model.train()
+        model.to(device)
+        betas = np.linspace(1e-6, 1e-2, 1000)
+        model.compute_noise_level(betas)
+        model_ref.load_state_dict(model.state_dict())
+        model_ref.to(device)
+        count = 0
+        for param, param_ref in zip(model.parameters(),
+                                    model_ref.parameters()):
+            assert (param - param_ref).sum() == 0, param
+            count += 1
+        optimizer = optim.Adam(model.parameters(), lr=0.001)
+        for i in range(5):
+            y_hat = model.forward(input_dummy, mel_spec, torch.rand(8).to(device))
+            optimizer.zero_grad()
+            loss = criterion(y_hat, input_dummy)
+            loss.backward()
+            optimizer.step()
+        # check parameter changes
+        count = 0
+        for param, param_ref in zip(model.parameters(),
+                                    model_ref.parameters()):
+            # ignore pre-higway layer since it works conditional
+            # if count not in [145, 59]:
+            assert (param != param_ref).any(
+            ), "param {} with shape {} not updated!! \n{}\n{}".format(
+                count, param.shape, param, param_ref)
+            count += 1