Merge branch 'dev'

This commit is contained in:
erogol 2020-11-25 15:29:06 +01:00
commit f6c96b0ac2
71 changed files with 5000 additions and 418 deletions

View File

@ -1,14 +1,14 @@
#!/bin/bash
yes | apt-get install sox
yes | apt-get install ffmpeg
yes | apt-get install espeak
yes | apt-get install espeak
yes | apt-get install tmux
yes | apt-get install zsh
sh -c "$(curl -fsSL https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"
pip3 install https://download.pytorch.org/whl/cu100/torch-1.3.0%2Bcu100-cp36-cp36m-linux_x86_64.whl
sudo sh install.sh
pip install pytorch==1.3.0+cu100
python3 setup.py develop
# pip install pytorch==1.7.0+cu100
# python3 setup.py develop
# python3 distribute.py --config_path config.json --data_path /data/ro/shared/data/keithito/LJSpeech-1.1/
# cp -R ${USER_DIR}/Mozilla_22050 ../tmp/
# python3 distribute.py --config_path config_tacotron_gst.json --data_path ../tmp/Mozilla_22050/

View File

@ -17,5 +17,6 @@ fi
if [[ "$TEST_SUITE" == "testscripts" ]]; then
# test model training scripts
./tests/test_tts_train.sh
./tests/test_vocoder_train.sh
./tests/test_vocoder_gan_train.sh
./tests/test_vocoder_wavernn_train.sh
fi

View File

@ -26,7 +26,7 @@ TTS paper collection: https://github.com/erogol/TTS-papers
## TTS Performance
<p align="center"><img src="https://discourse-prod-uploads-81679984178418.s3.dualstack.us-west-2.amazonaws.com/optimized/3X/6/4/6428f980e9ec751c248e591460895f7881aec0c6_2_1035x591.png" width="800" /></p>
"Mozilla*" and "Judy*" are our models.
"Mozilla*" and "Judy*" are our models.
[Details...](https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results)
## Provided Models and Methods
@ -47,7 +47,10 @@ Speaker Encoder:
Vocoders:
- MelGAN: [paper](https://arxiv.org/abs/1710.10467)
- MultiBandMelGAN: [paper](https://arxiv.org/abs/2005.05106)
- ParallelWaveGAN: [paper](https://arxiv.org/abs/1910.11480)
- GAN-TTS discriminators: [paper](https://arxiv.org/abs/1909.11646)
- WaveRNN: [origin][https://github.com/fatchord/WaveRNN/]
- WaveGrad: [paper][https://arxiv.org/abs/2009.00713]
You can also help us implement more models. Some TTS related work can be found [here](https://github.com/erogol/TTS-papers).
@ -70,8 +73,8 @@ You can also help us implement more models. Some TTS related work can be found [
## Main Requirements and Installation
Highly recommended to use [miniconda](https://conda.io/miniconda.html) for easier installation.
* python>=3.6
* pytorch>=1.4.1
* tensorflow>=2.2
* pytorch>=1.5.0
* tensorflow>=2.3
* librosa
* tensorboard
* tensorboardX
@ -149,23 +152,25 @@ head -n 12000 metadata_shuf.csv > metadata_train.csv
tail -n 1100 metadata_shuf.csv > metadata_val.csv
```
To train a new model, you need to define your own ```config.json``` file (check the example) and call with the command below. You also set the model architecture in ```config.json```.
To train a new model, you need to define your own ```config.json``` to define model details, trainin configuration and more (check the examples). Then call the corressponding train script.
```python TTS/bin/train_tts.py --config_path TTS/tts/configs/config.json```
For instance, in order to train a tacotron or tacotron2 model on LJSpeech dataset, follow these steps.
```python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json```
To fine-tune a model, use ```--restore_path```.
```python TTS/bin/train_tts.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar```
```python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar```
To continue an old training run, use ```--continue_path```.
```python TTS/bin/train_tts.py --continue_path /path/to/your/run_folder/```
```python TTS/bin/train_tacotron.py --continue_path /path/to/your/run_folder/```
For multi-GPU training use ```distribute.py```. It enables process based multi-GPU training where each process uses a single GPU.
For multi-GPU training, call ```distribute.py```. It runs any provided train script in multi-GPU setting.
```CUDA_VISIBLE_DEVICES="0,1,4" TTS/bin/distribute.py --config_path TTS/tts/configs/config.json```
```CUDA_VISIBLE_DEVICES="0,1,4" python TTS/bin/distribute.py --script train_tacotron.py --config_path TTS/tts/configs/config.json```
Each run creates a new output folder and ```config.json``` is copied under this folder.
Each run creates a new output folder accomodating used ```config.json```, model checkpoints and tensorboard logs.
In case of any error or intercepted execution, if there is no checkpoint yet under the output folder, the whole folder is going to be removed.
@ -199,7 +204,7 @@ If you like to use TTS to try a new idea and like to share your experiments with
- [x] Train TTS with r=1 successfully.
- [x] Enable process based distributed training. Similar to (https://github.com/fastai/imagenet-fast/).
- [x] Adapting Neural Vocoder. TTS works with WaveRNN and ParallelWaveGAN (https://github.com/erogol/WaveRNN and https://github.com/erogol/ParallelWaveGAN)
- [ ] Multi-speaker embedding.
- [x] Multi-speaker embedding.
- [x] Model optimization (model export, model pruning etc.)
<!--## References
@ -218,3 +223,4 @@ If you like to use TTS to try a new idea and like to share your experiments with
- https://github.com/r9y9/tacotron_pytorch (Initial Tacotron architecture)
- https://github.com/kan-bayashi/ParallelWaveGAN (vocoder library)
- https://github.com/jaywalnut310/glow-tts (Original Glow-TTS implementation)
- https://github.com/fatchord/WaveRNN/ (Original WaveRNN implementation)

View File

@ -0,0 +1,130 @@
import argparse
import glob
import os
import numpy as np
from tqdm import tqdm
import torch
from TTS.speaker_encoder.model import SpeakerEncoder
from TTS.utils.audio import AudioProcessor
from TTS.utils.io import load_config
from TTS.tts.utils.speakers import save_speaker_mapping
from TTS.tts.datasets.preprocess import load_meta_data
parser = argparse.ArgumentParser(
description='Compute embedding vectors for each wav file in a dataset. If "target_dataset" is defined, it generates "speakers.json" necessary for training a multi-speaker model.')
parser.add_argument(
'model_path',
type=str,
help='Path to model outputs (checkpoint, tensorboard etc.).')
parser.add_argument(
'config_path',
type=str,
help='Path to config file for training.',
)
parser.add_argument(
'data_path',
type=str,
help='Data path for wav files - directory or CSV file')
parser.add_argument(
'output_path',
type=str,
help='path for training outputs.')
parser.add_argument(
'--target_dataset',
type=str,
default='',
help='Target dataset to pick a processor from TTS.tts.dataset.preprocess. Necessary to create a speakers.json file.'
)
parser.add_argument(
'--use_cuda', type=bool, help='flag to set cuda.', default=False
)
parser.add_argument(
'--separator', type=str, help='Separator used in file if CSV is passed for data_path', default='|'
)
args = parser.parse_args()
c = load_config(args.config_path)
ap = AudioProcessor(**c['audio'])
data_path = args.data_path
split_ext = os.path.splitext(data_path)
sep = args.separator
if args.target_dataset != '':
# if target dataset is defined
dataset_config = [
{
"name": args.target_dataset,
"path": args.data_path,
"meta_file_train": None,
"meta_file_val": None
},
]
wav_files, _ = load_meta_data(dataset_config, eval_split=False)
output_files = [wav_file[1].replace(data_path, args.output_path).replace(
'.wav', '.npy') for wav_file in wav_files]
else:
# if target dataset is not defined
if len(split_ext) > 0 and split_ext[1].lower() == '.csv':
# Parse CSV
print(f'CSV file: {data_path}')
with open(data_path) as f:
wav_path = os.path.join(os.path.dirname(data_path), 'wavs')
wav_files = []
print(f'Separator is: {sep}')
for line in f:
components = line.split(sep)
if len(components) != 2:
print("Invalid line")
continue
wav_file = os.path.join(wav_path, components[0] + '.wav')
#print(f'wav_file: {wav_file}')
if os.path.exists(wav_file):
wav_files.append(wav_file)
print(f'Count of wavs imported: {len(wav_files)}')
else:
# Parse all wav files in data_path
wav_files = glob.glob(data_path + '/**/*.wav', recursive=True)
output_files = [wav_file.replace(data_path, args.output_path).replace(
'.wav', '.npy') for wav_file in wav_files]
for output_file in output_files:
os.makedirs(os.path.dirname(output_file), exist_ok=True)
# define Encoder model
model = SpeakerEncoder(**c.model)
model.load_state_dict(torch.load(args.model_path)['model'])
model.eval()
if args.use_cuda:
model.cuda()
# compute speaker embeddings
speaker_mapping = {}
for idx, wav_file in enumerate(tqdm(wav_files)):
if isinstance(wav_file, list):
speaker_name = wav_file[2]
wav_file = wav_file[1]
mel_spec = ap.melspectrogram(ap.load_wav(wav_file, sr=ap.sample_rate)).T
mel_spec = torch.FloatTensor(mel_spec[None, :, :])
if args.use_cuda:
mel_spec = mel_spec.cuda()
embedd = model.compute_embedding(mel_spec)
embedd = embedd.detach().cpu().numpy()
np.save(output_files[idx], embedd)
if args.target_dataset != '':
# create speaker_mapping if target dataset is defined
wav_file_name = os.path.basename(wav_file)
speaker_mapping[wav_file_name] = {}
speaker_mapping[wav_file_name]['name'] = speaker_name
speaker_mapping[wav_file_name]['embedding'] = embedd.flatten().tolist()
if args.target_dataset != '':
# save speaker_mapping if target dataset is defined
mapping_file_path = os.path.join(args.output_path, 'speakers.json')
save_speaker_mapping(args.output_path, speaker_mapping)

View File

@ -2,6 +2,7 @@
# -*- coding: utf-8 -*-
import os
import glob
import argparse
import numpy as np
@ -11,6 +12,7 @@ from TTS.tts.datasets.preprocess import load_meta_data
from TTS.utils.io import load_config
from TTS.utils.audio import AudioProcessor
def main():
"""Run preprocessing process."""
parser = argparse.ArgumentParser(
@ -30,7 +32,10 @@ def main():
ap = AudioProcessor(**CONFIG.audio)
# load the meta data of target dataset
dataset_items = load_meta_data(CONFIG.datasets)[0] # take only train data
if 'data_path' in CONFIG.keys():
dataset_items = glob.glob(os.path.join(CONFIG.data_path, '**', '*.wav'), recursive=True)
else:
dataset_items = load_meta_data(CONFIG.datasets)[0] # take only train data
print(f" > There are {len(dataset_items)} files.")
mel_sum = 0
@ -40,7 +45,7 @@ def main():
N = 0
for item in tqdm(dataset_items):
# compute features
wav = ap.load_wav(item[1])
wav = ap.load_wav(item if isinstance(item, str) else item[1])
linear = ap.spectrogram(wav)
mel = ap.melspectrogram(wav)
@ -56,7 +61,7 @@ def main():
linear_mean = linear_sum / N
linear_scale = np.sqrt(linear_square_sum / N - linear_mean ** 2)
output_file_path = os.path.join(args.out_path, "scale_stats.npy")
output_file_path = args.out_path
stats = {}
stats['mel_mean'] = mel_mean
stats['mel_std'] = mel_scale
@ -78,7 +83,7 @@ def main():
del CONFIG.audio['clip_norm']
stats['audio_config'] = CONFIG.audio
np.save(output_file_path, stats, allow_pickle=True)
print(f' > scale_stats.npy is saved to {output_file_path}')
print(f' > stats saved to {output_file_path}')
if __name__ == "__main__":

View File

@ -10,7 +10,7 @@ import time
import torch
from TTS.tts.utils.generic_utils import setup_model
from TTS.tts.utils.generic_utils import setup_model, is_tacotron
from TTS.tts.utils.synthesis import synthesis
from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
from TTS.utils.audio import AudioProcessor
@ -125,7 +125,8 @@ if __name__ == "__main__":
model.eval()
if args.use_cuda:
model.cuda()
model.decoder.set_r(cp['r'])
if is_tacotron(C):
model.decoder.set_r(cp['r'])
# load vocoder model
if args.vocoder_path != "":
@ -153,7 +154,10 @@ if __name__ == "__main__":
args.speaker_fileid = None
if args.gst_style is None:
gst_style = C.gst['gst_style_input']
if is_tacotron(C):
gst_style = C.gst['gst_style_input']
else:
gst_style = None
else:
# check if gst_style string is a dict, if is dict convert else use string
try:

View File

@ -35,7 +35,7 @@ print(" > Using CUDA: ", use_cuda)
print(" > Number of GPUs: ", num_gpus)
def setup_loader(ap, is_val=False, verbose=False):
def setup_loader(ap: AudioProcessor, is_val: bool=False, verbose: bool=False):
if is_val:
loader = None
else:
@ -212,6 +212,7 @@ if __name__ == '__main__':
parser.add_argument(
'--config_path',
type=str,
required=True,
help='Path to config file for training.',
)
parser.add_argument('--debug',

View File

@ -9,17 +9,16 @@ import time
import traceback
import torch
from random import randrange
from torch.utils.data import DataLoader
from TTS.tts.datasets.preprocess import load_meta_data
from TTS.tts.datasets.TTSDataset import MyDataset
from TTS.tts.layers.losses import GlowTTSLoss
from TTS.tts.utils.distribute import (DistributedSampler, init_distributed,
reduce_tensor)
from TTS.tts.utils.generic_utils import setup_model
from TTS.tts.utils.generic_utils import setup_model, check_config_tts
from TTS.tts.utils.io import save_best_model, save_checkpoint
from TTS.tts.utils.measures import alignment_diagonal_score
from TTS.tts.utils.speakers import (get_speakers, load_speaker_mapping,
save_speaker_mapping)
from TTS.tts.utils.speakers import parse_speakers, load_speaker_mapping
from TTS.tts.utils.synthesis import synthesis
from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
from TTS.tts.utils.visual import plot_alignment, plot_spectrogram
@ -34,10 +33,15 @@ from TTS.utils.tensorboard_logger import TensorboardLogger
from TTS.utils.training import (NoamLR, check_update,
setup_torch_training_env)
# DISTRIBUTED
from torch.nn.parallel import DistributedDataParallel as DDP_th
from torch.utils.data.distributed import DistributedSampler
from TTS.utils.distribute import init_distributed, reduce_tensor
use_cuda, num_gpus = setup_torch_training_env(True, False)
def setup_loader(ap, r, is_val=False, verbose=False):
def setup_loader(ap, r, is_val=False, verbose=False, speaker_mapping=None):
if is_val and not c.run_eval:
loader = None
else:
@ -48,6 +52,7 @@ def setup_loader(ap, r, is_val=False, verbose=False):
meta_data=meta_data_eval if is_val else meta_data_train,
ap=ap,
tp=c.characters if 'characters' in c.keys() else None,
add_blank=c['add_blank'] if 'add_blank' in c.keys() else False,
batch_group_size=0 if is_val else c.batch_group_size *
c.batch_size,
min_seq_len=c.min_seq_len,
@ -56,7 +61,8 @@ def setup_loader(ap, r, is_val=False, verbose=False):
use_phonemes=c.use_phonemes,
phoneme_language=c.phoneme_language,
enable_eos_bos=c.enable_eos_bos_chars,
verbose=verbose)
verbose=verbose,
speaker_mapping=speaker_mapping if c.use_speaker_embedding and c.use_external_speaker_embedding_file else None)
sampler = DistributedSampler(dataset) if num_gpus > 1 else None
loader = DataLoader(
dataset,
@ -86,10 +92,13 @@ def format_data(data):
avg_spec_length = torch.mean(mel_lengths.float())
if c.use_speaker_embedding:
speaker_ids = [
speaker_mapping[speaker_name] for speaker_name in speaker_names
]
speaker_ids = torch.LongTensor(speaker_ids)
if c.use_external_speaker_embedding_file:
speaker_ids = data[8]
else:
speaker_ids = [
speaker_mapping[speaker_name] for speaker_name in speaker_names
]
speaker_ids = torch.LongTensor(speaker_ids)
else:
speaker_ids = None
@ -107,7 +116,7 @@ def format_data(data):
avg_text_length, avg_spec_length, attn_mask
def data_depended_init(model, ap):
def data_depended_init(model, ap, speaker_mapping=None):
"""Data depended initialization for activation normalization."""
if hasattr(model, 'module'):
for f in model.module.decoder.flows:
@ -118,19 +127,19 @@ def data_depended_init(model, ap):
if getattr(f, "set_ddi", False):
f.set_ddi(True)
data_loader = setup_loader(ap, 1, is_val=False)
data_loader = setup_loader(ap, 1, is_val=False, speaker_mapping=speaker_mapping)
model.train()
print(" > Data depended initialization ... ")
with torch.no_grad():
for _, data in enumerate(data_loader):
# format data
text_input, text_lengths, mel_input, mel_lengths, _,\
text_input, text_lengths, mel_input, mel_lengths, speaker_ids,\
_, _, attn_mask = format_data(data)
# forward pass model
_ = model.forward(
text_input, text_lengths, mel_input, mel_lengths, attn_mask)
text_input, text_lengths, mel_input, mel_lengths, attn_mask, g=speaker_ids)
break
if hasattr(model, 'module'):
@ -145,9 +154,9 @@ def data_depended_init(model, ap):
def train(model, criterion, optimizer, scheduler,
ap, global_step, epoch, amp):
ap, global_step, epoch, speaker_mapping=None):
data_loader = setup_loader(ap, 1, is_val=False,
verbose=(epoch == 0))
verbose=(epoch == 0), speaker_mapping=speaker_mapping)
model.train()
epoch_time = 0
keep_avg = KeepAverage()
@ -158,43 +167,49 @@ def train(model, criterion, optimizer, scheduler,
batch_n_iter = int(len(data_loader.dataset) / c.batch_size)
end_time = time.time()
c_logger.print_train_start()
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
for num_iter, data in enumerate(data_loader):
start_time = time.time()
# format data
text_input, text_lengths, mel_input, mel_lengths, _,\
text_input, text_lengths, mel_input, mel_lengths, speaker_ids,\
avg_text_length, avg_spec_length, attn_mask = format_data(data)
loader_time = time.time() - end_time
global_step += 1
optimizer.zero_grad()
# forward pass model
with torch.cuda.amp.autocast(enabled=c.mixed_precision):
z, logdet, y_mean, y_log_scale, alignments, o_dur_log, o_total_dur = model.forward(
text_input, text_lengths, mel_input, mel_lengths, attn_mask, g=speaker_ids)
# compute loss
loss_dict = criterion(z, y_mean, y_log_scale, logdet, mel_lengths,
o_dur_log, o_total_dur, text_lengths)
# backward pass with loss scaling
if c.mixed_precision:
scaler.scale(loss_dict['loss']).backward()
scaler.unscale_(optimizer)
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
c.grad_clip)
scaler.step(optimizer)
scaler.update()
else:
loss_dict['loss'].backward()
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
c.grad_clip)
optimizer.step()
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True)
optimizer.step()
# setup lr
if c.noam_schedule:
scheduler.step()
optimizer.zero_grad()
# forward pass model
z, logdet, y_mean, y_log_scale, alignments, o_dur_log, o_total_dur = model.forward(
text_input, text_lengths, mel_input, mel_lengths, attn_mask)
# compute loss
loss_dict = criterion(z, y_mean, y_log_scale, logdet, mel_lengths,
o_dur_log, o_total_dur, text_lengths)
# backward pass
if amp is not None:
with amp.scale_loss(loss_dict['loss'], optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss_dict['loss'].backward()
if amp:
amp_opt_params = amp.master_params(optimizer)
else:
amp_opt_params = None
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True, amp_opt_params=amp_opt_params)
optimizer.step()
# current_lr
current_lr = optimizer.param_groups[0]['lr']
@ -257,12 +272,12 @@ def train(model, criterion, optimizer, scheduler,
if c.checkpoint:
# save model
save_checkpoint(model, optimizer, global_step, epoch, 1, OUT_PATH,
model_loss=loss_dict['loss'],
amp_state_dict=amp.state_dict() if amp else None)
model_loss=loss_dict['loss'])
# Diagnostic visualizations
# direct pass on model for spec predictions
spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1])
target_speaker = None if speaker_ids is None else speaker_ids[:1]
spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1], g=target_speaker)
spec_pred = spec_pred.permute(0, 2, 1)
gt_spec = mel_input.permute(0, 2, 1)
const_spec = spec_pred[0].data.cpu().numpy()
@ -298,8 +313,8 @@ def train(model, criterion, optimizer, scheduler,
@torch.no_grad()
def evaluate(model, criterion, ap, global_step, epoch):
data_loader = setup_loader(ap, 1, is_val=True)
def evaluate(model, criterion, ap, global_step, epoch, speaker_mapping):
data_loader = setup_loader(ap, 1, is_val=True, speaker_mapping=speaker_mapping)
model.eval()
epoch_time = 0
keep_avg = KeepAverage()
@ -309,12 +324,12 @@ def evaluate(model, criterion, ap, global_step, epoch):
start_time = time.time()
# format data
text_input, text_lengths, mel_input, mel_lengths, _,\
text_input, text_lengths, mel_input, mel_lengths, speaker_ids,\
_, _, attn_mask = format_data(data)
# forward pass model
z, logdet, y_mean, y_log_scale, alignments, o_dur_log, o_total_dur = model.forward(
text_input, text_lengths, mel_input, mel_lengths, attn_mask)
text_input, text_lengths, mel_input, mel_lengths, attn_mask, g=speaker_ids)
# compute loss
loss_dict = criterion(z, y_mean, y_log_scale, logdet, mel_lengths,
@ -355,10 +370,11 @@ def evaluate(model, criterion, ap, global_step, epoch):
if args.rank == 0:
# Diagnostic visualizations
# direct pass on model for spec predictions
target_speaker = None if speaker_ids is None else speaker_ids[:1]
if hasattr(model, 'module'):
spec_pred, *_ = model.module.inference(text_input[:1], text_lengths[:1])
spec_pred, *_ = model.module.inference(text_input[:1], text_lengths[:1], g=target_speaker)
else:
spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1])
spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1], g=target_speaker)
spec_pred = spec_pred.permute(0, 2, 1)
gt_spec = mel_input.permute(0, 2, 1)
@ -398,7 +414,17 @@ def evaluate(model, criterion, ap, global_step, epoch):
test_audios = {}
test_figures = {}
print(" | > Synthesizing test sentences")
speaker_id = 0 if c.use_speaker_embedding else None
if c.use_speaker_embedding:
if c.use_external_speaker_embedding_file:
speaker_embedding = speaker_mapping[list(speaker_mapping.keys())[randrange(len(speaker_mapping)-1)]]['embedding']
speaker_id = None
else:
speaker_id = 0
speaker_embedding = None
else:
speaker_id = None
speaker_embedding = None
style_wav = c.get("style_wav_for_test")
for idx, test_sentence in enumerate(test_sentences):
try:
@ -409,6 +435,7 @@ def evaluate(model, criterion, ap, global_step, epoch):
use_cuda,
ap,
speaker_id=speaker_id,
speaker_embedding=speaker_embedding,
style_wav=style_wav,
truncated=False,
enable_eos_bos_chars=c.enable_eos_bos_chars, #pylint: disable=unused-argument
@ -459,38 +486,13 @@ def main(args): # pylint: disable=redefined-outer-name
meta_data_eval = meta_data_eval[:int(len(meta_data_eval) * c.eval_portion)]
# parse speakers
if c.use_speaker_embedding:
speakers = get_speakers(meta_data_train)
if args.restore_path:
prev_out_path = os.path.dirname(args.restore_path)
speaker_mapping = load_speaker_mapping(prev_out_path)
assert all([speaker in speaker_mapping
for speaker in speakers]), "As of now you, you cannot " \
"introduce new speakers to " \
"a previously trained model."
else:
speaker_mapping = {name: i for i, name in enumerate(speakers)}
save_speaker_mapping(OUT_PATH, speaker_mapping)
num_speakers = len(speaker_mapping)
print("Training with {} speakers: {}".format(num_speakers,
", ".join(speakers)))
else:
num_speakers = 0
num_speakers, speaker_embedding_dim, speaker_mapping = parse_speakers(c, args, meta_data_train, OUT_PATH)
# setup model
model = setup_model(num_chars, num_speakers, c)
model = setup_model(num_chars, num_speakers, c, speaker_embedding_dim=speaker_embedding_dim)
optimizer = RAdam(model.parameters(), lr=c.lr, weight_decay=0, betas=(0.9, 0.98), eps=1e-9)
criterion = GlowTTSLoss()
if c.apex_amp_level:
# pylint: disable=import-outside-toplevel
from apex import amp
from apex.parallel import DistributedDataParallel as DDP
model.cuda()
model, optimizer = amp.initialize(model, optimizer, opt_level=c.apex_amp_level)
else:
amp = None
if args.restore_path:
checkpoint = torch.load(args.restore_path, map_location='cpu')
try:
@ -507,9 +509,6 @@ def main(args): # pylint: disable=redefined-outer-name
model.load_state_dict(model_dict)
del model_dict
if amp and 'amp' in checkpoint:
amp.load_state_dict(checkpoint['amp'])
for group in optimizer.param_groups:
group['initial_lr'] = c.lr
print(" > Model restored from step %d" % checkpoint['step'],
@ -524,7 +523,7 @@ def main(args): # pylint: disable=redefined-outer-name
# DISTRUBUTED
if num_gpus > 1:
model = DDP(model)
model = DDP_th(model, device_ids=[args.rank])
if c.noam_schedule:
scheduler = NoamLR(optimizer,
@ -540,19 +539,19 @@ def main(args): # pylint: disable=redefined-outer-name
best_loss = float('inf')
global_step = args.restore_step
model = data_depended_init(model, ap)
model = data_depended_init(model, ap, speaker_mapping)
for epoch in range(0, c.epochs):
c_logger.print_epoch_start(epoch, c.epochs)
train_avg_loss_dict, global_step = train(model, criterion, optimizer,
scheduler, ap, global_step,
epoch, amp)
eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch)
epoch, speaker_mapping)
eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch, speaker_mapping=speaker_mapping)
c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
target_loss = train_avg_loss_dict['avg_loss']
if c.run_eval:
target_loss = eval_avg_loss_dict['avg_loss']
best_loss = save_best_model(target_loss, best_loss, model, optimizer, global_step, epoch, c.r,
OUT_PATH, amp_state_dict=amp.state_dict() if amp else None)
OUT_PATH)
if __name__ == '__main__':
@ -602,10 +601,11 @@ if __name__ == '__main__':
# setup output paths and read configs
c = load_config(args.config_path)
# check_config(c)
check_config_tts(c)
_ = os.path.dirname(os.path.realpath(__file__))
if c.apex_amp_level:
print(" > apex AMP level: ", c.apex_amp_level)
if c.mixed_precision:
print(" > Mixed precision enabled.")
OUT_PATH = args.continue_path
if args.continue_path == '':

View File

@ -7,28 +7,25 @@ import os
import sys
import time
import traceback
from random import randrange
import numpy as np
import torch
from random import randrange
from torch.utils.data import DataLoader
from TTS.tts.datasets.preprocess import load_meta_data
from TTS.tts.datasets.TTSDataset import MyDataset
from TTS.tts.layers.losses import TacotronLoss
from TTS.tts.utils.distribute import (DistributedSampler,
apply_gradient_allreduce,
init_distributed, reduce_tensor)
from TTS.tts.utils.generic_utils import setup_model, check_config_tts
from TTS.tts.utils.generic_utils import check_config_tts, setup_model
from TTS.tts.utils.io import save_best_model, save_checkpoint
from TTS.tts.utils.measures import alignment_diagonal_score
from TTS.tts.utils.speakers import (get_speakers, load_speaker_mapping,
save_speaker_mapping)
from TTS.tts.utils.speakers import load_speaker_mapping, parse_speakers
from TTS.tts.utils.synthesis import synthesis
from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
from TTS.tts.utils.visual import plot_alignment, plot_spectrogram
from TTS.utils.audio import AudioProcessor
from TTS.utils.console_logger import ConsoleLogger
from TTS.utils.distribute import (DistributedSampler, apply_gradient_allreduce,
init_distributed, reduce_tensor)
from TTS.utils.generic_utils import (KeepAverage, count_parameters,
create_experiment_folder, get_git_branch,
remove_experiment_folder, set_init_dict)
@ -41,6 +38,7 @@ from TTS.utils.training import (NoamLR, adam_weight_decay, check_update,
use_cuda, num_gpus = setup_torch_training_env(True, False)
def setup_loader(ap, r, is_val=False, verbose=False, speaker_mapping=None):
if is_val and not c.run_eval:
loader = None
@ -52,6 +50,7 @@ def setup_loader(ap, r, is_val=False, verbose=False, speaker_mapping=None):
meta_data=meta_data_eval if is_val else meta_data_train,
ap=ap,
tp=c.characters if 'characters' in c.keys() else None,
add_blank=c['add_blank'] if 'add_blank' in c.keys() else False,
batch_group_size=0 if is_val else c.batch_group_size *
c.batch_size,
min_seq_len=c.min_seq_len,
@ -87,8 +86,8 @@ def format_data(data, speaker_mapping=None):
mel_input = data[4]
mel_lengths = data[5]
stop_targets = data[6]
avg_text_length = torch.mean(text_lengths.float())
avg_spec_length = torch.mean(mel_lengths.float())
max_text_length = torch.max(text_lengths.float())
max_spec_length = torch.max(mel_lengths.float())
if c.use_speaker_embedding:
if c.use_external_speaker_embedding_file:
@ -124,11 +123,11 @@ def format_data(data, speaker_mapping=None):
if speaker_embeddings is not None:
speaker_embeddings = speaker_embeddings.cuda(non_blocking=True)
return text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, avg_text_length, avg_spec_length
return text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, max_text_length, max_spec_length
def train(model, criterion, optimizer, optimizer_st, scheduler,
ap, global_step, epoch, amp, speaker_mapping=None):
ap, global_step, epoch, scaler, scaler_st, speaker_mapping=None):
data_loader = setup_loader(ap, model.decoder.r, is_val=False,
verbose=(epoch == 0), speaker_mapping=speaker_mapping)
model.train()
@ -145,7 +144,7 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
start_time = time.time()
# format data
text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, avg_text_length, avg_spec_length = format_data(data, speaker_mapping)
text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, max_text_length, max_spec_length = format_data(data, speaker_mapping)
loader_time = time.time() - end_time
global_step += 1
@ -153,65 +152,79 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
# setup lr
if c.noam_schedule:
scheduler.step()
optimizer.zero_grad()
if optimizer_st:
optimizer_st.zero_grad()
# forward pass model
if c.bidirectional_decoder or c.double_decoder_consistency:
decoder_output, postnet_output, alignments, stop_tokens, decoder_backward_output, alignments_backward = model(
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
else:
decoder_output, postnet_output, alignments, stop_tokens = model(
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
decoder_backward_output = None
alignments_backward = None
with torch.cuda.amp.autocast(enabled=c.mixed_precision):
# forward pass model
if c.bidirectional_decoder or c.double_decoder_consistency:
decoder_output, postnet_output, alignments, stop_tokens, decoder_backward_output, alignments_backward = model(
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
else:
decoder_output, postnet_output, alignments, stop_tokens = model(
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
decoder_backward_output = None
alignments_backward = None
# set the [alignment] lengths wrt reduction factor for guided attention
if mel_lengths.max() % model.decoder.r != 0:
alignment_lengths = (mel_lengths + (model.decoder.r - (mel_lengths.max() % model.decoder.r))) // model.decoder.r
else:
alignment_lengths = mel_lengths // model.decoder.r
# set the [alignment] lengths wrt reduction factor for guided attention
if mel_lengths.max() % model.decoder.r != 0:
alignment_lengths = (mel_lengths + (model.decoder.r - (mel_lengths.max() % model.decoder.r))) // model.decoder.r
else:
alignment_lengths = mel_lengths // model.decoder.r
# compute loss
loss_dict = criterion(postnet_output, decoder_output, mel_input,
linear_input, stop_tokens, stop_targets,
mel_lengths, decoder_backward_output,
alignments, alignment_lengths, alignments_backward,
text_lengths)
# compute loss
loss_dict = criterion(postnet_output, decoder_output, mel_input,
linear_input, stop_tokens, stop_targets,
mel_lengths, decoder_backward_output,
alignments, alignment_lengths, alignments_backward,
text_lengths)
# backward pass
if amp is not None:
with amp.scale_loss(loss_dict['loss'], optimizer) as scaled_loss:
scaled_loss.backward()
# check nan loss
if torch.isnan(loss_dict['loss']).any():
raise RuntimeError(f'Detected NaN loss at step {global_step}.')
# optimizer step
if c.mixed_precision:
# model optimizer step in mixed precision mode
scaler.scale(loss_dict['loss']).backward()
scaler.unscale_(optimizer)
optimizer, current_lr = adam_weight_decay(optimizer)
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True)
scaler.step(optimizer)
scaler.update()
# stopnet optimizer step
if c.separate_stopnet:
scaler_st.scale( loss_dict['stopnet_loss']).backward()
scaler.unscale_(optimizer_st)
optimizer_st, _ = adam_weight_decay(optimizer_st)
grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0)
scaler_st.step(optimizer)
scaler_st.update()
else:
grad_norm_st = 0
else:
# main model optimizer step
loss_dict['loss'].backward()
optimizer, current_lr = adam_weight_decay(optimizer)
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True)
optimizer.step()
optimizer, current_lr = adam_weight_decay(optimizer)
if amp:
amp_opt_params = amp.master_params(optimizer)
else:
amp_opt_params = None
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True, amp_opt_params=amp_opt_params)
optimizer.step()
# stopnet optimizer step
if c.separate_stopnet:
loss_dict['stopnet_loss'].backward()
optimizer_st, _ = adam_weight_decay(optimizer_st)
grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0)
optimizer_st.step()
else:
grad_norm_st = 0
# compute alignment error (the lower the better )
align_error = 1 - alignment_diagonal_score(alignments)
loss_dict['align_error'] = align_error
# backpass and check the grad norm for stop loss
if c.separate_stopnet:
loss_dict['stopnet_loss'].backward()
optimizer_st, _ = adam_weight_decay(optimizer_st)
if amp:
amp_opt_params = amp.master_params(optimizer)
else:
amp_opt_params = None
grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0, amp_opt_params=amp_opt_params)
optimizer_st.step()
else:
grad_norm_st = 0
step_time = time.time() - start_time
epoch_time += step_time
@ -242,8 +255,8 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
# print training progress
if global_step % c.print_step == 0:
log_dict = {
"avg_spec_length": [avg_spec_length, 1], # value, precision
"avg_text_length": [avg_text_length, 1],
"max_spec_length": [max_spec_length, 1], # value, precision
"max_text_length": [max_text_length, 1],
"step_time": [step_time, 4],
"loader_time": [loader_time, 2],
"current_lr": current_lr,
@ -270,7 +283,7 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
save_checkpoint(model, optimizer, global_step, epoch, model.decoder.r, OUT_PATH,
optimizer_st=optimizer_st,
model_loss=loss_dict['postnet_loss'],
amp_state_dict=amp.state_dict() if amp else None)
scaler=scaler.state_dict() if c.mixed_precision else None)
# Diagnostic visualizations
const_spec = postnet_output[0].data.cpu().numpy()
@ -502,45 +515,14 @@ def main(args): # pylint: disable=redefined-outer-name
meta_data_eval = meta_data_eval[:int(len(meta_data_eval) * c.eval_portion)]
# parse speakers
if c.use_speaker_embedding:
speakers = get_speakers(meta_data_train)
if args.restore_path:
if c.use_external_speaker_embedding_file: # if restore checkpoint and use External Embedding file
prev_out_path = os.path.dirname(args.restore_path)
speaker_mapping = load_speaker_mapping(prev_out_path)
if not speaker_mapping:
print("WARNING: speakers.json was not found in restore_path, trying to use CONFIG.external_speaker_embedding_file")
speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
if not speaker_mapping:
raise RuntimeError("You must copy the file speakers.json to restore_path, or set a valid file in CONFIG.external_speaker_embedding_file")
speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
elif not c.use_external_speaker_embedding_file: # if restore checkpoint and don't use External Embedding file
prev_out_path = os.path.dirname(args.restore_path)
speaker_mapping = load_speaker_mapping(prev_out_path)
speaker_embedding_dim = None
assert all([speaker in speaker_mapping
for speaker in speakers]), "As of now you, you cannot " \
"introduce new speakers to " \
"a previously trained model."
elif c.use_external_speaker_embedding_file and c.external_speaker_embedding_file: # if start new train using External Embedding file
speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
elif c.use_external_speaker_embedding_file and not c.external_speaker_embedding_file: # if start new train using External Embedding file and don't pass external embedding file
raise "use_external_speaker_embedding_file is True, so you need pass a external speaker embedding file, run GE2E-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb or AngularPrototypical-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb notebook in notebooks/ folder"
else: # if start new train and don't use External Embedding file
speaker_mapping = {name: i for i, name in enumerate(speakers)}
speaker_embedding_dim = None
save_speaker_mapping(OUT_PATH, speaker_mapping)
num_speakers = len(speaker_mapping)
print("Training with {} speakers: {}".format(num_speakers,
", ".join(speakers)))
else:
num_speakers = 0
speaker_embedding_dim = None
speaker_mapping = None
num_speakers, speaker_embedding_dim, speaker_mapping = parse_speakers(c, args, meta_data_train, OUT_PATH)
model = setup_model(num_chars, num_speakers, c, speaker_embedding_dim)
# scalers for mixed precision training
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
scaler_st = torch.cuda.amp.GradScaler() if c.mixed_precision and c.separate_stopnet else None
params = set_weight_decay(model, c.wd)
optimizer = RAdam(params, lr=c.lr, weight_decay=0)
if c.stopnet and c.separate_stopnet:
@ -550,26 +532,22 @@ def main(args): # pylint: disable=redefined-outer-name
else:
optimizer_st = None
if c.apex_amp_level == "O1":
# pylint: disable=import-outside-toplevel
from apex import amp
model.cuda()
model, optimizer = amp.initialize(model, optimizer, opt_level=c.apex_amp_level)
else:
amp = None
# setup criterion
criterion = TacotronLoss(c, stopnet_pos_weight=10.0, ga_sigma=0.4)
if args.restore_path:
checkpoint = torch.load(args.restore_path, map_location='cpu')
try:
# TODO: fix optimizer init, model.cuda() needs to be called before
print(" > Restoring Model.")
model.load_state_dict(checkpoint['model'])
# optimizer restore
# optimizer.load_state_dict(checkpoint['optimizer'])
print(" > Restoring Optimizer.")
optimizer.load_state_dict(checkpoint['optimizer'])
if "scaler" in checkpoint and c.mixed_precision:
print(" > Restoring AMP Scaler...")
scaler.load_state_dict(checkpoint["scaler"])
if c.reinit_layers:
raise RuntimeError
model.load_state_dict(checkpoint['model'])
except KeyError:
print(" > Partial model initialization.")
model_dict = model.state_dict()
@ -579,9 +557,6 @@ def main(args): # pylint: disable=redefined-outer-name
model.load_state_dict(model_dict)
del model_dict
if amp and 'amp' in checkpoint:
amp.load_state_dict(checkpoint['amp'])
for group in optimizer.param_groups:
group['lr'] = c.lr
print(" > Model restored from step %d" % checkpoint['step'],
@ -624,14 +599,14 @@ def main(args): # pylint: disable=redefined-outer-name
print("\n > Number of output frames:", model.decoder.r)
train_avg_loss_dict, global_step = train(model, criterion, optimizer,
optimizer_st, scheduler, ap,
global_step, epoch, amp, speaker_mapping)
global_step, epoch, scaler, scaler_st, speaker_mapping)
eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch, speaker_mapping)
c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
target_loss = train_avg_loss_dict['avg_postnet_loss']
if c.run_eval:
target_loss = eval_avg_loss_dict['avg_postnet_loss']
best_loss = save_best_model(target_loss, best_loss, model, optimizer, global_step, epoch, c.r,
OUT_PATH, amp_state_dict=amp.state_dict() if amp else None)
OUT_PATH, scaler=scaler.state_dict() if c.mixed_precision else None)
if __name__ == '__main__':
@ -683,8 +658,8 @@ if __name__ == '__main__':
check_config_tts(c)
_ = os.path.dirname(os.path.realpath(__file__))
if c.apex_amp_level == 'O1':
print(" > apex AMP level: ", c.apex_amp_level)
if c.mixed_precision:
print(" > Mixed precision mode is ON")
OUT_PATH = args.continue_path
if args.continue_path == '':

View File

@ -19,13 +19,16 @@ from TTS.utils.tensorboard_logger import TensorboardLogger
from TTS.utils.training import setup_torch_training_env
from TTS.vocoder.datasets.gan_dataset import GANDataset
from TTS.vocoder.datasets.preprocess import load_wav_data, load_wav_feat_data
# from distribute import (DistributedSampler, apply_gradient_allreduce,
# init_distributed, reduce_tensor)
from TTS.vocoder.layers.losses import DiscriminatorLoss, GeneratorLoss
from TTS.vocoder.utils.generic_utils import (plot_results, setup_discriminator,
setup_generator)
from TTS.vocoder.utils.io import save_best_model, save_checkpoint
# DISTRIBUTED
from torch.nn.parallel import DistributedDataParallel as DDP_th
from torch.utils.data.distributed import DistributedSampler
from TTS.utils.distribute import init_distributed
use_cuda, num_gpus = setup_torch_training_env(True, True)
@ -45,12 +48,12 @@ def setup_loader(ap, is_val=False, verbose=False):
use_cache=c.use_cache,
verbose=verbose)
dataset.shuffle_mapping()
# sampler = DistributedSampler(dataset) if num_gpus > 1 else None
sampler = DistributedSampler(dataset, shuffle=True) if num_gpus > 1 else None
loader = DataLoader(dataset,
batch_size=1 if is_val else c.batch_size,
shuffle=True,
shuffle=False if num_gpus > 1 else True,
drop_last=False,
sampler=None,
sampler=sampler,
num_workers=c.num_val_loader_workers
if is_val else c.num_loader_workers,
pin_memory=False)
@ -243,41 +246,42 @@ def train(model_G, criterion_G, optimizer_G, model_D, criterion_D, optimizer_D,
c_logger.print_train_step(batch_n_iter, num_iter, global_step,
log_dict, loss_dict, keep_avg.avg_values)
# plot step stats
if global_step % 10 == 0:
iter_stats = {
"lr_G": current_lr_G,
"lr_D": current_lr_D,
"step_time": step_time
}
iter_stats.update(loss_dict)
tb_logger.tb_train_iter_stats(global_step, iter_stats)
if args.rank == 0:
# plot step stats
if global_step % 10 == 0:
iter_stats = {
"lr_G": current_lr_G,
"lr_D": current_lr_D,
"step_time": step_time
}
iter_stats.update(loss_dict)
tb_logger.tb_train_iter_stats(global_step, iter_stats)
# save checkpoint
if global_step % c.save_step == 0:
if c.checkpoint:
# save model
save_checkpoint(model_G,
optimizer_G,
scheduler_G,
model_D,
optimizer_D,
scheduler_D,
global_step,
epoch,
OUT_PATH,
model_losses=loss_dict)
# save checkpoint
if global_step % c.save_step == 0:
if c.checkpoint:
# save model
save_checkpoint(model_G,
optimizer_G,
scheduler_G,
model_D,
optimizer_D,
scheduler_D,
global_step,
epoch,
OUT_PATH,
model_losses=loss_dict)
# compute spectrograms
figures = plot_results(y_hat_vis, y_G, ap, global_step,
'train')
tb_logger.tb_train_figures(global_step, figures)
# compute spectrograms
figures = plot_results(y_hat_vis, y_G, ap, global_step,
'train')
tb_logger.tb_train_figures(global_step, figures)
# Sample audio
sample_voice = y_hat_vis[0].squeeze(0).detach().cpu().numpy()
tb_logger.tb_train_audios(global_step,
{'train/audio': sample_voice},
c.audio["sample_rate"])
# Sample audio
sample_voice = y_hat_vis[0].squeeze(0).detach().cpu().numpy()
tb_logger.tb_train_audios(global_step,
{'train/audio': sample_voice},
c.audio["sample_rate"])
end_time = time.time()
# print epoch stats
@ -286,7 +290,8 @@ def train(model_G, criterion_G, optimizer_G, model_D, criterion_D, optimizer_D,
# Plot Training Epoch Stats
epoch_stats = {"epoch_time": epoch_time}
epoch_stats.update(keep_avg.avg_values)
tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
if args.rank == 0:
tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
# TODO: plot model stats
# if c.tb_model_param_stats:
# tb_logger.tb_model_weights(model, global_step)
@ -326,7 +331,6 @@ def evaluate(model_G, criterion_G, model_D, criterion_D, ap, global_step, epoch)
y_hat = model_G.pqmf_synthesis(y_hat)
y_G_sub = model_G.pqmf_analysis(y_G)
scores_fake, feats_fake, feats_real = None, None, None
if global_step > c.steps_to_start_discriminator:
@ -403,7 +407,6 @@ def evaluate(model_G, criterion_G, model_D, criterion_D, ap, global_step, epoch)
else:
loss_dict[key] = value.item()
step_time = time.time() - start_time
epoch_time += step_time
@ -419,20 +422,21 @@ def evaluate(model_G, criterion_G, model_D, criterion_D, ap, global_step, epoch)
if c.print_eval:
c_logger.print_eval_step(num_iter, loss_dict, keep_avg.avg_values)
# compute spectrograms
figures = plot_results(y_hat, y_G, ap, global_step, 'eval')
tb_logger.tb_eval_figures(global_step, figures)
if args.rank == 0:
# compute spectrograms
figures = plot_results(y_hat, y_G, ap, global_step, 'eval')
tb_logger.tb_eval_figures(global_step, figures)
# Sample audio
sample_voice = y_hat[0].squeeze(0).detach().cpu().numpy()
tb_logger.tb_eval_audios(global_step, {'eval/audio': sample_voice},
c.audio["sample_rate"])
# Sample audio
sample_voice = y_hat[0].squeeze(0).detach().cpu().numpy()
tb_logger.tb_eval_audios(global_step, {'eval/audio': sample_voice},
c.audio["sample_rate"])
# synthesize a full voice
tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
# synthesize a full voice
data_loader.return_segments = False
tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
return keep_avg.avg_values
@ -443,7 +447,8 @@ def main(args): # pylint: disable=redefined-outer-name
print(f" > Loading wavs from: {c.data_path}")
if c.feature_path is not None:
print(f" > Loading features from: {c.feature_path}")
eval_data, train_data = load_wav_feat_data(c.data_path, c.feature_path, c.eval_split_size)
eval_data, train_data = load_wav_feat_data(
c.data_path, c.feature_path, c.eval_split_size)
else:
eval_data, train_data = load_wav_data(c.data_path, c.eval_split_size)
@ -451,9 +456,9 @@ def main(args): # pylint: disable=redefined-outer-name
ap = AudioProcessor(**c.audio)
# DISTRUBUTED
# if num_gpus > 1:
# init_distributed(args.rank, num_gpus, args.group_id,
# c.distributed["backend"], c.distributed["url"])
if num_gpus > 1:
init_distributed(args.rank, num_gpus, args.group_id,
c.distributed["backend"], c.distributed["url"])
# setup models
model_gen = setup_generator(c)
@ -470,10 +475,12 @@ def main(args): # pylint: disable=redefined-outer-name
scheduler_disc = None
if 'lr_scheduler_gen' in c:
scheduler_gen = getattr(torch.optim.lr_scheduler, c.lr_scheduler_gen)
scheduler_gen = scheduler_gen(optimizer_gen, **c.lr_scheduler_gen_params)
scheduler_gen = scheduler_gen(
optimizer_gen, **c.lr_scheduler_gen_params)
if 'lr_scheduler_disc' in c:
scheduler_disc = getattr(torch.optim.lr_scheduler, c.lr_scheduler_disc)
scheduler_disc = scheduler_disc(optimizer_disc, **c.lr_scheduler_disc_params)
scheduler_disc = scheduler_disc(
optimizer_disc, **c.lr_scheduler_disc_params)
# setup criterion
criterion_gen = GeneratorLoss(c)
@ -531,8 +538,9 @@ def main(args): # pylint: disable=redefined-outer-name
criterion_disc.cuda()
# DISTRUBUTED
# if num_gpus > 1:
# model = apply_gradient_allreduce(model)
if num_gpus > 1:
model_gen = DDP_th(model_gen, device_ids=[args.rank])
model_disc = DDP_th(model_disc, device_ids=[args.rank])
num_params = count_parameters(model_gen)
print(" > Generator has {} parameters".format(num_params), flush=True)
@ -572,8 +580,7 @@ if __name__ == '__main__':
parser.add_argument(
'--continue_path',
type=str,
help=
'Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
help='Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
default='',
required='--config_path' not in sys.argv)
parser.add_argument(

View File

@ -0,0 +1,511 @@
import argparse
import glob
import os
import sys
import time
import traceback
import numpy as np
import torch
# DISTRIBUTED
from torch.nn.parallel import DistributedDataParallel as DDP_th
from torch.optim import Adam
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from TTS.utils.audio import AudioProcessor
from TTS.utils.console_logger import ConsoleLogger
from TTS.utils.distribute import init_distributed
from TTS.utils.generic_utils import (KeepAverage, count_parameters,
create_experiment_folder, get_git_branch,
remove_experiment_folder, set_init_dict)
from TTS.utils.io import copy_config_file, load_config
from TTS.utils.tensorboard_logger import TensorboardLogger
from TTS.utils.training import setup_torch_training_env
from TTS.vocoder.datasets.preprocess import load_wav_data, load_wav_feat_data
from TTS.vocoder.datasets.wavegrad_dataset import WaveGradDataset
from TTS.vocoder.utils.generic_utils import plot_results, setup_generator
from TTS.vocoder.utils.io import save_best_model, save_checkpoint
use_cuda, num_gpus = setup_torch_training_env(True, True)
def setup_loader(ap, is_val=False, verbose=False):
if is_val and not c.run_eval:
loader = None
else:
dataset = WaveGradDataset(ap=ap,
items=eval_data if is_val else train_data,
seq_len=c.seq_len,
hop_len=ap.hop_length,
pad_short=c.pad_short,
conv_pad=c.conv_pad,
is_training=not is_val,
return_segments=True,
use_noise_augment=False,
use_cache=c.use_cache,
verbose=verbose)
sampler = DistributedSampler(dataset) if num_gpus > 1 else None
loader = DataLoader(dataset,
batch_size=c.batch_size,
shuffle=num_gpus <= 1,
drop_last=False,
sampler=sampler,
num_workers=c.num_val_loader_workers
if is_val else c.num_loader_workers,
pin_memory=False)
return loader
def format_data(data):
# return a whole audio segment
m, x = data
x = x.unsqueeze(1)
if use_cuda:
m = m.cuda(non_blocking=True)
x = x.cuda(non_blocking=True)
return m, x
def format_test_data(data):
# return a whole audio segment
m, x = data
m = m[None, ...]
x = x[None, None, ...]
if use_cuda:
m = m.cuda(non_blocking=True)
x = x.cuda(non_blocking=True)
return m, x
def train(model, criterion, optimizer,
scheduler, scaler, ap, global_step, epoch):
data_loader = setup_loader(ap, is_val=False, verbose=(epoch == 0))
model.train()
epoch_time = 0
keep_avg = KeepAverage()
if use_cuda:
batch_n_iter = int(
len(data_loader.dataset) / (c.batch_size * num_gpus))
else:
batch_n_iter = int(len(data_loader.dataset) / c.batch_size)
end_time = time.time()
c_logger.print_train_start()
# setup noise schedule
noise_schedule = c['train_noise_schedule']
betas = np.linspace(noise_schedule['min_val'], noise_schedule['max_val'], noise_schedule['num_steps'])
if hasattr(model, 'module'):
model.module.compute_noise_level(betas)
else:
model.compute_noise_level(betas)
for num_iter, data in enumerate(data_loader):
start_time = time.time()
# format data
m, x = format_data(data)
loader_time = time.time() - end_time
global_step += 1
with torch.cuda.amp.autocast(enabled=c.mixed_precision):
# compute noisy input
if hasattr(model, 'module'):
noise, x_noisy, noise_scale = model.module.compute_y_n(x)
else:
noise, x_noisy, noise_scale = model.compute_y_n(x)
# forward pass
noise_hat = model(x_noisy, m, noise_scale)
# compute losses
loss = criterion(noise, noise_hat)
loss_wavegrad_dict = {'wavegrad_loss':loss}
# check nan loss
if torch.isnan(loss).any():
raise RuntimeError(f'Detected NaN loss at step {global_step}.')
optimizer.zero_grad()
# backward pass with loss scaling
if c.mixed_precision:
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
c.clip_grad)
scaler.step(optimizer)
scaler.update()
else:
loss.backward()
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
c.clip_grad)
optimizer.step()
# schedule update
if scheduler is not None:
scheduler.step()
# disconnect loss values
loss_dict = dict()
for key, value in loss_wavegrad_dict.items():
if isinstance(value, int):
loss_dict[key] = value
else:
loss_dict[key] = value.item()
# epoch/step timing
step_time = time.time() - start_time
epoch_time += step_time
# get current learning rates
current_lr = list(optimizer.param_groups)[0]['lr']
# update avg stats
update_train_values = dict()
for key, value in loss_dict.items():
update_train_values['avg_' + key] = value
update_train_values['avg_loader_time'] = loader_time
update_train_values['avg_step_time'] = step_time
keep_avg.update_values(update_train_values)
# print training stats
if global_step % c.print_step == 0:
log_dict = {
'step_time': [step_time, 2],
'loader_time': [loader_time, 4],
"current_lr": current_lr,
"grad_norm": grad_norm.item()
}
c_logger.print_train_step(batch_n_iter, num_iter, global_step,
log_dict, loss_dict, keep_avg.avg_values)
if args.rank == 0:
# plot step stats
if global_step % 10 == 0:
iter_stats = {
"lr": current_lr,
"grad_norm": grad_norm.item(),
"step_time": step_time
}
iter_stats.update(loss_dict)
tb_logger.tb_train_iter_stats(global_step, iter_stats)
# save checkpoint
if global_step % c.save_step == 0:
if c.checkpoint:
# save model
save_checkpoint(model,
optimizer,
scheduler,
None,
None,
None,
global_step,
epoch,
OUT_PATH,
model_losses=loss_dict,
scaler=scaler.state_dict() if c.mixed_precision else None)
end_time = time.time()
# print epoch stats
c_logger.print_train_epoch_end(global_step, epoch, epoch_time, keep_avg)
# Plot Training Epoch Stats
epoch_stats = {"epoch_time": epoch_time}
epoch_stats.update(keep_avg.avg_values)
if args.rank == 0:
tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
# TODO: plot model stats
if c.tb_model_param_stats and args.rank == 0:
tb_logger.tb_model_weights(model, global_step)
return keep_avg.avg_values, global_step
@torch.no_grad()
def evaluate(model, criterion, ap, global_step, epoch):
data_loader = setup_loader(ap, is_val=True, verbose=(epoch == 0))
model.eval()
epoch_time = 0
keep_avg = KeepAverage()
end_time = time.time()
c_logger.print_eval_start()
for num_iter, data in enumerate(data_loader):
start_time = time.time()
# format data
m, x = format_data(data)
loader_time = time.time() - end_time
global_step += 1
# compute noisy input
if hasattr(model, 'module'):
noise, x_noisy, noise_scale = model.module.compute_y_n(x)
else:
noise, x_noisy, noise_scale = model.compute_y_n(x)
# forward pass
noise_hat = model(x_noisy, m, noise_scale)
# compute losses
loss = criterion(noise, noise_hat)
loss_wavegrad_dict = {'wavegrad_loss':loss}
loss_dict = dict()
for key, value in loss_wavegrad_dict.items():
if isinstance(value, (int, float)):
loss_dict[key] = value
else:
loss_dict[key] = value.item()
step_time = time.time() - start_time
epoch_time += step_time
# update avg stats
update_eval_values = dict()
for key, value in loss_dict.items():
update_eval_values['avg_' + key] = value
update_eval_values['avg_loader_time'] = loader_time
update_eval_values['avg_step_time'] = step_time
keep_avg.update_values(update_eval_values)
# print eval stats
if c.print_eval:
c_logger.print_eval_step(num_iter, loss_dict, keep_avg.avg_values)
if args.rank == 0:
data_loader.dataset.return_segments = False
samples = data_loader.dataset.load_test_samples(1)
m, x = format_test_data(samples[0])
# setup noise schedule and inference
noise_schedule = c['test_noise_schedule']
betas = np.linspace(noise_schedule['min_val'], noise_schedule['max_val'], noise_schedule['num_steps'])
if hasattr(model, 'module'):
model.module.compute_noise_level(betas)
# compute voice
x_pred = model.module.inference(m)
else:
model.compute_noise_level(betas)
# compute voice
x_pred = model.inference(m)
# compute spectrograms
figures = plot_results(x_pred, x, ap, global_step, 'eval')
tb_logger.tb_eval_figures(global_step, figures)
# Sample audio
sample_voice = x_pred[0].squeeze(0).detach().cpu().numpy()
tb_logger.tb_eval_audios(global_step, {'eval/audio': sample_voice},
c.audio["sample_rate"])
tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
data_loader.dataset.return_segments = True
return keep_avg.avg_values
def main(args): # pylint: disable=redefined-outer-name
# pylint: disable=global-variable-undefined
global train_data, eval_data
print(f" > Loading wavs from: {c.data_path}")
if c.feature_path is not None:
print(f" > Loading features from: {c.feature_path}")
eval_data, train_data = load_wav_feat_data(c.data_path, c.feature_path, c.eval_split_size)
else:
eval_data, train_data = load_wav_data(c.data_path, c.eval_split_size)
# setup audio processor
ap = AudioProcessor(**c.audio)
# DISTRUBUTED
if num_gpus > 1:
init_distributed(args.rank, num_gpus, args.group_id,
c.distributed["backend"], c.distributed["url"])
# setup models
model = setup_generator(c)
# scaler for mixed_precision
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
# setup optimizers
optimizer = Adam(model.parameters(), lr=c.lr, weight_decay=0)
# schedulers
scheduler = None
if 'lr_scheduler' in c:
scheduler = getattr(torch.optim.lr_scheduler, c.lr_scheduler)
scheduler = scheduler(optimizer, **c.lr_scheduler_params)
# setup criterion
criterion = torch.nn.L1Loss().cuda()
if args.restore_path:
checkpoint = torch.load(args.restore_path, map_location='cpu')
try:
print(" > Restoring Model...")
model.load_state_dict(checkpoint['model'])
print(" > Restoring Optimizer...")
optimizer.load_state_dict(checkpoint['optimizer'])
if 'scheduler' in checkpoint:
print(" > Restoring LR Scheduler...")
scheduler.load_state_dict(checkpoint['scheduler'])
# NOTE: Not sure if necessary
scheduler.optimizer = optimizer
if "scaler" in checkpoint and c.mixed_precision:
print(" > Restoring AMP Scaler...")
scaler.load_state_dict(checkpoint["scaler"])
except RuntimeError:
# retore only matching layers.
print(" > Partial model initialization...")
model_dict = model.state_dict()
model_dict = set_init_dict(model_dict, checkpoint['model'], c)
model.load_state_dict(model_dict)
del model_dict
# reset lr if not countinuining training.
for group in optimizer.param_groups:
group['lr'] = c.lr
print(" > Model restored from step %d" % checkpoint['step'],
flush=True)
args.restore_step = checkpoint['step']
else:
args.restore_step = 0
if use_cuda:
model.cuda()
criterion.cuda()
# DISTRUBUTED
if num_gpus > 1:
model = DDP_th(model, device_ids=[args.rank])
num_params = count_parameters(model)
print(" > WaveGrad has {} parameters".format(num_params), flush=True)
if 'best_loss' not in locals():
best_loss = float('inf')
global_step = args.restore_step
for epoch in range(0, c.epochs):
c_logger.print_epoch_start(epoch, c.epochs)
_, global_step = train(model, criterion, optimizer,
scheduler, scaler, ap, global_step,
epoch)
eval_avg_loss_dict = evaluate(model, criterion, ap,
global_step, epoch)
c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
target_loss = eval_avg_loss_dict[c.target_loss]
best_loss = save_best_model(target_loss,
best_loss,
model,
optimizer,
scheduler,
None,
None,
None,
global_step,
epoch,
OUT_PATH,
model_losses=eval_avg_loss_dict,
scaler=scaler.state_dict() if c.mixed_precision else None)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--continue_path',
type=str,
help=
'Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
default='',
required='--config_path' not in sys.argv)
parser.add_argument(
'--restore_path',
type=str,
help='Model file to be restored. Use to finetune a model.',
default='')
parser.add_argument('--config_path',
type=str,
help='Path to config file for training.',
required='--continue_path' not in sys.argv)
parser.add_argument('--debug',
type=bool,
default=False,
help='Do not verify commit integrity to run training.')
# DISTRUBUTED
parser.add_argument(
'--rank',
type=int,
default=0,
help='DISTRIBUTED: process rank for distributed training.')
parser.add_argument('--group_id',
type=str,
default="",
help='DISTRIBUTED: process group id.')
args = parser.parse_args()
if args.continue_path != '':
args.output_path = args.continue_path
args.config_path = os.path.join(args.continue_path, 'config.json')
list_of_files = glob.glob(
args.continue_path +
"/*.pth.tar") # * means all if need specific format then *.csv
latest_model_file = max(list_of_files, key=os.path.getctime)
args.restore_path = latest_model_file
print(f" > Training continues for {args.restore_path}")
# setup output paths and read configs
c = load_config(args.config_path)
# check_config(c)
_ = os.path.dirname(os.path.realpath(__file__))
# DISTRIBUTED
if c.mixed_precision:
print(" > Mixed precision is enabled")
OUT_PATH = args.continue_path
if args.continue_path == '':
OUT_PATH = create_experiment_folder(c.output_path, c.run_name,
args.debug)
AUDIO_PATH = os.path.join(OUT_PATH, 'test_audios')
c_logger = ConsoleLogger()
if args.rank == 0:
os.makedirs(AUDIO_PATH, exist_ok=True)
new_fields = {}
if args.restore_path:
new_fields["restore_path"] = args.restore_path
new_fields["github_branch"] = get_git_branch()
copy_config_file(args.config_path,
os.path.join(OUT_PATH, 'config.json'), new_fields)
os.chmod(AUDIO_PATH, 0o775)
os.chmod(OUT_PATH, 0o775)
LOG_DIR = OUT_PATH
tb_logger = TensorboardLogger(LOG_DIR, model_name='VOCODER')
# write model desc to tensorboard
tb_logger.tb_add_text('model-description', c['run_description'], 0)
try:
main(args)
except KeyboardInterrupt:
remove_experiment_folder(OUT_PATH)
try:
sys.exit(0)
except SystemExit:
os._exit(0) # pylint: disable=protected-access
except Exception: # pylint: disable=broad-except
remove_experiment_folder(OUT_PATH)
traceback.print_exc()
sys.exit(1)

View File

@ -0,0 +1,539 @@
import argparse
import os
import sys
import traceback
import time
import glob
import random
import torch
from torch.utils.data import DataLoader
# from torch.utils.data.distributed import DistributedSampler
from TTS.tts.utils.visual import plot_spectrogram
from TTS.utils.audio import AudioProcessor
from TTS.utils.radam import RAdam
from TTS.utils.io import copy_config_file, load_config
from TTS.utils.training import setup_torch_training_env
from TTS.utils.console_logger import ConsoleLogger
from TTS.utils.tensorboard_logger import TensorboardLogger
from TTS.utils.generic_utils import (
KeepAverage,
count_parameters,
create_experiment_folder,
get_git_branch,
remove_experiment_folder,
set_init_dict,
)
from TTS.vocoder.datasets.wavernn_dataset import WaveRNNDataset
from TTS.vocoder.datasets.preprocess import (
load_wav_data,
load_wav_feat_data
)
from TTS.vocoder.utils.distribution import discretized_mix_logistic_loss, gaussian_loss
from TTS.vocoder.utils.generic_utils import setup_wavernn
from TTS.vocoder.utils.io import save_best_model, save_checkpoint
use_cuda, num_gpus = setup_torch_training_env(True, True)
def setup_loader(ap, is_val=False, verbose=False):
if is_val and not c.run_eval:
loader = None
else:
dataset = WaveRNNDataset(ap=ap,
items=eval_data if is_val else train_data,
seq_len=c.seq_len,
hop_len=ap.hop_length,
pad=c.padding,
mode=c.mode,
mulaw=c.mulaw,
is_training=not is_val,
verbose=verbose,
)
# sampler = DistributedSampler(dataset) if num_gpus > 1 else None
loader = DataLoader(dataset,
shuffle=True,
collate_fn=dataset.collate,
batch_size=c.batch_size,
num_workers=c.num_val_loader_workers
if is_val
else c.num_loader_workers,
pin_memory=True,
)
return loader
def format_data(data):
# setup input data
x_input = data[0]
mels = data[1]
y_coarse = data[2]
# dispatch data to GPU
if use_cuda:
x_input = x_input.cuda(non_blocking=True)
mels = mels.cuda(non_blocking=True)
y_coarse = y_coarse.cuda(non_blocking=True)
return x_input, mels, y_coarse
def train(model, optimizer, criterion, scheduler, scaler, ap, global_step, epoch):
# create train loader
data_loader = setup_loader(ap, is_val=False, verbose=(epoch == 0))
model.train()
epoch_time = 0
keep_avg = KeepAverage()
if use_cuda:
batch_n_iter = int(len(data_loader.dataset) /
(c.batch_size * num_gpus))
else:
batch_n_iter = int(len(data_loader.dataset) / c.batch_size)
end_time = time.time()
c_logger.print_train_start()
# train loop
for num_iter, data in enumerate(data_loader):
start_time = time.time()
x_input, mels, y_coarse = format_data(data)
loader_time = time.time() - end_time
global_step += 1
optimizer.zero_grad()
if c.mixed_precision:
# mixed precision training
with torch.cuda.amp.autocast():
y_hat = model(x_input, mels)
if isinstance(model.mode, int):
y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
else:
y_coarse = y_coarse.float()
y_coarse = y_coarse.unsqueeze(-1)
# compute losses
loss = criterion(y_hat, y_coarse)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
if c.grad_clip > 0:
torch.nn.utils.clip_grad_norm_(
model.parameters(), c.grad_clip)
scaler.step(optimizer)
scaler.update()
else:
# full precision training
y_hat = model(x_input, mels)
if isinstance(model.mode, int):
y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
else:
y_coarse = y_coarse.float()
y_coarse = y_coarse.unsqueeze(-1)
# compute losses
loss = criterion(y_hat, y_coarse)
if loss.item() is None:
raise RuntimeError(" [!] None loss. Exiting ...")
loss.backward()
if c.grad_clip > 0:
torch.nn.utils.clip_grad_norm_(
model.parameters(), c.grad_clip)
optimizer.step()
if scheduler is not None:
scheduler.step()
# get the current learning rate
cur_lr = list(optimizer.param_groups)[0]["lr"]
step_time = time.time() - start_time
epoch_time += step_time
update_train_values = dict()
loss_dict = dict()
loss_dict["model_loss"] = loss.item()
for key, value in loss_dict.items():
update_train_values["avg_" + key] = value
update_train_values["avg_loader_time"] = loader_time
update_train_values["avg_step_time"] = step_time
keep_avg.update_values(update_train_values)
# print training stats
if global_step % c.print_step == 0:
log_dict = {"step_time": [step_time, 2],
"loader_time": [loader_time, 4],
"current_lr": cur_lr,
}
c_logger.print_train_step(batch_n_iter,
num_iter,
global_step,
log_dict,
loss_dict,
keep_avg.avg_values,
)
# plot step stats
if global_step % 10 == 0:
iter_stats = {"lr": cur_lr, "step_time": step_time}
iter_stats.update(loss_dict)
tb_logger.tb_train_iter_stats(global_step, iter_stats)
# save checkpoint
if global_step % c.save_step == 0:
if c.checkpoint:
# save model
save_checkpoint(model,
optimizer,
scheduler,
None,
None,
None,
global_step,
epoch,
OUT_PATH,
model_losses=loss_dict,
scaler=scaler.state_dict() if c.mixed_precision else None
)
# synthesize a full voice
rand_idx = random.randrange(0, len(train_data))
wav_path = train_data[rand_idx] if not isinstance(
train_data[rand_idx], (tuple, list)) else train_data[rand_idx][0]
wav = ap.load_wav(wav_path)
ground_mel = ap.melspectrogram(wav)
sample_wav = model.generate(ground_mel,
c.batched,
c.target_samples,
c.overlap_samples,
use_cuda
)
predict_mel = ap.melspectrogram(sample_wav)
# compute spectrograms
figures = {"train/ground_truth": plot_spectrogram(ground_mel.T),
"train/prediction": plot_spectrogram(predict_mel.T)
}
tb_logger.tb_train_figures(global_step, figures)
# Sample audio
tb_logger.tb_train_audios(
global_step, {
"train/audio": sample_wav}, c.audio["sample_rate"]
)
end_time = time.time()
# print epoch stats
c_logger.print_train_epoch_end(global_step, epoch, epoch_time, keep_avg)
# Plot Training Epoch Stats
epoch_stats = {"epoch_time": epoch_time}
epoch_stats.update(keep_avg.avg_values)
tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
# TODO: plot model stats
# if c.tb_model_param_stats:
# tb_logger.tb_model_weights(model, global_step)
return keep_avg.avg_values, global_step
@torch.no_grad()
def evaluate(model, criterion, ap, global_step, epoch):
# create train loader
data_loader = setup_loader(ap, is_val=True, verbose=(epoch == 0))
model.eval()
epoch_time = 0
keep_avg = KeepAverage()
end_time = time.time()
c_logger.print_eval_start()
with torch.no_grad():
for num_iter, data in enumerate(data_loader):
start_time = time.time()
# format data
x_input, mels, y_coarse = format_data(data)
loader_time = time.time() - end_time
global_step += 1
y_hat = model(x_input, mels)
if isinstance(model.mode, int):
y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
else:
y_coarse = y_coarse.float()
y_coarse = y_coarse.unsqueeze(-1)
loss = criterion(y_hat, y_coarse)
# Compute avg loss
# if num_gpus > 1:
# loss = reduce_tensor(loss.data, num_gpus)
loss_dict = dict()
loss_dict["model_loss"] = loss.item()
step_time = time.time() - start_time
epoch_time += step_time
# update avg stats
update_eval_values = dict()
for key, value in loss_dict.items():
update_eval_values["avg_" + key] = value
update_eval_values["avg_loader_time"] = loader_time
update_eval_values["avg_step_time"] = step_time
keep_avg.update_values(update_eval_values)
# print eval stats
if c.print_eval:
c_logger.print_eval_step(
num_iter, loss_dict, keep_avg.avg_values)
if epoch % c.test_every_epochs == 0 and epoch != 0:
# synthesize a full voice
rand_idx = random.randrange(0, len(eval_data))
wav_path = eval_data[rand_idx] if not isinstance(
eval_data[rand_idx], (tuple, list)) else eval_data[rand_idx][0]
wav = ap.load_wav(wav_path)
ground_mel = ap.melspectrogram(wav)
sample_wav = model.generate(ground_mel,
c.batched,
c.target_samples,
c.overlap_samples,
use_cuda
)
predict_mel = ap.melspectrogram(sample_wav)
# Sample audio
tb_logger.tb_eval_audios(
global_step, {
"eval/audio": sample_wav}, c.audio["sample_rate"]
)
# compute spectrograms
figures = {"eval/ground_truth": plot_spectrogram(ground_mel.T),
"eval/prediction": plot_spectrogram(predict_mel.T)
}
tb_logger.tb_eval_figures(global_step, figures)
tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
return keep_avg.avg_values
# FIXME: move args definition/parsing inside of main?
def main(args): # pylint: disable=redefined-outer-name
# pylint: disable=global-variable-undefined
global train_data, eval_data
# setup audio processor
ap = AudioProcessor(**c.audio)
# print(f" > Loading wavs from: {c.data_path}")
# if c.feature_path is not None:
# print(f" > Loading features from: {c.feature_path}")
# eval_data, train_data = load_wav_feat_data(
# c.data_path, c.feature_path, c.eval_split_size
# )
# else:
# mel_feat_path = os.path.join(OUT_PATH, "mel")
# feat_data = find_feat_files(mel_feat_path)
# if feat_data:
# print(f" > Loading features from: {mel_feat_path}")
# eval_data, train_data = load_wav_feat_data(
# c.data_path, mel_feat_path, c.eval_split_size
# )
# else:
# print(" > No feature data found. Preprocessing...")
# # preprocessing feature data from given wav files
# preprocess_wav_files(OUT_PATH, CONFIG, ap)
# eval_data, train_data = load_wav_feat_data(
# c.data_path, mel_feat_path, c.eval_split_size
# )
print(f" > Loading wavs from: {c.data_path}")
if c.feature_path is not None:
print(f" > Loading features from: {c.feature_path}")
eval_data, train_data = load_wav_feat_data(
c.data_path, c.feature_path, c.eval_split_size)
else:
eval_data, train_data = load_wav_data(
c.data_path, c.eval_split_size)
# setup model
model_wavernn = setup_wavernn(c)
# setup amp scaler
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
# define train functions
if c.mode == "mold":
criterion = discretized_mix_logistic_loss
elif c.mode == "gauss":
criterion = gaussian_loss
elif isinstance(c.mode, int):
criterion = torch.nn.CrossEntropyLoss()
if use_cuda:
model_wavernn.cuda()
if isinstance(c.mode, int):
criterion.cuda()
optimizer = RAdam(model_wavernn.parameters(), lr=c.lr, weight_decay=0)
scheduler = None
if "lr_scheduler" in c:
scheduler = getattr(torch.optim.lr_scheduler, c.lr_scheduler)
scheduler = scheduler(optimizer, **c.lr_scheduler_params)
# slow start for the first 5 epochs
# lr_lambda = lambda epoch: min(epoch / c.warmup_steps, 1)
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
# restore any checkpoint
if args.restore_path:
checkpoint = torch.load(args.restore_path, map_location="cpu")
try:
print(" > Restoring Model...")
model_wavernn.load_state_dict(checkpoint["model"])
print(" > Restoring Optimizer...")
optimizer.load_state_dict(checkpoint["optimizer"])
if "scheduler" in checkpoint:
print(" > Restoring Generator LR Scheduler...")
scheduler.load_state_dict(checkpoint["scheduler"])
scheduler.optimizer = optimizer
if "scaler" in checkpoint and c.mixed_precision:
print(" > Restoring AMP Scaler...")
scaler.load_state_dict(checkpoint["scaler"])
except RuntimeError:
# retore only matching layers.
print(" > Partial model initialization...")
model_dict = model_wavernn.state_dict()
model_dict = set_init_dict(model_dict, checkpoint["model"], c)
model_wavernn.load_state_dict(model_dict)
print(" > Model restored from step %d" %
checkpoint["step"], flush=True)
args.restore_step = checkpoint["step"]
else:
args.restore_step = 0
# DISTRIBUTED
# if num_gpus > 1:
# model = apply_gradient_allreduce(model)
num_parameters = count_parameters(model_wavernn)
print(" > Model has {} parameters".format(num_parameters), flush=True)
if "best_loss" not in locals():
best_loss = float("inf")
global_step = args.restore_step
for epoch in range(0, c.epochs):
c_logger.print_epoch_start(epoch, c.epochs)
_, global_step = train(model_wavernn, optimizer,
criterion, scheduler, scaler, ap, global_step, epoch)
eval_avg_loss_dict = evaluate(
model_wavernn, criterion, ap, global_step, epoch)
c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
target_loss = eval_avg_loss_dict["avg_model_loss"]
best_loss = save_best_model(
target_loss,
best_loss,
model_wavernn,
optimizer,
scheduler,
None,
None,
None,
global_step,
epoch,
OUT_PATH,
model_losses=eval_avg_loss_dict,
scaler=scaler.state_dict() if c.mixed_precision else None
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--continue_path",
type=str,
help='Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
default="",
required="--config_path" not in sys.argv,
)
parser.add_argument(
"--restore_path",
type=str,
help="Model file to be restored. Use to finetune a model.",
default="",
)
parser.add_argument(
"--config_path",
type=str,
help="Path to config file for training.",
required="--continue_path" not in sys.argv,
)
parser.add_argument(
"--debug",
type=bool,
default=False,
help="Do not verify commit integrity to run training.",
)
# DISTRUBUTED
parser.add_argument(
"--rank",
type=int,
default=0,
help="DISTRIBUTED: process rank for distributed training.",
)
parser.add_argument(
"--group_id", type=str, default="", help="DISTRIBUTED: process group id."
)
args = parser.parse_args()
if args.continue_path != "":
args.output_path = args.continue_path
args.config_path = os.path.join(args.continue_path, "config.json")
list_of_files = glob.glob(
args.continue_path + "/*.pth.tar"
) # * means all if need specific format then *.csv
latest_model_file = max(list_of_files, key=os.path.getctime)
args.restore_path = latest_model_file
print(f" > Training continues for {args.restore_path}")
# setup output paths and read configs
c = load_config(args.config_path)
# check_config(c)
_ = os.path.dirname(os.path.realpath(__file__))
OUT_PATH = args.continue_path
if args.continue_path == "":
OUT_PATH = create_experiment_folder(
c.output_path, c.run_name, args.debug
)
AUDIO_PATH = os.path.join(OUT_PATH, "test_audios")
c_logger = ConsoleLogger()
if args.rank == 0:
os.makedirs(AUDIO_PATH, exist_ok=True)
new_fields = {}
if args.restore_path:
new_fields["restore_path"] = args.restore_path
new_fields["github_branch"] = get_git_branch()
copy_config_file(
args.config_path, os.path.join(OUT_PATH, "config.json"), new_fields
)
os.chmod(AUDIO_PATH, 0o775)
os.chmod(OUT_PATH, 0o775)
LOG_DIR = OUT_PATH
tb_logger = TensorboardLogger(LOG_DIR, model_name="VOCODER")
# write model desc to tensorboard
tb_logger.tb_add_text("model-description", c["run_description"], 0)
try:
main(args)
except KeyboardInterrupt:
remove_experiment_folder(OUT_PATH)
try:
sys.exit(0)
except SystemExit:
os._exit(0) # pylint: disable=protected-access
except Exception: # pylint: disable=broad-except
remove_experiment_folder(OUT_PATH)
traceback.print_exc()
sys.exit(1)

91
TTS/bin/tune_wavegrad.py Normal file
View File

@ -0,0 +1,91 @@
"""Search a good noise schedule for WaveGrad for a given number of inferece iterations"""
import argparse
from itertools import product as cartesian_product
import numpy as np
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from TTS.utils.audio import AudioProcessor
from TTS.utils.io import load_config
from TTS.vocoder.datasets.preprocess import load_wav_data
from TTS.vocoder.datasets.wavegrad_dataset import WaveGradDataset
from TTS.vocoder.utils.generic_utils import setup_generator
parser = argparse.ArgumentParser()
parser.add_argument('--model_path', type=str, help='Path to model checkpoint.')
parser.add_argument('--config_path', type=str, help='Path to model config file.')
parser.add_argument('--data_path', type=str, help='Path to data directory.')
parser.add_argument('--output_path', type=str, help='path for output file including file name and extension.')
parser.add_argument('--num_iter', type=int, help='Number of model inference iterations that you like to optimize noise schedule for.')
parser.add_argument('--use_cuda', type=bool, help='enable/disable CUDA.')
parser.add_argument('--num_samples', type=int, default=1, help='Number of datasamples used for inference.')
parser.add_argument('--search_depth', type=int, default=3, help='Search granularity. Increasing this increases the run-time exponentially.')
# load config
args = parser.parse_args()
config = load_config(args.config_path)
# setup audio processor
ap = AudioProcessor(**config.audio)
# load dataset
_, train_data = load_wav_data(args.data_path, 0)
train_data = train_data[:args.num_samples]
dataset = WaveGradDataset(ap=ap,
items=train_data,
seq_len=-1,
hop_len=ap.hop_length,
pad_short=config.pad_short,
conv_pad=config.conv_pad,
is_training=True,
return_segments=False,
use_noise_augment=False,
use_cache=False,
verbose=True)
loader = DataLoader(
dataset,
batch_size=1,
shuffle=False,
collate_fn=dataset.collate_full_clips,
drop_last=False,
num_workers=config.num_loader_workers,
pin_memory=False)
# setup the model
model = setup_generator(config)
if args.use_cuda:
model.cuda()
# setup optimization parameters
base_values = sorted(10 * np.random.uniform(size=args.search_depth))
print(base_values)
exponents = 10 ** np.linspace(-6, -1, num=args.num_iter)
best_error = float('inf')
best_schedule = None
total_search_iter = len(base_values)**args.num_iter
for base in tqdm(cartesian_product(base_values, repeat=args.num_iter), total=total_search_iter):
beta = exponents * base
model.compute_noise_level(beta)
for data in loader:
mel, audio = data
y_hat = model.inference(mel.cuda() if args.use_cuda else mel)
if args.use_cuda:
y_hat = y_hat.cpu()
y_hat = y_hat.numpy()
mel_hat = []
for i in range(y_hat.shape[0]):
m = ap.melspectrogram(y_hat[i, 0])[:, :-1]
mel_hat.append(torch.from_numpy(m))
mel_hat = torch.stack(mel_hat)
mse = torch.sum((mel - mel_hat) ** 2).mean()
if mse.item() < best_error:
best_error = mse.item()
best_schedule = {'beta': beta}
print(f" > Found a better schedule. - MSE: {mse.item()}")
np.save(args.output_path, best_schedule)

View File

@ -61,6 +61,7 @@ class SpeakerEncoder(nn.Module):
d = torch.nn.functional.normalize(d, p=2, dim=1)
return d
@torch.no_grad()
def inference(self, x):
d = self.layers.forward(x)
if self.use_lstm_with_projection:

View File

@ -65,14 +65,19 @@
"eval_batch_size":16,
"r": 7, // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
"gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
"apex_amp_level": null, // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use "O1" to activate.
"mixed_precision": true, // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use "O1" to activate.
// LOSS SETTINGS
"loss_masking": true, // enable / disable loss masking against the sequence padding.
"decoder_loss_alpha": 0.5, // decoder loss weight. If > 0, it is enabled
"postnet_loss_alpha": 0.25, // postnet loss weight. If > 0, it is enabled
"decoder_loss_alpha": 0.5, // original decoder loss weight. If > 0, it is enabled
"postnet_loss_alpha": 0.25, // original postnet loss weight. If > 0, it is enabled
"postnet_diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
"decoder_diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
"decoder_ssim_alpha": 0.5, // decoder ssim loss weight. If > 0, it is enabled
"postnet_ssim_alpha": 0.25, // postnet ssim loss weight. If > 0, it is enabled
"ga_alpha": 5.0, // weight for guided attention loss. If > 0, guided attention is enabled.
"diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
"stopnet_pos_weight": 15.0, // pos class weight for stopnet loss since there are way more negative samples than positive samples.
// VALIDATION
"run_eval": true,

View File

@ -51,10 +51,13 @@
// "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
// },
"add_blank": false, // if true add a new token after each token of the sentence. This increases the size of the input sequence, but has considerably improved the prosody of the GlowTTS model.
// DISTRIBUTED TRAINING
"apex_amp_level": null, // APEX amp optimization level. "O1" is currently supported.
"distributed":{
"backend": "nccl",
"url": "tcp:\/\/localhost:54321"
"url": "tcp:\/\/localhost:54323"
},
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

View File

@ -51,6 +51,8 @@
// "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
// },
"add_blank": false, // if true add a new token after each token of the sentence. This increases the size of the input sequence, but has considerably improved the prosody of the GlowTTS model.
// DISTRIBUTED TRAINING
"distributed":{
"backend": "nccl",

View File

@ -17,6 +17,7 @@ class MyDataset(Dataset):
ap,
meta_data,
tp=None,
add_blank=False,
batch_group_size=0,
min_seq_len=0,
max_seq_len=float("inf"),
@ -55,6 +56,7 @@ class MyDataset(Dataset):
self.max_seq_len = max_seq_len
self.ap = ap
self.tp = tp
self.add_blank = add_blank
self.use_phonemes = use_phonemes
self.phoneme_cache_path = phoneme_cache_path
self.phoneme_language = phoneme_language
@ -88,7 +90,7 @@ class MyDataset(Dataset):
phonemes = phoneme_to_sequence(text, [self.cleaners],
language=self.phoneme_language,
enable_eos_bos=False,
tp=self.tp)
tp=self.tp, add_blank=self.add_blank)
phonemes = np.asarray(phonemes, dtype=np.int32)
np.save(cache_path, phonemes)
return phonemes
@ -127,7 +129,7 @@ class MyDataset(Dataset):
text = self._load_or_generate_phoneme_sequence(wav_file, text)
else:
text = np.asarray(text_to_sequence(text, [self.cleaners],
tp=self.tp),
tp=self.tp, add_blank=self.add_blank),
dtype=np.int32)
assert text.size > 0, self.items[idx][1]

View File

@ -9,9 +9,9 @@ from tqdm import tqdm
from TTS.tts.utils.generic_utils import split_dataset
def load_meta_data(datasets):
def load_meta_data(datasets, eval_split=True):
meta_data_train_all = []
meta_data_eval_all = []
meta_data_eval_all = [] if eval_split else None
for dataset in datasets:
name = dataset['name']
root_path = dataset['path']
@ -20,12 +20,13 @@ def load_meta_data(datasets):
preprocessor = get_preprocessor_by_name(name)
meta_data_train = preprocessor(root_path, meta_file_train)
print(f" | > Found {len(meta_data_train)} files in {Path(root_path).resolve()}")
if meta_file_val is None:
meta_data_eval, meta_data_train = split_dataset(meta_data_train)
else:
meta_data_eval = preprocessor(root_path, meta_file_val)
if eval_split:
if meta_file_val is None:
meta_data_eval, meta_data_train = split_dataset(meta_data_train)
else:
meta_data_eval = preprocessor(root_path, meta_file_val)
meta_data_eval_all += meta_data_eval
meta_data_train_all += meta_data_train
meta_data_eval_all += meta_data_eval
return meta_data_train_all, meta_data_eval_all
@ -227,7 +228,6 @@ def brspeech(root_path, meta_file):
if line.startswith("wav_filename"):
continue
cols = line.split('|')
#print(cols)
wav_file = os.path.join(root_path, cols[0])
text = cols[2]
speaker_name = cols[3]
@ -303,17 +303,17 @@ def _voxcel_x(root_path, meta_file, voxcel_idx):
elif not cache_to.exists():
cnt = 0
meta_data = ""
meta_data = []
wav_files = voxceleb_path.rglob("**/*.wav")
for path in tqdm(wav_files, desc=f"Building VoxCeleb {voxcel_idx} Meta file ... this needs to be done only once.",
total=expected_count):
speaker_id = str(Path(path).parent.parent.stem)
assert speaker_id.startswith('id')
text = None # VoxCel does not provide transciptions, and they are not needed for training the SE
meta_data += f"{text}|{path}|voxcel{voxcel_idx}_{speaker_id}\n"
meta_data.append(f"{text}|{path}|voxcel{voxcel_idx}_{speaker_id}\n")
cnt += 1
with open(str(cache_to), 'w') as f:
f.write(meta_data)
f.write("".join(meta_data))
if cnt < expected_count:
raise ValueError(f"Found too few instances for Voxceleb. Should be around {expected_count}, is: {cnt}")

View File

@ -2,10 +2,14 @@ import math
import numpy as np
import torch
from torch import nn
from inspect import signature
from torch.nn import functional
from TTS.tts.utils.generic_utils import sequence_mask
from TTS.tts.utils.ssim import ssim
# pylint: disable=abstract-method Method
# relates https://github.com/pytorch/pytorch/issues/42305
class L1LossMasked(nn.Module):
def __init__(self, seq_len_norm):
super().__init__()
@ -22,6 +26,10 @@ class L1LossMasked(nn.Module):
class for each corresponding step.
length: A Variable containing a LongTensor of size (batch,)
which contains the length of each data in a batch.
Shapes:
x: B x T X D
target: B x T x D
length: B
Returns:
loss: An average loss value in range [0, 1] masked by the length.
"""
@ -60,6 +68,10 @@ class MSELossMasked(nn.Module):
class for each corresponding step.
length: A Variable containing a LongTensor of size (batch,)
which contains the length of each data in a batch.
Shapes:
x: B x T X D
target: B x T x D
length: B
Returns:
loss: An average loss value in range [0, 1] masked by the length.
"""
@ -84,6 +96,33 @@ class MSELossMasked(nn.Module):
return loss
class SSIMLoss(torch.nn.Module):
"""SSIM loss as explained here https://en.wikipedia.org/wiki/Structural_similarity"""
def __init__(self):
super().__init__()
self.loss_func = ssim
def forward(self, y_hat, y, length=None):
"""
Args:
y_hat (tensor): model prediction values.
y (tensor): target values.
length (tensor): length of each sample in a batch.
Shapes:
y_hat: B x T X D
y: B x T x D
length: B
Returns:
loss: An average loss value in range [0, 1] masked by the length.
"""
if length is not None:
m = sequence_mask(sequence_length=length,
max_len=y.size(1)).unsqueeze(2).float().to(
y_hat.device)
y_hat, y = y_hat * m, y * m
return 1 - self.loss_func(y_hat.unsqueeze(1), y.unsqueeze(1))
class AttentionEntropyLoss(nn.Module):
# pylint: disable=R0201
def forward(self, align):
@ -115,19 +154,29 @@ class BCELossMasked(nn.Module):
class for each corresponding step.
length: A Variable containing a LongTensor of size (batch,)
which contains the length of each data in a batch.
Shapes:
x: B x T
target: B x T
length: B
Returns:
loss: An average loss value in range [0, 1] masked by the length.
"""
# mask: (batch, max_len, 1)
target.requires_grad = False
mask = sequence_mask(sequence_length=length,
max_len=target.size(1)).float()
if length is not None:
mask = sequence_mask(sequence_length=length,
max_len=target.size(1)).float()
x = x * mask
target = target * mask
num_items = mask.sum()
else:
num_items = torch.numel(x)
loss = functional.binary_cross_entropy_with_logits(
x * mask,
target * mask,
x,
target,
pos_weight=self.pos_weight,
reduction='sum')
loss = loss / mask.sum()
loss = loss / num_items
return loss
@ -139,9 +188,19 @@ class DifferentailSpectralLoss(nn.Module):
super().__init__()
self.loss_func = loss_func
def forward(self, x, target, length):
def forward(self, x, target, length=None):
"""
Shapes:
x: B x T
target: B x T
length: B
Returns:
loss: An average loss value in range [0, 1] masked by the length.
"""
x_diff = x[:, 1:] - x[:, :-1]
target_diff = target[:, 1:] - target[:, :-1]
if length is None:
return self.loss_func(x_diff, target_diff)
return self.loss_func(x_diff, target_diff, length-1)
@ -169,7 +228,7 @@ class GuidedAttentionLoss(torch.nn.Module):
@staticmethod
def _make_ga_mask(ilen, olen, sigma):
grid_x, grid_y = torch.meshgrid(torch.arange(olen, device=olen.device), torch.arange(ilen, device=ilen.device))
grid_x, grid_y = torch.meshgrid(torch.arange(olen).to(olen), torch.arange(ilen).to(ilen))
grid_x, grid_y = grid_x.float(), grid_y.float()
return 1.0 - torch.exp(-(grid_y / ilen - grid_x / olen)**2 /
(2 * (sigma**2)))
@ -182,13 +241,17 @@ class GuidedAttentionLoss(torch.nn.Module):
class TacotronLoss(torch.nn.Module):
"""Collection of Tacotron set-up based on provided config."""
def __init__(self, c, stopnet_pos_weight=10, ga_sigma=0.4):
super(TacotronLoss, self).__init__()
self.stopnet_pos_weight = stopnet_pos_weight
self.ga_alpha = c.ga_alpha
self.diff_spec_alpha = c.diff_spec_alpha
self.decoder_diff_spec_alpha = c.decoder_diff_spec_alpha
self.postnet_diff_spec_alpha = c.postnet_diff_spec_alpha
self.decoder_alpha = c.decoder_loss_alpha
self.postnet_alpha = c.postnet_loss_alpha
self.decoder_ssim_alpha = c.decoder_ssim_alpha
self.postnet_ssim_alpha = c.postnet_ssim_alpha
self.config = c
# postnet and decoder loss
@ -199,12 +262,15 @@ class TacotronLoss(torch.nn.Module):
else:
self.criterion = nn.L1Loss() if c.model in ["Tacotron"
] else nn.MSELoss()
# differential spectral loss
if c.diff_spec_alpha > 0:
self.criterion_diff_spec = DifferentailSpectralLoss(loss_func=self.criterion)
# guided attention loss
if c.ga_alpha > 0:
self.criterion_ga = GuidedAttentionLoss(sigma=ga_sigma)
# differential spectral loss
if c.postnet_diff_spec_alpha > 0 or c.decoder_diff_spec_alpha > 0:
self.criterion_diff_spec = DifferentailSpectralLoss(loss_func=self.criterion)
# ssim loss
if c.postnet_ssim_alpha > 0 or c.decoder_ssim_alpha > 0:
self.criterion_ssim = SSIMLoss()
# stopnet loss
# pylint: disable=not-callable
self.criterion_st = BCELossMasked(
@ -215,6 +281,9 @@ class TacotronLoss(torch.nn.Module):
alignments, alignment_lens, alignments_backwards, input_lens):
return_dict = {}
# remove lengths if no masking is applied
if not self.config.loss_masking:
output_lens = None
# decoder and postnet losses
if self.config.loss_masking:
if self.decoder_alpha > 0:
@ -262,8 +331,11 @@ class TacotronLoss(torch.nn.Module):
# double decoder consistency loss (if enabled)
if self.config.double_decoder_consistency:
decoder_b_loss = self.criterion(decoder_b_output, mel_input,
output_lens)
if self.config.loss_masking:
decoder_b_loss = self.criterion(decoder_b_output, mel_input,
output_lens)
else:
decoder_b_loss = self.criterion(decoder_b_output, mel_input)
# decoder_c_loss = torch.nn.functional.l1_loss(decoder_b_output, decoder_output)
attention_c_loss = torch.nn.functional.l1_loss(alignments, alignments_backwards)
loss += self.decoder_alpha * (decoder_b_loss + attention_c_loss)
@ -274,14 +346,38 @@ class TacotronLoss(torch.nn.Module):
if self.config.ga_alpha > 0:
ga_loss = self.criterion_ga(alignments, input_lens, alignment_lens)
loss += ga_loss * self.ga_alpha
return_dict['ga_loss'] = ga_loss * self.ga_alpha
return_dict['ga_loss'] = ga_loss
# decoder differential spectral loss
if self.config.decoder_diff_spec_alpha > 0:
decoder_diff_spec_loss = self.criterion_diff_spec(decoder_output, mel_input, output_lens)
loss += decoder_diff_spec_loss * self.decoder_diff_spec_alpha
return_dict['decoder_diff_spec_loss'] = decoder_diff_spec_loss
# postnet differential spectral loss
if self.config.postnet_diff_spec_alpha > 0:
postnet_diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
loss += postnet_diff_spec_loss * self.postnet_diff_spec_alpha
return_dict['postnet_diff_spec_loss'] = postnet_diff_spec_loss
# decoder ssim loss
if self.config.decoder_ssim_alpha > 0:
decoder_ssim_loss = self.criterion_ssim(decoder_output, mel_input, output_lens)
loss += decoder_ssim_loss * self.postnet_ssim_alpha
return_dict['decoder_ssim_loss'] = decoder_ssim_loss
# postnet ssim loss
if self.config.postnet_ssim_alpha > 0:
postnet_ssim_loss = self.criterion_ssim(postnet_output, mel_input, output_lens)
loss += postnet_ssim_loss * self.postnet_ssim_alpha
return_dict['postnet_ssim_loss'] = postnet_ssim_loss
# differential spectral loss
if self.config.diff_spec_alpha > 0:
diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
loss += diff_spec_loss * self.diff_spec_alpha
return_dict['diff_spec_loss'] = diff_spec_loss
return_dict['loss'] = loss
# check if any loss is NaN
for key, loss in return_dict.items():
if torch.isnan(loss):
raise RuntimeError(f" [!] NaN loss with {key}.")
return return_dict
@ -306,4 +402,9 @@ class GlowTTSLoss(torch.nn.Module):
return_dict['loss'] = log_mle + loss_dur
return_dict['log_mle'] = log_mle
return_dict['loss_dur'] = loss_dur
# check if any loss is NaN
for key, loss in return_dict.items():
if torch.isnan(loss):
raise RuntimeError(f" [!] NaN loss with {key}.")
return return_dict

View File

@ -102,7 +102,7 @@ class Encoder(nn.Module):
o = layer(o)
o = o.transpose(1, 2)
o = nn.utils.rnn.pack_padded_sequence(o,
input_lengths,
input_lengths.cpu(),
batch_first=True)
self.lstm.flatten_parameters()
o, _ = self.lstm(o)

View File

@ -37,7 +37,8 @@ class GlowTts(nn.Module):
hidden_channels_enc=None,
hidden_channels_dec=None,
use_encoder_prenet=False,
encoder_type="transformer"):
encoder_type="transformer",
external_speaker_embedding_dim=None):
super().__init__()
self.num_chars = num_chars
@ -67,6 +68,14 @@ class GlowTts(nn.Module):
self.use_encoder_prenet = use_encoder_prenet
self.noise_scale = 0.66
self.length_scale = 1.
self.external_speaker_embedding_dim = external_speaker_embedding_dim
# if is a multispeaker and c_in_channels is 0, set to 256
if num_speakers > 1:
if self.c_in_channels == 0 and not self.external_speaker_embedding_dim:
self.c_in_channels = 512
elif self.external_speaker_embedding_dim:
self.c_in_channels = self.external_speaker_embedding_dim
self.encoder = Encoder(num_chars,
out_channels=out_channels,
@ -80,7 +89,7 @@ class GlowTts(nn.Module):
dropout_p=dropout_p,
mean_only=mean_only,
use_prenet=use_encoder_prenet,
c_in_channels=c_in_channels)
c_in_channels=self.c_in_channels)
self.decoder = Decoder(out_channels,
hidden_channels_dec or hidden_channels,
@ -92,10 +101,10 @@ class GlowTts(nn.Module):
num_splits=num_splits,
num_sqz=num_sqz,
sigmoid_scale=sigmoid_scale,
c_in_channels=c_in_channels)
c_in_channels=self.c_in_channels)
if num_speakers > 1:
self.emb_g = nn.Embedding(num_speakers, c_in_channels)
if num_speakers > 1 and not external_speaker_embedding_dim:
self.emb_g = nn.Embedding(num_speakers, self.c_in_channels)
nn.init.uniform_(self.emb_g.weight, -0.1, 0.1)
@staticmethod
@ -122,7 +131,11 @@ class GlowTts(nn.Module):
y_max_length = y.size(2)
# norm speaker embeddings
if g is not None:
g = F.normalize(self.emb_g(g)).unsqueeze(-1) # [b, h]
if self.external_speaker_embedding_dim:
g = F.normalize(g).unsqueeze(-1)
else:
g = F.normalize(self.emb_g(g)).unsqueeze(-1)# [b, h]
# embedding pass
o_mean, o_log_scale, o_dur_log, x_mask = self.encoder(x,
x_lengths,
@ -157,8 +170,13 @@ class GlowTts(nn.Module):
@torch.no_grad()
def inference(self, x, x_lengths, g=None):
if g is not None:
g = F.normalize(self.emb_g(g)).unsqueeze(-1) # [b, h]
if self.external_speaker_embedding_dim:
g = F.normalize(g).unsqueeze(-1)
else:
g = F.normalize(self.emb_g(g)).unsqueeze(-1) # [b, h]
# embedding pass
o_mean, o_log_scale, o_dur_log, x_mask = self.encoder(x,
x_lengths,

View File

@ -28,7 +28,6 @@ def split_dataset(items):
return items_eval, items
return items[:eval_split_size], items[eval_split_size:]
# from https://gist.github.com/jihunchoi/f1434a77df9db1bb337417854b398df1
def sequence_mask(sequence_length, max_len=None):
if max_len is None:
@ -50,7 +49,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
MyModel = importlib.import_module('TTS.tts.models.' + c.model.lower())
MyModel = getattr(MyModel, to_camel(c.model))
if c.model.lower() in "tacotron":
model = MyModel(num_chars=num_chars,
model = MyModel(num_chars=num_chars + getattr(c, "add_blank", False),
num_speakers=num_speakers,
r=c.r,
postnet_output_dim=int(c.audio['fft_size'] / 2 + 1),
@ -77,7 +76,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
ddc_r=c.ddc_r,
speaker_embedding_dim=speaker_embedding_dim)
elif c.model.lower() == "tacotron2":
model = MyModel(num_chars=num_chars,
model = MyModel(num_chars=num_chars + getattr(c, "add_blank", False),
num_speakers=num_speakers,
r=c.r,
postnet_output_dim=c.audio['num_mels'],
@ -103,7 +102,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
ddc_r=c.ddc_r,
speaker_embedding_dim=speaker_embedding_dim)
elif c.model.lower() == "glow_tts":
model = MyModel(num_chars=num_chars,
model = MyModel(num_chars=num_chars + getattr(c, "add_blank", False),
hidden_channels=192,
filter_channels=768,
filter_channels_dp=256,
@ -126,13 +125,15 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
mean_only=True,
hidden_channels_enc=192,
hidden_channels_dec=192,
use_encoder_prenet=True)
use_encoder_prenet=True,
external_speaker_embedding_dim=speaker_embedding_dim)
return model
def is_tacotron(c):
return False if 'glow_tts' in c['model'] else True
def check_config_tts(c):
check_argument('model', c, enum_list=['tacotron', 'tacotron2'], restricted=True, val_type=str)
check_argument('model', c, enum_list=['tacotron', 'tacotron2', 'glow_tts'], restricted=True, val_type=str)
check_argument('run_name', c, restricted=True, val_type=str)
check_argument('run_description', c, val_type=str)
@ -176,10 +177,20 @@ def check_config_tts(c):
check_argument('eval_batch_size', c, restricted=True, val_type=int, min_val=1)
check_argument('r', c, restricted=True, val_type=int, min_val=1)
check_argument('gradual_training', c, restricted=False, val_type=list)
check_argument('loss_masking', c, restricted=True, val_type=bool)
check_argument('apex_amp_level', c, restricted=False, val_type=str)
# check_argument('grad_accum', c, restricted=True, val_type=int, min_val=1, max_val=100)
# loss parameters
check_argument('loss_masking', c, restricted=True, val_type=bool)
if c['model'].lower() in ['tacotron', 'tacotron2']:
check_argument('decoder_loss_alpha', c, restricted=True, val_type=float, min_val=0)
check_argument('postnet_loss_alpha', c, restricted=True, val_type=float, min_val=0)
check_argument('postnet_diff_spec_alpha', c, restricted=True, val_type=float, min_val=0)
check_argument('decoder_diff_spec_alpha', c, restricted=True, val_type=float, min_val=0)
check_argument('decoder_ssim_alpha', c, restricted=True, val_type=float, min_val=0)
check_argument('postnet_ssim_alpha', c, restricted=True, val_type=float, min_val=0)
check_argument('ga_alpha', c, restricted=True, val_type=float, min_val=0)
# validation parameters
check_argument('run_eval', c, restricted=True, val_type=bool)
check_argument('test_delay_epochs', c, restricted=True, val_type=int, min_val=0)
@ -195,27 +206,30 @@ def check_config_tts(c):
check_argument('seq_len_norm', c, restricted=True, val_type=bool)
# tacotron prenet
check_argument('memory_size', c, restricted=True, val_type=int, min_val=-1)
check_argument('prenet_type', c, restricted=True, val_type=str, enum_list=['original', 'bn'])
check_argument('prenet_dropout', c, restricted=True, val_type=bool)
check_argument('memory_size', c, restricted=is_tacotron(c), val_type=int, min_val=-1)
check_argument('prenet_type', c, restricted=is_tacotron(c), val_type=str, enum_list=['original', 'bn'])
check_argument('prenet_dropout', c, restricted=is_tacotron(c), val_type=bool)
# attention
check_argument('attention_type', c, restricted=True, val_type=str, enum_list=['graves', 'original'])
check_argument('attention_heads', c, restricted=True, val_type=int)
check_argument('attention_norm', c, restricted=True, val_type=str, enum_list=['sigmoid', 'softmax'])
check_argument('windowing', c, restricted=True, val_type=bool)
check_argument('use_forward_attn', c, restricted=True, val_type=bool)
check_argument('forward_attn_mask', c, restricted=True, val_type=bool)
check_argument('transition_agent', c, restricted=True, val_type=bool)
check_argument('transition_agent', c, restricted=True, val_type=bool)
check_argument('location_attn', c, restricted=True, val_type=bool)
check_argument('bidirectional_decoder', c, restricted=True, val_type=bool)
check_argument('double_decoder_consistency', c, restricted=True, val_type=bool)
check_argument('attention_type', c, restricted=is_tacotron(c), val_type=str, enum_list=['graves', 'original'])
check_argument('attention_heads', c, restricted=is_tacotron(c), val_type=int)
check_argument('attention_norm', c, restricted=is_tacotron(c), val_type=str, enum_list=['sigmoid', 'softmax'])
check_argument('windowing', c, restricted=is_tacotron(c), val_type=bool)
check_argument('use_forward_attn', c, restricted=is_tacotron(c), val_type=bool)
check_argument('forward_attn_mask', c, restricted=is_tacotron(c), val_type=bool)
check_argument('transition_agent', c, restricted=is_tacotron(c), val_type=bool)
check_argument('transition_agent', c, restricted=is_tacotron(c), val_type=bool)
check_argument('location_attn', c, restricted=is_tacotron(c), val_type=bool)
check_argument('bidirectional_decoder', c, restricted=is_tacotron(c), val_type=bool)
check_argument('double_decoder_consistency', c, restricted=is_tacotron(c), val_type=bool)
check_argument('ddc_r', c, restricted='double_decoder_consistency' in c.keys(), min_val=1, max_val=7, val_type=int)
# stopnet
check_argument('stopnet', c, restricted=True, val_type=bool)
check_argument('separate_stopnet', c, restricted=True, val_type=bool)
check_argument('stopnet', c, restricted=is_tacotron(c), val_type=bool)
check_argument('separate_stopnet', c, restricted=is_tacotron(c), val_type=bool)
# GlowTTS parameters
check_argument('encoder_type', c, restricted=not is_tacotron(c), val_type=str)
# tensorboard
check_argument('print_step', c, restricted=True, val_type=int, min_val=1)
@ -240,15 +254,16 @@ def check_config_tts(c):
# multi-speaker and gst
check_argument('use_speaker_embedding', c, restricted=True, val_type=bool)
check_argument('use_external_speaker_embedding_file', c, restricted=True, val_type=bool)
check_argument('external_speaker_embedding_file', c, restricted=True, val_type=str)
check_argument('use_gst', c, restricted=True, val_type=bool)
check_argument('gst', c, restricted=True, val_type=dict)
check_argument('gst_style_input', c['gst'], restricted=True, val_type=[str, dict])
check_argument('gst_embedding_dim', c['gst'], restricted=True, val_type=int, min_val=0, max_val=1000)
check_argument('gst_use_speaker_embedding', c['gst'], restricted=True, val_type=bool)
check_argument('gst_num_heads', c['gst'], restricted=True, val_type=int, min_val=2, max_val=10)
check_argument('gst_style_tokens', c['gst'], restricted=True, val_type=int, min_val=1, max_val=1000)
check_argument('use_external_speaker_embedding_file', c, restricted=c['use_speaker_embedding'], val_type=bool)
check_argument('external_speaker_embedding_file', c, restricted=c['use_external_speaker_embedding_file'], val_type=str)
check_argument('use_gst', c, restricted=is_tacotron(c), val_type=bool)
if c['model'].lower() in ['tacotron', 'tacotron2'] and c['use_gst']:
check_argument('gst', c, restricted=is_tacotron(c), val_type=dict)
check_argument('gst_style_input', c['gst'], restricted=is_tacotron(c), val_type=[str, dict])
check_argument('gst_embedding_dim', c['gst'], restricted=is_tacotron(c), val_type=int, min_val=0, max_val=1000)
check_argument('gst_use_speaker_embedding', c['gst'], restricted=is_tacotron(c), val_type=bool)
check_argument('gst_num_heads', c['gst'], restricted=is_tacotron(c), val_type=int, min_val=2, max_val=10)
check_argument('gst_style_tokens', c['gst'], restricted=is_tacotron(c), val_type=int, min_val=1, max_val=1000)
# datasets - checking only the first entry
check_argument('datasets', c, restricted=True, val_type=list)

View File

@ -6,6 +6,7 @@ import pickle as pickle_tts
from TTS.utils.io import RenamingUnpickler
def load_checkpoint(model, checkpoint_path, amp=None, use_cuda=False):
try:
state = torch.load(checkpoint_path, map_location=torch.device('cpu'))
@ -25,9 +26,12 @@ def load_checkpoint(model, checkpoint_path, amp=None, use_cuda=False):
def save_model(model, optimizer, current_step, epoch, r, output_path, amp_state_dict=None, **kwargs):
new_state_dict = model.state_dict()
if hasattr(model, 'module'):
model_state = model.module.state_dict()
else:
model_state = model.state_dict()
state = {
'model': new_state_dict,
'model': model_state,
'optimizer': optimizer.state_dict() if optimizer is not None else None,
'step': current_step,
'epoch': epoch,

View File

@ -30,3 +30,44 @@ def get_speakers(items):
"""Returns a sorted, unique list of speakers in a given dataset."""
speakers = {e[2] for e in items}
return sorted(speakers)
def parse_speakers(c, args, meta_data_train, OUT_PATH):
""" Returns number of speakers, speaker embedding shape and speaker mapping"""
if c.use_speaker_embedding:
speakers = get_speakers(meta_data_train)
if args.restore_path:
if c.use_external_speaker_embedding_file: # if restore checkpoint and use External Embedding file
prev_out_path = os.path.dirname(args.restore_path)
speaker_mapping = load_speaker_mapping(prev_out_path)
if not speaker_mapping:
print("WARNING: speakers.json was not found in restore_path, trying to use CONFIG.external_speaker_embedding_file")
speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
if not speaker_mapping:
raise RuntimeError("You must copy the file speakers.json to restore_path, or set a valid file in CONFIG.external_speaker_embedding_file")
speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
elif not c.use_external_speaker_embedding_file: # if restore checkpoint and don't use External Embedding file
prev_out_path = os.path.dirname(args.restore_path)
speaker_mapping = load_speaker_mapping(prev_out_path)
speaker_embedding_dim = None
assert all([speaker in speaker_mapping
for speaker in speakers]), "As of now you, you cannot " \
"introduce new speakers to " \
"a previously trained model."
elif c.use_external_speaker_embedding_file and c.external_speaker_embedding_file: # if start new train using External Embedding file
speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
elif c.use_external_speaker_embedding_file and not c.external_speaker_embedding_file: # if start new train using External Embedding file and don't pass external embedding file
raise "use_external_speaker_embedding_file is True, so you need pass a external speaker embedding file, run GE2E-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb or AngularPrototypical-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb notebook in notebooks/ folder"
else: # if start new train and don't use External Embedding file
speaker_mapping = {name: i for i, name in enumerate(speakers)}
speaker_embedding_dim = None
save_speaker_mapping(OUT_PATH, speaker_mapping)
num_speakers = len(speaker_mapping)
print("Training with {} speakers: {}".format(len(speakers),
", ".join(speakers)))
else:
num_speakers = 0
speaker_embedding_dim = None
speaker_mapping = None
return num_speakers, speaker_embedding_dim, speaker_mapping

75
TTS/tts/utils/ssim.py Normal file
View File

@ -0,0 +1,75 @@
# taken from https://github.com/Po-Hsun-Su/pytorch-ssim
from math import exp
import torch
import torch.nn.functional as F
from torch.autograd import Variable
def gaussian(window_size, sigma):
gauss = torch.Tensor([exp(-(x - window_size//2)**2/float(2*sigma**2)) for x in range(window_size)])
return gauss/gauss.sum()
def create_window(window_size, channel):
_1D_window = gaussian(window_size, 1.5).unsqueeze(1)
_2D_window = _1D_window.mm(_1D_window.t()).float().unsqueeze(0).unsqueeze(0)
window = Variable(_2D_window.expand(channel, 1, window_size, window_size).contiguous())
return window
def _ssim(img1, img2, window, window_size, channel, size_average = True):
mu1 = F.conv2d(img1, window, padding = window_size//2, groups = channel)
mu2 = F.conv2d(img2, window, padding = window_size//2, groups = channel)
mu1_sq = mu1.pow(2)
mu2_sq = mu2.pow(2)
mu1_mu2 = mu1*mu2
sigma1_sq = F.conv2d(img1*img1, window, padding = window_size//2, groups = channel) - mu1_sq
sigma2_sq = F.conv2d(img2*img2, window, padding = window_size//2, groups = channel) - mu2_sq
sigma12 = F.conv2d(img1*img2, window, padding = window_size//2, groups = channel) - mu1_mu2
C1 = 0.01**2
C2 = 0.03**2
ssim_map = ((2*mu1_mu2 + C1)*(2*sigma12 + C2))/((mu1_sq + mu2_sq + C1)*(sigma1_sq + sigma2_sq + C2))
if size_average:
return ssim_map.mean()
return ssim_map.mean(1).mean(1).mean(1)
class SSIM(torch.nn.Module):
def __init__(self, window_size = 11, size_average = True):
super().__init__()
self.window_size = window_size
self.size_average = size_average
self.channel = 1
self.window = create_window(window_size, self.channel)
def forward(self, img1, img2):
(_, channel, _, _) = img1.size()
if channel == self.channel and self.window.data.type() == img1.data.type():
window = self.window
else:
window = create_window(self.window_size, channel)
if img1.is_cuda:
window = window.cuda(img1.get_device())
window = window.type_as(img1)
self.window = window
self.channel = channel
return _ssim(img1, img2, window, self.window_size, channel, self.size_average)
def ssim(img1, img2, window_size = 11, size_average = True):
(_, channel, _, _) = img1.size()
window = create_window(window_size, channel)
if img1.is_cuda:
window = window.cuda(img1.get_device())
window = window.type_as(img1)
return _ssim(img1, img2, window, window_size, channel, size_average)

View File

@ -14,10 +14,13 @@ def text_to_seqvec(text, CONFIG):
seq = np.asarray(
phoneme_to_sequence(text, text_cleaner, CONFIG.phoneme_language,
CONFIG.enable_eos_bos_chars,
tp=CONFIG.characters if 'characters' in CONFIG.keys() else None),
tp=CONFIG.characters if 'characters' in CONFIG.keys() else None,
add_blank=CONFIG['add_blank'] if 'add_blank' in CONFIG.keys() else False),
dtype=np.int32)
else:
seq = np.asarray(text_to_sequence(text, text_cleaner, tp=CONFIG.characters if 'characters' in CONFIG.keys() else None), dtype=np.int32)
seq = np.asarray(
text_to_sequence(text, text_cleaner, tp=CONFIG.characters if 'characters' in CONFIG.keys() else None,
add_blank=CONFIG['add_blank'] if 'add_blank' in CONFIG.keys() else False), dtype=np.int32)
return seq
@ -59,7 +62,7 @@ def run_model_torch(model, inputs, CONFIG, truncated, speaker_id=None, style_mel
inputs, speaker_ids=speaker_id, speaker_embeddings=speaker_embeddings)
elif 'glow' in CONFIG.model.lower():
inputs_lengths = torch.tensor(inputs.shape[1:2]).to(inputs.device) # pylint: disable=not-callable
postnet_output, _, _, _, alignments, _, _ = model.inference(inputs, inputs_lengths)
postnet_output, _, _, _, alignments, _, _ = model.inference(inputs, inputs_lengths, g=speaker_id if speaker_id else speaker_embeddings)
postnet_output = postnet_output.permute(0, 2, 1)
# these only belong to tacotron models.
decoder_output = None
@ -207,7 +210,7 @@ def synthesis(model,
"""
# GST processing
style_mel = None
if CONFIG.use_gst and style_wav is not None:
if 'use_gst' in CONFIG.keys() and CONFIG.use_gst and style_wav is not None:
if isinstance(style_wav, dict):
style_mel = style_wav
else:

View File

@ -16,6 +16,8 @@ _id_to_symbol = {i: s for i, s in enumerate(symbols)}
_phonemes_to_id = {s: i for i, s in enumerate(phonemes)}
_id_to_phonemes = {i: s for i, s in enumerate(phonemes)}
_symbols = symbols
_phonemes = phonemes
# Regular expression matching text enclosed in curly braces:
_CURLY_RE = re.compile(r'(.*?)\{(.+?)\}(.*)')
@ -57,6 +59,10 @@ def text2phone(text, language):
return ph
def intersperse(sequence, token):
result = [token] * (len(sequence) * 2 + 1)
result[1::2] = sequence
return result
def pad_with_eos_bos(phoneme_sequence, tp=None):
# pylint: disable=global-statement
@ -69,10 +75,9 @@ def pad_with_eos_bos(phoneme_sequence, tp=None):
return [_phonemes_to_id[_bos]] + list(phoneme_sequence) + [_phonemes_to_id[_eos]]
def phoneme_to_sequence(text, cleaner_names, language, enable_eos_bos=False, tp=None):
def phoneme_to_sequence(text, cleaner_names, language, enable_eos_bos=False, tp=None, add_blank=False):
# pylint: disable=global-statement
global _phonemes_to_id
global _phonemes_to_id, _phonemes
if tp:
_, _phonemes = make_symbols(**tp)
_phonemes_to_id = {s: i for i, s in enumerate(_phonemes)}
@ -88,13 +93,17 @@ def phoneme_to_sequence(text, cleaner_names, language, enable_eos_bos=False, tp=
# Append EOS char
if enable_eos_bos:
sequence = pad_with_eos_bos(sequence, tp=tp)
if add_blank:
sequence = intersperse(sequence, len(_phonemes)) # add a blank token (new), whose id number is len(_phonemes)
return sequence
def sequence_to_phoneme(sequence, tp=None):
def sequence_to_phoneme(sequence, tp=None, add_blank=False):
# pylint: disable=global-statement
'''Converts a sequence of IDs back to a string'''
global _id_to_phonemes
global _id_to_phonemes, _phonemes
if add_blank:
sequence = list(filter(lambda x: x != len(_phonemes), sequence))
result = ''
if tp:
_, _phonemes = make_symbols(**tp)
@ -107,7 +116,7 @@ def sequence_to_phoneme(sequence, tp=None):
return result.replace('}{', ' ')
def text_to_sequence(text, cleaner_names, tp=None):
def text_to_sequence(text, cleaner_names, tp=None, add_blank=False):
'''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
The text can optionally have ARPAbet sequences enclosed in curly braces embedded
@ -121,7 +130,7 @@ def text_to_sequence(text, cleaner_names, tp=None):
List of integers corresponding to the symbols in the text
'''
# pylint: disable=global-statement
global _symbol_to_id
global _symbol_to_id, _symbols
if tp:
_symbols, _ = make_symbols(**tp)
_symbol_to_id = {s: i for i, s in enumerate(_symbols)}
@ -137,13 +146,19 @@ def text_to_sequence(text, cleaner_names, tp=None):
_clean_text(m.group(1), cleaner_names))
sequence += _arpabet_to_sequence(m.group(2))
text = m.group(3)
if add_blank:
sequence = intersperse(sequence, len(_symbols)) # add a blank token (new), whose id number is len(_symbols)
return sequence
def sequence_to_text(sequence, tp=None):
def sequence_to_text(sequence, tp=None, add_blank=False):
'''Converts a sequence of IDs back to a string'''
# pylint: disable=global-statement
global _id_to_symbol
global _id_to_symbol, _symbols
if add_blank:
sequence = list(filter(lambda x: x != len(_symbols), sequence))
if tp:
_symbols, _ = make_symbols(**tp)
_id_to_symbol = {i: s for i, s in enumerate(_symbols)}

View File

@ -1,6 +1,8 @@
import torch
import librosa
import matplotlib
import numpy as np
import torch
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from TTS.tts.utils.text import phoneme_to_sequence, sequence_to_phoneme
@ -43,6 +45,8 @@ def plot_spectrogram(spectrogram,
spectrogram_ = spectrogram.detach().cpu().numpy().squeeze().T
else:
spectrogram_ = spectrogram.T
spectrogram_ = spectrogram_.astype(
np.float32) if spectrogram_.dtype == np.float16 else spectrogram_
if ap is not None:
spectrogram_ = ap._denormalize(spectrogram_) # pylint: disable=protected-access
fig = plt.figure(figsize=fig_size)

View File

@ -174,7 +174,7 @@ class AudioProcessor(object):
for key in stats_config.keys():
if key in skip_parameters:
continue
if key != 'sample_rate':
if key not in ['sample_rate', 'trim_db']:
assert stats_config[key] == self.__dict__[key],\
f" [!] Audio param {key} does not match the value used for computing mean-var stats. {stats_config[key]} vs {self.__dict__[key]}"
return mel_mean, mel_std, linear_mean, linear_std, stats_config

View File

@ -1,9 +1,11 @@
import os
import glob
import shutil
import datetime
import glob
import os
import shutil
import subprocess
import torch
def get_git_branch():
try:

View File

@ -1,5 +1,7 @@
import os
import re
import json
import yaml
import pickle as pickle_tts
@ -17,19 +19,27 @@ class AttrDict(dict):
self.__dict__ = self
def load_config(config_path):
def load_config(config_path: str) -> AttrDict:
"""Load config files and discard comments
Args:
config_path (str): path to config file.
"""
config = AttrDict()
with open(config_path, "r") as f:
input_str = f.read()
# handle comments
input_str = re.sub(r'\\\n', '', input_str)
input_str = re.sub(r'//.*\n', '\n', input_str)
data = json.loads(input_str)
ext = os.path.splitext(config_path)[1]
if ext in (".yml", ".yaml"):
with open(config_path, "r") as f:
data = yaml.safe_load(f)
else:
# fallback to json
with open(config_path, "r") as f:
input_str = f.read()
# handle comments
input_str = re.sub(r'\\\n', '', input_str)
input_str = re.sub(r'//.*\n', '\n', input_str)
data = json.loads(input_str)
config.update(data)
return config

View File

@ -92,8 +92,8 @@
// DATASET
"data_path": "/home/erogol/Data/MozillaMerged22050/wavs/",
"feature_path": null,
"seq_len": 16384,
"pad_short": 2000,
"seq_len": 6144,
"pad_short": 500,
"conv_pad": 0,
"use_noise_augment": false,
"use_cache": true,
@ -102,6 +102,16 @@
// TRAINING
"batch_size": 64, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
"train_noise_schedule":{
"min_val": 1e-6,
"max_val": 1e-2,
"num_steps": 1000
},
"test_noise_schedule":{
"min_val": 1e-6,
"max_val": 1e-2,
"num_steps": 50
}
// VALIDATION
"run_eval": true,

View File

@ -0,0 +1,138 @@
{
"run_name": "fullband-melgan",
"run_description": "fullband melgan mean-var scaling",
// AUDIO PARAMETERS
"audio":{
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 24000, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": "/home/erogol/Data/libritts/LibriTTS/scale_stats.npy" // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// DISTRIBUTED TRAINING
"distributed":{
"backend": "nccl",
"url": "tcp:\/\/localhost:54324"
},
// MODEL PARAMETERS
"use_pqmf": false,
// LOSS PARAMETERS
"use_stft_loss": true,
"use_subband_stft_loss": false,
"use_mse_gan_loss": true,
"use_hinge_gan_loss": false,
"use_feat_match_loss": false, // use only with melgan discriminators
// loss weights
"stft_loss_weight": 0.5,
"subband_stft_loss_weight": 0.5,
"mse_G_loss_weight": 2.5,
"hinge_G_loss_weight": 2.5,
"feat_match_loss_weight": 25,
// multiscale stft loss parameters
"stft_loss_params": {
"n_ffts": [1024, 2048, 512],
"hop_lengths": [120, 240, 50],
"win_lengths": [600, 1200, 240]
},
"target_loss": "avg_G_loss", // loss value to pick the best model to save after each epoch
// DISCRIMINATOR
"discriminator_model": "melgan_multiscale_discriminator",
"discriminator_model_params":{
"base_channels": 16,
"max_channels":512,
"downsample_factors":[4, 4, 4]
},
"steps_to_start_discriminator": 200000, // steps required to start GAN trainining.1
// GENERATOR
"generator_model": "fullband_melgan_generator",
"generator_model_params": {
"upsample_factors":[8, 8, 4],
"num_res_blocks": 4
},
// DATASET
"data_path": "/home/erogol/Data/libritts/LibriTTS/train-clean-360/",
"feature_path": null,
"seq_len": 16384,
"pad_short": 2000,
"conv_pad": 0,
"use_noise_augment": false,
"use_cache": true,
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// TRAINING
"batch_size": 48, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
// VALIDATION
"run_eval": true,
"test_delay_epochs": 10, //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
// OPTIMIZER
"epochs": 10000, // total number of epochs to train.
"wd": 0.0, // Weight decay weight.
"gen_clip_grad": -1, // Generator gradient clipping threshold. Apply gradient clipping if > 0
"disc_clip_grad": -1, // Discriminator gradient clipping threshold.
"lr_scheduler_gen": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_gen_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_scheduler_disc": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_disc_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_gen": 0.000015625, // Initial learning rate. If Noam decay is active, maximum learning rate.
"lr_disc": 0.000015625,
// TENSORBOARD and LOGGING
"print_step": 25, // Number of steps to log traning on console.
"print_eval": false, // If True, it prints loss values for each step in eval run.
"save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"eval_split_size": 10,
// PATHS
"output_path": "/home/erogol/Models/"
}

View File

@ -0,0 +1,116 @@
{
"run_name": "wavegrad-libritts",
"run_description": "wavegrad libritts",
"audio":{
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 24000, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": "/home/erogol/Data/libritts/LibriTTS/scale_stats_wavegrad.npy" // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// DISTRIBUTED TRAINING
"mixed_precision": true, // enable torch mixed precision training (true, false)
"distributed":{
"backend": "nccl",
"url": "tcp:\/\/localhost:54322"
},
"target_loss": "avg_wavegrad_loss", // loss value to pick the best model to save after each epoch
// MODEL PARAMETERS
"generator_model": "wavegrad",
"model_params":{
"use_weight_norm": true,
"y_conv_channels":32,
"x_conv_channels":768,
"ublock_out_channels": [512, 512, 256, 128, 128],
"dblock_out_channels": [128, 128, 256, 512],
"upsample_factors": [4, 4, 4, 2, 2],
"upsample_dilations": [
[1, 2, 1, 2],
[1, 2, 1, 2],
[1, 2, 4, 8],
[1, 2, 4, 8],
[1, 2, 4, 8]]
},
// DATASET
"data_path": "/home/erogol/Data/libritts/LibriTTS/train-clean-360/", // root data path. It finds all wav files recursively from there.
"feature_path": null, // if you use precomputed features
"seq_len": 6144, // 24 * hop_length
"pad_short": 0, // additional padding for short wavs
"conv_pad": 0, // additional padding against convolutions applied to spectrograms
"use_noise_augment": false, // add noise to the audio signal for augmentation
"use_cache": false, // use in memory cache to keep the computed features. This might cause OOM.
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// TRAINING
"batch_size": 96, // Batch size for training.
// NOISE SCHEDULE PARAMS - Only effective at training time.
"train_noise_schedule":{
"min_val": 1e-6,
"max_val": 1e-2,
"num_steps": 1000
},
"test_noise_schedule":{
"min_val": 1e-6,
"max_val": 1e-2,
"num_steps": 50
},
// VALIDATION
"run_eval": true, // enable/disable evaluation run
// OPTIMIZER
"epochs": 10000, // total number of epochs to train.
"clip_grad": 1.0, // Generator gradient clipping threshold. Apply gradient clipping if > 0
"lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr": 1e-4, // Initial learning rate. If Noam decay is active, maximum learning rate.
// TENSORBOARD and LOGGING
"print_step": 50, // Number of steps to log traning on console.
"print_eval": false, // If True, it prints loss values for each step in eval run.
"save_step": 5000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": true, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"eval_split_size": 256,
// PATHS
"output_path": "/home/erogol/Models/LJSpeech/"
}

View File

@ -0,0 +1,98 @@
{
"run_name": "wavernn_librittts",
"run_description": "wavernn libritts training from LJSpeech model",
// AUDIO PARAMETERS
"audio": {
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 24000, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 20, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": false, // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 40.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 20.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// Generating / Synthesizing
"batched": true,
"target_samples": 11000, // target number of samples to be generated in each batch entry
"overlap_samples": 550, // number of samples for crossfading between batches
// DISTRIBUTED TRAINING
// "distributed":{
// "backend": "nccl",
// "url": "tcp:\/\/localhost:54321"
// },
// MODEL MODE
"mode": "mold", // mold [string], gauss [string], bits [int]
"mulaw": true, // apply mulaw if mode is bits
// MODEL PARAMETERS
"wavernn_model_params": {
"rnn_dims": 512,
"fc_dims": 512,
"compute_dims": 128,
"res_out_dims": 128,
"num_res_blocks": 10,
"use_aux_net": true,
"use_upsample_net": true,
"upsample_factors": [4, 8, 8] // this needs to correctly factorise hop_length
},
// DATASET
//"use_gta": true, // use computed gta features from the tts model
"data_path": "/home/erogol/Data/libritts/LibriTTS/train-clean-360/", // path containing training wav files
"feature_path": null, // path containing computed features from wav files if null compute them
"seq_len": 1280, // has to be devideable by hop_length
"padding": 2, // pad the input for resnet to see wider input length
// TRAINING
"batch_size": 256, // Batch size for training.
"epochs": 10000, // total number of epochs to train.
"mixed_precision": true, // enable/ disable mixed precision training
// VALIDATION
"run_eval": true,
"test_every_epochs": 10, // Test after set number of epochs (Test every 10 epochs for example)
// OPTIMIZER
"grad_clip": 4, // apply gradient clipping if > 0
"lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_params": {
"gamma": 0.5,
"milestones": [200000, 400000, 600000]
},
"lr": 1e-4, // initial learning rate
// TENSORBOARD and LOGGING
"print_step": 25, // Number of steps to log traning on console.
"print_eval": false, // If True, it prints loss values for each step in eval run.
"save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"eval_split_size": 50, // number of samples for testing
// PATHS
"output_path": "/home/erogol/Models/LJSpeech/"
}

View File

@ -1,17 +1,38 @@
import glob
import os
from pathlib import Path
from tqdm import tqdm
import numpy as np
def preprocess_wav_files(out_path, config, ap):
os.makedirs(os.path.join(out_path, "quant"), exist_ok=True)
os.makedirs(os.path.join(out_path, "mel"), exist_ok=True)
wav_files = find_wav_files(config.data_path)
for path in tqdm(wav_files):
wav_name = Path(path).stem
quant_path = os.path.join(out_path, "quant", wav_name + ".npy")
mel_path = os.path.join(out_path, "mel", wav_name + ".npy")
y = ap.load_wav(path)
mel = ap.melspectrogram(y)
np.save(mel_path, mel)
if isinstance(config.mode, int):
quant = (
ap.mulaw_encode(y, qc=config.mode)
if config.mulaw
else ap.quantize(y, bits=config.mode)
)
np.save(quant_path, quant)
def find_wav_files(data_path):
wav_paths = glob.glob(os.path.join(data_path, '**', '*.wav'), recursive=True)
wav_paths = glob.glob(os.path.join(data_path, "**", "*.wav"), recursive=True)
return wav_paths
def find_feat_files(data_path):
feat_paths = glob.glob(os.path.join(data_path, '**', '*.npy'), recursive=True)
feat_paths = glob.glob(os.path.join(data_path, "**", "*.npy"), recursive=True)
return feat_paths
@ -23,8 +44,12 @@ def load_wav_data(data_path, eval_split_size):
def load_wav_feat_data(data_path, feat_path, eval_split_size):
wav_paths = sorted(find_wav_files(data_path))
feat_paths = sorted(find_feat_files(feat_path))
wav_paths = find_wav_files(data_path)
feat_paths = find_feat_files(feat_path)
wav_paths.sort(key=lambda x: Path(x).stem)
feat_paths.sort(key=lambda x: Path(x).stem)
assert len(wav_paths) == len(feat_paths)
for wav, feat in zip(wav_paths, feat_paths):
wav_name = Path(wav).stem

View File

@ -0,0 +1,131 @@
import os
import glob
import torch
import random
import numpy as np
from torch.utils.data import Dataset
from multiprocessing import Manager
class WaveGradDataset(Dataset):
"""
WaveGrad Dataset searchs for all the wav files under root path
and converts them to acoustic features on the fly and returns
random segments of (audio, feature) couples.
"""
def __init__(self,
ap,
items,
seq_len,
hop_len,
pad_short,
conv_pad=2,
is_training=True,
return_segments=True,
use_noise_augment=False,
use_cache=False,
verbose=False):
self.ap = ap
self.item_list = items
self.seq_len = seq_len if return_segments else None
self.hop_len = hop_len
self.pad_short = pad_short
self.conv_pad = conv_pad
self.is_training = is_training
self.return_segments = return_segments
self.use_cache = use_cache
self.use_noise_augment = use_noise_augment
self.verbose = verbose
if return_segments:
assert seq_len % hop_len == 0, " [!] seq_len has to be a multiple of hop_len."
self.feat_frame_len = seq_len // hop_len + (2 * conv_pad)
# cache acoustic features
if use_cache:
self.create_feature_cache()
def create_feature_cache(self):
self.manager = Manager()
self.cache = self.manager.list()
self.cache += [None for _ in range(len(self.item_list))]
@staticmethod
def find_wav_files(path):
return glob.glob(os.path.join(path, '**', '*.wav'), recursive=True)
def __len__(self):
return len(self.item_list)
def __getitem__(self, idx):
item = self.load_item(idx)
return item
def load_test_samples(self, num_samples):
samples = []
return_segments = self.return_segments
self.return_segments = False
for idx in range(num_samples):
mel, audio = self.load_item(idx)
samples.append([mel, audio])
self.return_segments = return_segments
return samples
def load_item(self, idx):
""" load (audio, feat) couple """
# compute features from wav
wavpath = self.item_list[idx]
if self.use_cache and self.cache[idx] is not None:
audio = self.cache[idx]
else:
audio = self.ap.load_wav(wavpath)
if self.return_segments:
# correct audio length wrt segment length
if audio.shape[-1] < self.seq_len + self.pad_short:
audio = np.pad(audio, (0, self.seq_len + self.pad_short - len(audio)), \
mode='constant', constant_values=0.0)
assert audio.shape[-1] >= self.seq_len + self.pad_short, f"{audio.shape[-1]} vs {self.seq_len + self.pad_short}"
# correct the audio length wrt hop length
p = (audio.shape[-1] // self.hop_len + 1) * self.hop_len - audio.shape[-1]
audio = np.pad(audio, (0, p), mode='constant', constant_values=0.0)
if self.use_cache:
self.cache[idx] = audio
if self.return_segments:
max_start = len(audio) - self.seq_len
start = random.randint(0, max_start)
end = start + self.seq_len
audio = audio[start:end]
if self.use_noise_augment and self.is_training and self.return_segments:
audio = audio + (1 / 32768) * torch.randn_like(audio)
mel = self.ap.melspectrogram(audio)
mel = mel[..., :-1] # ignore the padding
audio = torch.from_numpy(audio).float()
mel = torch.from_numpy(mel).float().squeeze(0)
return (mel, audio)
@staticmethod
def collate_full_clips(batch):
"""This is used in tune_wavegrad.py.
It pads sequences to the max length."""
max_mel_length = max([b[0].shape[1] for b in batch]) if len(batch) > 1 else batch[0][0].shape[1]
max_audio_length = max([b[1].shape[0] for b in batch]) if len(batch) > 1 else batch[0][1].shape[0]
mels = torch.zeros([len(batch), batch[0][0].shape[0], max_mel_length])
audios = torch.zeros([len(batch), max_audio_length])
for idx, b in enumerate(batch):
mel = b[0]
audio = b[1]
mels[idx, :, :mel.shape[1]] = mel
audios[idx, :audio.shape[0]] = audio
return mels, audios

View File

@ -0,0 +1,118 @@
import torch
import numpy as np
from torch.utils.data import Dataset
class WaveRNNDataset(Dataset):
"""
WaveRNN Dataset searchs for all the wav files under root path
and converts them to acoustic features on the fly.
"""
def __init__(self,
ap,
items,
seq_len,
hop_len,
pad,
mode,
mulaw,
is_training=True,
verbose=False,
):
self.ap = ap
self.compute_feat = not isinstance(items[0], (tuple, list))
self.item_list = items
self.seq_len = seq_len
self.hop_len = hop_len
self.mel_len = seq_len // hop_len
self.pad = pad
self.mode = mode
self.mulaw = mulaw
self.is_training = is_training
self.verbose = verbose
assert self.seq_len % self.hop_len == 0
def __len__(self):
return len(self.item_list)
def __getitem__(self, index):
item = self.load_item(index)
return item
def load_item(self, index):
"""
load (audio, feat) couple if feature_path is set
else compute it on the fly
"""
if self.compute_feat:
wavpath = self.item_list[index]
audio = self.ap.load_wav(wavpath)
min_audio_len = 2 * self.seq_len + (2 * self.pad * self.hop_len)
if audio.shape[0] < min_audio_len:
print(" [!] Instance is too short! : {}".format(wavpath))
audio = np.pad(audio, [0, min_audio_len - audio.shape[0] + self.hop_len])
mel = self.ap.melspectrogram(audio)
if self.mode in ["gauss", "mold"]:
x_input = audio
elif isinstance(self.mode, int):
x_input = (self.ap.mulaw_encode(audio, qc=self.mode)
if self.mulaw else self.ap.quantize(audio, bits=self.mode))
else:
raise RuntimeError("Unknown dataset mode - ", self.mode)
else:
wavpath, feat_path = self.item_list[index]
mel = np.load(feat_path.replace("/quant/", "/mel/"))
if mel.shape[-1] < self.mel_len + 2 * self.pad:
print(" [!] Instance is too short! : {}".format(wavpath))
self.item_list[index] = self.item_list[index + 1]
feat_path = self.item_list[index]
mel = np.load(feat_path.replace("/quant/", "/mel/"))
if self.mode in ["gauss", "mold"]:
x_input = self.ap.load_wav(wavpath)
elif isinstance(self.mode, int):
x_input = np.load(feat_path.replace("/mel/", "/quant/"))
else:
raise RuntimeError("Unknown dataset mode - ", self.mode)
return mel, x_input, wavpath
def collate(self, batch):
mel_win = self.seq_len // self.hop_len + 2 * self.pad
max_offsets = [x[0].shape[-1] -
(mel_win + 2 * self.pad) for x in batch]
mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]
sig_offsets = [(offset + self.pad) *
self.hop_len for offset in mel_offsets]
mels = [
x[0][:, mel_offsets[i]: mel_offsets[i] + mel_win]
for i, x in enumerate(batch)
]
coarse = [
x[1][sig_offsets[i]: sig_offsets[i] + self.seq_len + 1]
for i, x in enumerate(batch)
]
mels = np.stack(mels).astype(np.float32)
if self.mode in ["gauss", "mold"]:
coarse = np.stack(coarse).astype(np.float32)
coarse = torch.FloatTensor(coarse)
x_input = coarse[:, : self.seq_len]
elif isinstance(self.mode, int):
coarse = np.stack(coarse).astype(np.int64)
coarse = torch.LongTensor(coarse)
x_input = (2 * coarse[:, : self.seq_len].float() /
(2 ** self.mode - 1.0) - 1.0)
y_coarse = coarse[:, 1:]
mels = torch.FloatTensor(mels)
return x_input, mels, y_coarse

View File

@ -0,0 +1,175 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils import weight_norm
class Conv1d(nn.Conv1d):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
nn.init.orthogonal_(self.weight)
nn.init.zeros_(self.bias)
class PositionalEncoding(nn.Module):
"""Positional encoding with noise level conditioning"""
def __init__(self, n_channels, max_len=10000):
super().__init__()
self.n_channels = n_channels
self.max_len = max_len
self.C = 5000
self.pe = torch.zeros(0, 0)
def forward(self, x, noise_level):
if x.shape[2] > self.pe.shape[1]:
self.init_pe_matrix(x.shape[1] ,x.shape[2], x)
return x + noise_level[..., None, None] + self.pe[:, :x.size(2)].repeat(x.shape[0], 1, 1) / self.C
def init_pe_matrix(self, n_channels, max_len, x):
pe = torch.zeros(max_len, n_channels)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.pow(10000, torch.arange(0, n_channels, 2).float() / n_channels)
pe[:, 0::2] = torch.sin(position / div_term)
pe[:, 1::2] = torch.cos(position / div_term)
self.pe = pe.transpose(0, 1).to(x)
class FiLM(nn.Module):
def __init__(self, input_size, output_size):
super().__init__()
self.encoding = PositionalEncoding(input_size)
self.input_conv = nn.Conv1d(input_size, input_size, 3, padding=1)
self.output_conv = nn.Conv1d(input_size, output_size * 2, 3, padding=1)
nn.init.xavier_uniform_(self.input_conv.weight)
nn.init.xavier_uniform_(self.output_conv.weight)
nn.init.zeros_(self.input_conv.bias)
nn.init.zeros_(self.output_conv.bias)
def forward(self, x, noise_scale):
o = self.input_conv(x)
o = F.leaky_relu(o, 0.2)
o = self.encoding(o, noise_scale)
shift, scale = torch.chunk(self.output_conv(o), 2, dim=1)
return shift, scale
def remove_weight_norm(self):
nn.utils.remove_weight_norm(self.input_conv)
nn.utils.remove_weight_norm(self.output_conv)
def apply_weight_norm(self):
self.input_conv = weight_norm(self.input_conv)
self.output_conv = weight_norm(self.output_conv)
@torch.jit.script
def shif_and_scale(x, scale, shift):
o = shift + scale * x
return o
class UBlock(nn.Module):
def __init__(self, input_size, hidden_size, factor, dilation):
super().__init__()
assert isinstance(dilation, (list, tuple))
assert len(dilation) == 4
self.factor = factor
self.res_block = Conv1d(input_size, hidden_size, 1)
self.main_block = nn.ModuleList([
Conv1d(input_size,
hidden_size,
3,
dilation=dilation[0],
padding=dilation[0]),
Conv1d(hidden_size,
hidden_size,
3,
dilation=dilation[1],
padding=dilation[1])
])
self.out_block = nn.ModuleList([
Conv1d(hidden_size,
hidden_size,
3,
dilation=dilation[2],
padding=dilation[2]),
Conv1d(hidden_size,
hidden_size,
3,
dilation=dilation[3],
padding=dilation[3])
])
def forward(self, x, shift, scale):
x_inter = F.interpolate(x, size=x.shape[-1] * self.factor)
res = self.res_block(x_inter)
o = F.leaky_relu(x_inter, 0.2)
o = F.interpolate(o, size=x.shape[-1] * self.factor)
o = self.main_block[0](o)
o = shif_and_scale(o, scale, shift)
o = F.leaky_relu(o, 0.2)
o = self.main_block[1](o)
res2 = res + o
o = shif_and_scale(res2, scale, shift)
o = F.leaky_relu(o, 0.2)
o = self.out_block[0](o)
o = shif_and_scale(o, scale, shift)
o = F.leaky_relu(o, 0.2)
o = self.out_block[1](o)
o = o + res2
return o
def remove_weight_norm(self):
nn.utils.remove_weight_norm(self.res_block)
for _, layer in enumerate(self.main_block):
if len(layer.state_dict()) != 0:
nn.utils.remove_weight_norm(layer)
for _, layer in enumerate(self.out_block):
if len(layer.state_dict()) != 0:
nn.utils.remove_weight_norm(layer)
def apply_weight_norm(self):
self.res_block = weight_norm(self.res_block)
for idx, layer in enumerate(self.main_block):
if len(layer.state_dict()) != 0:
self.main_block[idx] = weight_norm(layer)
for idx, layer in enumerate(self.out_block):
if len(layer.state_dict()) != 0:
self.out_block[idx] = weight_norm(layer)
class DBlock(nn.Module):
def __init__(self, input_size, hidden_size, factor):
super().__init__()
self.factor = factor
self.res_block = Conv1d(input_size, hidden_size, 1)
self.main_block = nn.ModuleList([
Conv1d(input_size, hidden_size, 3, dilation=1, padding=1),
Conv1d(hidden_size, hidden_size, 3, dilation=2, padding=2),
Conv1d(hidden_size, hidden_size, 3, dilation=4, padding=4),
])
def forward(self, x):
size = x.shape[-1] // self.factor
res = self.res_block(x)
res = F.interpolate(res, size=size)
o = F.interpolate(x, size=size)
for layer in self.main_block:
o = F.leaky_relu(o, 0.2)
o = layer(o)
return o + res
def remove_weight_norm(self):
nn.utils.remove_weight_norm(self.res_block)
for _, layer in enumerate(self.main_block):
if len(layer.state_dict()) != 0:
nn.utils.remove_weight_norm(layer)
def apply_weight_norm(self):
self.res_block = weight_norm(self.res_block)
for idx, layer in enumerate(self.main_block):
if len(layer.state_dict()) != 0:
self.main_block[idx] = weight_norm(layer)

View File

@ -0,0 +1,177 @@
import numpy as np
import torch
from torch import nn
from torch.nn.utils import weight_norm
from ..layers.wavegrad import DBlock, FiLM, UBlock, Conv1d
class Wavegrad(nn.Module):
# pylint: disable=dangerous-default-value
def __init__(self,
in_channels=80,
out_channels=1,
use_weight_norm=False,
y_conv_channels=32,
x_conv_channels=768,
dblock_out_channels=[128, 128, 256, 512],
ublock_out_channels=[512, 512, 256, 128, 128],
upsample_factors=[5, 5, 3, 2, 2],
upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2], [1, 2, 4, 8],
[1, 2, 4, 8], [1, 2, 4, 8]]):
super().__init__()
self.use_weight_norm = use_weight_norm
self.hop_len = np.prod(upsample_factors)
self.noise_level = None
self.num_steps = None
self.beta = None
self.alpha = None
self.alpha_hat = None
self.noise_level = None
self.c1 = None
self.c2 = None
self.sigma = None
# dblocks
self.y_conv = Conv1d(1, y_conv_channels, 5, padding=2)
self.dblocks = nn.ModuleList([])
ic = y_conv_channels
for oc, df in zip(dblock_out_channels, reversed(upsample_factors)):
self.dblocks.append(DBlock(ic, oc, df))
ic = oc
# film
self.film = nn.ModuleList([])
ic = y_conv_channels
for oc in reversed(ublock_out_channels):
self.film.append(FiLM(ic, oc))
ic = oc
# ublocks
self.ublocks = nn.ModuleList([])
ic = x_conv_channels
for oc, uf, ud in zip(ublock_out_channels, upsample_factors, upsample_dilations):
self.ublocks.append(UBlock(ic, oc, uf, ud))
ic = oc
self.x_conv = Conv1d(in_channels, x_conv_channels, 3, padding=1)
self.out_conv = Conv1d(oc, out_channels, 3, padding=1)
if use_weight_norm:
self.apply_weight_norm()
def forward(self, x, spectrogram, noise_scale):
shift_and_scale = []
x = self.y_conv(x)
shift_and_scale.append(self.film[0](x, noise_scale))
for film, layer in zip(self.film[1:], self.dblocks):
x = layer(x)
shift_and_scale.append(film(x, noise_scale))
x = self.x_conv(spectrogram)
for layer, (film_shift, film_scale) in zip(self.ublocks,
reversed(shift_and_scale)):
x = layer(x, film_shift, film_scale)
x = self.out_conv(x)
return x
def load_noise_schedule(self, path):
beta = np.load(path, allow_pickle=True).item()['beta']
self.compute_noise_level(beta)
@torch.no_grad()
def inference(self, x, y_n=None):
""" x: B x D X T """
if y_n is None:
y_n = torch.randn(x.shape[0], 1, self.hop_len * x.shape[-1], dtype=torch.float32).to(x)
else:
y_n = torch.FloatTensor(y_n).unsqueeze(0).unsqueeze(0).to(x)
sqrt_alpha_hat = self.noise_level.to(x)
for n in range(len(self.alpha) - 1, -1, -1):
y_n = self.c1[n] * (y_n -
self.c2[n] * self.forward(y_n, x, sqrt_alpha_hat[n].repeat(x.shape[0])))
if n > 0:
z = torch.randn_like(y_n)
y_n += self.sigma[n - 1] * z
y_n.clamp_(-1.0, 1.0)
return y_n
def compute_y_n(self, y_0):
"""Compute noisy audio based on noise schedule"""
self.noise_level = self.noise_level.to(y_0)
if len(y_0.shape) == 3:
y_0 = y_0.squeeze(1)
s = torch.randint(1, self.num_steps + 1, [y_0.shape[0]])
l_a, l_b = self.noise_level[s-1], self.noise_level[s]
noise_scale = l_a + torch.rand(y_0.shape[0]).to(y_0) * (l_b - l_a)
noise_scale = noise_scale.unsqueeze(1)
noise = torch.randn_like(y_0)
noisy_audio = noise_scale * y_0 + (1.0 - noise_scale**2)**0.5 * noise
return noise.unsqueeze(1), noisy_audio.unsqueeze(1), noise_scale[:, 0]
def compute_noise_level(self, beta):
"""Compute noise schedule parameters"""
self.num_steps = len(beta)
alpha = 1 - beta
alpha_hat = np.cumprod(alpha)
noise_level = np.concatenate([[1.0], alpha_hat ** 0.5], axis=0)
noise_level = alpha_hat ** 0.5
# pylint: disable=not-callable
self.beta = torch.tensor(beta.astype(np.float32))
self.alpha = torch.tensor(alpha.astype(np.float32))
self.alpha_hat = torch.tensor(alpha_hat.astype(np.float32))
self.noise_level = torch.tensor(noise_level.astype(np.float32))
self.c1 = 1 / self.alpha**0.5
self.c2 = (1 - self.alpha) / (1 - self.alpha_hat)**0.5
self.sigma = ((1.0 - self.alpha_hat[:-1]) / (1.0 - self.alpha_hat[1:]) * self.beta[1:])**0.5
def remove_weight_norm(self):
for _, layer in enumerate(self.dblocks):
if len(layer.state_dict()) != 0:
try:
nn.utils.remove_weight_norm(layer)
except ValueError:
layer.remove_weight_norm()
for _, layer in enumerate(self.film):
if len(layer.state_dict()) != 0:
try:
nn.utils.remove_weight_norm(layer)
except ValueError:
layer.remove_weight_norm()
for _, layer in enumerate(self.ublocks):
if len(layer.state_dict()) != 0:
try:
nn.utils.remove_weight_norm(layer)
except ValueError:
layer.remove_weight_norm()
nn.utils.remove_weight_norm(self.x_conv)
nn.utils.remove_weight_norm(self.out_conv)
nn.utils.remove_weight_norm(self.y_conv)
def apply_weight_norm(self):
for _, layer in enumerate(self.dblocks):
if len(layer.state_dict()) != 0:
layer.apply_weight_norm()
for _, layer in enumerate(self.film):
if len(layer.state_dict()) != 0:
layer.apply_weight_norm()
for _, layer in enumerate(self.ublocks):
if len(layer.state_dict()) != 0:
layer.apply_weight_norm()
self.x_conv = weight_norm(self.x_conv)
self.out_conv = weight_norm(self.out_conv)
self.y_conv = weight_norm(self.y_conv)

View File

@ -0,0 +1,501 @@
import sys
import torch
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
import time
# fix this
from TTS.utils.audio import AudioProcessor as ap
from TTS.vocoder.utils.distribution import (
sample_from_gaussian,
sample_from_discretized_mix_logistic,
)
def stream(string, variables):
sys.stdout.write(f"\r{string}" % variables)
# pylint: disable=abstract-method
# relates https://github.com/pytorch/pytorch/issues/42305
class ResBlock(nn.Module):
def __init__(self, dims):
super().__init__()
self.conv1 = nn.Conv1d(dims, dims, kernel_size=1, bias=False)
self.conv2 = nn.Conv1d(dims, dims, kernel_size=1, bias=False)
self.batch_norm1 = nn.BatchNorm1d(dims)
self.batch_norm2 = nn.BatchNorm1d(dims)
def forward(self, x):
residual = x
x = self.conv1(x)
x = self.batch_norm1(x)
x = F.relu(x)
x = self.conv2(x)
x = self.batch_norm2(x)
return x + residual
class MelResNet(nn.Module):
def __init__(self, num_res_blocks, in_dims, compute_dims, res_out_dims, pad):
super().__init__()
k_size = pad * 2 + 1
self.conv_in = nn.Conv1d(
in_dims, compute_dims, kernel_size=k_size, bias=False)
self.batch_norm = nn.BatchNorm1d(compute_dims)
self.layers = nn.ModuleList()
for _ in range(num_res_blocks):
self.layers.append(ResBlock(compute_dims))
self.conv_out = nn.Conv1d(compute_dims, res_out_dims, kernel_size=1)
def forward(self, x):
x = self.conv_in(x)
x = self.batch_norm(x)
x = F.relu(x)
for f in self.layers:
x = f(x)
x = self.conv_out(x)
return x
class Stretch2d(nn.Module):
def __init__(self, x_scale, y_scale):
super().__init__()
self.x_scale = x_scale
self.y_scale = y_scale
def forward(self, x):
b, c, h, w = x.size()
x = x.unsqueeze(-1).unsqueeze(3)
x = x.repeat(1, 1, 1, self.y_scale, 1, self.x_scale)
return x.view(b, c, h * self.y_scale, w * self.x_scale)
class UpsampleNetwork(nn.Module):
def __init__(
self,
feat_dims,
upsample_scales,
compute_dims,
num_res_blocks,
res_out_dims,
pad,
use_aux_net,
):
super().__init__()
self.total_scale = np.cumproduct(upsample_scales)[-1]
self.indent = pad * self.total_scale
self.use_aux_net = use_aux_net
if use_aux_net:
self.resnet = MelResNet(
num_res_blocks, feat_dims, compute_dims, res_out_dims, pad
)
self.resnet_stretch = Stretch2d(self.total_scale, 1)
self.up_layers = nn.ModuleList()
for scale in upsample_scales:
k_size = (1, scale * 2 + 1)
padding = (0, scale)
stretch = Stretch2d(scale, 1)
conv = nn.Conv2d(1, 1, kernel_size=k_size,
padding=padding, bias=False)
conv.weight.data.fill_(1.0 / k_size[1])
self.up_layers.append(stretch)
self.up_layers.append(conv)
def forward(self, m):
if self.use_aux_net:
aux = self.resnet(m).unsqueeze(1)
aux = self.resnet_stretch(aux)
aux = aux.squeeze(1)
aux = aux.transpose(1, 2)
else:
aux = None
m = m.unsqueeze(1)
for f in self.up_layers:
m = f(m)
m = m.squeeze(1)[:, :, self.indent: -self.indent]
return m.transpose(1, 2), aux
class Upsample(nn.Module):
def __init__(
self, scale, pad, num_res_blocks, feat_dims, compute_dims, res_out_dims, use_aux_net
):
super().__init__()
self.scale = scale
self.pad = pad
self.indent = pad * scale
self.use_aux_net = use_aux_net
self.resnet = MelResNet(num_res_blocks, feat_dims,
compute_dims, res_out_dims, pad)
def forward(self, m):
if self.use_aux_net:
aux = self.resnet(m)
aux = torch.nn.functional.interpolate(
aux, scale_factor=self.scale, mode="linear", align_corners=True
)
aux = aux.transpose(1, 2)
else:
aux = None
m = torch.nn.functional.interpolate(
m, scale_factor=self.scale, mode="linear", align_corners=True
)
m = m[:, :, self.indent: -self.indent]
m = m * 0.045 # empirically found
return m.transpose(1, 2), aux
class WaveRNN(nn.Module):
def __init__(self,
rnn_dims,
fc_dims,
mode,
mulaw,
pad,
use_aux_net,
use_upsample_net,
upsample_factors,
feat_dims,
compute_dims,
res_out_dims,
num_res_blocks,
hop_length,
sample_rate,
):
super().__init__()
self.mode = mode
self.mulaw = mulaw
self.pad = pad
self.use_upsample_net = use_upsample_net
self.use_aux_net = use_aux_net
if isinstance(self.mode, int):
self.n_classes = 2 ** self.mode
elif self.mode == "mold":
self.n_classes = 3 * 10
elif self.mode == "gauss":
self.n_classes = 2
else:
raise RuntimeError("Unknown model mode value - ", self.mode)
self.rnn_dims = rnn_dims
self.aux_dims = res_out_dims // 4
self.hop_length = hop_length
self.sample_rate = sample_rate
if self.use_upsample_net:
assert (
np.cumproduct(upsample_factors)[-1] == self.hop_length
), " [!] upsample scales needs to be equal to hop_length"
self.upsample = UpsampleNetwork(
feat_dims,
upsample_factors,
compute_dims,
num_res_blocks,
res_out_dims,
pad,
use_aux_net,
)
else:
self.upsample = Upsample(
hop_length,
pad,
num_res_blocks,
feat_dims,
compute_dims,
res_out_dims,
use_aux_net,
)
if self.use_aux_net:
self.I = nn.Linear(feat_dims + self.aux_dims + 1, rnn_dims)
self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
self.rnn2 = nn.GRU(rnn_dims + self.aux_dims,
rnn_dims, batch_first=True)
self.fc1 = nn.Linear(rnn_dims + self.aux_dims, fc_dims)
self.fc2 = nn.Linear(fc_dims + self.aux_dims, fc_dims)
self.fc3 = nn.Linear(fc_dims, self.n_classes)
else:
self.I = nn.Linear(feat_dims + 1, rnn_dims)
self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
self.rnn2 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
self.fc1 = nn.Linear(rnn_dims, fc_dims)
self.fc2 = nn.Linear(fc_dims, fc_dims)
self.fc3 = nn.Linear(fc_dims, self.n_classes)
def forward(self, x, mels):
bsize = x.size(0)
h1 = torch.zeros(1, bsize, self.rnn_dims).to(x.device)
h2 = torch.zeros(1, bsize, self.rnn_dims).to(x.device)
mels, aux = self.upsample(mels)
if self.use_aux_net:
aux_idx = [self.aux_dims * i for i in range(5)]
a1 = aux[:, :, aux_idx[0]: aux_idx[1]]
a2 = aux[:, :, aux_idx[1]: aux_idx[2]]
a3 = aux[:, :, aux_idx[2]: aux_idx[3]]
a4 = aux[:, :, aux_idx[3]: aux_idx[4]]
x = (
torch.cat([x.unsqueeze(-1), mels, a1], dim=2)
if self.use_aux_net
else torch.cat([x.unsqueeze(-1), mels], dim=2)
)
x = self.I(x)
res = x
self.rnn1.flatten_parameters()
x, _ = self.rnn1(x, h1)
x = x + res
res = x
x = torch.cat([x, a2], dim=2) if self.use_aux_net else x
self.rnn2.flatten_parameters()
x, _ = self.rnn2(x, h2)
x = x + res
x = torch.cat([x, a3], dim=2) if self.use_aux_net else x
x = F.relu(self.fc1(x))
x = torch.cat([x, a4], dim=2) if self.use_aux_net else x
x = F.relu(self.fc2(x))
return self.fc3(x)
def inference(self, mels, batched, target, overlap):
self.eval()
device = mels.device
output = []
start = time.time()
rnn1 = self.get_gru_cell(self.rnn1)
rnn2 = self.get_gru_cell(self.rnn2)
with torch.no_grad():
if isinstance(mels, np.ndarray):
mels = torch.FloatTensor(mels).to(device)
if mels.ndim == 2:
mels = mels.unsqueeze(0)
wave_len = (mels.size(-1) - 1) * self.hop_length
mels = self.pad_tensor(mels.transpose(
1, 2), pad=self.pad, side="both")
mels, aux = self.upsample(mels.transpose(1, 2))
if batched:
mels = self.fold_with_overlap(mels, target, overlap)
if aux is not None:
aux = self.fold_with_overlap(aux, target, overlap)
b_size, seq_len, _ = mels.size()
h1 = torch.zeros(b_size, self.rnn_dims).to(device)
h2 = torch.zeros(b_size, self.rnn_dims).to(device)
x = torch.zeros(b_size, 1).to(device)
if self.use_aux_net:
d = self.aux_dims
aux_split = [aux[:, :, d * i: d * (i + 1)] for i in range(4)]
for i in range(seq_len):
m_t = mels[:, i, :]
if self.use_aux_net:
a1_t, a2_t, a3_t, a4_t = (a[:, i, :] for a in aux_split)
x = (
torch.cat([x, m_t, a1_t], dim=1)
if self.use_aux_net
else torch.cat([x, m_t], dim=1)
)
x = self.I(x)
h1 = rnn1(x, h1)
x = x + h1
inp = torch.cat([x, a2_t], dim=1) if self.use_aux_net else x
h2 = rnn2(inp, h2)
x = x + h2
x = torch.cat([x, a3_t], dim=1) if self.use_aux_net else x
x = F.relu(self.fc1(x))
x = torch.cat([x, a4_t], dim=1) if self.use_aux_net else x
x = F.relu(self.fc2(x))
logits = self.fc3(x)
if self.mode == "mold":
sample = sample_from_discretized_mix_logistic(
logits.unsqueeze(0).transpose(1, 2)
)
output.append(sample.view(-1))
x = sample.transpose(0, 1).to(device)
elif self.mode == "gauss":
sample = sample_from_gaussian(
logits.unsqueeze(0).transpose(1, 2))
output.append(sample.view(-1))
x = sample.transpose(0, 1).to(device)
elif isinstance(self.mode, int):
posterior = F.softmax(logits, dim=1)
distrib = torch.distributions.Categorical(posterior)
sample = 2 * distrib.sample().float() / (self.n_classes - 1.0) - 1.0
output.append(sample)
x = sample.unsqueeze(-1)
else:
raise RuntimeError(
"Unknown model mode value - ", self.mode)
if i % 100 == 0:
self.gen_display(i, seq_len, b_size, start)
output = torch.stack(output).transpose(0, 1)
output = output.cpu().numpy()
output = output.astype(np.float64)
if batched:
output = self.xfade_and_unfold(output, target, overlap)
else:
output = output[0]
if self.mulaw and isinstance(self.mode, int):
output = ap.mulaw_decode(output, self.mode)
# Fade-out at the end to avoid signal cutting out suddenly
fade_out = np.linspace(1, 0, 20 * self.hop_length)
output = output[:wave_len]
if wave_len > len(fade_out):
output[-20 * self.hop_length:] *= fade_out
self.train()
return output
def gen_display(self, i, seq_len, b_size, start):
gen_rate = (i + 1) / (time.time() - start) * b_size / 1000
realtime_ratio = gen_rate * 1000 / self.sample_rate
stream(
"%i/%i -- batch_size: %i -- gen_rate: %.1f kHz -- x_realtime: %.1f ",
(i * b_size, seq_len * b_size, b_size, gen_rate, realtime_ratio),
)
def fold_with_overlap(self, x, target, overlap):
"""Fold the tensor with overlap for quick batched inference.
Overlap will be used for crossfading in xfade_and_unfold()
Args:
x (tensor) : Upsampled conditioning features.
shape=(1, timesteps, features)
target (int) : Target timesteps for each index of batch
overlap (int) : Timesteps for both xfade and rnn warmup
Return:
(tensor) : shape=(num_folds, target + 2 * overlap, features)
Details:
x = [[h1, h2, ... hn]]
Where each h is a vector of conditioning features
Eg: target=2, overlap=1 with x.size(1)=10
folded = [[h1, h2, h3, h4],
[h4, h5, h6, h7],
[h7, h8, h9, h10]]
"""
_, total_len, features = x.size()
# Calculate variables needed
num_folds = (total_len - overlap) // (target + overlap)
extended_len = num_folds * (overlap + target) + overlap
remaining = total_len - extended_len
# Pad if some time steps poking out
if remaining != 0:
num_folds += 1
padding = target + 2 * overlap - remaining
x = self.pad_tensor(x, padding, side="after")
folded = torch.zeros(num_folds, target + 2 *
overlap, features).to(x.device)
# Get the values for the folded tensor
for i in range(num_folds):
start = i * (target + overlap)
end = start + target + 2 * overlap
folded[i] = x[:, start:end, :]
return folded
@staticmethod
def get_gru_cell(gru):
gru_cell = nn.GRUCell(gru.input_size, gru.hidden_size)
gru_cell.weight_hh.data = gru.weight_hh_l0.data
gru_cell.weight_ih.data = gru.weight_ih_l0.data
gru_cell.bias_hh.data = gru.bias_hh_l0.data
gru_cell.bias_ih.data = gru.bias_ih_l0.data
return gru_cell
@staticmethod
def pad_tensor(x, pad, side="both"):
# NB - this is just a quick method i need right now
# i.e., it won't generalise to other shapes/dims
b, t, c = x.size()
total = t + 2 * pad if side == "both" else t + pad
padded = torch.zeros(b, total, c).to(x.device)
if side in ("before", "both"):
padded[:, pad: pad + t, :] = x
elif side == "after":
padded[:, :t, :] = x
return padded
@staticmethod
def xfade_and_unfold(y, target, overlap):
"""Applies a crossfade and unfolds into a 1d array.
Args:
y (ndarry) : Batched sequences of audio samples
shape=(num_folds, target + 2 * overlap)
dtype=np.float64
overlap (int) : Timesteps for both xfade and rnn warmup
Return:
(ndarry) : audio samples in a 1d array
shape=(total_len)
dtype=np.float64
Details:
y = [[seq1],
[seq2],
[seq3]]
Apply a gain envelope at both ends of the sequences
y = [[seq1_in, seq1_target, seq1_out],
[seq2_in, seq2_target, seq2_out],
[seq3_in, seq3_target, seq3_out]]
Stagger and add up the groups of samples:
[seq1_in, seq1_target, (seq1_out + seq2_in), seq2_target, ...]
"""
num_folds, length = y.shape
target = length - 2 * overlap
total_len = num_folds * (target + overlap) + overlap
# Need some silence for the rnn warmup
silence_len = overlap // 2
fade_len = overlap - silence_len
silence = np.zeros((silence_len), dtype=np.float64)
# Equal power crossfade
t = np.linspace(-1, 1, fade_len, dtype=np.float64)
fade_in = np.sqrt(0.5 * (1 + t))
fade_out = np.sqrt(0.5 * (1 - t))
# Concat the silence to the fades
fade_in = np.concatenate([silence, fade_in])
fade_out = np.concatenate([fade_out, silence])
# Apply the gain to the overlap samples
y[:, :overlap] *= fade_in
y[:, -overlap:] *= fade_out
unfolded = np.zeros((total_len), dtype=np.float64)
# Loop to add up all the samples
for i in range(num_folds):
start = i * (target + overlap)
end = start + target + 2 * overlap
unfolded[start:end] += y[i]
return unfolded

View File

@ -0,0 +1,168 @@
import numpy as np
import math
import torch
from torch.distributions.normal import Normal
import torch.nn.functional as F
def gaussian_loss(y_hat, y, log_std_min=-7.0):
assert y_hat.dim() == 3
assert y_hat.size(2) == 2
mean = y_hat[:, :, :1]
log_std = torch.clamp(y_hat[:, :, 1:], min=log_std_min)
# TODO: replace with pytorch dist
log_probs = -0.5 * (
-math.log(2.0 * math.pi)
- 2.0 * log_std
- torch.pow(y - mean, 2) * torch.exp((-2.0 * log_std))
)
return log_probs.squeeze().mean()
def sample_from_gaussian(y_hat, log_std_min=-7.0, scale_factor=1.0):
assert y_hat.size(2) == 2
mean = y_hat[:, :, :1]
log_std = torch.clamp(y_hat[:, :, 1:], min=log_std_min)
dist = Normal(
mean,
torch.exp(log_std),
)
sample = dist.sample()
sample = torch.clamp(torch.clamp(
sample, min=-scale_factor), max=scale_factor)
del dist
return sample
def log_sum_exp(x):
""" numerically stable log_sum_exp implementation that prevents overflow """
# TF ordering
axis = len(x.size()) - 1
m, _ = torch.max(x, dim=axis)
m2, _ = torch.max(x, dim=axis, keepdim=True)
return m + torch.log(torch.sum(torch.exp(x - m2), dim=axis))
# It is adapted from https://github.com/r9y9/wavenet_vocoder/blob/master/wavenet_vocoder/mixture.py
def discretized_mix_logistic_loss(
y_hat, y, num_classes=65536, log_scale_min=None, reduce=True
):
if log_scale_min is None:
log_scale_min = float(np.log(1e-14))
y_hat = y_hat.permute(0, 2, 1)
assert y_hat.dim() == 3
assert y_hat.size(1) % 3 == 0
nr_mix = y_hat.size(1) // 3
# (B x T x C)
y_hat = y_hat.transpose(1, 2)
# unpack parameters. (B, T, num_mixtures) x 3
logit_probs = y_hat[:, :, :nr_mix]
means = y_hat[:, :, nr_mix: 2 * nr_mix]
log_scales = torch.clamp(
y_hat[:, :, 2 * nr_mix: 3 * nr_mix], min=log_scale_min)
# B x T x 1 -> B x T x num_mixtures
y = y.expand_as(means)
centered_y = y - means
inv_stdv = torch.exp(-log_scales)
plus_in = inv_stdv * (centered_y + 1.0 / (num_classes - 1))
cdf_plus = torch.sigmoid(plus_in)
min_in = inv_stdv * (centered_y - 1.0 / (num_classes - 1))
cdf_min = torch.sigmoid(min_in)
# log probability for edge case of 0 (before scaling)
# equivalent: torch.log(F.sigmoid(plus_in))
log_cdf_plus = plus_in - F.softplus(plus_in)
# log probability for edge case of 255 (before scaling)
# equivalent: (1 - F.sigmoid(min_in)).log()
log_one_minus_cdf_min = -F.softplus(min_in)
# probability for all other cases
cdf_delta = cdf_plus - cdf_min
mid_in = inv_stdv * centered_y
# log probability in the center of the bin, to be used in extreme cases
# (not actually used in our code)
log_pdf_mid = mid_in - log_scales - 2.0 * F.softplus(mid_in)
# tf equivalent
# log_probs = tf.where(x < -0.999, log_cdf_plus,
# tf.where(x > 0.999, log_one_minus_cdf_min,
# tf.where(cdf_delta > 1e-5,
# tf.log(tf.maximum(cdf_delta, 1e-12)),
# log_pdf_mid - np.log(127.5))))
# TODO: cdf_delta <= 1e-5 actually can happen. How can we choose the value
# for num_classes=65536 case? 1e-7? not sure..
inner_inner_cond = (cdf_delta > 1e-5).float()
inner_inner_out = inner_inner_cond * torch.log(
torch.clamp(cdf_delta, min=1e-12)
) + (1.0 - inner_inner_cond) * (log_pdf_mid - np.log((num_classes - 1) / 2))
inner_cond = (y > 0.999).float()
inner_out = (
inner_cond * log_one_minus_cdf_min +
(1.0 - inner_cond) * inner_inner_out
)
cond = (y < -0.999).float()
log_probs = cond * log_cdf_plus + (1.0 - cond) * inner_out
log_probs = log_probs + F.log_softmax(logit_probs, -1)
if reduce:
return -torch.mean(log_sum_exp(log_probs))
return -log_sum_exp(log_probs).unsqueeze(-1)
def sample_from_discretized_mix_logistic(y, log_scale_min=None):
"""
Sample from discretized mixture of logistic distributions
Args:
y (Tensor): B x C x T
log_scale_min (float): Log scale minimum value
Returns:
Tensor: sample in range of [-1, 1].
"""
if log_scale_min is None:
log_scale_min = float(np.log(1e-14))
assert y.size(1) % 3 == 0
nr_mix = y.size(1) // 3
# B x T x C
y = y.transpose(1, 2)
logit_probs = y[:, :, :nr_mix]
# sample mixture indicator from softmax
temp = logit_probs.data.new(logit_probs.size()).uniform_(1e-5, 1.0 - 1e-5)
temp = logit_probs.data - torch.log(-torch.log(temp))
_, argmax = temp.max(dim=-1)
# (B, T) -> (B, T, nr_mix)
one_hot = to_one_hot(argmax, nr_mix)
# select logistic parameters
means = torch.sum(y[:, :, nr_mix: 2 * nr_mix] * one_hot, dim=-1)
log_scales = torch.clamp(
torch.sum(y[:, :, 2 * nr_mix: 3 * nr_mix] * one_hot, dim=-1), min=log_scale_min
)
# sample from logistic & clip to interval
# we don't actually round to the nearest 8bit value when sampling
u = means.data.new(means.size()).uniform_(1e-5, 1.0 - 1e-5)
x = means + torch.exp(log_scales) * (torch.log(u) - torch.log(1.0 - u))
x = torch.clamp(torch.clamp(x, min=-1.0), max=1.0)
return x
def to_one_hot(tensor, n, fill_with=1.0):
# we perform one hot encore with respect to the last axis
one_hot = torch.FloatTensor(tensor.size() + (n,)).zero_()
if tensor.is_cuda:
one_hot = one_hot.cuda()
one_hot.scatter_(len(tensor.size()), tensor.unsqueeze(-1), fill_with)
return one_hot

View File

@ -42,12 +42,35 @@ def to_camel(text):
return re.sub(r'(?!^)_([a-zA-Z])', lambda m: m.group(1).upper(), text)
def setup_wavernn(c):
print(" > Model: WaveRNN")
MyModel = importlib.import_module("TTS.vocoder.models.wavernn")
MyModel = getattr(MyModel, "WaveRNN")
model = MyModel(
rnn_dims=c.wavernn_model_params['rnn_dims'],
fc_dims=c.wavernn_model_params['fc_dims'],
mode=c.mode,
mulaw=c.mulaw,
pad=c.padding,
use_aux_net=c.wavernn_model_params['use_aux_net'],
use_upsample_net=c.wavernn_model_params['use_upsample_net'],
upsample_factors=c.wavernn_model_params['upsample_factors'],
feat_dims=c.audio['num_mels'],
compute_dims=c.wavernn_model_params['compute_dims'],
res_out_dims=c.wavernn_model_params['res_out_dims'],
num_res_blocks=c.wavernn_model_params['num_res_blocks'],
hop_length=c.audio["hop_length"],
sample_rate=c.audio["sample_rate"],
)
return model
def setup_generator(c):
print(" > Generator Model: {}".format(c.generator_model))
MyModel = importlib.import_module('TTS.vocoder.models.' +
c.generator_model.lower())
MyModel = getattr(MyModel, to_camel(c.generator_model))
if c.generator_model in 'melgan_generator':
if c.generator_model.lower() in 'melgan_generator':
model = MyModel(
in_channels=c.audio['num_mels'],
out_channels=1,
@ -58,7 +81,7 @@ def setup_generator(c):
num_res_blocks=c.generator_model_params['num_res_blocks'])
if c.generator_model in 'melgan_fb_generator':
pass
if c.generator_model in 'multiband_melgan_generator':
if c.generator_model.lower() in 'multiband_melgan_generator':
model = MyModel(
in_channels=c.audio['num_mels'],
out_channels=4,
@ -67,7 +90,7 @@ def setup_generator(c):
upsample_factors=c.generator_model_params['upsample_factors'],
res_kernel=3,
num_res_blocks=c.generator_model_params['num_res_blocks'])
if c.generator_model in 'fullband_melgan_generator':
if c.generator_model.lower() in 'fullband_melgan_generator':
model = MyModel(
in_channels=c.audio['num_mels'],
out_channels=1,
@ -76,7 +99,7 @@ def setup_generator(c):
upsample_factors=c.generator_model_params['upsample_factors'],
res_kernel=3,
num_res_blocks=c.generator_model_params['num_res_blocks'])
if c.generator_model in 'parallel_wavegan_generator':
if c.generator_model.lower() in 'parallel_wavegan_generator':
model = MyModel(
in_channels=1,
out_channels=1,
@ -91,6 +114,17 @@ def setup_generator(c):
bias=True,
use_weight_norm=True,
upsample_factors=c.generator_model_params['upsample_factors'])
if c.generator_model.lower() in 'wavegrad':
model = MyModel(
in_channels=c['audio']['num_mels'],
out_channels=1,
use_weight_norm=c['model_params']['use_weight_norm'],
x_conv_channels=c['model_params']['x_conv_channels'],
y_conv_channels=c['model_params']['y_conv_channels'],
dblock_out_channels=c['model_params']['dblock_out_channels'],
ublock_out_channels=c['model_params']['ublock_out_channels'],
upsample_factors=c['model_params']['upsample_factors'],
upsample_dilations=c['model_params']['upsample_dilations'])
return model

View File

@ -20,7 +20,10 @@ def load_checkpoint(model, checkpoint_path, use_cuda=False):
def save_model(model, optimizer, scheduler, model_disc, optimizer_disc,
scheduler_disc, current_step, epoch, output_path, **kwargs):
model_state = model.state_dict()
if hasattr(model, 'module'):
model_state = model.module.state_dict()
else:
model_state = model.state_dict()
model_disc_state = model_disc.state_dict()\
if model_disc is not None else None
optimizer_state = optimizer.state_dict()\

View File

@ -13,7 +13,11 @@
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 2,
=======
"execution_count": null,
>>>>>>> dev
"metadata": {},
"outputs": [],
"source": [
@ -25,8 +29,13 @@
"import umap\n",
"\n",
"from TTS.speaker_encoder.model import SpeakerEncoder\n",
<<<<<<< HEAD
"from TTS.utils.audio import AudioProcessor\n",
"from TTS.utils.io import load_config\n",
=======
"from TTS.tts.utils.audio import AudioProcessor\n",
"from TTS.tts.utils.generic_utils import load_config\n",
>>>>>>> dev
"\n",
"from bokeh.io import output_notebook, show\n",
"from bokeh.plotting import figure\n",
@ -48,6 +57,7 @@
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 3,
"metadata": {},
"outputs": [
@ -367,6 +377,11 @@
"output_type": "display_data"
}
],
=======
"execution_count": null,
"metadata": {},
"outputs": [],
>>>>>>> dev
"source": [
"output_notebook()"
]
@ -380,12 +395,20 @@
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"#MODEL_RUN_PATH = \"libritts_360-half-October-31-2019_04+54PM-19d2f5f/\"\n",
"MODEL_RUN_PATH = \"libritts_360-half-September-28-2019_10+46AM-8565c50/\"\n",
=======
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"MODEL_RUN_PATH = \"/media/erogol/data_ssd/Models/libri_tts/speaker_encoder/libritts_360-half-October-31-2019_04+54PM-19d2f5f/\"\n",
>>>>>>> dev
"MODEL_PATH = MODEL_RUN_PATH + \"best_model.pth.tar\"\n",
"CONFIG_PATH = MODEL_RUN_PATH + \"config.json\"\n",
"\n",
@ -395,11 +418,16 @@
"\n",
"# My multi speaker locations\n",
"EMBED_PATH = \"/home/erogol/Data/Libri-TTS/train-clean-360-embed_128/\"\n",
<<<<<<< HEAD
"AUDIO_PATH = \"datasets/LibriTTS/test-clean/\""
=======
"AUDIO_PATH = \"/home/erogol/Data/Libri-TTS/train-clean-360/\""
>>>>>>> dev
]
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 5,
"metadata": {},
"outputs": [
@ -413,12 +441,18 @@
]
}
],
=======
"execution_count": null,
"metadata": {},
"outputs": [],
>>>>>>> dev
"source": [
"!ls -1 $MODEL_RUN_PATH"
]
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 6,
"metadata": {},
"outputs": [
@ -454,6 +488,11 @@
]
}
],
=======
"execution_count": null,
"metadata": {},
"outputs": [],
>>>>>>> dev
"source": [
"CONFIG = load_config(CONFIG_PATH)\n",
"ap = AudioProcessor(**CONFIG['audio'])"
@ -468,6 +507,7 @@
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 7,
"metadata": {},
"outputs": [
@ -479,6 +519,11 @@
]
}
],
=======
"execution_count": null,
"metadata": {},
"outputs": [],
>>>>>>> dev
"source": [
"embed_files = glob.glob(EMBED_PATH+\"/**/*.npy\", recursive=True)\n",
"print(f'Embeddings found: {len(embed_files)}')"
@ -493,6 +538,7 @@
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 8,
"metadata": {},
"outputs": [
@ -508,6 +554,11 @@
]
}
],
=======
"execution_count": null,
"metadata": {},
"outputs": [],
>>>>>>> dev
"source": [
"embed_files[0]"
]
@ -523,6 +574,7 @@
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 9,
"metadata": {},
"outputs": [
@ -534,6 +586,11 @@
]
}
],
=======
"execution_count": null,
"metadata": {},
"outputs": [],
>>>>>>> dev
"source": [
"speaker_paths = list(set([os.path.dirname(os.path.dirname(embed_file)) for embed_file in embed_files]))\n",
"speaker_to_utter = {}\n",
@ -557,6 +614,7 @@
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 11,
"metadata": {},
"outputs": [
@ -575,6 +633,13 @@
],
"source": [
"ttsembeds = []\n",
=======
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"embeds = []\n",
>>>>>>> dev
"labels = []\n",
"locations = []\n",
"\n",
@ -598,7 +663,11 @@
" embed = np.load(embed_path)\n",
" embeds.append(embed)\n",
" labels.append(str(speaker_num))\n",
<<<<<<< HEAD
" #locations.append(embed_path.replace(EMBED_PATH, '').replace('.npy','.wav'))\n",
=======
" locations.append(embed_path.replace(EMBED_PATH, '').replace('.npy','.wav'))\n",
>>>>>>> dev
"embeds = np.concatenate(embeds)"
]
},
@ -611,6 +680,7 @@
},
{
"cell_type": "code",
<<<<<<< HEAD
"execution_count": 12,
"metadata": {},
"outputs": [
@ -626,6 +696,11 @@
]
}
],
=======
"execution_count": null,
"metadata": {},
"outputs": [],
>>>>>>> dev
"source": [
"model = umap.UMAP()\n",
"projection = model.fit_transform(embeds)"
@ -729,7 +804,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
<<<<<<< HEAD
"version": "3.8.5"
=======
"version": "3.7.4"
>>>>>>> dev
}
},
"nbformat": 4,

File diff suppressed because one or more lines are too long

View File

@ -23,3 +23,4 @@ pylint==2.5.3
gdown
umap-learn
cython
pyyaml

View File

@ -1,3 +1,4 @@
set -e
TF_CPP_MIN_LOG_LEVEL=3
# tests
@ -6,7 +7,10 @@ nosetests tests -x &&\
# runtime tests
./tests/test_server_package.sh && \
./tests/test_tts_train.sh && \
./tests/test_vocoder_train.sh && \
./tests/test_glow-tts_train.sh && \
./tests/test_vocoder_gan_train.sh && \
./tests/test_vocoder_wavernn_train.sh && \
./tests/test_vocoder_wavegrad_train.sh && \
# linter check
cardboardlinter --refspec master

View File

@ -33,7 +33,7 @@ args, unknown_args = parser.parse_known_args()
# Remove our arguments from argv so that setuptools doesn't see them
sys.argv = [sys.argv[0]] + unknown_args
version = '0.0.5'
version = '0.0.6'
# Adapted from https://github.com/pytorch/pytorch
cwd = os.path.dirname(os.path.abspath(__file__))

View File

@ -0,0 +1,134 @@
{
"model": "glow_tts",
"run_name": "glow-tts-gatedconv",
"run_description": "glow-tts model training with gated conv.",
// AUDIO PARAMETERS
"audio":{
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
// Griffin-Lim
"power": 1.1, // value to sharpen wav signals after GL algorithm.
"griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 1.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// VOCABULARY PARAMETERS
// if custom character set is not defined,
// default set in symbols.py is used
// "characters":{
// "pad": "_",
// "eos": "~",
// "bos": "^",
// "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!'(),-.:;? ",
// "punctuations":"!'(),-.:;? ",
// "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
// },
"add_blank": false, // if true add a new token after each token of the sentence. This increases the size of the input sequence, but has considerably improved the prosody of the GlowTTS model.
// DISTRIBUTED TRAINING
"mixed_precision": false,
"distributed":{
"backend": "nccl",
"url": "tcp:\/\/localhost:54323"
},
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// MODEL PARAMETERS
"use_mas": false, // use Monotonic Alignment Search if true. Otherwise use pre-computed attention alignments.
// TRAINING
"batch_size": 2, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
"eval_batch_size":1,
"r": 1, // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
"loss_masking": true, // enable / disable loss masking against the sequence padding.
// VALIDATION
"run_eval": true,
"test_delay_epochs": 0, //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
// OPTIMIZER
"noam_schedule": true, // use noam warmup and lr schedule.
"grad_clip": 5.0, // upper limit for gradients for clipping.
"epochs": 1, // total number of epochs to train.
"lr": 1e-3, // Initial learning rate. If Noam decay is active, maximum learning rate.
"wd": 0.000001, // Weight decay weight.
"warmup_steps": 4000, // Noam decay steps to increase the learning rate from 0 to "lr"
"seq_len_norm": false, // Normalize eash sample loss with its length to alleviate imbalanced datasets. Use it if your dataset is small or has skewed distribution of sequence lengths.
"encoder_type": "gatedconv",
// TENSORBOARD and LOGGING
"print_step": 25, // Number of steps to log training on console.
"tb_plot_step": 100, // Number of steps to plot TB training figures.
"print_eval": false, // If True, it prints intermediate loss values in evalulation.
"save_step": 5000, // Number of training steps expected to save traninpg stats and checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
"apex_amp_level": null,
// DATA LOADING
"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false, // enable/disable beginning of sentence and end of sentence chars.
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"batch_group_size": 0, //Number of batches to shuffle after bucketing.
"min_seq_len": 3, // DATASET-RELATED: minimum text length to use in training
"max_seq_len": 500, // DATASET-RELATED: maximum text length
"compute_f0": false, // compute f0 values in data-loader
// PATHS
"output_path": "tests/train_outputs/",
// PHONEMES
"phoneme_cache_path": "tests/outputs/phoneme_cache/", // phoneme computation is slow, therefore, it caches results in the given folder.
"use_phonemes": true, // use phonemes instead of raw characters. It is suggested for better pronounciation.
"phoneme_language": "en-us", // depending on your target language, pick one from https://github.com/bootphon/phonemizer#languages
// MULTI-SPEAKER and GST
"use_external_speaker_embedding_file": false,
"external_speaker_embedding_file": null,
"use_speaker_embedding": false, // use speaker embedding to enable multi-speaker learning.
// DATASETS
"datasets": // List of datasets. They all merged and they get different speaker_ids.
[
{
"name": "ljspeech",
"path": "tests/data/ljspeech/",
"meta_file_train": "metadata.csv",
"meta_file_val": "metadata.csv"
}
]
}

View File

@ -67,13 +67,24 @@
"gradual_training": [[0, 7, 4]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
"loss_masking": true, // enable / disable loss masking against the sequence padding.
"ga_alpha": 10.0, // weight for guided attention loss. If > 0, guided attention is enabled.
"apex_amp_level": null,
"mixed_precision": false,
// VALIDATION
"run_eval": true,
"test_delay_epochs": 0, //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
// LOSS SETTINGS
"loss_masking": true, // enable / disable loss masking against the sequence padding.
"decoder_loss_alpha": 0.5, // original decoder loss weight. If > 0, it is enabled
"postnet_loss_alpha": 0.25, // original postnet loss weight. If > 0, it is enabled
"postnet_diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
"decoder_diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
"decoder_ssim_alpha": 0.5, // decoder ssim loss weight. If > 0, it is enabled
"postnet_ssim_alpha": 0.25, // postnet ssim loss weight. If > 0, it is enabled
"ga_alpha": 5.0, // weight for guided attention loss. If > 0, guided attention is enabled.
"stopnet_pos_weight": 15.0, // pos class weight for stopnet loss since there are way more negative samples than positive samples.
// OPTIMIZER
"noam_schedule": false, // use noam warmup and lr schedule.
"grad_clip": 1.0, // upper limit for gradients for clipping.

View File

@ -0,0 +1,114 @@
{
"run_name": "wavegrad-ljspeech",
"run_description": "wavegrad ljspeech",
"audio":{
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// DISTRIBUTED TRAINING
"mixed_precision": false,
"distributed":{
"backend": "nccl",
"url": "tcp:\/\/localhost:54322"
},
"target_loss": "avg_wavegrad_loss", // loss value to pick the best model to save after each epoch
// MODEL PARAMETERS
"generator_model": "wavegrad",
"model_params":{
"y_conv_channels":32,
"x_conv_channels":768,
"ublock_out_channels": [512, 512, 256, 128, 128],
"dblock_out_channels": [128, 128, 256, 512],
"upsample_factors": [4, 4, 4, 2, 2],
"upsample_dilations": [
[1, 2, 1, 2],
[1, 2, 1, 2],
[1, 2, 4, 8],
[1, 2, 4, 8],
[1, 2, 4, 8]],
"use_weight_norm": true
},
// DATASET
"data_path": "tests/data/ljspeech/wavs/", // root data path. It finds all wav files recursively from there.
"feature_path": null, // if you use precomputed features
"seq_len": 6144, // 24 * hop_length
"pad_short": 0, // additional padding for short wavs
"conv_pad": 0, // additional padding against convolutions applied to spectrograms
"use_noise_augment": false, // add noise to the audio signal for augmentation
"use_cache": true, // use in memory cache to keep the computed features. This might cause OOM.
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// TRAINING
"batch_size": 1, // Batch size for training.
"train_noise_schedule":{
"min_val": 1e-6,
"max_val": 1e-2,
"num_steps": 1000
},
"test_noise_schedule":{
"min_val": 1e-6,
"max_val": 1e-2,
"num_steps": 2
},
// VALIDATION
"run_eval": true, // enable/disable evaluation run
// OPTIMIZER
"epochs": 1, // total number of epochs to train.
"clip_grad": 1.0, // Generator gradient clipping threshold. Apply gradient clipping if > 0
"lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr": 1e-4, // Initial learning rate. If Noam decay is active, maximum learning rate.
// TENSORBOARD and LOGGING
"print_step": 250, // Number of steps to log traning on console.
"print_eval": false, // If True, it prints loss values for each step in eval run.
"save_step": 10000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": true, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"eval_split_size": 4,
// PATHS
"output_path": "tests/train_outputs/"
}

View File

@ -0,0 +1,107 @@
{
"run_name": "wavernn_test",
"run_description": "wavernn_test training",
// AUDIO PARAMETERS
"audio":{
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 20.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// Generating / Synthesizing
"batched": true,
"target_samples": 11000, // target number of samples to be generated in each batch entry
"overlap_samples": 550, // number of samples for crossfading between batches
// DISTRIBUTED TRAINING
// "distributed":{
// "backend": "nccl",
// "url": "tcp:\/\/localhost:54321"
// },
// MODEL PARAMETERS
"use_aux_net": true,
"use_upsample_net": true,
"upsample_factors": [4, 8, 8], // this needs to correctly factorise hop_length
"seq_len": 1280, // has to be devideable by hop_length
"mode": "mold", // mold [string], gauss [string], bits [int]
"mulaw": false, // apply mulaw if mode is bits
"padding": 2, // pad the input for resnet to see wider input length
// DATASET
//"use_gta": true, // use computed gta features from the tts model
"data_path": "tests/data/ljspeech/wavs/", // path containing training wav files
"feature_path": null, // path containing computed features from wav files if null compute them
// MODEL PARAMETERS
"wavernn_model_params": {
"rnn_dims": 512,
"fc_dims": 512,
"compute_dims": 128,
"res_out_dims": 128,
"num_res_blocks": 10,
"use_aux_net": true,
"use_upsample_net": true,
"upsample_factors": [4, 8, 8] // this needs to correctly factorise hop_length
},
"mixed_precision": false,
// TRAINING
"batch_size": 4, // Batch size for training. Lower values than 32 might cause hard to learn attention.
"epochs": 1, // total number of epochs to train.
// VALIDATION
"run_eval": true,
"test_every_epochs": 10, // Test after set number of epochs (Test every 20 epochs for example)
// OPTIMIZER
"grad_clip": 4, // apply gradient clipping if > 0
"lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_params": {
"gamma": 0.5,
"milestones": [200000, 400000, 600000]
},
"lr": 1e-4, // initial learning rate
// TENSORBOARD and LOGGING
"print_step": 25, // Number of steps to log traning on console.
"print_eval": false, // If True, it prints loss values for each step in eval run.
"save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"eval_split_size": 10, // number of samples for testing
// PATHS
"output_path": "tests/train_outputs/"
}

View File

@ -62,7 +62,7 @@ class GE2ELossTests(unittest.TestCase):
assert output.item() >= 0.0
# check speaker loss with orthogonal d-vectors
dummy_input = T.empty(3, 64)
dummy_input = T.nn.init.orthogonal(dummy_input)
dummy_input = T.nn.init.orthogonal_(dummy_input)
dummy_input = T.cat(
[
dummy_input[0].repeat(5, 1, 1).transpose(0, 1),
@ -91,7 +91,7 @@ class AngleProtoLossTests(unittest.TestCase):
# check speaker loss with orthogonal d-vectors
dummy_input = T.empty(3, 64)
dummy_input = T.nn.init.orthogonal(dummy_input)
dummy_input = T.nn.init.orthogonal_(dummy_input)
dummy_input = T.cat(
[
dummy_input[0].repeat(5, 1, 1).transpose(0, 1),

13
tests/test_glow-tts_train.sh Executable file
View File

@ -0,0 +1,13 @@
#!/usr/bin/env bash
set -xe
BASEDIR=$(dirname "$0")
echo "$BASEDIR"
# run training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_glow_tts.py --config_path $BASEDIR/inputs/test_glow_tts.json
# find the training folder
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
echo $LATEST_FOLDER
# continue the previous training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_glow_tts.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
# remove all the outputs
rm -rf $BASEDIR/train_outputs/

View File

@ -2,7 +2,7 @@ import unittest
import torch as T
from TTS.tts.layers.tacotron import Prenet, CBHG, Decoder, Encoder
from TTS.tts.layers.losses import L1LossMasked
from TTS.tts.layers.losses import L1LossMasked, SSIMLoss
from TTS.tts.utils.generic_utils import sequence_mask
# pylint: disable=unused-variable
@ -149,3 +149,72 @@ class L1LossMaskedTests(unittest.TestCase):
(sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
output = layer(dummy_input + mask, dummy_target, dummy_length)
assert output.item() == 0, "0 vs {}".format(output.item())
class SSIMLossTests(unittest.TestCase):
def test_in_out(self): #pylint: disable=no-self-use
# test input == target
layer = SSIMLoss()
dummy_input = T.ones(4, 8, 128).float()
dummy_target = T.ones(4, 8, 128).float()
dummy_length = (T.ones(4) * 8).long()
output = layer(dummy_input, dummy_target, dummy_length)
assert output.item() == 0.0
# test input != target
dummy_input = T.ones(4, 8, 128).float()
dummy_target = T.zeros(4, 8, 128).float()
dummy_length = (T.ones(4) * 8).long()
output = layer(dummy_input, dummy_target, dummy_length)
assert abs(output.item() - 1.0) < 1e-4 , "1.0 vs {}".format(output.item())
# test if padded values of input makes any difference
dummy_input = T.ones(4, 8, 128).float()
dummy_target = T.zeros(4, 8, 128).float()
dummy_length = (T.arange(5, 9)).long()
mask = (
(sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
output = layer(dummy_input + mask, dummy_target, dummy_length)
assert abs(output.item() - 1.0) < 1e-4, "1.0 vs {}".format(output.item())
dummy_input = T.rand(4, 8, 128).float()
dummy_target = dummy_input.detach()
dummy_length = (T.arange(5, 9)).long()
mask = (
(sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
output = layer(dummy_input + mask, dummy_target, dummy_length)
assert output.item() == 0, "0 vs {}".format(output.item())
# seq_len_norm = True
# test input == target
layer = L1LossMasked(seq_len_norm=True)
dummy_input = T.ones(4, 8, 128).float()
dummy_target = T.ones(4, 8, 128).float()
dummy_length = (T.ones(4) * 8).long()
output = layer(dummy_input, dummy_target, dummy_length)
assert output.item() == 0.0
# test input != target
dummy_input = T.ones(4, 8, 128).float()
dummy_target = T.zeros(4, 8, 128).float()
dummy_length = (T.ones(4) * 8).long()
output = layer(dummy_input, dummy_target, dummy_length)
assert output.item() == 1.0, "1.0 vs {}".format(output.item())
# test if padded values of input makes any difference
dummy_input = T.ones(4, 8, 128).float()
dummy_target = T.zeros(4, 8, 128).float()
dummy_length = (T.arange(5, 9)).long()
mask = (
(sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
output = layer(dummy_input + mask, dummy_target, dummy_length)
assert abs(output.item() - 1.0) < 1e-5, "1.0 vs {}".format(output.item())
dummy_input = T.rand(4, 8, 128).float()
dummy_target = dummy_input.detach()
dummy_length = (T.arange(5, 9)).long()
mask = (
(sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
output = layer(dummy_input + mask, dummy_target, dummy_length)
assert output.item() == 0, "0 vs {}".format(output.item())

View File

@ -6,12 +6,12 @@ if [[ ! -f tests/outputs/checkpoint_10.pth.tar ]]; then
exit 1
fi
rm -f dist/*.whl
python setup.py --quiet bdist_wheel --checkpoint tests/outputs/checkpoint_10.pth.tar --model_config tests/outputs/dummy_model_config.json
python -m venv /tmp/venv
source /tmp/venv/bin/activate
pip install --quiet --upgrade pip setuptools wheel
rm -f dist/*.whl
python setup.py --quiet bdist_wheel --checkpoint tests/outputs/checkpoint_10.pth.tar --model_config tests/outputs/dummy_model_config.json
pip install --quiet dist/TTS*.whl
# this is related to https://github.com/librosa/librosa/issues/1160

View File

@ -294,6 +294,7 @@ class SCGSTMultiSpeakeTacotronTrainTest(unittest.TestCase):
mel_spec = torch.rand(8, 30, c.audio['num_mels']).to(device)
linear_spec = torch.rand(8, 30, c.audio['fft_size']).to(device)
mel_lengths = torch.randint(20, 30, (8, )).long().to(device)
mel_lengths[-1] = mel_spec.size(1)
stop_targets = torch.zeros(8, 30, 1).float().to(device)
speaker_embeddings = torch.rand(8, 55).to(device)

14
tests/test_tacotron_train.sh Executable file
View File

@ -0,0 +1,14 @@
#!/usr/bin/env bash
set -xe
BASEDIR=$(dirname "$0")
echo "$BASEDIR"
# run training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tts.py --config_path $BASEDIR/inputs/test_train_config.json
# find the training folder
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
echo $LATEST_FOLDER
# continue the previous training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tts.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
# remove all the outputs
rm -rf $BASEDIR/train_outputs/

View File

@ -11,6 +11,7 @@ from TTS.utils.io import load_config
conf = load_config(os.path.join(get_tests_input_path(), 'test_config.json'))
def test_phoneme_to_sequence():
text = "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase, the grey matter in the parts of the brain responsible for emotional regulation and learning!"
text_cleaner = ["phoneme_cleaners"]
lang = "en-us"
@ -20,7 +21,7 @@ def test_phoneme_to_sequence():
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters)
gt = "ɹiːsənt ɹɪːtʃ æt hɑːɹvɚd hɐz ʃoʊn mɛdᵻteɪɾɪŋ fɔːɹ æz lɪɾəl æz eɪt wiːks kæn æktʃuːəli ɪnkɹiːs, ðə ɡɹeɪ mæɾɚɹ ɪnðə pɑːɹts ʌvðə bɹeɪn ɹɪspɑːnsəbəl fɔːɹ ɪmoʊʃənəl ɹɛɡjuːleɪʃən ænd lɜːnɪŋ!"
assert text_hat == text_hat_with_params == gt
# multiple punctuations
text = "Be a voice, not an! echo?"
sequence = phoneme_to_sequence(text, text_cleaner, lang)
@ -87,6 +88,84 @@ def test_phoneme_to_sequence():
print(len(sequence))
assert text_hat == text_hat_with_params == gt
def test_phoneme_to_sequence_with_blank_token():
text = "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase, the grey matter in the parts of the brain responsible for emotional regulation and learning!"
text_cleaner = ["phoneme_cleaners"]
lang = "en-us"
sequence = phoneme_to_sequence(text, text_cleaner, lang)
text_hat = sequence_to_phoneme(sequence)
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
gt = "ɹiːsənt ɹɪːtʃ æt hɑːɹvɚd hɐz ʃoʊn mɛdᵻteɪɾɪŋ fɔːɹ æz lɪɾəl æz eɪt wiːks kæn æktʃuːəli ɪnkɹiːs, ðə ɡɹeɪ mæɾɚɹ ɪnðə pɑːɹts ʌvðə bɹeɪn ɹɪspɑːnsəbəl fɔːɹ ɪmoʊʃənəl ɹɛɡjuːleɪʃən ænd lɜːnɪŋ!"
assert text_hat == text_hat_with_params == gt
# multiple punctuations
text = "Be a voice, not an! echo?"
sequence = phoneme_to_sequence(text, text_cleaner, lang)
text_hat = sequence_to_phoneme(sequence)
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ?"
print(text_hat)
print(len(sequence))
assert text_hat == text_hat_with_params == gt
# not ending with punctuation
text = "Be a voice, not an! echo"
sequence = phoneme_to_sequence(text, text_cleaner, lang)
text_hat = sequence_to_phoneme(sequence)
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ"
print(text_hat)
print(len(sequence))
assert text_hat == text_hat_with_params == gt
# original
text = "Be a voice, not an echo!"
sequence = phoneme_to_sequence(text, text_cleaner, lang)
text_hat = sequence_to_phoneme(sequence)
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
gt = "biː ɐ vɔɪs, nɑːt ɐn ɛkoʊ!"
print(text_hat)
print(len(sequence))
assert text_hat == text_hat_with_params == gt
# extra space after the sentence
text = "Be a voice, not an! echo. "
sequence = phoneme_to_sequence(text, text_cleaner, lang)
text_hat = sequence_to_phoneme(sequence)
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ."
print(text_hat)
print(len(sequence))
assert text_hat == text_hat_with_params == gt
# extra space after the sentence
text = "Be a voice, not an! echo. "
sequence = phoneme_to_sequence(text, text_cleaner, lang, True)
text_hat = sequence_to_phoneme(sequence)
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
gt = "^biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ.~"
print(text_hat)
print(len(sequence))
assert text_hat == text_hat_with_params == gt
# padding char
text = "_Be a _voice, not an! echo_"
sequence = phoneme_to_sequence(text, text_cleaner, lang)
text_hat = sequence_to_phoneme(sequence)
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ"
print(text_hat)
print(len(sequence))
assert text_hat == text_hat_with_params == gt
def test_text2phone():
text = "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase, the grey matter in the parts of the brain responsible for emotional regulation and learning!"
gt = "ɹ|iː|s|ə|n|t| |ɹ|ɪ|s|ɜː|tʃ| |æ|t| |h|ɑːɹ|v|ɚ|d| |h|ɐ|z| |ʃ|oʊ|n| |m|ɛ|d|ᵻ|t|eɪ|ɾ|ɪ|ŋ| |f|ɔː|ɹ| |æ|z| |l|ɪ|ɾ|əl| |æ|z| |eɪ|t| |w|iː|k|s| |k|æ|n| |æ|k|tʃ|uː|əl|i| |ɪ|n|k|ɹ|iː|s|,| |ð|ə| |ɡ|ɹ|eɪ| |m|æ|ɾ|ɚ|ɹ| |ɪ|n|ð|ə| |p|ɑːɹ|t|s| |ʌ|v|ð|ə| |b|ɹ|eɪ|n| |ɹ|ɪ|s|p|ɑː|n|s|ə|b|əl| |f|ɔː|ɹ| |ɪ|m|oʊ|ʃ|ə|n|əl| |ɹ|ɛ|ɡ|j|uː|l|eɪ|ʃ|ə|n| |æ|n|d| |l|ɜː|n|ɪ|ŋ|!"

View File

@ -1,13 +1,13 @@
#!/usr/bin/env bash
set -xe
BASEDIR=$(dirname "$0")
echo "$BASEDIR"
# run training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tts.py --config_path $BASEDIR/inputs/test_train_config.json
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tacotron.py --config_path $BASEDIR/inputs/test_train_config.json
# find the training folder
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
echo $LATEST_FOLDER
# continue the previous training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tts.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tacotron.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
# remove all the outputs
rm -rf $BASEDIR/train_outputs/

View File

@ -1,15 +1,15 @@
#!/usr/bin/env bash
set -xe
BASEDIR=$(dirname "$0")
echo "$BASEDIR"
# create run dir
mkdir $BASEDIR/train_outputs
# run training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder.py --config_path $BASEDIR/inputs/test_vocoder_multiband_melgan_config.json
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_gan.py --config_path $BASEDIR/inputs/test_vocoder_multiband_melgan_config.json
# find the training folder
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
echo $LATEST_FOLDER
# continue the previous training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_gan.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
# remove all the outputs
rm -rf $BASEDIR/train_outputs/$LATEST_FOLDER

View File

@ -0,0 +1,15 @@
#!/usr/bin/env bash
set -xe
BASEDIR=$(dirname "$0")
echo "$BASEDIR"
# create run dir
mkdir -p $BASEDIR/train_outputs
# run training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_wavegrad.py --config_path $BASEDIR/inputs/test_vocoder_wavegrad.json
# find the training folder
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
echo $LATEST_FOLDER
# continue the previous training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_wavegrad.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
# remove all the outputs
rm -rf $BASEDIR/train_outputs/$LATEST_FOLDER

View File

@ -0,0 +1,31 @@
import numpy as np
import torch
import random
from TTS.vocoder.models.wavernn import WaveRNN
def test_wavernn():
model = WaveRNN(
rnn_dims=512,
fc_dims=512,
mode=10,
mulaw=False,
pad=2,
use_aux_net=True,
use_upsample_net=True,
upsample_factors=[4, 8, 8],
feat_dims=80,
compute_dims=128,
res_out_dims=128,
num_res_blocks=10,
hop_length=256,
sample_rate=22050,
)
dummy_x = torch.rand((2, 1280))
dummy_m = torch.rand((2, 80, 9))
y_size = random.randrange(20, 60)
dummy_y = torch.rand((80, y_size))
output = model(dummy_x, dummy_m)
assert np.all(output.shape == (2, 1280, 4 * 256)), output.shape
output = model.inference(dummy_y, True, 5500, 550)
assert np.all(output.shape == (256 * (y_size - 1),))

View File

@ -0,0 +1,92 @@
import os
import shutil
import numpy as np
from tests import get_tests_path, get_tests_input_path, get_tests_output_path
from torch.utils.data import DataLoader
from TTS.utils.audio import AudioProcessor
from TTS.utils.io import load_config
from TTS.vocoder.datasets.wavernn_dataset import WaveRNNDataset
from TTS.vocoder.datasets.preprocess import load_wav_feat_data, preprocess_wav_files
file_path = os.path.dirname(os.path.realpath(__file__))
OUTPATH = os.path.join(get_tests_output_path(), "loader_tests/")
os.makedirs(OUTPATH, exist_ok=True)
C = load_config(os.path.join(get_tests_input_path(),
"test_vocoder_wavernn_config.json"))
test_data_path = os.path.join(get_tests_path(), "data/ljspeech/")
test_mel_feat_path = os.path.join(test_data_path, "mel")
test_quant_feat_path = os.path.join(test_data_path, "quant")
ok_ljspeech = os.path.exists(test_data_path)
def wavernn_dataset_case(batch_size, seq_len, hop_len, pad, mode, mulaw, num_workers):
""" run dataloader with given parameters and check conditions """
ap = AudioProcessor(**C.audio)
C.batch_size = batch_size
C.mode = mode
C.seq_len = seq_len
C.data_path = test_data_path
preprocess_wav_files(test_data_path, C, ap)
_, train_items = load_wav_feat_data(
test_data_path, test_mel_feat_path, 5)
dataset = WaveRNNDataset(ap=ap,
items=train_items,
seq_len=seq_len,
hop_len=hop_len,
pad=pad,
mode=mode,
mulaw=mulaw
)
# sampler = DistributedSampler(dataset) if num_gpus > 1 else None
loader = DataLoader(dataset,
shuffle=True,
collate_fn=dataset.collate,
batch_size=batch_size,
num_workers=num_workers,
pin_memory=True,
)
max_iter = 10
count_iter = 0
try:
for data in loader:
x_input, mels, _ = data
expected_feat_shape = (ap.num_mels,
(x_input.shape[-1] // hop_len) + (pad * 2))
assert np.all(
mels.shape[1:] == expected_feat_shape), f" [!] {mels.shape} vs {expected_feat_shape}"
assert (mels.shape[2] - pad * 2) * hop_len == x_input.shape[1]
count_iter += 1
if count_iter == max_iter:
break
# except AssertionError:
# shutil.rmtree(test_mel_feat_path)
# shutil.rmtree(test_quant_feat_path)
finally:
shutil.rmtree(test_mel_feat_path)
shutil.rmtree(test_quant_feat_path)
def test_parametrized_wavernn_dataset():
''' test dataloader with different parameters '''
params = [
[16, C.audio['hop_length'] * 10, C.audio['hop_length'], 2, 10, True, 0],
[16, C.audio['hop_length'] * 10, C.audio['hop_length'], 2, "mold", False, 4],
[1, C.audio['hop_length'] * 10, C.audio['hop_length'], 2, 9, False, 0],
[1, C.audio['hop_length'], C.audio['hop_length'], 2, 10, True, 0],
[1, C.audio['hop_length'], C.audio['hop_length'], 2, "mold", False, 0],
[1, C.audio['hop_length'] * 5, C.audio['hop_length'], 4, 10, False, 2],
[1, C.audio['hop_length'] * 5, C.audio['hop_length'], 2, "mold", False, 0],
]
for param in params:
print(param)
wavernn_dataset_case(*param)

View File

@ -0,0 +1,15 @@
#!/usr/bin/env bash
set -xe
BASEDIR=$(dirname "$0")
echo "$BASEDIR"
# create run dir
mkdir -p $BASEDIR/train_outputs
# run training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_wavernn.py --config_path $BASEDIR/inputs/test_vocoder_wavernn_config.json
# find the training folder
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
echo $LATEST_FOLDER
# continue the previous training
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_wavernn.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
# remove all the outputs
rm -rf $BASEDIR/train_outputs/$LATEST_FOLDER

View File

@ -0,0 +1,92 @@
import torch
from TTS.vocoder.layers.wavegrad import PositionalEncoding, FiLM, UBlock, DBlock
from TTS.vocoder.models.wavegrad import Wavegrad
def test_positional_encoding():
layer = PositionalEncoding(50)
inp = torch.rand(32, 50, 100)
nl = torch.rand(32)
o = layer(inp, nl)
assert o.shape[0] == 32
assert o.shape[1] == 50
assert o.shape[2] == 100
assert isinstance(o, torch.FloatTensor)
def test_film():
layer = FiLM(50, 76)
inp = torch.rand(32, 50, 100)
nl = torch.rand(32)
shift, scale = layer(inp, nl)
assert shift.shape[0] == 32
assert shift.shape[1] == 76
assert shift.shape[2] == 100
assert isinstance(shift, torch.FloatTensor)
assert scale.shape[0] == 32
assert scale.shape[1] == 76
assert scale.shape[2] == 100
assert isinstance(scale, torch.FloatTensor)
layer.apply_weight_norm()
layer.remove_weight_norm()
def test_ublock():
inp1 = torch.rand(32, 50, 100)
inp2 = torch.rand(32, 50, 50)
nl = torch.rand(32)
layer_film = FiLM(50, 100)
layer = UBlock(50, 100, 2, [1, 2, 4, 8])
scale, shift = layer_film(inp1, nl)
o = layer(inp2, shift, scale)
assert o.shape[0] == 32
assert o.shape[1] == 100
assert o.shape[2] == 100
assert isinstance(o, torch.FloatTensor)
layer.apply_weight_norm()
layer.remove_weight_norm()
def test_dblock():
inp = torch.rand(32, 50, 130)
layer = DBlock(50, 100, 2)
o = layer(inp)
assert o.shape[0] == 32
assert o.shape[1] == 100
assert o.shape[2] == 65
assert isinstance(o, torch.FloatTensor)
layer.apply_weight_norm()
layer.remove_weight_norm()
def test_wavegrad_forward():
x = torch.rand(32, 1, 20 * 300)
c = torch.rand(32, 80, 20)
noise_scale = torch.rand(32)
model = Wavegrad(in_channels=80,
out_channels=1,
upsample_factors=[5, 5, 3, 2, 2],
upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2],
[1, 2, 4, 8], [1, 2, 4, 8],
[1, 2, 4, 8]])
o = model.forward(x, c, noise_scale)
assert o.shape[0] == 32
assert o.shape[1] == 1
assert o.shape[2] == 20 * 300
assert isinstance(o, torch.FloatTensor)
model.apply_weight_norm()
model.remove_weight_norm()

View File

@ -0,0 +1,62 @@
import unittest
import numpy as np
import torch
from torch import optim
from TTS.vocoder.models.wavegrad import Wavegrad
#pylint: disable=unused-variable
torch.manual_seed(1)
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
class WavegradTrainTest(unittest.TestCase):
def test_train_step(self): # pylint: disable=no-self-use
"""Test if all layers are updated in a basic training cycle"""
input_dummy = torch.rand(8, 1, 20 * 300).to(device)
mel_spec = torch.rand(8, 80, 20).to(device)
criterion = torch.nn.L1Loss().to(device)
model = Wavegrad(in_channels=80,
out_channels=1,
upsample_factors=[5, 5, 3, 2, 2],
upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2],
[1, 2, 4, 8], [1, 2, 4, 8],
[1, 2, 4, 8]])
model_ref = Wavegrad(in_channels=80,
out_channels=1,
upsample_factors=[5, 5, 3, 2, 2],
upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2],
[1, 2, 4, 8], [1, 2, 4, 8],
[1, 2, 4, 8]])
model.train()
model.to(device)
betas = np.linspace(1e-6, 1e-2, 1000)
model.compute_noise_level(betas)
model_ref.load_state_dict(model.state_dict())
model_ref.to(device)
count = 0
for param, param_ref in zip(model.parameters(),
model_ref.parameters()):
assert (param - param_ref).sum() == 0, param
count += 1
optimizer = optim.Adam(model.parameters(), lr=0.001)
for i in range(5):
y_hat = model.forward(input_dummy, mel_spec, torch.rand(8).to(device))
optimizer.zero_grad()
loss = criterion(y_hat, input_dummy)
loss.backward()
optimizer.step()
# check parameter changes
count = 0
for param, param_ref in zip(model.parameters(),
model_ref.parameters()):
# ignore pre-higway layer since it works conditional
# if count not in [145, 59]:
assert (param != param_ref).any(
), "param {} with shape {} not updated!! \n{}\n{}".format(
count, param.shape, param, param_ref)
count += 1