mirror of https://github.com/coqui-ai/TTS.git
Merge branch 'dev'
This commit is contained in:
commit
f6c96b0ac2
6
.compute
6
.compute
|
@ -1,14 +1,14 @@
|
|||
#!/bin/bash
|
||||
yes | apt-get install sox
|
||||
yes | apt-get install ffmpeg
|
||||
yes | apt-get install espeak
|
||||
yes | apt-get install espeak
|
||||
yes | apt-get install tmux
|
||||
yes | apt-get install zsh
|
||||
sh -c "$(curl -fsSL https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"
|
||||
pip3 install https://download.pytorch.org/whl/cu100/torch-1.3.0%2Bcu100-cp36-cp36m-linux_x86_64.whl
|
||||
sudo sh install.sh
|
||||
pip install pytorch==1.3.0+cu100
|
||||
python3 setup.py develop
|
||||
# pip install pytorch==1.7.0+cu100
|
||||
# python3 setup.py develop
|
||||
# python3 distribute.py --config_path config.json --data_path /data/ro/shared/data/keithito/LJSpeech-1.1/
|
||||
# cp -R ${USER_DIR}/Mozilla_22050 ../tmp/
|
||||
# python3 distribute.py --config_path config_tacotron_gst.json --data_path ../tmp/Mozilla_22050/
|
||||
|
|
|
@ -17,5 +17,6 @@ fi
|
|||
if [[ "$TEST_SUITE" == "testscripts" ]]; then
|
||||
# test model training scripts
|
||||
./tests/test_tts_train.sh
|
||||
./tests/test_vocoder_train.sh
|
||||
./tests/test_vocoder_gan_train.sh
|
||||
./tests/test_vocoder_wavernn_train.sh
|
||||
fi
|
||||
|
|
28
README.md
28
README.md
|
@ -26,7 +26,7 @@ TTS paper collection: https://github.com/erogol/TTS-papers
|
|||
## TTS Performance
|
||||
<p align="center"><img src="https://discourse-prod-uploads-81679984178418.s3.dualstack.us-west-2.amazonaws.com/optimized/3X/6/4/6428f980e9ec751c248e591460895f7881aec0c6_2_1035x591.png" width="800" /></p>
|
||||
|
||||
"Mozilla*" and "Judy*" are our models.
|
||||
"Mozilla*" and "Judy*" are our models.
|
||||
[Details...](https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results)
|
||||
|
||||
## Provided Models and Methods
|
||||
|
@ -47,7 +47,10 @@ Speaker Encoder:
|
|||
Vocoders:
|
||||
- MelGAN: [paper](https://arxiv.org/abs/1710.10467)
|
||||
- MultiBandMelGAN: [paper](https://arxiv.org/abs/2005.05106)
|
||||
- ParallelWaveGAN: [paper](https://arxiv.org/abs/1910.11480)
|
||||
- GAN-TTS discriminators: [paper](https://arxiv.org/abs/1909.11646)
|
||||
- WaveRNN: [origin][https://github.com/fatchord/WaveRNN/]
|
||||
- WaveGrad: [paper][https://arxiv.org/abs/2009.00713]
|
||||
|
||||
You can also help us implement more models. Some TTS related work can be found [here](https://github.com/erogol/TTS-papers).
|
||||
|
||||
|
@ -70,8 +73,8 @@ You can also help us implement more models. Some TTS related work can be found [
|
|||
## Main Requirements and Installation
|
||||
Highly recommended to use [miniconda](https://conda.io/miniconda.html) for easier installation.
|
||||
* python>=3.6
|
||||
* pytorch>=1.4.1
|
||||
* tensorflow>=2.2
|
||||
* pytorch>=1.5.0
|
||||
* tensorflow>=2.3
|
||||
* librosa
|
||||
* tensorboard
|
||||
* tensorboardX
|
||||
|
@ -149,23 +152,25 @@ head -n 12000 metadata_shuf.csv > metadata_train.csv
|
|||
tail -n 1100 metadata_shuf.csv > metadata_val.csv
|
||||
```
|
||||
|
||||
To train a new model, you need to define your own ```config.json``` file (check the example) and call with the command below. You also set the model architecture in ```config.json```.
|
||||
To train a new model, you need to define your own ```config.json``` to define model details, trainin configuration and more (check the examples). Then call the corressponding train script.
|
||||
|
||||
```python TTS/bin/train_tts.py --config_path TTS/tts/configs/config.json```
|
||||
For instance, in order to train a tacotron or tacotron2 model on LJSpeech dataset, follow these steps.
|
||||
|
||||
```python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json```
|
||||
|
||||
To fine-tune a model, use ```--restore_path```.
|
||||
|
||||
```python TTS/bin/train_tts.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar```
|
||||
```python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar```
|
||||
|
||||
To continue an old training run, use ```--continue_path```.
|
||||
|
||||
```python TTS/bin/train_tts.py --continue_path /path/to/your/run_folder/```
|
||||
```python TTS/bin/train_tacotron.py --continue_path /path/to/your/run_folder/```
|
||||
|
||||
For multi-GPU training use ```distribute.py```. It enables process based multi-GPU training where each process uses a single GPU.
|
||||
For multi-GPU training, call ```distribute.py```. It runs any provided train script in multi-GPU setting.
|
||||
|
||||
```CUDA_VISIBLE_DEVICES="0,1,4" TTS/bin/distribute.py --config_path TTS/tts/configs/config.json```
|
||||
```CUDA_VISIBLE_DEVICES="0,1,4" python TTS/bin/distribute.py --script train_tacotron.py --config_path TTS/tts/configs/config.json```
|
||||
|
||||
Each run creates a new output folder and ```config.json``` is copied under this folder.
|
||||
Each run creates a new output folder accomodating used ```config.json```, model checkpoints and tensorboard logs.
|
||||
|
||||
In case of any error or intercepted execution, if there is no checkpoint yet under the output folder, the whole folder is going to be removed.
|
||||
|
||||
|
@ -199,7 +204,7 @@ If you like to use TTS to try a new idea and like to share your experiments with
|
|||
- [x] Train TTS with r=1 successfully.
|
||||
- [x] Enable process based distributed training. Similar to (https://github.com/fastai/imagenet-fast/).
|
||||
- [x] Adapting Neural Vocoder. TTS works with WaveRNN and ParallelWaveGAN (https://github.com/erogol/WaveRNN and https://github.com/erogol/ParallelWaveGAN)
|
||||
- [ ] Multi-speaker embedding.
|
||||
- [x] Multi-speaker embedding.
|
||||
- [x] Model optimization (model export, model pruning etc.)
|
||||
|
||||
<!--## References
|
||||
|
@ -218,3 +223,4 @@ If you like to use TTS to try a new idea and like to share your experiments with
|
|||
- https://github.com/r9y9/tacotron_pytorch (Initial Tacotron architecture)
|
||||
- https://github.com/kan-bayashi/ParallelWaveGAN (vocoder library)
|
||||
- https://github.com/jaywalnut310/glow-tts (Original Glow-TTS implementation)
|
||||
- https://github.com/fatchord/WaveRNN/ (Original WaveRNN implementation)
|
||||
|
|
|
@ -0,0 +1,130 @@
|
|||
import argparse
|
||||
import glob
|
||||
import os
|
||||
|
||||
import numpy as np
|
||||
from tqdm import tqdm
|
||||
|
||||
import torch
|
||||
from TTS.speaker_encoder.model import SpeakerEncoder
|
||||
from TTS.utils.audio import AudioProcessor
|
||||
from TTS.utils.io import load_config
|
||||
from TTS.tts.utils.speakers import save_speaker_mapping
|
||||
from TTS.tts.datasets.preprocess import load_meta_data
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Compute embedding vectors for each wav file in a dataset. If "target_dataset" is defined, it generates "speakers.json" necessary for training a multi-speaker model.')
|
||||
parser.add_argument(
|
||||
'model_path',
|
||||
type=str,
|
||||
help='Path to model outputs (checkpoint, tensorboard etc.).')
|
||||
parser.add_argument(
|
||||
'config_path',
|
||||
type=str,
|
||||
help='Path to config file for training.',
|
||||
)
|
||||
parser.add_argument(
|
||||
'data_path',
|
||||
type=str,
|
||||
help='Data path for wav files - directory or CSV file')
|
||||
parser.add_argument(
|
||||
'output_path',
|
||||
type=str,
|
||||
help='path for training outputs.')
|
||||
parser.add_argument(
|
||||
'--target_dataset',
|
||||
type=str,
|
||||
default='',
|
||||
help='Target dataset to pick a processor from TTS.tts.dataset.preprocess. Necessary to create a speakers.json file.'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--use_cuda', type=bool, help='flag to set cuda.', default=False
|
||||
)
|
||||
parser.add_argument(
|
||||
'--separator', type=str, help='Separator used in file if CSV is passed for data_path', default='|'
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
|
||||
c = load_config(args.config_path)
|
||||
ap = AudioProcessor(**c['audio'])
|
||||
|
||||
data_path = args.data_path
|
||||
split_ext = os.path.splitext(data_path)
|
||||
sep = args.separator
|
||||
|
||||
if args.target_dataset != '':
|
||||
# if target dataset is defined
|
||||
dataset_config = [
|
||||
{
|
||||
"name": args.target_dataset,
|
||||
"path": args.data_path,
|
||||
"meta_file_train": None,
|
||||
"meta_file_val": None
|
||||
},
|
||||
]
|
||||
wav_files, _ = load_meta_data(dataset_config, eval_split=False)
|
||||
output_files = [wav_file[1].replace(data_path, args.output_path).replace(
|
||||
'.wav', '.npy') for wav_file in wav_files]
|
||||
else:
|
||||
# if target dataset is not defined
|
||||
if len(split_ext) > 0 and split_ext[1].lower() == '.csv':
|
||||
# Parse CSV
|
||||
print(f'CSV file: {data_path}')
|
||||
with open(data_path) as f:
|
||||
wav_path = os.path.join(os.path.dirname(data_path), 'wavs')
|
||||
wav_files = []
|
||||
print(f'Separator is: {sep}')
|
||||
for line in f:
|
||||
components = line.split(sep)
|
||||
if len(components) != 2:
|
||||
print("Invalid line")
|
||||
continue
|
||||
wav_file = os.path.join(wav_path, components[0] + '.wav')
|
||||
#print(f'wav_file: {wav_file}')
|
||||
if os.path.exists(wav_file):
|
||||
wav_files.append(wav_file)
|
||||
print(f'Count of wavs imported: {len(wav_files)}')
|
||||
else:
|
||||
# Parse all wav files in data_path
|
||||
wav_files = glob.glob(data_path + '/**/*.wav', recursive=True)
|
||||
|
||||
output_files = [wav_file.replace(data_path, args.output_path).replace(
|
||||
'.wav', '.npy') for wav_file in wav_files]
|
||||
|
||||
for output_file in output_files:
|
||||
os.makedirs(os.path.dirname(output_file), exist_ok=True)
|
||||
|
||||
# define Encoder model
|
||||
model = SpeakerEncoder(**c.model)
|
||||
model.load_state_dict(torch.load(args.model_path)['model'])
|
||||
model.eval()
|
||||
if args.use_cuda:
|
||||
model.cuda()
|
||||
|
||||
# compute speaker embeddings
|
||||
speaker_mapping = {}
|
||||
for idx, wav_file in enumerate(tqdm(wav_files)):
|
||||
if isinstance(wav_file, list):
|
||||
speaker_name = wav_file[2]
|
||||
wav_file = wav_file[1]
|
||||
|
||||
mel_spec = ap.melspectrogram(ap.load_wav(wav_file, sr=ap.sample_rate)).T
|
||||
mel_spec = torch.FloatTensor(mel_spec[None, :, :])
|
||||
if args.use_cuda:
|
||||
mel_spec = mel_spec.cuda()
|
||||
embedd = model.compute_embedding(mel_spec)
|
||||
embedd = embedd.detach().cpu().numpy()
|
||||
np.save(output_files[idx], embedd)
|
||||
|
||||
if args.target_dataset != '':
|
||||
# create speaker_mapping if target dataset is defined
|
||||
wav_file_name = os.path.basename(wav_file)
|
||||
speaker_mapping[wav_file_name] = {}
|
||||
speaker_mapping[wav_file_name]['name'] = speaker_name
|
||||
speaker_mapping[wav_file_name]['embedding'] = embedd.flatten().tolist()
|
||||
|
||||
if args.target_dataset != '':
|
||||
# save speaker_mapping if target dataset is defined
|
||||
mapping_file_path = os.path.join(args.output_path, 'speakers.json')
|
||||
save_speaker_mapping(args.output_path, speaker_mapping)
|
|
@ -2,6 +2,7 @@
|
|||
# -*- coding: utf-8 -*-
|
||||
|
||||
import os
|
||||
import glob
|
||||
import argparse
|
||||
|
||||
import numpy as np
|
||||
|
@ -11,6 +12,7 @@ from TTS.tts.datasets.preprocess import load_meta_data
|
|||
from TTS.utils.io import load_config
|
||||
from TTS.utils.audio import AudioProcessor
|
||||
|
||||
|
||||
def main():
|
||||
"""Run preprocessing process."""
|
||||
parser = argparse.ArgumentParser(
|
||||
|
@ -30,7 +32,10 @@ def main():
|
|||
ap = AudioProcessor(**CONFIG.audio)
|
||||
|
||||
# load the meta data of target dataset
|
||||
dataset_items = load_meta_data(CONFIG.datasets)[0] # take only train data
|
||||
if 'data_path' in CONFIG.keys():
|
||||
dataset_items = glob.glob(os.path.join(CONFIG.data_path, '**', '*.wav'), recursive=True)
|
||||
else:
|
||||
dataset_items = load_meta_data(CONFIG.datasets)[0] # take only train data
|
||||
print(f" > There are {len(dataset_items)} files.")
|
||||
|
||||
mel_sum = 0
|
||||
|
@ -40,7 +45,7 @@ def main():
|
|||
N = 0
|
||||
for item in tqdm(dataset_items):
|
||||
# compute features
|
||||
wav = ap.load_wav(item[1])
|
||||
wav = ap.load_wav(item if isinstance(item, str) else item[1])
|
||||
linear = ap.spectrogram(wav)
|
||||
mel = ap.melspectrogram(wav)
|
||||
|
||||
|
@ -56,7 +61,7 @@ def main():
|
|||
linear_mean = linear_sum / N
|
||||
linear_scale = np.sqrt(linear_square_sum / N - linear_mean ** 2)
|
||||
|
||||
output_file_path = os.path.join(args.out_path, "scale_stats.npy")
|
||||
output_file_path = args.out_path
|
||||
stats = {}
|
||||
stats['mel_mean'] = mel_mean
|
||||
stats['mel_std'] = mel_scale
|
||||
|
@ -78,7 +83,7 @@ def main():
|
|||
del CONFIG.audio['clip_norm']
|
||||
stats['audio_config'] = CONFIG.audio
|
||||
np.save(output_file_path, stats, allow_pickle=True)
|
||||
print(f' > scale_stats.npy is saved to {output_file_path}')
|
||||
print(f' > stats saved to {output_file_path}')
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
|
|
@ -10,7 +10,7 @@ import time
|
|||
|
||||
import torch
|
||||
|
||||
from TTS.tts.utils.generic_utils import setup_model
|
||||
from TTS.tts.utils.generic_utils import setup_model, is_tacotron
|
||||
from TTS.tts.utils.synthesis import synthesis
|
||||
from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
|
||||
from TTS.utils.audio import AudioProcessor
|
||||
|
@ -125,7 +125,8 @@ if __name__ == "__main__":
|
|||
model.eval()
|
||||
if args.use_cuda:
|
||||
model.cuda()
|
||||
model.decoder.set_r(cp['r'])
|
||||
if is_tacotron(C):
|
||||
model.decoder.set_r(cp['r'])
|
||||
|
||||
# load vocoder model
|
||||
if args.vocoder_path != "":
|
||||
|
@ -153,7 +154,10 @@ if __name__ == "__main__":
|
|||
args.speaker_fileid = None
|
||||
|
||||
if args.gst_style is None:
|
||||
gst_style = C.gst['gst_style_input']
|
||||
if is_tacotron(C):
|
||||
gst_style = C.gst['gst_style_input']
|
||||
else:
|
||||
gst_style = None
|
||||
else:
|
||||
# check if gst_style string is a dict, if is dict convert else use string
|
||||
try:
|
||||
|
|
|
@ -35,7 +35,7 @@ print(" > Using CUDA: ", use_cuda)
|
|||
print(" > Number of GPUs: ", num_gpus)
|
||||
|
||||
|
||||
def setup_loader(ap, is_val=False, verbose=False):
|
||||
def setup_loader(ap: AudioProcessor, is_val: bool=False, verbose: bool=False):
|
||||
if is_val:
|
||||
loader = None
|
||||
else:
|
||||
|
@ -212,6 +212,7 @@ if __name__ == '__main__':
|
|||
parser.add_argument(
|
||||
'--config_path',
|
||||
type=str,
|
||||
required=True,
|
||||
help='Path to config file for training.',
|
||||
)
|
||||
parser.add_argument('--debug',
|
||||
|
|
|
@ -9,17 +9,16 @@ import time
|
|||
import traceback
|
||||
|
||||
import torch
|
||||
from random import randrange
|
||||
from torch.utils.data import DataLoader
|
||||
|
||||
from TTS.tts.datasets.preprocess import load_meta_data
|
||||
from TTS.tts.datasets.TTSDataset import MyDataset
|
||||
from TTS.tts.layers.losses import GlowTTSLoss
|
||||
from TTS.tts.utils.distribute import (DistributedSampler, init_distributed,
|
||||
reduce_tensor)
|
||||
from TTS.tts.utils.generic_utils import setup_model
|
||||
from TTS.tts.utils.generic_utils import setup_model, check_config_tts
|
||||
from TTS.tts.utils.io import save_best_model, save_checkpoint
|
||||
from TTS.tts.utils.measures import alignment_diagonal_score
|
||||
from TTS.tts.utils.speakers import (get_speakers, load_speaker_mapping,
|
||||
save_speaker_mapping)
|
||||
from TTS.tts.utils.speakers import parse_speakers, load_speaker_mapping
|
||||
from TTS.tts.utils.synthesis import synthesis
|
||||
from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
|
||||
from TTS.tts.utils.visual import plot_alignment, plot_spectrogram
|
||||
|
@ -34,10 +33,15 @@ from TTS.utils.tensorboard_logger import TensorboardLogger
|
|||
from TTS.utils.training import (NoamLR, check_update,
|
||||
setup_torch_training_env)
|
||||
|
||||
# DISTRIBUTED
|
||||
from torch.nn.parallel import DistributedDataParallel as DDP_th
|
||||
from torch.utils.data.distributed import DistributedSampler
|
||||
from TTS.utils.distribute import init_distributed, reduce_tensor
|
||||
|
||||
|
||||
use_cuda, num_gpus = setup_torch_training_env(True, False)
|
||||
|
||||
def setup_loader(ap, r, is_val=False, verbose=False):
|
||||
|
||||
def setup_loader(ap, r, is_val=False, verbose=False, speaker_mapping=None):
|
||||
if is_val and not c.run_eval:
|
||||
loader = None
|
||||
else:
|
||||
|
@ -48,6 +52,7 @@ def setup_loader(ap, r, is_val=False, verbose=False):
|
|||
meta_data=meta_data_eval if is_val else meta_data_train,
|
||||
ap=ap,
|
||||
tp=c.characters if 'characters' in c.keys() else None,
|
||||
add_blank=c['add_blank'] if 'add_blank' in c.keys() else False,
|
||||
batch_group_size=0 if is_val else c.batch_group_size *
|
||||
c.batch_size,
|
||||
min_seq_len=c.min_seq_len,
|
||||
|
@ -56,7 +61,8 @@ def setup_loader(ap, r, is_val=False, verbose=False):
|
|||
use_phonemes=c.use_phonemes,
|
||||
phoneme_language=c.phoneme_language,
|
||||
enable_eos_bos=c.enable_eos_bos_chars,
|
||||
verbose=verbose)
|
||||
verbose=verbose,
|
||||
speaker_mapping=speaker_mapping if c.use_speaker_embedding and c.use_external_speaker_embedding_file else None)
|
||||
sampler = DistributedSampler(dataset) if num_gpus > 1 else None
|
||||
loader = DataLoader(
|
||||
dataset,
|
||||
|
@ -86,10 +92,13 @@ def format_data(data):
|
|||
avg_spec_length = torch.mean(mel_lengths.float())
|
||||
|
||||
if c.use_speaker_embedding:
|
||||
speaker_ids = [
|
||||
speaker_mapping[speaker_name] for speaker_name in speaker_names
|
||||
]
|
||||
speaker_ids = torch.LongTensor(speaker_ids)
|
||||
if c.use_external_speaker_embedding_file:
|
||||
speaker_ids = data[8]
|
||||
else:
|
||||
speaker_ids = [
|
||||
speaker_mapping[speaker_name] for speaker_name in speaker_names
|
||||
]
|
||||
speaker_ids = torch.LongTensor(speaker_ids)
|
||||
else:
|
||||
speaker_ids = None
|
||||
|
||||
|
@ -107,7 +116,7 @@ def format_data(data):
|
|||
avg_text_length, avg_spec_length, attn_mask
|
||||
|
||||
|
||||
def data_depended_init(model, ap):
|
||||
def data_depended_init(model, ap, speaker_mapping=None):
|
||||
"""Data depended initialization for activation normalization."""
|
||||
if hasattr(model, 'module'):
|
||||
for f in model.module.decoder.flows:
|
||||
|
@ -118,19 +127,19 @@ def data_depended_init(model, ap):
|
|||
if getattr(f, "set_ddi", False):
|
||||
f.set_ddi(True)
|
||||
|
||||
data_loader = setup_loader(ap, 1, is_val=False)
|
||||
data_loader = setup_loader(ap, 1, is_val=False, speaker_mapping=speaker_mapping)
|
||||
model.train()
|
||||
print(" > Data depended initialization ... ")
|
||||
with torch.no_grad():
|
||||
for _, data in enumerate(data_loader):
|
||||
|
||||
# format data
|
||||
text_input, text_lengths, mel_input, mel_lengths, _,\
|
||||
text_input, text_lengths, mel_input, mel_lengths, speaker_ids,\
|
||||
_, _, attn_mask = format_data(data)
|
||||
|
||||
# forward pass model
|
||||
_ = model.forward(
|
||||
text_input, text_lengths, mel_input, mel_lengths, attn_mask)
|
||||
text_input, text_lengths, mel_input, mel_lengths, attn_mask, g=speaker_ids)
|
||||
break
|
||||
|
||||
if hasattr(model, 'module'):
|
||||
|
@ -145,9 +154,9 @@ def data_depended_init(model, ap):
|
|||
|
||||
|
||||
def train(model, criterion, optimizer, scheduler,
|
||||
ap, global_step, epoch, amp):
|
||||
ap, global_step, epoch, speaker_mapping=None):
|
||||
data_loader = setup_loader(ap, 1, is_val=False,
|
||||
verbose=(epoch == 0))
|
||||
verbose=(epoch == 0), speaker_mapping=speaker_mapping)
|
||||
model.train()
|
||||
epoch_time = 0
|
||||
keep_avg = KeepAverage()
|
||||
|
@ -158,43 +167,49 @@ def train(model, criterion, optimizer, scheduler,
|
|||
batch_n_iter = int(len(data_loader.dataset) / c.batch_size)
|
||||
end_time = time.time()
|
||||
c_logger.print_train_start()
|
||||
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
|
||||
for num_iter, data in enumerate(data_loader):
|
||||
start_time = time.time()
|
||||
|
||||
# format data
|
||||
text_input, text_lengths, mel_input, mel_lengths, _,\
|
||||
text_input, text_lengths, mel_input, mel_lengths, speaker_ids,\
|
||||
avg_text_length, avg_spec_length, attn_mask = format_data(data)
|
||||
|
||||
loader_time = time.time() - end_time
|
||||
|
||||
global_step += 1
|
||||
optimizer.zero_grad()
|
||||
|
||||
# forward pass model
|
||||
with torch.cuda.amp.autocast(enabled=c.mixed_precision):
|
||||
z, logdet, y_mean, y_log_scale, alignments, o_dur_log, o_total_dur = model.forward(
|
||||
text_input, text_lengths, mel_input, mel_lengths, attn_mask, g=speaker_ids)
|
||||
|
||||
# compute loss
|
||||
loss_dict = criterion(z, y_mean, y_log_scale, logdet, mel_lengths,
|
||||
o_dur_log, o_total_dur, text_lengths)
|
||||
|
||||
# backward pass with loss scaling
|
||||
if c.mixed_precision:
|
||||
scaler.scale(loss_dict['loss']).backward()
|
||||
scaler.unscale_(optimizer)
|
||||
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
|
||||
c.grad_clip)
|
||||
scaler.step(optimizer)
|
||||
scaler.update()
|
||||
else:
|
||||
loss_dict['loss'].backward()
|
||||
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
|
||||
c.grad_clip)
|
||||
optimizer.step()
|
||||
|
||||
|
||||
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True)
|
||||
optimizer.step()
|
||||
|
||||
# setup lr
|
||||
if c.noam_schedule:
|
||||
scheduler.step()
|
||||
optimizer.zero_grad()
|
||||
|
||||
# forward pass model
|
||||
z, logdet, y_mean, y_log_scale, alignments, o_dur_log, o_total_dur = model.forward(
|
||||
text_input, text_lengths, mel_input, mel_lengths, attn_mask)
|
||||
|
||||
# compute loss
|
||||
loss_dict = criterion(z, y_mean, y_log_scale, logdet, mel_lengths,
|
||||
o_dur_log, o_total_dur, text_lengths)
|
||||
|
||||
# backward pass
|
||||
if amp is not None:
|
||||
with amp.scale_loss(loss_dict['loss'], optimizer) as scaled_loss:
|
||||
scaled_loss.backward()
|
||||
else:
|
||||
loss_dict['loss'].backward()
|
||||
|
||||
if amp:
|
||||
amp_opt_params = amp.master_params(optimizer)
|
||||
else:
|
||||
amp_opt_params = None
|
||||
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True, amp_opt_params=amp_opt_params)
|
||||
optimizer.step()
|
||||
|
||||
# current_lr
|
||||
current_lr = optimizer.param_groups[0]['lr']
|
||||
|
@ -257,12 +272,12 @@ def train(model, criterion, optimizer, scheduler,
|
|||
if c.checkpoint:
|
||||
# save model
|
||||
save_checkpoint(model, optimizer, global_step, epoch, 1, OUT_PATH,
|
||||
model_loss=loss_dict['loss'],
|
||||
amp_state_dict=amp.state_dict() if amp else None)
|
||||
model_loss=loss_dict['loss'])
|
||||
|
||||
# Diagnostic visualizations
|
||||
# direct pass on model for spec predictions
|
||||
spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1])
|
||||
target_speaker = None if speaker_ids is None else speaker_ids[:1]
|
||||
spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1], g=target_speaker)
|
||||
spec_pred = spec_pred.permute(0, 2, 1)
|
||||
gt_spec = mel_input.permute(0, 2, 1)
|
||||
const_spec = spec_pred[0].data.cpu().numpy()
|
||||
|
@ -298,8 +313,8 @@ def train(model, criterion, optimizer, scheduler,
|
|||
|
||||
|
||||
@torch.no_grad()
|
||||
def evaluate(model, criterion, ap, global_step, epoch):
|
||||
data_loader = setup_loader(ap, 1, is_val=True)
|
||||
def evaluate(model, criterion, ap, global_step, epoch, speaker_mapping):
|
||||
data_loader = setup_loader(ap, 1, is_val=True, speaker_mapping=speaker_mapping)
|
||||
model.eval()
|
||||
epoch_time = 0
|
||||
keep_avg = KeepAverage()
|
||||
|
@ -309,12 +324,12 @@ def evaluate(model, criterion, ap, global_step, epoch):
|
|||
start_time = time.time()
|
||||
|
||||
# format data
|
||||
text_input, text_lengths, mel_input, mel_lengths, _,\
|
||||
text_input, text_lengths, mel_input, mel_lengths, speaker_ids,\
|
||||
_, _, attn_mask = format_data(data)
|
||||
|
||||
# forward pass model
|
||||
z, logdet, y_mean, y_log_scale, alignments, o_dur_log, o_total_dur = model.forward(
|
||||
text_input, text_lengths, mel_input, mel_lengths, attn_mask)
|
||||
text_input, text_lengths, mel_input, mel_lengths, attn_mask, g=speaker_ids)
|
||||
|
||||
# compute loss
|
||||
loss_dict = criterion(z, y_mean, y_log_scale, logdet, mel_lengths,
|
||||
|
@ -355,10 +370,11 @@ def evaluate(model, criterion, ap, global_step, epoch):
|
|||
if args.rank == 0:
|
||||
# Diagnostic visualizations
|
||||
# direct pass on model for spec predictions
|
||||
target_speaker = None if speaker_ids is None else speaker_ids[:1]
|
||||
if hasattr(model, 'module'):
|
||||
spec_pred, *_ = model.module.inference(text_input[:1], text_lengths[:1])
|
||||
spec_pred, *_ = model.module.inference(text_input[:1], text_lengths[:1], g=target_speaker)
|
||||
else:
|
||||
spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1])
|
||||
spec_pred, *_ = model.inference(text_input[:1], text_lengths[:1], g=target_speaker)
|
||||
spec_pred = spec_pred.permute(0, 2, 1)
|
||||
gt_spec = mel_input.permute(0, 2, 1)
|
||||
|
||||
|
@ -398,7 +414,17 @@ def evaluate(model, criterion, ap, global_step, epoch):
|
|||
test_audios = {}
|
||||
test_figures = {}
|
||||
print(" | > Synthesizing test sentences")
|
||||
speaker_id = 0 if c.use_speaker_embedding else None
|
||||
if c.use_speaker_embedding:
|
||||
if c.use_external_speaker_embedding_file:
|
||||
speaker_embedding = speaker_mapping[list(speaker_mapping.keys())[randrange(len(speaker_mapping)-1)]]['embedding']
|
||||
speaker_id = None
|
||||
else:
|
||||
speaker_id = 0
|
||||
speaker_embedding = None
|
||||
else:
|
||||
speaker_id = None
|
||||
speaker_embedding = None
|
||||
|
||||
style_wav = c.get("style_wav_for_test")
|
||||
for idx, test_sentence in enumerate(test_sentences):
|
||||
try:
|
||||
|
@ -409,6 +435,7 @@ def evaluate(model, criterion, ap, global_step, epoch):
|
|||
use_cuda,
|
||||
ap,
|
||||
speaker_id=speaker_id,
|
||||
speaker_embedding=speaker_embedding,
|
||||
style_wav=style_wav,
|
||||
truncated=False,
|
||||
enable_eos_bos_chars=c.enable_eos_bos_chars, #pylint: disable=unused-argument
|
||||
|
@ -459,38 +486,13 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
meta_data_eval = meta_data_eval[:int(len(meta_data_eval) * c.eval_portion)]
|
||||
|
||||
# parse speakers
|
||||
if c.use_speaker_embedding:
|
||||
speakers = get_speakers(meta_data_train)
|
||||
if args.restore_path:
|
||||
prev_out_path = os.path.dirname(args.restore_path)
|
||||
speaker_mapping = load_speaker_mapping(prev_out_path)
|
||||
assert all([speaker in speaker_mapping
|
||||
for speaker in speakers]), "As of now you, you cannot " \
|
||||
"introduce new speakers to " \
|
||||
"a previously trained model."
|
||||
else:
|
||||
speaker_mapping = {name: i for i, name in enumerate(speakers)}
|
||||
save_speaker_mapping(OUT_PATH, speaker_mapping)
|
||||
num_speakers = len(speaker_mapping)
|
||||
print("Training with {} speakers: {}".format(num_speakers,
|
||||
", ".join(speakers)))
|
||||
else:
|
||||
num_speakers = 0
|
||||
num_speakers, speaker_embedding_dim, speaker_mapping = parse_speakers(c, args, meta_data_train, OUT_PATH)
|
||||
|
||||
# setup model
|
||||
model = setup_model(num_chars, num_speakers, c)
|
||||
model = setup_model(num_chars, num_speakers, c, speaker_embedding_dim=speaker_embedding_dim)
|
||||
optimizer = RAdam(model.parameters(), lr=c.lr, weight_decay=0, betas=(0.9, 0.98), eps=1e-9)
|
||||
criterion = GlowTTSLoss()
|
||||
|
||||
if c.apex_amp_level:
|
||||
# pylint: disable=import-outside-toplevel
|
||||
from apex import amp
|
||||
from apex.parallel import DistributedDataParallel as DDP
|
||||
model.cuda()
|
||||
model, optimizer = amp.initialize(model, optimizer, opt_level=c.apex_amp_level)
|
||||
else:
|
||||
amp = None
|
||||
|
||||
if args.restore_path:
|
||||
checkpoint = torch.load(args.restore_path, map_location='cpu')
|
||||
try:
|
||||
|
@ -507,9 +509,6 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
model.load_state_dict(model_dict)
|
||||
del model_dict
|
||||
|
||||
if amp and 'amp' in checkpoint:
|
||||
amp.load_state_dict(checkpoint['amp'])
|
||||
|
||||
for group in optimizer.param_groups:
|
||||
group['initial_lr'] = c.lr
|
||||
print(" > Model restored from step %d" % checkpoint['step'],
|
||||
|
@ -524,7 +523,7 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
|
||||
# DISTRUBUTED
|
||||
if num_gpus > 1:
|
||||
model = DDP(model)
|
||||
model = DDP_th(model, device_ids=[args.rank])
|
||||
|
||||
if c.noam_schedule:
|
||||
scheduler = NoamLR(optimizer,
|
||||
|
@ -540,19 +539,19 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
best_loss = float('inf')
|
||||
|
||||
global_step = args.restore_step
|
||||
model = data_depended_init(model, ap)
|
||||
model = data_depended_init(model, ap, speaker_mapping)
|
||||
for epoch in range(0, c.epochs):
|
||||
c_logger.print_epoch_start(epoch, c.epochs)
|
||||
train_avg_loss_dict, global_step = train(model, criterion, optimizer,
|
||||
scheduler, ap, global_step,
|
||||
epoch, amp)
|
||||
eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch)
|
||||
epoch, speaker_mapping)
|
||||
eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch, speaker_mapping=speaker_mapping)
|
||||
c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
|
||||
target_loss = train_avg_loss_dict['avg_loss']
|
||||
if c.run_eval:
|
||||
target_loss = eval_avg_loss_dict['avg_loss']
|
||||
best_loss = save_best_model(target_loss, best_loss, model, optimizer, global_step, epoch, c.r,
|
||||
OUT_PATH, amp_state_dict=amp.state_dict() if amp else None)
|
||||
OUT_PATH)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
@ -602,10 +601,11 @@ if __name__ == '__main__':
|
|||
# setup output paths and read configs
|
||||
c = load_config(args.config_path)
|
||||
# check_config(c)
|
||||
check_config_tts(c)
|
||||
_ = os.path.dirname(os.path.realpath(__file__))
|
||||
|
||||
if c.apex_amp_level:
|
||||
print(" > apex AMP level: ", c.apex_amp_level)
|
||||
if c.mixed_precision:
|
||||
print(" > Mixed precision enabled.")
|
||||
|
||||
OUT_PATH = args.continue_path
|
||||
if args.continue_path == '':
|
||||
|
|
|
@ -7,28 +7,25 @@ import os
|
|||
import sys
|
||||
import time
|
||||
import traceback
|
||||
from random import randrange
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from random import randrange
|
||||
from torch.utils.data import DataLoader
|
||||
from TTS.tts.datasets.preprocess import load_meta_data
|
||||
from TTS.tts.datasets.TTSDataset import MyDataset
|
||||
from TTS.tts.layers.losses import TacotronLoss
|
||||
from TTS.tts.utils.distribute import (DistributedSampler,
|
||||
apply_gradient_allreduce,
|
||||
init_distributed, reduce_tensor)
|
||||
from TTS.tts.utils.generic_utils import setup_model, check_config_tts
|
||||
from TTS.tts.utils.generic_utils import check_config_tts, setup_model
|
||||
from TTS.tts.utils.io import save_best_model, save_checkpoint
|
||||
from TTS.tts.utils.measures import alignment_diagonal_score
|
||||
from TTS.tts.utils.speakers import (get_speakers, load_speaker_mapping,
|
||||
save_speaker_mapping)
|
||||
from TTS.tts.utils.speakers import load_speaker_mapping, parse_speakers
|
||||
from TTS.tts.utils.synthesis import synthesis
|
||||
from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
|
||||
from TTS.tts.utils.visual import plot_alignment, plot_spectrogram
|
||||
from TTS.utils.audio import AudioProcessor
|
||||
from TTS.utils.console_logger import ConsoleLogger
|
||||
from TTS.utils.distribute import (DistributedSampler, apply_gradient_allreduce,
|
||||
init_distributed, reduce_tensor)
|
||||
from TTS.utils.generic_utils import (KeepAverage, count_parameters,
|
||||
create_experiment_folder, get_git_branch,
|
||||
remove_experiment_folder, set_init_dict)
|
||||
|
@ -41,6 +38,7 @@ from TTS.utils.training import (NoamLR, adam_weight_decay, check_update,
|
|||
|
||||
use_cuda, num_gpus = setup_torch_training_env(True, False)
|
||||
|
||||
|
||||
def setup_loader(ap, r, is_val=False, verbose=False, speaker_mapping=None):
|
||||
if is_val and not c.run_eval:
|
||||
loader = None
|
||||
|
@ -52,6 +50,7 @@ def setup_loader(ap, r, is_val=False, verbose=False, speaker_mapping=None):
|
|||
meta_data=meta_data_eval if is_val else meta_data_train,
|
||||
ap=ap,
|
||||
tp=c.characters if 'characters' in c.keys() else None,
|
||||
add_blank=c['add_blank'] if 'add_blank' in c.keys() else False,
|
||||
batch_group_size=0 if is_val else c.batch_group_size *
|
||||
c.batch_size,
|
||||
min_seq_len=c.min_seq_len,
|
||||
|
@ -87,8 +86,8 @@ def format_data(data, speaker_mapping=None):
|
|||
mel_input = data[4]
|
||||
mel_lengths = data[5]
|
||||
stop_targets = data[6]
|
||||
avg_text_length = torch.mean(text_lengths.float())
|
||||
avg_spec_length = torch.mean(mel_lengths.float())
|
||||
max_text_length = torch.max(text_lengths.float())
|
||||
max_spec_length = torch.max(mel_lengths.float())
|
||||
|
||||
if c.use_speaker_embedding:
|
||||
if c.use_external_speaker_embedding_file:
|
||||
|
@ -124,11 +123,11 @@ def format_data(data, speaker_mapping=None):
|
|||
if speaker_embeddings is not None:
|
||||
speaker_embeddings = speaker_embeddings.cuda(non_blocking=True)
|
||||
|
||||
return text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, avg_text_length, avg_spec_length
|
||||
return text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, max_text_length, max_spec_length
|
||||
|
||||
|
||||
def train(model, criterion, optimizer, optimizer_st, scheduler,
|
||||
ap, global_step, epoch, amp, speaker_mapping=None):
|
||||
ap, global_step, epoch, scaler, scaler_st, speaker_mapping=None):
|
||||
data_loader = setup_loader(ap, model.decoder.r, is_val=False,
|
||||
verbose=(epoch == 0), speaker_mapping=speaker_mapping)
|
||||
model.train()
|
||||
|
@ -145,7 +144,7 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
|
|||
start_time = time.time()
|
||||
|
||||
# format data
|
||||
text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, avg_text_length, avg_spec_length = format_data(data, speaker_mapping)
|
||||
text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, max_text_length, max_spec_length = format_data(data, speaker_mapping)
|
||||
loader_time = time.time() - end_time
|
||||
|
||||
global_step += 1
|
||||
|
@ -153,65 +152,79 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
|
|||
# setup lr
|
||||
if c.noam_schedule:
|
||||
scheduler.step()
|
||||
|
||||
optimizer.zero_grad()
|
||||
if optimizer_st:
|
||||
optimizer_st.zero_grad()
|
||||
|
||||
# forward pass model
|
||||
if c.bidirectional_decoder or c.double_decoder_consistency:
|
||||
decoder_output, postnet_output, alignments, stop_tokens, decoder_backward_output, alignments_backward = model(
|
||||
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
|
||||
else:
|
||||
decoder_output, postnet_output, alignments, stop_tokens = model(
|
||||
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
|
||||
decoder_backward_output = None
|
||||
alignments_backward = None
|
||||
with torch.cuda.amp.autocast(enabled=c.mixed_precision):
|
||||
# forward pass model
|
||||
if c.bidirectional_decoder or c.double_decoder_consistency:
|
||||
decoder_output, postnet_output, alignments, stop_tokens, decoder_backward_output, alignments_backward = model(
|
||||
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
|
||||
else:
|
||||
decoder_output, postnet_output, alignments, stop_tokens = model(
|
||||
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
|
||||
decoder_backward_output = None
|
||||
alignments_backward = None
|
||||
|
||||
# set the [alignment] lengths wrt reduction factor for guided attention
|
||||
if mel_lengths.max() % model.decoder.r != 0:
|
||||
alignment_lengths = (mel_lengths + (model.decoder.r - (mel_lengths.max() % model.decoder.r))) // model.decoder.r
|
||||
else:
|
||||
alignment_lengths = mel_lengths // model.decoder.r
|
||||
# set the [alignment] lengths wrt reduction factor for guided attention
|
||||
if mel_lengths.max() % model.decoder.r != 0:
|
||||
alignment_lengths = (mel_lengths + (model.decoder.r - (mel_lengths.max() % model.decoder.r))) // model.decoder.r
|
||||
else:
|
||||
alignment_lengths = mel_lengths // model.decoder.r
|
||||
|
||||
# compute loss
|
||||
loss_dict = criterion(postnet_output, decoder_output, mel_input,
|
||||
linear_input, stop_tokens, stop_targets,
|
||||
mel_lengths, decoder_backward_output,
|
||||
alignments, alignment_lengths, alignments_backward,
|
||||
text_lengths)
|
||||
# compute loss
|
||||
loss_dict = criterion(postnet_output, decoder_output, mel_input,
|
||||
linear_input, stop_tokens, stop_targets,
|
||||
mel_lengths, decoder_backward_output,
|
||||
alignments, alignment_lengths, alignments_backward,
|
||||
text_lengths)
|
||||
|
||||
# backward pass
|
||||
if amp is not None:
|
||||
with amp.scale_loss(loss_dict['loss'], optimizer) as scaled_loss:
|
||||
scaled_loss.backward()
|
||||
# check nan loss
|
||||
if torch.isnan(loss_dict['loss']).any():
|
||||
raise RuntimeError(f'Detected NaN loss at step {global_step}.')
|
||||
|
||||
# optimizer step
|
||||
if c.mixed_precision:
|
||||
# model optimizer step in mixed precision mode
|
||||
scaler.scale(loss_dict['loss']).backward()
|
||||
scaler.unscale_(optimizer)
|
||||
optimizer, current_lr = adam_weight_decay(optimizer)
|
||||
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True)
|
||||
scaler.step(optimizer)
|
||||
scaler.update()
|
||||
|
||||
# stopnet optimizer step
|
||||
if c.separate_stopnet:
|
||||
scaler_st.scale( loss_dict['stopnet_loss']).backward()
|
||||
scaler.unscale_(optimizer_st)
|
||||
optimizer_st, _ = adam_weight_decay(optimizer_st)
|
||||
grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0)
|
||||
scaler_st.step(optimizer)
|
||||
scaler_st.update()
|
||||
else:
|
||||
grad_norm_st = 0
|
||||
else:
|
||||
# main model optimizer step
|
||||
loss_dict['loss'].backward()
|
||||
optimizer, current_lr = adam_weight_decay(optimizer)
|
||||
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True)
|
||||
optimizer.step()
|
||||
|
||||
optimizer, current_lr = adam_weight_decay(optimizer)
|
||||
if amp:
|
||||
amp_opt_params = amp.master_params(optimizer)
|
||||
else:
|
||||
amp_opt_params = None
|
||||
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True, amp_opt_params=amp_opt_params)
|
||||
optimizer.step()
|
||||
# stopnet optimizer step
|
||||
if c.separate_stopnet:
|
||||
loss_dict['stopnet_loss'].backward()
|
||||
optimizer_st, _ = adam_weight_decay(optimizer_st)
|
||||
grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0)
|
||||
optimizer_st.step()
|
||||
else:
|
||||
grad_norm_st = 0
|
||||
|
||||
# compute alignment error (the lower the better )
|
||||
align_error = 1 - alignment_diagonal_score(alignments)
|
||||
loss_dict['align_error'] = align_error
|
||||
|
||||
# backpass and check the grad norm for stop loss
|
||||
if c.separate_stopnet:
|
||||
loss_dict['stopnet_loss'].backward()
|
||||
optimizer_st, _ = adam_weight_decay(optimizer_st)
|
||||
if amp:
|
||||
amp_opt_params = amp.master_params(optimizer)
|
||||
else:
|
||||
amp_opt_params = None
|
||||
grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0, amp_opt_params=amp_opt_params)
|
||||
optimizer_st.step()
|
||||
else:
|
||||
grad_norm_st = 0
|
||||
|
||||
step_time = time.time() - start_time
|
||||
epoch_time += step_time
|
||||
|
||||
|
@ -242,8 +255,8 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
|
|||
# print training progress
|
||||
if global_step % c.print_step == 0:
|
||||
log_dict = {
|
||||
"avg_spec_length": [avg_spec_length, 1], # value, precision
|
||||
"avg_text_length": [avg_text_length, 1],
|
||||
"max_spec_length": [max_spec_length, 1], # value, precision
|
||||
"max_text_length": [max_text_length, 1],
|
||||
"step_time": [step_time, 4],
|
||||
"loader_time": [loader_time, 2],
|
||||
"current_lr": current_lr,
|
||||
|
@ -270,7 +283,7 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
|
|||
save_checkpoint(model, optimizer, global_step, epoch, model.decoder.r, OUT_PATH,
|
||||
optimizer_st=optimizer_st,
|
||||
model_loss=loss_dict['postnet_loss'],
|
||||
amp_state_dict=amp.state_dict() if amp else None)
|
||||
scaler=scaler.state_dict() if c.mixed_precision else None)
|
||||
|
||||
# Diagnostic visualizations
|
||||
const_spec = postnet_output[0].data.cpu().numpy()
|
||||
|
@ -502,45 +515,14 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
meta_data_eval = meta_data_eval[:int(len(meta_data_eval) * c.eval_portion)]
|
||||
|
||||
# parse speakers
|
||||
if c.use_speaker_embedding:
|
||||
speakers = get_speakers(meta_data_train)
|
||||
if args.restore_path:
|
||||
if c.use_external_speaker_embedding_file: # if restore checkpoint and use External Embedding file
|
||||
prev_out_path = os.path.dirname(args.restore_path)
|
||||
speaker_mapping = load_speaker_mapping(prev_out_path)
|
||||
if not speaker_mapping:
|
||||
print("WARNING: speakers.json was not found in restore_path, trying to use CONFIG.external_speaker_embedding_file")
|
||||
speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
|
||||
if not speaker_mapping:
|
||||
raise RuntimeError("You must copy the file speakers.json to restore_path, or set a valid file in CONFIG.external_speaker_embedding_file")
|
||||
speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
|
||||
elif not c.use_external_speaker_embedding_file: # if restore checkpoint and don't use External Embedding file
|
||||
prev_out_path = os.path.dirname(args.restore_path)
|
||||
speaker_mapping = load_speaker_mapping(prev_out_path)
|
||||
speaker_embedding_dim = None
|
||||
assert all([speaker in speaker_mapping
|
||||
for speaker in speakers]), "As of now you, you cannot " \
|
||||
"introduce new speakers to " \
|
||||
"a previously trained model."
|
||||
elif c.use_external_speaker_embedding_file and c.external_speaker_embedding_file: # if start new train using External Embedding file
|
||||
speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
|
||||
speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
|
||||
elif c.use_external_speaker_embedding_file and not c.external_speaker_embedding_file: # if start new train using External Embedding file and don't pass external embedding file
|
||||
raise "use_external_speaker_embedding_file is True, so you need pass a external speaker embedding file, run GE2E-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb or AngularPrototypical-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb notebook in notebooks/ folder"
|
||||
else: # if start new train and don't use External Embedding file
|
||||
speaker_mapping = {name: i for i, name in enumerate(speakers)}
|
||||
speaker_embedding_dim = None
|
||||
save_speaker_mapping(OUT_PATH, speaker_mapping)
|
||||
num_speakers = len(speaker_mapping)
|
||||
print("Training with {} speakers: {}".format(num_speakers,
|
||||
", ".join(speakers)))
|
||||
else:
|
||||
num_speakers = 0
|
||||
speaker_embedding_dim = None
|
||||
speaker_mapping = None
|
||||
num_speakers, speaker_embedding_dim, speaker_mapping = parse_speakers(c, args, meta_data_train, OUT_PATH)
|
||||
|
||||
model = setup_model(num_chars, num_speakers, c, speaker_embedding_dim)
|
||||
|
||||
# scalers for mixed precision training
|
||||
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
|
||||
scaler_st = torch.cuda.amp.GradScaler() if c.mixed_precision and c.separate_stopnet else None
|
||||
|
||||
params = set_weight_decay(model, c.wd)
|
||||
optimizer = RAdam(params, lr=c.lr, weight_decay=0)
|
||||
if c.stopnet and c.separate_stopnet:
|
||||
|
@ -550,26 +532,22 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
else:
|
||||
optimizer_st = None
|
||||
|
||||
if c.apex_amp_level == "O1":
|
||||
# pylint: disable=import-outside-toplevel
|
||||
from apex import amp
|
||||
model.cuda()
|
||||
model, optimizer = amp.initialize(model, optimizer, opt_level=c.apex_amp_level)
|
||||
else:
|
||||
amp = None
|
||||
|
||||
# setup criterion
|
||||
criterion = TacotronLoss(c, stopnet_pos_weight=10.0, ga_sigma=0.4)
|
||||
|
||||
if args.restore_path:
|
||||
checkpoint = torch.load(args.restore_path, map_location='cpu')
|
||||
try:
|
||||
# TODO: fix optimizer init, model.cuda() needs to be called before
|
||||
print(" > Restoring Model.")
|
||||
model.load_state_dict(checkpoint['model'])
|
||||
# optimizer restore
|
||||
# optimizer.load_state_dict(checkpoint['optimizer'])
|
||||
print(" > Restoring Optimizer.")
|
||||
optimizer.load_state_dict(checkpoint['optimizer'])
|
||||
if "scaler" in checkpoint and c.mixed_precision:
|
||||
print(" > Restoring AMP Scaler...")
|
||||
scaler.load_state_dict(checkpoint["scaler"])
|
||||
if c.reinit_layers:
|
||||
raise RuntimeError
|
||||
model.load_state_dict(checkpoint['model'])
|
||||
except KeyError:
|
||||
print(" > Partial model initialization.")
|
||||
model_dict = model.state_dict()
|
||||
|
@ -579,9 +557,6 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
model.load_state_dict(model_dict)
|
||||
del model_dict
|
||||
|
||||
if amp and 'amp' in checkpoint:
|
||||
amp.load_state_dict(checkpoint['amp'])
|
||||
|
||||
for group in optimizer.param_groups:
|
||||
group['lr'] = c.lr
|
||||
print(" > Model restored from step %d" % checkpoint['step'],
|
||||
|
@ -624,14 +599,14 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
print("\n > Number of output frames:", model.decoder.r)
|
||||
train_avg_loss_dict, global_step = train(model, criterion, optimizer,
|
||||
optimizer_st, scheduler, ap,
|
||||
global_step, epoch, amp, speaker_mapping)
|
||||
global_step, epoch, scaler, scaler_st, speaker_mapping)
|
||||
eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch, speaker_mapping)
|
||||
c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
|
||||
target_loss = train_avg_loss_dict['avg_postnet_loss']
|
||||
if c.run_eval:
|
||||
target_loss = eval_avg_loss_dict['avg_postnet_loss']
|
||||
best_loss = save_best_model(target_loss, best_loss, model, optimizer, global_step, epoch, c.r,
|
||||
OUT_PATH, amp_state_dict=amp.state_dict() if amp else None)
|
||||
OUT_PATH, scaler=scaler.state_dict() if c.mixed_precision else None)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
@ -683,8 +658,8 @@ if __name__ == '__main__':
|
|||
check_config_tts(c)
|
||||
_ = os.path.dirname(os.path.realpath(__file__))
|
||||
|
||||
if c.apex_amp_level == 'O1':
|
||||
print(" > apex AMP level: ", c.apex_amp_level)
|
||||
if c.mixed_precision:
|
||||
print(" > Mixed precision mode is ON")
|
||||
|
||||
OUT_PATH = args.continue_path
|
||||
if args.continue_path == '':
|
|
@ -19,13 +19,16 @@ from TTS.utils.tensorboard_logger import TensorboardLogger
|
|||
from TTS.utils.training import setup_torch_training_env
|
||||
from TTS.vocoder.datasets.gan_dataset import GANDataset
|
||||
from TTS.vocoder.datasets.preprocess import load_wav_data, load_wav_feat_data
|
||||
# from distribute import (DistributedSampler, apply_gradient_allreduce,
|
||||
# init_distributed, reduce_tensor)
|
||||
from TTS.vocoder.layers.losses import DiscriminatorLoss, GeneratorLoss
|
||||
from TTS.vocoder.utils.generic_utils import (plot_results, setup_discriminator,
|
||||
setup_generator)
|
||||
from TTS.vocoder.utils.io import save_best_model, save_checkpoint
|
||||
|
||||
# DISTRIBUTED
|
||||
from torch.nn.parallel import DistributedDataParallel as DDP_th
|
||||
from torch.utils.data.distributed import DistributedSampler
|
||||
from TTS.utils.distribute import init_distributed
|
||||
|
||||
use_cuda, num_gpus = setup_torch_training_env(True, True)
|
||||
|
||||
|
||||
|
@ -45,12 +48,12 @@ def setup_loader(ap, is_val=False, verbose=False):
|
|||
use_cache=c.use_cache,
|
||||
verbose=verbose)
|
||||
dataset.shuffle_mapping()
|
||||
# sampler = DistributedSampler(dataset) if num_gpus > 1 else None
|
||||
sampler = DistributedSampler(dataset, shuffle=True) if num_gpus > 1 else None
|
||||
loader = DataLoader(dataset,
|
||||
batch_size=1 if is_val else c.batch_size,
|
||||
shuffle=True,
|
||||
shuffle=False if num_gpus > 1 else True,
|
||||
drop_last=False,
|
||||
sampler=None,
|
||||
sampler=sampler,
|
||||
num_workers=c.num_val_loader_workers
|
||||
if is_val else c.num_loader_workers,
|
||||
pin_memory=False)
|
||||
|
@ -243,41 +246,42 @@ def train(model_G, criterion_G, optimizer_G, model_D, criterion_D, optimizer_D,
|
|||
c_logger.print_train_step(batch_n_iter, num_iter, global_step,
|
||||
log_dict, loss_dict, keep_avg.avg_values)
|
||||
|
||||
# plot step stats
|
||||
if global_step % 10 == 0:
|
||||
iter_stats = {
|
||||
"lr_G": current_lr_G,
|
||||
"lr_D": current_lr_D,
|
||||
"step_time": step_time
|
||||
}
|
||||
iter_stats.update(loss_dict)
|
||||
tb_logger.tb_train_iter_stats(global_step, iter_stats)
|
||||
if args.rank == 0:
|
||||
# plot step stats
|
||||
if global_step % 10 == 0:
|
||||
iter_stats = {
|
||||
"lr_G": current_lr_G,
|
||||
"lr_D": current_lr_D,
|
||||
"step_time": step_time
|
||||
}
|
||||
iter_stats.update(loss_dict)
|
||||
tb_logger.tb_train_iter_stats(global_step, iter_stats)
|
||||
|
||||
# save checkpoint
|
||||
if global_step % c.save_step == 0:
|
||||
if c.checkpoint:
|
||||
# save model
|
||||
save_checkpoint(model_G,
|
||||
optimizer_G,
|
||||
scheduler_G,
|
||||
model_D,
|
||||
optimizer_D,
|
||||
scheduler_D,
|
||||
global_step,
|
||||
epoch,
|
||||
OUT_PATH,
|
||||
model_losses=loss_dict)
|
||||
# save checkpoint
|
||||
if global_step % c.save_step == 0:
|
||||
if c.checkpoint:
|
||||
# save model
|
||||
save_checkpoint(model_G,
|
||||
optimizer_G,
|
||||
scheduler_G,
|
||||
model_D,
|
||||
optimizer_D,
|
||||
scheduler_D,
|
||||
global_step,
|
||||
epoch,
|
||||
OUT_PATH,
|
||||
model_losses=loss_dict)
|
||||
|
||||
# compute spectrograms
|
||||
figures = plot_results(y_hat_vis, y_G, ap, global_step,
|
||||
'train')
|
||||
tb_logger.tb_train_figures(global_step, figures)
|
||||
# compute spectrograms
|
||||
figures = plot_results(y_hat_vis, y_G, ap, global_step,
|
||||
'train')
|
||||
tb_logger.tb_train_figures(global_step, figures)
|
||||
|
||||
# Sample audio
|
||||
sample_voice = y_hat_vis[0].squeeze(0).detach().cpu().numpy()
|
||||
tb_logger.tb_train_audios(global_step,
|
||||
{'train/audio': sample_voice},
|
||||
c.audio["sample_rate"])
|
||||
# Sample audio
|
||||
sample_voice = y_hat_vis[0].squeeze(0).detach().cpu().numpy()
|
||||
tb_logger.tb_train_audios(global_step,
|
||||
{'train/audio': sample_voice},
|
||||
c.audio["sample_rate"])
|
||||
end_time = time.time()
|
||||
|
||||
# print epoch stats
|
||||
|
@ -286,7 +290,8 @@ def train(model_G, criterion_G, optimizer_G, model_D, criterion_D, optimizer_D,
|
|||
# Plot Training Epoch Stats
|
||||
epoch_stats = {"epoch_time": epoch_time}
|
||||
epoch_stats.update(keep_avg.avg_values)
|
||||
tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
|
||||
if args.rank == 0:
|
||||
tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
|
||||
# TODO: plot model stats
|
||||
# if c.tb_model_param_stats:
|
||||
# tb_logger.tb_model_weights(model, global_step)
|
||||
|
@ -326,7 +331,6 @@ def evaluate(model_G, criterion_G, model_D, criterion_D, ap, global_step, epoch)
|
|||
y_hat = model_G.pqmf_synthesis(y_hat)
|
||||
y_G_sub = model_G.pqmf_analysis(y_G)
|
||||
|
||||
|
||||
scores_fake, feats_fake, feats_real = None, None, None
|
||||
if global_step > c.steps_to_start_discriminator:
|
||||
|
||||
|
@ -403,7 +407,6 @@ def evaluate(model_G, criterion_G, model_D, criterion_D, ap, global_step, epoch)
|
|||
else:
|
||||
loss_dict[key] = value.item()
|
||||
|
||||
|
||||
step_time = time.time() - start_time
|
||||
epoch_time += step_time
|
||||
|
||||
|
@ -419,20 +422,21 @@ def evaluate(model_G, criterion_G, model_D, criterion_D, ap, global_step, epoch)
|
|||
if c.print_eval:
|
||||
c_logger.print_eval_step(num_iter, loss_dict, keep_avg.avg_values)
|
||||
|
||||
# compute spectrograms
|
||||
figures = plot_results(y_hat, y_G, ap, global_step, 'eval')
|
||||
tb_logger.tb_eval_figures(global_step, figures)
|
||||
if args.rank == 0:
|
||||
# compute spectrograms
|
||||
figures = plot_results(y_hat, y_G, ap, global_step, 'eval')
|
||||
tb_logger.tb_eval_figures(global_step, figures)
|
||||
|
||||
# Sample audio
|
||||
sample_voice = y_hat[0].squeeze(0).detach().cpu().numpy()
|
||||
tb_logger.tb_eval_audios(global_step, {'eval/audio': sample_voice},
|
||||
c.audio["sample_rate"])
|
||||
# Sample audio
|
||||
sample_voice = y_hat[0].squeeze(0).detach().cpu().numpy()
|
||||
tb_logger.tb_eval_audios(global_step, {'eval/audio': sample_voice},
|
||||
c.audio["sample_rate"])
|
||||
|
||||
# synthesize a full voice
|
||||
tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
|
||||
|
||||
# synthesize a full voice
|
||||
data_loader.return_segments = False
|
||||
|
||||
tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
|
||||
|
||||
return keep_avg.avg_values
|
||||
|
||||
|
||||
|
@ -443,7 +447,8 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
print(f" > Loading wavs from: {c.data_path}")
|
||||
if c.feature_path is not None:
|
||||
print(f" > Loading features from: {c.feature_path}")
|
||||
eval_data, train_data = load_wav_feat_data(c.data_path, c.feature_path, c.eval_split_size)
|
||||
eval_data, train_data = load_wav_feat_data(
|
||||
c.data_path, c.feature_path, c.eval_split_size)
|
||||
else:
|
||||
eval_data, train_data = load_wav_data(c.data_path, c.eval_split_size)
|
||||
|
||||
|
@ -451,9 +456,9 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
ap = AudioProcessor(**c.audio)
|
||||
|
||||
# DISTRUBUTED
|
||||
# if num_gpus > 1:
|
||||
# init_distributed(args.rank, num_gpus, args.group_id,
|
||||
# c.distributed["backend"], c.distributed["url"])
|
||||
if num_gpus > 1:
|
||||
init_distributed(args.rank, num_gpus, args.group_id,
|
||||
c.distributed["backend"], c.distributed["url"])
|
||||
|
||||
# setup models
|
||||
model_gen = setup_generator(c)
|
||||
|
@ -470,10 +475,12 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
scheduler_disc = None
|
||||
if 'lr_scheduler_gen' in c:
|
||||
scheduler_gen = getattr(torch.optim.lr_scheduler, c.lr_scheduler_gen)
|
||||
scheduler_gen = scheduler_gen(optimizer_gen, **c.lr_scheduler_gen_params)
|
||||
scheduler_gen = scheduler_gen(
|
||||
optimizer_gen, **c.lr_scheduler_gen_params)
|
||||
if 'lr_scheduler_disc' in c:
|
||||
scheduler_disc = getattr(torch.optim.lr_scheduler, c.lr_scheduler_disc)
|
||||
scheduler_disc = scheduler_disc(optimizer_disc, **c.lr_scheduler_disc_params)
|
||||
scheduler_disc = scheduler_disc(
|
||||
optimizer_disc, **c.lr_scheduler_disc_params)
|
||||
|
||||
# setup criterion
|
||||
criterion_gen = GeneratorLoss(c)
|
||||
|
@ -531,8 +538,9 @@ def main(args): # pylint: disable=redefined-outer-name
|
|||
criterion_disc.cuda()
|
||||
|
||||
# DISTRUBUTED
|
||||
# if num_gpus > 1:
|
||||
# model = apply_gradient_allreduce(model)
|
||||
if num_gpus > 1:
|
||||
model_gen = DDP_th(model_gen, device_ids=[args.rank])
|
||||
model_disc = DDP_th(model_disc, device_ids=[args.rank])
|
||||
|
||||
num_params = count_parameters(model_gen)
|
||||
print(" > Generator has {} parameters".format(num_params), flush=True)
|
||||
|
@ -572,8 +580,7 @@ if __name__ == '__main__':
|
|||
parser.add_argument(
|
||||
'--continue_path',
|
||||
type=str,
|
||||
help=
|
||||
'Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
|
||||
help='Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
|
||||
default='',
|
||||
required='--config_path' not in sys.argv)
|
||||
parser.add_argument(
|
|
@ -0,0 +1,511 @@
|
|||
import argparse
|
||||
import glob
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import traceback
|
||||
import numpy as np
|
||||
|
||||
import torch
|
||||
# DISTRIBUTED
|
||||
from torch.nn.parallel import DistributedDataParallel as DDP_th
|
||||
from torch.optim import Adam
|
||||
from torch.utils.data import DataLoader
|
||||
from torch.utils.data.distributed import DistributedSampler
|
||||
from TTS.utils.audio import AudioProcessor
|
||||
from TTS.utils.console_logger import ConsoleLogger
|
||||
from TTS.utils.distribute import init_distributed
|
||||
from TTS.utils.generic_utils import (KeepAverage, count_parameters,
|
||||
create_experiment_folder, get_git_branch,
|
||||
remove_experiment_folder, set_init_dict)
|
||||
from TTS.utils.io import copy_config_file, load_config
|
||||
from TTS.utils.tensorboard_logger import TensorboardLogger
|
||||
from TTS.utils.training import setup_torch_training_env
|
||||
from TTS.vocoder.datasets.preprocess import load_wav_data, load_wav_feat_data
|
||||
from TTS.vocoder.datasets.wavegrad_dataset import WaveGradDataset
|
||||
from TTS.vocoder.utils.generic_utils import plot_results, setup_generator
|
||||
from TTS.vocoder.utils.io import save_best_model, save_checkpoint
|
||||
|
||||
use_cuda, num_gpus = setup_torch_training_env(True, True)
|
||||
|
||||
|
||||
def setup_loader(ap, is_val=False, verbose=False):
|
||||
if is_val and not c.run_eval:
|
||||
loader = None
|
||||
else:
|
||||
dataset = WaveGradDataset(ap=ap,
|
||||
items=eval_data if is_val else train_data,
|
||||
seq_len=c.seq_len,
|
||||
hop_len=ap.hop_length,
|
||||
pad_short=c.pad_short,
|
||||
conv_pad=c.conv_pad,
|
||||
is_training=not is_val,
|
||||
return_segments=True,
|
||||
use_noise_augment=False,
|
||||
use_cache=c.use_cache,
|
||||
verbose=verbose)
|
||||
sampler = DistributedSampler(dataset) if num_gpus > 1 else None
|
||||
loader = DataLoader(dataset,
|
||||
batch_size=c.batch_size,
|
||||
shuffle=num_gpus <= 1,
|
||||
drop_last=False,
|
||||
sampler=sampler,
|
||||
num_workers=c.num_val_loader_workers
|
||||
if is_val else c.num_loader_workers,
|
||||
pin_memory=False)
|
||||
|
||||
|
||||
return loader
|
||||
|
||||
|
||||
def format_data(data):
|
||||
# return a whole audio segment
|
||||
m, x = data
|
||||
x = x.unsqueeze(1)
|
||||
if use_cuda:
|
||||
m = m.cuda(non_blocking=True)
|
||||
x = x.cuda(non_blocking=True)
|
||||
return m, x
|
||||
|
||||
|
||||
def format_test_data(data):
|
||||
# return a whole audio segment
|
||||
m, x = data
|
||||
m = m[None, ...]
|
||||
x = x[None, None, ...]
|
||||
if use_cuda:
|
||||
m = m.cuda(non_blocking=True)
|
||||
x = x.cuda(non_blocking=True)
|
||||
return m, x
|
||||
|
||||
|
||||
def train(model, criterion, optimizer,
|
||||
scheduler, scaler, ap, global_step, epoch):
|
||||
data_loader = setup_loader(ap, is_val=False, verbose=(epoch == 0))
|
||||
model.train()
|
||||
epoch_time = 0
|
||||
keep_avg = KeepAverage()
|
||||
if use_cuda:
|
||||
batch_n_iter = int(
|
||||
len(data_loader.dataset) / (c.batch_size * num_gpus))
|
||||
else:
|
||||
batch_n_iter = int(len(data_loader.dataset) / c.batch_size)
|
||||
end_time = time.time()
|
||||
c_logger.print_train_start()
|
||||
# setup noise schedule
|
||||
noise_schedule = c['train_noise_schedule']
|
||||
betas = np.linspace(noise_schedule['min_val'], noise_schedule['max_val'], noise_schedule['num_steps'])
|
||||
if hasattr(model, 'module'):
|
||||
model.module.compute_noise_level(betas)
|
||||
else:
|
||||
model.compute_noise_level(betas)
|
||||
for num_iter, data in enumerate(data_loader):
|
||||
start_time = time.time()
|
||||
|
||||
# format data
|
||||
m, x = format_data(data)
|
||||
loader_time = time.time() - end_time
|
||||
|
||||
global_step += 1
|
||||
|
||||
with torch.cuda.amp.autocast(enabled=c.mixed_precision):
|
||||
# compute noisy input
|
||||
if hasattr(model, 'module'):
|
||||
noise, x_noisy, noise_scale = model.module.compute_y_n(x)
|
||||
else:
|
||||
noise, x_noisy, noise_scale = model.compute_y_n(x)
|
||||
|
||||
# forward pass
|
||||
noise_hat = model(x_noisy, m, noise_scale)
|
||||
|
||||
# compute losses
|
||||
loss = criterion(noise, noise_hat)
|
||||
loss_wavegrad_dict = {'wavegrad_loss':loss}
|
||||
|
||||
# check nan loss
|
||||
if torch.isnan(loss).any():
|
||||
raise RuntimeError(f'Detected NaN loss at step {global_step}.')
|
||||
|
||||
optimizer.zero_grad()
|
||||
|
||||
# backward pass with loss scaling
|
||||
if c.mixed_precision:
|
||||
scaler.scale(loss).backward()
|
||||
scaler.unscale_(optimizer)
|
||||
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
|
||||
c.clip_grad)
|
||||
scaler.step(optimizer)
|
||||
scaler.update()
|
||||
else:
|
||||
loss.backward()
|
||||
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(),
|
||||
c.clip_grad)
|
||||
optimizer.step()
|
||||
|
||||
# schedule update
|
||||
if scheduler is not None:
|
||||
scheduler.step()
|
||||
|
||||
# disconnect loss values
|
||||
loss_dict = dict()
|
||||
for key, value in loss_wavegrad_dict.items():
|
||||
if isinstance(value, int):
|
||||
loss_dict[key] = value
|
||||
else:
|
||||
loss_dict[key] = value.item()
|
||||
|
||||
# epoch/step timing
|
||||
step_time = time.time() - start_time
|
||||
epoch_time += step_time
|
||||
|
||||
# get current learning rates
|
||||
current_lr = list(optimizer.param_groups)[0]['lr']
|
||||
|
||||
# update avg stats
|
||||
update_train_values = dict()
|
||||
for key, value in loss_dict.items():
|
||||
update_train_values['avg_' + key] = value
|
||||
update_train_values['avg_loader_time'] = loader_time
|
||||
update_train_values['avg_step_time'] = step_time
|
||||
keep_avg.update_values(update_train_values)
|
||||
|
||||
# print training stats
|
||||
if global_step % c.print_step == 0:
|
||||
log_dict = {
|
||||
'step_time': [step_time, 2],
|
||||
'loader_time': [loader_time, 4],
|
||||
"current_lr": current_lr,
|
||||
"grad_norm": grad_norm.item()
|
||||
}
|
||||
c_logger.print_train_step(batch_n_iter, num_iter, global_step,
|
||||
log_dict, loss_dict, keep_avg.avg_values)
|
||||
|
||||
if args.rank == 0:
|
||||
# plot step stats
|
||||
if global_step % 10 == 0:
|
||||
iter_stats = {
|
||||
"lr": current_lr,
|
||||
"grad_norm": grad_norm.item(),
|
||||
"step_time": step_time
|
||||
}
|
||||
iter_stats.update(loss_dict)
|
||||
tb_logger.tb_train_iter_stats(global_step, iter_stats)
|
||||
|
||||
# save checkpoint
|
||||
if global_step % c.save_step == 0:
|
||||
if c.checkpoint:
|
||||
# save model
|
||||
save_checkpoint(model,
|
||||
optimizer,
|
||||
scheduler,
|
||||
None,
|
||||
None,
|
||||
None,
|
||||
global_step,
|
||||
epoch,
|
||||
OUT_PATH,
|
||||
model_losses=loss_dict,
|
||||
scaler=scaler.state_dict() if c.mixed_precision else None)
|
||||
|
||||
end_time = time.time()
|
||||
|
||||
# print epoch stats
|
||||
c_logger.print_train_epoch_end(global_step, epoch, epoch_time, keep_avg)
|
||||
|
||||
# Plot Training Epoch Stats
|
||||
epoch_stats = {"epoch_time": epoch_time}
|
||||
epoch_stats.update(keep_avg.avg_values)
|
||||
if args.rank == 0:
|
||||
tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
|
||||
# TODO: plot model stats
|
||||
if c.tb_model_param_stats and args.rank == 0:
|
||||
tb_logger.tb_model_weights(model, global_step)
|
||||
return keep_avg.avg_values, global_step
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def evaluate(model, criterion, ap, global_step, epoch):
|
||||
data_loader = setup_loader(ap, is_val=True, verbose=(epoch == 0))
|
||||
model.eval()
|
||||
epoch_time = 0
|
||||
keep_avg = KeepAverage()
|
||||
end_time = time.time()
|
||||
c_logger.print_eval_start()
|
||||
for num_iter, data in enumerate(data_loader):
|
||||
start_time = time.time()
|
||||
|
||||
# format data
|
||||
m, x = format_data(data)
|
||||
loader_time = time.time() - end_time
|
||||
|
||||
global_step += 1
|
||||
|
||||
# compute noisy input
|
||||
if hasattr(model, 'module'):
|
||||
noise, x_noisy, noise_scale = model.module.compute_y_n(x)
|
||||
else:
|
||||
noise, x_noisy, noise_scale = model.compute_y_n(x)
|
||||
|
||||
|
||||
# forward pass
|
||||
noise_hat = model(x_noisy, m, noise_scale)
|
||||
|
||||
# compute losses
|
||||
loss = criterion(noise, noise_hat)
|
||||
loss_wavegrad_dict = {'wavegrad_loss':loss}
|
||||
|
||||
|
||||
loss_dict = dict()
|
||||
for key, value in loss_wavegrad_dict.items():
|
||||
if isinstance(value, (int, float)):
|
||||
loss_dict[key] = value
|
||||
else:
|
||||
loss_dict[key] = value.item()
|
||||
|
||||
step_time = time.time() - start_time
|
||||
epoch_time += step_time
|
||||
|
||||
# update avg stats
|
||||
update_eval_values = dict()
|
||||
for key, value in loss_dict.items():
|
||||
update_eval_values['avg_' + key] = value
|
||||
update_eval_values['avg_loader_time'] = loader_time
|
||||
update_eval_values['avg_step_time'] = step_time
|
||||
keep_avg.update_values(update_eval_values)
|
||||
|
||||
# print eval stats
|
||||
if c.print_eval:
|
||||
c_logger.print_eval_step(num_iter, loss_dict, keep_avg.avg_values)
|
||||
|
||||
if args.rank == 0:
|
||||
data_loader.dataset.return_segments = False
|
||||
samples = data_loader.dataset.load_test_samples(1)
|
||||
m, x = format_test_data(samples[0])
|
||||
|
||||
# setup noise schedule and inference
|
||||
noise_schedule = c['test_noise_schedule']
|
||||
betas = np.linspace(noise_schedule['min_val'], noise_schedule['max_val'], noise_schedule['num_steps'])
|
||||
if hasattr(model, 'module'):
|
||||
model.module.compute_noise_level(betas)
|
||||
# compute voice
|
||||
x_pred = model.module.inference(m)
|
||||
else:
|
||||
model.compute_noise_level(betas)
|
||||
# compute voice
|
||||
x_pred = model.inference(m)
|
||||
|
||||
# compute spectrograms
|
||||
figures = plot_results(x_pred, x, ap, global_step, 'eval')
|
||||
tb_logger.tb_eval_figures(global_step, figures)
|
||||
|
||||
# Sample audio
|
||||
sample_voice = x_pred[0].squeeze(0).detach().cpu().numpy()
|
||||
tb_logger.tb_eval_audios(global_step, {'eval/audio': sample_voice},
|
||||
c.audio["sample_rate"])
|
||||
|
||||
tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
|
||||
data_loader.dataset.return_segments = True
|
||||
|
||||
return keep_avg.avg_values
|
||||
|
||||
|
||||
def main(args): # pylint: disable=redefined-outer-name
|
||||
# pylint: disable=global-variable-undefined
|
||||
global train_data, eval_data
|
||||
print(f" > Loading wavs from: {c.data_path}")
|
||||
if c.feature_path is not None:
|
||||
print(f" > Loading features from: {c.feature_path}")
|
||||
eval_data, train_data = load_wav_feat_data(c.data_path, c.feature_path, c.eval_split_size)
|
||||
else:
|
||||
eval_data, train_data = load_wav_data(c.data_path, c.eval_split_size)
|
||||
|
||||
# setup audio processor
|
||||
ap = AudioProcessor(**c.audio)
|
||||
|
||||
# DISTRUBUTED
|
||||
if num_gpus > 1:
|
||||
init_distributed(args.rank, num_gpus, args.group_id,
|
||||
c.distributed["backend"], c.distributed["url"])
|
||||
|
||||
# setup models
|
||||
model = setup_generator(c)
|
||||
|
||||
# scaler for mixed_precision
|
||||
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
|
||||
|
||||
# setup optimizers
|
||||
optimizer = Adam(model.parameters(), lr=c.lr, weight_decay=0)
|
||||
|
||||
# schedulers
|
||||
scheduler = None
|
||||
if 'lr_scheduler' in c:
|
||||
scheduler = getattr(torch.optim.lr_scheduler, c.lr_scheduler)
|
||||
scheduler = scheduler(optimizer, **c.lr_scheduler_params)
|
||||
|
||||
# setup criterion
|
||||
criterion = torch.nn.L1Loss().cuda()
|
||||
|
||||
if args.restore_path:
|
||||
checkpoint = torch.load(args.restore_path, map_location='cpu')
|
||||
try:
|
||||
print(" > Restoring Model...")
|
||||
model.load_state_dict(checkpoint['model'])
|
||||
print(" > Restoring Optimizer...")
|
||||
optimizer.load_state_dict(checkpoint['optimizer'])
|
||||
if 'scheduler' in checkpoint:
|
||||
print(" > Restoring LR Scheduler...")
|
||||
scheduler.load_state_dict(checkpoint['scheduler'])
|
||||
# NOTE: Not sure if necessary
|
||||
scheduler.optimizer = optimizer
|
||||
if "scaler" in checkpoint and c.mixed_precision:
|
||||
print(" > Restoring AMP Scaler...")
|
||||
scaler.load_state_dict(checkpoint["scaler"])
|
||||
except RuntimeError:
|
||||
# retore only matching layers.
|
||||
print(" > Partial model initialization...")
|
||||
model_dict = model.state_dict()
|
||||
model_dict = set_init_dict(model_dict, checkpoint['model'], c)
|
||||
model.load_state_dict(model_dict)
|
||||
del model_dict
|
||||
|
||||
# reset lr if not countinuining training.
|
||||
for group in optimizer.param_groups:
|
||||
group['lr'] = c.lr
|
||||
|
||||
print(" > Model restored from step %d" % checkpoint['step'],
|
||||
flush=True)
|
||||
args.restore_step = checkpoint['step']
|
||||
else:
|
||||
args.restore_step = 0
|
||||
|
||||
if use_cuda:
|
||||
model.cuda()
|
||||
criterion.cuda()
|
||||
|
||||
# DISTRUBUTED
|
||||
if num_gpus > 1:
|
||||
model = DDP_th(model, device_ids=[args.rank])
|
||||
|
||||
num_params = count_parameters(model)
|
||||
print(" > WaveGrad has {} parameters".format(num_params), flush=True)
|
||||
|
||||
if 'best_loss' not in locals():
|
||||
best_loss = float('inf')
|
||||
|
||||
global_step = args.restore_step
|
||||
for epoch in range(0, c.epochs):
|
||||
c_logger.print_epoch_start(epoch, c.epochs)
|
||||
_, global_step = train(model, criterion, optimizer,
|
||||
scheduler, scaler, ap, global_step,
|
||||
epoch)
|
||||
eval_avg_loss_dict = evaluate(model, criterion, ap,
|
||||
global_step, epoch)
|
||||
c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
|
||||
target_loss = eval_avg_loss_dict[c.target_loss]
|
||||
best_loss = save_best_model(target_loss,
|
||||
best_loss,
|
||||
model,
|
||||
optimizer,
|
||||
scheduler,
|
||||
None,
|
||||
None,
|
||||
None,
|
||||
global_step,
|
||||
epoch,
|
||||
OUT_PATH,
|
||||
model_losses=eval_avg_loss_dict,
|
||||
scaler=scaler.state_dict() if c.mixed_precision else None)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
'--continue_path',
|
||||
type=str,
|
||||
help=
|
||||
'Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
|
||||
default='',
|
||||
required='--config_path' not in sys.argv)
|
||||
parser.add_argument(
|
||||
'--restore_path',
|
||||
type=str,
|
||||
help='Model file to be restored. Use to finetune a model.',
|
||||
default='')
|
||||
parser.add_argument('--config_path',
|
||||
type=str,
|
||||
help='Path to config file for training.',
|
||||
required='--continue_path' not in sys.argv)
|
||||
parser.add_argument('--debug',
|
||||
type=bool,
|
||||
default=False,
|
||||
help='Do not verify commit integrity to run training.')
|
||||
|
||||
# DISTRUBUTED
|
||||
parser.add_argument(
|
||||
'--rank',
|
||||
type=int,
|
||||
default=0,
|
||||
help='DISTRIBUTED: process rank for distributed training.')
|
||||
parser.add_argument('--group_id',
|
||||
type=str,
|
||||
default="",
|
||||
help='DISTRIBUTED: process group id.')
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.continue_path != '':
|
||||
args.output_path = args.continue_path
|
||||
args.config_path = os.path.join(args.continue_path, 'config.json')
|
||||
list_of_files = glob.glob(
|
||||
args.continue_path +
|
||||
"/*.pth.tar") # * means all if need specific format then *.csv
|
||||
latest_model_file = max(list_of_files, key=os.path.getctime)
|
||||
args.restore_path = latest_model_file
|
||||
print(f" > Training continues for {args.restore_path}")
|
||||
|
||||
# setup output paths and read configs
|
||||
c = load_config(args.config_path)
|
||||
# check_config(c)
|
||||
_ = os.path.dirname(os.path.realpath(__file__))
|
||||
|
||||
# DISTRIBUTED
|
||||
if c.mixed_precision:
|
||||
print(" > Mixed precision is enabled")
|
||||
|
||||
OUT_PATH = args.continue_path
|
||||
if args.continue_path == '':
|
||||
OUT_PATH = create_experiment_folder(c.output_path, c.run_name,
|
||||
args.debug)
|
||||
|
||||
AUDIO_PATH = os.path.join(OUT_PATH, 'test_audios')
|
||||
|
||||
c_logger = ConsoleLogger()
|
||||
|
||||
if args.rank == 0:
|
||||
os.makedirs(AUDIO_PATH, exist_ok=True)
|
||||
new_fields = {}
|
||||
if args.restore_path:
|
||||
new_fields["restore_path"] = args.restore_path
|
||||
new_fields["github_branch"] = get_git_branch()
|
||||
copy_config_file(args.config_path,
|
||||
os.path.join(OUT_PATH, 'config.json'), new_fields)
|
||||
os.chmod(AUDIO_PATH, 0o775)
|
||||
os.chmod(OUT_PATH, 0o775)
|
||||
|
||||
LOG_DIR = OUT_PATH
|
||||
tb_logger = TensorboardLogger(LOG_DIR, model_name='VOCODER')
|
||||
|
||||
# write model desc to tensorboard
|
||||
tb_logger.tb_add_text('model-description', c['run_description'], 0)
|
||||
|
||||
try:
|
||||
main(args)
|
||||
except KeyboardInterrupt:
|
||||
remove_experiment_folder(OUT_PATH)
|
||||
try:
|
||||
sys.exit(0)
|
||||
except SystemExit:
|
||||
os._exit(0) # pylint: disable=protected-access
|
||||
except Exception: # pylint: disable=broad-except
|
||||
remove_experiment_folder(OUT_PATH)
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
|
@ -0,0 +1,539 @@
|
|||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import traceback
|
||||
import time
|
||||
import glob
|
||||
import random
|
||||
|
||||
import torch
|
||||
from torch.utils.data import DataLoader
|
||||
|
||||
# from torch.utils.data.distributed import DistributedSampler
|
||||
|
||||
from TTS.tts.utils.visual import plot_spectrogram
|
||||
from TTS.utils.audio import AudioProcessor
|
||||
from TTS.utils.radam import RAdam
|
||||
from TTS.utils.io import copy_config_file, load_config
|
||||
from TTS.utils.training import setup_torch_training_env
|
||||
from TTS.utils.console_logger import ConsoleLogger
|
||||
from TTS.utils.tensorboard_logger import TensorboardLogger
|
||||
from TTS.utils.generic_utils import (
|
||||
KeepAverage,
|
||||
count_parameters,
|
||||
create_experiment_folder,
|
||||
get_git_branch,
|
||||
remove_experiment_folder,
|
||||
set_init_dict,
|
||||
)
|
||||
from TTS.vocoder.datasets.wavernn_dataset import WaveRNNDataset
|
||||
from TTS.vocoder.datasets.preprocess import (
|
||||
load_wav_data,
|
||||
load_wav_feat_data
|
||||
)
|
||||
from TTS.vocoder.utils.distribution import discretized_mix_logistic_loss, gaussian_loss
|
||||
from TTS.vocoder.utils.generic_utils import setup_wavernn
|
||||
from TTS.vocoder.utils.io import save_best_model, save_checkpoint
|
||||
|
||||
|
||||
use_cuda, num_gpus = setup_torch_training_env(True, True)
|
||||
|
||||
|
||||
def setup_loader(ap, is_val=False, verbose=False):
|
||||
if is_val and not c.run_eval:
|
||||
loader = None
|
||||
else:
|
||||
dataset = WaveRNNDataset(ap=ap,
|
||||
items=eval_data if is_val else train_data,
|
||||
seq_len=c.seq_len,
|
||||
hop_len=ap.hop_length,
|
||||
pad=c.padding,
|
||||
mode=c.mode,
|
||||
mulaw=c.mulaw,
|
||||
is_training=not is_val,
|
||||
verbose=verbose,
|
||||
)
|
||||
# sampler = DistributedSampler(dataset) if num_gpus > 1 else None
|
||||
loader = DataLoader(dataset,
|
||||
shuffle=True,
|
||||
collate_fn=dataset.collate,
|
||||
batch_size=c.batch_size,
|
||||
num_workers=c.num_val_loader_workers
|
||||
if is_val
|
||||
else c.num_loader_workers,
|
||||
pin_memory=True,
|
||||
)
|
||||
return loader
|
||||
|
||||
|
||||
def format_data(data):
|
||||
# setup input data
|
||||
x_input = data[0]
|
||||
mels = data[1]
|
||||
y_coarse = data[2]
|
||||
|
||||
# dispatch data to GPU
|
||||
if use_cuda:
|
||||
x_input = x_input.cuda(non_blocking=True)
|
||||
mels = mels.cuda(non_blocking=True)
|
||||
y_coarse = y_coarse.cuda(non_blocking=True)
|
||||
|
||||
return x_input, mels, y_coarse
|
||||
|
||||
|
||||
def train(model, optimizer, criterion, scheduler, scaler, ap, global_step, epoch):
|
||||
# create train loader
|
||||
data_loader = setup_loader(ap, is_val=False, verbose=(epoch == 0))
|
||||
model.train()
|
||||
epoch_time = 0
|
||||
keep_avg = KeepAverage()
|
||||
if use_cuda:
|
||||
batch_n_iter = int(len(data_loader.dataset) /
|
||||
(c.batch_size * num_gpus))
|
||||
else:
|
||||
batch_n_iter = int(len(data_loader.dataset) / c.batch_size)
|
||||
end_time = time.time()
|
||||
c_logger.print_train_start()
|
||||
# train loop
|
||||
for num_iter, data in enumerate(data_loader):
|
||||
start_time = time.time()
|
||||
x_input, mels, y_coarse = format_data(data)
|
||||
loader_time = time.time() - end_time
|
||||
global_step += 1
|
||||
|
||||
optimizer.zero_grad()
|
||||
|
||||
if c.mixed_precision:
|
||||
# mixed precision training
|
||||
with torch.cuda.amp.autocast():
|
||||
y_hat = model(x_input, mels)
|
||||
if isinstance(model.mode, int):
|
||||
y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
|
||||
else:
|
||||
y_coarse = y_coarse.float()
|
||||
y_coarse = y_coarse.unsqueeze(-1)
|
||||
# compute losses
|
||||
loss = criterion(y_hat, y_coarse)
|
||||
scaler.scale(loss).backward()
|
||||
scaler.unscale_(optimizer)
|
||||
if c.grad_clip > 0:
|
||||
torch.nn.utils.clip_grad_norm_(
|
||||
model.parameters(), c.grad_clip)
|
||||
scaler.step(optimizer)
|
||||
scaler.update()
|
||||
else:
|
||||
# full precision training
|
||||
y_hat = model(x_input, mels)
|
||||
if isinstance(model.mode, int):
|
||||
y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
|
||||
else:
|
||||
y_coarse = y_coarse.float()
|
||||
y_coarse = y_coarse.unsqueeze(-1)
|
||||
# compute losses
|
||||
loss = criterion(y_hat, y_coarse)
|
||||
if loss.item() is None:
|
||||
raise RuntimeError(" [!] None loss. Exiting ...")
|
||||
loss.backward()
|
||||
if c.grad_clip > 0:
|
||||
torch.nn.utils.clip_grad_norm_(
|
||||
model.parameters(), c.grad_clip)
|
||||
optimizer.step()
|
||||
|
||||
if scheduler is not None:
|
||||
scheduler.step()
|
||||
|
||||
# get the current learning rate
|
||||
cur_lr = list(optimizer.param_groups)[0]["lr"]
|
||||
|
||||
step_time = time.time() - start_time
|
||||
epoch_time += step_time
|
||||
|
||||
update_train_values = dict()
|
||||
loss_dict = dict()
|
||||
loss_dict["model_loss"] = loss.item()
|
||||
for key, value in loss_dict.items():
|
||||
update_train_values["avg_" + key] = value
|
||||
update_train_values["avg_loader_time"] = loader_time
|
||||
update_train_values["avg_step_time"] = step_time
|
||||
keep_avg.update_values(update_train_values)
|
||||
|
||||
# print training stats
|
||||
if global_step % c.print_step == 0:
|
||||
log_dict = {"step_time": [step_time, 2],
|
||||
"loader_time": [loader_time, 4],
|
||||
"current_lr": cur_lr,
|
||||
}
|
||||
c_logger.print_train_step(batch_n_iter,
|
||||
num_iter,
|
||||
global_step,
|
||||
log_dict,
|
||||
loss_dict,
|
||||
keep_avg.avg_values,
|
||||
)
|
||||
|
||||
# plot step stats
|
||||
if global_step % 10 == 0:
|
||||
iter_stats = {"lr": cur_lr, "step_time": step_time}
|
||||
iter_stats.update(loss_dict)
|
||||
tb_logger.tb_train_iter_stats(global_step, iter_stats)
|
||||
|
||||
# save checkpoint
|
||||
if global_step % c.save_step == 0:
|
||||
if c.checkpoint:
|
||||
# save model
|
||||
save_checkpoint(model,
|
||||
optimizer,
|
||||
scheduler,
|
||||
None,
|
||||
None,
|
||||
None,
|
||||
global_step,
|
||||
epoch,
|
||||
OUT_PATH,
|
||||
model_losses=loss_dict,
|
||||
scaler=scaler.state_dict() if c.mixed_precision else None
|
||||
)
|
||||
|
||||
# synthesize a full voice
|
||||
rand_idx = random.randrange(0, len(train_data))
|
||||
wav_path = train_data[rand_idx] if not isinstance(
|
||||
train_data[rand_idx], (tuple, list)) else train_data[rand_idx][0]
|
||||
wav = ap.load_wav(wav_path)
|
||||
ground_mel = ap.melspectrogram(wav)
|
||||
sample_wav = model.generate(ground_mel,
|
||||
c.batched,
|
||||
c.target_samples,
|
||||
c.overlap_samples,
|
||||
use_cuda
|
||||
)
|
||||
predict_mel = ap.melspectrogram(sample_wav)
|
||||
|
||||
# compute spectrograms
|
||||
figures = {"train/ground_truth": plot_spectrogram(ground_mel.T),
|
||||
"train/prediction": plot_spectrogram(predict_mel.T)
|
||||
}
|
||||
tb_logger.tb_train_figures(global_step, figures)
|
||||
|
||||
# Sample audio
|
||||
tb_logger.tb_train_audios(
|
||||
global_step, {
|
||||
"train/audio": sample_wav}, c.audio["sample_rate"]
|
||||
)
|
||||
end_time = time.time()
|
||||
|
||||
# print epoch stats
|
||||
c_logger.print_train_epoch_end(global_step, epoch, epoch_time, keep_avg)
|
||||
|
||||
# Plot Training Epoch Stats
|
||||
epoch_stats = {"epoch_time": epoch_time}
|
||||
epoch_stats.update(keep_avg.avg_values)
|
||||
tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
|
||||
# TODO: plot model stats
|
||||
# if c.tb_model_param_stats:
|
||||
# tb_logger.tb_model_weights(model, global_step)
|
||||
return keep_avg.avg_values, global_step
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def evaluate(model, criterion, ap, global_step, epoch):
|
||||
# create train loader
|
||||
data_loader = setup_loader(ap, is_val=True, verbose=(epoch == 0))
|
||||
model.eval()
|
||||
epoch_time = 0
|
||||
keep_avg = KeepAverage()
|
||||
end_time = time.time()
|
||||
c_logger.print_eval_start()
|
||||
with torch.no_grad():
|
||||
for num_iter, data in enumerate(data_loader):
|
||||
start_time = time.time()
|
||||
# format data
|
||||
x_input, mels, y_coarse = format_data(data)
|
||||
loader_time = time.time() - end_time
|
||||
global_step += 1
|
||||
|
||||
y_hat = model(x_input, mels)
|
||||
if isinstance(model.mode, int):
|
||||
y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
|
||||
else:
|
||||
y_coarse = y_coarse.float()
|
||||
y_coarse = y_coarse.unsqueeze(-1)
|
||||
loss = criterion(y_hat, y_coarse)
|
||||
# Compute avg loss
|
||||
# if num_gpus > 1:
|
||||
# loss = reduce_tensor(loss.data, num_gpus)
|
||||
loss_dict = dict()
|
||||
loss_dict["model_loss"] = loss.item()
|
||||
|
||||
step_time = time.time() - start_time
|
||||
epoch_time += step_time
|
||||
|
||||
# update avg stats
|
||||
update_eval_values = dict()
|
||||
for key, value in loss_dict.items():
|
||||
update_eval_values["avg_" + key] = value
|
||||
update_eval_values["avg_loader_time"] = loader_time
|
||||
update_eval_values["avg_step_time"] = step_time
|
||||
keep_avg.update_values(update_eval_values)
|
||||
|
||||
# print eval stats
|
||||
if c.print_eval:
|
||||
c_logger.print_eval_step(
|
||||
num_iter, loss_dict, keep_avg.avg_values)
|
||||
|
||||
if epoch % c.test_every_epochs == 0 and epoch != 0:
|
||||
# synthesize a full voice
|
||||
rand_idx = random.randrange(0, len(eval_data))
|
||||
wav_path = eval_data[rand_idx] if not isinstance(
|
||||
eval_data[rand_idx], (tuple, list)) else eval_data[rand_idx][0]
|
||||
wav = ap.load_wav(wav_path)
|
||||
ground_mel = ap.melspectrogram(wav)
|
||||
sample_wav = model.generate(ground_mel,
|
||||
c.batched,
|
||||
c.target_samples,
|
||||
c.overlap_samples,
|
||||
use_cuda
|
||||
)
|
||||
predict_mel = ap.melspectrogram(sample_wav)
|
||||
|
||||
# Sample audio
|
||||
tb_logger.tb_eval_audios(
|
||||
global_step, {
|
||||
"eval/audio": sample_wav}, c.audio["sample_rate"]
|
||||
)
|
||||
|
||||
# compute spectrograms
|
||||
figures = {"eval/ground_truth": plot_spectrogram(ground_mel.T),
|
||||
"eval/prediction": plot_spectrogram(predict_mel.T)
|
||||
}
|
||||
tb_logger.tb_eval_figures(global_step, figures)
|
||||
|
||||
tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
|
||||
return keep_avg.avg_values
|
||||
|
||||
|
||||
# FIXME: move args definition/parsing inside of main?
|
||||
def main(args): # pylint: disable=redefined-outer-name
|
||||
# pylint: disable=global-variable-undefined
|
||||
global train_data, eval_data
|
||||
|
||||
# setup audio processor
|
||||
ap = AudioProcessor(**c.audio)
|
||||
|
||||
# print(f" > Loading wavs from: {c.data_path}")
|
||||
# if c.feature_path is not None:
|
||||
# print(f" > Loading features from: {c.feature_path}")
|
||||
# eval_data, train_data = load_wav_feat_data(
|
||||
# c.data_path, c.feature_path, c.eval_split_size
|
||||
# )
|
||||
# else:
|
||||
# mel_feat_path = os.path.join(OUT_PATH, "mel")
|
||||
# feat_data = find_feat_files(mel_feat_path)
|
||||
# if feat_data:
|
||||
# print(f" > Loading features from: {mel_feat_path}")
|
||||
# eval_data, train_data = load_wav_feat_data(
|
||||
# c.data_path, mel_feat_path, c.eval_split_size
|
||||
# )
|
||||
# else:
|
||||
# print(" > No feature data found. Preprocessing...")
|
||||
# # preprocessing feature data from given wav files
|
||||
# preprocess_wav_files(OUT_PATH, CONFIG, ap)
|
||||
# eval_data, train_data = load_wav_feat_data(
|
||||
# c.data_path, mel_feat_path, c.eval_split_size
|
||||
# )
|
||||
|
||||
print(f" > Loading wavs from: {c.data_path}")
|
||||
if c.feature_path is not None:
|
||||
print(f" > Loading features from: {c.feature_path}")
|
||||
eval_data, train_data = load_wav_feat_data(
|
||||
c.data_path, c.feature_path, c.eval_split_size)
|
||||
else:
|
||||
eval_data, train_data = load_wav_data(
|
||||
c.data_path, c.eval_split_size)
|
||||
# setup model
|
||||
model_wavernn = setup_wavernn(c)
|
||||
|
||||
# setup amp scaler
|
||||
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
|
||||
|
||||
# define train functions
|
||||
if c.mode == "mold":
|
||||
criterion = discretized_mix_logistic_loss
|
||||
elif c.mode == "gauss":
|
||||
criterion = gaussian_loss
|
||||
elif isinstance(c.mode, int):
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
|
||||
if use_cuda:
|
||||
model_wavernn.cuda()
|
||||
if isinstance(c.mode, int):
|
||||
criterion.cuda()
|
||||
|
||||
optimizer = RAdam(model_wavernn.parameters(), lr=c.lr, weight_decay=0)
|
||||
|
||||
scheduler = None
|
||||
if "lr_scheduler" in c:
|
||||
scheduler = getattr(torch.optim.lr_scheduler, c.lr_scheduler)
|
||||
scheduler = scheduler(optimizer, **c.lr_scheduler_params)
|
||||
# slow start for the first 5 epochs
|
||||
# lr_lambda = lambda epoch: min(epoch / c.warmup_steps, 1)
|
||||
# scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
|
||||
|
||||
# restore any checkpoint
|
||||
if args.restore_path:
|
||||
checkpoint = torch.load(args.restore_path, map_location="cpu")
|
||||
try:
|
||||
print(" > Restoring Model...")
|
||||
model_wavernn.load_state_dict(checkpoint["model"])
|
||||
print(" > Restoring Optimizer...")
|
||||
optimizer.load_state_dict(checkpoint["optimizer"])
|
||||
if "scheduler" in checkpoint:
|
||||
print(" > Restoring Generator LR Scheduler...")
|
||||
scheduler.load_state_dict(checkpoint["scheduler"])
|
||||
scheduler.optimizer = optimizer
|
||||
if "scaler" in checkpoint and c.mixed_precision:
|
||||
print(" > Restoring AMP Scaler...")
|
||||
scaler.load_state_dict(checkpoint["scaler"])
|
||||
except RuntimeError:
|
||||
# retore only matching layers.
|
||||
print(" > Partial model initialization...")
|
||||
model_dict = model_wavernn.state_dict()
|
||||
model_dict = set_init_dict(model_dict, checkpoint["model"], c)
|
||||
model_wavernn.load_state_dict(model_dict)
|
||||
|
||||
print(" > Model restored from step %d" %
|
||||
checkpoint["step"], flush=True)
|
||||
args.restore_step = checkpoint["step"]
|
||||
else:
|
||||
args.restore_step = 0
|
||||
|
||||
# DISTRIBUTED
|
||||
# if num_gpus > 1:
|
||||
# model = apply_gradient_allreduce(model)
|
||||
|
||||
num_parameters = count_parameters(model_wavernn)
|
||||
print(" > Model has {} parameters".format(num_parameters), flush=True)
|
||||
|
||||
if "best_loss" not in locals():
|
||||
best_loss = float("inf")
|
||||
|
||||
global_step = args.restore_step
|
||||
for epoch in range(0, c.epochs):
|
||||
c_logger.print_epoch_start(epoch, c.epochs)
|
||||
_, global_step = train(model_wavernn, optimizer,
|
||||
criterion, scheduler, scaler, ap, global_step, epoch)
|
||||
eval_avg_loss_dict = evaluate(
|
||||
model_wavernn, criterion, ap, global_step, epoch)
|
||||
c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
|
||||
target_loss = eval_avg_loss_dict["avg_model_loss"]
|
||||
best_loss = save_best_model(
|
||||
target_loss,
|
||||
best_loss,
|
||||
model_wavernn,
|
||||
optimizer,
|
||||
scheduler,
|
||||
None,
|
||||
None,
|
||||
None,
|
||||
global_step,
|
||||
epoch,
|
||||
OUT_PATH,
|
||||
model_losses=eval_avg_loss_dict,
|
||||
scaler=scaler.state_dict() if c.mixed_precision else None
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"--continue_path",
|
||||
type=str,
|
||||
help='Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
|
||||
default="",
|
||||
required="--config_path" not in sys.argv,
|
||||
)
|
||||
parser.add_argument(
|
||||
"--restore_path",
|
||||
type=str,
|
||||
help="Model file to be restored. Use to finetune a model.",
|
||||
default="",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--config_path",
|
||||
type=str,
|
||||
help="Path to config file for training.",
|
||||
required="--continue_path" not in sys.argv,
|
||||
)
|
||||
parser.add_argument(
|
||||
"--debug",
|
||||
type=bool,
|
||||
default=False,
|
||||
help="Do not verify commit integrity to run training.",
|
||||
)
|
||||
|
||||
# DISTRUBUTED
|
||||
parser.add_argument(
|
||||
"--rank",
|
||||
type=int,
|
||||
default=0,
|
||||
help="DISTRIBUTED: process rank for distributed training.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--group_id", type=str, default="", help="DISTRIBUTED: process group id."
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.continue_path != "":
|
||||
args.output_path = args.continue_path
|
||||
args.config_path = os.path.join(args.continue_path, "config.json")
|
||||
list_of_files = glob.glob(
|
||||
args.continue_path + "/*.pth.tar"
|
||||
) # * means all if need specific format then *.csv
|
||||
latest_model_file = max(list_of_files, key=os.path.getctime)
|
||||
args.restore_path = latest_model_file
|
||||
print(f" > Training continues for {args.restore_path}")
|
||||
|
||||
# setup output paths and read configs
|
||||
c = load_config(args.config_path)
|
||||
# check_config(c)
|
||||
_ = os.path.dirname(os.path.realpath(__file__))
|
||||
|
||||
OUT_PATH = args.continue_path
|
||||
if args.continue_path == "":
|
||||
OUT_PATH = create_experiment_folder(
|
||||
c.output_path, c.run_name, args.debug
|
||||
)
|
||||
|
||||
AUDIO_PATH = os.path.join(OUT_PATH, "test_audios")
|
||||
|
||||
c_logger = ConsoleLogger()
|
||||
|
||||
if args.rank == 0:
|
||||
os.makedirs(AUDIO_PATH, exist_ok=True)
|
||||
new_fields = {}
|
||||
if args.restore_path:
|
||||
new_fields["restore_path"] = args.restore_path
|
||||
new_fields["github_branch"] = get_git_branch()
|
||||
copy_config_file(
|
||||
args.config_path, os.path.join(OUT_PATH, "config.json"), new_fields
|
||||
)
|
||||
os.chmod(AUDIO_PATH, 0o775)
|
||||
os.chmod(OUT_PATH, 0o775)
|
||||
|
||||
LOG_DIR = OUT_PATH
|
||||
tb_logger = TensorboardLogger(LOG_DIR, model_name="VOCODER")
|
||||
|
||||
# write model desc to tensorboard
|
||||
tb_logger.tb_add_text("model-description", c["run_description"], 0)
|
||||
|
||||
try:
|
||||
main(args)
|
||||
except KeyboardInterrupt:
|
||||
remove_experiment_folder(OUT_PATH)
|
||||
try:
|
||||
sys.exit(0)
|
||||
except SystemExit:
|
||||
os._exit(0) # pylint: disable=protected-access
|
||||
except Exception: # pylint: disable=broad-except
|
||||
remove_experiment_folder(OUT_PATH)
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
|
@ -0,0 +1,91 @@
|
|||
"""Search a good noise schedule for WaveGrad for a given number of inferece iterations"""
|
||||
import argparse
|
||||
from itertools import product as cartesian_product
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch.utils.data import DataLoader
|
||||
from tqdm import tqdm
|
||||
from TTS.utils.audio import AudioProcessor
|
||||
from TTS.utils.io import load_config
|
||||
from TTS.vocoder.datasets.preprocess import load_wav_data
|
||||
from TTS.vocoder.datasets.wavegrad_dataset import WaveGradDataset
|
||||
from TTS.vocoder.utils.generic_utils import setup_generator
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--model_path', type=str, help='Path to model checkpoint.')
|
||||
parser.add_argument('--config_path', type=str, help='Path to model config file.')
|
||||
parser.add_argument('--data_path', type=str, help='Path to data directory.')
|
||||
parser.add_argument('--output_path', type=str, help='path for output file including file name and extension.')
|
||||
parser.add_argument('--num_iter', type=int, help='Number of model inference iterations that you like to optimize noise schedule for.')
|
||||
parser.add_argument('--use_cuda', type=bool, help='enable/disable CUDA.')
|
||||
parser.add_argument('--num_samples', type=int, default=1, help='Number of datasamples used for inference.')
|
||||
parser.add_argument('--search_depth', type=int, default=3, help='Search granularity. Increasing this increases the run-time exponentially.')
|
||||
|
||||
# load config
|
||||
args = parser.parse_args()
|
||||
config = load_config(args.config_path)
|
||||
|
||||
# setup audio processor
|
||||
ap = AudioProcessor(**config.audio)
|
||||
|
||||
# load dataset
|
||||
_, train_data = load_wav_data(args.data_path, 0)
|
||||
train_data = train_data[:args.num_samples]
|
||||
dataset = WaveGradDataset(ap=ap,
|
||||
items=train_data,
|
||||
seq_len=-1,
|
||||
hop_len=ap.hop_length,
|
||||
pad_short=config.pad_short,
|
||||
conv_pad=config.conv_pad,
|
||||
is_training=True,
|
||||
return_segments=False,
|
||||
use_noise_augment=False,
|
||||
use_cache=False,
|
||||
verbose=True)
|
||||
loader = DataLoader(
|
||||
dataset,
|
||||
batch_size=1,
|
||||
shuffle=False,
|
||||
collate_fn=dataset.collate_full_clips,
|
||||
drop_last=False,
|
||||
num_workers=config.num_loader_workers,
|
||||
pin_memory=False)
|
||||
|
||||
# setup the model
|
||||
model = setup_generator(config)
|
||||
if args.use_cuda:
|
||||
model.cuda()
|
||||
|
||||
# setup optimization parameters
|
||||
base_values = sorted(10 * np.random.uniform(size=args.search_depth))
|
||||
print(base_values)
|
||||
exponents = 10 ** np.linspace(-6, -1, num=args.num_iter)
|
||||
best_error = float('inf')
|
||||
best_schedule = None
|
||||
total_search_iter = len(base_values)**args.num_iter
|
||||
for base in tqdm(cartesian_product(base_values, repeat=args.num_iter), total=total_search_iter):
|
||||
beta = exponents * base
|
||||
model.compute_noise_level(beta)
|
||||
for data in loader:
|
||||
mel, audio = data
|
||||
y_hat = model.inference(mel.cuda() if args.use_cuda else mel)
|
||||
|
||||
if args.use_cuda:
|
||||
y_hat = y_hat.cpu()
|
||||
y_hat = y_hat.numpy()
|
||||
|
||||
mel_hat = []
|
||||
for i in range(y_hat.shape[0]):
|
||||
m = ap.melspectrogram(y_hat[i, 0])[:, :-1]
|
||||
mel_hat.append(torch.from_numpy(m))
|
||||
|
||||
mel_hat = torch.stack(mel_hat)
|
||||
mse = torch.sum((mel - mel_hat) ** 2).mean()
|
||||
if mse.item() < best_error:
|
||||
best_error = mse.item()
|
||||
best_schedule = {'beta': beta}
|
||||
print(f" > Found a better schedule. - MSE: {mse.item()}")
|
||||
np.save(args.output_path, best_schedule)
|
||||
|
||||
|
|
@ -61,6 +61,7 @@ class SpeakerEncoder(nn.Module):
|
|||
d = torch.nn.functional.normalize(d, p=2, dim=1)
|
||||
return d
|
||||
|
||||
@torch.no_grad()
|
||||
def inference(self, x):
|
||||
d = self.layers.forward(x)
|
||||
if self.use_lstm_with_projection:
|
||||
|
|
|
@ -65,14 +65,19 @@
|
|||
"eval_batch_size":16,
|
||||
"r": 7, // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
|
||||
"gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
|
||||
"apex_amp_level": null, // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use "O1" to activate.
|
||||
"mixed_precision": true, // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use "O1" to activate.
|
||||
|
||||
// LOSS SETTINGS
|
||||
"loss_masking": true, // enable / disable loss masking against the sequence padding.
|
||||
"decoder_loss_alpha": 0.5, // decoder loss weight. If > 0, it is enabled
|
||||
"postnet_loss_alpha": 0.25, // postnet loss weight. If > 0, it is enabled
|
||||
"decoder_loss_alpha": 0.5, // original decoder loss weight. If > 0, it is enabled
|
||||
"postnet_loss_alpha": 0.25, // original postnet loss weight. If > 0, it is enabled
|
||||
"postnet_diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
|
||||
"decoder_diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
|
||||
"decoder_ssim_alpha": 0.5, // decoder ssim loss weight. If > 0, it is enabled
|
||||
"postnet_ssim_alpha": 0.25, // postnet ssim loss weight. If > 0, it is enabled
|
||||
"ga_alpha": 5.0, // weight for guided attention loss. If > 0, guided attention is enabled.
|
||||
"diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
|
||||
"stopnet_pos_weight": 15.0, // pos class weight for stopnet loss since there are way more negative samples than positive samples.
|
||||
|
||||
|
||||
// VALIDATION
|
||||
"run_eval": true,
|
||||
|
|
|
@ -51,10 +51,13 @@
|
|||
// "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
|
||||
// },
|
||||
|
||||
"add_blank": false, // if true add a new token after each token of the sentence. This increases the size of the input sequence, but has considerably improved the prosody of the GlowTTS model.
|
||||
|
||||
// DISTRIBUTED TRAINING
|
||||
"apex_amp_level": null, // APEX amp optimization level. "O1" is currently supported.
|
||||
"distributed":{
|
||||
"backend": "nccl",
|
||||
"url": "tcp:\/\/localhost:54321"
|
||||
"url": "tcp:\/\/localhost:54323"
|
||||
},
|
||||
|
||||
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
|
||||
|
|
|
@ -51,6 +51,8 @@
|
|||
// "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
|
||||
// },
|
||||
|
||||
"add_blank": false, // if true add a new token after each token of the sentence. This increases the size of the input sequence, but has considerably improved the prosody of the GlowTTS model.
|
||||
|
||||
// DISTRIBUTED TRAINING
|
||||
"distributed":{
|
||||
"backend": "nccl",
|
||||
|
|
|
@ -17,6 +17,7 @@ class MyDataset(Dataset):
|
|||
ap,
|
||||
meta_data,
|
||||
tp=None,
|
||||
add_blank=False,
|
||||
batch_group_size=0,
|
||||
min_seq_len=0,
|
||||
max_seq_len=float("inf"),
|
||||
|
@ -55,6 +56,7 @@ class MyDataset(Dataset):
|
|||
self.max_seq_len = max_seq_len
|
||||
self.ap = ap
|
||||
self.tp = tp
|
||||
self.add_blank = add_blank
|
||||
self.use_phonemes = use_phonemes
|
||||
self.phoneme_cache_path = phoneme_cache_path
|
||||
self.phoneme_language = phoneme_language
|
||||
|
@ -88,7 +90,7 @@ class MyDataset(Dataset):
|
|||
phonemes = phoneme_to_sequence(text, [self.cleaners],
|
||||
language=self.phoneme_language,
|
||||
enable_eos_bos=False,
|
||||
tp=self.tp)
|
||||
tp=self.tp, add_blank=self.add_blank)
|
||||
phonemes = np.asarray(phonemes, dtype=np.int32)
|
||||
np.save(cache_path, phonemes)
|
||||
return phonemes
|
||||
|
@ -127,7 +129,7 @@ class MyDataset(Dataset):
|
|||
text = self._load_or_generate_phoneme_sequence(wav_file, text)
|
||||
else:
|
||||
text = np.asarray(text_to_sequence(text, [self.cleaners],
|
||||
tp=self.tp),
|
||||
tp=self.tp, add_blank=self.add_blank),
|
||||
dtype=np.int32)
|
||||
|
||||
assert text.size > 0, self.items[idx][1]
|
||||
|
|
|
@ -9,9 +9,9 @@ from tqdm import tqdm
|
|||
from TTS.tts.utils.generic_utils import split_dataset
|
||||
|
||||
|
||||
def load_meta_data(datasets):
|
||||
def load_meta_data(datasets, eval_split=True):
|
||||
meta_data_train_all = []
|
||||
meta_data_eval_all = []
|
||||
meta_data_eval_all = [] if eval_split else None
|
||||
for dataset in datasets:
|
||||
name = dataset['name']
|
||||
root_path = dataset['path']
|
||||
|
@ -20,12 +20,13 @@ def load_meta_data(datasets):
|
|||
preprocessor = get_preprocessor_by_name(name)
|
||||
meta_data_train = preprocessor(root_path, meta_file_train)
|
||||
print(f" | > Found {len(meta_data_train)} files in {Path(root_path).resolve()}")
|
||||
if meta_file_val is None:
|
||||
meta_data_eval, meta_data_train = split_dataset(meta_data_train)
|
||||
else:
|
||||
meta_data_eval = preprocessor(root_path, meta_file_val)
|
||||
if eval_split:
|
||||
if meta_file_val is None:
|
||||
meta_data_eval, meta_data_train = split_dataset(meta_data_train)
|
||||
else:
|
||||
meta_data_eval = preprocessor(root_path, meta_file_val)
|
||||
meta_data_eval_all += meta_data_eval
|
||||
meta_data_train_all += meta_data_train
|
||||
meta_data_eval_all += meta_data_eval
|
||||
return meta_data_train_all, meta_data_eval_all
|
||||
|
||||
|
||||
|
@ -227,7 +228,6 @@ def brspeech(root_path, meta_file):
|
|||
if line.startswith("wav_filename"):
|
||||
continue
|
||||
cols = line.split('|')
|
||||
#print(cols)
|
||||
wav_file = os.path.join(root_path, cols[0])
|
||||
text = cols[2]
|
||||
speaker_name = cols[3]
|
||||
|
@ -303,17 +303,17 @@ def _voxcel_x(root_path, meta_file, voxcel_idx):
|
|||
|
||||
elif not cache_to.exists():
|
||||
cnt = 0
|
||||
meta_data = ""
|
||||
meta_data = []
|
||||
wav_files = voxceleb_path.rglob("**/*.wav")
|
||||
for path in tqdm(wav_files, desc=f"Building VoxCeleb {voxcel_idx} Meta file ... this needs to be done only once.",
|
||||
total=expected_count):
|
||||
speaker_id = str(Path(path).parent.parent.stem)
|
||||
assert speaker_id.startswith('id')
|
||||
text = None # VoxCel does not provide transciptions, and they are not needed for training the SE
|
||||
meta_data += f"{text}|{path}|voxcel{voxcel_idx}_{speaker_id}\n"
|
||||
meta_data.append(f"{text}|{path}|voxcel{voxcel_idx}_{speaker_id}\n")
|
||||
cnt += 1
|
||||
with open(str(cache_to), 'w') as f:
|
||||
f.write(meta_data)
|
||||
f.write("".join(meta_data))
|
||||
if cnt < expected_count:
|
||||
raise ValueError(f"Found too few instances for Voxceleb. Should be around {expected_count}, is: {cnt}")
|
||||
|
||||
|
|
|
@ -2,10 +2,14 @@ import math
|
|||
import numpy as np
|
||||
import torch
|
||||
from torch import nn
|
||||
from inspect import signature
|
||||
from torch.nn import functional
|
||||
from TTS.tts.utils.generic_utils import sequence_mask
|
||||
from TTS.tts.utils.ssim import ssim
|
||||
|
||||
|
||||
# pylint: disable=abstract-method Method
|
||||
# relates https://github.com/pytorch/pytorch/issues/42305
|
||||
class L1LossMasked(nn.Module):
|
||||
def __init__(self, seq_len_norm):
|
||||
super().__init__()
|
||||
|
@ -22,6 +26,10 @@ class L1LossMasked(nn.Module):
|
|||
class for each corresponding step.
|
||||
length: A Variable containing a LongTensor of size (batch,)
|
||||
which contains the length of each data in a batch.
|
||||
Shapes:
|
||||
x: B x T X D
|
||||
target: B x T x D
|
||||
length: B
|
||||
Returns:
|
||||
loss: An average loss value in range [0, 1] masked by the length.
|
||||
"""
|
||||
|
@ -60,6 +68,10 @@ class MSELossMasked(nn.Module):
|
|||
class for each corresponding step.
|
||||
length: A Variable containing a LongTensor of size (batch,)
|
||||
which contains the length of each data in a batch.
|
||||
Shapes:
|
||||
x: B x T X D
|
||||
target: B x T x D
|
||||
length: B
|
||||
Returns:
|
||||
loss: An average loss value in range [0, 1] masked by the length.
|
||||
"""
|
||||
|
@ -84,6 +96,33 @@ class MSELossMasked(nn.Module):
|
|||
return loss
|
||||
|
||||
|
||||
class SSIMLoss(torch.nn.Module):
|
||||
"""SSIM loss as explained here https://en.wikipedia.org/wiki/Structural_similarity"""
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.loss_func = ssim
|
||||
|
||||
def forward(self, y_hat, y, length=None):
|
||||
"""
|
||||
Args:
|
||||
y_hat (tensor): model prediction values.
|
||||
y (tensor): target values.
|
||||
length (tensor): length of each sample in a batch.
|
||||
Shapes:
|
||||
y_hat: B x T X D
|
||||
y: B x T x D
|
||||
length: B
|
||||
Returns:
|
||||
loss: An average loss value in range [0, 1] masked by the length.
|
||||
"""
|
||||
if length is not None:
|
||||
m = sequence_mask(sequence_length=length,
|
||||
max_len=y.size(1)).unsqueeze(2).float().to(
|
||||
y_hat.device)
|
||||
y_hat, y = y_hat * m, y * m
|
||||
return 1 - self.loss_func(y_hat.unsqueeze(1), y.unsqueeze(1))
|
||||
|
||||
|
||||
class AttentionEntropyLoss(nn.Module):
|
||||
# pylint: disable=R0201
|
||||
def forward(self, align):
|
||||
|
@ -115,19 +154,29 @@ class BCELossMasked(nn.Module):
|
|||
class for each corresponding step.
|
||||
length: A Variable containing a LongTensor of size (batch,)
|
||||
which contains the length of each data in a batch.
|
||||
Shapes:
|
||||
x: B x T
|
||||
target: B x T
|
||||
length: B
|
||||
Returns:
|
||||
loss: An average loss value in range [0, 1] masked by the length.
|
||||
"""
|
||||
# mask: (batch, max_len, 1)
|
||||
target.requires_grad = False
|
||||
mask = sequence_mask(sequence_length=length,
|
||||
max_len=target.size(1)).float()
|
||||
if length is not None:
|
||||
mask = sequence_mask(sequence_length=length,
|
||||
max_len=target.size(1)).float()
|
||||
x = x * mask
|
||||
target = target * mask
|
||||
num_items = mask.sum()
|
||||
else:
|
||||
num_items = torch.numel(x)
|
||||
loss = functional.binary_cross_entropy_with_logits(
|
||||
x * mask,
|
||||
target * mask,
|
||||
x,
|
||||
target,
|
||||
pos_weight=self.pos_weight,
|
||||
reduction='sum')
|
||||
loss = loss / mask.sum()
|
||||
loss = loss / num_items
|
||||
return loss
|
||||
|
||||
|
||||
|
@ -139,9 +188,19 @@ class DifferentailSpectralLoss(nn.Module):
|
|||
super().__init__()
|
||||
self.loss_func = loss_func
|
||||
|
||||
def forward(self, x, target, length):
|
||||
def forward(self, x, target, length=None):
|
||||
"""
|
||||
Shapes:
|
||||
x: B x T
|
||||
target: B x T
|
||||
length: B
|
||||
Returns:
|
||||
loss: An average loss value in range [0, 1] masked by the length.
|
||||
"""
|
||||
x_diff = x[:, 1:] - x[:, :-1]
|
||||
target_diff = target[:, 1:] - target[:, :-1]
|
||||
if length is None:
|
||||
return self.loss_func(x_diff, target_diff)
|
||||
return self.loss_func(x_diff, target_diff, length-1)
|
||||
|
||||
|
||||
|
@ -169,7 +228,7 @@ class GuidedAttentionLoss(torch.nn.Module):
|
|||
|
||||
@staticmethod
|
||||
def _make_ga_mask(ilen, olen, sigma):
|
||||
grid_x, grid_y = torch.meshgrid(torch.arange(olen, device=olen.device), torch.arange(ilen, device=ilen.device))
|
||||
grid_x, grid_y = torch.meshgrid(torch.arange(olen).to(olen), torch.arange(ilen).to(ilen))
|
||||
grid_x, grid_y = grid_x.float(), grid_y.float()
|
||||
return 1.0 - torch.exp(-(grid_y / ilen - grid_x / olen)**2 /
|
||||
(2 * (sigma**2)))
|
||||
|
@ -182,13 +241,17 @@ class GuidedAttentionLoss(torch.nn.Module):
|
|||
|
||||
|
||||
class TacotronLoss(torch.nn.Module):
|
||||
"""Collection of Tacotron set-up based on provided config."""
|
||||
def __init__(self, c, stopnet_pos_weight=10, ga_sigma=0.4):
|
||||
super(TacotronLoss, self).__init__()
|
||||
self.stopnet_pos_weight = stopnet_pos_weight
|
||||
self.ga_alpha = c.ga_alpha
|
||||
self.diff_spec_alpha = c.diff_spec_alpha
|
||||
self.decoder_diff_spec_alpha = c.decoder_diff_spec_alpha
|
||||
self.postnet_diff_spec_alpha = c.postnet_diff_spec_alpha
|
||||
self.decoder_alpha = c.decoder_loss_alpha
|
||||
self.postnet_alpha = c.postnet_loss_alpha
|
||||
self.decoder_ssim_alpha = c.decoder_ssim_alpha
|
||||
self.postnet_ssim_alpha = c.postnet_ssim_alpha
|
||||
self.config = c
|
||||
|
||||
# postnet and decoder loss
|
||||
|
@ -199,12 +262,15 @@ class TacotronLoss(torch.nn.Module):
|
|||
else:
|
||||
self.criterion = nn.L1Loss() if c.model in ["Tacotron"
|
||||
] else nn.MSELoss()
|
||||
# differential spectral loss
|
||||
if c.diff_spec_alpha > 0:
|
||||
self.criterion_diff_spec = DifferentailSpectralLoss(loss_func=self.criterion)
|
||||
# guided attention loss
|
||||
if c.ga_alpha > 0:
|
||||
self.criterion_ga = GuidedAttentionLoss(sigma=ga_sigma)
|
||||
# differential spectral loss
|
||||
if c.postnet_diff_spec_alpha > 0 or c.decoder_diff_spec_alpha > 0:
|
||||
self.criterion_diff_spec = DifferentailSpectralLoss(loss_func=self.criterion)
|
||||
# ssim loss
|
||||
if c.postnet_ssim_alpha > 0 or c.decoder_ssim_alpha > 0:
|
||||
self.criterion_ssim = SSIMLoss()
|
||||
# stopnet loss
|
||||
# pylint: disable=not-callable
|
||||
self.criterion_st = BCELossMasked(
|
||||
|
@ -215,6 +281,9 @@ class TacotronLoss(torch.nn.Module):
|
|||
alignments, alignment_lens, alignments_backwards, input_lens):
|
||||
|
||||
return_dict = {}
|
||||
# remove lengths if no masking is applied
|
||||
if not self.config.loss_masking:
|
||||
output_lens = None
|
||||
# decoder and postnet losses
|
||||
if self.config.loss_masking:
|
||||
if self.decoder_alpha > 0:
|
||||
|
@ -262,8 +331,11 @@ class TacotronLoss(torch.nn.Module):
|
|||
|
||||
# double decoder consistency loss (if enabled)
|
||||
if self.config.double_decoder_consistency:
|
||||
decoder_b_loss = self.criterion(decoder_b_output, mel_input,
|
||||
output_lens)
|
||||
if self.config.loss_masking:
|
||||
decoder_b_loss = self.criterion(decoder_b_output, mel_input,
|
||||
output_lens)
|
||||
else:
|
||||
decoder_b_loss = self.criterion(decoder_b_output, mel_input)
|
||||
# decoder_c_loss = torch.nn.functional.l1_loss(decoder_b_output, decoder_output)
|
||||
attention_c_loss = torch.nn.functional.l1_loss(alignments, alignments_backwards)
|
||||
loss += self.decoder_alpha * (decoder_b_loss + attention_c_loss)
|
||||
|
@ -274,14 +346,38 @@ class TacotronLoss(torch.nn.Module):
|
|||
if self.config.ga_alpha > 0:
|
||||
ga_loss = self.criterion_ga(alignments, input_lens, alignment_lens)
|
||||
loss += ga_loss * self.ga_alpha
|
||||
return_dict['ga_loss'] = ga_loss * self.ga_alpha
|
||||
return_dict['ga_loss'] = ga_loss
|
||||
|
||||
# decoder differential spectral loss
|
||||
if self.config.decoder_diff_spec_alpha > 0:
|
||||
decoder_diff_spec_loss = self.criterion_diff_spec(decoder_output, mel_input, output_lens)
|
||||
loss += decoder_diff_spec_loss * self.decoder_diff_spec_alpha
|
||||
return_dict['decoder_diff_spec_loss'] = decoder_diff_spec_loss
|
||||
|
||||
# postnet differential spectral loss
|
||||
if self.config.postnet_diff_spec_alpha > 0:
|
||||
postnet_diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
|
||||
loss += postnet_diff_spec_loss * self.postnet_diff_spec_alpha
|
||||
return_dict['postnet_diff_spec_loss'] = postnet_diff_spec_loss
|
||||
|
||||
# decoder ssim loss
|
||||
if self.config.decoder_ssim_alpha > 0:
|
||||
decoder_ssim_loss = self.criterion_ssim(decoder_output, mel_input, output_lens)
|
||||
loss += decoder_ssim_loss * self.postnet_ssim_alpha
|
||||
return_dict['decoder_ssim_loss'] = decoder_ssim_loss
|
||||
|
||||
# postnet ssim loss
|
||||
if self.config.postnet_ssim_alpha > 0:
|
||||
postnet_ssim_loss = self.criterion_ssim(postnet_output, mel_input, output_lens)
|
||||
loss += postnet_ssim_loss * self.postnet_ssim_alpha
|
||||
return_dict['postnet_ssim_loss'] = postnet_ssim_loss
|
||||
|
||||
# differential spectral loss
|
||||
if self.config.diff_spec_alpha > 0:
|
||||
diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
|
||||
loss += diff_spec_loss * self.diff_spec_alpha
|
||||
return_dict['diff_spec_loss'] = diff_spec_loss
|
||||
return_dict['loss'] = loss
|
||||
|
||||
# check if any loss is NaN
|
||||
for key, loss in return_dict.items():
|
||||
if torch.isnan(loss):
|
||||
raise RuntimeError(f" [!] NaN loss with {key}.")
|
||||
return return_dict
|
||||
|
||||
|
||||
|
@ -306,4 +402,9 @@ class GlowTTSLoss(torch.nn.Module):
|
|||
return_dict['loss'] = log_mle + loss_dur
|
||||
return_dict['log_mle'] = log_mle
|
||||
return_dict['loss_dur'] = loss_dur
|
||||
|
||||
# check if any loss is NaN
|
||||
for key, loss in return_dict.items():
|
||||
if torch.isnan(loss):
|
||||
raise RuntimeError(f" [!] NaN loss with {key}.")
|
||||
return return_dict
|
||||
|
|
|
@ -102,7 +102,7 @@ class Encoder(nn.Module):
|
|||
o = layer(o)
|
||||
o = o.transpose(1, 2)
|
||||
o = nn.utils.rnn.pack_padded_sequence(o,
|
||||
input_lengths,
|
||||
input_lengths.cpu(),
|
||||
batch_first=True)
|
||||
self.lstm.flatten_parameters()
|
||||
o, _ = self.lstm(o)
|
||||
|
|
|
@ -37,7 +37,8 @@ class GlowTts(nn.Module):
|
|||
hidden_channels_enc=None,
|
||||
hidden_channels_dec=None,
|
||||
use_encoder_prenet=False,
|
||||
encoder_type="transformer"):
|
||||
encoder_type="transformer",
|
||||
external_speaker_embedding_dim=None):
|
||||
|
||||
super().__init__()
|
||||
self.num_chars = num_chars
|
||||
|
@ -67,6 +68,14 @@ class GlowTts(nn.Module):
|
|||
self.use_encoder_prenet = use_encoder_prenet
|
||||
self.noise_scale = 0.66
|
||||
self.length_scale = 1.
|
||||
self.external_speaker_embedding_dim = external_speaker_embedding_dim
|
||||
|
||||
# if is a multispeaker and c_in_channels is 0, set to 256
|
||||
if num_speakers > 1:
|
||||
if self.c_in_channels == 0 and not self.external_speaker_embedding_dim:
|
||||
self.c_in_channels = 512
|
||||
elif self.external_speaker_embedding_dim:
|
||||
self.c_in_channels = self.external_speaker_embedding_dim
|
||||
|
||||
self.encoder = Encoder(num_chars,
|
||||
out_channels=out_channels,
|
||||
|
@ -80,7 +89,7 @@ class GlowTts(nn.Module):
|
|||
dropout_p=dropout_p,
|
||||
mean_only=mean_only,
|
||||
use_prenet=use_encoder_prenet,
|
||||
c_in_channels=c_in_channels)
|
||||
c_in_channels=self.c_in_channels)
|
||||
|
||||
self.decoder = Decoder(out_channels,
|
||||
hidden_channels_dec or hidden_channels,
|
||||
|
@ -92,10 +101,10 @@ class GlowTts(nn.Module):
|
|||
num_splits=num_splits,
|
||||
num_sqz=num_sqz,
|
||||
sigmoid_scale=sigmoid_scale,
|
||||
c_in_channels=c_in_channels)
|
||||
c_in_channels=self.c_in_channels)
|
||||
|
||||
if num_speakers > 1:
|
||||
self.emb_g = nn.Embedding(num_speakers, c_in_channels)
|
||||
if num_speakers > 1 and not external_speaker_embedding_dim:
|
||||
self.emb_g = nn.Embedding(num_speakers, self.c_in_channels)
|
||||
nn.init.uniform_(self.emb_g.weight, -0.1, 0.1)
|
||||
|
||||
@staticmethod
|
||||
|
@ -122,7 +131,11 @@ class GlowTts(nn.Module):
|
|||
y_max_length = y.size(2)
|
||||
# norm speaker embeddings
|
||||
if g is not None:
|
||||
g = F.normalize(self.emb_g(g)).unsqueeze(-1) # [b, h]
|
||||
if self.external_speaker_embedding_dim:
|
||||
g = F.normalize(g).unsqueeze(-1)
|
||||
else:
|
||||
g = F.normalize(self.emb_g(g)).unsqueeze(-1)# [b, h]
|
||||
|
||||
# embedding pass
|
||||
o_mean, o_log_scale, o_dur_log, x_mask = self.encoder(x,
|
||||
x_lengths,
|
||||
|
@ -157,8 +170,13 @@ class GlowTts(nn.Module):
|
|||
|
||||
@torch.no_grad()
|
||||
def inference(self, x, x_lengths, g=None):
|
||||
|
||||
if g is not None:
|
||||
g = F.normalize(self.emb_g(g)).unsqueeze(-1) # [b, h]
|
||||
if self.external_speaker_embedding_dim:
|
||||
g = F.normalize(g).unsqueeze(-1)
|
||||
else:
|
||||
g = F.normalize(self.emb_g(g)).unsqueeze(-1) # [b, h]
|
||||
|
||||
# embedding pass
|
||||
o_mean, o_log_scale, o_dur_log, x_mask = self.encoder(x,
|
||||
x_lengths,
|
||||
|
|
|
@ -28,7 +28,6 @@ def split_dataset(items):
|
|||
return items_eval, items
|
||||
return items[:eval_split_size], items[eval_split_size:]
|
||||
|
||||
|
||||
# from https://gist.github.com/jihunchoi/f1434a77df9db1bb337417854b398df1
|
||||
def sequence_mask(sequence_length, max_len=None):
|
||||
if max_len is None:
|
||||
|
@ -50,7 +49,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
|
|||
MyModel = importlib.import_module('TTS.tts.models.' + c.model.lower())
|
||||
MyModel = getattr(MyModel, to_camel(c.model))
|
||||
if c.model.lower() in "tacotron":
|
||||
model = MyModel(num_chars=num_chars,
|
||||
model = MyModel(num_chars=num_chars + getattr(c, "add_blank", False),
|
||||
num_speakers=num_speakers,
|
||||
r=c.r,
|
||||
postnet_output_dim=int(c.audio['fft_size'] / 2 + 1),
|
||||
|
@ -77,7 +76,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
|
|||
ddc_r=c.ddc_r,
|
||||
speaker_embedding_dim=speaker_embedding_dim)
|
||||
elif c.model.lower() == "tacotron2":
|
||||
model = MyModel(num_chars=num_chars,
|
||||
model = MyModel(num_chars=num_chars + getattr(c, "add_blank", False),
|
||||
num_speakers=num_speakers,
|
||||
r=c.r,
|
||||
postnet_output_dim=c.audio['num_mels'],
|
||||
|
@ -103,7 +102,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
|
|||
ddc_r=c.ddc_r,
|
||||
speaker_embedding_dim=speaker_embedding_dim)
|
||||
elif c.model.lower() == "glow_tts":
|
||||
model = MyModel(num_chars=num_chars,
|
||||
model = MyModel(num_chars=num_chars + getattr(c, "add_blank", False),
|
||||
hidden_channels=192,
|
||||
filter_channels=768,
|
||||
filter_channels_dp=256,
|
||||
|
@ -126,13 +125,15 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
|
|||
mean_only=True,
|
||||
hidden_channels_enc=192,
|
||||
hidden_channels_dec=192,
|
||||
use_encoder_prenet=True)
|
||||
use_encoder_prenet=True,
|
||||
external_speaker_embedding_dim=speaker_embedding_dim)
|
||||
return model
|
||||
|
||||
|
||||
def is_tacotron(c):
|
||||
return False if 'glow_tts' in c['model'] else True
|
||||
|
||||
def check_config_tts(c):
|
||||
check_argument('model', c, enum_list=['tacotron', 'tacotron2'], restricted=True, val_type=str)
|
||||
check_argument('model', c, enum_list=['tacotron', 'tacotron2', 'glow_tts'], restricted=True, val_type=str)
|
||||
check_argument('run_name', c, restricted=True, val_type=str)
|
||||
check_argument('run_description', c, val_type=str)
|
||||
|
||||
|
@ -176,10 +177,20 @@ def check_config_tts(c):
|
|||
check_argument('eval_batch_size', c, restricted=True, val_type=int, min_val=1)
|
||||
check_argument('r', c, restricted=True, val_type=int, min_val=1)
|
||||
check_argument('gradual_training', c, restricted=False, val_type=list)
|
||||
check_argument('loss_masking', c, restricted=True, val_type=bool)
|
||||
check_argument('apex_amp_level', c, restricted=False, val_type=str)
|
||||
# check_argument('grad_accum', c, restricted=True, val_type=int, min_val=1, max_val=100)
|
||||
|
||||
# loss parameters
|
||||
check_argument('loss_masking', c, restricted=True, val_type=bool)
|
||||
if c['model'].lower() in ['tacotron', 'tacotron2']:
|
||||
check_argument('decoder_loss_alpha', c, restricted=True, val_type=float, min_val=0)
|
||||
check_argument('postnet_loss_alpha', c, restricted=True, val_type=float, min_val=0)
|
||||
check_argument('postnet_diff_spec_alpha', c, restricted=True, val_type=float, min_val=0)
|
||||
check_argument('decoder_diff_spec_alpha', c, restricted=True, val_type=float, min_val=0)
|
||||
check_argument('decoder_ssim_alpha', c, restricted=True, val_type=float, min_val=0)
|
||||
check_argument('postnet_ssim_alpha', c, restricted=True, val_type=float, min_val=0)
|
||||
check_argument('ga_alpha', c, restricted=True, val_type=float, min_val=0)
|
||||
|
||||
# validation parameters
|
||||
check_argument('run_eval', c, restricted=True, val_type=bool)
|
||||
check_argument('test_delay_epochs', c, restricted=True, val_type=int, min_val=0)
|
||||
|
@ -195,27 +206,30 @@ def check_config_tts(c):
|
|||
check_argument('seq_len_norm', c, restricted=True, val_type=bool)
|
||||
|
||||
# tacotron prenet
|
||||
check_argument('memory_size', c, restricted=True, val_type=int, min_val=-1)
|
||||
check_argument('prenet_type', c, restricted=True, val_type=str, enum_list=['original', 'bn'])
|
||||
check_argument('prenet_dropout', c, restricted=True, val_type=bool)
|
||||
check_argument('memory_size', c, restricted=is_tacotron(c), val_type=int, min_val=-1)
|
||||
check_argument('prenet_type', c, restricted=is_tacotron(c), val_type=str, enum_list=['original', 'bn'])
|
||||
check_argument('prenet_dropout', c, restricted=is_tacotron(c), val_type=bool)
|
||||
|
||||
# attention
|
||||
check_argument('attention_type', c, restricted=True, val_type=str, enum_list=['graves', 'original'])
|
||||
check_argument('attention_heads', c, restricted=True, val_type=int)
|
||||
check_argument('attention_norm', c, restricted=True, val_type=str, enum_list=['sigmoid', 'softmax'])
|
||||
check_argument('windowing', c, restricted=True, val_type=bool)
|
||||
check_argument('use_forward_attn', c, restricted=True, val_type=bool)
|
||||
check_argument('forward_attn_mask', c, restricted=True, val_type=bool)
|
||||
check_argument('transition_agent', c, restricted=True, val_type=bool)
|
||||
check_argument('transition_agent', c, restricted=True, val_type=bool)
|
||||
check_argument('location_attn', c, restricted=True, val_type=bool)
|
||||
check_argument('bidirectional_decoder', c, restricted=True, val_type=bool)
|
||||
check_argument('double_decoder_consistency', c, restricted=True, val_type=bool)
|
||||
check_argument('attention_type', c, restricted=is_tacotron(c), val_type=str, enum_list=['graves', 'original'])
|
||||
check_argument('attention_heads', c, restricted=is_tacotron(c), val_type=int)
|
||||
check_argument('attention_norm', c, restricted=is_tacotron(c), val_type=str, enum_list=['sigmoid', 'softmax'])
|
||||
check_argument('windowing', c, restricted=is_tacotron(c), val_type=bool)
|
||||
check_argument('use_forward_attn', c, restricted=is_tacotron(c), val_type=bool)
|
||||
check_argument('forward_attn_mask', c, restricted=is_tacotron(c), val_type=bool)
|
||||
check_argument('transition_agent', c, restricted=is_tacotron(c), val_type=bool)
|
||||
check_argument('transition_agent', c, restricted=is_tacotron(c), val_type=bool)
|
||||
check_argument('location_attn', c, restricted=is_tacotron(c), val_type=bool)
|
||||
check_argument('bidirectional_decoder', c, restricted=is_tacotron(c), val_type=bool)
|
||||
check_argument('double_decoder_consistency', c, restricted=is_tacotron(c), val_type=bool)
|
||||
check_argument('ddc_r', c, restricted='double_decoder_consistency' in c.keys(), min_val=1, max_val=7, val_type=int)
|
||||
|
||||
# stopnet
|
||||
check_argument('stopnet', c, restricted=True, val_type=bool)
|
||||
check_argument('separate_stopnet', c, restricted=True, val_type=bool)
|
||||
check_argument('stopnet', c, restricted=is_tacotron(c), val_type=bool)
|
||||
check_argument('separate_stopnet', c, restricted=is_tacotron(c), val_type=bool)
|
||||
|
||||
# GlowTTS parameters
|
||||
check_argument('encoder_type', c, restricted=not is_tacotron(c), val_type=str)
|
||||
|
||||
# tensorboard
|
||||
check_argument('print_step', c, restricted=True, val_type=int, min_val=1)
|
||||
|
@ -240,15 +254,16 @@ def check_config_tts(c):
|
|||
|
||||
# multi-speaker and gst
|
||||
check_argument('use_speaker_embedding', c, restricted=True, val_type=bool)
|
||||
check_argument('use_external_speaker_embedding_file', c, restricted=True, val_type=bool)
|
||||
check_argument('external_speaker_embedding_file', c, restricted=True, val_type=str)
|
||||
check_argument('use_gst', c, restricted=True, val_type=bool)
|
||||
check_argument('gst', c, restricted=True, val_type=dict)
|
||||
check_argument('gst_style_input', c['gst'], restricted=True, val_type=[str, dict])
|
||||
check_argument('gst_embedding_dim', c['gst'], restricted=True, val_type=int, min_val=0, max_val=1000)
|
||||
check_argument('gst_use_speaker_embedding', c['gst'], restricted=True, val_type=bool)
|
||||
check_argument('gst_num_heads', c['gst'], restricted=True, val_type=int, min_val=2, max_val=10)
|
||||
check_argument('gst_style_tokens', c['gst'], restricted=True, val_type=int, min_val=1, max_val=1000)
|
||||
check_argument('use_external_speaker_embedding_file', c, restricted=c['use_speaker_embedding'], val_type=bool)
|
||||
check_argument('external_speaker_embedding_file', c, restricted=c['use_external_speaker_embedding_file'], val_type=str)
|
||||
check_argument('use_gst', c, restricted=is_tacotron(c), val_type=bool)
|
||||
if c['model'].lower() in ['tacotron', 'tacotron2'] and c['use_gst']:
|
||||
check_argument('gst', c, restricted=is_tacotron(c), val_type=dict)
|
||||
check_argument('gst_style_input', c['gst'], restricted=is_tacotron(c), val_type=[str, dict])
|
||||
check_argument('gst_embedding_dim', c['gst'], restricted=is_tacotron(c), val_type=int, min_val=0, max_val=1000)
|
||||
check_argument('gst_use_speaker_embedding', c['gst'], restricted=is_tacotron(c), val_type=bool)
|
||||
check_argument('gst_num_heads', c['gst'], restricted=is_tacotron(c), val_type=int, min_val=2, max_val=10)
|
||||
check_argument('gst_style_tokens', c['gst'], restricted=is_tacotron(c), val_type=int, min_val=1, max_val=1000)
|
||||
|
||||
# datasets - checking only the first entry
|
||||
check_argument('datasets', c, restricted=True, val_type=list)
|
||||
|
|
|
@ -6,6 +6,7 @@ import pickle as pickle_tts
|
|||
from TTS.utils.io import RenamingUnpickler
|
||||
|
||||
|
||||
|
||||
def load_checkpoint(model, checkpoint_path, amp=None, use_cuda=False):
|
||||
try:
|
||||
state = torch.load(checkpoint_path, map_location=torch.device('cpu'))
|
||||
|
@ -25,9 +26,12 @@ def load_checkpoint(model, checkpoint_path, amp=None, use_cuda=False):
|
|||
|
||||
|
||||
def save_model(model, optimizer, current_step, epoch, r, output_path, amp_state_dict=None, **kwargs):
|
||||
new_state_dict = model.state_dict()
|
||||
if hasattr(model, 'module'):
|
||||
model_state = model.module.state_dict()
|
||||
else:
|
||||
model_state = model.state_dict()
|
||||
state = {
|
||||
'model': new_state_dict,
|
||||
'model': model_state,
|
||||
'optimizer': optimizer.state_dict() if optimizer is not None else None,
|
||||
'step': current_step,
|
||||
'epoch': epoch,
|
||||
|
|
|
@ -30,3 +30,44 @@ def get_speakers(items):
|
|||
"""Returns a sorted, unique list of speakers in a given dataset."""
|
||||
speakers = {e[2] for e in items}
|
||||
return sorted(speakers)
|
||||
|
||||
def parse_speakers(c, args, meta_data_train, OUT_PATH):
|
||||
""" Returns number of speakers, speaker embedding shape and speaker mapping"""
|
||||
if c.use_speaker_embedding:
|
||||
speakers = get_speakers(meta_data_train)
|
||||
if args.restore_path:
|
||||
if c.use_external_speaker_embedding_file: # if restore checkpoint and use External Embedding file
|
||||
prev_out_path = os.path.dirname(args.restore_path)
|
||||
speaker_mapping = load_speaker_mapping(prev_out_path)
|
||||
if not speaker_mapping:
|
||||
print("WARNING: speakers.json was not found in restore_path, trying to use CONFIG.external_speaker_embedding_file")
|
||||
speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
|
||||
if not speaker_mapping:
|
||||
raise RuntimeError("You must copy the file speakers.json to restore_path, or set a valid file in CONFIG.external_speaker_embedding_file")
|
||||
speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
|
||||
elif not c.use_external_speaker_embedding_file: # if restore checkpoint and don't use External Embedding file
|
||||
prev_out_path = os.path.dirname(args.restore_path)
|
||||
speaker_mapping = load_speaker_mapping(prev_out_path)
|
||||
speaker_embedding_dim = None
|
||||
assert all([speaker in speaker_mapping
|
||||
for speaker in speakers]), "As of now you, you cannot " \
|
||||
"introduce new speakers to " \
|
||||
"a previously trained model."
|
||||
elif c.use_external_speaker_embedding_file and c.external_speaker_embedding_file: # if start new train using External Embedding file
|
||||
speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
|
||||
speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
|
||||
elif c.use_external_speaker_embedding_file and not c.external_speaker_embedding_file: # if start new train using External Embedding file and don't pass external embedding file
|
||||
raise "use_external_speaker_embedding_file is True, so you need pass a external speaker embedding file, run GE2E-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb or AngularPrototypical-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb notebook in notebooks/ folder"
|
||||
else: # if start new train and don't use External Embedding file
|
||||
speaker_mapping = {name: i for i, name in enumerate(speakers)}
|
||||
speaker_embedding_dim = None
|
||||
save_speaker_mapping(OUT_PATH, speaker_mapping)
|
||||
num_speakers = len(speaker_mapping)
|
||||
print("Training with {} speakers: {}".format(len(speakers),
|
||||
", ".join(speakers)))
|
||||
else:
|
||||
num_speakers = 0
|
||||
speaker_embedding_dim = None
|
||||
speaker_mapping = None
|
||||
|
||||
return num_speakers, speaker_embedding_dim, speaker_mapping
|
|
@ -0,0 +1,75 @@
|
|||
# taken from https://github.com/Po-Hsun-Su/pytorch-ssim
|
||||
|
||||
from math import exp
|
||||
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
from torch.autograd import Variable
|
||||
|
||||
|
||||
def gaussian(window_size, sigma):
|
||||
gauss = torch.Tensor([exp(-(x - window_size//2)**2/float(2*sigma**2)) for x in range(window_size)])
|
||||
return gauss/gauss.sum()
|
||||
|
||||
def create_window(window_size, channel):
|
||||
_1D_window = gaussian(window_size, 1.5).unsqueeze(1)
|
||||
_2D_window = _1D_window.mm(_1D_window.t()).float().unsqueeze(0).unsqueeze(0)
|
||||
window = Variable(_2D_window.expand(channel, 1, window_size, window_size).contiguous())
|
||||
return window
|
||||
|
||||
def _ssim(img1, img2, window, window_size, channel, size_average = True):
|
||||
mu1 = F.conv2d(img1, window, padding = window_size//2, groups = channel)
|
||||
mu2 = F.conv2d(img2, window, padding = window_size//2, groups = channel)
|
||||
|
||||
mu1_sq = mu1.pow(2)
|
||||
mu2_sq = mu2.pow(2)
|
||||
mu1_mu2 = mu1*mu2
|
||||
|
||||
sigma1_sq = F.conv2d(img1*img1, window, padding = window_size//2, groups = channel) - mu1_sq
|
||||
sigma2_sq = F.conv2d(img2*img2, window, padding = window_size//2, groups = channel) - mu2_sq
|
||||
sigma12 = F.conv2d(img1*img2, window, padding = window_size//2, groups = channel) - mu1_mu2
|
||||
|
||||
C1 = 0.01**2
|
||||
C2 = 0.03**2
|
||||
|
||||
ssim_map = ((2*mu1_mu2 + C1)*(2*sigma12 + C2))/((mu1_sq + mu2_sq + C1)*(sigma1_sq + sigma2_sq + C2))
|
||||
|
||||
if size_average:
|
||||
return ssim_map.mean()
|
||||
return ssim_map.mean(1).mean(1).mean(1)
|
||||
|
||||
class SSIM(torch.nn.Module):
|
||||
def __init__(self, window_size = 11, size_average = True):
|
||||
super().__init__()
|
||||
self.window_size = window_size
|
||||
self.size_average = size_average
|
||||
self.channel = 1
|
||||
self.window = create_window(window_size, self.channel)
|
||||
|
||||
def forward(self, img1, img2):
|
||||
(_, channel, _, _) = img1.size()
|
||||
|
||||
if channel == self.channel and self.window.data.type() == img1.data.type():
|
||||
window = self.window
|
||||
else:
|
||||
window = create_window(self.window_size, channel)
|
||||
|
||||
if img1.is_cuda:
|
||||
window = window.cuda(img1.get_device())
|
||||
window = window.type_as(img1)
|
||||
|
||||
self.window = window
|
||||
self.channel = channel
|
||||
|
||||
|
||||
return _ssim(img1, img2, window, self.window_size, channel, self.size_average)
|
||||
|
||||
def ssim(img1, img2, window_size = 11, size_average = True):
|
||||
(_, channel, _, _) = img1.size()
|
||||
window = create_window(window_size, channel)
|
||||
|
||||
if img1.is_cuda:
|
||||
window = window.cuda(img1.get_device())
|
||||
window = window.type_as(img1)
|
||||
|
||||
return _ssim(img1, img2, window, window_size, channel, size_average)
|
|
@ -14,10 +14,13 @@ def text_to_seqvec(text, CONFIG):
|
|||
seq = np.asarray(
|
||||
phoneme_to_sequence(text, text_cleaner, CONFIG.phoneme_language,
|
||||
CONFIG.enable_eos_bos_chars,
|
||||
tp=CONFIG.characters if 'characters' in CONFIG.keys() else None),
|
||||
tp=CONFIG.characters if 'characters' in CONFIG.keys() else None,
|
||||
add_blank=CONFIG['add_blank'] if 'add_blank' in CONFIG.keys() else False),
|
||||
dtype=np.int32)
|
||||
else:
|
||||
seq = np.asarray(text_to_sequence(text, text_cleaner, tp=CONFIG.characters if 'characters' in CONFIG.keys() else None), dtype=np.int32)
|
||||
seq = np.asarray(
|
||||
text_to_sequence(text, text_cleaner, tp=CONFIG.characters if 'characters' in CONFIG.keys() else None,
|
||||
add_blank=CONFIG['add_blank'] if 'add_blank' in CONFIG.keys() else False), dtype=np.int32)
|
||||
return seq
|
||||
|
||||
|
||||
|
@ -59,7 +62,7 @@ def run_model_torch(model, inputs, CONFIG, truncated, speaker_id=None, style_mel
|
|||
inputs, speaker_ids=speaker_id, speaker_embeddings=speaker_embeddings)
|
||||
elif 'glow' in CONFIG.model.lower():
|
||||
inputs_lengths = torch.tensor(inputs.shape[1:2]).to(inputs.device) # pylint: disable=not-callable
|
||||
postnet_output, _, _, _, alignments, _, _ = model.inference(inputs, inputs_lengths)
|
||||
postnet_output, _, _, _, alignments, _, _ = model.inference(inputs, inputs_lengths, g=speaker_id if speaker_id else speaker_embeddings)
|
||||
postnet_output = postnet_output.permute(0, 2, 1)
|
||||
# these only belong to tacotron models.
|
||||
decoder_output = None
|
||||
|
@ -207,7 +210,7 @@ def synthesis(model,
|
|||
"""
|
||||
# GST processing
|
||||
style_mel = None
|
||||
if CONFIG.use_gst and style_wav is not None:
|
||||
if 'use_gst' in CONFIG.keys() and CONFIG.use_gst and style_wav is not None:
|
||||
if isinstance(style_wav, dict):
|
||||
style_mel = style_wav
|
||||
else:
|
||||
|
|
|
@ -16,6 +16,8 @@ _id_to_symbol = {i: s for i, s in enumerate(symbols)}
|
|||
_phonemes_to_id = {s: i for i, s in enumerate(phonemes)}
|
||||
_id_to_phonemes = {i: s for i, s in enumerate(phonemes)}
|
||||
|
||||
_symbols = symbols
|
||||
_phonemes = phonemes
|
||||
# Regular expression matching text enclosed in curly braces:
|
||||
_CURLY_RE = re.compile(r'(.*?)\{(.+?)\}(.*)')
|
||||
|
||||
|
@ -57,6 +59,10 @@ def text2phone(text, language):
|
|||
|
||||
return ph
|
||||
|
||||
def intersperse(sequence, token):
|
||||
result = [token] * (len(sequence) * 2 + 1)
|
||||
result[1::2] = sequence
|
||||
return result
|
||||
|
||||
def pad_with_eos_bos(phoneme_sequence, tp=None):
|
||||
# pylint: disable=global-statement
|
||||
|
@ -69,10 +75,9 @@ def pad_with_eos_bos(phoneme_sequence, tp=None):
|
|||
|
||||
return [_phonemes_to_id[_bos]] + list(phoneme_sequence) + [_phonemes_to_id[_eos]]
|
||||
|
||||
|
||||
def phoneme_to_sequence(text, cleaner_names, language, enable_eos_bos=False, tp=None):
|
||||
def phoneme_to_sequence(text, cleaner_names, language, enable_eos_bos=False, tp=None, add_blank=False):
|
||||
# pylint: disable=global-statement
|
||||
global _phonemes_to_id
|
||||
global _phonemes_to_id, _phonemes
|
||||
if tp:
|
||||
_, _phonemes = make_symbols(**tp)
|
||||
_phonemes_to_id = {s: i for i, s in enumerate(_phonemes)}
|
||||
|
@ -88,13 +93,17 @@ def phoneme_to_sequence(text, cleaner_names, language, enable_eos_bos=False, tp=
|
|||
# Append EOS char
|
||||
if enable_eos_bos:
|
||||
sequence = pad_with_eos_bos(sequence, tp=tp)
|
||||
if add_blank:
|
||||
sequence = intersperse(sequence, len(_phonemes)) # add a blank token (new), whose id number is len(_phonemes)
|
||||
return sequence
|
||||
|
||||
|
||||
def sequence_to_phoneme(sequence, tp=None):
|
||||
def sequence_to_phoneme(sequence, tp=None, add_blank=False):
|
||||
# pylint: disable=global-statement
|
||||
'''Converts a sequence of IDs back to a string'''
|
||||
global _id_to_phonemes
|
||||
global _id_to_phonemes, _phonemes
|
||||
if add_blank:
|
||||
sequence = list(filter(lambda x: x != len(_phonemes), sequence))
|
||||
result = ''
|
||||
if tp:
|
||||
_, _phonemes = make_symbols(**tp)
|
||||
|
@ -107,7 +116,7 @@ def sequence_to_phoneme(sequence, tp=None):
|
|||
return result.replace('}{', ' ')
|
||||
|
||||
|
||||
def text_to_sequence(text, cleaner_names, tp=None):
|
||||
def text_to_sequence(text, cleaner_names, tp=None, add_blank=False):
|
||||
'''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
|
||||
|
||||
The text can optionally have ARPAbet sequences enclosed in curly braces embedded
|
||||
|
@ -121,7 +130,7 @@ def text_to_sequence(text, cleaner_names, tp=None):
|
|||
List of integers corresponding to the symbols in the text
|
||||
'''
|
||||
# pylint: disable=global-statement
|
||||
global _symbol_to_id
|
||||
global _symbol_to_id, _symbols
|
||||
if tp:
|
||||
_symbols, _ = make_symbols(**tp)
|
||||
_symbol_to_id = {s: i for i, s in enumerate(_symbols)}
|
||||
|
@ -137,13 +146,19 @@ def text_to_sequence(text, cleaner_names, tp=None):
|
|||
_clean_text(m.group(1), cleaner_names))
|
||||
sequence += _arpabet_to_sequence(m.group(2))
|
||||
text = m.group(3)
|
||||
|
||||
if add_blank:
|
||||
sequence = intersperse(sequence, len(_symbols)) # add a blank token (new), whose id number is len(_symbols)
|
||||
return sequence
|
||||
|
||||
|
||||
def sequence_to_text(sequence, tp=None):
|
||||
def sequence_to_text(sequence, tp=None, add_blank=False):
|
||||
'''Converts a sequence of IDs back to a string'''
|
||||
# pylint: disable=global-statement
|
||||
global _id_to_symbol
|
||||
global _id_to_symbol, _symbols
|
||||
if add_blank:
|
||||
sequence = list(filter(lambda x: x != len(_symbols), sequence))
|
||||
|
||||
if tp:
|
||||
_symbols, _ = make_symbols(**tp)
|
||||
_id_to_symbol = {i: s for i, s in enumerate(_symbols)}
|
||||
|
|
|
@ -1,6 +1,8 @@
|
|||
import torch
|
||||
import librosa
|
||||
import matplotlib
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
matplotlib.use('Agg')
|
||||
import matplotlib.pyplot as plt
|
||||
from TTS.tts.utils.text import phoneme_to_sequence, sequence_to_phoneme
|
||||
|
@ -43,6 +45,8 @@ def plot_spectrogram(spectrogram,
|
|||
spectrogram_ = spectrogram.detach().cpu().numpy().squeeze().T
|
||||
else:
|
||||
spectrogram_ = spectrogram.T
|
||||
spectrogram_ = spectrogram_.astype(
|
||||
np.float32) if spectrogram_.dtype == np.float16 else spectrogram_
|
||||
if ap is not None:
|
||||
spectrogram_ = ap._denormalize(spectrogram_) # pylint: disable=protected-access
|
||||
fig = plt.figure(figsize=fig_size)
|
||||
|
|
|
@ -174,7 +174,7 @@ class AudioProcessor(object):
|
|||
for key in stats_config.keys():
|
||||
if key in skip_parameters:
|
||||
continue
|
||||
if key != 'sample_rate':
|
||||
if key not in ['sample_rate', 'trim_db']:
|
||||
assert stats_config[key] == self.__dict__[key],\
|
||||
f" [!] Audio param {key} does not match the value used for computing mean-var stats. {stats_config[key]} vs {self.__dict__[key]}"
|
||||
return mel_mean, mel_std, linear_mean, linear_std, stats_config
|
||||
|
|
|
@ -1,9 +1,11 @@
|
|||
import os
|
||||
import glob
|
||||
import shutil
|
||||
import datetime
|
||||
import glob
|
||||
import os
|
||||
import shutil
|
||||
import subprocess
|
||||
|
||||
import torch
|
||||
|
||||
|
||||
def get_git_branch():
|
||||
try:
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
import os
|
||||
import re
|
||||
import json
|
||||
import yaml
|
||||
import pickle as pickle_tts
|
||||
|
||||
|
||||
|
@ -17,19 +19,27 @@ class AttrDict(dict):
|
|||
self.__dict__ = self
|
||||
|
||||
|
||||
def load_config(config_path):
|
||||
def load_config(config_path: str) -> AttrDict:
|
||||
"""Load config files and discard comments
|
||||
|
||||
Args:
|
||||
config_path (str): path to config file.
|
||||
"""
|
||||
config = AttrDict()
|
||||
with open(config_path, "r") as f:
|
||||
input_str = f.read()
|
||||
# handle comments
|
||||
input_str = re.sub(r'\\\n', '', input_str)
|
||||
input_str = re.sub(r'//.*\n', '\n', input_str)
|
||||
data = json.loads(input_str)
|
||||
|
||||
ext = os.path.splitext(config_path)[1]
|
||||
if ext in (".yml", ".yaml"):
|
||||
with open(config_path, "r") as f:
|
||||
data = yaml.safe_load(f)
|
||||
else:
|
||||
# fallback to json
|
||||
with open(config_path, "r") as f:
|
||||
input_str = f.read()
|
||||
# handle comments
|
||||
input_str = re.sub(r'\\\n', '', input_str)
|
||||
input_str = re.sub(r'//.*\n', '\n', input_str)
|
||||
data = json.loads(input_str)
|
||||
|
||||
config.update(data)
|
||||
return config
|
||||
|
||||
|
|
|
@ -92,8 +92,8 @@
|
|||
// DATASET
|
||||
"data_path": "/home/erogol/Data/MozillaMerged22050/wavs/",
|
||||
"feature_path": null,
|
||||
"seq_len": 16384,
|
||||
"pad_short": 2000,
|
||||
"seq_len": 6144,
|
||||
"pad_short": 500,
|
||||
"conv_pad": 0,
|
||||
"use_noise_augment": false,
|
||||
"use_cache": true,
|
||||
|
@ -102,6 +102,16 @@
|
|||
|
||||
// TRAINING
|
||||
"batch_size": 64, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
|
||||
"train_noise_schedule":{
|
||||
"min_val": 1e-6,
|
||||
"max_val": 1e-2,
|
||||
"num_steps": 1000
|
||||
},
|
||||
"test_noise_schedule":{
|
||||
"min_val": 1e-6,
|
||||
"max_val": 1e-2,
|
||||
"num_steps": 50
|
||||
}
|
||||
|
||||
// VALIDATION
|
||||
"run_eval": true,
|
||||
|
|
|
@ -0,0 +1,138 @@
|
|||
{
|
||||
"run_name": "fullband-melgan",
|
||||
"run_description": "fullband melgan mean-var scaling",
|
||||
|
||||
// AUDIO PARAMETERS
|
||||
"audio":{
|
||||
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
|
||||
"win_length": 1024, // stft window length in ms.
|
||||
"hop_length": 256, // stft window hop-lengh in ms.
|
||||
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
|
||||
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
|
||||
|
||||
// Audio processing parameters
|
||||
"sample_rate": 24000, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
|
||||
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
|
||||
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
|
||||
|
||||
// Silence trimming
|
||||
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
|
||||
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
|
||||
|
||||
// MelSpectrogram parameters
|
||||
"num_mels": 80, // size of the mel spec frame.
|
||||
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
|
||||
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
|
||||
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
|
||||
|
||||
// Normalization parameters
|
||||
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
|
||||
"min_level_db": -100, // lower bound for normalization
|
||||
"symmetric_norm": true, // move normalization to range [-1, 1]
|
||||
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
|
||||
"clip_norm": true, // clip normalized values into the range.
|
||||
"stats_path": "/home/erogol/Data/libritts/LibriTTS/scale_stats.npy" // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
|
||||
},
|
||||
|
||||
// DISTRIBUTED TRAINING
|
||||
"distributed":{
|
||||
"backend": "nccl",
|
||||
"url": "tcp:\/\/localhost:54324"
|
||||
},
|
||||
|
||||
// MODEL PARAMETERS
|
||||
"use_pqmf": false,
|
||||
|
||||
// LOSS PARAMETERS
|
||||
"use_stft_loss": true,
|
||||
"use_subband_stft_loss": false,
|
||||
"use_mse_gan_loss": true,
|
||||
"use_hinge_gan_loss": false,
|
||||
"use_feat_match_loss": false, // use only with melgan discriminators
|
||||
|
||||
// loss weights
|
||||
"stft_loss_weight": 0.5,
|
||||
"subband_stft_loss_weight": 0.5,
|
||||
"mse_G_loss_weight": 2.5,
|
||||
"hinge_G_loss_weight": 2.5,
|
||||
"feat_match_loss_weight": 25,
|
||||
|
||||
// multiscale stft loss parameters
|
||||
"stft_loss_params": {
|
||||
"n_ffts": [1024, 2048, 512],
|
||||
"hop_lengths": [120, 240, 50],
|
||||
"win_lengths": [600, 1200, 240]
|
||||
},
|
||||
|
||||
"target_loss": "avg_G_loss", // loss value to pick the best model to save after each epoch
|
||||
|
||||
// DISCRIMINATOR
|
||||
"discriminator_model": "melgan_multiscale_discriminator",
|
||||
"discriminator_model_params":{
|
||||
"base_channels": 16,
|
||||
"max_channels":512,
|
||||
"downsample_factors":[4, 4, 4]
|
||||
},
|
||||
"steps_to_start_discriminator": 200000, // steps required to start GAN trainining.1
|
||||
|
||||
// GENERATOR
|
||||
"generator_model": "fullband_melgan_generator",
|
||||
"generator_model_params": {
|
||||
"upsample_factors":[8, 8, 4],
|
||||
"num_res_blocks": 4
|
||||
},
|
||||
|
||||
// DATASET
|
||||
"data_path": "/home/erogol/Data/libritts/LibriTTS/train-clean-360/",
|
||||
"feature_path": null,
|
||||
"seq_len": 16384,
|
||||
"pad_short": 2000,
|
||||
"conv_pad": 0,
|
||||
"use_noise_augment": false,
|
||||
"use_cache": true,
|
||||
|
||||
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
|
||||
|
||||
// TRAINING
|
||||
"batch_size": 48, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
|
||||
|
||||
// VALIDATION
|
||||
"run_eval": true,
|
||||
"test_delay_epochs": 10, //Until attention is aligned, testing only wastes computation time.
|
||||
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
|
||||
|
||||
// OPTIMIZER
|
||||
"epochs": 10000, // total number of epochs to train.
|
||||
"wd": 0.0, // Weight decay weight.
|
||||
"gen_clip_grad": -1, // Generator gradient clipping threshold. Apply gradient clipping if > 0
|
||||
"disc_clip_grad": -1, // Discriminator gradient clipping threshold.
|
||||
"lr_scheduler_gen": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
|
||||
"lr_scheduler_gen_params": {
|
||||
"gamma": 0.5,
|
||||
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
|
||||
},
|
||||
"lr_scheduler_disc": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
|
||||
"lr_scheduler_disc_params": {
|
||||
"gamma": 0.5,
|
||||
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
|
||||
},
|
||||
"lr_gen": 0.000015625, // Initial learning rate. If Noam decay is active, maximum learning rate.
|
||||
"lr_disc": 0.000015625,
|
||||
|
||||
// TENSORBOARD and LOGGING
|
||||
"print_step": 25, // Number of steps to log traning on console.
|
||||
"print_eval": false, // If True, it prints loss values for each step in eval run.
|
||||
"save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
|
||||
"checkpoint": true, // If true, it saves checkpoints per "save_step"
|
||||
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
|
||||
|
||||
// DATA LOADING
|
||||
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
|
||||
"num_val_loader_workers": 4, // number of evaluation data loader processes.
|
||||
"eval_split_size": 10,
|
||||
|
||||
// PATHS
|
||||
"output_path": "/home/erogol/Models/"
|
||||
}
|
||||
|
||||
|
|
@ -0,0 +1,116 @@
|
|||
{
|
||||
"run_name": "wavegrad-libritts",
|
||||
"run_description": "wavegrad libritts",
|
||||
|
||||
"audio":{
|
||||
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
|
||||
"win_length": 1024, // stft window length in ms.
|
||||
"hop_length": 256, // stft window hop-lengh in ms.
|
||||
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
|
||||
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
|
||||
|
||||
// Audio processing parameters
|
||||
"sample_rate": 24000, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
|
||||
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
|
||||
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
|
||||
|
||||
// Silence trimming
|
||||
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
|
||||
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
|
||||
|
||||
// MelSpectrogram parameters
|
||||
"num_mels": 80, // size of the mel spec frame.
|
||||
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
|
||||
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
|
||||
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
|
||||
|
||||
// Normalization parameters
|
||||
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
|
||||
"min_level_db": -100, // lower bound for normalization
|
||||
"symmetric_norm": true, // move normalization to range [-1, 1]
|
||||
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
|
||||
"clip_norm": true, // clip normalized values into the range.
|
||||
"stats_path": "/home/erogol/Data/libritts/LibriTTS/scale_stats_wavegrad.npy" // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
|
||||
},
|
||||
|
||||
// DISTRIBUTED TRAINING
|
||||
"mixed_precision": true, // enable torch mixed precision training (true, false)
|
||||
"distributed":{
|
||||
"backend": "nccl",
|
||||
"url": "tcp:\/\/localhost:54322"
|
||||
},
|
||||
|
||||
"target_loss": "avg_wavegrad_loss", // loss value to pick the best model to save after each epoch
|
||||
|
||||
// MODEL PARAMETERS
|
||||
"generator_model": "wavegrad",
|
||||
"model_params":{
|
||||
"use_weight_norm": true,
|
||||
"y_conv_channels":32,
|
||||
"x_conv_channels":768,
|
||||
"ublock_out_channels": [512, 512, 256, 128, 128],
|
||||
"dblock_out_channels": [128, 128, 256, 512],
|
||||
"upsample_factors": [4, 4, 4, 2, 2],
|
||||
"upsample_dilations": [
|
||||
[1, 2, 1, 2],
|
||||
[1, 2, 1, 2],
|
||||
[1, 2, 4, 8],
|
||||
[1, 2, 4, 8],
|
||||
[1, 2, 4, 8]]
|
||||
},
|
||||
|
||||
// DATASET
|
||||
"data_path": "/home/erogol/Data/libritts/LibriTTS/train-clean-360/", // root data path. It finds all wav files recursively from there.
|
||||
"feature_path": null, // if you use precomputed features
|
||||
"seq_len": 6144, // 24 * hop_length
|
||||
"pad_short": 0, // additional padding for short wavs
|
||||
"conv_pad": 0, // additional padding against convolutions applied to spectrograms
|
||||
"use_noise_augment": false, // add noise to the audio signal for augmentation
|
||||
"use_cache": false, // use in memory cache to keep the computed features. This might cause OOM.
|
||||
|
||||
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
|
||||
|
||||
// TRAINING
|
||||
"batch_size": 96, // Batch size for training.
|
||||
|
||||
// NOISE SCHEDULE PARAMS - Only effective at training time.
|
||||
"train_noise_schedule":{
|
||||
"min_val": 1e-6,
|
||||
"max_val": 1e-2,
|
||||
"num_steps": 1000
|
||||
},
|
||||
"test_noise_schedule":{
|
||||
"min_val": 1e-6,
|
||||
"max_val": 1e-2,
|
||||
"num_steps": 50
|
||||
},
|
||||
|
||||
// VALIDATION
|
||||
"run_eval": true, // enable/disable evaluation run
|
||||
|
||||
// OPTIMIZER
|
||||
"epochs": 10000, // total number of epochs to train.
|
||||
"clip_grad": 1.0, // Generator gradient clipping threshold. Apply gradient clipping if > 0
|
||||
"lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
|
||||
"lr_scheduler_params": {
|
||||
"gamma": 0.5,
|
||||
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
|
||||
},
|
||||
"lr": 1e-4, // Initial learning rate. If Noam decay is active, maximum learning rate.
|
||||
|
||||
// TENSORBOARD and LOGGING
|
||||
"print_step": 50, // Number of steps to log traning on console.
|
||||
"print_eval": false, // If True, it prints loss values for each step in eval run.
|
||||
"save_step": 5000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
|
||||
"checkpoint": true, // If true, it saves checkpoints per "save_step"
|
||||
"tb_model_param_stats": true, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
|
||||
|
||||
// DATA LOADING
|
||||
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
|
||||
"num_val_loader_workers": 4, // number of evaluation data loader processes.
|
||||
"eval_split_size": 256,
|
||||
|
||||
// PATHS
|
||||
"output_path": "/home/erogol/Models/LJSpeech/"
|
||||
}
|
||||
|
|
@ -0,0 +1,98 @@
|
|||
{
|
||||
"run_name": "wavernn_librittts",
|
||||
"run_description": "wavernn libritts training from LJSpeech model",
|
||||
|
||||
// AUDIO PARAMETERS
|
||||
"audio": {
|
||||
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
|
||||
"win_length": 1024, // stft window length in ms.
|
||||
"hop_length": 256, // stft window hop-lengh in ms.
|
||||
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
|
||||
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
|
||||
// Audio processing parameters
|
||||
"sample_rate": 24000, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
|
||||
"preemphasis": 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
|
||||
"ref_level_db": 20, // reference level db, theoretically 20db is the sound of air.
|
||||
// Silence trimming
|
||||
"do_trim_silence": false, // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
|
||||
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
|
||||
// MelSpectrogram parameters
|
||||
"num_mels": 80, // size of the mel spec frame.
|
||||
"mel_fmin": 40.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
|
||||
"mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
|
||||
"spec_gain": 20.0, // scaler value appplied after log transform of spectrogram.
|
||||
// Normalization parameters
|
||||
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
|
||||
"min_level_db": -100, // lower bound for normalization
|
||||
"symmetric_norm": true, // move normalization to range [-1, 1]
|
||||
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
|
||||
"clip_norm": true, // clip normalized values into the range.
|
||||
"stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
|
||||
},
|
||||
|
||||
// Generating / Synthesizing
|
||||
"batched": true,
|
||||
"target_samples": 11000, // target number of samples to be generated in each batch entry
|
||||
"overlap_samples": 550, // number of samples for crossfading between batches
|
||||
// DISTRIBUTED TRAINING
|
||||
// "distributed":{
|
||||
// "backend": "nccl",
|
||||
// "url": "tcp:\/\/localhost:54321"
|
||||
// },
|
||||
|
||||
// MODEL MODE
|
||||
"mode": "mold", // mold [string], gauss [string], bits [int]
|
||||
"mulaw": true, // apply mulaw if mode is bits
|
||||
|
||||
// MODEL PARAMETERS
|
||||
"wavernn_model_params": {
|
||||
"rnn_dims": 512,
|
||||
"fc_dims": 512,
|
||||
"compute_dims": 128,
|
||||
"res_out_dims": 128,
|
||||
"num_res_blocks": 10,
|
||||
"use_aux_net": true,
|
||||
"use_upsample_net": true,
|
||||
"upsample_factors": [4, 8, 8] // this needs to correctly factorise hop_length
|
||||
},
|
||||
|
||||
// DATASET
|
||||
//"use_gta": true, // use computed gta features from the tts model
|
||||
"data_path": "/home/erogol/Data/libritts/LibriTTS/train-clean-360/", // path containing training wav files
|
||||
"feature_path": null, // path containing computed features from wav files if null compute them
|
||||
"seq_len": 1280, // has to be devideable by hop_length
|
||||
"padding": 2, // pad the input for resnet to see wider input length
|
||||
|
||||
// TRAINING
|
||||
"batch_size": 256, // Batch size for training.
|
||||
"epochs": 10000, // total number of epochs to train.
|
||||
"mixed_precision": true, // enable/ disable mixed precision training
|
||||
|
||||
// VALIDATION
|
||||
"run_eval": true,
|
||||
"test_every_epochs": 10, // Test after set number of epochs (Test every 10 epochs for example)
|
||||
|
||||
// OPTIMIZER
|
||||
"grad_clip": 4, // apply gradient clipping if > 0
|
||||
"lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
|
||||
"lr_scheduler_params": {
|
||||
"gamma": 0.5,
|
||||
"milestones": [200000, 400000, 600000]
|
||||
},
|
||||
"lr": 1e-4, // initial learning rate
|
||||
|
||||
// TENSORBOARD and LOGGING
|
||||
"print_step": 25, // Number of steps to log traning on console.
|
||||
"print_eval": false, // If True, it prints loss values for each step in eval run.
|
||||
"save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
|
||||
"checkpoint": true, // If true, it saves checkpoints per "save_step"
|
||||
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
|
||||
|
||||
// DATA LOADING
|
||||
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
|
||||
"num_val_loader_workers": 4, // number of evaluation data loader processes.
|
||||
"eval_split_size": 50, // number of samples for testing
|
||||
|
||||
// PATHS
|
||||
"output_path": "/home/erogol/Models/LJSpeech/"
|
||||
}
|
|
@ -1,17 +1,38 @@
|
|||
import glob
|
||||
import os
|
||||
from pathlib import Path
|
||||
from tqdm import tqdm
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
def preprocess_wav_files(out_path, config, ap):
|
||||
os.makedirs(os.path.join(out_path, "quant"), exist_ok=True)
|
||||
os.makedirs(os.path.join(out_path, "mel"), exist_ok=True)
|
||||
wav_files = find_wav_files(config.data_path)
|
||||
for path in tqdm(wav_files):
|
||||
wav_name = Path(path).stem
|
||||
quant_path = os.path.join(out_path, "quant", wav_name + ".npy")
|
||||
mel_path = os.path.join(out_path, "mel", wav_name + ".npy")
|
||||
y = ap.load_wav(path)
|
||||
mel = ap.melspectrogram(y)
|
||||
np.save(mel_path, mel)
|
||||
if isinstance(config.mode, int):
|
||||
quant = (
|
||||
ap.mulaw_encode(y, qc=config.mode)
|
||||
if config.mulaw
|
||||
else ap.quantize(y, bits=config.mode)
|
||||
)
|
||||
np.save(quant_path, quant)
|
||||
|
||||
|
||||
def find_wav_files(data_path):
|
||||
wav_paths = glob.glob(os.path.join(data_path, '**', '*.wav'), recursive=True)
|
||||
wav_paths = glob.glob(os.path.join(data_path, "**", "*.wav"), recursive=True)
|
||||
return wav_paths
|
||||
|
||||
|
||||
def find_feat_files(data_path):
|
||||
feat_paths = glob.glob(os.path.join(data_path, '**', '*.npy'), recursive=True)
|
||||
feat_paths = glob.glob(os.path.join(data_path, "**", "*.npy"), recursive=True)
|
||||
return feat_paths
|
||||
|
||||
|
||||
|
@ -23,8 +44,12 @@ def load_wav_data(data_path, eval_split_size):
|
|||
|
||||
|
||||
def load_wav_feat_data(data_path, feat_path, eval_split_size):
|
||||
wav_paths = sorted(find_wav_files(data_path))
|
||||
feat_paths = sorted(find_feat_files(feat_path))
|
||||
wav_paths = find_wav_files(data_path)
|
||||
feat_paths = find_feat_files(feat_path)
|
||||
|
||||
wav_paths.sort(key=lambda x: Path(x).stem)
|
||||
feat_paths.sort(key=lambda x: Path(x).stem)
|
||||
|
||||
assert len(wav_paths) == len(feat_paths)
|
||||
for wav, feat in zip(wav_paths, feat_paths):
|
||||
wav_name = Path(wav).stem
|
||||
|
|
|
@ -0,0 +1,131 @@
|
|||
import os
|
||||
import glob
|
||||
import torch
|
||||
import random
|
||||
import numpy as np
|
||||
from torch.utils.data import Dataset
|
||||
from multiprocessing import Manager
|
||||
|
||||
|
||||
class WaveGradDataset(Dataset):
|
||||
"""
|
||||
WaveGrad Dataset searchs for all the wav files under root path
|
||||
and converts them to acoustic features on the fly and returns
|
||||
random segments of (audio, feature) couples.
|
||||
"""
|
||||
def __init__(self,
|
||||
ap,
|
||||
items,
|
||||
seq_len,
|
||||
hop_len,
|
||||
pad_short,
|
||||
conv_pad=2,
|
||||
is_training=True,
|
||||
return_segments=True,
|
||||
use_noise_augment=False,
|
||||
use_cache=False,
|
||||
verbose=False):
|
||||
|
||||
self.ap = ap
|
||||
self.item_list = items
|
||||
self.seq_len = seq_len if return_segments else None
|
||||
self.hop_len = hop_len
|
||||
self.pad_short = pad_short
|
||||
self.conv_pad = conv_pad
|
||||
self.is_training = is_training
|
||||
self.return_segments = return_segments
|
||||
self.use_cache = use_cache
|
||||
self.use_noise_augment = use_noise_augment
|
||||
self.verbose = verbose
|
||||
|
||||
if return_segments:
|
||||
assert seq_len % hop_len == 0, " [!] seq_len has to be a multiple of hop_len."
|
||||
self.feat_frame_len = seq_len // hop_len + (2 * conv_pad)
|
||||
|
||||
# cache acoustic features
|
||||
if use_cache:
|
||||
self.create_feature_cache()
|
||||
|
||||
def create_feature_cache(self):
|
||||
self.manager = Manager()
|
||||
self.cache = self.manager.list()
|
||||
self.cache += [None for _ in range(len(self.item_list))]
|
||||
|
||||
@staticmethod
|
||||
def find_wav_files(path):
|
||||
return glob.glob(os.path.join(path, '**', '*.wav'), recursive=True)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.item_list)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
item = self.load_item(idx)
|
||||
return item
|
||||
|
||||
def load_test_samples(self, num_samples):
|
||||
samples = []
|
||||
return_segments = self.return_segments
|
||||
self.return_segments = False
|
||||
for idx in range(num_samples):
|
||||
mel, audio = self.load_item(idx)
|
||||
samples.append([mel, audio])
|
||||
self.return_segments = return_segments
|
||||
return samples
|
||||
|
||||
def load_item(self, idx):
|
||||
""" load (audio, feat) couple """
|
||||
# compute features from wav
|
||||
wavpath = self.item_list[idx]
|
||||
|
||||
if self.use_cache and self.cache[idx] is not None:
|
||||
audio = self.cache[idx]
|
||||
else:
|
||||
audio = self.ap.load_wav(wavpath)
|
||||
|
||||
if self.return_segments:
|
||||
# correct audio length wrt segment length
|
||||
if audio.shape[-1] < self.seq_len + self.pad_short:
|
||||
audio = np.pad(audio, (0, self.seq_len + self.pad_short - len(audio)), \
|
||||
mode='constant', constant_values=0.0)
|
||||
assert audio.shape[-1] >= self.seq_len + self.pad_short, f"{audio.shape[-1]} vs {self.seq_len + self.pad_short}"
|
||||
|
||||
# correct the audio length wrt hop length
|
||||
p = (audio.shape[-1] // self.hop_len + 1) * self.hop_len - audio.shape[-1]
|
||||
audio = np.pad(audio, (0, p), mode='constant', constant_values=0.0)
|
||||
|
||||
if self.use_cache:
|
||||
self.cache[idx] = audio
|
||||
|
||||
if self.return_segments:
|
||||
max_start = len(audio) - self.seq_len
|
||||
start = random.randint(0, max_start)
|
||||
end = start + self.seq_len
|
||||
audio = audio[start:end]
|
||||
|
||||
if self.use_noise_augment and self.is_training and self.return_segments:
|
||||
audio = audio + (1 / 32768) * torch.randn_like(audio)
|
||||
|
||||
mel = self.ap.melspectrogram(audio)
|
||||
mel = mel[..., :-1] # ignore the padding
|
||||
|
||||
audio = torch.from_numpy(audio).float()
|
||||
mel = torch.from_numpy(mel).float().squeeze(0)
|
||||
return (mel, audio)
|
||||
|
||||
@staticmethod
|
||||
def collate_full_clips(batch):
|
||||
"""This is used in tune_wavegrad.py.
|
||||
It pads sequences to the max length."""
|
||||
max_mel_length = max([b[0].shape[1] for b in batch]) if len(batch) > 1 else batch[0][0].shape[1]
|
||||
max_audio_length = max([b[1].shape[0] for b in batch]) if len(batch) > 1 else batch[0][1].shape[0]
|
||||
|
||||
mels = torch.zeros([len(batch), batch[0][0].shape[0], max_mel_length])
|
||||
audios = torch.zeros([len(batch), max_audio_length])
|
||||
|
||||
for idx, b in enumerate(batch):
|
||||
mel = b[0]
|
||||
audio = b[1]
|
||||
mels[idx, :, :mel.shape[1]] = mel
|
||||
audios[idx, :audio.shape[0]] = audio
|
||||
|
||||
return mels, audios
|
|
@ -0,0 +1,118 @@
|
|||
import torch
|
||||
import numpy as np
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
|
||||
class WaveRNNDataset(Dataset):
|
||||
"""
|
||||
WaveRNN Dataset searchs for all the wav files under root path
|
||||
and converts them to acoustic features on the fly.
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
ap,
|
||||
items,
|
||||
seq_len,
|
||||
hop_len,
|
||||
pad,
|
||||
mode,
|
||||
mulaw,
|
||||
is_training=True,
|
||||
verbose=False,
|
||||
):
|
||||
|
||||
self.ap = ap
|
||||
self.compute_feat = not isinstance(items[0], (tuple, list))
|
||||
self.item_list = items
|
||||
self.seq_len = seq_len
|
||||
self.hop_len = hop_len
|
||||
self.mel_len = seq_len // hop_len
|
||||
self.pad = pad
|
||||
self.mode = mode
|
||||
self.mulaw = mulaw
|
||||
self.is_training = is_training
|
||||
self.verbose = verbose
|
||||
|
||||
assert self.seq_len % self.hop_len == 0
|
||||
|
||||
def __len__(self):
|
||||
return len(self.item_list)
|
||||
|
||||
def __getitem__(self, index):
|
||||
item = self.load_item(index)
|
||||
return item
|
||||
|
||||
def load_item(self, index):
|
||||
"""
|
||||
load (audio, feat) couple if feature_path is set
|
||||
else compute it on the fly
|
||||
"""
|
||||
if self.compute_feat:
|
||||
|
||||
wavpath = self.item_list[index]
|
||||
audio = self.ap.load_wav(wavpath)
|
||||
min_audio_len = 2 * self.seq_len + (2 * self.pad * self.hop_len)
|
||||
if audio.shape[0] < min_audio_len:
|
||||
print(" [!] Instance is too short! : {}".format(wavpath))
|
||||
audio = np.pad(audio, [0, min_audio_len - audio.shape[0] + self.hop_len])
|
||||
mel = self.ap.melspectrogram(audio)
|
||||
|
||||
if self.mode in ["gauss", "mold"]:
|
||||
x_input = audio
|
||||
elif isinstance(self.mode, int):
|
||||
x_input = (self.ap.mulaw_encode(audio, qc=self.mode)
|
||||
if self.mulaw else self.ap.quantize(audio, bits=self.mode))
|
||||
else:
|
||||
raise RuntimeError("Unknown dataset mode - ", self.mode)
|
||||
|
||||
else:
|
||||
|
||||
wavpath, feat_path = self.item_list[index]
|
||||
mel = np.load(feat_path.replace("/quant/", "/mel/"))
|
||||
|
||||
if mel.shape[-1] < self.mel_len + 2 * self.pad:
|
||||
print(" [!] Instance is too short! : {}".format(wavpath))
|
||||
self.item_list[index] = self.item_list[index + 1]
|
||||
feat_path = self.item_list[index]
|
||||
mel = np.load(feat_path.replace("/quant/", "/mel/"))
|
||||
if self.mode in ["gauss", "mold"]:
|
||||
x_input = self.ap.load_wav(wavpath)
|
||||
elif isinstance(self.mode, int):
|
||||
x_input = np.load(feat_path.replace("/mel/", "/quant/"))
|
||||
else:
|
||||
raise RuntimeError("Unknown dataset mode - ", self.mode)
|
||||
|
||||
return mel, x_input, wavpath
|
||||
|
||||
def collate(self, batch):
|
||||
mel_win = self.seq_len // self.hop_len + 2 * self.pad
|
||||
max_offsets = [x[0].shape[-1] -
|
||||
(mel_win + 2 * self.pad) for x in batch]
|
||||
|
||||
mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]
|
||||
sig_offsets = [(offset + self.pad) *
|
||||
self.hop_len for offset in mel_offsets]
|
||||
|
||||
mels = [
|
||||
x[0][:, mel_offsets[i]: mel_offsets[i] + mel_win]
|
||||
for i, x in enumerate(batch)
|
||||
]
|
||||
|
||||
coarse = [
|
||||
x[1][sig_offsets[i]: sig_offsets[i] + self.seq_len + 1]
|
||||
for i, x in enumerate(batch)
|
||||
]
|
||||
|
||||
mels = np.stack(mels).astype(np.float32)
|
||||
if self.mode in ["gauss", "mold"]:
|
||||
coarse = np.stack(coarse).astype(np.float32)
|
||||
coarse = torch.FloatTensor(coarse)
|
||||
x_input = coarse[:, : self.seq_len]
|
||||
elif isinstance(self.mode, int):
|
||||
coarse = np.stack(coarse).astype(np.int64)
|
||||
coarse = torch.LongTensor(coarse)
|
||||
x_input = (2 * coarse[:, : self.seq_len].float() /
|
||||
(2 ** self.mode - 1.0) - 1.0)
|
||||
y_coarse = coarse[:, 1:]
|
||||
mels = torch.FloatTensor(mels)
|
||||
return x_input, mels, y_coarse
|
|
@ -0,0 +1,175 @@
|
|||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
from torch.nn.utils import weight_norm
|
||||
|
||||
|
||||
class Conv1d(nn.Conv1d):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
nn.init.orthogonal_(self.weight)
|
||||
nn.init.zeros_(self.bias)
|
||||
|
||||
|
||||
class PositionalEncoding(nn.Module):
|
||||
"""Positional encoding with noise level conditioning"""
|
||||
def __init__(self, n_channels, max_len=10000):
|
||||
super().__init__()
|
||||
self.n_channels = n_channels
|
||||
self.max_len = max_len
|
||||
self.C = 5000
|
||||
self.pe = torch.zeros(0, 0)
|
||||
|
||||
def forward(self, x, noise_level):
|
||||
if x.shape[2] > self.pe.shape[1]:
|
||||
self.init_pe_matrix(x.shape[1] ,x.shape[2], x)
|
||||
return x + noise_level[..., None, None] + self.pe[:, :x.size(2)].repeat(x.shape[0], 1, 1) / self.C
|
||||
|
||||
def init_pe_matrix(self, n_channels, max_len, x):
|
||||
pe = torch.zeros(max_len, n_channels)
|
||||
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
|
||||
div_term = torch.pow(10000, torch.arange(0, n_channels, 2).float() / n_channels)
|
||||
|
||||
pe[:, 0::2] = torch.sin(position / div_term)
|
||||
pe[:, 1::2] = torch.cos(position / div_term)
|
||||
self.pe = pe.transpose(0, 1).to(x)
|
||||
|
||||
|
||||
class FiLM(nn.Module):
|
||||
def __init__(self, input_size, output_size):
|
||||
super().__init__()
|
||||
self.encoding = PositionalEncoding(input_size)
|
||||
self.input_conv = nn.Conv1d(input_size, input_size, 3, padding=1)
|
||||
self.output_conv = nn.Conv1d(input_size, output_size * 2, 3, padding=1)
|
||||
|
||||
nn.init.xavier_uniform_(self.input_conv.weight)
|
||||
nn.init.xavier_uniform_(self.output_conv.weight)
|
||||
nn.init.zeros_(self.input_conv.bias)
|
||||
nn.init.zeros_(self.output_conv.bias)
|
||||
|
||||
def forward(self, x, noise_scale):
|
||||
o = self.input_conv(x)
|
||||
o = F.leaky_relu(o, 0.2)
|
||||
o = self.encoding(o, noise_scale)
|
||||
shift, scale = torch.chunk(self.output_conv(o), 2, dim=1)
|
||||
return shift, scale
|
||||
|
||||
def remove_weight_norm(self):
|
||||
nn.utils.remove_weight_norm(self.input_conv)
|
||||
nn.utils.remove_weight_norm(self.output_conv)
|
||||
|
||||
def apply_weight_norm(self):
|
||||
self.input_conv = weight_norm(self.input_conv)
|
||||
self.output_conv = weight_norm(self.output_conv)
|
||||
|
||||
|
||||
@torch.jit.script
|
||||
def shif_and_scale(x, scale, shift):
|
||||
o = shift + scale * x
|
||||
return o
|
||||
|
||||
|
||||
class UBlock(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, factor, dilation):
|
||||
super().__init__()
|
||||
assert isinstance(dilation, (list, tuple))
|
||||
assert len(dilation) == 4
|
||||
|
||||
self.factor = factor
|
||||
self.res_block = Conv1d(input_size, hidden_size, 1)
|
||||
self.main_block = nn.ModuleList([
|
||||
Conv1d(input_size,
|
||||
hidden_size,
|
||||
3,
|
||||
dilation=dilation[0],
|
||||
padding=dilation[0]),
|
||||
Conv1d(hidden_size,
|
||||
hidden_size,
|
||||
3,
|
||||
dilation=dilation[1],
|
||||
padding=dilation[1])
|
||||
])
|
||||
self.out_block = nn.ModuleList([
|
||||
Conv1d(hidden_size,
|
||||
hidden_size,
|
||||
3,
|
||||
dilation=dilation[2],
|
||||
padding=dilation[2]),
|
||||
Conv1d(hidden_size,
|
||||
hidden_size,
|
||||
3,
|
||||
dilation=dilation[3],
|
||||
padding=dilation[3])
|
||||
])
|
||||
|
||||
def forward(self, x, shift, scale):
|
||||
x_inter = F.interpolate(x, size=x.shape[-1] * self.factor)
|
||||
res = self.res_block(x_inter)
|
||||
o = F.leaky_relu(x_inter, 0.2)
|
||||
o = F.interpolate(o, size=x.shape[-1] * self.factor)
|
||||
o = self.main_block[0](o)
|
||||
o = shif_and_scale(o, scale, shift)
|
||||
o = F.leaky_relu(o, 0.2)
|
||||
o = self.main_block[1](o)
|
||||
res2 = res + o
|
||||
o = shif_and_scale(res2, scale, shift)
|
||||
o = F.leaky_relu(o, 0.2)
|
||||
o = self.out_block[0](o)
|
||||
o = shif_and_scale(o, scale, shift)
|
||||
o = F.leaky_relu(o, 0.2)
|
||||
o = self.out_block[1](o)
|
||||
o = o + res2
|
||||
return o
|
||||
|
||||
def remove_weight_norm(self):
|
||||
nn.utils.remove_weight_norm(self.res_block)
|
||||
for _, layer in enumerate(self.main_block):
|
||||
if len(layer.state_dict()) != 0:
|
||||
nn.utils.remove_weight_norm(layer)
|
||||
for _, layer in enumerate(self.out_block):
|
||||
if len(layer.state_dict()) != 0:
|
||||
nn.utils.remove_weight_norm(layer)
|
||||
|
||||
def apply_weight_norm(self):
|
||||
self.res_block = weight_norm(self.res_block)
|
||||
for idx, layer in enumerate(self.main_block):
|
||||
if len(layer.state_dict()) != 0:
|
||||
self.main_block[idx] = weight_norm(layer)
|
||||
for idx, layer in enumerate(self.out_block):
|
||||
if len(layer.state_dict()) != 0:
|
||||
self.out_block[idx] = weight_norm(layer)
|
||||
|
||||
|
||||
class DBlock(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, factor):
|
||||
super().__init__()
|
||||
self.factor = factor
|
||||
self.res_block = Conv1d(input_size, hidden_size, 1)
|
||||
self.main_block = nn.ModuleList([
|
||||
Conv1d(input_size, hidden_size, 3, dilation=1, padding=1),
|
||||
Conv1d(hidden_size, hidden_size, 3, dilation=2, padding=2),
|
||||
Conv1d(hidden_size, hidden_size, 3, dilation=4, padding=4),
|
||||
])
|
||||
|
||||
def forward(self, x):
|
||||
size = x.shape[-1] // self.factor
|
||||
res = self.res_block(x)
|
||||
res = F.interpolate(res, size=size)
|
||||
o = F.interpolate(x, size=size)
|
||||
for layer in self.main_block:
|
||||
o = F.leaky_relu(o, 0.2)
|
||||
o = layer(o)
|
||||
return o + res
|
||||
|
||||
def remove_weight_norm(self):
|
||||
nn.utils.remove_weight_norm(self.res_block)
|
||||
for _, layer in enumerate(self.main_block):
|
||||
if len(layer.state_dict()) != 0:
|
||||
nn.utils.remove_weight_norm(layer)
|
||||
|
||||
def apply_weight_norm(self):
|
||||
self.res_block = weight_norm(self.res_block)
|
||||
for idx, layer in enumerate(self.main_block):
|
||||
if len(layer.state_dict()) != 0:
|
||||
self.main_block[idx] = weight_norm(layer)
|
||||
|
|
@ -0,0 +1,177 @@
|
|||
import numpy as np
|
||||
import torch
|
||||
from torch import nn
|
||||
from torch.nn.utils import weight_norm
|
||||
|
||||
from ..layers.wavegrad import DBlock, FiLM, UBlock, Conv1d
|
||||
|
||||
|
||||
class Wavegrad(nn.Module):
|
||||
# pylint: disable=dangerous-default-value
|
||||
def __init__(self,
|
||||
in_channels=80,
|
||||
out_channels=1,
|
||||
use_weight_norm=False,
|
||||
y_conv_channels=32,
|
||||
x_conv_channels=768,
|
||||
dblock_out_channels=[128, 128, 256, 512],
|
||||
ublock_out_channels=[512, 512, 256, 128, 128],
|
||||
upsample_factors=[5, 5, 3, 2, 2],
|
||||
upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2], [1, 2, 4, 8],
|
||||
[1, 2, 4, 8], [1, 2, 4, 8]]):
|
||||
super().__init__()
|
||||
|
||||
self.use_weight_norm = use_weight_norm
|
||||
self.hop_len = np.prod(upsample_factors)
|
||||
self.noise_level = None
|
||||
self.num_steps = None
|
||||
self.beta = None
|
||||
self.alpha = None
|
||||
self.alpha_hat = None
|
||||
self.noise_level = None
|
||||
self.c1 = None
|
||||
self.c2 = None
|
||||
self.sigma = None
|
||||
|
||||
# dblocks
|
||||
self.y_conv = Conv1d(1, y_conv_channels, 5, padding=2)
|
||||
self.dblocks = nn.ModuleList([])
|
||||
ic = y_conv_channels
|
||||
for oc, df in zip(dblock_out_channels, reversed(upsample_factors)):
|
||||
self.dblocks.append(DBlock(ic, oc, df))
|
||||
ic = oc
|
||||
|
||||
# film
|
||||
self.film = nn.ModuleList([])
|
||||
ic = y_conv_channels
|
||||
for oc in reversed(ublock_out_channels):
|
||||
self.film.append(FiLM(ic, oc))
|
||||
ic = oc
|
||||
|
||||
# ublocks
|
||||
self.ublocks = nn.ModuleList([])
|
||||
ic = x_conv_channels
|
||||
for oc, uf, ud in zip(ublock_out_channels, upsample_factors, upsample_dilations):
|
||||
self.ublocks.append(UBlock(ic, oc, uf, ud))
|
||||
ic = oc
|
||||
|
||||
self.x_conv = Conv1d(in_channels, x_conv_channels, 3, padding=1)
|
||||
self.out_conv = Conv1d(oc, out_channels, 3, padding=1)
|
||||
|
||||
if use_weight_norm:
|
||||
self.apply_weight_norm()
|
||||
|
||||
def forward(self, x, spectrogram, noise_scale):
|
||||
shift_and_scale = []
|
||||
|
||||
x = self.y_conv(x)
|
||||
shift_and_scale.append(self.film[0](x, noise_scale))
|
||||
|
||||
for film, layer in zip(self.film[1:], self.dblocks):
|
||||
x = layer(x)
|
||||
shift_and_scale.append(film(x, noise_scale))
|
||||
|
||||
x = self.x_conv(spectrogram)
|
||||
for layer, (film_shift, film_scale) in zip(self.ublocks,
|
||||
reversed(shift_and_scale)):
|
||||
x = layer(x, film_shift, film_scale)
|
||||
x = self.out_conv(x)
|
||||
return x
|
||||
|
||||
def load_noise_schedule(self, path):
|
||||
beta = np.load(path, allow_pickle=True).item()['beta']
|
||||
self.compute_noise_level(beta)
|
||||
|
||||
@torch.no_grad()
|
||||
def inference(self, x, y_n=None):
|
||||
""" x: B x D X T """
|
||||
if y_n is None:
|
||||
y_n = torch.randn(x.shape[0], 1, self.hop_len * x.shape[-1], dtype=torch.float32).to(x)
|
||||
else:
|
||||
y_n = torch.FloatTensor(y_n).unsqueeze(0).unsqueeze(0).to(x)
|
||||
sqrt_alpha_hat = self.noise_level.to(x)
|
||||
for n in range(len(self.alpha) - 1, -1, -1):
|
||||
y_n = self.c1[n] * (y_n -
|
||||
self.c2[n] * self.forward(y_n, x, sqrt_alpha_hat[n].repeat(x.shape[0])))
|
||||
if n > 0:
|
||||
z = torch.randn_like(y_n)
|
||||
y_n += self.sigma[n - 1] * z
|
||||
y_n.clamp_(-1.0, 1.0)
|
||||
return y_n
|
||||
|
||||
|
||||
def compute_y_n(self, y_0):
|
||||
"""Compute noisy audio based on noise schedule"""
|
||||
self.noise_level = self.noise_level.to(y_0)
|
||||
if len(y_0.shape) == 3:
|
||||
y_0 = y_0.squeeze(1)
|
||||
s = torch.randint(1, self.num_steps + 1, [y_0.shape[0]])
|
||||
l_a, l_b = self.noise_level[s-1], self.noise_level[s]
|
||||
noise_scale = l_a + torch.rand(y_0.shape[0]).to(y_0) * (l_b - l_a)
|
||||
noise_scale = noise_scale.unsqueeze(1)
|
||||
noise = torch.randn_like(y_0)
|
||||
noisy_audio = noise_scale * y_0 + (1.0 - noise_scale**2)**0.5 * noise
|
||||
return noise.unsqueeze(1), noisy_audio.unsqueeze(1), noise_scale[:, 0]
|
||||
|
||||
def compute_noise_level(self, beta):
|
||||
"""Compute noise schedule parameters"""
|
||||
self.num_steps = len(beta)
|
||||
alpha = 1 - beta
|
||||
alpha_hat = np.cumprod(alpha)
|
||||
noise_level = np.concatenate([[1.0], alpha_hat ** 0.5], axis=0)
|
||||
noise_level = alpha_hat ** 0.5
|
||||
|
||||
# pylint: disable=not-callable
|
||||
self.beta = torch.tensor(beta.astype(np.float32))
|
||||
self.alpha = torch.tensor(alpha.astype(np.float32))
|
||||
self.alpha_hat = torch.tensor(alpha_hat.astype(np.float32))
|
||||
self.noise_level = torch.tensor(noise_level.astype(np.float32))
|
||||
|
||||
self.c1 = 1 / self.alpha**0.5
|
||||
self.c2 = (1 - self.alpha) / (1 - self.alpha_hat)**0.5
|
||||
self.sigma = ((1.0 - self.alpha_hat[:-1]) / (1.0 - self.alpha_hat[1:]) * self.beta[1:])**0.5
|
||||
|
||||
def remove_weight_norm(self):
|
||||
for _, layer in enumerate(self.dblocks):
|
||||
if len(layer.state_dict()) != 0:
|
||||
try:
|
||||
nn.utils.remove_weight_norm(layer)
|
||||
except ValueError:
|
||||
layer.remove_weight_norm()
|
||||
|
||||
for _, layer in enumerate(self.film):
|
||||
if len(layer.state_dict()) != 0:
|
||||
try:
|
||||
nn.utils.remove_weight_norm(layer)
|
||||
except ValueError:
|
||||
layer.remove_weight_norm()
|
||||
|
||||
|
||||
for _, layer in enumerate(self.ublocks):
|
||||
if len(layer.state_dict()) != 0:
|
||||
try:
|
||||
nn.utils.remove_weight_norm(layer)
|
||||
except ValueError:
|
||||
layer.remove_weight_norm()
|
||||
|
||||
nn.utils.remove_weight_norm(self.x_conv)
|
||||
nn.utils.remove_weight_norm(self.out_conv)
|
||||
nn.utils.remove_weight_norm(self.y_conv)
|
||||
|
||||
def apply_weight_norm(self):
|
||||
for _, layer in enumerate(self.dblocks):
|
||||
if len(layer.state_dict()) != 0:
|
||||
layer.apply_weight_norm()
|
||||
|
||||
for _, layer in enumerate(self.film):
|
||||
if len(layer.state_dict()) != 0:
|
||||
layer.apply_weight_norm()
|
||||
|
||||
|
||||
for _, layer in enumerate(self.ublocks):
|
||||
if len(layer.state_dict()) != 0:
|
||||
layer.apply_weight_norm()
|
||||
|
||||
self.x_conv = weight_norm(self.x_conv)
|
||||
self.out_conv = weight_norm(self.out_conv)
|
||||
self.y_conv = weight_norm(self.y_conv)
|
|
@ -0,0 +1,501 @@
|
|||
import sys
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import numpy as np
|
||||
import torch.nn.functional as F
|
||||
import time
|
||||
|
||||
# fix this
|
||||
from TTS.utils.audio import AudioProcessor as ap
|
||||
from TTS.vocoder.utils.distribution import (
|
||||
sample_from_gaussian,
|
||||
sample_from_discretized_mix_logistic,
|
||||
)
|
||||
|
||||
|
||||
def stream(string, variables):
|
||||
sys.stdout.write(f"\r{string}" % variables)
|
||||
|
||||
# pylint: disable=abstract-method
|
||||
# relates https://github.com/pytorch/pytorch/issues/42305
|
||||
class ResBlock(nn.Module):
|
||||
def __init__(self, dims):
|
||||
super().__init__()
|
||||
self.conv1 = nn.Conv1d(dims, dims, kernel_size=1, bias=False)
|
||||
self.conv2 = nn.Conv1d(dims, dims, kernel_size=1, bias=False)
|
||||
self.batch_norm1 = nn.BatchNorm1d(dims)
|
||||
self.batch_norm2 = nn.BatchNorm1d(dims)
|
||||
|
||||
def forward(self, x):
|
||||
residual = x
|
||||
x = self.conv1(x)
|
||||
x = self.batch_norm1(x)
|
||||
x = F.relu(x)
|
||||
x = self.conv2(x)
|
||||
x = self.batch_norm2(x)
|
||||
return x + residual
|
||||
|
||||
|
||||
class MelResNet(nn.Module):
|
||||
def __init__(self, num_res_blocks, in_dims, compute_dims, res_out_dims, pad):
|
||||
super().__init__()
|
||||
k_size = pad * 2 + 1
|
||||
self.conv_in = nn.Conv1d(
|
||||
in_dims, compute_dims, kernel_size=k_size, bias=False)
|
||||
self.batch_norm = nn.BatchNorm1d(compute_dims)
|
||||
self.layers = nn.ModuleList()
|
||||
for _ in range(num_res_blocks):
|
||||
self.layers.append(ResBlock(compute_dims))
|
||||
self.conv_out = nn.Conv1d(compute_dims, res_out_dims, kernel_size=1)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.conv_in(x)
|
||||
x = self.batch_norm(x)
|
||||
x = F.relu(x)
|
||||
for f in self.layers:
|
||||
x = f(x)
|
||||
x = self.conv_out(x)
|
||||
return x
|
||||
|
||||
|
||||
class Stretch2d(nn.Module):
|
||||
def __init__(self, x_scale, y_scale):
|
||||
super().__init__()
|
||||
self.x_scale = x_scale
|
||||
self.y_scale = y_scale
|
||||
|
||||
def forward(self, x):
|
||||
b, c, h, w = x.size()
|
||||
x = x.unsqueeze(-1).unsqueeze(3)
|
||||
x = x.repeat(1, 1, 1, self.y_scale, 1, self.x_scale)
|
||||
return x.view(b, c, h * self.y_scale, w * self.x_scale)
|
||||
|
||||
|
||||
class UpsampleNetwork(nn.Module):
|
||||
def __init__(
|
||||
self,
|
||||
feat_dims,
|
||||
upsample_scales,
|
||||
compute_dims,
|
||||
num_res_blocks,
|
||||
res_out_dims,
|
||||
pad,
|
||||
use_aux_net,
|
||||
):
|
||||
super().__init__()
|
||||
self.total_scale = np.cumproduct(upsample_scales)[-1]
|
||||
self.indent = pad * self.total_scale
|
||||
self.use_aux_net = use_aux_net
|
||||
if use_aux_net:
|
||||
self.resnet = MelResNet(
|
||||
num_res_blocks, feat_dims, compute_dims, res_out_dims, pad
|
||||
)
|
||||
self.resnet_stretch = Stretch2d(self.total_scale, 1)
|
||||
self.up_layers = nn.ModuleList()
|
||||
for scale in upsample_scales:
|
||||
k_size = (1, scale * 2 + 1)
|
||||
padding = (0, scale)
|
||||
stretch = Stretch2d(scale, 1)
|
||||
conv = nn.Conv2d(1, 1, kernel_size=k_size,
|
||||
padding=padding, bias=False)
|
||||
conv.weight.data.fill_(1.0 / k_size[1])
|
||||
self.up_layers.append(stretch)
|
||||
self.up_layers.append(conv)
|
||||
|
||||
def forward(self, m):
|
||||
if self.use_aux_net:
|
||||
aux = self.resnet(m).unsqueeze(1)
|
||||
aux = self.resnet_stretch(aux)
|
||||
aux = aux.squeeze(1)
|
||||
aux = aux.transpose(1, 2)
|
||||
else:
|
||||
aux = None
|
||||
m = m.unsqueeze(1)
|
||||
for f in self.up_layers:
|
||||
m = f(m)
|
||||
m = m.squeeze(1)[:, :, self.indent: -self.indent]
|
||||
return m.transpose(1, 2), aux
|
||||
|
||||
|
||||
class Upsample(nn.Module):
|
||||
def __init__(
|
||||
self, scale, pad, num_res_blocks, feat_dims, compute_dims, res_out_dims, use_aux_net
|
||||
):
|
||||
super().__init__()
|
||||
self.scale = scale
|
||||
self.pad = pad
|
||||
self.indent = pad * scale
|
||||
self.use_aux_net = use_aux_net
|
||||
self.resnet = MelResNet(num_res_blocks, feat_dims,
|
||||
compute_dims, res_out_dims, pad)
|
||||
|
||||
def forward(self, m):
|
||||
if self.use_aux_net:
|
||||
aux = self.resnet(m)
|
||||
aux = torch.nn.functional.interpolate(
|
||||
aux, scale_factor=self.scale, mode="linear", align_corners=True
|
||||
)
|
||||
aux = aux.transpose(1, 2)
|
||||
else:
|
||||
aux = None
|
||||
m = torch.nn.functional.interpolate(
|
||||
m, scale_factor=self.scale, mode="linear", align_corners=True
|
||||
)
|
||||
m = m[:, :, self.indent: -self.indent]
|
||||
m = m * 0.045 # empirically found
|
||||
|
||||
return m.transpose(1, 2), aux
|
||||
|
||||
|
||||
class WaveRNN(nn.Module):
|
||||
def __init__(self,
|
||||
rnn_dims,
|
||||
fc_dims,
|
||||
mode,
|
||||
mulaw,
|
||||
pad,
|
||||
use_aux_net,
|
||||
use_upsample_net,
|
||||
upsample_factors,
|
||||
feat_dims,
|
||||
compute_dims,
|
||||
res_out_dims,
|
||||
num_res_blocks,
|
||||
hop_length,
|
||||
sample_rate,
|
||||
):
|
||||
super().__init__()
|
||||
self.mode = mode
|
||||
self.mulaw = mulaw
|
||||
self.pad = pad
|
||||
self.use_upsample_net = use_upsample_net
|
||||
self.use_aux_net = use_aux_net
|
||||
if isinstance(self.mode, int):
|
||||
self.n_classes = 2 ** self.mode
|
||||
elif self.mode == "mold":
|
||||
self.n_classes = 3 * 10
|
||||
elif self.mode == "gauss":
|
||||
self.n_classes = 2
|
||||
else:
|
||||
raise RuntimeError("Unknown model mode value - ", self.mode)
|
||||
|
||||
self.rnn_dims = rnn_dims
|
||||
self.aux_dims = res_out_dims // 4
|
||||
self.hop_length = hop_length
|
||||
self.sample_rate = sample_rate
|
||||
|
||||
if self.use_upsample_net:
|
||||
assert (
|
||||
np.cumproduct(upsample_factors)[-1] == self.hop_length
|
||||
), " [!] upsample scales needs to be equal to hop_length"
|
||||
self.upsample = UpsampleNetwork(
|
||||
feat_dims,
|
||||
upsample_factors,
|
||||
compute_dims,
|
||||
num_res_blocks,
|
||||
res_out_dims,
|
||||
pad,
|
||||
use_aux_net,
|
||||
)
|
||||
else:
|
||||
self.upsample = Upsample(
|
||||
hop_length,
|
||||
pad,
|
||||
num_res_blocks,
|
||||
feat_dims,
|
||||
compute_dims,
|
||||
res_out_dims,
|
||||
use_aux_net,
|
||||
)
|
||||
if self.use_aux_net:
|
||||
self.I = nn.Linear(feat_dims + self.aux_dims + 1, rnn_dims)
|
||||
self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
|
||||
self.rnn2 = nn.GRU(rnn_dims + self.aux_dims,
|
||||
rnn_dims, batch_first=True)
|
||||
self.fc1 = nn.Linear(rnn_dims + self.aux_dims, fc_dims)
|
||||
self.fc2 = nn.Linear(fc_dims + self.aux_dims, fc_dims)
|
||||
self.fc3 = nn.Linear(fc_dims, self.n_classes)
|
||||
else:
|
||||
self.I = nn.Linear(feat_dims + 1, rnn_dims)
|
||||
self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
|
||||
self.rnn2 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
|
||||
self.fc1 = nn.Linear(rnn_dims, fc_dims)
|
||||
self.fc2 = nn.Linear(fc_dims, fc_dims)
|
||||
self.fc3 = nn.Linear(fc_dims, self.n_classes)
|
||||
|
||||
def forward(self, x, mels):
|
||||
bsize = x.size(0)
|
||||
h1 = torch.zeros(1, bsize, self.rnn_dims).to(x.device)
|
||||
h2 = torch.zeros(1, bsize, self.rnn_dims).to(x.device)
|
||||
mels, aux = self.upsample(mels)
|
||||
|
||||
if self.use_aux_net:
|
||||
aux_idx = [self.aux_dims * i for i in range(5)]
|
||||
a1 = aux[:, :, aux_idx[0]: aux_idx[1]]
|
||||
a2 = aux[:, :, aux_idx[1]: aux_idx[2]]
|
||||
a3 = aux[:, :, aux_idx[2]: aux_idx[3]]
|
||||
a4 = aux[:, :, aux_idx[3]: aux_idx[4]]
|
||||
|
||||
x = (
|
||||
torch.cat([x.unsqueeze(-1), mels, a1], dim=2)
|
||||
if self.use_aux_net
|
||||
else torch.cat([x.unsqueeze(-1), mels], dim=2)
|
||||
)
|
||||
x = self.I(x)
|
||||
res = x
|
||||
self.rnn1.flatten_parameters()
|
||||
x, _ = self.rnn1(x, h1)
|
||||
|
||||
x = x + res
|
||||
res = x
|
||||
x = torch.cat([x, a2], dim=2) if self.use_aux_net else x
|
||||
self.rnn2.flatten_parameters()
|
||||
x, _ = self.rnn2(x, h2)
|
||||
|
||||
x = x + res
|
||||
x = torch.cat([x, a3], dim=2) if self.use_aux_net else x
|
||||
x = F.relu(self.fc1(x))
|
||||
|
||||
x = torch.cat([x, a4], dim=2) if self.use_aux_net else x
|
||||
x = F.relu(self.fc2(x))
|
||||
return self.fc3(x)
|
||||
|
||||
def inference(self, mels, batched, target, overlap):
|
||||
|
||||
self.eval()
|
||||
device = mels.device
|
||||
output = []
|
||||
start = time.time()
|
||||
rnn1 = self.get_gru_cell(self.rnn1)
|
||||
rnn2 = self.get_gru_cell(self.rnn2)
|
||||
|
||||
with torch.no_grad():
|
||||
if isinstance(mels, np.ndarray):
|
||||
mels = torch.FloatTensor(mels).to(device)
|
||||
|
||||
if mels.ndim == 2:
|
||||
mels = mels.unsqueeze(0)
|
||||
wave_len = (mels.size(-1) - 1) * self.hop_length
|
||||
|
||||
mels = self.pad_tensor(mels.transpose(
|
||||
1, 2), pad=self.pad, side="both")
|
||||
mels, aux = self.upsample(mels.transpose(1, 2))
|
||||
|
||||
if batched:
|
||||
mels = self.fold_with_overlap(mels, target, overlap)
|
||||
if aux is not None:
|
||||
aux = self.fold_with_overlap(aux, target, overlap)
|
||||
|
||||
b_size, seq_len, _ = mels.size()
|
||||
|
||||
h1 = torch.zeros(b_size, self.rnn_dims).to(device)
|
||||
h2 = torch.zeros(b_size, self.rnn_dims).to(device)
|
||||
x = torch.zeros(b_size, 1).to(device)
|
||||
|
||||
if self.use_aux_net:
|
||||
d = self.aux_dims
|
||||
aux_split = [aux[:, :, d * i: d * (i + 1)] for i in range(4)]
|
||||
|
||||
for i in range(seq_len):
|
||||
|
||||
m_t = mels[:, i, :]
|
||||
|
||||
if self.use_aux_net:
|
||||
a1_t, a2_t, a3_t, a4_t = (a[:, i, :] for a in aux_split)
|
||||
|
||||
x = (
|
||||
torch.cat([x, m_t, a1_t], dim=1)
|
||||
if self.use_aux_net
|
||||
else torch.cat([x, m_t], dim=1)
|
||||
)
|
||||
x = self.I(x)
|
||||
h1 = rnn1(x, h1)
|
||||
|
||||
x = x + h1
|
||||
inp = torch.cat([x, a2_t], dim=1) if self.use_aux_net else x
|
||||
h2 = rnn2(inp, h2)
|
||||
|
||||
x = x + h2
|
||||
x = torch.cat([x, a3_t], dim=1) if self.use_aux_net else x
|
||||
x = F.relu(self.fc1(x))
|
||||
|
||||
x = torch.cat([x, a4_t], dim=1) if self.use_aux_net else x
|
||||
x = F.relu(self.fc2(x))
|
||||
|
||||
logits = self.fc3(x)
|
||||
|
||||
if self.mode == "mold":
|
||||
sample = sample_from_discretized_mix_logistic(
|
||||
logits.unsqueeze(0).transpose(1, 2)
|
||||
)
|
||||
output.append(sample.view(-1))
|
||||
x = sample.transpose(0, 1).to(device)
|
||||
elif self.mode == "gauss":
|
||||
sample = sample_from_gaussian(
|
||||
logits.unsqueeze(0).transpose(1, 2))
|
||||
output.append(sample.view(-1))
|
||||
x = sample.transpose(0, 1).to(device)
|
||||
elif isinstance(self.mode, int):
|
||||
posterior = F.softmax(logits, dim=1)
|
||||
distrib = torch.distributions.Categorical(posterior)
|
||||
|
||||
sample = 2 * distrib.sample().float() / (self.n_classes - 1.0) - 1.0
|
||||
output.append(sample)
|
||||
x = sample.unsqueeze(-1)
|
||||
else:
|
||||
raise RuntimeError(
|
||||
"Unknown model mode value - ", self.mode)
|
||||
|
||||
if i % 100 == 0:
|
||||
self.gen_display(i, seq_len, b_size, start)
|
||||
|
||||
output = torch.stack(output).transpose(0, 1)
|
||||
output = output.cpu().numpy()
|
||||
output = output.astype(np.float64)
|
||||
|
||||
if batched:
|
||||
output = self.xfade_and_unfold(output, target, overlap)
|
||||
else:
|
||||
output = output[0]
|
||||
|
||||
if self.mulaw and isinstance(self.mode, int):
|
||||
output = ap.mulaw_decode(output, self.mode)
|
||||
|
||||
# Fade-out at the end to avoid signal cutting out suddenly
|
||||
fade_out = np.linspace(1, 0, 20 * self.hop_length)
|
||||
output = output[:wave_len]
|
||||
|
||||
if wave_len > len(fade_out):
|
||||
output[-20 * self.hop_length:] *= fade_out
|
||||
|
||||
self.train()
|
||||
return output
|
||||
|
||||
def gen_display(self, i, seq_len, b_size, start):
|
||||
gen_rate = (i + 1) / (time.time() - start) * b_size / 1000
|
||||
realtime_ratio = gen_rate * 1000 / self.sample_rate
|
||||
stream(
|
||||
"%i/%i -- batch_size: %i -- gen_rate: %.1f kHz -- x_realtime: %.1f ",
|
||||
(i * b_size, seq_len * b_size, b_size, gen_rate, realtime_ratio),
|
||||
)
|
||||
|
||||
def fold_with_overlap(self, x, target, overlap):
|
||||
"""Fold the tensor with overlap for quick batched inference.
|
||||
Overlap will be used for crossfading in xfade_and_unfold()
|
||||
Args:
|
||||
x (tensor) : Upsampled conditioning features.
|
||||
shape=(1, timesteps, features)
|
||||
target (int) : Target timesteps for each index of batch
|
||||
overlap (int) : Timesteps for both xfade and rnn warmup
|
||||
Return:
|
||||
(tensor) : shape=(num_folds, target + 2 * overlap, features)
|
||||
Details:
|
||||
x = [[h1, h2, ... hn]]
|
||||
Where each h is a vector of conditioning features
|
||||
Eg: target=2, overlap=1 with x.size(1)=10
|
||||
folded = [[h1, h2, h3, h4],
|
||||
[h4, h5, h6, h7],
|
||||
[h7, h8, h9, h10]]
|
||||
"""
|
||||
|
||||
_, total_len, features = x.size()
|
||||
|
||||
# Calculate variables needed
|
||||
num_folds = (total_len - overlap) // (target + overlap)
|
||||
extended_len = num_folds * (overlap + target) + overlap
|
||||
remaining = total_len - extended_len
|
||||
|
||||
# Pad if some time steps poking out
|
||||
if remaining != 0:
|
||||
num_folds += 1
|
||||
padding = target + 2 * overlap - remaining
|
||||
x = self.pad_tensor(x, padding, side="after")
|
||||
|
||||
folded = torch.zeros(num_folds, target + 2 *
|
||||
overlap, features).to(x.device)
|
||||
|
||||
# Get the values for the folded tensor
|
||||
for i in range(num_folds):
|
||||
start = i * (target + overlap)
|
||||
end = start + target + 2 * overlap
|
||||
folded[i] = x[:, start:end, :]
|
||||
|
||||
return folded
|
||||
|
||||
@staticmethod
|
||||
def get_gru_cell(gru):
|
||||
gru_cell = nn.GRUCell(gru.input_size, gru.hidden_size)
|
||||
gru_cell.weight_hh.data = gru.weight_hh_l0.data
|
||||
gru_cell.weight_ih.data = gru.weight_ih_l0.data
|
||||
gru_cell.bias_hh.data = gru.bias_hh_l0.data
|
||||
gru_cell.bias_ih.data = gru.bias_ih_l0.data
|
||||
return gru_cell
|
||||
|
||||
@staticmethod
|
||||
def pad_tensor(x, pad, side="both"):
|
||||
# NB - this is just a quick method i need right now
|
||||
# i.e., it won't generalise to other shapes/dims
|
||||
b, t, c = x.size()
|
||||
total = t + 2 * pad if side == "both" else t + pad
|
||||
padded = torch.zeros(b, total, c).to(x.device)
|
||||
if side in ("before", "both"):
|
||||
padded[:, pad: pad + t, :] = x
|
||||
elif side == "after":
|
||||
padded[:, :t, :] = x
|
||||
return padded
|
||||
|
||||
@staticmethod
|
||||
def xfade_and_unfold(y, target, overlap):
|
||||
"""Applies a crossfade and unfolds into a 1d array.
|
||||
Args:
|
||||
y (ndarry) : Batched sequences of audio samples
|
||||
shape=(num_folds, target + 2 * overlap)
|
||||
dtype=np.float64
|
||||
overlap (int) : Timesteps for both xfade and rnn warmup
|
||||
Return:
|
||||
(ndarry) : audio samples in a 1d array
|
||||
shape=(total_len)
|
||||
dtype=np.float64
|
||||
Details:
|
||||
y = [[seq1],
|
||||
[seq2],
|
||||
[seq3]]
|
||||
Apply a gain envelope at both ends of the sequences
|
||||
y = [[seq1_in, seq1_target, seq1_out],
|
||||
[seq2_in, seq2_target, seq2_out],
|
||||
[seq3_in, seq3_target, seq3_out]]
|
||||
Stagger and add up the groups of samples:
|
||||
[seq1_in, seq1_target, (seq1_out + seq2_in), seq2_target, ...]
|
||||
"""
|
||||
|
||||
num_folds, length = y.shape
|
||||
target = length - 2 * overlap
|
||||
total_len = num_folds * (target + overlap) + overlap
|
||||
|
||||
# Need some silence for the rnn warmup
|
||||
silence_len = overlap // 2
|
||||
fade_len = overlap - silence_len
|
||||
silence = np.zeros((silence_len), dtype=np.float64)
|
||||
|
||||
# Equal power crossfade
|
||||
t = np.linspace(-1, 1, fade_len, dtype=np.float64)
|
||||
fade_in = np.sqrt(0.5 * (1 + t))
|
||||
fade_out = np.sqrt(0.5 * (1 - t))
|
||||
|
||||
# Concat the silence to the fades
|
||||
fade_in = np.concatenate([silence, fade_in])
|
||||
fade_out = np.concatenate([fade_out, silence])
|
||||
|
||||
# Apply the gain to the overlap samples
|
||||
y[:, :overlap] *= fade_in
|
||||
y[:, -overlap:] *= fade_out
|
||||
|
||||
unfolded = np.zeros((total_len), dtype=np.float64)
|
||||
|
||||
# Loop to add up all the samples
|
||||
for i in range(num_folds):
|
||||
start = i * (target + overlap)
|
||||
end = start + target + 2 * overlap
|
||||
unfolded[start:end] += y[i]
|
||||
|
||||
return unfolded
|
|
@ -0,0 +1,168 @@
|
|||
import numpy as np
|
||||
import math
|
||||
import torch
|
||||
from torch.distributions.normal import Normal
|
||||
import torch.nn.functional as F
|
||||
|
||||
|
||||
def gaussian_loss(y_hat, y, log_std_min=-7.0):
|
||||
assert y_hat.dim() == 3
|
||||
assert y_hat.size(2) == 2
|
||||
mean = y_hat[:, :, :1]
|
||||
log_std = torch.clamp(y_hat[:, :, 1:], min=log_std_min)
|
||||
# TODO: replace with pytorch dist
|
||||
log_probs = -0.5 * (
|
||||
-math.log(2.0 * math.pi)
|
||||
- 2.0 * log_std
|
||||
- torch.pow(y - mean, 2) * torch.exp((-2.0 * log_std))
|
||||
)
|
||||
return log_probs.squeeze().mean()
|
||||
|
||||
|
||||
def sample_from_gaussian(y_hat, log_std_min=-7.0, scale_factor=1.0):
|
||||
assert y_hat.size(2) == 2
|
||||
mean = y_hat[:, :, :1]
|
||||
log_std = torch.clamp(y_hat[:, :, 1:], min=log_std_min)
|
||||
dist = Normal(
|
||||
mean,
|
||||
torch.exp(log_std),
|
||||
)
|
||||
sample = dist.sample()
|
||||
sample = torch.clamp(torch.clamp(
|
||||
sample, min=-scale_factor), max=scale_factor)
|
||||
del dist
|
||||
return sample
|
||||
|
||||
|
||||
def log_sum_exp(x):
|
||||
""" numerically stable log_sum_exp implementation that prevents overflow """
|
||||
# TF ordering
|
||||
axis = len(x.size()) - 1
|
||||
m, _ = torch.max(x, dim=axis)
|
||||
m2, _ = torch.max(x, dim=axis, keepdim=True)
|
||||
return m + torch.log(torch.sum(torch.exp(x - m2), dim=axis))
|
||||
|
||||
|
||||
# It is adapted from https://github.com/r9y9/wavenet_vocoder/blob/master/wavenet_vocoder/mixture.py
|
||||
def discretized_mix_logistic_loss(
|
||||
y_hat, y, num_classes=65536, log_scale_min=None, reduce=True
|
||||
):
|
||||
if log_scale_min is None:
|
||||
log_scale_min = float(np.log(1e-14))
|
||||
y_hat = y_hat.permute(0, 2, 1)
|
||||
assert y_hat.dim() == 3
|
||||
assert y_hat.size(1) % 3 == 0
|
||||
nr_mix = y_hat.size(1) // 3
|
||||
|
||||
# (B x T x C)
|
||||
y_hat = y_hat.transpose(1, 2)
|
||||
|
||||
# unpack parameters. (B, T, num_mixtures) x 3
|
||||
logit_probs = y_hat[:, :, :nr_mix]
|
||||
means = y_hat[:, :, nr_mix: 2 * nr_mix]
|
||||
log_scales = torch.clamp(
|
||||
y_hat[:, :, 2 * nr_mix: 3 * nr_mix], min=log_scale_min)
|
||||
|
||||
# B x T x 1 -> B x T x num_mixtures
|
||||
y = y.expand_as(means)
|
||||
|
||||
centered_y = y - means
|
||||
inv_stdv = torch.exp(-log_scales)
|
||||
plus_in = inv_stdv * (centered_y + 1.0 / (num_classes - 1))
|
||||
cdf_plus = torch.sigmoid(plus_in)
|
||||
min_in = inv_stdv * (centered_y - 1.0 / (num_classes - 1))
|
||||
cdf_min = torch.sigmoid(min_in)
|
||||
|
||||
# log probability for edge case of 0 (before scaling)
|
||||
# equivalent: torch.log(F.sigmoid(plus_in))
|
||||
log_cdf_plus = plus_in - F.softplus(plus_in)
|
||||
|
||||
# log probability for edge case of 255 (before scaling)
|
||||
# equivalent: (1 - F.sigmoid(min_in)).log()
|
||||
log_one_minus_cdf_min = -F.softplus(min_in)
|
||||
|
||||
# probability for all other cases
|
||||
cdf_delta = cdf_plus - cdf_min
|
||||
|
||||
mid_in = inv_stdv * centered_y
|
||||
# log probability in the center of the bin, to be used in extreme cases
|
||||
# (not actually used in our code)
|
||||
log_pdf_mid = mid_in - log_scales - 2.0 * F.softplus(mid_in)
|
||||
|
||||
# tf equivalent
|
||||
|
||||
# log_probs = tf.where(x < -0.999, log_cdf_plus,
|
||||
# tf.where(x > 0.999, log_one_minus_cdf_min,
|
||||
# tf.where(cdf_delta > 1e-5,
|
||||
# tf.log(tf.maximum(cdf_delta, 1e-12)),
|
||||
# log_pdf_mid - np.log(127.5))))
|
||||
|
||||
# TODO: cdf_delta <= 1e-5 actually can happen. How can we choose the value
|
||||
# for num_classes=65536 case? 1e-7? not sure..
|
||||
inner_inner_cond = (cdf_delta > 1e-5).float()
|
||||
|
||||
inner_inner_out = inner_inner_cond * torch.log(
|
||||
torch.clamp(cdf_delta, min=1e-12)
|
||||
) + (1.0 - inner_inner_cond) * (log_pdf_mid - np.log((num_classes - 1) / 2))
|
||||
inner_cond = (y > 0.999).float()
|
||||
inner_out = (
|
||||
inner_cond * log_one_minus_cdf_min +
|
||||
(1.0 - inner_cond) * inner_inner_out
|
||||
)
|
||||
cond = (y < -0.999).float()
|
||||
log_probs = cond * log_cdf_plus + (1.0 - cond) * inner_out
|
||||
|
||||
log_probs = log_probs + F.log_softmax(logit_probs, -1)
|
||||
|
||||
if reduce:
|
||||
return -torch.mean(log_sum_exp(log_probs))
|
||||
return -log_sum_exp(log_probs).unsqueeze(-1)
|
||||
|
||||
|
||||
def sample_from_discretized_mix_logistic(y, log_scale_min=None):
|
||||
"""
|
||||
Sample from discretized mixture of logistic distributions
|
||||
Args:
|
||||
y (Tensor): B x C x T
|
||||
log_scale_min (float): Log scale minimum value
|
||||
Returns:
|
||||
Tensor: sample in range of [-1, 1].
|
||||
"""
|
||||
if log_scale_min is None:
|
||||
log_scale_min = float(np.log(1e-14))
|
||||
assert y.size(1) % 3 == 0
|
||||
nr_mix = y.size(1) // 3
|
||||
|
||||
# B x T x C
|
||||
y = y.transpose(1, 2)
|
||||
logit_probs = y[:, :, :nr_mix]
|
||||
|
||||
# sample mixture indicator from softmax
|
||||
temp = logit_probs.data.new(logit_probs.size()).uniform_(1e-5, 1.0 - 1e-5)
|
||||
temp = logit_probs.data - torch.log(-torch.log(temp))
|
||||
_, argmax = temp.max(dim=-1)
|
||||
|
||||
# (B, T) -> (B, T, nr_mix)
|
||||
one_hot = to_one_hot(argmax, nr_mix)
|
||||
# select logistic parameters
|
||||
means = torch.sum(y[:, :, nr_mix: 2 * nr_mix] * one_hot, dim=-1)
|
||||
log_scales = torch.clamp(
|
||||
torch.sum(y[:, :, 2 * nr_mix: 3 * nr_mix] * one_hot, dim=-1), min=log_scale_min
|
||||
)
|
||||
# sample from logistic & clip to interval
|
||||
# we don't actually round to the nearest 8bit value when sampling
|
||||
u = means.data.new(means.size()).uniform_(1e-5, 1.0 - 1e-5)
|
||||
x = means + torch.exp(log_scales) * (torch.log(u) - torch.log(1.0 - u))
|
||||
|
||||
x = torch.clamp(torch.clamp(x, min=-1.0), max=1.0)
|
||||
|
||||
return x
|
||||
|
||||
|
||||
def to_one_hot(tensor, n, fill_with=1.0):
|
||||
# we perform one hot encore with respect to the last axis
|
||||
one_hot = torch.FloatTensor(tensor.size() + (n,)).zero_()
|
||||
if tensor.is_cuda:
|
||||
one_hot = one_hot.cuda()
|
||||
one_hot.scatter_(len(tensor.size()), tensor.unsqueeze(-1), fill_with)
|
||||
return one_hot
|
|
@ -42,12 +42,35 @@ def to_camel(text):
|
|||
return re.sub(r'(?!^)_([a-zA-Z])', lambda m: m.group(1).upper(), text)
|
||||
|
||||
|
||||
def setup_wavernn(c):
|
||||
print(" > Model: WaveRNN")
|
||||
MyModel = importlib.import_module("TTS.vocoder.models.wavernn")
|
||||
MyModel = getattr(MyModel, "WaveRNN")
|
||||
model = MyModel(
|
||||
rnn_dims=c.wavernn_model_params['rnn_dims'],
|
||||
fc_dims=c.wavernn_model_params['fc_dims'],
|
||||
mode=c.mode,
|
||||
mulaw=c.mulaw,
|
||||
pad=c.padding,
|
||||
use_aux_net=c.wavernn_model_params['use_aux_net'],
|
||||
use_upsample_net=c.wavernn_model_params['use_upsample_net'],
|
||||
upsample_factors=c.wavernn_model_params['upsample_factors'],
|
||||
feat_dims=c.audio['num_mels'],
|
||||
compute_dims=c.wavernn_model_params['compute_dims'],
|
||||
res_out_dims=c.wavernn_model_params['res_out_dims'],
|
||||
num_res_blocks=c.wavernn_model_params['num_res_blocks'],
|
||||
hop_length=c.audio["hop_length"],
|
||||
sample_rate=c.audio["sample_rate"],
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
def setup_generator(c):
|
||||
print(" > Generator Model: {}".format(c.generator_model))
|
||||
MyModel = importlib.import_module('TTS.vocoder.models.' +
|
||||
c.generator_model.lower())
|
||||
MyModel = getattr(MyModel, to_camel(c.generator_model))
|
||||
if c.generator_model in 'melgan_generator':
|
||||
if c.generator_model.lower() in 'melgan_generator':
|
||||
model = MyModel(
|
||||
in_channels=c.audio['num_mels'],
|
||||
out_channels=1,
|
||||
|
@ -58,7 +81,7 @@ def setup_generator(c):
|
|||
num_res_blocks=c.generator_model_params['num_res_blocks'])
|
||||
if c.generator_model in 'melgan_fb_generator':
|
||||
pass
|
||||
if c.generator_model in 'multiband_melgan_generator':
|
||||
if c.generator_model.lower() in 'multiband_melgan_generator':
|
||||
model = MyModel(
|
||||
in_channels=c.audio['num_mels'],
|
||||
out_channels=4,
|
||||
|
@ -67,7 +90,7 @@ def setup_generator(c):
|
|||
upsample_factors=c.generator_model_params['upsample_factors'],
|
||||
res_kernel=3,
|
||||
num_res_blocks=c.generator_model_params['num_res_blocks'])
|
||||
if c.generator_model in 'fullband_melgan_generator':
|
||||
if c.generator_model.lower() in 'fullband_melgan_generator':
|
||||
model = MyModel(
|
||||
in_channels=c.audio['num_mels'],
|
||||
out_channels=1,
|
||||
|
@ -76,7 +99,7 @@ def setup_generator(c):
|
|||
upsample_factors=c.generator_model_params['upsample_factors'],
|
||||
res_kernel=3,
|
||||
num_res_blocks=c.generator_model_params['num_res_blocks'])
|
||||
if c.generator_model in 'parallel_wavegan_generator':
|
||||
if c.generator_model.lower() in 'parallel_wavegan_generator':
|
||||
model = MyModel(
|
||||
in_channels=1,
|
||||
out_channels=1,
|
||||
|
@ -91,6 +114,17 @@ def setup_generator(c):
|
|||
bias=True,
|
||||
use_weight_norm=True,
|
||||
upsample_factors=c.generator_model_params['upsample_factors'])
|
||||
if c.generator_model.lower() in 'wavegrad':
|
||||
model = MyModel(
|
||||
in_channels=c['audio']['num_mels'],
|
||||
out_channels=1,
|
||||
use_weight_norm=c['model_params']['use_weight_norm'],
|
||||
x_conv_channels=c['model_params']['x_conv_channels'],
|
||||
y_conv_channels=c['model_params']['y_conv_channels'],
|
||||
dblock_out_channels=c['model_params']['dblock_out_channels'],
|
||||
ublock_out_channels=c['model_params']['ublock_out_channels'],
|
||||
upsample_factors=c['model_params']['upsample_factors'],
|
||||
upsample_dilations=c['model_params']['upsample_dilations'])
|
||||
return model
|
||||
|
||||
|
||||
|
|
|
@ -20,7 +20,10 @@ def load_checkpoint(model, checkpoint_path, use_cuda=False):
|
|||
|
||||
def save_model(model, optimizer, scheduler, model_disc, optimizer_disc,
|
||||
scheduler_disc, current_step, epoch, output_path, **kwargs):
|
||||
model_state = model.state_dict()
|
||||
if hasattr(model, 'module'):
|
||||
model_state = model.module.state_dict()
|
||||
else:
|
||||
model_state = model.state_dict()
|
||||
model_disc_state = model_disc.state_dict()\
|
||||
if model_disc is not None else None
|
||||
optimizer_state = optimizer.state_dict()\
|
||||
|
|
|
@ -13,7 +13,11 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
<<<<<<< HEAD
|
||||
"execution_count": 2,
|
||||
=======
|
||||
"execution_count": null,
|
||||
>>>>>>> dev
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -25,8 +29,13 @@
|
|||
"import umap\n",
|
||||
"\n",
|
||||
"from TTS.speaker_encoder.model import SpeakerEncoder\n",
|
||||
<<<<<<< HEAD
|
||||
"from TTS.utils.audio import AudioProcessor\n",
|
||||
"from TTS.utils.io import load_config\n",
|
||||
=======
|
||||
"from TTS.tts.utils.audio import AudioProcessor\n",
|
||||
"from TTS.tts.utils.generic_utils import load_config\n",
|
||||
>>>>>>> dev
|
||||
"\n",
|
||||
"from bokeh.io import output_notebook, show\n",
|
||||
"from bokeh.plotting import figure\n",
|
||||
|
@ -48,6 +57,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
<<<<<<< HEAD
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
|
@ -367,6 +377,11 @@
|
|||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
=======
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
>>>>>>> dev
|
||||
"source": [
|
||||
"output_notebook()"
|
||||
]
|
||||
|
@ -380,12 +395,20 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
<<<<<<< HEAD
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#MODEL_RUN_PATH = \"libritts_360-half-October-31-2019_04+54PM-19d2f5f/\"\n",
|
||||
"MODEL_RUN_PATH = \"libritts_360-half-September-28-2019_10+46AM-8565c50/\"\n",
|
||||
=======
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"MODEL_RUN_PATH = \"/media/erogol/data_ssd/Models/libri_tts/speaker_encoder/libritts_360-half-October-31-2019_04+54PM-19d2f5f/\"\n",
|
||||
>>>>>>> dev
|
||||
"MODEL_PATH = MODEL_RUN_PATH + \"best_model.pth.tar\"\n",
|
||||
"CONFIG_PATH = MODEL_RUN_PATH + \"config.json\"\n",
|
||||
"\n",
|
||||
|
@ -395,11 +418,16 @@
|
|||
"\n",
|
||||
"# My multi speaker locations\n",
|
||||
"EMBED_PATH = \"/home/erogol/Data/Libri-TTS/train-clean-360-embed_128/\"\n",
|
||||
<<<<<<< HEAD
|
||||
"AUDIO_PATH = \"datasets/LibriTTS/test-clean/\""
|
||||
=======
|
||||
"AUDIO_PATH = \"/home/erogol/Data/Libri-TTS/train-clean-360/\""
|
||||
>>>>>>> dev
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
<<<<<<< HEAD
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
|
@ -413,12 +441,18 @@
|
|||
]
|
||||
}
|
||||
],
|
||||
=======
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
>>>>>>> dev
|
||||
"source": [
|
||||
"!ls -1 $MODEL_RUN_PATH"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
<<<<<<< HEAD
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
|
@ -454,6 +488,11 @@
|
|||
]
|
||||
}
|
||||
],
|
||||
=======
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
>>>>>>> dev
|
||||
"source": [
|
||||
"CONFIG = load_config(CONFIG_PATH)\n",
|
||||
"ap = AudioProcessor(**CONFIG['audio'])"
|
||||
|
@ -468,6 +507,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
<<<<<<< HEAD
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
|
@ -479,6 +519,11 @@
|
|||
]
|
||||
}
|
||||
],
|
||||
=======
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
>>>>>>> dev
|
||||
"source": [
|
||||
"embed_files = glob.glob(EMBED_PATH+\"/**/*.npy\", recursive=True)\n",
|
||||
"print(f'Embeddings found: {len(embed_files)}')"
|
||||
|
@ -493,6 +538,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
<<<<<<< HEAD
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
|
@ -508,6 +554,11 @@
|
|||
]
|
||||
}
|
||||
],
|
||||
=======
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
>>>>>>> dev
|
||||
"source": [
|
||||
"embed_files[0]"
|
||||
]
|
||||
|
@ -523,6 +574,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
<<<<<<< HEAD
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
|
@ -534,6 +586,11 @@
|
|||
]
|
||||
}
|
||||
],
|
||||
=======
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
>>>>>>> dev
|
||||
"source": [
|
||||
"speaker_paths = list(set([os.path.dirname(os.path.dirname(embed_file)) for embed_file in embed_files]))\n",
|
||||
"speaker_to_utter = {}\n",
|
||||
|
@ -557,6 +614,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
<<<<<<< HEAD
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
|
@ -575,6 +633,13 @@
|
|||
],
|
||||
"source": [
|
||||
"ttsembeds = []\n",
|
||||
=======
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"embeds = []\n",
|
||||
>>>>>>> dev
|
||||
"labels = []\n",
|
||||
"locations = []\n",
|
||||
"\n",
|
||||
|
@ -598,7 +663,11 @@
|
|||
" embed = np.load(embed_path)\n",
|
||||
" embeds.append(embed)\n",
|
||||
" labels.append(str(speaker_num))\n",
|
||||
<<<<<<< HEAD
|
||||
" #locations.append(embed_path.replace(EMBED_PATH, '').replace('.npy','.wav'))\n",
|
||||
=======
|
||||
" locations.append(embed_path.replace(EMBED_PATH, '').replace('.npy','.wav'))\n",
|
||||
>>>>>>> dev
|
||||
"embeds = np.concatenate(embeds)"
|
||||
]
|
||||
},
|
||||
|
@ -611,6 +680,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
<<<<<<< HEAD
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
|
@ -626,6 +696,11 @@
|
|||
]
|
||||
}
|
||||
],
|
||||
=======
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
>>>>>>> dev
|
||||
"source": [
|
||||
"model = umap.UMAP()\n",
|
||||
"projection = model.fit_transform(embeds)"
|
||||
|
@ -729,7 +804,11 @@
|
|||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
<<<<<<< HEAD
|
||||
"version": "3.8.5"
|
||||
=======
|
||||
"version": "3.7.4"
|
||||
>>>>>>> dev
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
|
File diff suppressed because one or more lines are too long
|
@ -23,3 +23,4 @@ pylint==2.5.3
|
|||
gdown
|
||||
umap-learn
|
||||
cython
|
||||
pyyaml
|
||||
|
|
|
@ -1,3 +1,4 @@
|
|||
set -e
|
||||
TF_CPP_MIN_LOG_LEVEL=3
|
||||
|
||||
# tests
|
||||
|
@ -6,7 +7,10 @@ nosetests tests -x &&\
|
|||
# runtime tests
|
||||
./tests/test_server_package.sh && \
|
||||
./tests/test_tts_train.sh && \
|
||||
./tests/test_vocoder_train.sh && \
|
||||
./tests/test_glow-tts_train.sh && \
|
||||
./tests/test_vocoder_gan_train.sh && \
|
||||
./tests/test_vocoder_wavernn_train.sh && \
|
||||
./tests/test_vocoder_wavegrad_train.sh && \
|
||||
|
||||
# linter check
|
||||
cardboardlinter --refspec master
|
2
setup.py
2
setup.py
|
@ -33,7 +33,7 @@ args, unknown_args = parser.parse_known_args()
|
|||
# Remove our arguments from argv so that setuptools doesn't see them
|
||||
sys.argv = [sys.argv[0]] + unknown_args
|
||||
|
||||
version = '0.0.5'
|
||||
version = '0.0.6'
|
||||
|
||||
# Adapted from https://github.com/pytorch/pytorch
|
||||
cwd = os.path.dirname(os.path.abspath(__file__))
|
||||
|
|
|
@ -0,0 +1,134 @@
|
|||
{
|
||||
"model": "glow_tts",
|
||||
"run_name": "glow-tts-gatedconv",
|
||||
"run_description": "glow-tts model training with gated conv.",
|
||||
|
||||
// AUDIO PARAMETERS
|
||||
"audio":{
|
||||
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
|
||||
"win_length": 1024, // stft window length in ms.
|
||||
"hop_length": 256, // stft window hop-lengh in ms.
|
||||
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
|
||||
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
|
||||
|
||||
// Audio processing parameters
|
||||
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
|
||||
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
|
||||
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
|
||||
|
||||
// Griffin-Lim
|
||||
"power": 1.1, // value to sharpen wav signals after GL algorithm.
|
||||
"griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
|
||||
|
||||
// Silence trimming
|
||||
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
|
||||
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
|
||||
|
||||
// MelSpectrogram parameters
|
||||
"num_mels": 80, // size of the mel spec frame.
|
||||
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
|
||||
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
|
||||
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
|
||||
|
||||
// Normalization parameters
|
||||
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
|
||||
"min_level_db": -100, // lower bound for normalization
|
||||
"symmetric_norm": true, // move normalization to range [-1, 1]
|
||||
"max_norm": 1.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
|
||||
"clip_norm": true, // clip normalized values into the range.
|
||||
"stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
|
||||
},
|
||||
|
||||
// VOCABULARY PARAMETERS
|
||||
// if custom character set is not defined,
|
||||
// default set in symbols.py is used
|
||||
// "characters":{
|
||||
// "pad": "_",
|
||||
// "eos": "~",
|
||||
// "bos": "^",
|
||||
// "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!'(),-.:;? ",
|
||||
// "punctuations":"!'(),-.:;? ",
|
||||
// "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
|
||||
// },
|
||||
|
||||
"add_blank": false, // if true add a new token after each token of the sentence. This increases the size of the input sequence, but has considerably improved the prosody of the GlowTTS model.
|
||||
|
||||
// DISTRIBUTED TRAINING
|
||||
"mixed_precision": false,
|
||||
"distributed":{
|
||||
"backend": "nccl",
|
||||
"url": "tcp:\/\/localhost:54323"
|
||||
},
|
||||
|
||||
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
|
||||
|
||||
// MODEL PARAMETERS
|
||||
"use_mas": false, // use Monotonic Alignment Search if true. Otherwise use pre-computed attention alignments.
|
||||
|
||||
// TRAINING
|
||||
"batch_size": 2, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
|
||||
"eval_batch_size":1,
|
||||
"r": 1, // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
|
||||
"loss_masking": true, // enable / disable loss masking against the sequence padding.
|
||||
|
||||
// VALIDATION
|
||||
"run_eval": true,
|
||||
"test_delay_epochs": 0, //Until attention is aligned, testing only wastes computation time.
|
||||
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
|
||||
|
||||
// OPTIMIZER
|
||||
"noam_schedule": true, // use noam warmup and lr schedule.
|
||||
"grad_clip": 5.0, // upper limit for gradients for clipping.
|
||||
"epochs": 1, // total number of epochs to train.
|
||||
"lr": 1e-3, // Initial learning rate. If Noam decay is active, maximum learning rate.
|
||||
"wd": 0.000001, // Weight decay weight.
|
||||
"warmup_steps": 4000, // Noam decay steps to increase the learning rate from 0 to "lr"
|
||||
"seq_len_norm": false, // Normalize eash sample loss with its length to alleviate imbalanced datasets. Use it if your dataset is small or has skewed distribution of sequence lengths.
|
||||
|
||||
"encoder_type": "gatedconv",
|
||||
|
||||
// TENSORBOARD and LOGGING
|
||||
"print_step": 25, // Number of steps to log training on console.
|
||||
"tb_plot_step": 100, // Number of steps to plot TB training figures.
|
||||
"print_eval": false, // If True, it prints intermediate loss values in evalulation.
|
||||
"save_step": 5000, // Number of training steps expected to save traninpg stats and checkpoints.
|
||||
"checkpoint": true, // If true, it saves checkpoints per "save_step"
|
||||
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
|
||||
"apex_amp_level": null,
|
||||
|
||||
// DATA LOADING
|
||||
"text_cleaner": "phoneme_cleaners",
|
||||
"enable_eos_bos_chars": false, // enable/disable beginning of sentence and end of sentence chars.
|
||||
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
|
||||
"num_val_loader_workers": 4, // number of evaluation data loader processes.
|
||||
"batch_group_size": 0, //Number of batches to shuffle after bucketing.
|
||||
"min_seq_len": 3, // DATASET-RELATED: minimum text length to use in training
|
||||
"max_seq_len": 500, // DATASET-RELATED: maximum text length
|
||||
"compute_f0": false, // compute f0 values in data-loader
|
||||
|
||||
// PATHS
|
||||
"output_path": "tests/train_outputs/",
|
||||
|
||||
// PHONEMES
|
||||
"phoneme_cache_path": "tests/outputs/phoneme_cache/", // phoneme computation is slow, therefore, it caches results in the given folder.
|
||||
"use_phonemes": true, // use phonemes instead of raw characters. It is suggested for better pronounciation.
|
||||
"phoneme_language": "en-us", // depending on your target language, pick one from https://github.com/bootphon/phonemizer#languages
|
||||
|
||||
// MULTI-SPEAKER and GST
|
||||
"use_external_speaker_embedding_file": false,
|
||||
"external_speaker_embedding_file": null,
|
||||
"use_speaker_embedding": false, // use speaker embedding to enable multi-speaker learning.
|
||||
|
||||
// DATASETS
|
||||
"datasets": // List of datasets. They all merged and they get different speaker_ids.
|
||||
[
|
||||
{
|
||||
"name": "ljspeech",
|
||||
"path": "tests/data/ljspeech/",
|
||||
"meta_file_train": "metadata.csv",
|
||||
"meta_file_val": "metadata.csv"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
|
|
@ -67,13 +67,24 @@
|
|||
"gradual_training": [[0, 7, 4]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
|
||||
"loss_masking": true, // enable / disable loss masking against the sequence padding.
|
||||
"ga_alpha": 10.0, // weight for guided attention loss. If > 0, guided attention is enabled.
|
||||
"apex_amp_level": null,
|
||||
"mixed_precision": false,
|
||||
|
||||
// VALIDATION
|
||||
"run_eval": true,
|
||||
"test_delay_epochs": 0, //Until attention is aligned, testing only wastes computation time.
|
||||
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
|
||||
|
||||
// LOSS SETTINGS
|
||||
"loss_masking": true, // enable / disable loss masking against the sequence padding.
|
||||
"decoder_loss_alpha": 0.5, // original decoder loss weight. If > 0, it is enabled
|
||||
"postnet_loss_alpha": 0.25, // original postnet loss weight. If > 0, it is enabled
|
||||
"postnet_diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
|
||||
"decoder_diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
|
||||
"decoder_ssim_alpha": 0.5, // decoder ssim loss weight. If > 0, it is enabled
|
||||
"postnet_ssim_alpha": 0.25, // postnet ssim loss weight. If > 0, it is enabled
|
||||
"ga_alpha": 5.0, // weight for guided attention loss. If > 0, guided attention is enabled.
|
||||
"stopnet_pos_weight": 15.0, // pos class weight for stopnet loss since there are way more negative samples than positive samples.
|
||||
|
||||
// OPTIMIZER
|
||||
"noam_schedule": false, // use noam warmup and lr schedule.
|
||||
"grad_clip": 1.0, // upper limit for gradients for clipping.
|
||||
|
|
|
@ -0,0 +1,114 @@
|
|||
{
|
||||
"run_name": "wavegrad-ljspeech",
|
||||
"run_description": "wavegrad ljspeech",
|
||||
|
||||
"audio":{
|
||||
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
|
||||
"win_length": 1024, // stft window length in ms.
|
||||
"hop_length": 256, // stft window hop-lengh in ms.
|
||||
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
|
||||
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
|
||||
|
||||
// Audio processing parameters
|
||||
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
|
||||
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
|
||||
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
|
||||
|
||||
// Silence trimming
|
||||
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
|
||||
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
|
||||
|
||||
// MelSpectrogram parameters
|
||||
"num_mels": 80, // size of the mel spec frame.
|
||||
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
|
||||
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
|
||||
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
|
||||
|
||||
// Normalization parameters
|
||||
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
|
||||
"min_level_db": -100, // lower bound for normalization
|
||||
"symmetric_norm": true, // move normalization to range [-1, 1]
|
||||
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
|
||||
"clip_norm": true, // clip normalized values into the range.
|
||||
"stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
|
||||
},
|
||||
|
||||
// DISTRIBUTED TRAINING
|
||||
"mixed_precision": false,
|
||||
"distributed":{
|
||||
"backend": "nccl",
|
||||
"url": "tcp:\/\/localhost:54322"
|
||||
},
|
||||
|
||||
"target_loss": "avg_wavegrad_loss", // loss value to pick the best model to save after each epoch
|
||||
|
||||
// MODEL PARAMETERS
|
||||
"generator_model": "wavegrad",
|
||||
"model_params":{
|
||||
"y_conv_channels":32,
|
||||
"x_conv_channels":768,
|
||||
"ublock_out_channels": [512, 512, 256, 128, 128],
|
||||
"dblock_out_channels": [128, 128, 256, 512],
|
||||
"upsample_factors": [4, 4, 4, 2, 2],
|
||||
"upsample_dilations": [
|
||||
[1, 2, 1, 2],
|
||||
[1, 2, 1, 2],
|
||||
[1, 2, 4, 8],
|
||||
[1, 2, 4, 8],
|
||||
[1, 2, 4, 8]],
|
||||
"use_weight_norm": true
|
||||
},
|
||||
|
||||
// DATASET
|
||||
"data_path": "tests/data/ljspeech/wavs/", // root data path. It finds all wav files recursively from there.
|
||||
"feature_path": null, // if you use precomputed features
|
||||
"seq_len": 6144, // 24 * hop_length
|
||||
"pad_short": 0, // additional padding for short wavs
|
||||
"conv_pad": 0, // additional padding against convolutions applied to spectrograms
|
||||
"use_noise_augment": false, // add noise to the audio signal for augmentation
|
||||
"use_cache": true, // use in memory cache to keep the computed features. This might cause OOM.
|
||||
|
||||
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
|
||||
|
||||
// TRAINING
|
||||
"batch_size": 1, // Batch size for training.
|
||||
"train_noise_schedule":{
|
||||
"min_val": 1e-6,
|
||||
"max_val": 1e-2,
|
||||
"num_steps": 1000
|
||||
},
|
||||
"test_noise_schedule":{
|
||||
"min_val": 1e-6,
|
||||
"max_val": 1e-2,
|
||||
"num_steps": 2
|
||||
},
|
||||
|
||||
// VALIDATION
|
||||
"run_eval": true, // enable/disable evaluation run
|
||||
|
||||
// OPTIMIZER
|
||||
"epochs": 1, // total number of epochs to train.
|
||||
"clip_grad": 1.0, // Generator gradient clipping threshold. Apply gradient clipping if > 0
|
||||
"lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
|
||||
"lr_scheduler_params": {
|
||||
"gamma": 0.5,
|
||||
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
|
||||
},
|
||||
"lr": 1e-4, // Initial learning rate. If Noam decay is active, maximum learning rate.
|
||||
|
||||
// TENSORBOARD and LOGGING
|
||||
"print_step": 250, // Number of steps to log traning on console.
|
||||
"print_eval": false, // If True, it prints loss values for each step in eval run.
|
||||
"save_step": 10000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
|
||||
"checkpoint": true, // If true, it saves checkpoints per "save_step"
|
||||
"tb_model_param_stats": true, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
|
||||
|
||||
// DATA LOADING
|
||||
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
|
||||
"num_val_loader_workers": 4, // number of evaluation data loader processes.
|
||||
"eval_split_size": 4,
|
||||
|
||||
// PATHS
|
||||
"output_path": "tests/train_outputs/"
|
||||
}
|
||||
|
|
@ -0,0 +1,107 @@
|
|||
{
|
||||
"run_name": "wavernn_test",
|
||||
"run_description": "wavernn_test training",
|
||||
|
||||
// AUDIO PARAMETERS
|
||||
"audio":{
|
||||
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
|
||||
"win_length": 1024, // stft window length in ms.
|
||||
"hop_length": 256, // stft window hop-lengh in ms.
|
||||
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
|
||||
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
|
||||
|
||||
// Audio processing parameters
|
||||
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
|
||||
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
|
||||
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
|
||||
|
||||
// Silence trimming
|
||||
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
|
||||
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
|
||||
|
||||
// MelSpectrogram parameters
|
||||
"num_mels": 80, // size of the mel spec frame.
|
||||
"mel_fmin": 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
|
||||
"mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
|
||||
"spec_gain": 20.0, // scaler value appplied after log transform of spectrogram.
|
||||
|
||||
// Normalization parameters
|
||||
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
|
||||
"min_level_db": -100, // lower bound for normalization
|
||||
"symmetric_norm": true, // move normalization to range [-1, 1]
|
||||
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
|
||||
"clip_norm": true, // clip normalized values into the range.
|
||||
"stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
|
||||
},
|
||||
|
||||
// Generating / Synthesizing
|
||||
"batched": true,
|
||||
"target_samples": 11000, // target number of samples to be generated in each batch entry
|
||||
"overlap_samples": 550, // number of samples for crossfading between batches
|
||||
|
||||
// DISTRIBUTED TRAINING
|
||||
// "distributed":{
|
||||
// "backend": "nccl",
|
||||
// "url": "tcp:\/\/localhost:54321"
|
||||
// },
|
||||
|
||||
// MODEL PARAMETERS
|
||||
"use_aux_net": true,
|
||||
"use_upsample_net": true,
|
||||
"upsample_factors": [4, 8, 8], // this needs to correctly factorise hop_length
|
||||
"seq_len": 1280, // has to be devideable by hop_length
|
||||
"mode": "mold", // mold [string], gauss [string], bits [int]
|
||||
"mulaw": false, // apply mulaw if mode is bits
|
||||
"padding": 2, // pad the input for resnet to see wider input length
|
||||
|
||||
// DATASET
|
||||
//"use_gta": true, // use computed gta features from the tts model
|
||||
"data_path": "tests/data/ljspeech/wavs/", // path containing training wav files
|
||||
"feature_path": null, // path containing computed features from wav files if null compute them
|
||||
|
||||
// MODEL PARAMETERS
|
||||
"wavernn_model_params": {
|
||||
"rnn_dims": 512,
|
||||
"fc_dims": 512,
|
||||
"compute_dims": 128,
|
||||
"res_out_dims": 128,
|
||||
"num_res_blocks": 10,
|
||||
"use_aux_net": true,
|
||||
"use_upsample_net": true,
|
||||
"upsample_factors": [4, 8, 8] // this needs to correctly factorise hop_length
|
||||
},
|
||||
"mixed_precision": false,
|
||||
|
||||
// TRAINING
|
||||
"batch_size": 4, // Batch size for training. Lower values than 32 might cause hard to learn attention.
|
||||
"epochs": 1, // total number of epochs to train.
|
||||
|
||||
// VALIDATION
|
||||
"run_eval": true,
|
||||
"test_every_epochs": 10, // Test after set number of epochs (Test every 20 epochs for example)
|
||||
|
||||
// OPTIMIZER
|
||||
"grad_clip": 4, // apply gradient clipping if > 0
|
||||
"lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
|
||||
"lr_scheduler_params": {
|
||||
"gamma": 0.5,
|
||||
"milestones": [200000, 400000, 600000]
|
||||
},
|
||||
"lr": 1e-4, // initial learning rate
|
||||
|
||||
// TENSORBOARD and LOGGING
|
||||
"print_step": 25, // Number of steps to log traning on console.
|
||||
"print_eval": false, // If True, it prints loss values for each step in eval run.
|
||||
"save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
|
||||
"checkpoint": true, // If true, it saves checkpoints per "save_step"
|
||||
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
|
||||
|
||||
// DATA LOADING
|
||||
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
|
||||
"num_val_loader_workers": 4, // number of evaluation data loader processes.
|
||||
"eval_split_size": 10, // number of samples for testing
|
||||
|
||||
// PATHS
|
||||
"output_path": "tests/train_outputs/"
|
||||
}
|
||||
|
|
@ -62,7 +62,7 @@ class GE2ELossTests(unittest.TestCase):
|
|||
assert output.item() >= 0.0
|
||||
# check speaker loss with orthogonal d-vectors
|
||||
dummy_input = T.empty(3, 64)
|
||||
dummy_input = T.nn.init.orthogonal(dummy_input)
|
||||
dummy_input = T.nn.init.orthogonal_(dummy_input)
|
||||
dummy_input = T.cat(
|
||||
[
|
||||
dummy_input[0].repeat(5, 1, 1).transpose(0, 1),
|
||||
|
@ -91,7 +91,7 @@ class AngleProtoLossTests(unittest.TestCase):
|
|||
|
||||
# check speaker loss with orthogonal d-vectors
|
||||
dummy_input = T.empty(3, 64)
|
||||
dummy_input = T.nn.init.orthogonal(dummy_input)
|
||||
dummy_input = T.nn.init.orthogonal_(dummy_input)
|
||||
dummy_input = T.cat(
|
||||
[
|
||||
dummy_input[0].repeat(5, 1, 1).transpose(0, 1),
|
||||
|
|
|
@ -0,0 +1,13 @@
|
|||
#!/usr/bin/env bash
|
||||
set -xe
|
||||
BASEDIR=$(dirname "$0")
|
||||
echo "$BASEDIR"
|
||||
# run training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_glow_tts.py --config_path $BASEDIR/inputs/test_glow_tts.json
|
||||
# find the training folder
|
||||
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
|
||||
echo $LATEST_FOLDER
|
||||
# continue the previous training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_glow_tts.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
|
||||
# remove all the outputs
|
||||
rm -rf $BASEDIR/train_outputs/
|
|
@ -2,7 +2,7 @@ import unittest
|
|||
import torch as T
|
||||
|
||||
from TTS.tts.layers.tacotron import Prenet, CBHG, Decoder, Encoder
|
||||
from TTS.tts.layers.losses import L1LossMasked
|
||||
from TTS.tts.layers.losses import L1LossMasked, SSIMLoss
|
||||
from TTS.tts.utils.generic_utils import sequence_mask
|
||||
|
||||
# pylint: disable=unused-variable
|
||||
|
@ -149,3 +149,72 @@ class L1LossMaskedTests(unittest.TestCase):
|
|||
(sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
|
||||
output = layer(dummy_input + mask, dummy_target, dummy_length)
|
||||
assert output.item() == 0, "0 vs {}".format(output.item())
|
||||
|
||||
|
||||
class SSIMLossTests(unittest.TestCase):
|
||||
def test_in_out(self): #pylint: disable=no-self-use
|
||||
# test input == target
|
||||
layer = SSIMLoss()
|
||||
dummy_input = T.ones(4, 8, 128).float()
|
||||
dummy_target = T.ones(4, 8, 128).float()
|
||||
dummy_length = (T.ones(4) * 8).long()
|
||||
output = layer(dummy_input, dummy_target, dummy_length)
|
||||
assert output.item() == 0.0
|
||||
|
||||
# test input != target
|
||||
dummy_input = T.ones(4, 8, 128).float()
|
||||
dummy_target = T.zeros(4, 8, 128).float()
|
||||
dummy_length = (T.ones(4) * 8).long()
|
||||
output = layer(dummy_input, dummy_target, dummy_length)
|
||||
assert abs(output.item() - 1.0) < 1e-4 , "1.0 vs {}".format(output.item())
|
||||
|
||||
# test if padded values of input makes any difference
|
||||
dummy_input = T.ones(4, 8, 128).float()
|
||||
dummy_target = T.zeros(4, 8, 128).float()
|
||||
dummy_length = (T.arange(5, 9)).long()
|
||||
mask = (
|
||||
(sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
|
||||
output = layer(dummy_input + mask, dummy_target, dummy_length)
|
||||
assert abs(output.item() - 1.0) < 1e-4, "1.0 vs {}".format(output.item())
|
||||
|
||||
dummy_input = T.rand(4, 8, 128).float()
|
||||
dummy_target = dummy_input.detach()
|
||||
dummy_length = (T.arange(5, 9)).long()
|
||||
mask = (
|
||||
(sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
|
||||
output = layer(dummy_input + mask, dummy_target, dummy_length)
|
||||
assert output.item() == 0, "0 vs {}".format(output.item())
|
||||
|
||||
# seq_len_norm = True
|
||||
# test input == target
|
||||
layer = L1LossMasked(seq_len_norm=True)
|
||||
dummy_input = T.ones(4, 8, 128).float()
|
||||
dummy_target = T.ones(4, 8, 128).float()
|
||||
dummy_length = (T.ones(4) * 8).long()
|
||||
output = layer(dummy_input, dummy_target, dummy_length)
|
||||
assert output.item() == 0.0
|
||||
|
||||
# test input != target
|
||||
dummy_input = T.ones(4, 8, 128).float()
|
||||
dummy_target = T.zeros(4, 8, 128).float()
|
||||
dummy_length = (T.ones(4) * 8).long()
|
||||
output = layer(dummy_input, dummy_target, dummy_length)
|
||||
assert output.item() == 1.0, "1.0 vs {}".format(output.item())
|
||||
|
||||
# test if padded values of input makes any difference
|
||||
dummy_input = T.ones(4, 8, 128).float()
|
||||
dummy_target = T.zeros(4, 8, 128).float()
|
||||
dummy_length = (T.arange(5, 9)).long()
|
||||
mask = (
|
||||
(sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
|
||||
output = layer(dummy_input + mask, dummy_target, dummy_length)
|
||||
assert abs(output.item() - 1.0) < 1e-5, "1.0 vs {}".format(output.item())
|
||||
|
||||
dummy_input = T.rand(4, 8, 128).float()
|
||||
dummy_target = dummy_input.detach()
|
||||
dummy_length = (T.arange(5, 9)).long()
|
||||
mask = (
|
||||
(sequence_mask(dummy_length).float() - 1.0) * 100.0).unsqueeze(2)
|
||||
output = layer(dummy_input + mask, dummy_target, dummy_length)
|
||||
assert output.item() == 0, "0 vs {}".format(output.item())
|
||||
|
||||
|
|
|
@ -6,12 +6,12 @@ if [[ ! -f tests/outputs/checkpoint_10.pth.tar ]]; then
|
|||
exit 1
|
||||
fi
|
||||
|
||||
rm -f dist/*.whl
|
||||
python setup.py --quiet bdist_wheel --checkpoint tests/outputs/checkpoint_10.pth.tar --model_config tests/outputs/dummy_model_config.json
|
||||
|
||||
python -m venv /tmp/venv
|
||||
source /tmp/venv/bin/activate
|
||||
pip install --quiet --upgrade pip setuptools wheel
|
||||
|
||||
rm -f dist/*.whl
|
||||
python setup.py --quiet bdist_wheel --checkpoint tests/outputs/checkpoint_10.pth.tar --model_config tests/outputs/dummy_model_config.json
|
||||
pip install --quiet dist/TTS*.whl
|
||||
|
||||
# this is related to https://github.com/librosa/librosa/issues/1160
|
||||
|
|
|
@ -294,6 +294,7 @@ class SCGSTMultiSpeakeTacotronTrainTest(unittest.TestCase):
|
|||
mel_spec = torch.rand(8, 30, c.audio['num_mels']).to(device)
|
||||
linear_spec = torch.rand(8, 30, c.audio['fft_size']).to(device)
|
||||
mel_lengths = torch.randint(20, 30, (8, )).long().to(device)
|
||||
mel_lengths[-1] = mel_spec.size(1)
|
||||
stop_targets = torch.zeros(8, 30, 1).float().to(device)
|
||||
speaker_embeddings = torch.rand(8, 55).to(device)
|
||||
|
||||
|
|
|
@ -0,0 +1,14 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
set -xe
|
||||
BASEDIR=$(dirname "$0")
|
||||
echo "$BASEDIR"
|
||||
# run training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tts.py --config_path $BASEDIR/inputs/test_train_config.json
|
||||
# find the training folder
|
||||
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
|
||||
echo $LATEST_FOLDER
|
||||
# continue the previous training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tts.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
|
||||
# remove all the outputs
|
||||
rm -rf $BASEDIR/train_outputs/
|
|
@ -11,6 +11,7 @@ from TTS.utils.io import load_config
|
|||
conf = load_config(os.path.join(get_tests_input_path(), 'test_config.json'))
|
||||
|
||||
def test_phoneme_to_sequence():
|
||||
|
||||
text = "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase, the grey matter in the parts of the brain responsible for emotional regulation and learning!"
|
||||
text_cleaner = ["phoneme_cleaners"]
|
||||
lang = "en-us"
|
||||
|
@ -20,7 +21,7 @@ def test_phoneme_to_sequence():
|
|||
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters)
|
||||
gt = "ɹiːsənt ɹɪsɜːtʃ æt hɑːɹvɚd hɐz ʃoʊn mɛdᵻteɪɾɪŋ fɔːɹ æz lɪɾəl æz eɪt wiːks kæn æktʃuːəli ɪnkɹiːs, ðə ɡɹeɪ mæɾɚɹ ɪnðə pɑːɹts ʌvðə bɹeɪn ɹɪspɑːnsəbəl fɔːɹ ɪmoʊʃənəl ɹɛɡjuːleɪʃən ænd lɜːnɪŋ!"
|
||||
assert text_hat == text_hat_with_params == gt
|
||||
|
||||
|
||||
# multiple punctuations
|
||||
text = "Be a voice, not an! echo?"
|
||||
sequence = phoneme_to_sequence(text, text_cleaner, lang)
|
||||
|
@ -87,6 +88,84 @@ def test_phoneme_to_sequence():
|
|||
print(len(sequence))
|
||||
assert text_hat == text_hat_with_params == gt
|
||||
|
||||
def test_phoneme_to_sequence_with_blank_token():
|
||||
|
||||
text = "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase, the grey matter in the parts of the brain responsible for emotional regulation and learning!"
|
||||
text_cleaner = ["phoneme_cleaners"]
|
||||
lang = "en-us"
|
||||
sequence = phoneme_to_sequence(text, text_cleaner, lang)
|
||||
text_hat = sequence_to_phoneme(sequence)
|
||||
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
|
||||
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
|
||||
gt = "ɹiːsənt ɹɪsɜːtʃ æt hɑːɹvɚd hɐz ʃoʊn mɛdᵻteɪɾɪŋ fɔːɹ æz lɪɾəl æz eɪt wiːks kæn æktʃuːəli ɪnkɹiːs, ðə ɡɹeɪ mæɾɚɹ ɪnðə pɑːɹts ʌvðə bɹeɪn ɹɪspɑːnsəbəl fɔːɹ ɪmoʊʃənəl ɹɛɡjuːleɪʃən ænd lɜːnɪŋ!"
|
||||
assert text_hat == text_hat_with_params == gt
|
||||
|
||||
# multiple punctuations
|
||||
text = "Be a voice, not an! echo?"
|
||||
sequence = phoneme_to_sequence(text, text_cleaner, lang)
|
||||
text_hat = sequence_to_phoneme(sequence)
|
||||
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
|
||||
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
|
||||
gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ?"
|
||||
print(text_hat)
|
||||
print(len(sequence))
|
||||
assert text_hat == text_hat_with_params == gt
|
||||
|
||||
# not ending with punctuation
|
||||
text = "Be a voice, not an! echo"
|
||||
sequence = phoneme_to_sequence(text, text_cleaner, lang)
|
||||
text_hat = sequence_to_phoneme(sequence)
|
||||
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
|
||||
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
|
||||
gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ"
|
||||
print(text_hat)
|
||||
print(len(sequence))
|
||||
assert text_hat == text_hat_with_params == gt
|
||||
|
||||
# original
|
||||
text = "Be a voice, not an echo!"
|
||||
sequence = phoneme_to_sequence(text, text_cleaner, lang)
|
||||
text_hat = sequence_to_phoneme(sequence)
|
||||
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
|
||||
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
|
||||
gt = "biː ɐ vɔɪs, nɑːt ɐn ɛkoʊ!"
|
||||
print(text_hat)
|
||||
print(len(sequence))
|
||||
assert text_hat == text_hat_with_params == gt
|
||||
|
||||
# extra space after the sentence
|
||||
text = "Be a voice, not an! echo. "
|
||||
sequence = phoneme_to_sequence(text, text_cleaner, lang)
|
||||
text_hat = sequence_to_phoneme(sequence)
|
||||
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
|
||||
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
|
||||
gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ."
|
||||
print(text_hat)
|
||||
print(len(sequence))
|
||||
assert text_hat == text_hat_with_params == gt
|
||||
|
||||
# extra space after the sentence
|
||||
text = "Be a voice, not an! echo. "
|
||||
sequence = phoneme_to_sequence(text, text_cleaner, lang, True)
|
||||
text_hat = sequence_to_phoneme(sequence)
|
||||
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
|
||||
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
|
||||
gt = "^biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ.~"
|
||||
print(text_hat)
|
||||
print(len(sequence))
|
||||
assert text_hat == text_hat_with_params == gt
|
||||
|
||||
# padding char
|
||||
text = "_Be a _voice, not an! echo_"
|
||||
sequence = phoneme_to_sequence(text, text_cleaner, lang)
|
||||
text_hat = sequence_to_phoneme(sequence)
|
||||
_ = phoneme_to_sequence(text, text_cleaner, lang, tp=conf.characters, add_blank=True)
|
||||
text_hat_with_params = sequence_to_phoneme(sequence, tp=conf.characters, add_blank=True)
|
||||
gt = "biː ɐ vɔɪs, nɑːt ɐn! ɛkoʊ"
|
||||
print(text_hat)
|
||||
print(len(sequence))
|
||||
assert text_hat == text_hat_with_params == gt
|
||||
|
||||
def test_text2phone():
|
||||
text = "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase, the grey matter in the parts of the brain responsible for emotional regulation and learning!"
|
||||
gt = "ɹ|iː|s|ə|n|t| |ɹ|ɪ|s|ɜː|tʃ| |æ|t| |h|ɑːɹ|v|ɚ|d| |h|ɐ|z| |ʃ|oʊ|n| |m|ɛ|d|ᵻ|t|eɪ|ɾ|ɪ|ŋ| |f|ɔː|ɹ| |æ|z| |l|ɪ|ɾ|əl| |æ|z| |eɪ|t| |w|iː|k|s| |k|æ|n| |æ|k|tʃ|uː|əl|i| |ɪ|n|k|ɹ|iː|s|,| |ð|ə| |ɡ|ɹ|eɪ| |m|æ|ɾ|ɚ|ɹ| |ɪ|n|ð|ə| |p|ɑːɹ|t|s| |ʌ|v|ð|ə| |b|ɹ|eɪ|n| |ɹ|ɪ|s|p|ɑː|n|s|ə|b|əl| |f|ɔː|ɹ| |ɪ|m|oʊ|ʃ|ə|n|əl| |ɹ|ɛ|ɡ|j|uː|l|eɪ|ʃ|ə|n| |æ|n|d| |l|ɜː|n|ɪ|ŋ|!"
|
||||
|
|
|
@ -1,13 +1,13 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
set -xe
|
||||
BASEDIR=$(dirname "$0")
|
||||
echo "$BASEDIR"
|
||||
# run training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tts.py --config_path $BASEDIR/inputs/test_train_config.json
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tacotron.py --config_path $BASEDIR/inputs/test_train_config.json
|
||||
# find the training folder
|
||||
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
|
||||
echo $LATEST_FOLDER
|
||||
# continue the previous training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tts.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_tacotron.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
|
||||
# remove all the outputs
|
||||
rm -rf $BASEDIR/train_outputs/
|
||||
|
|
|
@ -1,15 +1,15 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
set -xe
|
||||
BASEDIR=$(dirname "$0")
|
||||
echo "$BASEDIR"
|
||||
# create run dir
|
||||
mkdir $BASEDIR/train_outputs
|
||||
# run training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder.py --config_path $BASEDIR/inputs/test_vocoder_multiband_melgan_config.json
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_gan.py --config_path $BASEDIR/inputs/test_vocoder_multiband_melgan_config.json
|
||||
# find the training folder
|
||||
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
|
||||
echo $LATEST_FOLDER
|
||||
# continue the previous training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_gan.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
|
||||
# remove all the outputs
|
||||
rm -rf $BASEDIR/train_outputs/$LATEST_FOLDER
|
|
@ -0,0 +1,15 @@
|
|||
#!/usr/bin/env bash
|
||||
set -xe
|
||||
BASEDIR=$(dirname "$0")
|
||||
echo "$BASEDIR"
|
||||
# create run dir
|
||||
mkdir -p $BASEDIR/train_outputs
|
||||
# run training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_wavegrad.py --config_path $BASEDIR/inputs/test_vocoder_wavegrad.json
|
||||
# find the training folder
|
||||
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
|
||||
echo $LATEST_FOLDER
|
||||
# continue the previous training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_wavegrad.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
|
||||
# remove all the outputs
|
||||
rm -rf $BASEDIR/train_outputs/$LATEST_FOLDER
|
|
@ -0,0 +1,31 @@
|
|||
import numpy as np
|
||||
import torch
|
||||
import random
|
||||
from TTS.vocoder.models.wavernn import WaveRNN
|
||||
|
||||
|
||||
def test_wavernn():
|
||||
model = WaveRNN(
|
||||
rnn_dims=512,
|
||||
fc_dims=512,
|
||||
mode=10,
|
||||
mulaw=False,
|
||||
pad=2,
|
||||
use_aux_net=True,
|
||||
use_upsample_net=True,
|
||||
upsample_factors=[4, 8, 8],
|
||||
feat_dims=80,
|
||||
compute_dims=128,
|
||||
res_out_dims=128,
|
||||
num_res_blocks=10,
|
||||
hop_length=256,
|
||||
sample_rate=22050,
|
||||
)
|
||||
dummy_x = torch.rand((2, 1280))
|
||||
dummy_m = torch.rand((2, 80, 9))
|
||||
y_size = random.randrange(20, 60)
|
||||
dummy_y = torch.rand((80, y_size))
|
||||
output = model(dummy_x, dummy_m)
|
||||
assert np.all(output.shape == (2, 1280, 4 * 256)), output.shape
|
||||
output = model.inference(dummy_y, True, 5500, 550)
|
||||
assert np.all(output.shape == (256 * (y_size - 1),))
|
|
@ -0,0 +1,92 @@
|
|||
import os
|
||||
import shutil
|
||||
|
||||
import numpy as np
|
||||
from tests import get_tests_path, get_tests_input_path, get_tests_output_path
|
||||
from torch.utils.data import DataLoader
|
||||
|
||||
from TTS.utils.audio import AudioProcessor
|
||||
from TTS.utils.io import load_config
|
||||
from TTS.vocoder.datasets.wavernn_dataset import WaveRNNDataset
|
||||
from TTS.vocoder.datasets.preprocess import load_wav_feat_data, preprocess_wav_files
|
||||
|
||||
file_path = os.path.dirname(os.path.realpath(__file__))
|
||||
OUTPATH = os.path.join(get_tests_output_path(), "loader_tests/")
|
||||
os.makedirs(OUTPATH, exist_ok=True)
|
||||
|
||||
C = load_config(os.path.join(get_tests_input_path(),
|
||||
"test_vocoder_wavernn_config.json"))
|
||||
|
||||
test_data_path = os.path.join(get_tests_path(), "data/ljspeech/")
|
||||
test_mel_feat_path = os.path.join(test_data_path, "mel")
|
||||
test_quant_feat_path = os.path.join(test_data_path, "quant")
|
||||
ok_ljspeech = os.path.exists(test_data_path)
|
||||
|
||||
|
||||
def wavernn_dataset_case(batch_size, seq_len, hop_len, pad, mode, mulaw, num_workers):
|
||||
""" run dataloader with given parameters and check conditions """
|
||||
ap = AudioProcessor(**C.audio)
|
||||
|
||||
C.batch_size = batch_size
|
||||
C.mode = mode
|
||||
C.seq_len = seq_len
|
||||
C.data_path = test_data_path
|
||||
|
||||
preprocess_wav_files(test_data_path, C, ap)
|
||||
_, train_items = load_wav_feat_data(
|
||||
test_data_path, test_mel_feat_path, 5)
|
||||
|
||||
dataset = WaveRNNDataset(ap=ap,
|
||||
items=train_items,
|
||||
seq_len=seq_len,
|
||||
hop_len=hop_len,
|
||||
pad=pad,
|
||||
mode=mode,
|
||||
mulaw=mulaw
|
||||
)
|
||||
# sampler = DistributedSampler(dataset) if num_gpus > 1 else None
|
||||
loader = DataLoader(dataset,
|
||||
shuffle=True,
|
||||
collate_fn=dataset.collate,
|
||||
batch_size=batch_size,
|
||||
num_workers=num_workers,
|
||||
pin_memory=True,
|
||||
)
|
||||
|
||||
max_iter = 10
|
||||
count_iter = 0
|
||||
|
||||
try:
|
||||
for data in loader:
|
||||
x_input, mels, _ = data
|
||||
expected_feat_shape = (ap.num_mels,
|
||||
(x_input.shape[-1] // hop_len) + (pad * 2))
|
||||
assert np.all(
|
||||
mels.shape[1:] == expected_feat_shape), f" [!] {mels.shape} vs {expected_feat_shape}"
|
||||
|
||||
assert (mels.shape[2] - pad * 2) * hop_len == x_input.shape[1]
|
||||
count_iter += 1
|
||||
if count_iter == max_iter:
|
||||
break
|
||||
# except AssertionError:
|
||||
# shutil.rmtree(test_mel_feat_path)
|
||||
# shutil.rmtree(test_quant_feat_path)
|
||||
finally:
|
||||
shutil.rmtree(test_mel_feat_path)
|
||||
shutil.rmtree(test_quant_feat_path)
|
||||
|
||||
|
||||
def test_parametrized_wavernn_dataset():
|
||||
''' test dataloader with different parameters '''
|
||||
params = [
|
||||
[16, C.audio['hop_length'] * 10, C.audio['hop_length'], 2, 10, True, 0],
|
||||
[16, C.audio['hop_length'] * 10, C.audio['hop_length'], 2, "mold", False, 4],
|
||||
[1, C.audio['hop_length'] * 10, C.audio['hop_length'], 2, 9, False, 0],
|
||||
[1, C.audio['hop_length'], C.audio['hop_length'], 2, 10, True, 0],
|
||||
[1, C.audio['hop_length'], C.audio['hop_length'], 2, "mold", False, 0],
|
||||
[1, C.audio['hop_length'] * 5, C.audio['hop_length'], 4, 10, False, 2],
|
||||
[1, C.audio['hop_length'] * 5, C.audio['hop_length'], 2, "mold", False, 0],
|
||||
]
|
||||
for param in params:
|
||||
print(param)
|
||||
wavernn_dataset_case(*param)
|
|
@ -0,0 +1,15 @@
|
|||
#!/usr/bin/env bash
|
||||
set -xe
|
||||
BASEDIR=$(dirname "$0")
|
||||
echo "$BASEDIR"
|
||||
# create run dir
|
||||
mkdir -p $BASEDIR/train_outputs
|
||||
# run training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_wavernn.py --config_path $BASEDIR/inputs/test_vocoder_wavernn_config.json
|
||||
# find the training folder
|
||||
LATEST_FOLDER=$(ls $BASEDIR/train_outputs/| sort | tail -1)
|
||||
echo $LATEST_FOLDER
|
||||
# continue the previous training
|
||||
CUDA_VISIBLE_DEVICES="" python TTS/bin/train_vocoder_wavernn.py --continue_path $BASEDIR/train_outputs/$LATEST_FOLDER
|
||||
# remove all the outputs
|
||||
rm -rf $BASEDIR/train_outputs/$LATEST_FOLDER
|
|
@ -0,0 +1,92 @@
|
|||
import torch
|
||||
|
||||
from TTS.vocoder.layers.wavegrad import PositionalEncoding, FiLM, UBlock, DBlock
|
||||
from TTS.vocoder.models.wavegrad import Wavegrad
|
||||
|
||||
|
||||
def test_positional_encoding():
|
||||
layer = PositionalEncoding(50)
|
||||
inp = torch.rand(32, 50, 100)
|
||||
nl = torch.rand(32)
|
||||
o = layer(inp, nl)
|
||||
|
||||
assert o.shape[0] == 32
|
||||
assert o.shape[1] == 50
|
||||
assert o.shape[2] == 100
|
||||
assert isinstance(o, torch.FloatTensor)
|
||||
|
||||
|
||||
def test_film():
|
||||
layer = FiLM(50, 76)
|
||||
inp = torch.rand(32, 50, 100)
|
||||
nl = torch.rand(32)
|
||||
shift, scale = layer(inp, nl)
|
||||
|
||||
assert shift.shape[0] == 32
|
||||
assert shift.shape[1] == 76
|
||||
assert shift.shape[2] == 100
|
||||
assert isinstance(shift, torch.FloatTensor)
|
||||
|
||||
assert scale.shape[0] == 32
|
||||
assert scale.shape[1] == 76
|
||||
assert scale.shape[2] == 100
|
||||
assert isinstance(scale, torch.FloatTensor)
|
||||
|
||||
layer.apply_weight_norm()
|
||||
layer.remove_weight_norm()
|
||||
|
||||
|
||||
def test_ublock():
|
||||
inp1 = torch.rand(32, 50, 100)
|
||||
inp2 = torch.rand(32, 50, 50)
|
||||
nl = torch.rand(32)
|
||||
|
||||
layer_film = FiLM(50, 100)
|
||||
layer = UBlock(50, 100, 2, [1, 2, 4, 8])
|
||||
|
||||
scale, shift = layer_film(inp1, nl)
|
||||
o = layer(inp2, shift, scale)
|
||||
|
||||
assert o.shape[0] == 32
|
||||
assert o.shape[1] == 100
|
||||
assert o.shape[2] == 100
|
||||
assert isinstance(o, torch.FloatTensor)
|
||||
|
||||
layer.apply_weight_norm()
|
||||
layer.remove_weight_norm()
|
||||
|
||||
|
||||
def test_dblock():
|
||||
inp = torch.rand(32, 50, 130)
|
||||
layer = DBlock(50, 100, 2)
|
||||
o = layer(inp)
|
||||
|
||||
assert o.shape[0] == 32
|
||||
assert o.shape[1] == 100
|
||||
assert o.shape[2] == 65
|
||||
assert isinstance(o, torch.FloatTensor)
|
||||
|
||||
layer.apply_weight_norm()
|
||||
layer.remove_weight_norm()
|
||||
|
||||
|
||||
def test_wavegrad_forward():
|
||||
x = torch.rand(32, 1, 20 * 300)
|
||||
c = torch.rand(32, 80, 20)
|
||||
noise_scale = torch.rand(32)
|
||||
|
||||
model = Wavegrad(in_channels=80,
|
||||
out_channels=1,
|
||||
upsample_factors=[5, 5, 3, 2, 2],
|
||||
upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2],
|
||||
[1, 2, 4, 8], [1, 2, 4, 8],
|
||||
[1, 2, 4, 8]])
|
||||
o = model.forward(x, c, noise_scale)
|
||||
|
||||
assert o.shape[0] == 32
|
||||
assert o.shape[1] == 1
|
||||
assert o.shape[2] == 20 * 300
|
||||
assert isinstance(o, torch.FloatTensor)
|
||||
|
||||
model.apply_weight_norm()
|
||||
model.remove_weight_norm()
|
|
@ -0,0 +1,62 @@
|
|||
import unittest
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch import optim
|
||||
from TTS.vocoder.models.wavegrad import Wavegrad
|
||||
|
||||
#pylint: disable=unused-variable
|
||||
|
||||
torch.manual_seed(1)
|
||||
use_cuda = torch.cuda.is_available()
|
||||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
|
||||
class WavegradTrainTest(unittest.TestCase):
|
||||
def test_train_step(self): # pylint: disable=no-self-use
|
||||
"""Test if all layers are updated in a basic training cycle"""
|
||||
input_dummy = torch.rand(8, 1, 20 * 300).to(device)
|
||||
mel_spec = torch.rand(8, 80, 20).to(device)
|
||||
|
||||
criterion = torch.nn.L1Loss().to(device)
|
||||
model = Wavegrad(in_channels=80,
|
||||
out_channels=1,
|
||||
upsample_factors=[5, 5, 3, 2, 2],
|
||||
upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2],
|
||||
[1, 2, 4, 8], [1, 2, 4, 8],
|
||||
[1, 2, 4, 8]])
|
||||
|
||||
model_ref = Wavegrad(in_channels=80,
|
||||
out_channels=1,
|
||||
upsample_factors=[5, 5, 3, 2, 2],
|
||||
upsample_dilations=[[1, 2, 1, 2], [1, 2, 1, 2],
|
||||
[1, 2, 4, 8], [1, 2, 4, 8],
|
||||
[1, 2, 4, 8]])
|
||||
model.train()
|
||||
model.to(device)
|
||||
betas = np.linspace(1e-6, 1e-2, 1000)
|
||||
model.compute_noise_level(betas)
|
||||
model_ref.load_state_dict(model.state_dict())
|
||||
model_ref.to(device)
|
||||
count = 0
|
||||
for param, param_ref in zip(model.parameters(),
|
||||
model_ref.parameters()):
|
||||
assert (param - param_ref).sum() == 0, param
|
||||
count += 1
|
||||
optimizer = optim.Adam(model.parameters(), lr=0.001)
|
||||
for i in range(5):
|
||||
y_hat = model.forward(input_dummy, mel_spec, torch.rand(8).to(device))
|
||||
optimizer.zero_grad()
|
||||
loss = criterion(y_hat, input_dummy)
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
# check parameter changes
|
||||
count = 0
|
||||
for param, param_ref in zip(model.parameters(),
|
||||
model_ref.parameters()):
|
||||
# ignore pre-higway layer since it works conditional
|
||||
# if count not in [145, 59]:
|
||||
assert (param != param_ref).any(
|
||||
), "param {} with shape {} not updated!! \n{}\n{}".format(
|
||||
count, param.shape, param, param_ref)
|
||||
count += 1
|
Loading…
Reference in New Issue