Merge branch 'dev'

This commit is contained in:
erogol 2020-09-07 11:43:26 +02:00
commit 3131308baa
190 changed files with 42377 additions and 2832 deletions

View File

@ -1,2 +1,5 @@
linters:
- pylint:
# pylintrc: pylintrc
filefilter: ['- test_*.py', '+ *.py', '- *.npy']
# exclude:

18
.github/PR_TEMPLATE.md vendored Normal file
View File

@ -0,0 +1,18 @@
---
name: 'Contribution Guideline '
about: Refer to Contirbution Guideline
title: ''
labels: ''
assignees: ''
---
### Contribution Guideline
Please send your PRs to `dev` branch if it is not directly related to a specific branch.
Before making a Pull Request, check your changes for basic mistakes and style problems by using a linter.
We have cardboardlinter setup in this repository, so for example, if you've made some changes and would like to run the linter on just the changed code, you can use the follow command:
```bash
pip install pylint cardboardlint
cardboardlinter --refspec master
```

1
.gitignore vendored
View File

@ -128,3 +128,4 @@ tests/outputs/*
TODO.txt
.vscode/*
data/*
notebooks/data/*

View File

@ -157,7 +157,8 @@ disable=missing-docstring,
xreadlines-attribute,
deprecated-sys-function,
exception-escape,
comprehension-escape
comprehension-escape,
duplicate-code
# Enable the message, report, category or checker with the given id(s). You can
# either give multiple identifier separated by comma (,) or put this option

View File

@ -3,6 +3,12 @@ language: python
git:
quiet: true
before_install:
- sudo apt-get update
- sudo apt-get -y install espeak
- python -m pip install --upgrade pip
- pip install six==1.12.0
matrix:
include:
- name: "Lint check"
@ -11,7 +17,15 @@ matrix:
env: TEST_SUITE="lint"
- name: "Unit tests"
python: "3.6"
install: pip install --quiet -r requirements_tests.txt
install:
- python setup.py egg_info
- pip install -e .
env: TEST_SUITE="unittest"
- name: "Unit tests"
python: "3.6"
install:
- python setup.py egg_info
- pip install -e .
env: TEST_SUITE="testscripts"
script: ./.travis/script

View File

@ -10,10 +10,12 @@ if [[ ( "$TRAVIS_PULL_REQUEST" != "false" ) && ( "$TEST_SUITE" == "lint" ) ]]; t
fi
if [[ "$TEST_SUITE" == "unittest" ]]; then
# Run tests on all pushes
pushd tts_namespace
python -m unittest
popd
# Test server package
nosetests tests --nocapture
./tests/test_server_package.sh
fi
if [[ "$TEST_SUITE" == "testscripts" ]]; then
# test model training scripts
./tests/test_tts_train.sh
./tests/test_vocoder_train.sh
fi

140
README.md
View File

@ -1,38 +1,73 @@
<p align="center"><img src="https://user-images.githubusercontent.com/1402048/52643646-c2102980-2edd-11e9-8c37-b72f3c89a640.png" data-canonical-src="![TTS banner](https://user-images.githubusercontent.com/1402048/52643646-c2102980-2edd-11e9-8c37-b72f3c89a640.png =250x250)
" width="320" height="95" /></p>
<img src="https://travis-ci.org/mozilla/TTS.svg?branch=dev"/>
<br/>
This project is a part of [Mozilla Common Voice](https://voice.mozilla.org/en). TTS aims a deep learning based Text2Speech engine, low in cost and high in quality. To begin with, you can hear a sample generated voice from [here](https://soundcloud.com/user-565970875/commonvoice-loc-sens-attn).
<p align='center'>
<img src="https://travis-ci.org/mozilla/TTS.svg?branch=dev"/>
<a href='https://discourse.mozilla.org/c/tts'><img src="https://img.shields.io/badge/discourse-online-green.svg"/></a>
<a href='https://opensource.org/licenses/MPL-2.0'> <img src="https://img.shields.io/badge/License-MPL%202.0-brightgreen.svg"/></a>
</p>
TTS includes two different model implementations which are based on [Tacotron](https://arxiv.org/abs/1703.10135) and [Tacotron2](https://arxiv.org/abs/1712.05884). Tacotron is smaller, efficient and easier to train but Tacotron2 provides better results, especially when it is combined with a Neural vocoder. Therefore, choose depending on your project requirements.
<br/>
If you are new, you can also find [here](http://www.erogol.com/text-speech-deep-learning-architectures/) a brief post about TTS architectures and their comparisons.
This project is a part of [Mozilla Common Voice](https://voice.mozilla.org/en).
Mozilla TTS aims a deep learning based Text2Speech engine, low in cost and high in quality.
You can check some of synthesized voice samples from [here](https://erogol.github.io/ddc-samples/).
If you are new, you can also find [here](http://www.erogol.com/text-speech-deep-learning-architectures/) a brief post about some of TTS architectures and [here](https://github.com/erogol/TTS-papers) list of up-to-date research papers.
[![](https://sourcerer.io/fame/erogol/erogol/TTS/images/0)](https://sourcerer.io/fame/erogol/erogol/TTS/links/0)[![](https://sourcerer.io/fame/erogol/erogol/TTS/images/1)](https://sourcerer.io/fame/erogol/erogol/TTS/links/1)[![](https://sourcerer.io/fame/erogol/erogol/TTS/images/2)](https://sourcerer.io/fame/erogol/erogol/TTS/links/2)[![](https://sourcerer.io/fame/erogol/erogol/TTS/images/3)](https://sourcerer.io/fame/erogol/erogol/TTS/links/3)[![](https://sourcerer.io/fame/erogol/erogol/TTS/images/4)](https://sourcerer.io/fame/erogol/erogol/TTS/links/4)[![](https://sourcerer.io/fame/erogol/erogol/TTS/images/5)](https://sourcerer.io/fame/erogol/erogol/TTS/links/5)[![](https://sourcerer.io/fame/erogol/erogol/TTS/images/6)](https://sourcerer.io/fame/erogol/erogol/TTS/links/6)[![](https://sourcerer.io/fame/erogol/erogol/TTS/images/7)](https://sourcerer.io/fame/erogol/erogol/TTS/links/7)
## TTS Performance
## TTS Performance
<p align="center"><img src="https://camo.githubusercontent.com/9fa79f977015e55eb9ec7aa32045555f60d093d3/68747470733a2f2f646973636f757273652d706161732d70726f64756374696f6e2d636f6e74656e742e73332e6475616c737461636b2e75732d656173742d312e616d617a6f6e6177732e636f6d2f6f7074696d697a65642f33582f362f342f363432386639383065396563373531633234386535393134363038393566373838316165633063365f325f363930783339342e706e67"/></p>
[Details...](https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results)
## Provided Models and Methods
Text-to-Spectrogram:
- Tacotron: [paper](https://arxiv.org/abs/1703.10135)
- Tacotron2: [paper](https://arxiv.org/abs/1712.05884)
Attention Methods:
- Guided Attention: [paper](https://arxiv.org/abs/1710.08969)
- Forward Backward Decoding: [paper](https://arxiv.org/abs/1907.09006)
- Graves Attention: [paper](https://arxiv.org/abs/1907.09006)
- Double Decoder Consistency: [blog](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/)
Speaker Encoder:
- GE2E: [paper](https://arxiv.org/abs/1710.10467)
Vocoders:
- MelGAN: [paper](https://arxiv.org/abs/1710.10467)
- MultiBandMelGAN: [paper](https://arxiv.org/abs/2005.05106)
- GAN-TTS discriminators: [paper](https://arxiv.org/abs/1909.11646)
You can also help us implement more models. Some TTS related work can be found [here](https://github.com/erogol/TTS-papers).
## Features
- High performance Text2Speech models on Torch and Tensorflow 2.0.
- High performance Speaker Encoder to compute speaker embeddings efficiently.
- Integration with various Neural Vocoders (PWGAN, MelGAN, WaveRNN)
- Released trained models.
- Efficient training codes for PyTorch. (soon for Tensorflow 2.0)
- Codes to convert Torch models to Tensorflow 2.0.
- Detailed training anlaysis on console and Tensorboard.
- High performance Deep Learning models for Text2Speech tasks.
- Text2Spec models (Tacotron, Tacotron2).
- Speaker Encoder to compute speaker embeddings efficiently.
- Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS, ParallelWaveGAN)
- Fast and efficient model training.
- Detailed training logs on console and Tensorboard.
- Support for multi-speaker TTS.
- Efficient Multi-GPUs training.
- Ability to convert PyTorch models to Tensorflow 2.0 and TFLite for inference.
- Released models in PyTorch, Tensorflow and TFLite.
- Tools to curate Text2Speech datasets under```dataset_analysis```.
- Demo server for model testing.
- Notebooks for extensive model benchmarking.
- Modular (but not too much) code base enabling easy testing for new ideas.
## Requirements and Installation
## Main Requirements and Installation
Highly recommended to use [miniconda](https://conda.io/miniconda.html) for easier installation.
* python>=3.6
* pytorch>=0.4.1
* pytorch>=1.4.1
* tensorflow>=2.2
* librosa
* tensorboard
* tensorboardX
@ -47,18 +82,34 @@ Or you can use ```requirements.txt``` to install the requirements only.
```pip install -r requirements.txt```
### Directory Structure
```
|- notebooks/ (Jupyter Notebooks for model evaluation, parameter selection and data analysis.)
|- utils/ (common utilities.)
|- TTS
|- bin/ (folder for all the executables.)
|- train*.py (train your target model.)
|- distribute.py (train your TTS model using Multiple GPUs.)
|- compute_statistics.py (compute dataset statistics for normalization.)
|- convert*.py (convert target torch model to TF.)
|- tts/ (text to speech models)
|- layers/ (model layer definitions)
|- models/ (model definitions)
|- tf/ (Tensorflow 2 utilities and model implementations)
|- utils/ (model specific utilities.)
|- speaker_encoder/ (Speaker Encoder models.)
|- (same)
|- vocoder/ (Vocoder models.)
|- (same)
```
### Docker
A barebone `Dockerfile` exists at the root of the project, which should let you quickly setup the environment. By default, it will start the server and let you query it. Make sure to use `nvidia-docker` to use your GPUs. Make sure you follow the instructions in the [`server README`](server/README.md) before you build your image so that the server can find the model within the image.
A docker image is created by [@synesthesiam](https://github.com/synesthesiam) and shared in a separate [repository](https://github.com/synesthesiam/docker-mozillatts) with the latest LJSpeech models.
```
docker build -t mozilla-tts .
nvidia-docker run -it --rm -p 5002:5002 mozilla-tts
```
## Checkpoints and Audio Samples
## Release Models
Please visit [our wiki.](https://github.com/mozilla/TTS/wiki/Released-Models)
## Example Model Outputs
## Sample Model Output
Below you see Tacotron model state after 16K iterations with batch-size 32 with LJSpeech dataset.
> "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase the grey matter in the parts of the brain responsible for emotional regulation and learning."
@ -67,26 +118,14 @@ Audio examples: [soundcloud](https://soundcloud.com/user-565970875/pocket-articl
<img src="images/example_model_output.png?raw=true" alt="example_output" width="400"/>
## Runtime
The most time-consuming part is the vocoder algorithm (Griffin-Lim) which runs on CPU. By setting its number of iterations lower, you might have faster execution with a small loss of quality. Some of the experimental values are below.
Sentence: "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."
Audio length is approximately 6 secs.
| Time (secs) | System | # GL iters | Model
| ---- |:-------|:-----------| ---- |
|2.00|GTX1080Ti|30|Tacotron|
|3.01|GTX1080Ti|60|Tacotron|
|3.57|CPU|60|Tacotron|
|5.27|GTX1080Ti|60|Tacotron2|
|6.50|CPU|60|Tacotron2|
## [Mozilla TTS Tutorials and Notebooks](https://github.com/mozilla/TTS/wiki/TTS-Notebooks-and-Tutorials)
## Datasets and Data-Loading
TTS provides a generic dataloder easy to use for new datasets. You need to write an preprocessor function to integrate your own dataset.Check ```datasets/preprocess.py``` to see some examples. After the function, you need to set ```dataset``` field in ```config.json```. Do not forget other data related fields too.
TTS provides a generic dataloader easy to use for your custom dataset.
You just need to write a simple function to format the dataset. Check ```datasets/preprocess.py``` to see some examples.
After that, you need to set ```dataset``` fields in ```config.json```.
Some of the open-sourced datasets that we successfully applied TTS, are linked below.
Some of the public datasets that we successfully applied TTS:
- [LJ Speech](https://keithito.com/LJ-Speech-Dataset/)
- [Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/)
@ -96,9 +135,9 @@ Some of the open-sourced datasets that we successfully applied TTS, are linked b
- [Spanish](https://drive.google.com/file/d/1Sm_zyBo67XHkiFhcRSQ4YaHPYM0slO_e/view?usp=sharing) - thx! @carlfm01
## Training and Fine-tuning LJ-Speech
Here you can find a [CoLab](https://gist.github.com/erogol/97516ad65b44dbddb8cd694953187c5b) notebook for a hands-on example, training LJSpeech. Or you can manually follow the guideline below.
Here you can find a [CoLab](https://gist.github.com/erogol/97516ad65b44dbddb8cd694953187c5b) notebook for a hands-on example, training LJSpeech. Or you can manually follow the guideline below.
To start with, split ```metadata.csv``` into train and validation subsets respectively ```metadata_train.csv``` and ```metadata_val.csv```. Note that for text-to-speech, validation performance might be misleading since the loss value does not directly measure the voice quality to the human ear and it also does not measure the attention module performance. Therefore, running the model with new sentences and listening to the results is the best way to go.
To start with, split ```metadata.csv``` into train and validation subsets respectively ```metadata_train.csv``` and ```metadata_val.csv```. Note that for text-to-speech, validation performance might be misleading since the loss value does not directly measure the voice quality to the human ear and it also does not measure the attention module performance. Therefore, running the model with new sentences and listening to the results is the best way to go.
```
shuf metadata.csv > metadata_shuf.csv
@ -108,15 +147,19 @@ tail -n 1100 metadata_shuf.csv > metadata_val.csv
To train a new model, you need to define your own ```config.json``` file (check the example) and call with the command below. You also set the model architecture in ```config.json```.
```train.py --config_path config.json```
```python TTS/bin/train.py --config_path TTS/tts/configs/config.json```
To fine-tune a model, use ```--restore_path```.
```train.py --config_path config.json --restore_path /path/to/your/model.pth.tar```
```python TTS/bin/train.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar```
To continue an old training run, use ```--continue_path```.
```python TTS/bin/train.py --continue_path /path/to/your/run_folder/```
For multi-GPU training use ```distribute.py```. It enables process based multi-GPU training where each process uses a single GPU.
```CUDA_VISIBLE_DEVICES="0,1,4" distribute.py --config_path config.json```
```CUDA_VISIBLE_DEVICES="0,1,4" TTS/bin/distribute.py --config_path TTS/tts/configs/config.json```
Each run creates a new output folder and ```config.json``` is copied under this folder.
@ -124,8 +167,6 @@ In case of any error or intercepted execution, if there is no checkpoint yet und
You can also enjoy Tensorboard, if you point Tensorboard argument```--logdir``` to the experiment folder.
## [Testing and Examples](https://github.com/mozilla/TTS/wiki/Examples-using-TTS)
## Contribution guidelines
This repository is governed by Mozilla's code of conduct and etiquette guidelines. For more details, please read the [Mozilla Community Participation Guidelines.](https://www.mozilla.org/about/governance/policies/participation/)
@ -137,10 +178,10 @@ cardboardlinter --refspec master
```
## Collaborative Experimentation Guide
If you like to use TTS to try a new idea and like to share your experiments with the community, we urge you to use the following guideline for a better collaboration.
If you like to use TTS to try a new idea and like to share your experiments with the community, we urge you to use the following guideline for a better collaboration.
(If you have an idea for better collaboration, let us know)
- Create a new branch.
- Open an issue pointing your branch.
- Open an issue pointing your branch.
- Explain your experiment.
- Share your results as you proceed. (Tensorboard log files, audio results, visuals etc.)
- Use LJSpeech dataset (for English) if you like to compare results with the released models. (It is the most open scalable dataset for quick experimentation)
@ -155,7 +196,7 @@ If you like to use TTS to try a new idea and like to share your experiments with
- [x] Enable process based distributed training. Similar to (https://github.com/fastai/imagenet-fast/).
- [x] Adapting Neural Vocoder. TTS works with WaveRNN and ParallelWaveGAN (https://github.com/erogol/WaveRNN and https://github.com/erogol/ParallelWaveGAN)
- [ ] Multi-speaker embedding.
- [ ] Model optimization (model export, model pruning etc.)
- [x] Model optimization (model export, model pruning etc.)
<!--## References
- [Efficient Neural Audio Synthesis](https://arxiv.org/pdf/1802.08435.pdf)
@ -171,3 +212,4 @@ If you like to use TTS to try a new idea and like to share your experiments with
### References
- https://github.com/keithito/tacotron (Dataset pre-processing)
- https://github.com/r9y9/tacotron_pytorch (Initial Tacotron architecture)
- https://github.com/kan-bayashi/ParallelWaveGAN (vocoder library)

View File

@ -1,24 +0,0 @@
# coding: utf-8
# import torch
# from torch import nn
# class StopProjection(nn.Module):
# r""" Simple projection layer to predict the "stop token"
# Args:
# in_features (int): size of the input vector
# out_features (int or list): size of each output vector. aka number
# of predicted frames.
# """
# def __init__(self, in_features, out_features):
# super(StopProjection, self).__init__()
# self.linear = nn.Linear(in_features, out_features)
# self.dropout = nn.Dropout(0.5)
# self.sigmoid = nn.Sigmoid()
# def forward(self, inputs):
# out = self.dropout(inputs)
# out = self.linear(out)
# out = self.sigmoid(out)
# return out

View File

@ -1,179 +0,0 @@
# coding: utf-8
import torch
import copy
from torch import nn
from TTS.layers.tacotron import Encoder, Decoder, PostCBHG
from TTS.utils.generic_utils import sequence_mask
from TTS.layers.gst_layers import GST
class Tacotron(nn.Module):
def __init__(self,
num_chars,
num_speakers,
r=5,
postnet_output_dim=1025,
decoder_output_dim=80,
memory_size=5,
attn_type='original',
attn_win=False,
gst=False,
attn_norm="sigmoid",
prenet_type="original",
prenet_dropout=True,
forward_attn=False,
trans_agent=False,
forward_attn_mask=False,
location_attn=True,
attn_K=5,
separate_stopnet=True,
bidirectional_decoder=False):
super(Tacotron, self).__init__()
self.r = r
self.decoder_output_dim = decoder_output_dim
self.postnet_output_dim = postnet_output_dim
self.gst = gst
self.num_speakers = num_speakers
self.bidirectional_decoder = bidirectional_decoder
decoder_dim = 512 if num_speakers > 1 else 256
encoder_dim = 512 if num_speakers > 1 else 256
proj_speaker_dim = 80 if num_speakers > 1 else 0
# embedding layer
self.embedding = nn.Embedding(num_chars, 256, padding_idx=0)
self.embedding.weight.data.normal_(0, 0.3)
# boilerplate model
self.encoder = Encoder(encoder_dim)
self.decoder = Decoder(decoder_dim, decoder_output_dim, r, memory_size, attn_type, attn_win,
attn_norm, prenet_type, prenet_dropout,
forward_attn, trans_agent, forward_attn_mask,
location_attn, attn_K, separate_stopnet,
proj_speaker_dim)
if self.bidirectional_decoder:
self.decoder_backward = copy.deepcopy(self.decoder)
self.postnet = PostCBHG(decoder_output_dim)
self.last_linear = nn.Linear(self.postnet.cbhg.gru_features * 2,
postnet_output_dim)
# speaker embedding layers
if num_speakers > 1:
self.speaker_embedding = nn.Embedding(num_speakers, 256)
self.speaker_embedding.weight.data.normal_(0, 0.3)
self.speaker_project_mel = nn.Sequential(
nn.Linear(256, proj_speaker_dim), nn.Tanh())
self.speaker_embeddings = None
self.speaker_embeddings_projected = None
# global style token layers
if self.gst:
gst_embedding_dim = 256
self.gst_layer = GST(num_mel=80,
num_heads=4,
num_style_tokens=10,
embedding_dim=gst_embedding_dim)
def _init_states(self):
self.speaker_embeddings = None
self.speaker_embeddings_projected = None
def compute_speaker_embedding(self, speaker_ids):
if hasattr(self, "speaker_embedding") and speaker_ids is None:
raise RuntimeError(
" [!] Model has speaker embedding layer but speaker_id is not provided"
)
if hasattr(self, "speaker_embedding") and speaker_ids is not None:
self.speaker_embeddings = self._compute_speaker_embedding(
speaker_ids)
self.speaker_embeddings_projected = self.speaker_project_mel(
self.speaker_embeddings).squeeze(1)
def compute_gst(self, inputs, mel_specs):
gst_outputs = self.gst_layer(mel_specs)
inputs = self._add_speaker_embedding(inputs, gst_outputs)
return inputs
def forward(self, characters, text_lengths, mel_specs, speaker_ids=None):
"""
Shapes:
- characters: B x T_in
- text_lengths: B
- mel_specs: B x T_out x D
- speaker_ids: B x 1
"""
self._init_states()
mask = sequence_mask(text_lengths).to(characters.device)
# B x T_in x embed_dim
inputs = self.embedding(characters)
# B x speaker_embed_dim
self.compute_speaker_embedding(speaker_ids)
if self.num_speakers > 1:
# B x T_in x embed_dim + speaker_embed_dim
inputs = self._concat_speaker_embedding(inputs,
self.speaker_embeddings)
# B x T_in x encoder_dim
encoder_outputs = self.encoder(inputs)
if self.gst:
# B x gst_dim
encoder_outputs = self.compute_gst(encoder_outputs, mel_specs)
if self.num_speakers > 1:
encoder_outputs = self._concat_speaker_embedding(
encoder_outputs, self.speaker_embeddings)
# decoder_outputs: B x decoder_dim x T_out
# alignments: B x T_in x encoder_dim
# stop_tokens: B x T_in
decoder_outputs, alignments, stop_tokens = self.decoder(
encoder_outputs, mel_specs, mask,
self.speaker_embeddings_projected)
# B x T_out x decoder_dim
postnet_outputs = self.postnet(decoder_outputs)
# B x T_out x posnet_dim
postnet_outputs = self.last_linear(postnet_outputs)
# B x T_out x decoder_dim
decoder_outputs = decoder_outputs.transpose(1, 2).contiguous()
if self.bidirectional_decoder:
decoder_outputs_backward, alignments_backward = self._backward_inference(mel_specs, encoder_outputs, mask)
return decoder_outputs, postnet_outputs, alignments, stop_tokens, decoder_outputs_backward, alignments_backward
return decoder_outputs, postnet_outputs, alignments, stop_tokens
@torch.no_grad()
def inference(self, characters, speaker_ids=None, style_mel=None):
inputs = self.embedding(characters)
self._init_states()
self.compute_speaker_embedding(speaker_ids)
if self.num_speakers > 1:
inputs = self._concat_speaker_embedding(inputs,
self.speaker_embeddings)
encoder_outputs = self.encoder(inputs)
if self.gst and style_mel is not None:
encoder_outputs = self.compute_gst(encoder_outputs, style_mel)
if self.num_speakers > 1:
encoder_outputs = self._concat_speaker_embedding(
encoder_outputs, self.speaker_embeddings)
decoder_outputs, alignments, stop_tokens = self.decoder.inference(
encoder_outputs, self.speaker_embeddings_projected)
postnet_outputs = self.postnet(decoder_outputs)
postnet_outputs = self.last_linear(postnet_outputs)
decoder_outputs = decoder_outputs.transpose(1, 2)
return decoder_outputs, postnet_outputs, alignments, stop_tokens
def _backward_inference(self, mel_specs, encoder_outputs, mask):
decoder_outputs_b, alignments_b, _ = self.decoder_backward(
encoder_outputs, torch.flip(mel_specs, dims=(1,)), mask,
self.speaker_embeddings_projected)
decoder_outputs_b = decoder_outputs_b.transpose(1, 2).contiguous()
return decoder_outputs_b, alignments_b
def _compute_speaker_embedding(self, speaker_ids):
speaker_embeddings = self.speaker_embedding(speaker_ids)
return speaker_embeddings.unsqueeze_(1)
@staticmethod
def _add_speaker_embedding(outputs, speaker_embeddings):
speaker_embeddings_ = speaker_embeddings.expand(
outputs.size(0), outputs.size(1), -1)
outputs = outputs + speaker_embeddings_
return outputs
@staticmethod
def _concat_speaker_embedding(outputs, speaker_embeddings):
speaker_embeddings_ = speaker_embeddings.expand(
outputs.size(0), outputs.size(1), -1)
outputs = torch.cat([outputs, speaker_embeddings_], dim=-1)
return outputs

View File

@ -1,133 +0,0 @@
import copy
import torch
from math import sqrt
from torch import nn
from TTS.layers.tacotron2 import Encoder, Decoder, Postnet
from TTS.utils.generic_utils import sequence_mask
# TODO: match function arguments with tacotron
class Tacotron2(nn.Module):
def __init__(self,
num_chars,
num_speakers,
r,
postnet_output_dim=80,
decoder_output_dim=80,
attn_type='original',
attn_win=False,
attn_norm="softmax",
prenet_type="original",
prenet_dropout=True,
forward_attn=False,
trans_agent=False,
forward_attn_mask=False,
location_attn=True,
attn_K=5,
separate_stopnet=True,
bidirectional_decoder=False):
super(Tacotron2, self).__init__()
self.postnet_output_dim = postnet_output_dim
self.decoder_output_dim = decoder_output_dim
self.r = r
self.bidirectional_decoder = bidirectional_decoder
decoder_dim = 512 if num_speakers > 1 else 512
encoder_dim = 512 if num_speakers > 1 else 512
proj_speaker_dim = 80 if num_speakers > 1 else 0
# embedding layer
self.embedding = nn.Embedding(num_chars, 512, padding_idx=0)
std = sqrt(2.0 / (num_chars + 512))
val = sqrt(3.0) * std # uniform bounds for std
self.embedding.weight.data.uniform_(-val, val)
if num_speakers > 1:
self.speaker_embedding = nn.Embedding(num_speakers, 512)
self.speaker_embedding.weight.data.normal_(0, 0.3)
self.speaker_embeddings = None
self.speaker_embeddings_projected = None
self.encoder = Encoder(encoder_dim)
self.decoder = Decoder(decoder_dim, self.decoder_output_dim, r, attn_type, attn_win,
attn_norm, prenet_type, prenet_dropout,
forward_attn, trans_agent, forward_attn_mask,
location_attn, attn_K, separate_stopnet, proj_speaker_dim)
if self.bidirectional_decoder:
self.decoder_backward = copy.deepcopy(self.decoder)
self.postnet = Postnet(self.postnet_output_dim)
def _init_states(self):
self.speaker_embeddings = None
self.speaker_embeddings_projected = None
@staticmethod
def shape_outputs(mel_outputs, mel_outputs_postnet, alignments):
mel_outputs = mel_outputs.transpose(1, 2)
mel_outputs_postnet = mel_outputs_postnet.transpose(1, 2)
return mel_outputs, mel_outputs_postnet, alignments
def forward(self, text, text_lengths, mel_specs=None, speaker_ids=None):
self._init_states()
# compute mask for padding
mask = sequence_mask(text_lengths).to(text.device)
embedded_inputs = self.embedding(text).transpose(1, 2)
encoder_outputs = self.encoder(embedded_inputs, text_lengths)
encoder_outputs = self._add_speaker_embedding(encoder_outputs,
speaker_ids)
decoder_outputs, alignments, stop_tokens = self.decoder(
encoder_outputs, mel_specs, mask)
postnet_outputs = self.postnet(decoder_outputs)
postnet_outputs = decoder_outputs + postnet_outputs
decoder_outputs, postnet_outputs, alignments = self.shape_outputs(
decoder_outputs, postnet_outputs, alignments)
if self.bidirectional_decoder:
decoder_outputs_backward, alignments_backward = self._backward_inference(mel_specs, encoder_outputs, mask)
return decoder_outputs, postnet_outputs, alignments, stop_tokens, decoder_outputs_backward, alignments_backward
return decoder_outputs, postnet_outputs, alignments, stop_tokens
@torch.no_grad()
def inference(self, text, speaker_ids=None):
embedded_inputs = self.embedding(text).transpose(1, 2)
encoder_outputs = self.encoder.inference(embedded_inputs)
encoder_outputs = self._add_speaker_embedding(encoder_outputs,
speaker_ids)
mel_outputs, alignments, stop_tokens = self.decoder.inference(
encoder_outputs)
mel_outputs_postnet = self.postnet(mel_outputs)
mel_outputs_postnet = mel_outputs + mel_outputs_postnet
mel_outputs, mel_outputs_postnet, alignments = self.shape_outputs(
mel_outputs, mel_outputs_postnet, alignments)
return mel_outputs, mel_outputs_postnet, alignments, stop_tokens
def inference_truncated(self, text, speaker_ids=None):
"""
Preserve model states for continuous inference
"""
embedded_inputs = self.embedding(text).transpose(1, 2)
encoder_outputs = self.encoder.inference_truncated(embedded_inputs)
encoder_outputs = self._add_speaker_embedding(encoder_outputs,
speaker_ids)
mel_outputs, alignments, stop_tokens = self.decoder.inference_truncated(
encoder_outputs)
mel_outputs_postnet = self.postnet(mel_outputs)
mel_outputs_postnet = mel_outputs + mel_outputs_postnet
mel_outputs, mel_outputs_postnet, alignments = self.shape_outputs(
mel_outputs, mel_outputs_postnet, alignments)
return mel_outputs, mel_outputs_postnet, alignments, stop_tokens
def _backward_inference(self, mel_specs, encoder_outputs, mask):
decoder_outputs_b, alignments_b, _ = self.decoder_backward(
encoder_outputs, torch.flip(mel_specs, dims=(1,)), mask,
self.speaker_embeddings_projected)
decoder_outputs_b = decoder_outputs_b.transpose(1, 2)
return decoder_outputs_b, alignments_b
def _add_speaker_embedding(self, encoder_outputs, speaker_ids):
if hasattr(self, "speaker_embedding") and speaker_ids is None:
raise RuntimeError(" [!] Model has speaker embedding layer but speaker_id is not provided")
if hasattr(self, "speaker_embedding") and speaker_ids is not None:
speaker_embeddings = self.speaker_embedding(speaker_ids)
speaker_embeddings.unsqueeze_(1)
speaker_embeddings = speaker_embeddings.expand(encoder_outputs.size(0),
encoder_outputs.size(1),
-1)
encoder_outputs = encoder_outputs + speaker_embeddings
return encoder_outputs

View File

@ -7,16 +7,16 @@ import argparse
import numpy as np
from tqdm import tqdm
from TTS.datasets.preprocess import load_meta_data
from TTS.utils.io import load_config
from TTS.utils.audio import AudioProcessor
from mozilla_voice_tts.tts.datasets.preprocess import load_meta_data
from mozilla_voice_tts.utils.io import load_config
from mozilla_voice_tts.utils.audio import AudioProcessor
def main():
"""Run preprocessing process."""
parser = argparse.ArgumentParser(
description="Compute mean and variance of spectrogtram features.")
parser.add_argument("--config_path", type=str, required=True,
help="TTS config file path.")
help="TTS config file path to define audio processin parameters.")
parser.add_argument("--out_path", default=None, type=str,
help="directory to save the output file.")
args = parser.parse_args()
@ -63,6 +63,11 @@ def main():
stats['linear_mean'] = linear_mean
stats['linear_std'] = linear_scale
print(f' > Avg mel spec mean: {mel_mean.mean()}')
print(f' > Avg mel spec scale: {mel_scale.mean()}')
print(f' > Avg linear spec mean: {linear_mean.mean()}')
print(f' > Avg lienar spec scale: {linear_scale.mean()}')
# set default config values for mean-var scaling
CONFIG.audio['stats_path'] = output_file_path
CONFIG.audio['signal_norm'] = True
@ -73,6 +78,7 @@ def main():
del CONFIG.audio['clip_norm']
stats['audio_config'] = CONFIG.audio
np.save(output_file_path, stats, allow_pickle=True)
print(f' > scale_stats.npy is saved to {output_file_path}')
if __name__ == "__main__":

View File

@ -0,0 +1,32 @@
# Convert Tensorflow Tacotron2 model to TF-Lite binary
import argparse
from mozilla_voice_tts.utils.io import load_config
from mozilla_voice_tts.vocoder.tf.utils.generic_utils import setup_generator
from mozilla_voice_tts.vocoder.tf.utils.io import load_checkpoint
from mozilla_voice_tts.vocoder.tf.utils.tflite import convert_melgan_to_tflite
parser = argparse.ArgumentParser()
parser.add_argument('--tf_model',
type=str,
help='Path to target torch model to be converted to TF.')
parser.add_argument('--config_path',
type=str,
help='Path to config file of torch model.')
parser.add_argument('--output_path',
type=str,
help='path to tflite output binary.')
args = parser.parse_args()
# Set constants
CONFIG = load_config(args.config_path)
# load the model
model = setup_generator(CONFIG)
model.build_inference()
model = load_checkpoint(model, args.tf_model)
# create tflite model
tflite_model = convert_melgan_to_tflite(model, output_path=args.output_path)

View File

@ -0,0 +1,116 @@
import argparse
import os
import numpy as np
import tensorflow as tf
import torch
from fuzzywuzzy import fuzz
from mozilla_voice_tts.utils.io import load_config
from mozilla_voice_tts.vocoder.tf.utils.convert_torch_to_tf_utils import (
compare_torch_tf, convert_tf_name, transfer_weights_torch_to_tf)
from mozilla_voice_tts.vocoder.tf.utils.generic_utils import \
setup_generator as setup_tf_generator
from mozilla_voice_tts.vocoder.tf.utils.io import save_checkpoint
from mozilla_voice_tts.vocoder.utils.generic_utils import setup_generator
# prevent GPU use
os.environ['CUDA_VISIBLE_DEVICES'] = ''
# define args
parser = argparse.ArgumentParser()
parser.add_argument('--torch_model_path',
type=str,
help='Path to target torch model to be converted to TF.')
parser.add_argument('--config_path',
type=str,
help='Path to config file of torch model.')
parser.add_argument(
'--output_path',
type=str,
help='path to output file including file name to save TF model.')
args = parser.parse_args()
# load model config
config_path = args.config_path
c = load_config(config_path)
num_speakers = 0
# init torch model
model = setup_generator(c)
checkpoint = torch.load(args.torch_model_path,
map_location=torch.device('cpu'))
state_dict = checkpoint['model']
model.load_state_dict(state_dict)
model.remove_weight_norm()
state_dict = model.state_dict()
# init tf model
model_tf = setup_tf_generator(c)
common_sufix = '/.ATTRIBUTES/VARIABLE_VALUE'
# get tf_model graph by passing an input
# B x D x T
dummy_input = tf.random.uniform((7, 80, 64), dtype=tf.float32)
mel_pred = model_tf(dummy_input, training=False)
# get tf variables
tf_vars = model_tf.weights
# match variable names with fuzzy logic
torch_var_names = list(state_dict.keys())
tf_var_names = [we.name for we in model_tf.weights]
var_map = []
for tf_name in tf_var_names:
# skip re-mapped layer names
if tf_name in [name[0] for name in var_map]:
continue
tf_name_edited = convert_tf_name(tf_name)
ratios = [
fuzz.ratio(torch_name, tf_name_edited)
for torch_name in torch_var_names
]
max_idx = np.argmax(ratios)
matching_name = torch_var_names[max_idx]
del torch_var_names[max_idx]
var_map.append((tf_name, matching_name))
# pass weights
tf_vars = transfer_weights_torch_to_tf(tf_vars, dict(var_map), state_dict)
# Compare TF and TORCH models
# check embedding outputs
model.eval()
dummy_input_torch = torch.ones((1, 80, 10))
dummy_input_tf = tf.convert_to_tensor(dummy_input_torch.numpy())
dummy_input_tf = tf.transpose(dummy_input_tf, perm=[0, 2, 1])
dummy_input_tf = tf.expand_dims(dummy_input_tf, 2)
out_torch = model.layers[0](dummy_input_torch)
out_tf = model_tf.model_layers[0](dummy_input_tf)
out_tf_ = tf.transpose(out_tf, perm=[0, 3, 2, 1])[:, :, 0, :]
assert compare_torch_tf(out_torch, out_tf_) < 1e-5
for i in range(1, len(model.layers)):
print(f"{i} -> {model.layers[i]} vs {model_tf.model_layers[i]}")
out_torch = model.layers[i](out_torch)
out_tf = model_tf.model_layers[i](out_tf)
out_tf_ = tf.transpose(out_tf, perm=[0, 3, 2, 1])[:, :, 0, :]
diff = compare_torch_tf(out_torch, out_tf_)
assert diff < 1e-5, diff
torch.manual_seed(0)
dummy_input_torch = torch.rand((1, 80, 100))
dummy_input_tf = tf.convert_to_tensor(dummy_input_torch.numpy())
model.inference_padding = 0
model_tf.inference_padding = 0
output_torch = model.inference(dummy_input_torch)
output_tf = model_tf(dummy_input_tf, training=False)
assert compare_torch_tf(output_torch, output_tf) < 1e-5, compare_torch_tf(
output_torch, output_tf)
# save tf model
save_checkpoint(model_tf, checkpoint['step'], checkpoint['epoch'],
args.output_path)
print(' > Model conversion is successfully completed :).')

View File

@ -0,0 +1,37 @@
# Convert Tensorflow Tacotron2 model to TF-Lite binary
import argparse
from mozilla_voice_tts.utils.io import load_config
from mozilla_voice_tts.tts.utils.text.symbols import symbols, phonemes
from mozilla_voice_tts.tts.tf.utils.generic_utils import setup_model
from mozilla_voice_tts.tts.tf.utils.io import load_checkpoint
from mozilla_voice_tts.tts.tf.utils.tflite import convert_tacotron2_to_tflite
parser = argparse.ArgumentParser()
parser.add_argument('--tf_model',
type=str,
help='Path to target torch model to be converted to TF.')
parser.add_argument('--config_path',
type=str,
help='Path to config file of torch model.')
parser.add_argument('--output_path',
type=str,
help='path to tflite output binary.')
args = parser.parse_args()
# Set constants
CONFIG = load_config(args.config_path)
# load the model
c = CONFIG
num_speakers = 0
num_chars = len(phonemes) if c.use_phonemes else len(symbols)
model = setup_model(num_chars, num_speakers, c, enable_tflite=True)
model.build_inference()
model = load_checkpoint(model, args.tf_model)
model.decoder.set_max_decoder_steps(1000)
# create tflite model
tflite_model = convert_tacotron2_to_tflite(model, output_path=args.output_path)

View File

@ -1,21 +1,27 @@
# %%
import sys
sys.path.append('/home/erogol/Projects')
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
# %%
import argparse
import numpy as np
import torch
import tensorflow as tf
from fuzzywuzzy import fuzz
import os
import sys
# %%
# print variable match
from pprint import pprint
import numpy as np
import tensorflow as tf
import torch
from fuzzywuzzy import fuzz
from mozilla_voice_tts.tts.tf.models.tacotron2 import Tacotron2
from mozilla_voice_tts.tts.tf.utils.convert_torch_to_tf_utils import (
compare_torch_tf, convert_tf_name, transfer_weights_torch_to_tf)
from mozilla_voice_tts.tts.tf.utils.generic_utils import save_checkpoint
from mozilla_voice_tts.tts.utils.generic_utils import setup_model
from mozilla_voice_tts.tts.utils.text.symbols import phonemes, symbols
from mozilla_voice_tts.utils.io import load_config
sys.path.append('/home/erogol/Projects')
os.environ['CUDA_VISIBLE_DEVICES'] = ''
from TTS.utils.text.symbols import phonemes, symbols
from TTS.utils.generic_utils import setup_model
from TTS.utils.io import load_config
from TTS.tf.models.tacotron2 import Tacotron2
from TTS.tf.utils.convert_torch_to_tf_utils import compare_torch_tf, tf_create_dummy_inputs, transfer_weights_torch_to_tf, convert_tf_name
from TTS.tf.utils.generic_utils import save_checkpoint
parser = argparse.ArgumentParser()
parser.add_argument('--torch_model_path',
@ -26,7 +32,7 @@ parser.add_argument('--config_path',
help='Path to config file of torch model.')
parser.add_argument('--output_path',
type=str,
help='path to save TF model weights.')
help='path to output file including file name to save TF model.')
args = parser.parse_args()
# load model config
@ -65,18 +71,18 @@ model_tf = Tacotron2(num_chars=num_chars,
# TODO: set layer names so that we can remove these manual matching
common_sufix = '/.ATTRIBUTES/VARIABLE_VALUE'
var_map = [
('tacotron2/embedding/embeddings:0', 'embedding.weight'),
('tacotron2/encoder/lstm/forward_lstm/lstm_cell_1/kernel:0',
('embedding/embeddings:0', 'embedding.weight'),
('encoder/lstm/forward_lstm/lstm_cell_1/kernel:0',
'encoder.lstm.weight_ih_l0'),
('tacotron2/encoder/lstm/forward_lstm/lstm_cell_1/recurrent_kernel:0',
('encoder/lstm/forward_lstm/lstm_cell_1/recurrent_kernel:0',
'encoder.lstm.weight_hh_l0'),
('tacotron2/encoder/lstm/backward_lstm/lstm_cell_2/kernel:0',
('encoder/lstm/backward_lstm/lstm_cell_2/kernel:0',
'encoder.lstm.weight_ih_l0_reverse'),
('tacotron2/encoder/lstm/backward_lstm/lstm_cell_2/recurrent_kernel:0',
('encoder/lstm/backward_lstm/lstm_cell_2/recurrent_kernel:0',
'encoder.lstm.weight_hh_l0_reverse'),
('tacotron2/encoder/lstm/forward_lstm/lstm_cell_1/bias:0',
('encoder/lstm/forward_lstm/lstm_cell_1/bias:0',
('encoder.lstm.bias_ih_l0', 'encoder.lstm.bias_hh_l0')),
('tacotron2/encoder/lstm/backward_lstm/lstm_cell_2/bias:0',
('encoder/lstm/backward_lstm/lstm_cell_2/bias:0',
('encoder.lstm.bias_ih_l0_reverse', 'encoder.lstm.bias_hh_l0_reverse')),
('attention/v/kernel:0', 'decoder.attention.v.linear_layer.weight'),
('decoder/linear_projection/kernel:0',
@ -86,8 +92,7 @@ var_map = [
# %%
# get tf_model graph
input_ids, input_lengths, mel_outputs, mel_lengths = tf_create_dummy_inputs()
mel_pred = model_tf(input_ids, training=False)
model_tf.build_inference()
# get tf variables
tf_vars = model_tf.weights
@ -109,9 +114,6 @@ for tf_name in tf_var_names:
del torch_var_names[max_idx]
var_map.append((tf_name, matching_name))
# %%
# print variable match
from pprint import pprint
pprint(var_map)
pprint(torch_var_names)

View File

@ -0,0 +1,65 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import sys
import pathlib
import time
import subprocess
import argparse
import torch
def main():
"""
Call train.py as a new process and pass command arguments
"""
parser = argparse.ArgumentParser()
parser.add_argument(
'--continue_path',
type=str,
help='Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
default='',
required='--config_path' not in sys.argv)
parser.add_argument(
'--restore_path',
type=str,
help='Model file to be restored. Use to finetune a model.',
default='')
parser.add_argument(
'--config_path',
type=str,
help='Path to config file for training.',
required='--continue_path' not in sys.argv
)
args = parser.parse_args()
num_gpus = torch.cuda.device_count()
group_id = time.strftime("%Y_%m_%d-%H%M%S")
# set arguments for train.py
folder_path = pathlib.Path(__file__).parent.absolute()
command = [os.path.join(folder_path, 'train_tts.py')]
command.append('--continue_path={}'.format(args.continue_path))
command.append('--restore_path={}'.format(args.restore_path))
command.append('--config_path={}'.format(args.config_path))
command.append('--group_id=group_{}'.format(group_id))
command.append('')
# run processes
processes = []
for i in range(num_gpus):
my_env = os.environ.copy()
my_env["PYTHON_EGG_CACHE"] = "/tmp/tmp{}".format(i)
command[-1] = '--rank={}'.format(i)
stdout = None if i == 0 else open(os.devnull, 'w')
p = subprocess.Popen(['python3'] + command, stdout=stdout, env=my_env)
processes.append(p)
print(command)
for p in processes:
p.wait()
if __name__ == '__main__':
main()

View File

@ -0,0 +1,174 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import argparse
import json
# pylint: disable=redefined-outer-name, unused-argument
import os
import string
import time
import torch
from mozilla_voice_tts.tts.utils.generic_utils import setup_model
from mozilla_voice_tts.tts.utils.synthesis import synthesis
from mozilla_voice_tts.tts.utils.text.symbols import make_symbols, phonemes, symbols
from mozilla_voice_tts.utils.audio import AudioProcessor
from mozilla_voice_tts.utils.io import load_config
from mozilla_voice_tts.vocoder.utils.generic_utils import setup_generator
def tts(model, vocoder_model, text, CONFIG, use_cuda, ap, use_gl, speaker_fileid, speaker_embedding=None, gst_style=None):
t_1 = time.time()
waveform, _, _, mel_postnet_spec, _, _ = synthesis(model, text, CONFIG, use_cuda, ap, speaker_fileid, gst_style, False, CONFIG.enable_eos_bos_chars, use_gl, speaker_embedding=speaker_embedding)
if CONFIG.model == "Tacotron" and not use_gl:
mel_postnet_spec = ap.out_linear_to_mel(mel_postnet_spec.T).T
if not use_gl:
waveform = vocoder_model.inference(torch.FloatTensor(mel_postnet_spec.T).unsqueeze(0))
if use_cuda and not use_gl:
waveform = waveform.cpu()
if not use_gl:
waveform = waveform.numpy()
waveform = waveform.squeeze()
rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
tps = (time.time() - t_1) / len(waveform)
print(" > Run-time: {}".format(time.time() - t_1))
print(" > Real-time factor: {}".format(rtf))
print(" > Time per step: {}".format(tps))
return waveform
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('text', type=str, help='Text to generate speech.')
parser.add_argument('config_path',
type=str,
help='Path to model config file.')
parser.add_argument(
'model_path',
type=str,
help='Path to model file.',
)
parser.add_argument(
'out_path',
type=str,
help='Path to save final wav file. Wav file will be names as the text given.',
)
parser.add_argument('--use_cuda',
type=bool,
help='Run model on CUDA.',
default=False)
parser.add_argument(
'--vocoder_path',
type=str,
help=
'Path to vocoder model file. If it is not defined, model uses GL as vocoder. Please make sure that you installed vocoder library before (WaveRNN).',
default="",
)
parser.add_argument('--vocoder_config_path',
type=str,
help='Path to vocoder model config file.',
default="")
parser.add_argument(
'--batched_vocoder',
type=bool,
help="If True, vocoder model uses faster batch processing.",
default=True)
parser.add_argument('--speakers_json',
type=str,
help="JSON file for multi-speaker model.",
default="")
parser.add_argument(
'--speaker_fileid',
type=str,
help="if CONFIG.use_external_speaker_embedding_file is true, name of speaker embedding reference file present in speakers.json, else target speaker_fileid if the model is multi-speaker.",
default=None)
parser.add_argument(
'--gst_style',
help="Wav path file for GST stylereference.",
default=None)
args = parser.parse_args()
# load the config
C = load_config(args.config_path)
C.forward_attn_mask = True
# load the audio processor
ap = AudioProcessor(**C.audio)
# if the vocabulary was passed, replace the default
if 'characters' in C.keys():
symbols, phonemes = make_symbols(**C.characters)
speaker_embedding = None
speaker_embedding_dim = None
num_speakers = 0
# load speakers
if args.speakers_json != '':
speaker_mapping = json.load(open(args.speakers_json, 'r'))
num_speakers = len(speaker_mapping)
if C.use_external_speaker_embedding_file:
if args.speaker_fileid is not None:
speaker_embedding = speaker_mapping[args.speaker_fileid]['embedding']
else: # if speaker_fileid is not specificated use the first sample in speakers.json
speaker_embedding = speaker_mapping[list(speaker_mapping.keys())[0]]['embedding']
speaker_embedding_dim = len(speaker_embedding)
# load the model
num_chars = len(phonemes) if C.use_phonemes else len(symbols)
model = setup_model(num_chars, num_speakers, C, speaker_embedding_dim)
cp = torch.load(args.model_path, map_location=torch.device('cpu'))
model.load_state_dict(cp['model'])
model.eval()
if args.use_cuda:
model.cuda()
model.decoder.set_r(cp['r'])
# load vocoder model
if args.vocoder_path != "":
VC = load_config(args.vocoder_config_path)
vocoder_model = setup_generator(VC)
vocoder_model.load_state_dict(torch.load(args.vocoder_path, map_location="cpu")["model"])
vocoder_model.remove_weight_norm()
if args.use_cuda:
vocoder_model.cuda()
vocoder_model.eval()
else:
vocoder_model = None
VC = None
# synthesize voice
use_griffin_lim = args.vocoder_path == ""
print(" > Text: {}".format(args.text))
if not C.use_external_speaker_embedding_file:
if args.speaker_fileid.isdigit():
args.speaker_fileid = int(args.speaker_fileid)
else:
args.speaker_fileid = None
else:
args.speaker_fileid = None
if args.gst_style is None:
gst_style = C.gst['gst_style_input']
else:
# check if gst_style string is a dict, if is dict convert else use string
try:
gst_style = json.loads(args.gst_style)
if max(map(int, gst_style.keys())) >= C.gst['gst_style_tokens']:
raise RuntimeError("The highest value of the gst_style dictionary key must be less than the number of GST Tokens, \n Highest dictionary key value: {} \n Number of GST tokens: {}".format(max(map(int, gst_style.keys())), C.gst['gst_style_tokens']))
except ValueError:
gst_style = args.gst_style
wav = tts(model, vocoder_model, args.text, C, args.use_cuda, ap, use_griffin_lim, args.speaker_fileid, speaker_embedding=speaker_embedding, gst_style=gst_style)
# save the results
file_name = args.text.replace(" ", "_")
file_name = file_name.translate(
str.maketrans('', '', string.punctuation.replace('_', ''))) + '.wav'
out_path = os.path.join(args.out_path, file_name)
print(" > Saving output to {}".format(out_path))
ap.save_wav(wav, out_path)

View File

@ -1,3 +1,6 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import argparse
import os
import sys
@ -6,19 +9,22 @@ import traceback
import torch
from torch.utils.data import DataLoader
from TTS.datasets.preprocess import load_meta_data
from TTS.speaker_encoder.dataset import MyDataset
from TTS.speaker_encoder.loss import GE2ELoss
from TTS.speaker_encoder.model import SpeakerEncoder
from TTS.speaker_encoder.visual import plot_embeddings
from TTS.speaker_encoder.generic_utils import save_best_model
from TTS.utils.audio import AudioProcessor
from TTS.utils.generic_utils import (create_experiment_folder, get_git_branch,
remove_experiment_folder, set_init_dict)
from TTS.utils.io import load_config, copy_config_file
from TTS.utils.training import check_update, NoamLR
from TTS.utils.tensorboard_logger import TensorboardLogger
from TTS.utils.radam import RAdam
from mozilla_voice_tts.speaker_encoder.dataset import MyDataset
from mozilla_voice_tts.speaker_encoder.generic_utils import save_best_model
from mozilla_voice_tts.speaker_encoder.losses import GE2ELoss, AngleProtoLoss
from mozilla_voice_tts.speaker_encoder.model import SpeakerEncoder
from mozilla_voice_tts.speaker_encoder.visual import plot_embeddings
from mozilla_voice_tts.tts.datasets.preprocess import load_meta_data
from mozilla_voice_tts.utils.generic_utils import (
create_experiment_folder, get_git_branch, remove_experiment_folder,
set_init_dict)
from mozilla_voice_tts.utils.io import copy_config_file, load_config
from mozilla_voice_tts.utils.audio import AudioProcessor
from mozilla_voice_tts.utils.generic_utils import count_parameters
from mozilla_voice_tts.utils.radam import RAdam
from mozilla_voice_tts.utils.tensorboard_logger import TensorboardLogger
from mozilla_voice_tts.utils.training import NoamLR, check_update
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True
@ -94,7 +100,7 @@ def train(model, criterion, optimizer, scheduler, ap, global_step):
if global_step % c.steps_plot_stats == 0:
# Plot Training Epoch Stats
train_stats = {
"GE2Eloss": avg_loss,
"loss": avg_loss,
"lr": current_lr,
"grad_norm": grad_norm,
"step_time": step_time
@ -129,12 +135,18 @@ def main(args): # pylint: disable=redefined-outer-name
global meta_data_eval
ap = AudioProcessor(**c.audio)
model = SpeakerEncoder(input_dim=40,
proj_dim=128,
lstm_dim=384,
num_lstm_layers=3)
model = SpeakerEncoder(input_dim=c.model['input_dim'],
proj_dim=c.model['proj_dim'],
lstm_dim=c.model['lstm_dim'],
num_lstm_layers=c.model['num_lstm_layers'])
optimizer = RAdam(model.parameters(), lr=c.lr)
criterion = GE2ELoss(loss_method='softmax')
if c.loss == "ge2e":
criterion = GE2ELoss(loss_method='softmax')
elif c.loss == "angleproto":
criterion = AngleProtoLoss()
else:
raise Exception("The %s not is a loss supported" % c.loss)
if args.restore_path:
checkpoint = torch.load(args.restore_path)
@ -177,8 +189,8 @@ def main(args): # pylint: disable=redefined-outer-name
meta_data_train, meta_data_eval = load_meta_data(c.datasets)
global_step = args.restore_step
train_loss, global_step = train(model, criterion, optimizer, scheduler, ap,
global_step)
_, global_step = train(model, criterion, optimizer, scheduler, ap,
global_step)
if __name__ == '__main__':
@ -236,7 +248,7 @@ if __name__ == '__main__':
new_fields)
LOG_DIR = OUT_PATH
tb_logger = TensorboardLogger(LOG_DIR)
tb_logger = TensorboardLogger(LOG_DIR, model_name='Speaker_Encoder')
try:
main(args)

View File

@ -1,7 +1,10 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import argparse
import glob
import os
import sys
import glob
import time
import traceback
@ -9,46 +12,51 @@ import numpy as np
import torch
from torch.utils.data import DataLoader
from TTS.datasets.TTSDataset import MyDataset
from distribute import (DistributedSampler, apply_gradient_allreduce,
init_distributed, reduce_tensor)
from TTS.layers.losses import TacotronLoss
from TTS.utils.audio import AudioProcessor
from TTS.utils.generic_utils import (count_parameters, create_experiment_folder, remove_experiment_folder,
get_git_branch, set_init_dict,
setup_model, KeepAverage, check_config)
from TTS.utils.io import (save_best_model, save_checkpoint,
load_config, copy_config_file)
from TTS.utils.training import (NoamLR, check_update, adam_weight_decay,
gradual_training_scheduler, set_weight_decay)
from TTS.utils.tensorboard_logger import TensorboardLogger
from TTS.utils.console_logger import ConsoleLogger
from TTS.utils.speakers import load_speaker_mapping, save_speaker_mapping, \
get_speakers
from TTS.utils.synthesis import synthesis
from TTS.utils.text.symbols import make_symbols, phonemes, symbols
from TTS.utils.visual import plot_alignment, plot_spectrogram
from TTS.datasets.preprocess import load_meta_data
from TTS.utils.radam import RAdam
from TTS.utils.measures import alignment_diagonal_score
from mozilla_voice_tts.tts.datasets.preprocess import load_meta_data
from mozilla_voice_tts.tts.datasets.TTSDataset import MyDataset
from mozilla_voice_tts.tts.layers.losses import TacotronLoss
from mozilla_voice_tts.tts.utils.distribute import (DistributedSampler,
apply_gradient_allreduce,
init_distributed,
reduce_tensor)
from mozilla_voice_tts.tts.utils.generic_utils import check_config, setup_model
from mozilla_voice_tts.tts.utils.io import save_best_model, save_checkpoint
from mozilla_voice_tts.tts.utils.measures import alignment_diagonal_score
from mozilla_voice_tts.tts.utils.speakers import (get_speakers,
load_speaker_mapping,
save_speaker_mapping)
from mozilla_voice_tts.tts.utils.synthesis import synthesis
from mozilla_voice_tts.tts.utils.text.symbols import (make_symbols, phonemes,
symbols)
from mozilla_voice_tts.tts.utils.visual import plot_alignment, plot_spectrogram
from mozilla_voice_tts.utils.audio import AudioProcessor
from mozilla_voice_tts.utils.console_logger import ConsoleLogger
from mozilla_voice_tts.utils.generic_utils import (KeepAverage,
count_parameters,
create_experiment_folder,
get_git_branch,
remove_experiment_folder,
set_init_dict)
from mozilla_voice_tts.utils.io import copy_config_file, load_config
from mozilla_voice_tts.utils.radam import RAdam
from mozilla_voice_tts.utils.tensorboard_logger import TensorboardLogger
from mozilla_voice_tts.utils.training import (NoamLR, adam_weight_decay,
check_update,
gradual_training_scheduler,
set_weight_decay,
setup_torch_training_env)
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = False
torch.manual_seed(54321)
use_cuda = torch.cuda.is_available()
num_gpus = torch.cuda.device_count()
print(" > Using CUDA: ", use_cuda)
print(" > Number of GPUs: ", num_gpus)
use_cuda, num_gpus = setup_torch_training_env(True, False)
def setup_loader(ap, r, is_val=False, verbose=False):
def setup_loader(ap, r, is_val=False, verbose=False, speaker_mapping=None):
if is_val and not c.run_eval:
loader = None
else:
dataset = MyDataset(
r,
c.text_cleaner,
compute_linear_spec=True if c.model.lower() == 'tacotron' else False,
compute_linear_spec=c.model.lower() == 'tacotron',
meta_data=meta_data_eval if is_val else meta_data_train,
ap=ap,
tp=c.characters if 'characters' in c.keys() else None,
@ -60,7 +68,8 @@ def setup_loader(ap, r, is_val=False, verbose=False):
use_phonemes=c.use_phonemes,
phoneme_language=c.phoneme_language,
enable_eos_bos=c.enable_eos_bos_chars,
verbose=verbose)
verbose=verbose,
speaker_mapping=speaker_mapping if c.use_speaker_embedding and c.use_external_speaker_embedding_file else None)
sampler = DistributedSampler(dataset) if num_gpus > 1 else None
loader = DataLoader(
dataset,
@ -74,9 +83,8 @@ def setup_loader(ap, r, is_val=False, verbose=False):
pin_memory=False)
return loader
def format_data(data):
if c.use_speaker_embedding:
def format_data(data, speaker_mapping=None):
if speaker_mapping is None and c.use_speaker_embedding and not c.use_external_speaker_embedding_file:
speaker_mapping = load_speaker_mapping(OUT_PATH)
# setup input data
@ -91,13 +99,20 @@ def format_data(data):
avg_spec_length = torch.mean(mel_lengths.float())
if c.use_speaker_embedding:
speaker_ids = [
speaker_mapping[speaker_name] for speaker_name in speaker_names
]
speaker_ids = torch.LongTensor(speaker_ids)
if c.use_external_speaker_embedding_file:
speaker_embeddings = data[8]
speaker_ids = None
else:
speaker_ids = [
speaker_mapping[speaker_name] for speaker_name in speaker_names
]
speaker_ids = torch.LongTensor(speaker_ids)
speaker_embeddings = None
else:
speaker_embeddings = None
speaker_ids = None
# set stop targets view, we predict a single stop token per iteration.
stop_targets = stop_targets.view(text_input.shape[0],
stop_targets.size(1) // c.r, -1)
@ -114,30 +129,19 @@ def format_data(data):
stop_targets = stop_targets.cuda(non_blocking=True)
if speaker_ids is not None:
speaker_ids = speaker_ids.cuda(non_blocking=True)
return text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, avg_text_length, avg_spec_length
if speaker_embeddings is not None:
speaker_embeddings = speaker_embeddings.cuda(non_blocking=True)
return text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, avg_text_length, avg_spec_length
def train(model, criterion, optimizer, optimizer_st, scheduler,
ap, global_step, epoch):
ap, global_step, epoch, amp, speaker_mapping=None):
data_loader = setup_loader(ap, model.decoder.r, is_val=False,
verbose=(epoch == 0))
verbose=(epoch == 0), speaker_mapping=speaker_mapping)
model.train()
epoch_time = 0
train_values = {
'avg_postnet_loss': 0,
'avg_decoder_loss': 0,
'avg_stopnet_loss': 0,
'avg_align_error': 0,
'avg_step_time': 0,
'avg_loader_time': 0
}
if c.bidirectional_decoder:
train_values['avg_decoder_b_loss'] = 0 # decoder backward loss
train_values['avg_decoder_c_loss'] = 0 # decoder consistency loss
if c.ga_alpha > 0:
train_values['avg_ga_loss'] = 0 # guidede attention loss
keep_avg = KeepAverage()
keep_avg.add_values(train_values)
if use_cuda:
batch_n_iter = int(
len(data_loader.dataset) / (c.batch_size * num_gpus))
@ -149,7 +153,7 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
start_time = time.time()
# format data
text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, avg_text_length, avg_spec_length = format_data(data)
text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, avg_text_length, avg_spec_length = format_data(data, speaker_mapping)
loader_time = time.time() - end_time
global_step += 1
@ -162,15 +166,16 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
optimizer_st.zero_grad()
# forward pass model
if c.bidirectional_decoder:
if c.bidirectional_decoder or c.double_decoder_consistency:
decoder_output, postnet_output, alignments, stop_tokens, decoder_backward_output, alignments_backward = model(
text_input, text_lengths, mel_input, speaker_ids=speaker_ids)
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
else:
decoder_output, postnet_output, alignments, stop_tokens = model(
text_input, text_lengths, mel_input, speaker_ids=speaker_ids)
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
decoder_backward_output = None
alignments_backward = None
# set the alignment lengths wrt reduction factor for guided attention
# set the [alignment] lengths wrt reduction factor for guided attention
if mel_lengths.max() % model.decoder.r != 0:
alignment_lengths = (mel_lengths + (model.decoder.r - (mel_lengths.max() % model.decoder.r))) // model.decoder.r
else:
@ -180,29 +185,37 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
loss_dict = criterion(postnet_output, decoder_output, mel_input,
linear_input, stop_tokens, stop_targets,
mel_lengths, decoder_backward_output,
alignments, alignment_lengths, text_lengths)
if c.bidirectional_decoder:
keep_avg.update_values({'avg_decoder_b_loss': loss_dict['decoder_backward_loss'].item(),
'avg_decoder_c_loss': loss_dict['decoder_c_loss'].item()})
if c.ga_alpha > 0:
keep_avg.update_values({'avg_ga_loss': loss_dict['ga_loss'].item()})
alignments, alignment_lengths, alignments_backward,
text_lengths)
# backward pass
loss_dict['loss'].backward()
if amp is not None:
with amp.scale_loss(loss_dict['loss'], optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss_dict['loss'].backward()
optimizer, current_lr = adam_weight_decay(optimizer)
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True)
if amp:
amp_opt_params = amp.master_params(optimizer)
else:
amp_opt_params = None
grad_norm, _ = check_update(model, c.grad_clip, ignore_stopnet=True, amp_opt_params=amp_opt_params)
optimizer.step()
# compute alignment error (the lower the better )
align_error = 1 - alignment_diagonal_score(alignments)
keep_avg.update_value('avg_align_error', align_error)
loss_dict['align_error'] = align_error
# backpass and check the grad norm for stop loss
if c.separate_stopnet:
loss_dict['stopnet_loss'].backward()
optimizer_st, _ = adam_weight_decay(optimizer_st)
grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0)
if amp:
amp_opt_params = amp.master_params(optimizer)
else:
amp_opt_params = None
grad_norm_st, _ = check_update(model.decoder.stopnet, 1.0, amp_opt_params=amp_opt_params)
optimizer_st.step()
else:
grad_norm_st = 0
@ -210,23 +223,6 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
step_time = time.time() - start_time
epoch_time += step_time
# update avg stats
update_train_values = {
'avg_postnet_loss': float(loss_dict['postnet_loss'].item()),
'avg_decoder_loss': float(loss_dict['decoder_loss'].item()),
'avg_stopnet_loss': loss_dict['stopnet_loss'].item() \
if isinstance(loss_dict['stopnet_loss'], float) else float(loss_dict['stopnet_loss'].item()),
'avg_step_time': step_time,
'avg_loader_time': loader_time
}
keep_avg.update_values(update_train_values)
if global_step % c.print_step == 0:
c_logger.print_train_step(batch_n_iter, num_iter, global_step,
avg_spec_length, avg_text_length,
step_time, loader_time, current_lr,
loss_dict, keep_avg.avg_values)
# aggregate losses from processes
if num_gpus > 1:
loss_dict['postnet_loss'] = reduce_tensor(loss_dict['postnet_loss'].data, num_gpus)
@ -234,18 +230,46 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
loss_dict['loss'] = reduce_tensor(loss_dict['loss'] .data, num_gpus)
loss_dict['stopnet_loss'] = reduce_tensor(loss_dict['stopnet_loss'].data, num_gpus) if c.stopnet else loss_dict['stopnet_loss']
# detach loss values
loss_dict_new = dict()
for key, value in loss_dict.items():
if isinstance(value, (int, float)):
loss_dict_new[key] = value
else:
loss_dict_new[key] = value.item()
loss_dict = loss_dict_new
# update avg stats
update_train_values = dict()
for key, value in loss_dict.items():
update_train_values['avg_' + key] = value
update_train_values['avg_loader_time'] = loader_time
update_train_values['avg_step_time'] = step_time
keep_avg.update_values(update_train_values)
# print training progress
if global_step % c.print_step == 0:
log_dict = {
"avg_spec_length": [avg_spec_length, 1], # value, precision
"avg_text_length": [avg_text_length, 1],
"step_time": [step_time, 4],
"loader_time": [loader_time, 2],
"current_lr": current_lr,
}
c_logger.print_train_step(batch_n_iter, num_iter, global_step,
log_dict, loss_dict, keep_avg.avg_values)
if args.rank == 0:
# Plot Training Iter Stats
# reduce TB load
if global_step % 10 == 0:
if global_step % c.tb_plot_step == 0:
iter_stats = {
"loss_posnet": loss_dict['postnet_loss'].item(),
"loss_decoder": loss_dict['decoder_loss'].item(),
"lr": current_lr,
"grad_norm": grad_norm,
"grad_norm_st": grad_norm_st,
"step_time": step_time
}
iter_stats.update(loss_dict)
tb_logger.tb_train_iter_stats(global_step, iter_stats)
if global_step % c.save_step == 0:
@ -253,7 +277,8 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
# save model
save_checkpoint(model, optimizer, global_step, epoch, model.decoder.r, OUT_PATH,
optimizer_st=optimizer_st,
model_loss=loss_dict['postnet_loss'].item())
model_loss=loss_dict['postnet_loss'],
amp_state_dict=amp.state_dict() if amp else None)
# Diagnostic visualizations
const_spec = postnet_output[0].data.cpu().numpy()
@ -263,13 +288,13 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
align_img = alignments[0].data.cpu().numpy()
figures = {
"prediction": plot_spectrogram(const_spec, ap),
"ground_truth": plot_spectrogram(gt_spec, ap),
"alignment": plot_alignment(align_img),
"prediction": plot_spectrogram(const_spec, ap, output_fig=False),
"ground_truth": plot_spectrogram(gt_spec, ap, output_fig=False),
"alignment": plot_alignment(align_img, output_fig=False),
}
if c.bidirectional_decoder:
figures["alignment_backward"] = plot_alignment(alignments_backward[0].data.cpu().numpy())
if c.bidirectional_decoder or c.double_decoder_consistency:
figures["alignment_backward"] = plot_alignment(alignments_backward[0].data.cpu().numpy(), output_fig=False)
tb_logger.tb_train_figures(global_step, figures)
@ -288,16 +313,8 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
# Plot Epoch Stats
if args.rank == 0:
# Plot Training Epoch Stats
epoch_stats = {
"loss_postnet": keep_avg['avg_postnet_loss'],
"loss_decoder": keep_avg['avg_decoder_loss'],
"stopnet_loss": keep_avg['avg_stopnet_loss'],
"alignment_score": keep_avg['avg_align_error'],
"epoch_time": epoch_time
}
if c.ga_alpha > 0:
epoch_stats['guided_attention_loss'] = keep_avg['avg_ga_loss']
epoch_stats = {"epoch_time": epoch_time}
epoch_stats.update(keep_avg.avg_values)
tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
if c.tb_model_param_stats:
tb_logger.tb_model_weights(model, global_step)
@ -305,41 +322,29 @@ def train(model, criterion, optimizer, optimizer_st, scheduler,
@torch.no_grad()
def evaluate(model, criterion, ap, global_step, epoch):
data_loader = setup_loader(ap, model.decoder.r, is_val=True)
def evaluate(model, criterion, ap, global_step, epoch, speaker_mapping=None):
data_loader = setup_loader(ap, model.decoder.r, is_val=True, speaker_mapping=speaker_mapping)
model.eval()
epoch_time = 0
eval_values_dict = {
'avg_postnet_loss': 0,
'avg_decoder_loss': 0,
'avg_stopnet_loss': 0,
'avg_align_error': 0
}
if c.bidirectional_decoder:
eval_values_dict['avg_decoder_b_loss'] = 0 # decoder backward loss
eval_values_dict['avg_decoder_c_loss'] = 0 # decoder consistency loss
if c.ga_alpha > 0:
eval_values_dict['avg_ga_loss'] = 0 # guidede attention loss
keep_avg = KeepAverage()
keep_avg.add_values(eval_values_dict)
c_logger.print_eval_start()
if data_loader is not None:
for num_iter, data in enumerate(data_loader):
start_time = time.time()
# format data
text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, _, _ = format_data(data)
text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, _, _ = format_data(data, speaker_mapping)
assert mel_input.shape[1] % model.decoder.r == 0
# forward pass model
if c.bidirectional_decoder:
if c.bidirectional_decoder or c.double_decoder_consistency:
decoder_output, postnet_output, alignments, stop_tokens, decoder_backward_output, alignments_backward = model(
text_input, text_lengths, mel_input, speaker_ids=speaker_ids)
text_input, text_lengths, mel_input, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
else:
decoder_output, postnet_output, alignments, stop_tokens = model(
text_input, text_lengths, mel_input, speaker_ids=speaker_ids)
text_input, text_lengths, mel_input, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
decoder_backward_output = None
alignments_backward = None
# set the alignment lengths wrt reduction factor for guided attention
if mel_lengths.max() % model.decoder.r != 0:
@ -351,12 +356,8 @@ def evaluate(model, criterion, ap, global_step, epoch):
loss_dict = criterion(postnet_output, decoder_output, mel_input,
linear_input, stop_tokens, stop_targets,
mel_lengths, decoder_backward_output,
alignments, alignment_lengths, text_lengths)
if c.bidirectional_decoder:
keep_avg.update_values({'avg_decoder_b_loss': loss_dict['decoder_b_loss'].item(),
'avg_decoder_c_loss': loss_dict['decoder_c_loss'].item()})
if c.ga_alpha > 0:
keep_avg.update_values({'avg_ga_loss': loss_dict['ga_loss'].item()})
alignments, alignment_lengths, alignments_backward,
text_lengths)
# step time
step_time = time.time() - start_time
@ -364,7 +365,7 @@ def evaluate(model, criterion, ap, global_step, epoch):
# compute alignment score
align_error = 1 - alignment_diagonal_score(alignments)
keep_avg.update_value('avg_align_error', align_error)
loss_dict['align_error'] = align_error
# aggregate losses from processes
if num_gpus > 1:
@ -373,14 +374,20 @@ def evaluate(model, criterion, ap, global_step, epoch):
if c.stopnet:
loss_dict['stopnet_loss'] = reduce_tensor(loss_dict['stopnet_loss'].data, num_gpus)
keep_avg.update_values({
'avg_postnet_loss':
float(loss_dict['postnet_loss'].item()),
'avg_decoder_loss':
float(loss_dict['decoder_loss'].item()),
'avg_stopnet_loss':
float(loss_dict['stopnet_loss'].item()),
})
# detach loss values
loss_dict_new = dict()
for key, value in loss_dict.items():
if isinstance(value, (int, float)):
loss_dict_new[key] = value
else:
loss_dict_new[key] = value.item()
loss_dict = loss_dict_new
# update avg stats
update_train_values = dict()
for key, value in loss_dict.items():
update_train_values['avg_' + key] = value
keep_avg.update_values(update_train_values)
if c.print_eval:
c_logger.print_eval_step(num_iter, loss_dict, keep_avg.avg_values)
@ -395,9 +402,9 @@ def evaluate(model, criterion, ap, global_step, epoch):
align_img = alignments[idx].data.cpu().numpy()
eval_figures = {
"prediction": plot_spectrogram(const_spec, ap),
"ground_truth": plot_spectrogram(gt_spec, ap),
"alignment": plot_alignment(align_img)
"prediction": plot_spectrogram(const_spec, ap, output_fig=False),
"ground_truth": plot_spectrogram(gt_spec, ap, output_fig=False),
"alignment": plot_alignment(align_img, output_fig=False)
}
# Sample audio
@ -409,20 +416,11 @@ def evaluate(model, criterion, ap, global_step, epoch):
c.audio["sample_rate"])
# Plot Validation Stats
epoch_stats = {
"loss_postnet": keep_avg['avg_postnet_loss'],
"loss_decoder": keep_avg['avg_decoder_loss'],
"stopnet_loss": keep_avg['avg_stopnet_loss'],
"alignment_score": keep_avg['avg_align_error'],
}
if c.bidirectional_decoder:
epoch_stats['loss_decoder_backward'] = keep_avg['avg_decoder_b_loss']
if c.bidirectional_decoder or c.double_decoder_consistency:
align_b_img = alignments_backward[idx].data.cpu().numpy()
eval_figures['alignment_backward'] = plot_alignment(align_b_img)
if c.ga_alpha > 0:
epoch_stats['guided_attention_loss'] = keep_avg['avg_ga_loss']
tb_logger.tb_eval_stats(global_step, epoch_stats)
eval_figures['alignment2'] = plot_alignment(align_b_img, output_fig=False)
tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
tb_logger.tb_eval_figures(global_step, eval_figures)
if args.rank == 0 and epoch > c.test_delay_epochs:
@ -431,7 +429,8 @@ def evaluate(model, criterion, ap, global_step, epoch):
"It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
"Be a voice, not an echo.",
"I'm sorry Dave. I'm afraid I can't do that.",
"This cake is great. It's so delicious and moist."
"This cake is great. It's so delicious and moist.",
"Prior to November 22, 1963."
]
else:
with open(c.test_sentences_file, "r") as f:
@ -442,10 +441,10 @@ def evaluate(model, criterion, ap, global_step, epoch):
test_figures = {}
print(" | > Synthesizing test sentences")
speaker_id = 0 if c.use_speaker_embedding else None
style_wav = c.get("style_wav_for_test")
style_wav = c.get("gst_style_input")
for idx, test_sentence in enumerate(test_sentences):
try:
wav, alignment, decoder_output, postnet_output, stop_tokens, inputs = synthesis(
wav, alignment, decoder_output, postnet_output, stop_tokens, _ = synthesis(
model,
test_sentence,
c,
@ -465,10 +464,10 @@ def evaluate(model, criterion, ap, global_step, epoch):
ap.save_wav(wav, file_path)
test_audios['{}-audio'.format(idx)] = wav
test_figures['{}-prediction'.format(idx)] = plot_spectrogram(
postnet_output, ap)
postnet_output, ap, output_fig=False)
test_figures['{}-alignment'.format(idx)] = plot_alignment(
alignment)
except:
alignment, output_fig=False)
except: #pylint: disable=bare-except
print(" !! Error creating Test Sentence -", idx)
traceback.print_exc()
tb_logger.tb_test_audios(global_step, test_audios,
@ -495,28 +494,51 @@ def main(args): # pylint: disable=redefined-outer-name
# load data instances
meta_data_train, meta_data_eval = load_meta_data(c.datasets)
# set the portion of the data used for training
if 'train_portion' in c.keys():
meta_data_train = meta_data_train[:int(len(meta_data_train) * c.train_portion)]
if 'eval_portion' in c.keys():
meta_data_eval = meta_data_eval[:int(len(meta_data_eval) * c.eval_portion)]
# parse speakers
if c.use_speaker_embedding:
speakers = get_speakers(meta_data_train)
if args.restore_path:
prev_out_path = os.path.dirname(args.restore_path)
speaker_mapping = load_speaker_mapping(prev_out_path)
assert all([speaker in speaker_mapping
for speaker in speakers]), "As of now you, you cannot " \
"introduce new speakers to " \
"a previously trained model."
else:
if c.use_external_speaker_embedding_file: # if restore checkpoint and use External Embedding file
prev_out_path = os.path.dirname(args.restore_path)
speaker_mapping = load_speaker_mapping(prev_out_path)
if not speaker_mapping:
print("WARNING: speakers.json was not found in restore_path, trying to use CONFIG.external_speaker_embedding_file")
speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
if not speaker_mapping:
raise RuntimeError("You must copy the file speakers.json to restore_path, or set a valid file in CONFIG.external_speaker_embedding_file")
speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
elif not c.use_external_speaker_embedding_file: # if restore checkpoint and don't use External Embedding file
prev_out_path = os.path.dirname(args.restore_path)
speaker_mapping = load_speaker_mapping(prev_out_path)
speaker_embedding_dim = None
assert all([speaker in speaker_mapping
for speaker in speakers]), "As of now you, you cannot " \
"introduce new speakers to " \
"a previously trained model."
elif c.use_external_speaker_embedding_file and c.external_speaker_embedding_file: # if start new train using External Embedding file
speaker_mapping = load_speaker_mapping(c.external_speaker_embedding_file)
speaker_embedding_dim = len(speaker_mapping[list(speaker_mapping.keys())[0]]['embedding'])
elif c.use_external_speaker_embedding_file and not c.external_speaker_embedding_file: # if start new train using External Embedding file and don't pass external embedding file
raise "use_external_speaker_embedding_file is True, so you need pass a external speaker embedding file, run GE2E-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb or AngularPrototypical-Speaker_Encoder-ExtractSpeakerEmbeddings-by-sample.ipynb notebook in notebooks/ folder"
else: # if start new train and don't use External Embedding file
speaker_mapping = {name: i for i, name in enumerate(speakers)}
speaker_embedding_dim = None
save_speaker_mapping(OUT_PATH, speaker_mapping)
num_speakers = len(speaker_mapping)
print("Training with {} speakers: {}".format(num_speakers,
", ".join(speakers)))
else:
num_speakers = 0
speaker_embedding_dim = None
speaker_mapping = None
model = setup_model(num_chars, num_speakers, c)
print(" | > Num output units : {}".format(ap.num_freq), flush=True)
model = setup_model(num_chars, num_speakers, c, speaker_embedding_dim)
params = set_weight_decay(model, c.wd)
optimizer = RAdam(params, lr=c.lr, weight_decay=0)
@ -527,6 +549,14 @@ def main(args): # pylint: disable=redefined-outer-name
else:
optimizer_st = None
if c.apex_amp_level == "O1":
# pylint: disable=import-outside-toplevel
from apex import amp
model.cuda()
model, optimizer = amp.initialize(model, optimizer, opt_level=c.apex_amp_level)
else:
amp = None
# setup criterion
criterion = TacotronLoss(c, stopnet_pos_weight=10.0, ga_sigma=0.4)
@ -539,12 +569,18 @@ def main(args): # pylint: disable=redefined-outer-name
if c.reinit_layers:
raise RuntimeError
model.load_state_dict(checkpoint['model'])
except:
except KeyError:
print(" > Partial model initialization.")
model_dict = model.state_dict()
model_dict = set_init_dict(model_dict, checkpoint, c)
model_dict = set_init_dict(model_dict, checkpoint['model'], c)
# torch.save(model_dict, os.path.join(OUT_PATH, 'state_dict.pt'))
# print("State Dict saved for debug in: ", os.path.join(OUT_PATH, 'state_dict.pt'))
model.load_state_dict(model_dict)
del model_dict
if amp and 'amp' in checkpoint:
amp.load_state_dict(checkpoint['amp'])
for group in optimizer.param_groups:
group['lr'] = c.lr
print(" > Model restored from step %d" % checkpoint['step'],
@ -585,17 +621,16 @@ def main(args): # pylint: disable=redefined-outer-name
if c.bidirectional_decoder:
model.decoder_backward.set_r(r)
print("\n > Number of output frames:", model.decoder.r)
train_avg_loss_dict, global_step = train(model, criterion, optimizer,
optimizer_st, scheduler, ap,
global_step, epoch)
eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch)
global_step, epoch, amp, speaker_mapping)
eval_avg_loss_dict = evaluate(model, criterion, ap, global_step, epoch, speaker_mapping)
c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
target_loss = train_avg_loss_dict['avg_postnet_loss']
if c.run_eval:
target_loss = eval_avg_loss_dict['avg_postnet_loss']
best_loss = save_best_model(target_loss, best_loss, model, optimizer, global_step, epoch, c.r,
OUT_PATH)
OUT_PATH, amp_state_dict=amp.state_dict() if amp else None)
if __name__ == '__main__':
@ -647,6 +682,9 @@ if __name__ == '__main__':
check_config(c)
_ = os.path.dirname(os.path.realpath(__file__))
if c.apex_amp_level == 'O1':
print(" > apex AMP level: ", c.apex_amp_level)
OUT_PATH = args.continue_path
if args.continue_path == '':
OUT_PATH = create_experiment_folder(c.output_path, c.run_name, args.debug)
@ -667,7 +705,7 @@ if __name__ == '__main__':
os.chmod(OUT_PATH, 0o775)
LOG_DIR = OUT_PATH
tb_logger = TensorboardLogger(LOG_DIR)
tb_logger = TensorboardLogger(LOG_DIR, model_name='TTS')
# write model desc to tensorboard
tb_logger.tb_add_text('model-description', c['run_description'], 0)

View File

@ -0,0 +1,664 @@
import argparse
import glob
import os
import sys
import time
import traceback
from inspect import signature
import torch
from torch.utils.data import DataLoader
from mozilla_voice_tts.utils.audio import AudioProcessor
from mozilla_voice_tts.utils.console_logger import ConsoleLogger
from mozilla_voice_tts.utils.generic_utils import (KeepAverage,
count_parameters,
create_experiment_folder,
get_git_branch,
remove_experiment_folder,
set_init_dict)
from mozilla_voice_tts.utils.io import copy_config_file, load_config
from mozilla_voice_tts.utils.radam import RAdam
from mozilla_voice_tts.utils.tensorboard_logger import TensorboardLogger
from mozilla_voice_tts.utils.training import setup_torch_training_env
from mozilla_voice_tts.vocoder.datasets.gan_dataset import GANDataset
from mozilla_voice_tts.vocoder.datasets.preprocess import (load_wav_data,
load_wav_feat_data)
# from distribute import (DistributedSampler, apply_gradient_allreduce,
# init_distributed, reduce_tensor)
from mozilla_voice_tts.vocoder.layers.losses import (DiscriminatorLoss,
GeneratorLoss)
from mozilla_voice_tts.vocoder.utils.generic_utils import (plot_results,
setup_discriminator,
setup_generator)
from mozilla_voice_tts.vocoder.utils.io import save_best_model, save_checkpoint
use_cuda, num_gpus = setup_torch_training_env(True, True)
def setup_loader(ap, is_val=False, verbose=False):
if is_val and not c.run_eval:
loader = None
else:
dataset = GANDataset(ap=ap,
items=eval_data if is_val else train_data,
seq_len=c.seq_len,
hop_len=ap.hop_length,
pad_short=c.pad_short,
conv_pad=c.conv_pad,
is_training=not is_val,
return_segments=not is_val,
use_noise_augment=c.use_noise_augment,
use_cache=c.use_cache,
verbose=verbose)
dataset.shuffle_mapping()
# sampler = DistributedSampler(dataset) if num_gpus > 1 else None
loader = DataLoader(dataset,
batch_size=1 if is_val else c.batch_size,
shuffle=True,
drop_last=False,
sampler=None,
num_workers=c.num_val_loader_workers
if is_val else c.num_loader_workers,
pin_memory=False)
return loader
def format_data(data):
if isinstance(data[0], list):
# setup input data
c_G, x_G = data[0]
c_D, x_D = data[1]
# dispatch data to GPU
if use_cuda:
c_G = c_G.cuda(non_blocking=True)
x_G = x_G.cuda(non_blocking=True)
c_D = c_D.cuda(non_blocking=True)
x_D = x_D.cuda(non_blocking=True)
return c_G, x_G, c_D, x_D
# return a whole audio segment
co, x = data
if use_cuda:
co = co.cuda(non_blocking=True)
x = x.cuda(non_blocking=True)
return co, x, None, None
def train(model_G, criterion_G, optimizer_G, model_D, criterion_D, optimizer_D,
scheduler_G, scheduler_D, ap, global_step, epoch):
data_loader = setup_loader(ap, is_val=False, verbose=(epoch == 0))
model_G.train()
model_D.train()
epoch_time = 0
keep_avg = KeepAverage()
if use_cuda:
batch_n_iter = int(
len(data_loader.dataset) / (c.batch_size * num_gpus))
else:
batch_n_iter = int(len(data_loader.dataset) / c.batch_size)
end_time = time.time()
c_logger.print_train_start()
for num_iter, data in enumerate(data_loader):
start_time = time.time()
# format data
c_G, y_G, c_D, y_D = format_data(data)
loader_time = time.time() - end_time
global_step += 1
##############################
# GENERATOR
##############################
# generator pass
y_hat = model_G(c_G)
y_hat_sub = None
y_G_sub = None
y_hat_vis = y_hat # for visualization
# PQMF formatting
if y_hat.shape[1] > 1:
y_hat_sub = y_hat
y_hat = model_G.pqmf_synthesis(y_hat)
y_hat_vis = y_hat
y_G_sub = model_G.pqmf_analysis(y_G)
scores_fake, feats_fake, feats_real = None, None, None
if global_step > c.steps_to_start_discriminator:
# run D with or without cond. features
if len(signature(model_D.forward).parameters) == 2:
D_out_fake = model_D(y_hat, c_G)
else:
D_out_fake = model_D(y_hat)
D_out_real = None
if c.use_feat_match_loss:
with torch.no_grad():
D_out_real = model_D(y_G)
# format D outputs
if isinstance(D_out_fake, tuple):
scores_fake, feats_fake = D_out_fake
if D_out_real is None:
feats_real = None
else:
_, feats_real = D_out_real
else:
scores_fake = D_out_fake
# compute losses
loss_G_dict = criterion_G(y_hat, y_G, scores_fake, feats_fake,
feats_real, y_hat_sub, y_G_sub)
loss_G = loss_G_dict['G_loss']
# optimizer generator
optimizer_G.zero_grad()
loss_G.backward()
if c.gen_clip_grad > 0:
torch.nn.utils.clip_grad_norm_(model_G.parameters(),
c.gen_clip_grad)
optimizer_G.step()
if scheduler_G is not None:
scheduler_G.step()
loss_dict = dict()
for key, value in loss_G_dict.items():
if isinstance(value, int):
loss_dict[key] = value
else:
loss_dict[key] = value.item()
##############################
# DISCRIMINATOR
##############################
if global_step >= c.steps_to_start_discriminator:
# discriminator pass
with torch.no_grad():
y_hat = model_G(c_D)
# PQMF formatting
if y_hat.shape[1] > 1:
y_hat = model_G.pqmf_synthesis(y_hat)
# run D with or without cond. features
if len(signature(model_D.forward).parameters) == 2:
D_out_fake = model_D(y_hat.detach(), c_D)
D_out_real = model_D(y_D, c_D)
else:
D_out_fake = model_D(y_hat.detach())
D_out_real = model_D(y_D)
# format D outputs
if isinstance(D_out_fake, tuple):
scores_fake, feats_fake = D_out_fake
if D_out_real is None:
scores_real, feats_real = None, None
else:
scores_real, feats_real = D_out_real
else:
scores_fake = D_out_fake
scores_real = D_out_real
# compute losses
loss_D_dict = criterion_D(scores_fake, scores_real)
loss_D = loss_D_dict['D_loss']
# optimizer discriminator
optimizer_D.zero_grad()
loss_D.backward()
if c.disc_clip_grad > 0:
torch.nn.utils.clip_grad_norm_(model_D.parameters(),
c.disc_clip_grad)
optimizer_D.step()
if scheduler_D is not None:
scheduler_D.step()
for key, value in loss_D_dict.items():
if isinstance(value, (int, float)):
loss_dict[key] = value
else:
loss_dict[key] = value.item()
step_time = time.time() - start_time
epoch_time += step_time
# get current learning rates
current_lr_G = list(optimizer_G.param_groups)[0]['lr']
current_lr_D = list(optimizer_D.param_groups)[0]['lr']
# update avg stats
update_train_values = dict()
for key, value in loss_dict.items():
update_train_values['avg_' + key] = value
update_train_values['avg_loader_time'] = loader_time
update_train_values['avg_step_time'] = step_time
keep_avg.update_values(update_train_values)
# print training stats
if global_step % c.print_step == 0:
log_dict = {
'step_time': [step_time, 2],
'loader_time': [loader_time, 4],
"current_lr_G": current_lr_G,
"current_lr_D": current_lr_D
}
c_logger.print_train_step(batch_n_iter, num_iter, global_step,
log_dict, loss_dict, keep_avg.avg_values)
# plot step stats
if global_step % 10 == 0:
iter_stats = {
"lr_G": current_lr_G,
"lr_D": current_lr_D,
"step_time": step_time
}
iter_stats.update(loss_dict)
tb_logger.tb_train_iter_stats(global_step, iter_stats)
# save checkpoint
if global_step % c.save_step == 0:
if c.checkpoint:
# save model
save_checkpoint(model_G,
optimizer_G,
scheduler_G,
model_D,
optimizer_D,
scheduler_D,
global_step,
epoch,
OUT_PATH,
model_losses=loss_dict)
# compute spectrograms
figures = plot_results(y_hat_vis, y_G, ap, global_step,
'train')
tb_logger.tb_train_figures(global_step, figures)
# Sample audio
sample_voice = y_hat_vis[0].squeeze(0).detach().cpu().numpy()
tb_logger.tb_train_audios(global_step,
{'train/audio': sample_voice},
c.audio["sample_rate"])
end_time = time.time()
# print epoch stats
c_logger.print_train_epoch_end(global_step, epoch, epoch_time, keep_avg)
# Plot Training Epoch Stats
epoch_stats = {"epoch_time": epoch_time}
epoch_stats.update(keep_avg.avg_values)
tb_logger.tb_train_epoch_stats(global_step, epoch_stats)
# TODO: plot model stats
# if c.tb_model_param_stats:
# tb_logger.tb_model_weights(model, global_step)
return keep_avg.avg_values, global_step
@torch.no_grad()
def evaluate(model_G, criterion_G, model_D, criterion_D, ap, global_step, epoch):
data_loader = setup_loader(ap, is_val=True, verbose=(epoch == 0))
model_G.eval()
model_D.eval()
epoch_time = 0
keep_avg = KeepAverage()
end_time = time.time()
c_logger.print_eval_start()
for num_iter, data in enumerate(data_loader):
start_time = time.time()
# format data
c_G, y_G, _, _ = format_data(data)
loader_time = time.time() - end_time
global_step += 1
##############################
# GENERATOR
##############################
# generator pass
y_hat = model_G(c_G)
y_hat_sub = None
y_G_sub = None
# PQMF formatting
if y_hat.shape[1] > 1:
y_hat_sub = y_hat
y_hat = model_G.pqmf_synthesis(y_hat)
y_G_sub = model_G.pqmf_analysis(y_G)
scores_fake, feats_fake, feats_real = None, None, None
if global_step > c.steps_to_start_discriminator:
if len(signature(model_D.forward).parameters) == 2:
D_out_fake = model_D(y_hat, c_G)
else:
D_out_fake = model_D(y_hat)
D_out_real = None
if c.use_feat_match_loss:
with torch.no_grad():
D_out_real = model_D(y_G)
# format D outputs
if isinstance(D_out_fake, tuple):
scores_fake, feats_fake = D_out_fake
if D_out_real is None:
feats_real = None
else:
_, feats_real = D_out_real
else:
scores_fake = D_out_fake
feats_fake, feats_real = None, None
# compute losses
loss_G_dict = criterion_G(y_hat, y_G, scores_fake, feats_fake,
feats_real, y_hat_sub, y_G_sub)
loss_dict = dict()
for key, value in loss_G_dict.items():
if isinstance(value, (int, float)):
loss_dict[key] = value
else:
loss_dict[key] = value.item()
##############################
# DISCRIMINATOR
##############################
if global_step >= c.steps_to_start_discriminator:
# discriminator pass
with torch.no_grad():
y_hat = model_G(c_G)
# PQMF formatting
if y_hat.shape[1] > 1:
y_hat = model_G.pqmf_synthesis(y_hat)
# run D with or without cond. features
if len(signature(model_D.forward).parameters) == 2:
D_out_fake = model_D(y_hat.detach(), c_G)
D_out_real = model_D(y_G, c_G)
else:
D_out_fake = model_D(y_hat.detach())
D_out_real = model_D(y_G)
# format D outputs
if isinstance(D_out_fake, tuple):
scores_fake, feats_fake = D_out_fake
if D_out_real is None:
scores_real, feats_real = None, None
else:
scores_real, feats_real = D_out_real
else:
scores_fake = D_out_fake
scores_real = D_out_real
# compute losses
loss_D_dict = criterion_D(scores_fake, scores_real)
for key, value in loss_D_dict.items():
if isinstance(value, (int, float)):
loss_dict[key] = value
else:
loss_dict[key] = value.item()
step_time = time.time() - start_time
epoch_time += step_time
# update avg stats
update_eval_values = dict()
for key, value in loss_dict.items():
update_eval_values['avg_' + key] = value
update_eval_values['avg_loader_time'] = loader_time
update_eval_values['avg_step_time'] = step_time
keep_avg.update_values(update_eval_values)
# print eval stats
if c.print_eval:
c_logger.print_eval_step(num_iter, loss_dict, keep_avg.avg_values)
# compute spectrograms
figures = plot_results(y_hat, y_G, ap, global_step, 'eval')
tb_logger.tb_eval_figures(global_step, figures)
# Sample audio
sample_voice = y_hat[0].squeeze(0).detach().cpu().numpy()
tb_logger.tb_eval_audios(global_step, {'eval/audio': sample_voice},
c.audio["sample_rate"])
# synthesize a full voice
data_loader.return_segments = False
tb_logger.tb_eval_stats(global_step, keep_avg.avg_values)
return keep_avg.avg_values
# FIXME: move args definition/parsing inside of main?
def main(args): # pylint: disable=redefined-outer-name
# pylint: disable=global-variable-undefined
global train_data, eval_data
print(f" > Loading wavs from: {c.data_path}")
if c.feature_path is not None:
print(f" > Loading features from: {c.feature_path}")
eval_data, train_data = load_wav_feat_data(c.data_path, c.feature_path, c.eval_split_size)
else:
eval_data, train_data = load_wav_data(c.data_path, c.eval_split_size)
# setup audio processor
ap = AudioProcessor(**c.audio)
# DISTRUBUTED
# if num_gpus > 1:
# init_distributed(args.rank, num_gpus, args.group_id,
# c.distributed["backend"], c.distributed["url"])
# setup models
model_gen = setup_generator(c)
model_disc = setup_discriminator(c)
# setup optimizers
optimizer_gen = RAdam(model_gen.parameters(), lr=c.lr_gen, weight_decay=0)
optimizer_disc = RAdam(model_disc.parameters(),
lr=c.lr_disc,
weight_decay=0)
# schedulers
scheduler_gen = None
scheduler_disc = None
if 'lr_scheduler_gen' in c:
scheduler_gen = getattr(torch.optim.lr_scheduler, c.lr_scheduler_gen)
scheduler_gen = scheduler_gen(optimizer_gen, **c.lr_scheduler_gen_params)
if 'lr_scheduler_disc' in c:
scheduler_disc = getattr(torch.optim.lr_scheduler, c.lr_scheduler_disc)
scheduler_disc = scheduler_disc(optimizer_disc, **c.lr_scheduler_disc_params)
# setup criterion
criterion_gen = GeneratorLoss(c)
criterion_disc = DiscriminatorLoss(c)
if args.restore_path:
checkpoint = torch.load(args.restore_path, map_location='cpu')
try:
print(" > Restoring Generator Model...")
model_gen.load_state_dict(checkpoint['model'])
print(" > Restoring Generator Optimizer...")
optimizer_gen.load_state_dict(checkpoint['optimizer'])
print(" > Restoring Discriminator Model...")
model_disc.load_state_dict(checkpoint['model_disc'])
print(" > Restoring Discriminator Optimizer...")
optimizer_disc.load_state_dict(checkpoint['optimizer_disc'])
if 'scheduler' in checkpoint:
print(" > Restoring Generator LR Scheduler...")
scheduler_gen.load_state_dict(checkpoint['scheduler'])
# NOTE: Not sure if necessary
scheduler_gen.optimizer = optimizer_gen
if 'scheduler_disc' in checkpoint:
print(" > Restoring Discriminator LR Scheduler...")
scheduler_disc.load_state_dict(checkpoint['scheduler_disc'])
scheduler_disc.optimizer = optimizer_disc
except RuntimeError:
# retore only matching layers.
print(" > Partial model initialization...")
model_dict = model_gen.state_dict()
model_dict = set_init_dict(model_dict, checkpoint['model'], c)
model_gen.load_state_dict(model_dict)
model_dict = model_disc.state_dict()
model_dict = set_init_dict(model_dict, checkpoint['model_disc'], c)
model_disc.load_state_dict(model_dict)
del model_dict
# reset lr if not countinuining training.
for group in optimizer_gen.param_groups:
group['lr'] = c.lr_gen
for group in optimizer_disc.param_groups:
group['lr'] = c.lr_disc
print(" > Model restored from step %d" % checkpoint['step'],
flush=True)
args.restore_step = checkpoint['step']
else:
args.restore_step = 0
if use_cuda:
model_gen.cuda()
criterion_gen.cuda()
model_disc.cuda()
criterion_disc.cuda()
# DISTRUBUTED
# if num_gpus > 1:
# model = apply_gradient_allreduce(model)
num_params = count_parameters(model_gen)
print(" > Generator has {} parameters".format(num_params), flush=True)
num_params = count_parameters(model_disc)
print(" > Discriminator has {} parameters".format(num_params), flush=True)
if 'best_loss' not in locals():
best_loss = float('inf')
global_step = args.restore_step
for epoch in range(0, c.epochs):
c_logger.print_epoch_start(epoch, c.epochs)
_, global_step = train(model_gen, criterion_gen, optimizer_gen,
model_disc, criterion_disc, optimizer_disc,
scheduler_gen, scheduler_disc, ap, global_step,
epoch)
eval_avg_loss_dict = evaluate(model_gen, criterion_gen, model_disc, criterion_disc, ap,
global_step, epoch)
c_logger.print_epoch_end(epoch, eval_avg_loss_dict)
target_loss = eval_avg_loss_dict[c.target_loss]
best_loss = save_best_model(target_loss,
best_loss,
model_gen,
optimizer_gen,
scheduler_gen,
model_disc,
optimizer_disc,
scheduler_disc,
global_step,
epoch,
OUT_PATH,
model_losses=eval_avg_loss_dict)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--continue_path',
type=str,
help=
'Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
default='',
required='--config_path' not in sys.argv)
parser.add_argument(
'--restore_path',
type=str,
help='Model file to be restored. Use to finetune a model.',
default='')
parser.add_argument('--config_path',
type=str,
help='Path to config file for training.',
required='--continue_path' not in sys.argv)
parser.add_argument('--debug',
type=bool,
default=False,
help='Do not verify commit integrity to run training.')
# DISTRUBUTED
parser.add_argument(
'--rank',
type=int,
default=0,
help='DISTRIBUTED: process rank for distributed training.')
parser.add_argument('--group_id',
type=str,
default="",
help='DISTRIBUTED: process group id.')
args = parser.parse_args()
if args.continue_path != '':
args.output_path = args.continue_path
args.config_path = os.path.join(args.continue_path, 'config.json')
list_of_files = glob.glob(
args.continue_path +
"/*.pth.tar") # * means all if need specific format then *.csv
latest_model_file = max(list_of_files, key=os.path.getctime)
args.restore_path = latest_model_file
print(f" > Training continues for {args.restore_path}")
# setup output paths and read configs
c = load_config(args.config_path)
# check_config(c)
_ = os.path.dirname(os.path.realpath(__file__))
OUT_PATH = args.continue_path
if args.continue_path == '':
OUT_PATH = create_experiment_folder(c.output_path, c.run_name,
args.debug)
AUDIO_PATH = os.path.join(OUT_PATH, 'test_audios')
c_logger = ConsoleLogger()
if args.rank == 0:
os.makedirs(AUDIO_PATH, exist_ok=True)
new_fields = {}
if args.restore_path:
new_fields["restore_path"] = args.restore_path
new_fields["github_branch"] = get_git_branch()
copy_config_file(args.config_path,
os.path.join(OUT_PATH, 'config.json'), new_fields)
os.chmod(AUDIO_PATH, 0o775)
os.chmod(OUT_PATH, 0o775)
LOG_DIR = OUT_PATH
tb_logger = TensorboardLogger(LOG_DIR, model_name='VOCODER')
# write model desc to tensorboard
tb_logger.tb_add_text('model-description', c['run_description'], 0)
try:
main(args)
except KeyboardInterrupt:
remove_experiment_folder(OUT_PATH)
try:
sys.exit(0)
except SystemExit:
os._exit(0) # pylint: disable=protected-access
except Exception: # pylint: disable=broad-except
remove_experiment_folder(OUT_PATH)
traceback.print_exc()
sys.exit(1)

View File

@ -15,7 +15,7 @@ If you have the environment set already for TTS, then you can directly call ```s
3. source /tmp/venv/bin/activate
4. pip install -U pip setuptools wheel
5. pip install -U https//example.com/url/to/python/package.whl
6. python -m TTS.server.server
6. python -m mozilla_voice_tts.server.server
You can now open http://localhost:5002 in a browser

View File

@ -3,11 +3,13 @@
"tts_file":"best_model.pth.tar", // tts checkpoint file
"tts_config":"config.json", // tts config.json file
"tts_speakers": null, // json file listing speaker ids. null if no speaker embedding.
"vocoder_config":null,
"vocoder_file": null,
"wavernn_lib_path": null, // Rootpath to wavernn project folder to be imported. If this is null, model uses GL for speech synthesis.
"wavernn_path":null, // wavernn model root path
"wavernn_file":null, // wavernn checkpoint file name
"wavernn_config": null, // wavernn config file
"is_wavernn_batched":true,
"is_wavernn_batched":true,
"port": 5002,
"use_cuda": true,
"debug": true

View File

@ -3,7 +3,7 @@ import argparse
import os
from flask import Flask, request, render_template, send_file
from TTS.server.synthesizer import Synthesizer
from mozilla_voice_tts.server.synthesizer import Synthesizer
def create_argparser():
@ -15,12 +15,11 @@ def create_argparser():
parser.add_argument('--tts_config', type=str, help='path to TTS config.json file')
parser.add_argument('--tts_speakers', type=str, help='path to JSON file containing speaker ids, if speaker ids are used in the model')
parser.add_argument('--wavernn_lib_path', type=str, default=None, help='path to WaveRNN project folder to be imported. If this is not passed, model uses Griffin-Lim for synthesis.')
parser.add_argument('--wavernn_file', type=str, default=None, help='path to WaveRNN checkpoint file.')
parser.add_argument('--wavernn_checkpoint', type=str, default=None, help='path to WaveRNN checkpoint file.')
parser.add_argument('--wavernn_config', type=str, default=None, help='path to WaveRNN config file.')
parser.add_argument('--is_wavernn_batched', type=convert_boolean, default=False, help='true to use batched WaveRNN.')
parser.add_argument('--pwgan_lib_path', type=str, default=None, help='path to ParallelWaveGAN project folder to be imported. If this is not passed, model uses Griffin-Lim for synthesis.')
parser.add_argument('--pwgan_file', type=str, default=None, help='path to ParallelWaveGAN checkpoint file.')
parser.add_argument('--pwgan_config', type=str, default=None, help='path to ParallelWaveGAN config file.')
parser.add_argument('--vocoder_config', type=str, default=None, help='path to mozilla_voice_tts.vocoder config file.')
parser.add_argument('--vocoder_checkpoint', type=str, default=None, help='path to mozilla_voice_tts.vocoder checkpoint file.')
parser.add_argument('--port', type=int, default=5002, help='port to listen on.')
parser.add_argument('--use_cuda', type=convert_boolean, default=False, help='true to use CUDA.')
parser.add_argument('--debug', type=convert_boolean, default=False, help='true to enable Flask debug mode.')
@ -35,14 +34,15 @@ embedded_tts_folder = os.path.join(embedded_models_folder, 'tts')
tts_checkpoint_file = os.path.join(embedded_tts_folder, 'checkpoint.pth.tar')
tts_config_file = os.path.join(embedded_tts_folder, 'config.json')
embedded_vocoder_folder = os.path.join(embedded_models_folder, 'vocoder')
vocoder_checkpoint_file = os.path.join(embedded_vocoder_folder, 'checkpoint.pth.tar')
vocoder_config_file = os.path.join(embedded_vocoder_folder, 'config.json')
# These models are soon to be deprecated
embedded_wavernn_folder = os.path.join(embedded_models_folder, 'wavernn')
wavernn_checkpoint_file = os.path.join(embedded_wavernn_folder, 'checkpoint.pth.tar')
wavernn_config_file = os.path.join(embedded_wavernn_folder, 'config.json')
embedded_pwgan_folder = os.path.join(embedded_models_folder, 'pwgan')
pwgan_checkpoint_file = os.path.join(embedded_pwgan_folder, 'checkpoint.pkl')
pwgan_config_file = os.path.join(embedded_pwgan_folder, 'config.yml')
args = create_argparser().parse_args()
# If these were not specified in the CLI args, use default values with embedded model files
@ -50,14 +50,16 @@ if not args.tts_checkpoint and os.path.isfile(tts_checkpoint_file):
args.tts_checkpoint = tts_checkpoint_file
if not args.tts_config and os.path.isfile(tts_config_file):
args.tts_config = tts_config_file
if not args.wavernn_file and os.path.isfile(wavernn_checkpoint_file):
args.wavernn_file = wavernn_checkpoint_file
if not args.vocoder_checkpoint and os.path.isfile(vocoder_checkpoint_file):
args.vocoder_checkpoint = vocoder_checkpoint_file
if not args.vocoder_config and os.path.isfile(vocoder_config_file):
args.vocoder_config = vocoder_config_file
if not args.wavernn_checkpoint and os.path.isfile(wavernn_checkpoint_file):
args.wavernn_checkpoint = wavernn_checkpoint_file
if not args.wavernn_config and os.path.isfile(wavernn_config_file):
args.wavernn_config = wavernn_config_file
if not args.pwgan_file and os.path.isfile(pwgan_checkpoint_file):
args.pwgan_file = pwgan_checkpoint_file
if not args.pwgan_config and os.path.isfile(pwgan_config_file):
args.pwgan_config = pwgan_config_file
synthesizer = Synthesizer(args)
@ -76,5 +78,9 @@ def tts():
return send_file(data, mimetype='audio/wav')
if __name__ == '__main__':
def main():
app.run(debug=args.debug, host='0.0.0.0', port=args.port)
if __name__ == '__main__':
main()

View File

@ -1,45 +1,44 @@
import io
import re
import sys
import time
import numpy as np
import torch
import yaml
import pysbd
from TTS.utils.audio import AudioProcessor
from TTS.utils.io import load_config
from TTS.utils.generic_utils import setup_model
from TTS.utils.speakers import load_speaker_mapping
from mozilla_voice_tts.utils.audio import AudioProcessor
from mozilla_voice_tts.utils.io import load_config
from mozilla_voice_tts.tts.utils.generic_utils import setup_model
from mozilla_voice_tts.tts.utils.speakers import load_speaker_mapping
from mozilla_voice_tts.vocoder.utils.generic_utils import setup_generator
# pylint: disable=unused-wildcard-import
# pylint: disable=wildcard-import
from TTS.utils.synthesis import *
from mozilla_voice_tts.tts.utils.synthesis import *
from TTS.utils.text import make_symbols, phonemes, symbols
alphabets = r"([A-Za-z])"
prefixes = r"(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = r"(Inc|Ltd|Jr|Sr|Co)"
starters = r"(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = r"([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = r"[.](com|net|org|io|gov)"
from mozilla_voice_tts.tts.utils.text import make_symbols, phonemes, symbols
class Synthesizer(object):
def __init__(self, config):
self.wavernn = None
self.pwgan = None
self.vocoder_model = None
self.config = config
print(config)
self.seg = self.get_segmenter("en")
self.use_cuda = self.config.use_cuda
if self.use_cuda:
assert torch.cuda.is_available(), "CUDA is not availabe on this machine."
self.load_tts(self.config.tts_checkpoint, self.config.tts_config,
self.config.use_cuda)
if self.config.vocoder_checkpoint:
self.load_vocoder(self.config.vocoder_checkpoint, self.config.vocoder_config, self.config.use_cuda)
if self.config.wavernn_lib_path:
self.load_wavernn(self.config.wavernn_lib_path, self.config.wavernn_file,
self.load_wavernn(self.config.wavernn_lib_path, self.config.wavernn_checkpoint,
self.config.wavernn_config, self.config.use_cuda)
if self.config.pwgan_file:
self.load_pwgan(self.config.pwgan_lib_path, self.config.pwgan_file,
self.config.pwgan_config, self.config.use_cuda)
@staticmethod
def get_segmenter(lang):
return pysbd.Segmenter(language=lang, clean=True)
def load_tts(self, tts_checkpoint, tts_config, use_cuda):
# pylint: disable=global-statement
@ -77,6 +76,19 @@ class Synthesizer(object):
self.tts_model.decoder.max_decoder_steps = 3000
if 'r' in cp:
self.tts_model.decoder.set_r(cp['r'])
print(f" > model reduction factor: {cp['r']}")
def load_vocoder(self, model_file, model_config, use_cuda):
self.vocoder_config = load_config(model_config)
self.vocoder_model = setup_generator(self.vocoder_config)
self.vocoder_model.load_state_dict(torch.load(model_file, map_location="cpu")["model"])
self.vocoder_model.remove_weight_norm()
self.vocoder_model.inference_padding = 0
self.vocoder_config = load_config(model_config)
if use_cuda:
self.vocoder_model.cuda()
self.vocoder_model.eval()
def load_wavernn(self, lib_path, model_file, model_config, use_cuda):
# TODO: set a function in wavernn code base for model setup and call it here.
@ -112,65 +124,16 @@ class Synthesizer(object):
self.wavernn.cuda()
self.wavernn.eval()
def load_pwgan(self, lib_path, model_file, model_config, use_cuda):
if lib_path:
# set this if ParallelWaveGAN is not installed globally
sys.path.append(lib_path)
try:
#pylint: disable=import-outside-toplevel
from parallel_wavegan.models import ParallelWaveGANGenerator
except ImportError as e:
raise RuntimeError(f"cannot import parallel-wavegan, either install it or set its directory using the --pwgan_lib_path command line argument: {e}")
print(" > Loading PWGAN model ...")
print(" | > model config: ", model_config)
print(" | > model file: ", model_file)
with open(model_config) as f:
self.pwgan_config = yaml.load(f, Loader=yaml.Loader)
self.pwgan = ParallelWaveGANGenerator(**self.pwgan_config["generator_params"])
self.pwgan.load_state_dict(torch.load(model_file, map_location="cpu")["model"]["generator"])
self.pwgan.remove_weight_norm()
if use_cuda:
self.pwgan.cuda()
self.pwgan.eval()
def save_wav(self, wav, path):
# wav *= 32767 / max(1e-8, np.max(np.abs(wav)))
wav = np.array(wav)
self.ap.save_wav(wav, path)
@staticmethod
def split_into_sentences(text):
text = " " + text + " <stop>"
text = text.replace("\n", " ")
text = re.sub(prefixes, "\\1<prd>", text)
text = re.sub(websites, "<prd>\\1", text)
if "Ph.D" in text:
text = text.replace("Ph.D.", "Ph<prd>D<prd>")
text = re.sub(r"\s" + alphabets + "[.] ", " \\1<prd> ", text)
text = re.sub(acronyms+" "+starters, "\\1<stop> \\2", text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]", "\\1<prd>\\2<prd>\\3<prd>", text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]", "\\1<prd>\\2<prd>", text)
text = re.sub(" "+suffixes+"[.] "+starters, " \\1<stop> \\2", text)
text = re.sub(" "+suffixes+"[.]", " \\1<prd>", text)
text = re.sub(" " + alphabets + "[.]", " \\1<prd>", text)
if "" in text:
text = text.replace(".”", "”.")
if "\"" in text:
text = text.replace(".\"", "\".")
if "!" in text:
text = text.replace("!\"", "\"!")
if "?" in text:
text = text.replace("?\"", "\"?")
text = text.replace(".", ".<stop>")
text = text.replace("?", "?<stop>")
text = text.replace("!", "!<stop>")
text = text.replace("<prd>", ".")
sentences = text.split("<stop>")
sentences = sentences[:-1]
sentences = list(filter(None, [s.strip() for s in sentences])) # remove empty sentences
return sentences
def split_into_sentences(self, text):
return self.seg.segment(text)
def tts(self, text, speaker_id=None):
start_time = time.time()
wavs = []
sens = self.split_into_sentences(text)
print(sens)
@ -184,29 +147,35 @@ class Synthesizer(object):
inputs = numpy_to_torch(inputs, torch.long, cuda=self.use_cuda)
inputs = inputs.unsqueeze(0)
# synthesize voice
decoder_output, postnet_output, alignments, stop_tokens = run_model_torch(
self.tts_model, inputs, self.tts_config, False, speaker_id, None)
# convert outputs to numpy
postnet_output, decoder_output, _, _ = parse_outputs_torch(
postnet_output, decoder_output, alignments, stop_tokens)
if self.pwgan:
vocoder_input = torch.FloatTensor(postnet_output.T).unsqueeze(0)
_, postnet_output, _, _ = run_model_torch(self.tts_model, inputs, self.tts_config, False, speaker_id, None)
if self.vocoder_model:
# use native vocoder model
vocoder_input = postnet_output[0].transpose(0, 1).unsqueeze(0)
wav = self.vocoder_model.inference(vocoder_input)
if self.use_cuda:
vocoder_input.cuda()
wav = self.pwgan.inference(vocoder_input, hop_size=self.ap.hop_length)
wav = wav.cpu().numpy()
else:
wav = wav.numpy()
wav = wav.flatten()
elif self.wavernn:
# use 3rd paty wavernn
vocoder_input = None
if self.tts_config.model == "Tacotron":
vocoder_input = torch.FloatTensor(self.ap.out_linear_to_mel(linear_spec=postnet_output.T).T).T.unsqueeze(0)
else:
vocoder_input = torch.FloatTensor(postnet_output.T).unsqueeze(0)
vocoder_input = postnet_output[0].transpose(0, 1).unsqueeze(0)
if self.use_cuda:
vocoder_input.cuda()
wav = self.wavernn.generate(vocoder_input, batched=self.config.is_wavernn_batched, target=11000, overlap=550)
else:
# use GL
if self.use_cuda:
postnet_output = postnet_output[0].cpu()
else:
postnet_output = postnet_output[0]
postnet_output = postnet_output.numpy()
wav = inv_spectrogram(postnet_output, self.ap, self.tts_config)
# trim silence
wav = trim_silence(wav, self.ap)
@ -215,4 +184,10 @@ class Synthesizer(object):
out = io.BytesIO()
self.save_wav(wavs, out)
# compute stats
process_time = time.time() - start_time
audio_time = len(wavs) / self.tts_config.audio['sample_rate']
print(f" > Processing time: {process_time}")
print(f" > Real-time factor: {process_time / audio_time}")
return out

View File

@ -8,10 +8,10 @@
<meta name="description" content="">
<meta name="author" content="">
<title>Mozillia - Text2Speech engine</title>
<title>Mozilla - Text2Speech engine</title>
<!-- Bootstrap core CSS -->
<link href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.1/css/bootstrap.min.css"
<link href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.1/css/bootstrap.min.css"
integrity="sha384-WskhaSGFgHYWDcbwN70/dfYBj47jz9qbsMId/iRN3ewGhXQFZCSftd1LZCfmhktB" crossorigin="anonymous" rel="stylesheet">
<!-- Custom styles for this template -->
@ -27,7 +27,7 @@
</style>
</head>
<body>
<a href="https://github.com/mozilla/TTS"><img style="position: absolute; z-index:1000; top: 0; left: 0; border: 0;" src="https://s3.amazonaws.com/github/ribbons/forkme_left_darkblue_121621.png" alt="Fork me on GitHub"></a>
@ -60,7 +60,7 @@
<h1 class="mt-5">Mozilla TTS</h1>
<ul class="list-unstyled">
</ul>
<input id="text" placeholder="Type here..." size=45 type="text" name="text">
<input id="text" placeholder="Type here..." size=45 type="text" name="text">
<button id="speak-button" name="speak">Speak</button><br/><br/>
<audio id="audio" controls autoplay hidden></audio>
<p id="message"></p>

View File

@ -1,16 +1,16 @@
### Speaker embedding (Experimental)
### Speaker Encoder
This is an implementation of https://arxiv.org/abs/1710.10467. This model can be used for voice and speaker embedding.
With the code here you can generate d-vectors for both multi-speaker and single-speaker TTS datasets, then visualise and explore them along with the associated audio files in an interactive chart.
Below is an example showing embedding results of various speakers. You can generate the same plot with the provided notebook as demonstrated in [this video](https://youtu.be/KW3oO7JVa7Q).
Below is an example showing embedding results of various speakers. You can generate the same plot with the provided notebook as demonstrated in [this video](https://youtu.be/KW3oO7JVa7Q).
![](umap.png)
Download a pretrained model from [Released Models](https://github.com/mozilla/TTS/wiki/Released-Models) page.
To run the code, you need to follow the same flow as in TTS.
To run the code, you need to follow the same flow as in mozilla_voice_tts.
- Define 'config.json' for your needs. Note that, audio parameters should match your TTS model.
- Example training call ```python speaker_encoder/train.py --config_path speaker_encoder/config.json --data_path ~/Data/Libri-TTS/train-clean-360```

View File

@ -6,9 +6,9 @@ import numpy as np
from tqdm import tqdm
import torch
from TTS.speaker_encoder.model import SpeakerEncoder
from TTS.utils.audio import AudioProcessor
from TTS.utils.generic_utils import load_config
from mozilla_voice_tts.speaker_encoder.model import SpeakerEncoder
from mozilla_voice_tts.tts.utils.audio import AudioProcessor
from mozilla_voice_tts.tts.utils.generic_utils import load_config
parser = argparse.ArgumentParser(
description='Compute embedding vectors for each wav file in a dataset. ')

View File

@ -0,0 +1,61 @@
{
"run_name": "Model compatible to CorentinJ/Real-Time-Voice-Cloning",
"run_description": "train speaker encoder with voxceleb1, voxceleb2 and libriSpeech ",
"audio":{
// Audio processing parameters
"num_mels": 40, // size of the mel spec frame.
"fft_size": 400, // number of stft frequency levels. Size of the linear spectogram frame.
"sample_rate": 16000, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"win_length": 400, // stft window length in ms.
"hop_length": 160, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
"preemphasis": 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"min_level_db": -100, // normalization range
"ref_level_db": 20, // reference level db, theoretically 20db is the sound of air.
"power": 1.5, // value to sharpen wav signals after GL algorithm.
"griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
// Normalization parameters
"signal_norm": true, // normalize the spec values in range [0, 1]
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"mel_fmin": 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
"do_trim_silence": false, // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60 // threshold for timming silence. Set this according to your dataset.
},
"reinit_layers": [],
"loss": "ge2e", // "ge2e" to use Generalized End-to-End loss and "angleproto" to use Angular Prototypical loss (new SOTA)
"grad_clip": 3.0, // upper limit for gradients for clipping.
"epochs": 1000, // total number of epochs to train.
"lr": 0.0001, // Initial learning rate. If Noam decay is active, maximum learning rate.
"lr_decay": false, // if true, Noam learning rate decaying is applied through training.
"warmup_steps": 4000, // Noam decay steps to increase the learning rate from 0 to "lr"
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
"steps_plot_stats": 10, // number of steps to plot embeddings.
"num_speakers_in_batch": 32, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"wd": 0.000001, // Weight decay weight.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"save_step": 1000, // Number of training steps expected to save traning stats and checkpoints.
"print_step": 1, // Number of steps to log traning on console.
"output_path": "../../checkpoints/voxceleb_librispeech/speaker_encoder/", // DATASET-RELATED: output path for all training outputs.
"model": {
"input_dim": 40,
"proj_dim": 256,
"lstm_dim": 256,
"num_lstm_layers": 3,
"use_lstm_with_projection": false
},
"datasets":
[
{
"name": "vctk",
"path": "../../../datasets/VCTK-Corpus-removed-silence/",
"meta_file_train": null,
"meta_file_val": null
}
]
}

View File

@ -9,7 +9,7 @@ class MyDataset(Dataset):
num_utter_per_speaker=10, skip_speakers=False, verbose=False):
"""
Args:
ap (TTS.utils.AudioProcessor): audio processor object.
ap (mozilla_voice_tts.tts.utils.AudioProcessor): audio processor object.
meta_data (list): list of dataset instances.
seq_len (int): voice segment length in seconds.
verbose (bool): print diagnostic information.
@ -31,7 +31,7 @@ class MyDataset(Dataset):
print(f" | > Num speakers: {len(self.speakers)}")
def load_wav(self, filename):
audio = self.ap.load_wav(filename)
audio = self.ap.load_wav(filename, sr=self.ap.sample_rate)
return audio
def load_data(self, idx):

View File

@ -15,7 +15,7 @@ def save_checkpoint(model, optimizer, model_loss, out_path,
'optimizer': optimizer.state_dict() if optimizer is not None else None,
'step': current_step,
'epoch': epoch,
'GE2Eloss': model_loss,
'loss': model_loss,
'date': datetime.date.today().strftime("%B %d, %Y"),
}
torch.save(state, checkpoint_path)
@ -29,7 +29,7 @@ def save_best_model(model, optimizer, model_loss, best_loss, out_path,
'model': new_state_dict,
'optimizer': optimizer.state_dict(),
'step': current_step,
'GE2Eloss': model_loss,
'loss': model_loss,
'date': datetime.date.today().strftime("%B %d, %Y"),
}
best_loss = model_loss
@ -38,4 +38,4 @@ def save_best_model(model, optimizer, model_loss, best_loss, out_path,
print("\n > BEST MODEL ({0:.5f}) : {1:}".format(
model_loss, bestmodel_path))
torch.save(state, bestmodel_path)
return best_loss
return best_loss

View File

@ -1,7 +1,7 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
# adapted from https://github.com/cvqluu/GE2E-Loss
class GE2ELoss(nn.Module):
@ -23,6 +23,8 @@ class GE2ELoss(nn.Module):
self.b = nn.Parameter(torch.tensor(init_b))
self.loss_method = loss_method
print(' > Initialised Generalized End-to-End loss')
assert self.loss_method in ["softmax", "contrast"]
if self.loss_method == "softmax":
@ -119,3 +121,40 @@ class GE2ELoss(nn.Module):
cos_sim_matrix = self.w * cos_sim_matrix + self.b
L = self.embed_loss(dvecs, cos_sim_matrix)
return L.mean()
# adapted from https://github.com/clovaai/voxceleb_trainer/blob/master/loss/angleproto.py
class AngleProtoLoss(nn.Module):
"""
Implementation of the Angular Prototypical loss defined in https://arxiv.org/abs/2003.11982
Accepts an input of size (N, M, D)
where N is the number of speakers in the batch,
M is the number of utterances per speaker,
and D is the dimensionality of the embedding vector
Args:
- init_w (float): defines the initial value of w
- init_b (float): definies the initial value of b
"""
def __init__(self, init_w=10.0, init_b=-5.0):
super(AngleProtoLoss, self).__init__()
# pylint: disable=E1102
self.w = nn.Parameter(torch.tensor(init_w))
# pylint: disable=E1102
self.b = nn.Parameter(torch.tensor(init_b))
self.criterion = torch.nn.CrossEntropyLoss()
print(' > Initialised Angular Prototypical loss')
def forward(self, x):
"""
Calculates the AngleProto loss for an input of dimensions (num_speakers, num_utts_per_speaker, dvec_feats)
"""
out_anchor = torch.mean(x[:, 1:, :], 1)
out_positive = x[:, 0, :]
num_speakers = out_anchor.size()[0]
cos_sim_matrix = F.cosine_similarity(out_positive.unsqueeze(-1).expand(-1, -1, num_speakers), out_anchor.unsqueeze(-1).expand(-1, -1, num_speakers).transpose(0, 2))
torch.clamp(self.w, 1e-6)
cos_sim_matrix = cos_sim_matrix * self.w + self.b
label = torch.from_numpy(np.asarray(range(0, num_speakers))).to(cos_sim_matrix.device)
L = self.criterion(cos_sim_matrix, label)
return L

View File

@ -16,15 +16,33 @@ class LSTMWithProjection(nn.Module):
o, (_, _) = self.lstm(x)
return self.linear(o)
class LSTMWithoutProjection(nn.Module):
def __init__(self, input_dim, lstm_dim, proj_dim, num_lstm_layers):
super().__init__()
self.lstm = nn.LSTM(input_size=input_dim,
hidden_size=lstm_dim,
num_layers=num_lstm_layers,
batch_first=True)
self.linear = nn.Linear(lstm_dim, proj_dim, bias=True)
self.relu = nn.ReLU()
def forward(self, x):
_, (hidden, _) = self.lstm(x)
return self.relu(self.linear(hidden[-1]))
class SpeakerEncoder(nn.Module):
def __init__(self, input_dim, proj_dim=256, lstm_dim=768, num_lstm_layers=3):
def __init__(self, input_dim, proj_dim=256, lstm_dim=768, num_lstm_layers=3, use_lstm_with_projection=True):
super().__init__()
self.use_lstm_with_projection = use_lstm_with_projection
layers = []
layers.append(LSTMWithProjection(input_dim, lstm_dim, proj_dim))
for _ in range(num_lstm_layers - 1):
layers.append(LSTMWithProjection(proj_dim, lstm_dim, proj_dim))
self.layers = nn.Sequential(*layers)
# choise LSTM layer
if use_lstm_with_projection:
layers.append(LSTMWithProjection(input_dim, lstm_dim, proj_dim))
for _ in range(num_lstm_layers - 1):
layers.append(LSTMWithProjection(proj_dim, lstm_dim, proj_dim))
self.layers = nn.Sequential(*layers)
else:
self.layers = LSTMWithoutProjection(input_dim, lstm_dim, proj_dim, num_lstm_layers)
self._init_layers()
def _init_layers(self):
@ -37,12 +55,18 @@ class SpeakerEncoder(nn.Module):
def forward(self, x):
# TODO: implement state passing for lstms
d = self.layers(x)
d = torch.nn.functional.normalize(d[:, -1], p=2, dim=1)
if self.use_lstm_with_projection:
d = torch.nn.functional.normalize(d[:, -1], p=2, dim=1)
else:
d = torch.nn.functional.normalize(d, p=2, dim=1)
return d
def inference(self, x):
d = self.layers.forward(x)
d = torch.nn.functional.normalize(d[:, -1], p=2, dim=1)
if self.use_lstm_with_projection:
d = torch.nn.functional.normalize(d[:, -1], p=2, dim=1)
else:
d = torch.nn.functional.normalize(d, p=2, dim=1)
return d
def compute_embedding(self, x, num_frames=160, overlap=0.5):
@ -85,4 +109,3 @@ class SpeakerEncoder(nn.Module):
frames[cur_iter <= num_iters, :, :]
)
return embed / num_iters

View File

Before

Width:  |  Height:  |  Size: 24 KiB

After

Width:  |  Height:  |  Size: 24 KiB

View File

@ -1,24 +1,24 @@
{
"model": "Tacotron2",
"run_name": "ljspeech",
"run_description": "tacotron2",
"run_name": "ljspeech-ddc-bn",
"run_description": "tacotron2 with ddc and batch-normalization",
// AUDIO PARAMETERS
"audio":{
// stft parameters
"num_freq": 513, // number of stft frequency levels. Size of the linear spectogram frame.
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 20, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (true), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// Griffin-Lim
@ -29,6 +29,7 @@
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 20.0,
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
@ -66,6 +67,7 @@
"gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
"loss_masking": true, // enable / disable loss masking against the sequence padding.
"ga_alpha": 10.0, // weight for guided attention loss. If > 0, guided attention is enabled.
"apex_amp_level": null, // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use "O1" to activate.
// VALIDATION
"run_eval": true,
@ -83,26 +85,29 @@
// TACOTRON PRENET
"memory_size": -1, // ONLY TACOTRON - size of the memory queue used fro storing last decoder predictions for auto-regression. If < 0, memory queue is disabled and decoder only uses the last prediction frame.
"prenet_type": "original", // "original" or "bn".
"prenet_dropout": true, // enable/disable dropout at prenet.
"prenet_type": "bn", // "original" or "bn".
"prenet_dropout": false, // enable/disable dropout at prenet.
// ATTENTION
// TACOTRON ATTENTION
"attention_type": "original", // 'original' or 'graves'
"attention_heads": 4, // number of attention heads (only for 'graves')
"attention_norm": "sigmoid", // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron.
"attention_norm": "sigmoid", // softmax or sigmoid.
"windowing": false, // Enables attention windowing. Used only in eval mode.
"use_forward_attn": false, // if it uses forward attention. In general, it aligns faster.
"forward_attn_mask": false, // Additional masking forcing monotonicity only in eval mode.
"transition_agent": false, // enable/disable transition agent of forward attention.
"location_attn": true, // enable_disable location sensitive attention. It is enabled for TACOTRON by default.
"bidirectional_decoder": false, // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset.
"double_decoder_consistency": true, // use DDC explained here https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency-draft/
"ddc_r": 7, // reduction rate for coarse decoder.
// STOPNET
"stopnet": true, // Train stopnet predicting the end of synthesis.
"separate_stopnet": true, // Train stopnet seperately if 'stopnet==true'. It prevents stopnet loss to influence the rest of the model. It causes a better model, but it trains SLOWER.
// TENSORBOARD and LOGGING
"print_step": 25, // Number of steps to log traning on console.
"print_step": 25, // Number of steps to log training on console.
"tb_plot_step": 100, // Number of steps to plot TB training figures.
"print_eval": false, // If True, it prints intermediate loss values in evalulation.
"save_step": 10000, // Number of training steps expected to save traninpg stats and checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
@ -118,28 +123,37 @@
"max_seq_len": 153, // DATASET-RELATED: maximum text length
// PATHS
"output_path": "/home/erogol/Models/LJSpeech/",
"output_path": "../../Mozilla-TTS/vctk-test/",
// PHONEMES
"phoneme_cache_path": "/media/erogol/data_ssd2/mozilla_us_phonemes_3", // phoneme computation is slow, therefore, it caches results in the given folder.
"use_phonemes": false, // use phonemes instead of raw characters. It is suggested for better pronounciation.
"phoneme_cache_path": "../../Mozilla-TTS/vctk-test/", // phoneme computation is slow, therefore, it caches results in the given folder.
"use_phonemes": true, // use phonemes instead of raw characters. It is suggested for better pronounciation.
"phoneme_language": "en-us", // depending on your target language, pick one from https://github.com/bootphon/phonemizer#languages
// MULTI-SPEAKER and GST
"use_speaker_embedding": false, // use speaker embedding to enable multi-speaker learning.
"style_wav_for_test": null, // path to style wav file to be used in TacotronGST inference.
"use_gst": false, // TACOTRON ONLY: use global style tokens
"use_speaker_embedding": true, // use speaker embedding to enable multi-speaker learning.
"use_external_speaker_embedding_file": false, // if true, forces the model to use external embedding per sample instead of nn.embeddings, that is, it supports external embeddings such as those used at: https://arxiv.org/abs /1806.04558
"external_speaker_embedding_file": "../../speakers-vctk-en.json", // if not null and use_external_speaker_embedding_file is true, it is used to load a specific embedding file and thus uses these embeddings instead of nn.embeddings, that is, it supports external embeddings such as those used at: https://arxiv.org/abs /1806.04558
"use_gst": true, // use global style tokens
"gst": { // gst parameter if gst is enabled
"gst_style_input": null, // Condition the style input either on a
// -> wave file [path to wave] or
// -> dictionary using the style tokens {'token1': 'value', 'token2': 'value'} example {"0": 0.15, "1": 0.15, "5": -0.15}
// with the dictionary being len(dict) <= len(gst_style_tokens).
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10
},
// DATASETS
"datasets": // List of datasets. They all merged and they get different speaker_ids.
[
{
"name": "ljspeech",
"path": "/home/erogol/Data/LJSpeech-1.1/",
"meta_file_train": "metadata.csv",
"name": "vctk",
"path": "../../../datasets/VCTK-Corpus-removed-silence/",
"meta_file_train": ["p225", "p234", "p238", "p245", "p248", "p261", "p294", "p302", "p326", "p335", "p347"], // for vtck if list, ignore speakers id in list for train, its useful for test cloning with new speakers
"meta_file_val": null
}
]
}

View File

@ -5,8 +5,8 @@ import torch
import random
from torch.utils.data import Dataset
from TTS.utils.text import text_to_sequence, phoneme_to_sequence, pad_with_eos_bos
from TTS.utils.data import prepare_data, prepare_tensor, prepare_stop_target
from mozilla_voice_tts.tts.utils.text import text_to_sequence, phoneme_to_sequence, pad_with_eos_bos
from mozilla_voice_tts.tts.utils.data import prepare_data, prepare_tensor, prepare_stop_target
class MyDataset(Dataset):
@ -24,13 +24,14 @@ class MyDataset(Dataset):
phoneme_cache_path=None,
phoneme_language="en-us",
enable_eos_bos=False,
speaker_mapping=None,
verbose=False):
"""
Args:
outputs_per_step (int): number of time frames predicted per step.
text_cleaner (str): text cleaner used for the dataset.
compute_linear_spec (bool): compute linear spectrogram if True.
ap (TTS.utils.AudioProcessor): audio processor object.
ap (mozilla_voice_tts.tts.utils.AudioProcessor): audio processor object.
meta_data (list): list of dataset instances.
batch_group_size (int): (0) range of batch randomization after sorting
sequences by length.
@ -58,6 +59,7 @@ class MyDataset(Dataset):
self.phoneme_cache_path = phoneme_cache_path
self.phoneme_language = phoneme_language
self.enable_eos_bos = enable_eos_bos
self.speaker_mapping = speaker_mapping
self.verbose = verbose
if use_phonemes and not os.path.isdir(phoneme_cache_path):
os.makedirs(phoneme_cache_path, exist_ok=True)
@ -92,7 +94,7 @@ class MyDataset(Dataset):
return phonemes
def _load_or_generate_phoneme_sequence(self, wav_file, text):
file_name = os.path.basename(wav_file).split('.')[0]
file_name = os.path.splitext(os.path.basename(wav_file))[0]
cache_path = os.path.join(self.phoneme_cache_path,
file_name + '_phoneme.npy')
try:
@ -127,7 +129,8 @@ class MyDataset(Dataset):
'text': text,
'wav': wav,
'item_idx': self.items[idx][1],
'speaker_name': speaker_name
'speaker_name': speaker_name,
'wav_file_name': os.path.basename(wav_file)
}
return sample
@ -191,9 +194,15 @@ class MyDataset(Dataset):
batch[idx]['item_idx'] for idx in ids_sorted_decreasing
]
text = [batch[idx]['text'] for idx in ids_sorted_decreasing]
speaker_name = [batch[idx]['speaker_name']
for idx in ids_sorted_decreasing]
# get speaker embeddings
if self.speaker_mapping is not None:
wav_files_names = [batch[idx]['wav_file_name'] for idx in ids_sorted_decreasing]
speaker_embedding = [self.speaker_mapping[w]['embedding'] for w in wav_files_names]
else:
speaker_embedding = None
# compute features
mel = [self.ap.melspectrogram(w).astype('float32') for w in wav]
@ -224,6 +233,9 @@ class MyDataset(Dataset):
mel_lengths = torch.LongTensor(mel_lengths)
stop_targets = torch.FloatTensor(stop_targets)
if speaker_embedding is not None:
speaker_embedding = torch.FloatTensor(speaker_embedding)
# compute linear spectrogram
if self.compute_linear_spec:
linear = [self.ap.spectrogram(w).astype('float32') for w in wav]
@ -234,7 +246,7 @@ class MyDataset(Dataset):
else:
linear = None
return text, text_lenghts, speaker_name, linear, mel, mel_lengths, \
stop_targets, item_idxs
stop_targets, item_idxs, speaker_embedding
raise TypeError(("batch must contain tensors, numbers, dicts or lists;\
found {}".format(type(batch[0]))))

View File

@ -2,7 +2,7 @@ import os
from glob import glob
import re
import sys
from TTS.utils.generic_utils import split_dataset
from mozilla_voice_tts.tts.utils.generic_utils import split_dataset
def load_meta_data(datasets):
@ -93,9 +93,10 @@ def mozilla_de(root_path, meta_file):
def mailabs(root_path, meta_files=None):
"""Normalizes M-AI-Labs meta data files to TTS format"""
speaker_regex = re.compile("by_book/(male|female)/(?P<speaker_name>[^/]+)/")
speaker_regex = re.compile(
"by_book/(male|female)/(?P<speaker_name>[^/]+)/")
if meta_files is None:
csv_files = glob(root_path+"/**/metadata.csv", recursive=True)
csv_files = glob(root_path + "/**/metadata.csv", recursive=True)
else:
csv_files = meta_files
# meta_files = [f.strip() for f in meta_files.split(",")]
@ -115,12 +116,15 @@ def mailabs(root_path, meta_files=None):
if meta_files is None:
wav_file = os.path.join(folder, 'wavs', cols[0] + '.wav')
else:
wav_file = os.path.join(root_path, folder.replace("metadata.csv", ""), 'wavs', cols[0] + '.wav')
wav_file = os.path.join(root_path,
folder.replace("metadata.csv", ""),
'wavs', cols[0] + '.wav')
if os.path.isfile(wav_file):
text = cols[1].strip()
items.append([text, wav_file, speaker_name])
else:
raise RuntimeError("> File %s does not exist!"%(wav_file))
raise RuntimeError("> File %s does not exist!" %
(wav_file))
return items
@ -185,7 +189,8 @@ def libri_tts(root_path, meta_files=None):
text = cols[1]
items.append([text, wav_file, speaker_name])
for item in items:
assert os.path.exists(item[1]), f" [!] wav files don't exist - {item[1]}"
assert os.path.exists(
item[1]), f" [!] wav files don't exist - {item[1]}"
return items
@ -197,7 +202,8 @@ def custom_turkish(root_path, meta_file):
with open(txt_file, 'r', encoding='utf-8') as ttf:
for line in ttf:
cols = line.split('|')
wav_file = os.path.join(root_path, 'wavs', cols[0].strip() + '.wav')
wav_file = os.path.join(root_path, 'wavs',
cols[0].strip() + '.wav')
if not os.path.exists(wav_file):
skipped_files.append(wav_file)
continue
@ -205,3 +211,44 @@ def custom_turkish(root_path, meta_file):
items.append([text, wav_file, speaker_name])
print(f" [!] {len(skipped_files)} files skipped. They don't exist...")
return items
# ToDo: add the dataset link when the dataset is released publicly
def brspeech(root_path, meta_file):
'''BRSpeech 3.0 beta'''
txt_file = os.path.join(root_path, meta_file)
items = []
with open(txt_file, 'r') as ttf:
for line in ttf:
if line.startswith("wav_filename"):
continue
cols = line.split('|')
#print(cols)
wav_file = os.path.join(root_path, cols[0])
text = cols[2]
speaker_name = cols[3]
items.append([text, wav_file, speaker_name])
return items
def vctk(root_path, meta_files=None, wavs_path='wav48'):
"""homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz"""
test_speakers = meta_files
items = []
meta_files = glob(f"{os.path.join(root_path,'txt')}/**/*.txt",
recursive=True)
for meta_file in meta_files:
_, speaker_id, txt_file = os.path.relpath(meta_file,
root_path).split(os.sep)
file_id = txt_file.split('.')[0]
if isinstance(test_speakers,
list): # if is list ignore this speakers ids
if speaker_id in test_speakers:
continue
with open(meta_file) as file_text:
text = file_text.readlines()[0]
wav_file = os.path.join(root_path, wavs_path, speaker_id,
file_id + '.wav')
items.append([text, wav_file, speaker_id])
return items

View File

@ -1,6 +1,5 @@
import torch
from torch import nn
from torch.autograd import Variable
from torch.nn import functional as F
@ -52,6 +51,7 @@ class LinearBN(nn.Module):
class Prenet(nn.Module):
# pylint: disable=dangerous-default-value
def __init__(self,
in_features,
prenet_type="original",
@ -244,14 +244,14 @@ class OriginalAttention(nn.Module):
self.u = (0.5 * torch.ones([B, 1])).to(inputs.device)
def init_location_attention(self, inputs):
B = inputs.shape[0]
T = inputs.shape[1]
self.attention_weights_cum = Variable(inputs.data.new(B, T).zero_())
B = inputs.size(0)
T = inputs.size(1)
self.attention_weights_cum = torch.zeros([B, T], device=inputs.device)
def init_states(self, inputs):
B = inputs.shape[0]
T = inputs.shape[1]
self.attention_weights = Variable(inputs.data.new(B, T).zero_())
B = inputs.size(0)
T = inputs.size(1)
self.attention_weights = torch.zeros([B, T], device=inputs.device)
if self.location_attention:
self.init_location_attention(inputs)
if self.forward_attn:
@ -300,8 +300,8 @@ class OriginalAttention(nn.Module):
def apply_forward_attention(self, alignment):
# forward attention
fwd_shifted_alpha = F.pad(self.alpha[:, :-1].clone().to(alignment.device),
(1, 0, 0, 0))
fwd_shifted_alpha = F.pad(
self.alpha[:, :-1].clone().to(alignment.device), (1, 0, 0, 0))
# compute transition potentials
alpha = ((1 - self.u) * self.alpha
+ self.u * fwd_shifted_alpha
@ -309,7 +309,7 @@ class OriginalAttention(nn.Module):
# force incremental alignment
if not self.training and self.forward_attn_mask:
_, n = fwd_shifted_alpha.max(1)
val, n2 = alpha.max(1)
val, _ = alpha.max(1)
for b in range(alignment.shape[0]):
alpha[b, n[b] + 3:] = 0
alpha[b, :(

View File

@ -72,7 +72,7 @@ class ReferenceEncoder(nn.Module):
# x: 3D tensor [batch_size, post_conv_width,
# num_channels*post_conv_height]
self.recurrence.flatten_parameters()
memory, out = self.recurrence(x)
_, out = self.recurrence(x)
# out: 3D tensor [seq_len==1, batch_size, encoding_size=128]
return out.squeeze(0)
@ -96,7 +96,7 @@ class StyleTokenLayer(nn.Module):
self.key_dim = embedding_dim // num_heads
self.style_tokens = nn.Parameter(
torch.FloatTensor(num_style_tokens, self.key_dim))
nn.init.orthogonal_(self.style_tokens)
nn.init.normal_(self.style_tokens, mean=0, std=0.5)
self.attention = MultiHeadAttention(
query_dim=self.query_dim,
key_dim=self.key_dim,

View File

@ -2,7 +2,7 @@ import numpy as np
import torch
from torch import nn
from torch.nn import functional
from TTS.utils.generic_utils import sequence_mask
from mozilla_voice_tts.tts.utils.generic_utils import sequence_mask
class L1LossMasked(nn.Module):
@ -150,7 +150,7 @@ class GuidedAttentionLoss(torch.nn.Module):
@staticmethod
def _make_ga_mask(ilen, olen, sigma):
grid_x, grid_y = torch.meshgrid(torch.arange(olen), torch.arange(ilen))
grid_x, grid_y = torch.meshgrid(torch.arange(olen, device=olen.device), torch.arange(ilen, device=ilen.device))
grid_x, grid_y = grid_x.float(), grid_y.float()
return 1.0 - torch.exp(-(grid_y / ilen - grid_x / olen) ** 2 / (2 * (sigma ** 2)))
@ -184,7 +184,7 @@ class TacotronLoss(torch.nn.Module):
def forward(self, postnet_output, decoder_output, mel_input, linear_input,
stopnet_output, stopnet_target, output_lens, decoder_b_output,
alignments, alignment_lens, input_lens):
alignments, alignment_lens, alignments_backwards, input_lens):
return_dict = {}
# decoder and postnet losses
@ -226,6 +226,15 @@ class TacotronLoss(torch.nn.Module):
return_dict['decoder_b_loss'] = decoder_b_loss
return_dict['decoder_c_loss'] = decoder_c_loss
# double decoder consistency loss (if enabled)
if self.config.double_decoder_consistency:
decoder_b_loss = self.criterion(decoder_b_output, mel_input, output_lens)
# decoder_c_loss = torch.nn.functional.l1_loss(decoder_b_output, decoder_output)
attention_c_loss = torch.nn.functional.l1_loss(alignments, alignments_backwards)
loss += decoder_b_loss + attention_c_loss
return_dict['decoder_coarse_loss'] = decoder_b_loss
return_dict['decoder_ddc_loss'] = attention_c_loss
# guided attention loss (if enabled)
if self.config.ga_alpha > 0:
ga_loss = self.criterion_ga(alignments, input_lens, alignment_lens)
@ -234,4 +243,3 @@ class TacotronLoss(torch.nn.Module):
return_dict['loss'] = loss
return return_dict

View File

@ -1,7 +1,7 @@
# coding: utf-8
import torch
from torch import nn
from .common_layers import Prenet, init_attn, Linear
from .common_layers import Prenet, init_attn
class BatchNormConv1d(nn.Module):
@ -18,8 +18,8 @@ class BatchNormConv1d(nn.Module):
activation: activation function set b/w Conv1d and BatchNorm
Shapes:
- input: batch x dims
- output: batch x dims
- input: (B, D)
- output: (B, D)
"""
def __init__(self,
@ -46,9 +46,9 @@ class BatchNormConv1d(nn.Module):
# self.init_layers()
def init_layers(self):
if type(self.activation) == torch.nn.ReLU:
if isinstance(self.activation, torch.nn.ReLU):
w_gain = 'relu'
elif type(self.activation) == torch.nn.Tanh:
elif isinstance(self.activation, torch.nn.Tanh):
w_gain = 'tanh'
elif self.activation is None:
w_gain = 'linear'
@ -67,12 +67,23 @@ class BatchNormConv1d(nn.Module):
class Highway(nn.Module):
r"""Highway layers as explained in https://arxiv.org/abs/1505.00387
Args:
in_features (int): size of each input sample
out_feature (int): size of each output sample
Shapes:
- input: (B, *, H_in)
- output: (B, *, H_out)
"""
# TODO: Try GLU layer
def __init__(self, in_size, out_size):
def __init__(self, in_features, out_feature):
super(Highway, self).__init__()
self.H = nn.Linear(in_size, out_size)
self.H = nn.Linear(in_features, out_feature)
self.H.bias.data.zero_()
self.T = nn.Linear(in_size, out_size)
self.T = nn.Linear(in_features, out_feature)
self.T.bias.data.fill_(-1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
@ -103,10 +114,10 @@ class CBHG(nn.Module):
num_highways (int): number of highways layers
Shapes:
- input: B x D x T_in
- output: B x T_in x D*2
- input: (B, C, T_in)
- output: (B, T_in, C*2)
"""
#pylint: disable=dangerous-default-value
def __init__(self,
in_features,
K=16,
@ -195,6 +206,8 @@ class CBHG(nn.Module):
class EncoderCBHG(nn.Module):
r"""CBHG module with Encoder specific arguments"""
def __init__(self):
super(EncoderCBHG, self).__init__()
self.cbhg = CBHG(
@ -211,7 +224,14 @@ class EncoderCBHG(nn.Module):
class Encoder(nn.Module):
r"""Encapsulate Prenet and CBHG modules for encoder"""
r"""Stack Prenet and CBHG module for encoder
Args:
inputs (FloatTensor): embedding features
Shapes:
- inputs: (B, T, D_in)
- outputs: (B, T, 128 * 2)
"""
def __init__(self, in_features):
super(Encoder, self).__init__()
@ -219,14 +239,6 @@ class Encoder(nn.Module):
self.cbhg = EncoderCBHG()
def forward(self, inputs):
r"""
Args:
inputs (FloatTensor): embedding features
Shapes:
- inputs: batch x time x in_features
- outputs: batch x time x 128*2
"""
# B x T x prenet_dim
outputs = self.prenet(inputs)
outputs = self.cbhg(outputs.transpose(1, 2))
@ -250,35 +262,48 @@ class PostCBHG(nn.Module):
class Decoder(nn.Module):
"""Decoder module.
"""Tacotron decoder.
Args:
in_features (int): input vector (encoder output) sample size.
memory_dim (int): memory vector (prev. time-step output) sample size.
r (int): number of outputs per time step.
in_channels (int): number of input channels.
frame_channels (int): number of feature frame channels.
r (int): number of outputs per time step (reduction rate).
memory_size (int): size of the past window. if <= 0 memory_size = r
TODO: arguments
attn_type (string): type of attention used in decoder.
attn_windowing (bool): if true, define an attention window centered to maximum
attention response. It provides more robust attention alignment especially
at interence time.
attn_norm (string): attention normalization function. 'sigmoid' or 'softmax'.
prenet_type (string): 'original' or 'bn'.
prenet_dropout (float): prenet dropout rate.
forward_attn (bool): if true, use forward attention method. https://arxiv.org/abs/1807.06736
trans_agent (bool): if true, use transition agent. https://arxiv.org/abs/1807.06736
forward_attn_mask (bool): if true, mask attention values smaller than a threshold.
location_attn (bool): if true, use location sensitive attention.
attn_K (int): number of attention heads for GravesAttention.
separate_stopnet (bool): if true, detach stopnet input to prevent gradient flow.
speaker_embedding_dim (int): size of speaker embedding vector, for multi-speaker training.
"""
# Pylint gets confused by PyTorch conventions here
#pylint: disable=attribute-defined-outside-init
# pylint: disable=attribute-defined-outside-init
def __init__(self, in_features, memory_dim, r, memory_size, attn_type, attn_windowing,
def __init__(self, in_channels, frame_channels, r, memory_size, attn_type, attn_windowing,
attn_norm, prenet_type, prenet_dropout, forward_attn,
trans_agent, forward_attn_mask, location_attn, attn_K,
separate_stopnet, speaker_embedding_dim):
separate_stopnet):
super(Decoder, self).__init__()
self.r_init = r
self.r = r
self.in_features = in_features
self.in_channels = in_channels
self.max_decoder_steps = 500
self.use_memory_queue = memory_size > 0
self.memory_size = memory_size if memory_size > 0 else r
self.memory_dim = memory_dim
self.frame_channels = frame_channels
self.separate_stopnet = separate_stopnet
self.query_dim = 256
# memory -> |Prenet| -> processed_memory
prenet_dim = memory_dim * self.memory_size + speaker_embedding_dim if self.use_memory_queue else memory_dim + speaker_embedding_dim
prenet_dim = frame_channels * self.memory_size if self.use_memory_queue else frame_channels
self.prenet = Prenet(
prenet_dim,
prenet_type,
@ -286,11 +311,11 @@ class Decoder(nn.Module):
out_features=[256, 128])
# processed_inputs, processed_memory -> |Attention| -> Attention, attention, RNN_State
# attention_rnn generates queries for the attention mechanism
self.attention_rnn = nn.GRUCell(in_features + 128, self.query_dim)
self.attention_rnn = nn.GRUCell(in_channels + 128, self.query_dim)
self.attention = init_attn(attn_type=attn_type,
query_dim=self.query_dim,
embedding_dim=in_features,
embedding_dim=in_channels,
attention_dim=128,
location_attention=location_attn,
attention_location_n_filters=32,
@ -302,14 +327,14 @@ class Decoder(nn.Module):
forward_attn_mask=forward_attn_mask,
attn_K=attn_K)
# (processed_memory | attention context) -> |Linear| -> decoder_RNN_input
self.project_to_decoder_in = nn.Linear(256 + in_features, 256)
self.project_to_decoder_in = nn.Linear(256 + in_channels, 256)
# decoder_RNN_input -> |RNN| -> RNN_state
self.decoder_rnns = nn.ModuleList(
[nn.GRUCell(256, 256) for _ in range(2)])
# RNN_state -> |Linear| -> mel_spec
self.proj_to_mel = nn.Linear(256, memory_dim * self.r_init)
self.proj_to_mel = nn.Linear(256, frame_channels * self.r_init)
# learn init values instead of zero init.
self.stopnet = StopNet(256 + memory_dim * self.r_init)
self.stopnet = StopNet(256 + frame_channels * self.r_init)
def set_r(self, new_r):
self.r = new_r
@ -319,9 +344,9 @@ class Decoder(nn.Module):
Reshape the spectrograms for given 'r'
"""
# Grouping multiple frames if necessary
if memory.size(-1) == self.memory_dim:
if memory.size(-1) == self.frame_channels:
memory = memory.view(memory.shape[0], memory.size(1) // self.r, -1)
# Time first (T_decoder, B, memory_dim)
# Time first (T_decoder, B, frame_channels)
memory = memory.transpose(0, 1)
return memory
@ -330,19 +355,18 @@ class Decoder(nn.Module):
Initialization of decoder states
"""
B = inputs.size(0)
T = inputs.size(1)
# go frame as zeros matrix
if self.use_memory_queue:
self.memory_input = torch.zeros(1, device=inputs.device).repeat(B, self.memory_dim * self.memory_size)
self.memory_input = torch.zeros(1, device=inputs.device).repeat(B, self.frame_channels * self.memory_size)
else:
self.memory_input = torch.zeros(1, device=inputs.device).repeat(B, self.memory_dim)
self.memory_input = torch.zeros(1, device=inputs.device).repeat(B, self.frame_channels)
# decoder states
self.attention_rnn_hidden = torch.zeros(1, device=inputs.device).repeat(B, 256)
self.decoder_rnn_hiddens = [
torch.zeros(1, device=inputs.device).repeat(B, 256)
for idx in range(len(self.decoder_rnns))
]
self.context_vec = inputs.data.new(B, self.in_features).zero_()
self.context_vec = inputs.data.new(B, self.in_channels).zero_()
# cache attention inputs
self.processed_inputs = self.attention.preprocess_inputs(inputs)
@ -352,7 +376,7 @@ class Decoder(nn.Module):
stop_tokens = torch.stack(stop_tokens).transpose(0, 1)
outputs = torch.stack(outputs).transpose(0, 1).contiguous()
outputs = outputs.view(
outputs.size(0), -1, self.memory_dim)
outputs.size(0), -1, self.frame_channels)
outputs = outputs.transpose(1, 2)
return outputs, attentions, stop_tokens
@ -386,7 +410,7 @@ class Decoder(nn.Module):
stop_token = self.stopnet(stopnet_input.detach())
else:
stop_token = self.stopnet(stopnet_input)
output = output[:, : self.r * self.memory_dim]
output = output[:, : self.r * self.frame_channels]
return output, stop_token, self.attention.attention_weights
def _update_memory_input(self, new_memory):
@ -395,17 +419,17 @@ class Decoder(nn.Module):
# memory queue size is larger than number of frames per decoder iter
self.memory_input = torch.cat([
new_memory, self.memory_input[:, :(
self.memory_size - self.r) * self.memory_dim].clone()
self.memory_size - self.r) * self.frame_channels].clone()
], dim=-1)
else:
# memory queue size smaller than number of frames per decoder iter
self.memory_input = new_memory[:, :self.memory_size * self.memory_dim]
self.memory_input = new_memory[:, :self.memory_size * self.frame_channels]
else:
# use only the last frame prediction
# assert new_memory.shape[-1] == self.r * self.memory_dim
self.memory_input = new_memory[:, self.memory_dim * (self.r - 1):]
# assert new_memory.shape[-1] == self.r * self.frame_channels
self.memory_input = new_memory[:, self.frame_channels * (self.r - 1):]
def forward(self, inputs, memory, mask, speaker_embeddings=None):
def forward(self, inputs, memory, mask):
"""
Args:
inputs: Encoder outputs.
@ -415,8 +439,8 @@ class Decoder(nn.Module):
mask: Attention mask for sequence padding.
Shapes:
- inputs: batch x time x encoder_out_dim
- memory: batch x #mel_specs x mel_spec_dim
- inputs: (B, T, D_out_enc)
- memory: (B, T_mel, D_mel)
"""
# Run greedy decoding if memory is None
memory = self._reshape_memory(memory)
@ -430,8 +454,7 @@ class Decoder(nn.Module):
if t > 0:
new_memory = memory[t - 1]
self._update_memory_input(new_memory)
if speaker_embeddings is not None:
self.memory_input = torch.cat([self.memory_input, speaker_embeddings], dim=-1)
output, stop_token, attention = self.decode(inputs, mask)
outputs += [output]
attentions += [attention]
@ -439,15 +462,12 @@ class Decoder(nn.Module):
t += 1
return self._parse_outputs(outputs, attentions, stop_tokens)
def inference(self, inputs, speaker_embeddings=None):
def inference(self, inputs):
"""
Args:
inputs: encoder outputs.
speaker_embeddings: speaker vectors.
Shapes:
- inputs: batch x time x encoder_out_dim
- speaker_embeddings: batch x embed_dim
"""
outputs = []
attentions = []
@ -460,8 +480,6 @@ class Decoder(nn.Module):
if t > 0:
new_memory = outputs[-1]
self._update_memory_input(new_memory)
if speaker_embeddings is not None:
self.memory_input = torch.cat([self.memory_input, speaker_embeddings], dim=-1)
output, stop_token, attention = self.decode(inputs, None)
stop_token = torch.sigmoid(stop_token.data)
outputs += [output]
@ -471,14 +489,14 @@ class Decoder(nn.Module):
if t > inputs.shape[1] / 4 and (stop_token > 0.6
or attention[:, -1].item() > 0.6):
break
elif t > self.max_decoder_steps:
if t > self.max_decoder_steps:
print(" | > Decoder stopped with 'max_decoder_steps")
break
return self._parse_outputs(outputs, attentions, stop_tokens)
class StopNet(nn.Module):
r"""
r"""Stopnet signalling decoder to stop inference.
Args:
in_features (int): feature dimension of input.
"""

View File

@ -1,11 +1,24 @@
import torch
from torch.autograd import Variable
from torch import nn
from torch.nn import functional as F
from .common_layers import init_attn, Prenet, Linear
# NOTE: linter has a problem with the current TF release
#pylint: disable=no-value-for-parameter
#pylint: disable=unexpected-keyword-arg
class ConvBNBlock(nn.Module):
r"""Convolutions with Batch Normalization and non-linear activation.
Args:
in_channels (int): number of input channels.
out_channels (int): number of output channels.
kernel_size (int): convolution kernel size.
activation (str): 'relu', 'tanh', None (linear).
Shapes:
- input: (B, C_in, T)
- output: (B, C_out, T)
"""
def __init__(self, in_channels, out_channels, kernel_size, activation=None):
super(ConvBNBlock, self).__init__()
assert (kernel_size - 1) % 2 == 0
@ -32,16 +45,25 @@ class ConvBNBlock(nn.Module):
class Postnet(nn.Module):
def __init__(self, output_dim, num_convs=5):
r"""Tacotron2 Postnet
Args:
in_out_channels (int): number of output channels.
Shapes:
- input: (B, C_in, T)
- output: (B, C_in, T)
"""
def __init__(self, in_out_channels, num_convs=5):
super(Postnet, self).__init__()
self.convolutions = nn.ModuleList()
self.convolutions.append(
ConvBNBlock(output_dim, 512, kernel_size=5, activation='tanh'))
ConvBNBlock(in_out_channels, 512, kernel_size=5, activation='tanh'))
for _ in range(1, num_convs - 1):
self.convolutions.append(
ConvBNBlock(512, 512, kernel_size=5, activation='tanh'))
self.convolutions.append(
ConvBNBlock(512, output_dim, kernel_size=5, activation=None))
ConvBNBlock(512, in_out_channels, kernel_size=5, activation=None))
def forward(self, x):
o = x
@ -51,14 +73,23 @@ class Postnet(nn.Module):
class Encoder(nn.Module):
def __init__(self, output_input_dim=512):
r"""Tacotron2 Encoder
Args:
in_out_channels (int): number of input and output channels.
Shapes:
- input: (B, C_in, T)
- output: (B, C_in, T)
"""
def __init__(self, in_out_channels=512):
super(Encoder, self).__init__()
self.convolutions = nn.ModuleList()
for _ in range(3):
self.convolutions.append(
ConvBNBlock(output_input_dim, output_input_dim, 5, 'relu'))
self.lstm = nn.LSTM(output_input_dim,
int(output_input_dim / 2),
ConvBNBlock(in_out_channels, in_out_channels, 5, 'relu'))
self.lstm = nn.LSTM(in_out_channels,
int(in_out_channels / 2),
num_layers=1,
batch_first=True,
bias=True,
@ -90,20 +121,40 @@ class Encoder(nn.Module):
# adapted from https://github.com/NVIDIA/tacotron2/
class Decoder(nn.Module):
"""Tacotron2 decoder. We don't use Zoneout but Dropout between RNN layers.
Args:
in_channels (int): number of input channels.
frame_channels (int): number of feature frame channels.
r (int): number of outputs per time step (reduction rate).
memory_size (int): size of the past window. if <= 0 memory_size = r
attn_type (string): type of attention used in decoder.
attn_win (bool): if true, define an attention window centered to maximum
attention response. It provides more robust attention alignment especially
at interence time.
attn_norm (string): attention normalization function. 'sigmoid' or 'softmax'.
prenet_type (string): 'original' or 'bn'.
prenet_dropout (float): prenet dropout rate.
forward_attn (bool): if true, use forward attention method. https://arxiv.org/abs/1807.06736
trans_agent (bool): if true, use transition agent. https://arxiv.org/abs/1807.06736
forward_attn_mask (bool): if true, mask attention values smaller than a threshold.
location_attn (bool): if true, use location sensitive attention.
attn_K (int): number of attention heads for GravesAttention.
separate_stopnet (bool): if true, detach stopnet input to prevent gradient flow.
"""
# Pylint gets confused by PyTorch conventions here
#pylint: disable=attribute-defined-outside-init
def __init__(self, input_dim, frame_dim, r, attn_type, attn_win, attn_norm,
def __init__(self, in_channels, frame_channels, r, attn_type, attn_win, attn_norm,
prenet_type, prenet_dropout, forward_attn, trans_agent,
forward_attn_mask, location_attn, attn_K, separate_stopnet,
speaker_embedding_dim):
forward_attn_mask, location_attn, attn_K, separate_stopnet):
super(Decoder, self).__init__()
self.frame_dim = frame_dim
self.frame_channels = frame_channels
self.r_init = r
self.r = r
self.encoder_embedding_dim = input_dim
self.encoder_embedding_dim = in_channels
self.separate_stopnet = separate_stopnet
self.max_decoder_steps = 1000
self.gate_threshold = 0.5
self.stop_threshold = 0.5
# model dimensions
self.query_dim = 1024
@ -114,20 +165,20 @@ class Decoder(nn.Module):
self.p_decoder_dropout = 0.1
# memory -> |Prenet| -> processed_memory
prenet_dim = self.frame_dim
prenet_dim = self.frame_channels
self.prenet = Prenet(prenet_dim,
prenet_type,
prenet_dropout,
out_features=[self.prenet_dim, self.prenet_dim],
bias=False)
self.attention_rnn = nn.LSTMCell(self.prenet_dim + input_dim,
self.attention_rnn = nn.LSTMCell(self.prenet_dim + in_channels,
self.query_dim,
bias=True)
self.attention = init_attn(attn_type=attn_type,
query_dim=self.query_dim,
embedding_dim=input_dim,
embedding_dim=in_channels,
attention_dim=128,
location_attention=location_attn,
attention_location_n_filters=32,
@ -139,16 +190,16 @@ class Decoder(nn.Module):
forward_attn_mask=forward_attn_mask,
attn_K=attn_K)
self.decoder_rnn = nn.LSTMCell(self.query_dim + input_dim,
self.decoder_rnn = nn.LSTMCell(self.query_dim + in_channels,
self.decoder_rnn_dim,
bias=True)
self.linear_projection = Linear(self.decoder_rnn_dim + input_dim,
self.frame_dim * self.r_init)
self.linear_projection = Linear(self.decoder_rnn_dim + in_channels,
self.frame_channels * self.r_init)
self.stopnet = nn.Sequential(
nn.Dropout(0.1),
Linear(self.decoder_rnn_dim + self.frame_dim * self.r_init,
Linear(self.decoder_rnn_dim + self.frame_channels * self.r_init,
1,
bias=True,
init_gain='sigmoid'))
@ -159,8 +210,8 @@ class Decoder(nn.Module):
def get_go_frame(self, inputs):
B = inputs.size(0)
memory = torch.zeros(1, device=inputs.device).repeat(B,
self.frame_dim * self.r)
memory = torch.zeros(1, device=inputs.device).repeat(
B, self.frame_channels * self.r)
return memory
def _init_states(self, inputs, mask, keep_states=False):
@ -186,9 +237,9 @@ class Decoder(nn.Module):
Reshape the spectrograms for given 'r'
"""
# Grouping multiple frames if necessary
if memory.size(-1) == self.frame_dim:
if memory.size(-1) == self.frame_channels:
memory = memory.view(memory.shape[0], memory.size(1) // self.r, -1)
# Time first (T_decoder, B, frame_dim)
# Time first (T_decoder, B, frame_channels)
memory = memory.transpose(0, 1)
return memory
@ -196,22 +247,22 @@ class Decoder(nn.Module):
alignments = torch.stack(alignments).transpose(0, 1)
stop_tokens = torch.stack(stop_tokens).transpose(0, 1)
outputs = torch.stack(outputs).transpose(0, 1).contiguous()
outputs = outputs.view(outputs.size(0), -1, self.frame_dim)
outputs = outputs.view(outputs.size(0), -1, self.frame_channels)
outputs = outputs.transpose(1, 2)
return outputs, stop_tokens, alignments
def _update_memory(self, memory):
if len(memory.shape) == 2:
return memory[:, self.frame_dim * (self.r - 1):]
return memory[:, :, self.frame_dim * (self.r - 1):]
return memory[:, self.frame_channels * (self.r - 1):]
return memory[:, :, self.frame_channels * (self.r - 1):]
def decode(self, memory):
'''
shapes:
- memory: B x r * self.frame_dim
- memory: B x r * self.frame_channels
'''
# self.context: B x D_en
# query_input: B x D_en + (r * self.frame_dim)
# query_input: B x D_en + (r * self.frame_channels)
query_input = torch.cat((memory, self.context), -1)
# self.query and self.attention_rnn_cell_state : B x D_attn_rnn
self.query, self.attention_rnn_cell_state = self.attention_rnn(
@ -234,25 +285,36 @@ class Decoder(nn.Module):
# B x (D_decoder_rnn + D_en)
decoder_hidden_context = torch.cat((self.decoder_hidden, self.context),
dim=1)
# B x (self.r * self.frame_dim)
# B x (self.r * self.frame_channels)
decoder_output = self.linear_projection(decoder_hidden_context)
# B x (D_decoder_rnn + (self.r * self.frame_dim))
# B x (D_decoder_rnn + (self.r * self.frame_channels))
stopnet_input = torch.cat((self.decoder_hidden, decoder_output), dim=1)
if self.separate_stopnet:
stop_token = self.stopnet(stopnet_input.detach())
else:
stop_token = self.stopnet(stopnet_input)
# select outputs for the reduction rate self.r
decoder_output = decoder_output[:, :self.r * self.frame_dim]
decoder_output = decoder_output[:, :self.r * self.frame_channels]
return decoder_output, self.attention.attention_weights, stop_token
def forward(self, inputs, memories, mask, speaker_embeddings=None):
def forward(self, inputs, memories, mask):
r"""Train Decoder with teacher forcing.
Args:
inputs: Encoder outputs.
memories: Feature frames for teacher-forcing.
mask: Attention mask for sequence padding.
Shapes:
- inputs: (B, T, D_out_enc)
- memory: (B, T_mel, D_mel)
- outputs: (B, T_mel, D_mel)
- alignments: (B, T_in, T_out)
- stop_tokens: (B, T_out)
"""
memory = self.get_go_frame(inputs).unsqueeze(0)
memories = self._reshape_memory(memories)
memories = torch.cat((memory, memories), dim=0)
memories = self._update_memory(memories)
if speaker_embeddings is not None:
memories = torch.cat([memories, speaker_embeddings], dim=-1)
memories = self.prenet(memories)
self._init_states(inputs, mask=mask)
@ -270,7 +332,18 @@ class Decoder(nn.Module):
outputs, stop_tokens, alignments)
return outputs, alignments, stop_tokens
def inference(self, inputs, speaker_embeddings=None):
def inference(self, inputs):
r"""Decoder inference without teacher forcing and use
Stopnet to stop decoder.
Args:
inputs: Encoder outputs.
Shapes:
- inputs: (B, T, D_out_enc)
- outputs: (B, T_mel, D_mel)
- alignments: (B, T_in, T_out)
- stop_tokens: (B, T_out)
"""
memory = self.get_go_frame(inputs)
memory = self._update_memory(memory)
@ -280,15 +353,13 @@ class Decoder(nn.Module):
outputs, stop_tokens, alignments, t = [], [], [], 0
while True:
memory = self.prenet(memory)
if speaker_embeddings is not None:
memory = torch.cat([memory, speaker_embeddings], dim=-1)
decoder_output, alignment, stop_token = self.decode(memory)
stop_token = torch.sigmoid(stop_token.data)
outputs += [decoder_output.squeeze(1)]
stop_tokens += [stop_token]
alignments += [alignment]
if stop_token > 0.7 and t > inputs.shape[0] / 2:
if stop_token > self.stop_threshold and t > inputs.shape[0] // 2:
break
if len(outputs) == self.max_decoder_steps:
print(" | > Decoder stopped with 'max_decoder_steps")
@ -315,7 +386,6 @@ class Decoder(nn.Module):
self.attention.init_win_idx()
self.attention.init_states(inputs)
outputs, stop_tokens, alignments, t = [], [], [], 0
stop_flags = [True, False, False]
while True:
memory = self.prenet(self.memory_truncated)
decoder_output, alignment, stop_token = self.decode(memory)

View File

@ -0,0 +1,166 @@
# coding: utf-8
import torch
from torch import nn
from mozilla_voice_tts.tts.layers.gst_layers import GST
from mozilla_voice_tts.tts.layers.tacotron import Decoder, Encoder, PostCBHG
from mozilla_voice_tts.tts.models.tacotron_abstract import TacotronAbstract
class Tacotron(TacotronAbstract):
def __init__(self,
num_chars,
num_speakers,
r=5,
postnet_output_dim=1025,
decoder_output_dim=80,
attn_type='original',
attn_win=False,
attn_norm="sigmoid",
prenet_type="original",
prenet_dropout=True,
forward_attn=False,
trans_agent=False,
forward_attn_mask=False,
location_attn=True,
attn_K=5,
separate_stopnet=True,
bidirectional_decoder=False,
double_decoder_consistency=False,
ddc_r=None,
encoder_in_features=256,
decoder_in_features=256,
speaker_embedding_dim=None,
gst=False,
gst_embedding_dim=256,
gst_num_heads=4,
gst_style_tokens=10,
memory_size=5):
super(Tacotron,
self).__init__(num_chars, num_speakers, r, postnet_output_dim,
decoder_output_dim, attn_type, attn_win,
attn_norm, prenet_type, prenet_dropout,
forward_attn, trans_agent, forward_attn_mask,
location_attn, attn_K, separate_stopnet,
bidirectional_decoder, double_decoder_consistency,
ddc_r, encoder_in_features, decoder_in_features,
speaker_embedding_dim, gst, gst_embedding_dim,
gst_num_heads, gst_style_tokens)
# speaker embedding layers
if self.num_speakers > 1:
if not self.embeddings_per_sample:
speaker_embedding_dim = 256
self.speaker_embedding = nn.Embedding(self.num_speakers, speaker_embedding_dim)
self.speaker_embedding.weight.data.normal_(0, 0.3)
# speaker and gst embeddings is concat in decoder input
if self.num_speakers > 1:
self.decoder_in_features += speaker_embedding_dim # add speaker embedding dim
# embedding layer
self.embedding = nn.Embedding(num_chars, 256, padding_idx=0)
self.embedding.weight.data.normal_(0, 0.3)
# base model layers
self.encoder = Encoder(self.encoder_in_features)
self.decoder = Decoder(self.decoder_in_features, decoder_output_dim, r,
memory_size, attn_type, attn_win, attn_norm,
prenet_type, prenet_dropout, forward_attn,
trans_agent, forward_attn_mask, location_attn,
attn_K, separate_stopnet)
self.postnet = PostCBHG(decoder_output_dim)
self.last_linear = nn.Linear(self.postnet.cbhg.gru_features * 2,
postnet_output_dim)
# global style token layers
if self.gst:
self.gst_layer = GST(num_mel=80,
num_heads=gst_num_heads,
num_style_tokens=gst_style_tokens,
embedding_dim=gst_embedding_dim)
# backward pass decoder
if self.bidirectional_decoder:
self._init_backward_decoder()
# setup DDC
if self.double_decoder_consistency:
self.coarse_decoder = Decoder(
self.decoder_in_features, decoder_output_dim, ddc_r, memory_size,
attn_type, attn_win, attn_norm, prenet_type, prenet_dropout,
forward_attn, trans_agent, forward_attn_mask, location_attn,
attn_K, separate_stopnet)
def forward(self, characters, text_lengths, mel_specs, mel_lengths=None, speaker_ids=None, speaker_embeddings=None):
"""
Shapes:
- characters: B x T_in
- text_lengths: B
- mel_specs: B x T_out x D
- speaker_ids: B x 1
"""
input_mask, output_mask = self.compute_masks(text_lengths, mel_lengths)
# B x T_in x embed_dim
inputs = self.embedding(characters)
# B x T_in x encoder_in_features
encoder_outputs = self.encoder(inputs)
# sequence masking
encoder_outputs = encoder_outputs * input_mask.unsqueeze(2).expand_as(encoder_outputs)
# global style token
if self.gst:
# B x gst_dim
encoder_outputs = self.compute_gst(encoder_outputs, mel_specs)
# speaker embedding
if self.num_speakers > 1:
if not self.embeddings_per_sample:
# B x 1 x speaker_embed_dim
speaker_embeddings = self.speaker_embedding(speaker_ids)[:, None]
else:
# B x 1 x speaker_embed_dim
speaker_embeddings = torch.unsqueeze(speaker_embeddings, 1)
encoder_outputs = self._concat_speaker_embedding(encoder_outputs, speaker_embeddings)
# decoder_outputs: B x decoder_in_features x T_out
# alignments: B x T_in x encoder_in_features
# stop_tokens: B x T_in
decoder_outputs, alignments, stop_tokens = self.decoder(
encoder_outputs, mel_specs, input_mask)
# sequence masking
if output_mask is not None:
decoder_outputs = decoder_outputs * output_mask.unsqueeze(1).expand_as(decoder_outputs)
# B x T_out x decoder_in_features
postnet_outputs = self.postnet(decoder_outputs)
# sequence masking
if output_mask is not None:
postnet_outputs = postnet_outputs * output_mask.unsqueeze(2).expand_as(postnet_outputs)
# B x T_out x posnet_dim
postnet_outputs = self.last_linear(postnet_outputs)
# B x T_out x decoder_in_features
decoder_outputs = decoder_outputs.transpose(1, 2).contiguous()
if self.bidirectional_decoder:
decoder_outputs_backward, alignments_backward = self._backward_pass(mel_specs, encoder_outputs, input_mask)
return decoder_outputs, postnet_outputs, alignments, stop_tokens, decoder_outputs_backward, alignments_backward
if self.double_decoder_consistency:
decoder_outputs_backward, alignments_backward = self._coarse_decoder_pass(mel_specs, encoder_outputs, alignments, input_mask)
return decoder_outputs, postnet_outputs, alignments, stop_tokens, decoder_outputs_backward, alignments_backward
return decoder_outputs, postnet_outputs, alignments, stop_tokens
@torch.no_grad()
def inference(self, characters, speaker_ids=None, style_mel=None, speaker_embeddings=None):
inputs = self.embedding(characters)
encoder_outputs = self.encoder(inputs)
if self.gst:
# B x gst_dim
encoder_outputs = self.compute_gst(encoder_outputs, style_mel)
if self.num_speakers > 1:
if not self.embeddings_per_sample:
# B x 1 x speaker_embed_dim
speaker_embeddings = self.speaker_embedding(speaker_ids)[:, None]
else:
# B x 1 x speaker_embed_dim
speaker_embeddings = torch.unsqueeze(speaker_embeddings, 1)
encoder_outputs = self._concat_speaker_embedding(encoder_outputs, speaker_embeddings)
decoder_outputs, alignments, stop_tokens = self.decoder.inference(
encoder_outputs)
postnet_outputs = self.postnet(decoder_outputs)
postnet_outputs = self.last_linear(postnet_outputs)
decoder_outputs = decoder_outputs.transpose(1, 2)
return decoder_outputs, postnet_outputs, alignments, stop_tokens

View File

@ -0,0 +1,184 @@
import torch
from torch import nn
from mozilla_voice_tts.tts.layers.gst_layers import GST
from mozilla_voice_tts.tts.layers.tacotron2 import Decoder, Encoder, Postnet
from mozilla_voice_tts.tts.models.tacotron_abstract import TacotronAbstract
# TODO: match function arguments with tacotron
class Tacotron2(TacotronAbstract):
def __init__(self,
num_chars,
num_speakers,
r,
postnet_output_dim=80,
decoder_output_dim=80,
attn_type='original',
attn_win=False,
attn_norm="softmax",
prenet_type="original",
prenet_dropout=True,
forward_attn=False,
trans_agent=False,
forward_attn_mask=False,
location_attn=True,
attn_K=5,
separate_stopnet=True,
bidirectional_decoder=False,
double_decoder_consistency=False,
ddc_r=None,
encoder_in_features=512,
decoder_in_features=512,
speaker_embedding_dim=None,
gst=False,
gst_embedding_dim=512,
gst_num_heads=4,
gst_style_tokens=10):
super(Tacotron2,
self).__init__(num_chars, num_speakers, r, postnet_output_dim,
decoder_output_dim, attn_type, attn_win,
attn_norm, prenet_type, prenet_dropout,
forward_attn, trans_agent, forward_attn_mask,
location_attn, attn_K, separate_stopnet,
bidirectional_decoder, double_decoder_consistency,
ddc_r, encoder_in_features, decoder_in_features,
speaker_embedding_dim, gst, gst_embedding_dim,
gst_num_heads, gst_style_tokens)
# speaker embedding layer
if self.num_speakers > 1:
if not self.embeddings_per_sample:
speaker_embedding_dim = 512
self.speaker_embedding = nn.Embedding(self.num_speakers, speaker_embedding_dim)
self.speaker_embedding.weight.data.normal_(0, 0.3)
# speaker and gst embeddings is concat in decoder input
if self.num_speakers > 1:
self.decoder_in_features += speaker_embedding_dim # add speaker embedding dim
# embedding layer
self.embedding = nn.Embedding(num_chars, 512, padding_idx=0)
# base model layers
self.encoder = Encoder(self.encoder_in_features)
self.decoder = Decoder(self.decoder_in_features, self.decoder_output_dim, r, attn_type, attn_win,
attn_norm, prenet_type, prenet_dropout,
forward_attn, trans_agent, forward_attn_mask,
location_attn, attn_K, separate_stopnet)
self.postnet = Postnet(self.postnet_output_dim)
# global style token layers
if self.gst:
self.gst_layer = GST(num_mel=80,
num_heads=self.gst_num_heads,
num_style_tokens=self.gst_style_tokens,
embedding_dim=self.gst_embedding_dim)
# backward pass decoder
if self.bidirectional_decoder:
self._init_backward_decoder()
# setup DDC
if self.double_decoder_consistency:
self.coarse_decoder = Decoder(
self.decoder_in_features, self.decoder_output_dim, ddc_r, attn_type,
attn_win, attn_norm, prenet_type, prenet_dropout, forward_attn,
trans_agent, forward_attn_mask, location_attn, attn_K,
separate_stopnet)
@staticmethod
def shape_outputs(mel_outputs, mel_outputs_postnet, alignments):
mel_outputs = mel_outputs.transpose(1, 2)
mel_outputs_postnet = mel_outputs_postnet.transpose(1, 2)
return mel_outputs, mel_outputs_postnet, alignments
def forward(self, text, text_lengths, mel_specs=None, mel_lengths=None, speaker_ids=None, speaker_embeddings=None):
# compute mask for padding
# B x T_in_max (boolean)
input_mask, output_mask = self.compute_masks(text_lengths, mel_lengths)
# B x D_embed x T_in_max
embedded_inputs = self.embedding(text).transpose(1, 2)
# B x T_in_max x D_en
encoder_outputs = self.encoder(embedded_inputs, text_lengths)
if self.gst:
# B x gst_dim
encoder_outputs = self.compute_gst(encoder_outputs, mel_specs)
if self.num_speakers > 1:
if not self.embeddings_per_sample:
# B x 1 x speaker_embed_dim
speaker_embeddings = self.speaker_embedding(speaker_ids)[:, None]
else:
# B x 1 x speaker_embed_dim
speaker_embeddings = torch.unsqueeze(speaker_embeddings, 1)
encoder_outputs = self._concat_speaker_embedding(encoder_outputs, speaker_embeddings)
encoder_outputs = encoder_outputs * input_mask.unsqueeze(2).expand_as(encoder_outputs)
# B x mel_dim x T_out -- B x T_out//r x T_in -- B x T_out//r
decoder_outputs, alignments, stop_tokens = self.decoder(
encoder_outputs, mel_specs, input_mask)
# sequence masking
if mel_lengths is not None:
decoder_outputs = decoder_outputs * output_mask.unsqueeze(1).expand_as(decoder_outputs)
# B x mel_dim x T_out
postnet_outputs = self.postnet(decoder_outputs)
postnet_outputs = decoder_outputs + postnet_outputs
# sequence masking
if output_mask is not None:
postnet_outputs = postnet_outputs * output_mask.unsqueeze(1).expand_as(postnet_outputs)
# B x T_out x mel_dim -- B x T_out x mel_dim -- B x T_out//r x T_in
decoder_outputs, postnet_outputs, alignments = self.shape_outputs(
decoder_outputs, postnet_outputs, alignments)
if self.bidirectional_decoder:
decoder_outputs_backward, alignments_backward = self._backward_pass(mel_specs, encoder_outputs, input_mask)
return decoder_outputs, postnet_outputs, alignments, stop_tokens, decoder_outputs_backward, alignments_backward
if self.double_decoder_consistency:
decoder_outputs_backward, alignments_backward = self._coarse_decoder_pass(mel_specs, encoder_outputs, alignments, input_mask)
return decoder_outputs, postnet_outputs, alignments, stop_tokens, decoder_outputs_backward, alignments_backward
return decoder_outputs, postnet_outputs, alignments, stop_tokens
@torch.no_grad()
def inference(self, text, speaker_ids=None, style_mel=None, speaker_embeddings=None):
embedded_inputs = self.embedding(text).transpose(1, 2)
encoder_outputs = self.encoder.inference(embedded_inputs)
if self.gst:
# B x gst_dim
encoder_outputs = self.compute_gst(encoder_outputs, style_mel)
if self.num_speakers > 1:
if not self.embeddings_per_sample:
speaker_embeddings = self.speaker_embedding(speaker_ids)[:, None]
encoder_outputs = self._concat_speaker_embedding(encoder_outputs, speaker_embeddings)
decoder_outputs, alignments, stop_tokens = self.decoder.inference(
encoder_outputs)
postnet_outputs = self.postnet(decoder_outputs)
postnet_outputs = decoder_outputs + postnet_outputs
decoder_outputs, postnet_outputs, alignments = self.shape_outputs(
decoder_outputs, postnet_outputs, alignments)
return decoder_outputs, postnet_outputs, alignments, stop_tokens
def inference_truncated(self, text, speaker_ids=None, style_mel=None, speaker_embeddings=None):
"""
Preserve model states for continuous inference
"""
embedded_inputs = self.embedding(text).transpose(1, 2)
encoder_outputs = self.encoder.inference_truncated(embedded_inputs)
if self.gst:
# B x gst_dim
encoder_outputs = self.compute_gst(encoder_outputs, style_mel)
if self.num_speakers > 1:
if not self.embeddings_per_sample:
speaker_embeddings = self.speaker_embedding(speaker_ids)[:, None]
encoder_outputs = self._concat_speaker_embedding(encoder_outputs, speaker_embeddings)
mel_outputs, alignments, stop_tokens = self.decoder.inference_truncated(
encoder_outputs)
mel_outputs_postnet = self.postnet(mel_outputs)
mel_outputs_postnet = mel_outputs + mel_outputs_postnet
mel_outputs, mel_outputs_postnet, alignments = self.shape_outputs(
mel_outputs, mel_outputs_postnet, alignments)
return mel_outputs, mel_outputs_postnet, alignments, stop_tokens

View File

@ -0,0 +1,212 @@
import copy
from abc import ABC, abstractmethod
import torch
from torch import nn
from mozilla_voice_tts.tts.utils.generic_utils import sequence_mask
class TacotronAbstract(ABC, nn.Module):
def __init__(self,
num_chars,
num_speakers,
r,
postnet_output_dim=80,
decoder_output_dim=80,
attn_type='original',
attn_win=False,
attn_norm="softmax",
prenet_type="original",
prenet_dropout=True,
forward_attn=False,
trans_agent=False,
forward_attn_mask=False,
location_attn=True,
attn_K=5,
separate_stopnet=True,
bidirectional_decoder=False,
double_decoder_consistency=False,
ddc_r=None,
encoder_in_features=512,
decoder_in_features=512,
speaker_embedding_dim=None,
gst=False,
gst_embedding_dim=512,
gst_num_heads=4,
gst_style_tokens=10):
""" Abstract Tacotron class """
super().__init__()
self.num_chars = num_chars
self.r = r
self.decoder_output_dim = decoder_output_dim
self.postnet_output_dim = postnet_output_dim
self.gst = gst
self.gst_embedding_dim = gst_embedding_dim
self.gst_num_heads = gst_num_heads
self.gst_style_tokens = gst_style_tokens
self.num_speakers = num_speakers
self.bidirectional_decoder = bidirectional_decoder
self.double_decoder_consistency = double_decoder_consistency
self.ddc_r = ddc_r
self.attn_type = attn_type
self.attn_win = attn_win
self.attn_norm = attn_norm
self.prenet_type = prenet_type
self.prenet_dropout = prenet_dropout
self.forward_attn = forward_attn
self.trans_agent = trans_agent
self.forward_attn_mask = forward_attn_mask
self.location_attn = location_attn
self.attn_K = attn_K
self.separate_stopnet = separate_stopnet
self.encoder_in_features = encoder_in_features
self.decoder_in_features = decoder_in_features
self.speaker_embedding_dim = speaker_embedding_dim
# layers
self.embedding = None
self.encoder = None
self.decoder = None
self.postnet = None
# multispeaker
if self.speaker_embedding_dim is None:
# if speaker_embedding_dim is None we need use the nn.Embedding, with default speaker_embedding_dim
self.embeddings_per_sample = False
else:
# if speaker_embedding_dim is not None we need use speaker embedding per sample
self.embeddings_per_sample = True
# global style token
if self.gst:
self.decoder_in_features += gst_embedding_dim # add gst embedding dim
self.gst_layer = None
# model states
self.speaker_embeddings = None
self.speaker_embeddings_projected = None
# additional layers
self.decoder_backward = None
self.coarse_decoder = None
#############################
# INIT FUNCTIONS
#############################
def _init_states(self):
self.speaker_embeddings = None
self.speaker_embeddings_projected = None
def _init_backward_decoder(self):
self.decoder_backward = copy.deepcopy(self.decoder)
def _init_coarse_decoder(self):
self.coarse_decoder = copy.deepcopy(self.decoder)
self.coarse_decoder.r_init = self.ddc_r
self.coarse_decoder.set_r(self.ddc_r)
#############################
# CORE FUNCTIONS
#############################
@abstractmethod
def forward(self):
pass
@abstractmethod
def inference(self):
pass
#############################
# COMMON COMPUTE FUNCTIONS
#############################
def compute_masks(self, text_lengths, mel_lengths):
"""Compute masks against sequence paddings."""
# B x T_in_max (boolean)
device = text_lengths.device
input_mask = sequence_mask(text_lengths).to(device)
output_mask = None
if mel_lengths is not None:
max_len = mel_lengths.max()
r = self.decoder.r
max_len = max_len + (r - (max_len % r)) if max_len % r > 0 else max_len
output_mask = sequence_mask(mel_lengths, max_len=max_len).to(device)
return input_mask, output_mask
def _backward_pass(self, mel_specs, encoder_outputs, mask):
""" Run backwards decoder """
decoder_outputs_b, alignments_b, _ = self.decoder_backward(
encoder_outputs, torch.flip(mel_specs, dims=(1,)), mask,
self.speaker_embeddings_projected)
decoder_outputs_b = decoder_outputs_b.transpose(1, 2).contiguous()
return decoder_outputs_b, alignments_b
def _coarse_decoder_pass(self, mel_specs, encoder_outputs, alignments,
input_mask):
""" Double Decoder Consistency """
T = mel_specs.shape[1]
if T % self.coarse_decoder.r > 0:
padding_size = self.coarse_decoder.r - (T % self.coarse_decoder.r)
mel_specs = torch.nn.functional.pad(mel_specs,
(0, 0, 0, padding_size, 0, 0))
decoder_outputs_backward, alignments_backward, _ = self.coarse_decoder(
encoder_outputs.detach(), mel_specs, input_mask)
# scale_factor = self.decoder.r_init / self.decoder.r
alignments_backward = torch.nn.functional.interpolate(
alignments_backward.transpose(1, 2),
size=alignments.shape[1],
mode='nearest').transpose(1, 2)
decoder_outputs_backward = decoder_outputs_backward.transpose(1, 2)
decoder_outputs_backward = decoder_outputs_backward[:, :T, :]
return decoder_outputs_backward, alignments_backward
#############################
# EMBEDDING FUNCTIONS
#############################
def compute_speaker_embedding(self, speaker_ids):
""" Compute speaker embedding vectors """
if hasattr(self, "speaker_embedding") and speaker_ids is None:
raise RuntimeError(
" [!] Model has speaker embedding layer but speaker_id is not provided"
)
if hasattr(self, "speaker_embedding") and speaker_ids is not None:
self.speaker_embeddings = self.speaker_embedding(speaker_ids).unsqueeze(1)
if hasattr(self, "speaker_project_mel") and speaker_ids is not None:
self.speaker_embeddings_projected = self.speaker_project_mel(
self.speaker_embeddings).squeeze(1)
def compute_gst(self, inputs, style_input):
""" Compute global style token """
device = inputs.device
if isinstance(style_input, dict):
query = torch.zeros(1, 1, self.gst_embedding_dim//2).to(device)
_GST = torch.tanh(self.gst_layer.style_token_layer.style_tokens)
gst_outputs = torch.zeros(1, 1, self.gst_embedding_dim).to(device)
for k_token, v_amplifier in style_input.items():
key = _GST[int(k_token)].unsqueeze(0).expand(1, -1, -1)
gst_outputs_att = self.gst_layer.style_token_layer.attention(query, key)
gst_outputs = gst_outputs + gst_outputs_att * v_amplifier
elif style_input is None:
gst_outputs = torch.zeros(1, 1, self.gst_embedding_dim).to(device)
else:
gst_outputs = self.gst_layer(style_input) # pylint: disable=not-callable
inputs = self._concat_speaker_embedding(inputs, gst_outputs)
return inputs
@staticmethod
def _add_speaker_embedding(outputs, speaker_embeddings):
speaker_embeddings_ = speaker_embeddings.expand(
outputs.size(0), outputs.size(1), -1)
outputs = outputs + speaker_embeddings_
return outputs
@staticmethod
def _concat_speaker_embedding(outputs, speaker_embeddings):
speaker_embeddings_ = speaker_embeddings.expand(
outputs.size(0), outputs.size(1), -1)
outputs = torch.cat([outputs, speaker_embeddings_], dim=-1)
return outputs

View File

View File

@ -3,6 +3,9 @@ from tensorflow import keras
from tensorflow.python.ops import math_ops
# from tensorflow_addons.seq2seq import BahdanauAttention
# NOTE: linter has a problem with the current TF release
#pylint: disable=no-value-for-parameter
#pylint: disable=unexpected-keyword-arg
class Linear(keras.layers.Layer):
def __init__(self, units, use_bias, **kwargs):
@ -109,12 +112,18 @@ class Attention(keras.layers.Layer):
raise ValueError("Unknown value for attention norm type")
def init_states(self, batch_size, value_length):
states = ()
states = []
if self.use_loc_attn:
attention_cum = tf.zeros([batch_size, value_length])
attention_old = tf.zeros([batch_size, value_length])
states = (attention_cum, attention_old)
return states
states = [attention_cum, attention_old]
if self.use_forward_attn:
alpha = tf.concat([
tf.ones([batch_size, 1]),
tf.zeros([batch_size, value_length])[:, :-1] + 1e-7
], 1)
states.append(alpha)
return tuple(states)
def process_values(self, values):
""" cache values for decoder iterations """
@ -125,7 +134,7 @@ class Attention(keras.layers.Layer):
def get_loc_attn(self, query, states):
""" compute location attention, query layer and
unnorm. attention weights"""
attention_cum, attention_old = states
attention_cum, attention_old = states[:2]
attn_cat = tf.stack([attention_old, attention_cum], axis=2)
processed_query = self.query_layer(tf.expand_dims(query, 1))
@ -150,6 +159,23 @@ class Attention(keras.layers.Layer):
score -= 1.e9 * math_ops.cast(padding_mask, dtype=tf.float32)
return score
def apply_forward_attention(self, alignment, alpha): #pylint: disable=no-self-use
# forward attention
fwd_shifted_alpha = tf.pad(alpha[:, :-1], ((0, 0), (1, 0)), constant_values=0.0)
# compute transition potentials
new_alpha = ((1 - 0.5) * alpha + 0.5 * fwd_shifted_alpha + 1e-8) * alignment
# renormalize attention weights
new_alpha = new_alpha / tf.reduce_sum(new_alpha, axis=1, keepdims=True)
return new_alpha
def update_states(self, old_states, scores_norm, attn_weights, new_alpha=None):
states = []
if self.use_loc_attn:
states = [old_states[0] + scores_norm, attn_weights]
if self.use_forward_attn:
states.append(new_alpha)
return tuple(states)
def call(self, query, states):
"""
shapes:
@ -165,13 +191,19 @@ class Attention(keras.layers.Layer):
# self.apply_score_masking(score, mask)
# attn_weights shape == (batch_size, max_length, 1)
attn_weights = self.norm_func(score)
# normalize attention scores
scores_norm = self.norm_func(score)
attn_weights = scores_norm
# update attention states
if self.use_loc_attn:
states = (states[0] + attn_weights, attn_weights)
else:
states = ()
# apply forward attention
new_alpha = None
if self.use_forward_attn:
new_alpha = self.apply_forward_attention(attn_weights, states[-1])
attn_weights = new_alpha
# update states tuple
# states = (cum_attn_weights, attn_weights, new_alpha)
states = self.update_states(states, scores_norm, attn_weights, new_alpha)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = tf.matmul(tf.expand_dims(attn_weights, axis=2), self.values, transpose_a=True, transpose_b=False)

View File

@ -1,11 +1,12 @@
import tensorflow as tf
from tensorflow import keras
from TTS.tf.utils.tf_utils import shape_list
from TTS.tf.layers.common_layers import Prenet, Attention
from mozilla_voice_tts.tts.tf.utils.tf_utils import shape_list
from mozilla_voice_tts.tts.tf.layers.common_layers import Prenet, Attention
# from tensorflow_addons.seq2seq import AttentionWrapper
# NOTE: linter has a problem with the current TF release
#pylint: disable=no-value-for-parameter
#pylint: disable=unexpected-keyword-arg
class ConvBNBlock(keras.layers.Layer):
def __init__(self, filters, kernel_size, activation, **kwargs):
super(ConvBNBlock, self).__init__(**kwargs)
@ -58,12 +59,16 @@ class Decoder(keras.layers.Layer):
#pylint: disable=unused-argument
def __init__(self, frame_dim, r, attn_type, use_attn_win, attn_norm, prenet_type,
prenet_dropout, use_forward_attn, use_trans_agent, use_forward_attn_mask,
use_location_attn, attn_K, separate_stopnet, speaker_emb_dim, **kwargs):
use_location_attn, attn_K, separate_stopnet, speaker_emb_dim, enable_tflite, **kwargs):
super(Decoder, self).__init__(**kwargs)
self.frame_dim = frame_dim
self.r_init = tf.constant(r, dtype=tf.int32)
self.r = tf.constant(r, dtype=tf.int32)
self.output_dim = r * self.frame_dim
self.separate_stopnet = separate_stopnet
self.enable_tflite = enable_tflite
# layer constants
self.max_decoder_steps = tf.constant(1000, dtype=tf.int32)
self.stop_thresh = tf.constant(0.5, dtype=tf.float32)
@ -80,7 +85,7 @@ class Decoder(keras.layers.Layer):
[self.prenet_dim, self.prenet_dim],
bias=False,
name='prenet')
self.attention_rnn = keras.layers.LSTMCell(self.query_dim, use_bias=True, name=f'{self.name}/attention_rnn', )
self.attention_rnn = keras.layers.LSTMCell(self.query_dim, use_bias=True, name='attention_rnn', )
self.attention_rnn_dropout = keras.layers.Dropout(0.5)
# TODO: implement other attn options
@ -94,10 +99,10 @@ class Decoder(keras.layers.Layer):
use_trans_agent=use_trans_agent,
use_forward_attn_mask=use_forward_attn_mask,
name='attention')
self.decoder_rnn = keras.layers.LSTMCell(self.decoder_rnn_dim, use_bias=True, name=f'{self.name}/decoder_rnn')
self.decoder_rnn = keras.layers.LSTMCell(self.decoder_rnn_dim, use_bias=True, name='decoder_rnn')
self.decoder_rnn_dropout = keras.layers.Dropout(0.5)
self.linear_projection = keras.layers.Dense(self.frame_dim * r, name=f'{self.name}/linear_projection/linear_layer')
self.stopnet = keras.layers.Dense(1, name=f'{self.name}/stopnet/linear_layer')
self.linear_projection = keras.layers.Dense(self.frame_dim * r, name='linear_projection/linear_layer')
self.stopnet = keras.layers.Dense(1, name='stopnet/linear_layer')
def set_max_decoder_steps(self, new_max_steps):
@ -105,6 +110,7 @@ class Decoder(keras.layers.Layer):
def set_r(self, new_r):
self.r = tf.constant(new_r, dtype=tf.int32)
self.output_dim = self.frame_dim * new_r
def build_decoder_initial_states(self, batch_size, memory_dim, memory_length):
zero_frame = tf.zeros([batch_size, self.frame_dim])
@ -183,6 +189,7 @@ class Decoder(keras.layers.Layer):
outputs = tf.TensorArray(dtype=tf.float32, size=0, clear_after_read=False, dynamic_size=True)
attentions = tf.TensorArray(dtype=tf.float32, size=0, clear_after_read=False, dynamic_size=True)
stop_tokens = tf.TensorArray(dtype=tf.float32, size=0, clear_after_read=False, dynamic_size=True)
# pre-computes
self.attention.process_values(memory)
@ -226,7 +233,70 @@ class Decoder(keras.layers.Layer):
outputs = tf.reshape(outputs, [B, -1, self.frame_dim])
return outputs, stop_tokens, attentions
def decode_inference_tflite(self, memory, states):
"""Inference with TF-Lite compatibility. It assumes
batch_size is 1"""
# init states
# dynamic_shape is not supported in TFLite
outputs = tf.TensorArray(dtype=tf.float32,
size=self.max_decoder_steps,
element_shape=tf.TensorShape(
[self.output_dim]),
clear_after_read=False,
dynamic_size=False)
# stop_flags = tf.TensorArray(dtype=tf.bool,
# size=self.max_decoder_steps,
# element_shape=tf.TensorShape(
# []),
# clear_after_read=False,
# dynamic_size=False)
attentions = ()
stop_tokens = ()
# pre-computes
self.attention.process_values(memory)
# iter vars
stop_flag = tf.constant(False, dtype=tf.bool)
step_count = tf.constant(0, dtype=tf.int32)
def _body(step, memory, states, outputs, stop_flag):
frame_next = states[0]
prenet_next = self.prenet(frame_next, training=False)
output, stop_token, states, _ = self.step(prenet_next,
states,
None,
training=False)
stop_token = tf.math.sigmoid(stop_token)
stop_flag = tf.greater(stop_token, self.stop_thresh)
stop_flag = tf.reduce_all(stop_flag)
# stop_flags = stop_flags.write(step, tf.logical_not(stop_flag))
outputs = outputs.write(step, tf.reshape(output, [-1]))
return step + 1, memory, states, outputs, stop_flag
cond = lambda step, m, s, o, stop_flag: tf.equal(stop_flag, tf.constant(False, dtype=tf.bool))
step_count, memory, states, outputs, stop_flag = \
tf.while_loop(cond,
_body,
loop_vars=(step_count, memory, states, outputs,
stop_flag),
parallel_iterations=32,
swap_memory=True,
maximum_iterations=self.max_decoder_steps)
outputs = outputs.stack()
outputs = tf.gather(outputs, tf.range(step_count)) # pylint: disable=no-value-for-parameter
outputs = tf.expand_dims(outputs, axis=[0])
outputs = tf.transpose(outputs, [1, 0, 2])
outputs = tf.reshape(outputs, [1, -1, self.frame_dim])
return outputs, stop_tokens, attentions
def call(self, memory, states, frames=None, memory_seq_length=None, training=False):
if training:
return self.decode(memory, states, frames, memory_seq_length)
if self.enable_tflite:
return self.decode_inference_tflite(memory, states)
return self.decode_inference(memory, states)

View File

@ -1,10 +1,11 @@
import tensorflow as tf
from tensorflow import keras
from TTS.tf.layers.tacotron2 import Encoder, Decoder, Postnet
from TTS.tf.utils.tf_utils import shape_list
from mozilla_voice_tts.tts.tf.layers.tacotron2 import Encoder, Decoder, Postnet
from mozilla_voice_tts.tts.tf.utils.tf_utils import shape_list
#pylint: disable=too-many-ancestors
#pylint: disable=too-many-ancestors, abstract-method
class Tacotron2(keras.models.Model):
def __init__(self,
num_chars,
@ -23,7 +24,8 @@ class Tacotron2(keras.models.Model):
forward_attn_mask=False,
location_attn=True,
separate_stopnet=True,
bidirectional_decoder=False):
bidirectional_decoder=False,
enable_tflite=False):
super(Tacotron2, self).__init__()
self.r = r
self.decoder_output_dim = decoder_output_dim
@ -31,6 +33,7 @@ class Tacotron2(keras.models.Model):
self.bidirectional_decoder = bidirectional_decoder
self.num_speakers = num_speakers
self.speaker_embed_dim = 256
self.enable_tflite = enable_tflite
self.embedding = keras.layers.Embedding(num_chars, 512, name='embedding')
self.encoder = Encoder(512, name='encoder')
@ -48,9 +51,12 @@ class Tacotron2(keras.models.Model):
use_location_attn=location_attn,
attn_K=attn_K,
separate_stopnet=separate_stopnet,
speaker_emb_dim=self.speaker_embed_dim)
speaker_emb_dim=self.speaker_embed_dim,
name='decoder',
enable_tflite=enable_tflite)
self.postnet = Postnet(postnet_output_dim, 5, name='postnet')
@tf.function(experimental_relax_shapes=True)
def call(self, characters, text_lengths=None, frames=None, training=None):
if training:
return self.training(characters, text_lengths, frames)
@ -79,3 +85,23 @@ class Tacotron2(keras.models.Model):
print(output_frames.shape)
return decoder_frames, output_frames, attentions, stop_tokens
@tf.function(
experimental_relax_shapes=True,
input_signature=[
tf.TensorSpec([1, None], dtype=tf.int32),
],)
def inference_tflite(self, characters):
B, T = shape_list(characters)
embedding_vectors = self.embedding(characters, training=False)
encoder_output = self.encoder(embedding_vectors, training=False)
decoder_states = self.decoder.build_decoder_initial_states(B, 512, T)
decoder_frames, stop_tokens, attentions = self.decoder(encoder_output, decoder_states, training=False)
postnet_frames = self.postnet(decoder_frames, training=False)
output_frames = decoder_frames + postnet_frames
print(output_frames.shape)
return decoder_frames, output_frames, attentions, stop_tokens
def build_inference(self, ):
# TODO: issue https://github.com/PyCQA/pylint/issues/3613
input_ids = tf.random.uniform(shape=[1, 4], maxval=10, dtype=tf.int32) #pylint: disable=unexpected-keyword-arg
self(input_ids)

View File

@ -1,6 +1,9 @@
import numpy as np
import tensorflow as tf
# NOTE: linter has a problem with the current TF release
#pylint: disable=no-value-for-parameter
#pylint: disable=unexpected-keyword-arg
def tf_create_dummy_inputs():
""" Create dummy inputs for TF Tacotron2 model """

View File

@ -1,4 +1,3 @@
import os
import datetime
import importlib
import pickle
@ -6,9 +5,7 @@ import numpy as np
import tensorflow as tf
def save_checkpoint(model, optimizer, current_step, epoch, r, output_folder, **kwargs):
checkpoint_path = 'tts_tf_checkpoint_{}.pkl'.format(current_step)
checkpoint_path = os.path.join(output_folder, checkpoint_path)
def save_checkpoint(model, optimizer, current_step, epoch, r, output_path, **kwargs):
state = {
'model': model.weights,
'optimizer': optimizer,
@ -18,7 +15,7 @@ def save_checkpoint(model, optimizer, current_step, epoch, r, output_folder, **k
'r': r
}
state.update(kwargs)
pickle.dump(state, open(checkpoint_path, 'wb'))
pickle.dump(state, open(output_path, 'wb'))
def load_checkpoint(model, checkpoint_path):
@ -27,7 +24,13 @@ def load_checkpoint(model, checkpoint_path):
tf_vars = model.weights
for tf_var in tf_vars:
layer_name = tf_var.name
chkp_var_value = chkp_var_dict[layer_name]
try:
chkp_var_value = chkp_var_dict[layer_name]
except KeyError:
class_name = list(chkp_var_dict.keys())[0].split("/")[0]
layer_name = f"{class_name}/{layer_name}"
chkp_var_value = chkp_var_dict[layer_name]
tf.keras.backend.set_value(tf_var, chkp_var_value)
if 'r' in checkpoint.keys():
model.decoder.set_r(checkpoint['r'])
@ -72,9 +75,9 @@ def count_parameters(model, c):
return model.count_params()
def setup_model(num_chars, num_speakers, c):
def setup_model(num_chars, num_speakers, c, enable_tflite=False):
print(" > Using model: {}".format(c.model))
MyModel = importlib.import_module('TTS.tf.models.' + c.model.lower())
MyModel = importlib.import_module('mozilla_voice_tts.tts.tf.models.' + c.model.lower())
MyModel = getattr(MyModel, c.model)
if c.model.lower() in "tacotron":
raise NotImplementedError(' [!] Tacotron model is not ready.')
@ -95,5 +98,6 @@ def setup_model(num_chars, num_speakers, c):
location_attn=c.location_attn,
attn_K=c.attention_heads,
separate_stopnet=c.separate_stopnet,
bidirectional_decoder=c.bidirectional_decoder)
bidirectional_decoder=c.bidirectional_decoder,
enable_tflite=enable_tflite)
return model

View File

@ -0,0 +1,41 @@
import pickle
import datetime
import tensorflow as tf
def save_checkpoint(model, optimizer, current_step, epoch, r, output_path, **kwargs):
state = {
'model': model.weights,
'optimizer': optimizer,
'step': current_step,
'epoch': epoch,
'date': datetime.date.today().strftime("%B %d, %Y"),
'r': r
}
state.update(kwargs)
pickle.dump(state, open(output_path, 'wb'))
def load_checkpoint(model, checkpoint_path):
checkpoint = pickle.load(open(checkpoint_path, 'rb'))
chkp_var_dict = {var.name: var.numpy() for var in checkpoint['model']}
tf_vars = model.weights
for tf_var in tf_vars:
layer_name = tf_var.name
try:
chkp_var_value = chkp_var_dict[layer_name]
except KeyError:
class_name = list(chkp_var_dict.keys())[0].split("/")[0]
layer_name = f"{class_name}/{layer_name}"
chkp_var_value = chkp_var_dict[layer_name]
tf.keras.backend.set_value(tf_var, chkp_var_value)
if 'r' in checkpoint.keys():
model.decoder.set_r(checkpoint['r'])
return model
def load_tflite_model(tflite_path):
tflite_model = tf.lite.Interpreter(model_path=tflite_path)
tflite_model.allocate_tensors()
return tflite_model

View File

@ -0,0 +1,31 @@
import tensorflow as tf
def convert_tacotron2_to_tflite(model,
output_path=None,
experimental_converter=True):
"""Convert Tensorflow Tacotron2 model to TFLite. Save a binary file if output_path is
provided, else return TFLite model."""
concrete_function = model.inference_tflite.get_concrete_function()
converter = tf.lite.TFLiteConverter.from_concrete_functions(
[concrete_function])
converter.experimental_new_converter = experimental_converter
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS
]
tflite_model = converter.convert()
print(f'Tflite Model size is {len(tflite_model) / (1024.0 * 1024.0)} MBs.')
if output_path is not None:
# same model binary if outputpath is provided
with open(output_path, 'wb') as f:
f.write(tflite_model)
return None
return tflite_model
def load_tflite_model(tflite_path):
tflite_model = tf.lite.Interpreter(model_path=tflite_path)
tflite_model.allocate_tensors()
return tflite_model

View File

View File

@ -74,4 +74,3 @@ class StandardScaler():
X *= self.scale_
X += self.mean_
return X

View File

@ -1,15 +1,11 @@
# edited from https://github.com/fastai/imagenet-fast/blob/master/imagenet_nv/distributed.py
import os, sys
import math
import time
import subprocess
import argparse
import torch
import torch.distributed as dist
from torch.utils.data.sampler import Sampler
from torch.autograd import Variable
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
from TTS.utils.generic_utils import create_experiment_folder
from torch.autograd import Variable
from torch.utils.data.sampler import Sampler
class DistributedSampler(Sampler):
@ -108,7 +104,7 @@ def apply_gradient_allreduce(module):
for param in list(module.parameters()):
def allreduce_hook(*_):
Variable._execution_engine.queue_callback(allreduce_params)
Variable._execution_engine.queue_callback(allreduce_params) #pylint: disable=protected-access
if param.requires_grad:
param.register_hook(allreduce_hook)
@ -118,61 +114,3 @@ def apply_gradient_allreduce(module):
module.register_forward_hook(set_needs_reduction)
return module
def main():
"""
Call train.py as a new process and pass command arguments
"""
parser = argparse.ArgumentParser()
parser.add_argument(
'--continue_path',
type=str,
help='Training output folder to continue training. Use to continue a training. If it is used, "config_path" is ignored.',
default='',
required='--config_path' not in sys.argv)
parser.add_argument(
'--restore_path',
type=str,
help='Model file to be restored. Use to finetune a model.',
default='')
parser.add_argument(
'--config_path',
type=str,
help='Path to config file for training.',
required='--continue_path' not in sys.argv
)
args = parser.parse_args()
# OUT_PATH = create_experiment_folder(CONFIG.output_path, CONFIG.run_name,
# True)
# stdout_path = os.path.join(OUT_PATH, "process_stdout/")
num_gpus = torch.cuda.device_count()
group_id = time.strftime("%Y_%m_%d-%H%M%S")
# set arguments for train.py
command = ['train.py']
command.append('--continue_path={}'.format(args.continue_path))
command.append('--restore_path={}'.format(args.restore_path))
command.append('--config_path={}'.format(args.config_path))
command.append('--group_id=group_{}'.format(group_id))
command.append('')
# run processes
processes = []
for i in range(num_gpus):
my_env = os.environ.copy()
my_env["PYTHON_EGG_CACHE"] = "/tmp/tmp{}".format(i)
command[-1] = '--rank={}'.format(i)
stdout = None if i == 0 else open(os.devnull, 'w')
p = subprocess.Popen(['python3'] + command, stdout=stdout, env=my_env)
processes.append(p)
print(command)
for p in processes:
p.wait()
if __name__ == '__main__':
main()

View File

@ -0,0 +1,229 @@
import torch
import importlib
import numpy as np
from collections import Counter
from mozilla_voice_tts.utils.generic_utils import check_argument
def split_dataset(items):
is_multi_speaker = False
speakers = [item[-1] for item in items]
is_multi_speaker = len(set(speakers)) > 1
eval_split_size = 500 if len(items) * 0.01 > 500 else int(
len(items) * 0.01)
assert eval_split_size > 0, " [!] You do not have enough samples to train. You need at least 100 samples."
np.random.seed(0)
np.random.shuffle(items)
if is_multi_speaker:
items_eval = []
# most stupid code ever -- Fix it !
while len(items_eval) < eval_split_size:
speakers = [item[-1] for item in items]
speaker_counter = Counter(speakers)
item_idx = np.random.randint(0, len(items))
if speaker_counter[items[item_idx][-1]] > 1:
items_eval.append(items[item_idx])
del items[item_idx]
return items_eval, items
return items[:eval_split_size], items[eval_split_size:]
# from https://gist.github.com/jihunchoi/f1434a77df9db1bb337417854b398df1
def sequence_mask(sequence_length, max_len=None):
if max_len is None:
max_len = sequence_length.data.max()
batch_size = sequence_length.size(0)
seq_range = torch.arange(0, max_len).long()
seq_range_expand = seq_range.unsqueeze(0).expand(batch_size, max_len)
if sequence_length.is_cuda:
seq_range_expand = seq_range_expand.to(sequence_length.device)
seq_length_expand = (
sequence_length.unsqueeze(1).expand_as(seq_range_expand))
# B x T_max
return seq_range_expand < seq_length_expand
def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
print(" > Using model: {}".format(c.model))
MyModel = importlib.import_module('mozilla_voice_tts.tts.models.' + c.model.lower())
MyModel = getattr(MyModel, c.model)
if c.model.lower() in "tacotron":
model = MyModel(num_chars=num_chars,
num_speakers=num_speakers,
r=c.r,
postnet_output_dim=int(c.audio['fft_size'] / 2 + 1),
decoder_output_dim=c.audio['num_mels'],
gst=c.use_gst,
gst_embedding_dim=c.gst['gst_embedding_dim'],
gst_num_heads=c.gst['gst_num_heads'],
gst_style_tokens=c.gst['gst_style_tokens'],
memory_size=c.memory_size,
attn_type=c.attention_type,
attn_win=c.windowing,
attn_norm=c.attention_norm,
prenet_type=c.prenet_type,
prenet_dropout=c.prenet_dropout,
forward_attn=c.use_forward_attn,
trans_agent=c.transition_agent,
forward_attn_mask=c.forward_attn_mask,
location_attn=c.location_attn,
attn_K=c.attention_heads,
separate_stopnet=c.separate_stopnet,
bidirectional_decoder=c.bidirectional_decoder,
double_decoder_consistency=c.double_decoder_consistency,
ddc_r=c.ddc_r,
speaker_embedding_dim=speaker_embedding_dim)
elif c.model.lower() == "tacotron2":
model = MyModel(num_chars=num_chars,
num_speakers=num_speakers,
r=c.r,
postnet_output_dim=c.audio['num_mels'],
decoder_output_dim=c.audio['num_mels'],
gst=c.use_gst,
gst_embedding_dim=c.gst['gst_embedding_dim'],
gst_num_heads=c.gst['gst_num_heads'],
gst_style_tokens=c.gst['gst_style_tokens'],
attn_type=c.attention_type,
attn_win=c.windowing,
attn_norm=c.attention_norm,
prenet_type=c.prenet_type,
prenet_dropout=c.prenet_dropout,
forward_attn=c.use_forward_attn,
trans_agent=c.transition_agent,
forward_attn_mask=c.forward_attn_mask,
location_attn=c.location_attn,
attn_K=c.attention_heads,
separate_stopnet=c.separate_stopnet,
bidirectional_decoder=c.bidirectional_decoder,
double_decoder_consistency=c.double_decoder_consistency,
ddc_r=c.ddc_r,
speaker_embedding_dim=speaker_embedding_dim)
return model
def check_config(c):
check_argument('model', c, enum_list=['tacotron', 'tacotron2'], restricted=True, val_type=str)
check_argument('run_name', c, restricted=True, val_type=str)
check_argument('run_description', c, val_type=str)
# AUDIO
check_argument('audio', c, restricted=True, val_type=dict)
# audio processing parameters
check_argument('num_mels', c['audio'], restricted=True, val_type=int, min_val=10, max_val=2056)
check_argument('fft_size', c['audio'], restricted=True, val_type=int, min_val=128, max_val=4058)
check_argument('sample_rate', c['audio'], restricted=True, val_type=int, min_val=512, max_val=100000)
check_argument('frame_length_ms', c['audio'], restricted=True, val_type=float, min_val=10, max_val=1000, alternative='win_length')
check_argument('frame_shift_ms', c['audio'], restricted=True, val_type=float, min_val=1, max_val=1000, alternative='hop_length')
check_argument('preemphasis', c['audio'], restricted=True, val_type=float, min_val=0, max_val=1)
check_argument('min_level_db', c['audio'], restricted=True, val_type=int, min_val=-1000, max_val=10)
check_argument('ref_level_db', c['audio'], restricted=True, val_type=int, min_val=0, max_val=1000)
check_argument('power', c['audio'], restricted=True, val_type=float, min_val=1, max_val=5)
check_argument('griffin_lim_iters', c['audio'], restricted=True, val_type=int, min_val=10, max_val=1000)
# vocabulary parameters
check_argument('characters', c, restricted=False, val_type=dict)
check_argument('pad', c['characters'] if 'characters' in c.keys() else {}, restricted='characters' in c.keys(), val_type=str)
check_argument('eos', c['characters'] if 'characters' in c.keys() else {}, restricted='characters' in c.keys(), val_type=str)
check_argument('bos', c['characters'] if 'characters' in c.keys() else {}, restricted='characters' in c.keys(), val_type=str)
check_argument('characters', c['characters'] if 'characters' in c.keys() else {}, restricted='characters' in c.keys(), val_type=str)
check_argument('phonemes', c['characters'] if 'characters' in c.keys() else {}, restricted='characters' in c.keys(), val_type=str)
check_argument('punctuations', c['characters'] if 'characters' in c.keys() else {}, restricted='characters' in c.keys(), val_type=str)
# normalization parameters
check_argument('signal_norm', c['audio'], restricted=True, val_type=bool)
check_argument('symmetric_norm', c['audio'], restricted=True, val_type=bool)
check_argument('max_norm', c['audio'], restricted=True, val_type=float, min_val=0.1, max_val=1000)
check_argument('clip_norm', c['audio'], restricted=True, val_type=bool)
check_argument('mel_fmin', c['audio'], restricted=True, val_type=float, min_val=0.0, max_val=1000)
check_argument('mel_fmax', c['audio'], restricted=True, val_type=float, min_val=500.0)
check_argument('spec_gain', c['audio'], restricted=True, val_type=[int, float], min_val=1, max_val=100)
check_argument('do_trim_silence', c['audio'], restricted=True, val_type=bool)
check_argument('trim_db', c['audio'], restricted=True, val_type=int)
# training parameters
check_argument('batch_size', c, restricted=True, val_type=int, min_val=1)
check_argument('eval_batch_size', c, restricted=True, val_type=int, min_val=1)
check_argument('r', c, restricted=True, val_type=int, min_val=1)
check_argument('gradual_training', c, restricted=False, val_type=list)
check_argument('loss_masking', c, restricted=True, val_type=bool)
check_argument('apex_amp_level', c, restricted=False, val_type=str)
# check_argument('grad_accum', c, restricted=True, val_type=int, min_val=1, max_val=100)
# validation parameters
check_argument('run_eval', c, restricted=True, val_type=bool)
check_argument('test_delay_epochs', c, restricted=True, val_type=int, min_val=0)
check_argument('test_sentences_file', c, restricted=False, val_type=str)
# optimizer
check_argument('noam_schedule', c, restricted=False, val_type=bool)
check_argument('grad_clip', c, restricted=True, val_type=float, min_val=0.0)
check_argument('epochs', c, restricted=True, val_type=int, min_val=1)
check_argument('lr', c, restricted=True, val_type=float, min_val=0)
check_argument('wd', c, restricted=True, val_type=float, min_val=0)
check_argument('warmup_steps', c, restricted=True, val_type=int, min_val=0)
check_argument('seq_len_norm', c, restricted=True, val_type=bool)
# tacotron prenet
check_argument('memory_size', c, restricted=True, val_type=int, min_val=-1)
check_argument('prenet_type', c, restricted=True, val_type=str, enum_list=['original', 'bn'])
check_argument('prenet_dropout', c, restricted=True, val_type=bool)
# attention
check_argument('attention_type', c, restricted=True, val_type=str, enum_list=['graves', 'original'])
check_argument('attention_heads', c, restricted=True, val_type=int)
check_argument('attention_norm', c, restricted=True, val_type=str, enum_list=['sigmoid', 'softmax'])
check_argument('windowing', c, restricted=True, val_type=bool)
check_argument('use_forward_attn', c, restricted=True, val_type=bool)
check_argument('forward_attn_mask', c, restricted=True, val_type=bool)
check_argument('transition_agent', c, restricted=True, val_type=bool)
check_argument('transition_agent', c, restricted=True, val_type=bool)
check_argument('location_attn', c, restricted=True, val_type=bool)
check_argument('bidirectional_decoder', c, restricted=True, val_type=bool)
check_argument('double_decoder_consistency', c, restricted=True, val_type=bool)
check_argument('ddc_r', c, restricted='double_decoder_consistency' in c.keys(), min_val=1, max_val=7, val_type=int)
# stopnet
check_argument('stopnet', c, restricted=True, val_type=bool)
check_argument('separate_stopnet', c, restricted=True, val_type=bool)
# tensorboard
check_argument('print_step', c, restricted=True, val_type=int, min_val=1)
check_argument('tb_plot_step', c, restricted=True, val_type=int, min_val=1)
check_argument('save_step', c, restricted=True, val_type=int, min_val=1)
check_argument('checkpoint', c, restricted=True, val_type=bool)
check_argument('tb_model_param_stats', c, restricted=True, val_type=bool)
# dataloading
# pylint: disable=import-outside-toplevel
from mozilla_voice_tts.tts.utils.text import cleaners
check_argument('text_cleaner', c, restricted=True, val_type=str, enum_list=dir(cleaners))
check_argument('enable_eos_bos_chars', c, restricted=True, val_type=bool)
check_argument('num_loader_workers', c, restricted=True, val_type=int, min_val=0)
check_argument('num_val_loader_workers', c, restricted=True, val_type=int, min_val=0)
check_argument('batch_group_size', c, restricted=True, val_type=int, min_val=0)
check_argument('min_seq_len', c, restricted=True, val_type=int, min_val=0)
check_argument('max_seq_len', c, restricted=True, val_type=int, min_val=10)
# paths
check_argument('output_path', c, restricted=True, val_type=str)
# multi-speaker and gst
check_argument('use_speaker_embedding', c, restricted=True, val_type=bool)
check_argument('use_external_speaker_embedding_file', c, restricted=True, val_type=bool)
check_argument('external_speaker_embedding_file', c, restricted=True, val_type=str)
check_argument('use_gst', c, restricted=True, val_type=bool)
check_argument('gst', c, restricted=True, val_type=dict)
check_argument('gst_style_input', c['gst'], restricted=True, val_type=[str, dict])
check_argument('gst_embedding_dim', c['gst'], restricted=True, val_type=int, min_val=0, max_val=1000)
check_argument('gst_num_heads', c['gst'], restricted=True, val_type=int, min_val=2, max_val=10)
check_argument('gst_style_tokens', c['gst'], restricted=True, val_type=int, min_val=1, max_val=1000)
# datasets - checking only the first entry
check_argument('datasets', c, restricted=True, val_type=list)
for dataset_entry in c['datasets']:
check_argument('name', dataset_entry, restricted=True, val_type=str)
check_argument('path', dataset_entry, restricted=True, val_type=str)
check_argument('meta_file_train', dataset_entry, restricted=True, val_type=[str, list])
check_argument('meta_file_val', dataset_entry, restricted=True, val_type=str)

View File

@ -1,44 +1,13 @@
import os
import json
import re
import torch
import datetime
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
def load_config(config_path):
config = AttrDict()
with open(config_path, "r", encoding = "utf-8") as f:
input_str = f.read()
input_str = re.sub(r'\\\n', '', input_str)
input_str = re.sub(r'//.*\n', '\n', input_str)
data = json.loads(input_str)
config.update(data)
return config
def copy_config_file(config_file, out_path, new_fields):
config_lines = open(config_file, "r", encoding = "utf-8").readlines()
# add extra information fields
for key, value in new_fields.items():
if isinstance(value, str):
new_line = '"{}":"{}",\n'.format(key, value)
else:
new_line = '"{}":{},\n'.format(key, value)
config_lines.insert(1, new_line)
config_out_file = open(out_path, "w")
config_out_file.writelines(config_lines)
config_out_file.close()
def load_checkpoint(model, checkpoint_path, use_cuda=False):
def load_checkpoint(model, checkpoint_path, amp=None, use_cuda=False):
state = torch.load(checkpoint_path, map_location=torch.device('cpu'))
model.load_state_dict(state['model'])
if amp and 'amp' in state:
amp.load_state_dict(state['amp'])
if use_cuda:
model.cuda()
# set model stepsize
@ -47,7 +16,7 @@ def load_checkpoint(model, checkpoint_path, use_cuda=False):
return model, state
def save_model(model, optimizer, current_step, epoch, r, output_path, **kwargs):
def save_model(model, optimizer, current_step, epoch, r, output_path, amp_state_dict=None, **kwargs):
new_state_dict = model.state_dict()
state = {
'model': new_state_dict,
@ -57,6 +26,8 @@ def save_model(model, optimizer, current_step, epoch, r, output_path, **kwargs):
'date': datetime.date.today().strftime("%B %d, %Y"),
'r': r
}
if amp_state_dict:
state['amp'] = amp_state_dict
state.update(kwargs)
torch.save(state, output_path)

View File

@ -1,6 +1,3 @@
import torch
def alignment_diagonal_score(alignments, binary=False):
"""
Compute how diagonal alignment predictions are. It is useful

View File

@ -1,8 +1,6 @@
import os
import json
from TTS.datasets.preprocess import get_preprocessor_by_name
def make_speakers_json_path(out_path):
"""Returns conventional speakers.json location."""
@ -12,12 +10,15 @@ def make_speakers_json_path(out_path):
def load_speaker_mapping(out_path):
"""Loads speaker mapping if already present."""
try:
with open(make_speakers_json_path(out_path)) as f:
if os.path.splitext(out_path)[1] == '.json':
json_file = out_path
else:
json_file = make_speakers_json_path(out_path)
with open(json_file) as f:
return json.load(f)
except FileNotFoundError:
return {}
def save_speaker_mapping(out_path, speaker_mapping):
"""Saves speaker mapping if not yet present."""
speakers_json_path = make_speakers_json_path(out_path)

View File

@ -37,23 +37,25 @@ def numpy_to_tf(np_array, dtype):
return tensor
def compute_style_mel(style_wav, ap):
style_mel = ap.melspectrogram(
ap.load_wav(style_wav)).expand_dims(0)
def compute_style_mel(style_wav, ap, cuda=False):
style_mel = torch.FloatTensor(ap.melspectrogram(
ap.load_wav(style_wav, sr=ap.sample_rate))).unsqueeze(0)
if cuda:
return style_mel.cuda()
return style_mel
def run_model_torch(model, inputs, CONFIG, truncated, speaker_id=None, style_mel=None):
def run_model_torch(model, inputs, CONFIG, truncated, speaker_id=None, style_mel=None, speaker_embeddings=None):
if CONFIG.use_gst:
decoder_output, postnet_output, alignments, stop_tokens = model.inference(
inputs, style_mel=style_mel, speaker_ids=speaker_id)
inputs, style_mel=style_mel, speaker_ids=speaker_id, speaker_embeddings=speaker_embeddings)
else:
if truncated:
decoder_output, postnet_output, alignments, stop_tokens = model.inference_truncated(
inputs, speaker_ids=speaker_id)
inputs, speaker_ids=speaker_id, speaker_embeddings=speaker_embeddings)
else:
decoder_output, postnet_output, alignments, stop_tokens = model.inference(
inputs, speaker_ids=speaker_id)
inputs, speaker_ids=speaker_id, speaker_embeddings=speaker_embeddings)
return decoder_output, postnet_output, alignments, stop_tokens
@ -70,6 +72,31 @@ def run_model_tf(model, inputs, CONFIG, truncated, speaker_id=None, style_mel=No
return decoder_output, postnet_output, alignments, stop_tokens
def run_model_tflite(model, inputs, CONFIG, truncated, speaker_id=None, style_mel=None):
if CONFIG.use_gst and style_mel is not None:
raise NotImplementedError(' [!] GST inference not implemented for TfLite')
if truncated:
raise NotImplementedError(' [!] Truncated inference not implemented for TfLite')
if speaker_id is not None:
raise NotImplementedError(' [!] Multi-Speaker not implemented for TfLite')
# get input and output details
input_details = model.get_input_details()
output_details = model.get_output_details()
# reshape input tensor for the new input shape
model.resize_tensor_input(input_details[0]['index'], inputs.shape)
model.allocate_tensors()
detail = input_details[0]
# input_shape = detail['shape']
model.set_tensor(detail['index'], inputs)
# run the model
model.invoke()
# collect outputs
decoder_output = model.get_tensor(output_details[0]['index'])
postnet_output = model.get_tensor(output_details[1]['index'])
# tflite model only returns feature frames
return decoder_output, postnet_output, None, None
def parse_outputs_torch(postnet_output, decoder_output, alignments, stop_tokens):
postnet_output = postnet_output[0].data.cpu().numpy()
decoder_output = decoder_output[0].data.cpu().numpy()
@ -86,25 +113,42 @@ def parse_outputs_tf(postnet_output, decoder_output, alignments, stop_tokens):
return postnet_output, decoder_output, alignment, stop_tokens
def parse_outputs_tflite(postnet_output, decoder_output):
postnet_output = postnet_output[0]
decoder_output = decoder_output[0]
return postnet_output, decoder_output
def trim_silence(wav, ap):
return wav[:ap.find_endpoint(wav)]
def inv_spectrogram(postnet_output, ap, CONFIG):
if CONFIG.model in ["Tacotron", "TacotronGST"]:
if CONFIG.model.lower() in ["tacotron"]:
wav = ap.inv_spectrogram(postnet_output.T)
else:
wav = ap.inv_melspectrogram(postnet_output.T)
return wav
def id_to_torch(speaker_id):
def id_to_torch(speaker_id, cuda=False):
if speaker_id is not None:
speaker_id = np.asarray(speaker_id)
speaker_id = torch.from_numpy(speaker_id).unsqueeze(0)
if cuda:
return speaker_id.cuda()
return speaker_id
def embedding_to_torch(speaker_embedding, cuda=False):
if speaker_embedding is not None:
speaker_embedding = np.asarray(speaker_embedding)
speaker_embedding = torch.from_numpy(speaker_embedding).unsqueeze(0).type(torch.FloatTensor)
if cuda:
return speaker_embedding.cuda()
return speaker_embedding
# TODO: perform GL with pytorch for batching
def apply_griffin_lim(inputs, input_lens, CONFIG, ap):
'''Apply griffin-lim to each sample iterating throught the first dimension.
@ -134,15 +178,16 @@ def synthesis(model,
enable_eos_bos_chars=False, #pylint: disable=unused-argument
use_griffin_lim=False,
do_trim_silence=False,
speaker_embedding=None,
backend='torch'):
"""Synthesize voice for the given text.
Args:
model (TTS.models): model to synthesize.
model (mozilla_voice_tts.tts.models): model to synthesize.
text (str): target text
CONFIG (dict): config dictionary to be loaded from config.json.
use_cuda (bool): enable cuda.
ap (TTS.utils.audio.AudioProcessor): audio processor to process
ap (mozilla_voice_tts.tts.utils.audio.AudioProcessor): audio processor to process
model outputs.
speaker_id (int): id of speaker
style_wav (str): Uses for style embedding of GST.
@ -154,32 +199,50 @@ def synthesis(model,
"""
# GST processing
style_mel = None
if CONFIG.model == "TacotronGST" and style_wav is not None:
style_mel = compute_style_mel(style_wav, ap)
if CONFIG.use_gst and style_wav is not None:
if isinstance(style_wav, dict):
style_mel = style_wav
else:
style_mel = compute_style_mel(style_wav, ap, cuda=use_cuda)
# preprocess the given text
inputs = text_to_seqvec(text, CONFIG)
# pass tensors to backend
if backend == 'torch':
speaker_id = id_to_torch(speaker_id)
style_mel = numpy_to_torch(style_mel, torch.float, cuda=use_cuda)
if speaker_id is not None:
speaker_id = id_to_torch(speaker_id, cuda=use_cuda)
if speaker_embedding is not None:
speaker_embedding = embedding_to_torch(speaker_embedding, cuda=use_cuda)
if not isinstance(style_mel, dict):
style_mel = numpy_to_torch(style_mel, torch.float, cuda=use_cuda)
inputs = numpy_to_torch(inputs, torch.long, cuda=use_cuda)
inputs = inputs.unsqueeze(0)
else:
elif backend == 'tf':
# TODO: handle speaker id for tf model
style_mel = numpy_to_tf(style_mel, tf.float32)
inputs = numpy_to_tf(inputs, tf.int32)
inputs = tf.expand_dims(inputs, 0)
elif backend == 'tflite':
style_mel = numpy_to_tf(style_mel, tf.float32)
inputs = numpy_to_tf(inputs, tf.int32)
inputs = tf.expand_dims(inputs, 0)
# synthesize voice
if backend == 'torch':
decoder_output, postnet_output, alignments, stop_tokens = run_model_torch(
model, inputs, CONFIG, truncated, speaker_id, style_mel)
model, inputs, CONFIG, truncated, speaker_id, style_mel, speaker_embeddings=speaker_embedding)
postnet_output, decoder_output, alignment, stop_tokens = parse_outputs_torch(
postnet_output, decoder_output, alignments, stop_tokens)
else:
elif backend == 'tf':
decoder_output, postnet_output, alignments, stop_tokens = run_model_tf(
model, inputs, CONFIG, truncated, speaker_id, style_mel)
postnet_output, decoder_output, alignment, stop_tokens = parse_outputs_tf(
postnet_output, decoder_output, alignments, stop_tokens)
elif backend == 'tflite':
decoder_output, postnet_output, alignment, stop_tokens = run_model_tflite(
model, inputs, CONFIG, truncated, speaker_id, style_mel)
postnet_output, decoder_output = parse_outputs_tflite(
postnet_output, decoder_output)
# convert outputs to numpy
# plot results
wav = None

View File

@ -4,10 +4,11 @@ import re
from packaging import version
import phonemizer
from phonemizer.phonemize import phonemize
from TTS.utils.text import cleaners
from TTS.utils.text.symbols import make_symbols, symbols, phonemes, _phoneme_punctuations, _bos, \
from mozilla_voice_tts.tts.utils.text import cleaners
from mozilla_voice_tts.tts.utils.text.symbols import make_symbols, symbols, phonemes, _phoneme_punctuations, _bos, \
_eos
# pylint: disable=unnecessary-comprehension
# Mappings from symbol to numeric ID and vice versa:
_symbol_to_id = {s: i for i, s in enumerate(symbols)}
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
@ -44,7 +45,7 @@ def text2phone(text, language):
for punct in punctuations:
ph = ph.replace('| |\n', '|'+punct+'| |', 1)
elif version.parse(phonemizer.__version__) >= version.parse('2.1'):
ph = phonemize(text, separator=seperator, strip=False, njobs=1, backend='espeak', language=language, preserve_punctuation=True)
ph = phonemize(text, separator=seperator, strip=False, njobs=1, backend='espeak', language=language, preserve_punctuation=True, language_switch='remove-flags')
# this is a simple fix for phonemizer.
# https://github.com/bootphon/phonemizer/issues/32
if punctuations:
@ -77,7 +78,6 @@ def phoneme_to_sequence(text, cleaner_names, language, enable_eos_bos=False, tp=
_phonemes_to_id = {s: i for i, s in enumerate(_phonemes)}
sequence = []
text = text.replace(":", "")
clean_text = _clean_text(text, cleaner_names)
to_phonemes = text2phone(clean_text, language)
if to_phonemes is None:

View File

@ -67,15 +67,16 @@ def remove_aux_symbols(text):
text = re.sub(r'[\<\>\(\)\[\]\"]+', '', text)
return text
def replace_symbols(text):
def replace_symbols(text, lang='en'):
text = text.replace(';', ',')
text = text.replace('-', ' ')
text = text.replace(':', ' ')
text = text.replace('&', 'and')
if lang == 'en':
text = text.replace('&', 'and')
elif lang == 'pt':
text = text.replace('&', ' e ')
return text
def basic_cleaners(text):
'''Basic pipeline that lowercases and collapses whitespace without transliteration.'''
text = lowercase(text)
@ -91,6 +92,13 @@ def transliteration_cleaners(text):
return text
def basic_german_cleaners(text):
'''Pipeline for German text'''
text = lowercase(text)
text = collapse_whitespace(text)
return text
# TODO: elaborate it
def basic_turkish_cleaners(text):
'''Pipeline for Turkish text'''
@ -99,7 +107,6 @@ def basic_turkish_cleaners(text):
text = collapse_whitespace(text)
return text
def english_cleaners(text):
'''Pipeline for English text, including number and abbreviation expansion.'''
text = convert_to_ascii(text)
@ -111,6 +118,14 @@ def english_cleaners(text):
text = collapse_whitespace(text)
return text
def portuguese_cleaners(text):
'''Basic pipeline for Portuguese text. There is no need to expand abbreviation and
numbers, phonemizer already does that'''
text = lowercase(text)
text = replace_symbols(text, lang='pt')
text = remove_aux_symbols(text)
text = collapse_whitespace(text)
return text
def phoneme_cleaners(text):
'''Pipeline for phonemes mode, including number and abbreviation expansion.'''

View File

@ -0,0 +1,70 @@
""" from https://github.com/keithito/tacotron """
import inflect
import re
_inflect = inflect.engine()
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
_number_re = re.compile(r'[0-9]+')
def _remove_commas(m):
return m.group(1).replace(',', '')
def _expand_decimal_point(m):
return m.group(1).replace('.', ' point ')
def _expand_dollars(m):
match = m.group(1)
parts = match.split('.')
if len(parts) > 2:
return match + ' dollars' # Unexpected format
dollars = int(parts[0]) if parts[0] else 0
cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
if dollars and cents:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
if dollars:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
return '%s %s' % (dollars, dollar_unit)
if cents:
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s' % (cents, cent_unit)
return 'zero dollars'
def _expand_ordinal(m):
return _inflect.number_to_words(m.group(0))
def _expand_number(m):
num = int(m.group(0))
if 1000 < num < 3000:
if num == 2000:
return 'two thousand'
if 2000 < num < 2010:
return 'two thousand ' + _inflect.number_to_words(num % 100)
if num % 100 == 0:
return _inflect.number_to_words(num // 100) + ' hundred'
return _inflect.number_to_words(num,
andword='',
zero='oh',
group=2).replace(', ', ' ')
return _inflect.number_to_words(num, andword='')
def normalize_numbers(text):
text = re.sub(_comma_number_re, _remove_commas, text)
text = re.sub(_pounds_re, r'\1 pounds', text)
text = re.sub(_dollars_re, _expand_dollars, text)
text = re.sub(_decimal_number_re, _expand_decimal_point, text)
text = re.sub(_ordinal_re, _expand_ordinal, text)
text = re.sub(_number_re, _expand_number, text)
return text

View File

@ -3,10 +3,10 @@ import librosa
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from TTS.utils.text import phoneme_to_sequence, sequence_to_phoneme
from mozilla_voice_tts.tts.utils.text import phoneme_to_sequence, sequence_to_phoneme
def plot_alignment(alignment, info=None, fig_size=(16, 10), title=None):
def plot_alignment(alignment, info=None, fig_size=(16, 10), title=None, output_fig=False):
if isinstance(alignment, torch.Tensor):
alignment_ = alignment.detach().cpu().numpy().squeeze()
else:
@ -24,23 +24,28 @@ def plot_alignment(alignment, info=None, fig_size=(16, 10), title=None):
plt.tight_layout()
if title is not None:
plt.title(title)
if not output_fig:
plt.close()
return fig
def plot_spectrogram(linear_output, audio, fig_size=(16, 10)):
if isinstance(linear_output, torch.Tensor):
linear_output_ = linear_output.detach().cpu().numpy().squeeze()
def plot_spectrogram(spectrogram, ap=None, fig_size=(16, 10), output_fig=False):
if isinstance(spectrogram, torch.Tensor):
spectrogram_ = spectrogram.detach().cpu().numpy().squeeze().T
else:
linear_output_ = linear_output
spectrogram = audio._denormalize(linear_output_.T) # pylint: disable=protected-access
spectrogram_ = spectrogram.T
if ap is not None:
spectrogram_ = ap._denormalize(spectrogram_) # pylint: disable=protected-access
fig = plt.figure(figsize=fig_size)
plt.imshow(spectrogram, aspect="auto", origin="lower")
plt.imshow(spectrogram_, aspect="auto", origin="lower")
plt.colorbar()
plt.tight_layout()
if not output_fig:
plt.close()
return fig
def visualize(alignment, postnet_output, stop_tokens, text, hop_length, CONFIG, decoder_output=None, output_path=None, figsize=(8, 24)):
def visualize(alignment, postnet_output, stop_tokens, text, hop_length, CONFIG, decoder_output=None, output_path=None, figsize=(8, 24), output_fig=False):
if decoder_output is not None:
num_plot = 4
else:
@ -90,3 +95,6 @@ def visualize(alignment, postnet_output, stop_tokens, text, hop_length, CONFIG,
print(output_path)
fig.savefig(output_path)
plt.close()
if not output_fig:
plt.close()

View File

View File

@ -3,8 +3,9 @@ import soundfile as sf
import numpy as np
import scipy.io.wavfile
import scipy.signal
import pyworld as pw
from TTS.utils.data import StandardScaler
from mozilla_voice_tts.tts.utils.data import StandardScaler
class AudioProcessor(object):
@ -17,7 +18,7 @@ class AudioProcessor(object):
hop_length=None,
win_length=None,
ref_level_db=None,
num_freq=None,
fft_size=1024,
power=None,
preemphasis=0.0,
signal_norm=None,
@ -25,6 +26,8 @@ class AudioProcessor(object):
max_norm=None,
mel_fmin=None,
mel_fmax=None,
spec_gain=20,
stft_pad_mode='reflect',
clip_norm=True,
griffin_lim_iters=None,
do_trim_silence=False,
@ -41,7 +44,7 @@ class AudioProcessor(object):
self.frame_shift_ms = frame_shift_ms
self.frame_length_ms = frame_length_ms
self.ref_level_db = ref_level_db
self.num_freq = num_freq
self.fft_size = fft_size
self.power = power
self.preemphasis = preemphasis
self.griffin_lim_iters = griffin_lim_iters
@ -49,6 +52,8 @@ class AudioProcessor(object):
self.symmetric_norm = symmetric_norm
self.mel_fmin = mel_fmin or 0
self.mel_fmax = mel_fmax
self.spec_gain = float(spec_gain)
self.stft_pad_mode = stft_pad_mode
self.max_norm = 1.0 if max_norm is None else float(max_norm)
self.clip_norm = clip_norm
self.do_trim_silence = do_trim_silence
@ -57,12 +62,14 @@ class AudioProcessor(object):
self.stats_path = stats_path
# setup stft parameters
if hop_length is None:
self.n_fft, self.hop_length, self.win_length = self._stft_parameters()
# compute stft parameters from given time values
self.hop_length, self.win_length = self._stft_parameters()
else:
# use stft parameters from config file
self.hop_length = hop_length
self.win_length = win_length
self.n_fft = (self.num_freq - 1) * 2
assert min_level_db != 0.0, " [!] min_level_db is 0"
assert self.win_length <= self.fft_size, " [!] win_length cannot be larger than fft_size"
members = vars(self)
for key, value in members.items():
print(" | > {}:{}".format(key, value))
@ -84,19 +91,18 @@ class AudioProcessor(object):
assert self.mel_fmax <= self.sample_rate // 2
return librosa.filters.mel(
self.sample_rate,
self.n_fft,
self.fft_size,
n_mels=self.num_mels,
fmin=self.mel_fmin,
fmax=self.mel_fmax)
def _stft_parameters(self, ):
"""Compute necessary stft parameters with given time values"""
n_fft = (self.num_freq - 1) * 2
factor = self.frame_length_ms / self.frame_shift_ms
assert (factor).is_integer(), " [!] frame_shift_ms should divide frame_length_ms"
hop_length = int(self.frame_shift_ms / 1000.0 * self.sample_rate)
win_length = int(hop_length * factor)
return n_fft, hop_length, win_length
return hop_length, win_length
### normalization ###
def _normalize(self, S):
@ -108,7 +114,7 @@ class AudioProcessor(object):
if hasattr(self, 'mel_scaler'):
if S.shape[0] == self.num_mels:
return self.mel_scaler.transform(S.T).T
elif S.shape[0] == self.n_fft / 2:
elif S.shape[0] == self.fft_size / 2:
return self.linear_scaler.transform(S.T).T
else:
raise RuntimeError(' [!] Mean-Var stats does not match the given feature dimensions.')
@ -118,7 +124,7 @@ class AudioProcessor(object):
if self.symmetric_norm:
S_norm = ((2 * self.max_norm) * S_norm) - self.max_norm
if self.clip_norm:
S_norm = np.clip(S_norm, -self.max_norm, self.max_norm)
S_norm = np.clip(S_norm, -self.max_norm, self.max_norm) # pylint: disable=invalid-unary-operand-type
return S_norm
else:
S_norm = self.max_norm * S_norm
@ -137,13 +143,13 @@ class AudioProcessor(object):
if hasattr(self, 'mel_scaler'):
if S_denorm.shape[0] == self.num_mels:
return self.mel_scaler.inverse_transform(S_denorm.T).T
elif S_denorm.shape[0] == self.n_fft / 2:
elif S_denorm.shape[0] == self.fft_size / 2:
return self.linear_scaler.inverse_transform(S_denorm.T).T
else:
raise RuntimeError(' [!] Mean-Var stats does not match the given feature dimensions.')
if self.symmetric_norm:
if self.clip_norm:
S_denorm = np.clip(S_denorm, -self.max_norm, self.max_norm)
S_denorm = np.clip(S_denorm, -self.max_norm, self.max_norm) #pylint: disable=invalid-unary-operand-type
S_denorm = ((S_denorm + self.max_norm) * -self.min_level_db / (2 * self.max_norm)) + self.min_level_db
return S_denorm + self.ref_level_db
else:
@ -182,11 +188,11 @@ class AudioProcessor(object):
### DB and AMP conversion ###
# pylint: disable=no-self-use
def _amp_to_db(self, x):
return 20 * np.log10(np.maximum(1e-5, x))
return self.spec_gain * np.log10(np.maximum(1e-5, x))
# pylint: disable=no-self-use
def _db_to_amp(self, x):
return np.power(10.0, x * 0.05)
return np.power(10.0, x / self.spec_gain)
### Preemphasis ###
def apply_preemphasis(self, x):
@ -252,10 +258,10 @@ class AudioProcessor(object):
def _stft(self, y):
return librosa.stft(
y=y,
n_fft=self.n_fft,
n_fft=self.fft_size,
hop_length=self.hop_length,
win_length=self.win_length,
pad_mode='constant'
pad_mode=self.stft_pad_mode,
)
def _istft(self, y):
@ -280,6 +286,17 @@ class AudioProcessor(object):
return 0, pad
return pad // 2, pad // 2 + pad % 2
### Compute F0 ###
def compute_f0(self, x):
f0, t = pw.dio(
x.astype(np.double),
fs=self.sample_rate,
f0_ceil=self.mel_fmax,
frame_period=1000 * self.hop_length / self.sample_rate,
)
f0 = pw.stonemask(x.astype(np.double), f0, t, self.sample_rate)
return f0
### Audio Processing ###
def find_endpoint(self, wav, threshold_db=-40, min_silence_sec=0.8):
window_length = int(self.sample_rate * min_silence_sec)

View File

@ -1,5 +1,5 @@
import datetime
from TTS.utils.io import AttrDict
from mozilla_voice_tts.utils.io import AttrDict
tcolors = AttrDict({
@ -35,8 +35,7 @@ class ConsoleLogger():
def print_train_start(self):
print(f"\n{tcolors.BOLD} > TRAINING ({self.get_time()}) {tcolors.ENDC}")
def print_train_step(self, batch_steps, step, global_step, avg_spec_length,
avg_text_length, step_time, loader_time, lr,
def print_train_step(self, batch_steps, step, global_step, log_dict,
loss_dict, avg_loss_dict):
indent = " | > "
print()
@ -48,15 +47,20 @@ class ConsoleLogger():
log_text += "{}{}: {:.5f} ({:.5f})\n".format(indent, key, value, avg_loss_dict[f'avg_{key}'])
else:
log_text += "{}{}: {:.5f} \n".format(indent, key, value)
log_text += f"{indent}avg_spec_len: {avg_spec_length}\n{indent}avg_text_len: {avg_text_length}\n{indent}"\
f"step_time: {step_time:.2f}\n{indent}loader_time: {loader_time:.2f}\n{indent}lr: {lr:.5f}"
for idx, (key, value) in enumerate(log_dict.items()):
if isinstance(value, list):
log_text += f"{indent}{key}: {value[0]:.{value[1]}f}"
else:
log_text += f"{indent}{key}: {value}"
if idx < len(log_dict)-1:
log_text += "\n"
print(log_text, flush=True)
# pylint: disable=unused-argument
def print_train_epoch_end(self, global_step, epoch, epoch_time,
print_dict):
indent = " | > "
log_text = f"\n{tcolors.BOLD} --> TRAIN PERFORMACE -- EPOCH TIME: {epoch} sec -- GLOBAL_STEP: {global_step}{tcolors.ENDC}\n"
log_text = f"\n{tcolors.BOLD} --> TRAIN PERFORMACE -- EPOCH TIME: {epoch_time:.2f} sec -- GLOBAL_STEP: {global_step}{tcolors.ENDC}\n"
for key, value in print_dict.items():
log_text += "{}{}: {:.5f}\n".format(indent, key, value)
print(log_text, flush=True)
@ -82,14 +86,17 @@ class ConsoleLogger():
tcolors.BOLD, tcolors.ENDC)
for key, value in avg_loss_dict.items():
# print the avg value if given
color = tcolors.FAIL
color = ''
sign = '+'
diff = 0
if self.old_eval_loss_dict is not None:
if self.old_eval_loss_dict is not None and key in self.old_eval_loss_dict:
diff = value - self.old_eval_loss_dict[key]
if diff <= 0:
if diff < 0:
color = tcolors.OKGREEN
sign = ''
elif diff > 0:
color = tcolors.FAIL
sign = '+'
log_text += "{}{}:{} {:.5f} {}({}{:.5f})\n".format(indent, key, color, value, tcolors.ENDC, sign, diff)
self.old_eval_loss_dict = avg_loss_dict
print(log_text, flush=True)

View File

@ -0,0 +1,156 @@
import os
import glob
import shutil
import datetime
import subprocess
def get_git_branch():
try:
out = subprocess.check_output(["git", "branch"]).decode("utf8")
current = next(line for line in out.split("\n")
if line.startswith("*"))
current.replace("* ", "")
except subprocess.CalledProcessError:
current = "inside_docker"
return current
def get_commit_hash():
"""https://stackoverflow.com/questions/14989858/get-the-current-git-hash-in-a-python-script"""
# try:
# subprocess.check_output(['git', 'diff-index', '--quiet',
# 'HEAD']) # Verify client is clean
# except:
# raise RuntimeError(
# " !! Commit before training to get the commit hash.")
try:
commit = subprocess.check_output(
['git', 'rev-parse', '--short', 'HEAD']).decode().strip()
# Not copying .git folder into docker container
except subprocess.CalledProcessError:
commit = "0000000"
print(' > Git Hash: {}'.format(commit))
return commit
def create_experiment_folder(root_path, model_name, debug):
""" Create a folder with the current date and time """
date_str = datetime.datetime.now().strftime("%B-%d-%Y_%I+%M%p")
if debug:
commit_hash = 'debug'
else:
commit_hash = get_commit_hash()
output_folder = os.path.join(
root_path, model_name + '-' + date_str + '-' + commit_hash)
os.makedirs(output_folder, exist_ok=True)
print(" > Experiment folder: {}".format(output_folder))
return output_folder
def remove_experiment_folder(experiment_path):
"""Check folder if there is a checkpoint, otherwise remove the folder"""
checkpoint_files = glob.glob(experiment_path + "/*.pth.tar")
if not checkpoint_files:
if os.path.exists(experiment_path):
shutil.rmtree(experiment_path, ignore_errors=True)
print(" ! Run is removed from {}".format(experiment_path))
else:
print(" ! Run is kept in {}".format(experiment_path))
def count_parameters(model):
r"""Count number of trainable parameters in a network"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
def set_init_dict(model_dict, checkpoint_state, c):
# Partial initialization: if there is a mismatch with new and old layer, it is skipped.
for k, v in checkpoint_state.items():
if k not in model_dict:
print(" | > Layer missing in the model definition: {}".format(k))
# 1. filter out unnecessary keys
pretrained_dict = {
k: v
for k, v in checkpoint_state.items() if k in model_dict
}
# 2. filter out different size layers
pretrained_dict = {
k: v
for k, v in pretrained_dict.items()
if v.numel() == model_dict[k].numel()
}
# 3. skip reinit layers
if c.reinit_layers is not None:
for reinit_layer_name in c.reinit_layers:
pretrained_dict = {
k: v
for k, v in pretrained_dict.items()
if reinit_layer_name not in k
}
# 4. overwrite entries in the existing state dict
model_dict.update(pretrained_dict)
print(" | > {} / {} layers are restored.".format(len(pretrained_dict),
len(model_dict)))
return model_dict
class KeepAverage():
def __init__(self):
self.avg_values = {}
self.iters = {}
def __getitem__(self, key):
return self.avg_values[key]
def items(self):
return self.avg_values.items()
def add_value(self, name, init_val=0, init_iter=0):
self.avg_values[name] = init_val
self.iters[name] = init_iter
def update_value(self, name, value, weighted_avg=False):
if name not in self.avg_values:
# add value if not exist before
self.add_value(name, init_val=value)
else:
# else update existing value
if weighted_avg:
self.avg_values[name] = 0.99 * self.avg_values[name] + 0.01 * value
self.iters[name] += 1
else:
self.avg_values[name] = self.avg_values[name] * \
self.iters[name] + value
self.iters[name] += 1
self.avg_values[name] /= self.iters[name]
def add_values(self, name_dict):
for key, value in name_dict.items():
self.add_value(key, init_val=value)
def update_values(self, value_dict):
for key, value in value_dict.items():
self.update_value(key, value)
def check_argument(name, c, enum_list=None, max_val=None, min_val=None, restricted=False, val_type=None, alternative=None):
if alternative in c.keys() and c[alternative] is not None:
return
if restricted:
assert name in c.keys(), f' [!] {name} not defined in config.json'
if name in c.keys():
if max_val:
assert c[name] <= max_val, f' [!] {name} is larger than max value {max_val}'
if min_val:
assert c[name] >= min_val, f' [!] {name} is smaller than min value {min_val}'
if enum_list:
assert c[name].lower() in enum_list, f' [!] {name} is not a valid value'
if isinstance(val_type, list):
is_valid = False
for typ in val_type:
if isinstance(c[name], typ):
is_valid = True
assert is_valid or c[name] is None, f' [!] {name} has wrong type - {type(c[name])} vs {val_type}'
elif val_type:
assert isinstance(c[name], val_type) or c[name] is None, f' [!] {name} has wrong type - {type(c[name])} vs {val_type}'

View File

@ -0,0 +1,50 @@
import re
import json
from shutil import copyfile
class AttrDict(dict):
"""A custom dict which converts dict keys
to class attributes"""
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
def load_config(config_path):
"""Load config files and discard comments
Args:
config_path (str): path to config file.
"""
config = AttrDict()
with open(config_path, "r") as f:
input_str = f.read()
# handle comments
input_str = re.sub(r'\\\n', '', input_str)
input_str = re.sub(r'//.*\n', '\n', input_str)
data = json.loads(input_str)
config.update(data)
return config
def copy_config_file(config_file, out_path, new_fields):
"""Copy config.json to training folder and add
new fields.
Args:
config_file (str): path to config file.
out_path (str): output path to copy the file.
new_fields (dict): new fileds to be added or edited
in the config file.
"""
config_lines = open(config_file, "r").readlines()
# add extra information fields
for key, value in new_fields.items():
if isinstance(value, str):
new_line = '"{}":"{}",\n'.format(key, value)
else:
new_line = '"{}":{},\n'.format(key, value)
config_lines.insert(1, new_line)
config_out_file = open(out_path, "w")
config_out_file.writelines(config_lines)
config_out_file.close()

View File

@ -2,7 +2,7 @@
import math
import torch
from torch.optim.optimizer import Optimizer, required
from torch.optim.optimizer import Optimizer
class RAdam(Optimizer):
@ -25,7 +25,7 @@ class RAdam(Optimizer):
defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, buffer=[[None, None, None] for _ in range(10)])
super(RAdam, self).__init__(params, defaults)
def __setstate__(self, state):
def __setstate__(self, state): # pylint: disable=useless-super-delegation
super(RAdam, self).__setstate__(state)
def step(self, closure=None):

View File

@ -3,7 +3,8 @@ from tensorboardX import SummaryWriter
class TensorboardLogger(object):
def __init__(self, log_dir):
def __init__(self, log_dir, model_name):
self.model_name = model_name
self.writer = SummaryWriter(log_dir)
self.train_stats = {}
self.eval_stats = {}
@ -46,35 +47,35 @@ class TensorboardLogger(object):
for key, value in audios.items():
try:
self.writer.add_audio('{}/{}'.format(scope_name, key), value, step, sample_rate=sample_rate)
except:
except RuntimeError:
traceback.print_exc()
def tb_train_iter_stats(self, step, stats):
self.dict_to_tb_scalar("TrainIterStats", stats, step)
self.dict_to_tb_scalar(f"{self.model_name}_TrainIterStats", stats, step)
def tb_train_epoch_stats(self, step, stats):
self.dict_to_tb_scalar("TrainEpochStats", stats, step)
self.dict_to_tb_scalar(f"{self.model_name}_TrainEpochStats", stats, step)
def tb_train_figures(self, step, figures):
self.dict_to_tb_figure("TrainFigures", figures, step)
self.dict_to_tb_figure(f"{self.model_name}_TrainFigures", figures, step)
def tb_train_audios(self, step, audios, sample_rate):
self.dict_to_tb_audios("TrainAudios", audios, step, sample_rate)
self.dict_to_tb_audios(f"{self.model_name}_TrainAudios", audios, step, sample_rate)
def tb_eval_stats(self, step, stats):
self.dict_to_tb_scalar("EvalStats", stats, step)
self.dict_to_tb_scalar(f"{self.model_name}_EvalStats", stats, step)
def tb_eval_figures(self, step, figures):
self.dict_to_tb_figure("EvalFigures", figures, step)
self.dict_to_tb_figure(f"{self.model_name}_EvalFigures", figures, step)
def tb_eval_audios(self, step, audios, sample_rate):
self.dict_to_tb_audios("EvalAudios", audios, step, sample_rate)
self.dict_to_tb_audios(f"{self.model_name}_EvalAudios", audios, step, sample_rate)
def tb_test_audios(self, step, audios, sample_rate):
self.dict_to_tb_audios("TestAudios", audios, step, sample_rate)
self.dict_to_tb_audios(f"{self.model_name}_TestAudios", audios, step, sample_rate)
def tb_test_figures(self, step, figures):
self.dict_to_tb_figure("TestFigures", figures, step)
self.dict_to_tb_figure(f"{self.model_name}_TestFigures", figures, step)
def tb_add_text(self, title, text, step):
self.writer.add_text(title, text, step)

View File

@ -2,13 +2,32 @@ import torch
import numpy as np
def check_update(model, grad_clip, ignore_stopnet=False):
def setup_torch_training_env(cudnn_enable, cudnn_benchmark):
torch.backends.cudnn.enabled = cudnn_enable
torch.backends.cudnn.benchmark = cudnn_benchmark
torch.manual_seed(54321)
use_cuda = torch.cuda.is_available()
num_gpus = torch.cuda.device_count()
print(" > Using CUDA: ", use_cuda)
print(" > Number of GPUs: ", num_gpus)
return use_cuda, num_gpus
def check_update(model, grad_clip, ignore_stopnet=False, amp_opt_params=None):
r'''Check model gradient against unexpected jumps and failures'''
skip_flag = False
if ignore_stopnet:
grad_norm = torch.nn.utils.clip_grad_norm_([param for name, param in model.named_parameters() if 'stopnet' not in name], grad_clip)
if not amp_opt_params:
grad_norm = torch.nn.utils.clip_grad_norm_(
[param for name, param in model.named_parameters() if 'stopnet' not in name], grad_clip)
else:
grad_norm = torch.nn.utils.clip_grad_norm_(amp_opt_params, grad_clip)
else:
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
if not amp_opt_params:
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
else:
grad_norm = torch.nn.utils.clip_grad_norm_(amp_opt_params, grad_clip)
# compatibility with different torch versions
if isinstance(grad_norm, float):
if np.isinf(grad_norm):

View File

@ -0,0 +1,39 @@
# Mozilla TTS Vocoders (Experimental)
Here there are vocoder model implementations which can be combined with the other TTS models.
Currently, following models are implemented:
- Melgan
- MultiBand-Melgan
- ParallelWaveGAN
- GAN-TTS (Discriminator Only)
It is also very easy to adapt different vocoder models as we provide a flexible and modular (but not too modular) framework.
## Training a model
You can see here an example (Soon)[Colab Notebook]() training MelGAN with LJSpeech dataset.
In order to train a new model, you need to gather all wav files into a folder and give this folder to `data_path` in '''config.json'''
You need to define other relevant parameters in your ```config.json``` and then start traning with the following command.
```CUDA_VISIBLE_DEVICES='0' python tts/bin/train_vocoder.py --config_path path/to/config.json```
Example config files can be found under `tts/vocoder/configs/` folder.
You can continue a previous training run by the following command.
```CUDA_VISIBLE_DEVICES='0' python tts/bin/train_vocoder.py --continue_path path/to/your/model/folder```
You can fine-tune a pre-trained model by the following command.
```CUDA_VISIBLE_DEVICES='0' python tts/bin/train_vocoder.py --restore_path path/to/your/model.pth.tar```
Restoring a model starts a new training in a different folder. It only restores model weights with the given checkpoint file. However, continuing a training starts from the same directory where the previous training run left off.
You can also follow your training runs on Tensorboard as you do with our TTS models.
## Acknowledgement
Thanks to @kan-bayashi for his [repository](https://github.com/kan-bayashi/ParallelWaveGAN) being the start point of our work.

View File

View File

@ -0,0 +1,151 @@
{
"run_name": "multiband-melgan-rwd",
"run_description": "multiband melgan with random window discriminator from https://arxiv.org/pdf/1909.11646.pdf",
// AUDIO PARAMETERS
"audio":{
// stft parameters
"num_freq": 513, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 20, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// Griffin-Lim
"power": 1.5, // value to sharpen wav signals after GL algorithm.
"griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// DISTRIBUTED TRAINING
// "distributed":{
// "backend": "nccl",
// "url": "tcp:\/\/localhost:54321"
// },
// MODEL PARAMETERS
"use_pqmf": true,
// LOSS PARAMETERS
"use_stft_loss": true,
"use_subband_stft_loss": true,
"use_mse_gan_loss": true,
"use_hinge_gan_loss": false,
"use_feat_match_loss": false, // use only with melgan discriminators
// loss weights
"stft_loss_weight": 0.5,
"subband_stft_loss_weight": 0.5,
"mse_G_loss_weight": 2.5,
"hinge_G_loss_weight": 2.5,
"feat_match_loss_weight": 25,
// multiscale stft loss parameters
"stft_loss_params": {
"n_ffts": [1024, 2048, 512],
"hop_lengths": [120, 240, 50],
"win_lengths": [600, 1200, 240]
},
// subband multiscale stft loss parameters
"subband_stft_loss_params":{
"n_ffts": [384, 683, 171],
"hop_lengths": [30, 60, 10],
"win_lengths": [150, 300, 60]
},
"target_loss": "avg_G_loss", // loss value to pick the best model to save after each epoch
// DISCRIMINATOR
"discriminator_model": "random_window_discriminator",
"discriminator_model_params":{
"uncond_disc_donwsample_factors": [8, 4],
"cond_disc_downsample_factors": [[8, 4, 2, 2, 2], [8, 4, 2, 2], [8, 4, 2], [8, 4], [4, 2, 2]],
"cond_disc_out_channels": [[128, 128, 256, 256], [128, 256, 256], [128, 256], [256], [128, 256]],
"window_sizes": [512, 1024, 2048, 4096, 8192]
},
"steps_to_start_discriminator": 200000, // steps required to start GAN trainining.1
// GENERATOR
"generator_model": "multiband_melgan_generator",
"generator_model_params": {
"upsample_factors":[8, 4, 2],
"num_res_blocks": 4
},
// DATASET
"data_path": "/home/erogol/Data/LJSpeech-1.1/wavs/",
"seq_len": 16384,
"pad_short": 2000,
"conv_pad": 0,
"use_noise_augment": false,
"use_cache": true,
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// TRAINING
"batch_size": 64, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
// VALIDATION
"run_eval": true,
"test_delay_epochs": 10, //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
// OPTIMIZER
"noam_schedule": false, // use noam warmup and lr schedule.
"warmup_steps_gen": 4000, // Noam decay steps to increase the learning rate from 0 to "lr"
"warmup_steps_disc": 4000,
"epochs": 10000, // total number of epochs to train.
"wd": 0.0, // Weight decay weight.
"gen_clip_grad": -1, // Generator gradient clipping threshold. Apply gradient clipping if > 0
"disc_clip_grad": -1, // Discriminator gradient clipping threshold.
"lr_scheduler_gen": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_gen_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_scheduler_disc": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_disc_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_gen": 1e-4, // Initial learning rate. If Noam decay is active, maximum learning rate.
"lr_disc": 1e-4,
// TENSORBOARD and LOGGING
"print_step": 25, // Number of steps to log traning on console.
"print_eval": false, // If True, it prints loss values for each step in eval run.
"save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"eval_split_size": 10,
// PATHS
"output_path": "/home/erogol/Models/LJSpeech/"
}

View File

@ -0,0 +1,144 @@
{
"run_name": "multiband-melgan",
"run_description": "multiband melgan mean-var scaling",
// AUDIO PARAMETERS
"audio":{
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": "/home/erogol/Data/LJSpeech-1.1/scale_stats.npy" // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// DISTRIBUTED TRAINING
// "distributed":{
// "backend": "nccl",
// "url": "tcp:\/\/localhost:54321"
// },
// MODEL PARAMETERS
"use_pqmf": true,
// LOSS PARAMETERS
"use_stft_loss": true,
"use_subband_stft_loss": true,
"use_mse_gan_loss": true,
"use_hinge_gan_loss": false,
"use_feat_match_loss": false, // use only with melgan discriminators
// loss weights
"stft_loss_weight": 0.5,
"subband_stft_loss_weight": 0.5,
"mse_G_loss_weight": 2.5,
"hinge_G_loss_weight": 2.5,
"feat_match_loss_weight": 25,
// multiscale stft loss parameters
"stft_loss_params": {
"n_ffts": [1024, 2048, 512],
"hop_lengths": [120, 240, 50],
"win_lengths": [600, 1200, 240]
},
// subband multiscale stft loss parameters
"subband_stft_loss_params":{
"n_ffts": [384, 683, 171],
"hop_lengths": [30, 60, 10],
"win_lengths": [150, 300, 60]
},
"target_loss": "avg_G_loss", // loss value to pick the best model to save after each epoch
// DISCRIMINATOR
"discriminator_model": "melgan_multiscale_discriminator",
"discriminator_model_params":{
"base_channels": 16,
"max_channels":512,
"downsample_factors":[4, 4, 4]
},
"steps_to_start_discriminator": 200000, // steps required to start GAN trainining.1
// GENERATOR
"generator_model": "multiband_melgan_generator",
"generator_model_params": {
"upsample_factors":[8, 4, 2],
"num_res_blocks": 4
},
// DATASET
"data_path": "/home/erogol/Data/LJSpeech-1.1/wavs/",
"feature_path": null,
"seq_len": 16384,
"pad_short": 2000,
"conv_pad": 0,
"use_noise_augment": false,
"use_cache": true,
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// TRAINING
"batch_size": 64, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
// VALIDATION
"run_eval": true,
"test_delay_epochs": 10, //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
// OPTIMIZER
"epochs": 10000, // total number of epochs to train.
"wd": 0.0, // Weight decay weight.
"gen_clip_grad": -1, // Generator gradient clipping threshold. Apply gradient clipping if > 0
"disc_clip_grad": -1, // Discriminator gradient clipping threshold.
"lr_scheduler_gen": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_gen_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_scheduler_disc": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_disc_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_gen": 1e-4, // Initial learning rate. If Noam decay is active, maximum learning rate.
"lr_disc": 1e-4,
// TENSORBOARD and LOGGING
"print_step": 25, // Number of steps to log traning on console.
"print_eval": false, // If True, it prints loss values for each step in eval run.
"save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"eval_split_size": 10,
// PATHS
"output_path": "/home/erogol/Models/LJSpeech/"
}

View File

@ -0,0 +1,144 @@
{
"run_name": "multiband-melgan",
"run_description": "multiband melgan mean-var scaling",
// AUDIO PARAMETERS
"audio":{
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": "/home/erogol/Data/MozillaMerged22050/scale_stats.npy" // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// DISTRIBUTED TRAINING
// "distributed":{
// "backend": "nccl",
// "url": "tcp:\/\/localhost:54321"
// },
// MODEL PARAMETERS
"use_pqmf": true,
// LOSS PARAMETERS
"use_stft_loss": true,
"use_subband_stft_loss": true,
"use_mse_gan_loss": true,
"use_hinge_gan_loss": false,
"use_feat_match_loss": false, // use only with melgan discriminators
// loss weights
"stft_loss_weight": 0.5,
"subband_stft_loss_weight": 0.5,
"mse_G_loss_weight": 2.5,
"hinge_G_loss_weight": 2.5,
"feat_match_loss_weight": 25,
// multiscale stft loss parameters
"stft_loss_params": {
"n_ffts": [1024, 2048, 512],
"hop_lengths": [120, 240, 50],
"win_lengths": [600, 1200, 240]
},
// subband multiscale stft loss parameters
"subband_stft_loss_params":{
"n_ffts": [384, 683, 171],
"hop_lengths": [30, 60, 10],
"win_lengths": [150, 300, 60]
},
"target_loss": "avg_G_loss", // loss value to pick the best model to save after each epoch
// DISCRIMINATOR
"discriminator_model": "melgan_multiscale_discriminator",
"discriminator_model_params":{
"base_channels": 16,
"max_channels":512,
"downsample_factors":[4, 4, 4]
},
"steps_to_start_discriminator": 200000, // steps required to start GAN trainining.1
// GENERATOR
"generator_model": "multiband_melgan_generator",
"generator_model_params": {
"upsample_factors":[8, 4, 2],
"num_res_blocks": 4
},
// DATASET
"data_path": "/home/erogol/Data/MozillaMerged22050/wavs/",
"feature_path": null,
"seq_len": 16384,
"pad_short": 2000,
"conv_pad": 0,
"use_noise_augment": false,
"use_cache": true,
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// TRAINING
"batch_size": 64, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
// VALIDATION
"run_eval": true,
"test_delay_epochs": 10, //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
// OPTIMIZER
"epochs": 10000, // total number of epochs to train.
"wd": 0.0, // Weight decay weight.
"gen_clip_grad": -1, // Generator gradient clipping threshold. Apply gradient clipping if > 0
"disc_clip_grad": -1, // Discriminator gradient clipping threshold.
"lr_scheduler_gen": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_gen_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_scheduler_disc": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_disc_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_gen": 1e-4, // Initial learning rate. If Noam decay is active, maximum learning rate.
"lr_disc": 1e-4,
// TENSORBOARD and LOGGING
"print_step": 25, // Number of steps to log traning on console.
"print_eval": false, // If True, it prints loss values for each step in eval run.
"save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"eval_split_size": 10,
// PATHS
"output_path": "/home/erogol/Models/Mozilla/"
}

View File

@ -0,0 +1,143 @@
{
"run_name": "pwgan",
"run_description": "parallel-wavegan training",
// AUDIO PARAMETERS
"audio":{
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": "/home/erogol/Data/LJSpeech-1.1/scale_stats.npy" // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// DISTRIBUTED TRAINING
// "distributed":{
// "backend": "nccl",
// "url": "tcp:\/\/localhost:54321"
// },
// MODEL PARAMETERS
"use_pqmf": true,
// LOSS PARAMETERS
"use_stft_loss": true,
"use_subband_stft_loss": false, // USE ONLY WITH MULTIBAND MODELS
"use_mse_gan_loss": true,
"use_hinge_gan_loss": false,
"use_feat_match_loss": false, // use only with melgan discriminators
// loss weights
"stft_loss_weight": 0.5,
"subband_stft_loss_weight": 0.5,
"mse_G_loss_weight": 2.5,
"hinge_G_loss_weight": 2.5,
"feat_match_loss_weight": 25,
// multiscale stft loss parameters
"stft_loss_params": {
"n_ffts": [1024, 2048, 512],
"hop_lengths": [120, 240, 50],
"win_lengths": [600, 1200, 240]
},
// subband multiscale stft loss parameters
"subband_stft_loss_params":{
"n_ffts": [384, 683, 171],
"hop_lengths": [30, 60, 10],
"win_lengths": [150, 300, 60]
},
"target_loss": "avg_G_loss", // loss value to pick the best model to save after each epoch
// DISCRIMINATOR
"discriminator_model": "parallel_wavegan_discriminator",
"discriminator_model_params":{
"num_layers": 10
},
"steps_to_start_discriminator": 200000, // steps required to start GAN trainining.1
// GENERATOR
"generator_model": "parallel_wavegan_generator",
"generator_model_params": {
"upsample_factors":[4, 4, 4, 4],
"stacks": 3,
"num_res_blocks": 30
},
// DATASET
"data_path": "/home/erogol/Data/LJSpeech-1.1/wavs/",
"feature_path": null,
"seq_len": 25600,
"pad_short": 2000,
"conv_pad": 0,
"use_noise_augment": false,
"use_cache": true,
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// TRAINING
"batch_size": 6, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
// VALIDATION
"run_eval": true,
"test_delay_epochs": 10, //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
// OPTIMIZER
"epochs": 10000, // total number of epochs to train.
"wd": 0.0, // Weight decay weight.
"gen_clip_grad": -1, // Generator gradient clipping threshold. Apply gradient clipping if > 0
"disc_clip_grad": -1, // Discriminator gradient clipping threshold.
"lr_scheduler_gen": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_gen_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_scheduler_disc": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_disc_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_gen": 1e-4, // Initial learning rate. If Noam decay is active, maximum learning rate.
"lr_disc": 1e-4,
// TENSORBOARD and LOGGING
"print_step": 25, // Number of steps to log traning on console.
"print_eval": false, // If True, it prints loss values for each step in eval run.
"save_step": 25000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"eval_split_size": 10,
// PATHS
"output_path": "/home/erogol/Models/LJSpeech/"
}

View File

@ -0,0 +1,127 @@
import os
import glob
import torch
import random
import numpy as np
from torch.utils.data import Dataset
from multiprocessing import Manager
class GANDataset(Dataset):
"""
GAN Dataset searchs for all the wav files under root path
and converts them to acoustic features on the fly and returns
random segments of (audio, feature) couples.
"""
def __init__(self,
ap,
items,
seq_len,
hop_len,
pad_short,
conv_pad=2,
is_training=True,
return_segments=True,
use_noise_augment=False,
use_cache=False,
verbose=False):
self.ap = ap
self.item_list = items
self.compute_feat = not isinstance(items[0], (tuple, list))
self.seq_len = seq_len
self.hop_len = hop_len
self.pad_short = pad_short
self.conv_pad = conv_pad
self.is_training = is_training
self.return_segments = return_segments
self.use_cache = use_cache
self.use_noise_augment = use_noise_augment
self.verbose = verbose
assert seq_len % hop_len == 0, " [!] seq_len has to be a multiple of hop_len."
self.feat_frame_len = seq_len // hop_len + (2 * conv_pad)
# map G and D instances
self.G_to_D_mappings = list(range(len(self.item_list)))
self.shuffle_mapping()
# cache acoustic features
if use_cache:
self.create_feature_cache()
def create_feature_cache(self):
self.manager = Manager()
self.cache = self.manager.list()
self.cache += [None for _ in range(len(self.item_list))]
@staticmethod
def find_wav_files(path):
return glob.glob(os.path.join(path, '**', '*.wav'), recursive=True)
def __len__(self):
return len(self.item_list)
def __getitem__(self, idx):
""" Return different items for Generator and Discriminator and
cache acoustic features """
if self.return_segments:
idx2 = self.G_to_D_mappings[idx]
item1 = self.load_item(idx)
item2 = self.load_item(idx2)
return item1, item2
item1 = self.load_item(idx)
return item1
def shuffle_mapping(self):
random.shuffle(self.G_to_D_mappings)
def load_item(self, idx):
""" load (audio, feat) couple """
if self.compute_feat:
# compute features from wav
wavpath = self.item_list[idx]
# print(wavpath)
if self.use_cache and self.cache[idx] is not None:
audio, mel = self.cache[idx]
else:
audio = self.ap.load_wav(wavpath)
if len(audio) < self.seq_len + self.pad_short:
audio = np.pad(audio, (0, self.seq_len + self.pad_short - len(audio)), \
mode='constant', constant_values=0.0)
mel = self.ap.melspectrogram(audio)
else:
# load precomputed features
wavpath, feat_path = self.item_list[idx]
if self.use_cache and self.cache[idx] is not None:
audio, mel = self.cache[idx]
else:
audio = self.ap.load_wav(wavpath)
mel = np.load(feat_path)
# correct the audio length wrt padding applied in stft
audio = np.pad(audio, (0, self.hop_len), mode="edge")
audio = audio[:mel.shape[-1] * self.hop_len]
assert mel.shape[-1] * self.hop_len == audio.shape[-1], f' [!] {mel.shape[-1] * self.hop_len} vs {audio.shape[-1]}'
audio = torch.from_numpy(audio).float().unsqueeze(0)
mel = torch.from_numpy(mel).float().squeeze(0)
if self.return_segments:
max_mel_start = mel.shape[1] - self.feat_frame_len
mel_start = random.randint(0, max_mel_start)
mel_end = mel_start + self.feat_frame_len
mel = mel[:, mel_start:mel_end]
audio_start = mel_start * self.hop_len
audio = audio[:, audio_start:audio_start +
self.seq_len]
if self.use_noise_augment and self.is_training and self.return_segments:
audio = audio + (1 / 32768) * torch.randn_like(audio)
return (mel, audio)

View File

@ -0,0 +1,37 @@
import glob
import os
from pathlib import Path
import numpy as np
def find_wav_files(data_path):
wav_paths = glob.glob(os.path.join(data_path, '**', '*.wav'), recursive=True)
return wav_paths
def find_feat_files(data_path):
feat_paths = glob.glob(os.path.join(data_path, '**', '*.npy'), recursive=True)
return feat_paths
def load_wav_data(data_path, eval_split_size):
wav_paths = find_wav_files(data_path)
np.random.seed(0)
np.random.shuffle(wav_paths)
return wav_paths[:eval_split_size], wav_paths[eval_split_size:]
def load_wav_feat_data(data_path, feat_path, eval_split_size):
wav_paths = sorted(find_wav_files(data_path))
feat_paths = sorted(find_feat_files(feat_path))
assert len(wav_paths) == len(feat_paths)
for wav, feat in zip(wav_paths, feat_paths):
wav_name = Path(wav).stem
feat_name = Path(feat).stem
assert wav_name == feat_name
items = list(zip(wav_paths, feat_paths))
np.random.seed(0)
np.random.shuffle(items)
return items[:eval_split_size], items[eval_split_size:]

View File

@ -0,0 +1,309 @@
import torch
from torch import nn
from torch.nn import functional as F
class TorchSTFT():
def __init__(self, n_fft, hop_length, win_length, window='hann_window'):
""" Torch based STFT operation """
self.n_fft = n_fft
self.hop_length = hop_length
self.win_length = win_length
self.window = getattr(torch, window)(win_length)
def __call__(self, x):
# B x D x T x 2
o = torch.stft(x,
self.n_fft,
self.hop_length,
self.win_length,
self.window,
center=True,
pad_mode="reflect", # compatible with audio.py
normalized=False,
onesided=True)
M = o[:, :, :, 0]
P = o[:, :, :, 1]
return torch.sqrt(torch.clamp(M ** 2 + P ** 2, min=1e-8))
#################################
# GENERATOR LOSSES
#################################
class STFTLoss(nn.Module):
""" Single scale STFT Loss """
def __init__(self, n_fft, hop_length, win_length):
super(STFTLoss, self).__init__()
self.n_fft = n_fft
self.hop_length = hop_length
self.win_length = win_length
self.stft = TorchSTFT(n_fft, hop_length, win_length)
def forward(self, y_hat, y):
y_hat_M = self.stft(y_hat)
y_M = self.stft(y)
# magnitude loss
loss_mag = F.l1_loss(torch.log(y_M), torch.log(y_hat_M))
# spectral convergence loss
loss_sc = torch.norm(y_M - y_hat_M, p="fro") / torch.norm(y_M, p="fro")
return loss_mag, loss_sc
class MultiScaleSTFTLoss(torch.nn.Module):
""" Multi scale STFT loss """
def __init__(self,
n_ffts=(1024, 2048, 512),
hop_lengths=(120, 240, 50),
win_lengths=(600, 1200, 240)):
super(MultiScaleSTFTLoss, self).__init__()
self.loss_funcs = torch.nn.ModuleList()
for n_fft, hop_length, win_length in zip(n_ffts, hop_lengths, win_lengths):
self.loss_funcs.append(STFTLoss(n_fft, hop_length, win_length))
def forward(self, y_hat, y):
N = len(self.loss_funcs)
loss_sc = 0
loss_mag = 0
for f in self.loss_funcs:
lm, lsc = f(y_hat, y)
loss_mag += lm
loss_sc += lsc
loss_sc /= N
loss_mag /= N
return loss_mag, loss_sc
class MultiScaleSubbandSTFTLoss(MultiScaleSTFTLoss):
""" Multiscale STFT loss for multi band model outputs """
# pylint: disable=no-self-use
def forward(self, y_hat, y):
y_hat = y_hat.view(-1, 1, y_hat.shape[2])
y = y.view(-1, 1, y.shape[2])
return super().forward(y_hat.squeeze(1), y.squeeze(1))
class MSEGLoss(nn.Module):
""" Mean Squared Generator Loss """
# pylint: disable=no-self-use
def forward(self, score_real):
loss_fake = F.mse_loss(score_real, score_real.new_ones(score_real.shape))
return loss_fake
class HingeGLoss(nn.Module):
""" Hinge Discriminator Loss """
# pylint: disable=no-self-use
def forward(self, score_real):
# TODO: this might be wrong
loss_fake = torch.mean(F.relu(1. - score_real))
return loss_fake
##################################
# DISCRIMINATOR LOSSES
##################################
class MSEDLoss(nn.Module):
""" Mean Squared Discriminator Loss """
def __init__(self,):
super(MSEDLoss, self).__init__()
self.loss_func = nn.MSELoss()
# pylint: disable=no-self-use
def forward(self, score_fake, score_real):
loss_real = self.loss_func(score_real, score_real.new_ones(score_real.shape))
loss_fake = self.loss_func(score_fake, score_fake.new_zeros(score_fake.shape))
loss_d = loss_real + loss_fake
return loss_d, loss_real, loss_fake
class HingeDLoss(nn.Module):
""" Hinge Discriminator Loss """
# pylint: disable=no-self-use
def forward(self, score_fake, score_real):
loss_real = torch.mean(F.relu(1. - score_real))
loss_fake = torch.mean(F.relu(1. + score_fake))
loss_d = loss_real + loss_fake
return loss_d, loss_real, loss_fake
class MelganFeatureLoss(nn.Module):
def __init__(self,):
super(MelganFeatureLoss, self).__init__()
self.loss_func = nn.L1Loss()
# pylint: disable=no-self-use
def forward(self, fake_feats, real_feats):
loss_feats = 0
for fake_feat, real_feat in zip(fake_feats, real_feats):
loss_feats += self.loss_func(fake_feat, real_feat)
loss_feats /= len(fake_feats) + len(real_feats)
return loss_feats
#####################################
# LOSS WRAPPERS
#####################################
def _apply_G_adv_loss(scores_fake, loss_func):
""" Compute G adversarial loss function
and normalize values """
adv_loss = 0
if isinstance(scores_fake, list):
for score_fake in scores_fake:
fake_loss = loss_func(score_fake)
adv_loss += fake_loss
adv_loss /= len(scores_fake)
else:
fake_loss = loss_func(scores_fake)
adv_loss = fake_loss
return adv_loss
def _apply_D_loss(scores_fake, scores_real, loss_func):
""" Compute D loss func and normalize loss values """
loss = 0
real_loss = 0
fake_loss = 0
if isinstance(scores_fake, list):
# multi-scale loss
for score_fake, score_real in zip(scores_fake, scores_real):
total_loss, real_loss, fake_loss = loss_func(score_fake=score_fake, score_real=score_real)
loss += total_loss
real_loss += real_loss
fake_loss += fake_loss
# normalize loss values with number of scales
loss /= len(scores_fake)
real_loss /= len(scores_real)
fake_loss /= len(scores_fake)
else:
# single scale loss
total_loss, real_loss, fake_loss = loss_func(scores_fake, scores_real)
loss = total_loss
return loss, real_loss, fake_loss
##################################
# MODEL LOSSES
##################################
class GeneratorLoss(nn.Module):
def __init__(self, C):
""" Compute Generator Loss values depending on training
configuration """
super(GeneratorLoss, self).__init__()
assert not(C.use_mse_gan_loss and C.use_hinge_gan_loss),\
" [!] Cannot use HingeGANLoss and MSEGANLoss together."
self.use_stft_loss = C.use_stft_loss
self.use_subband_stft_loss = C.use_subband_stft_loss
self.use_mse_gan_loss = C.use_mse_gan_loss
self.use_hinge_gan_loss = C.use_hinge_gan_loss
self.use_feat_match_loss = C.use_feat_match_loss
self.stft_loss_weight = C.stft_loss_weight
self.subband_stft_loss_weight = C.subband_stft_loss_weight
self.mse_gan_loss_weight = C.mse_G_loss_weight
self.hinge_gan_loss_weight = C.hinge_G_loss_weight
self.feat_match_loss_weight = C.feat_match_loss_weight
if C.use_stft_loss:
self.stft_loss = MultiScaleSTFTLoss(**C.stft_loss_params)
if C.use_subband_stft_loss:
self.subband_stft_loss = MultiScaleSubbandSTFTLoss(**C.subband_stft_loss_params)
if C.use_mse_gan_loss:
self.mse_loss = MSEGLoss()
if C.use_hinge_gan_loss:
self.hinge_loss = HingeGLoss()
if C.use_feat_match_loss:
self.feat_match_loss = MelganFeatureLoss()
def forward(self, y_hat=None, y=None, scores_fake=None, feats_fake=None, feats_real=None, y_hat_sub=None, y_sub=None):
gen_loss = 0
adv_loss = 0
return_dict = {}
# STFT Loss
if self.use_stft_loss:
stft_loss_mg, stft_loss_sc = self.stft_loss(y_hat.squeeze(1), y.squeeze(1))
return_dict['G_stft_loss_mg'] = stft_loss_mg
return_dict['G_stft_loss_sc'] = stft_loss_sc
gen_loss += self.stft_loss_weight * (stft_loss_mg + stft_loss_sc)
# subband STFT Loss
if self.use_subband_stft_loss:
subband_stft_loss_mg, subband_stft_loss_sc = self.subband_stft_loss(y_hat_sub, y_sub)
return_dict['G_subband_stft_loss_mg'] = subband_stft_loss_mg
return_dict['G_subband_stft_loss_sc'] = subband_stft_loss_sc
gen_loss += self.subband_stft_loss_weight * (subband_stft_loss_mg + subband_stft_loss_sc)
# multiscale MSE adversarial loss
if self.use_mse_gan_loss and scores_fake is not None:
mse_fake_loss = _apply_G_adv_loss(scores_fake, self.mse_loss)
return_dict['G_mse_fake_loss'] = mse_fake_loss
adv_loss += self.mse_gan_loss_weight * mse_fake_loss
# multiscale Hinge adversarial loss
if self.use_hinge_gan_loss and not scores_fake is not None:
hinge_fake_loss = _apply_G_adv_loss(scores_fake, self.hinge_loss)
return_dict['G_hinge_fake_loss'] = hinge_fake_loss
adv_loss += self.hinge_gan_loss_weight * hinge_fake_loss
# Feature Matching Loss
if self.use_feat_match_loss and not feats_fake:
feat_match_loss = self.feat_match_loss(feats_fake, feats_real)
return_dict['G_feat_match_loss'] = feat_match_loss
adv_loss += self.feat_match_loss_weight * feat_match_loss
return_dict['G_loss'] = gen_loss + adv_loss
return_dict['G_gen_loss'] = gen_loss
return_dict['G_adv_loss'] = adv_loss
return return_dict
class DiscriminatorLoss(nn.Module):
""" Compute Discriminator Loss values depending on training
configuration """
def __init__(self, C):
super(DiscriminatorLoss, self).__init__()
assert not(C.use_mse_gan_loss and C.use_hinge_gan_loss),\
" [!] Cannot use HingeGANLoss and MSEGANLoss together."
self.use_mse_gan_loss = C.use_mse_gan_loss
self.use_hinge_gan_loss = C.use_hinge_gan_loss
if C.use_mse_gan_loss:
self.mse_loss = MSEDLoss()
if C.use_hinge_gan_loss:
self.hinge_loss = HingeDLoss()
def forward(self, scores_fake, scores_real):
loss = 0
return_dict = {}
if self.use_mse_gan_loss:
mse_D_loss, mse_D_real_loss, mse_D_fake_loss = _apply_D_loss(
scores_fake=scores_fake,
scores_real=scores_real,
loss_func=self.mse_loss)
return_dict['D_mse_gan_loss'] = mse_D_loss
return_dict['D_mse_gan_real_loss'] = mse_D_real_loss
return_dict['D_mse_gan_fake_loss'] = mse_D_fake_loss
loss += mse_D_loss
if self.use_hinge_gan_loss:
hinge_D_loss, hinge_D_real_loss, hinge_D_fake_loss = _apply_D_loss(
scores_fake=scores_fake,
scores_real=scores_real,
loss_func=self.hinge_loss)
return_dict['D_hinge_gan_loss'] = hinge_D_loss
return_dict['D_hinge_gan_real_loss'] = hinge_D_real_loss
return_dict['D_hinge_gan_fake_loss'] = hinge_D_fake_loss
loss += hinge_D_loss
return_dict['D_loss'] = loss
return return_dict

View File

@ -0,0 +1,45 @@
from torch import nn
from torch.nn.utils import weight_norm
class ResidualStack(nn.Module):
def __init__(self, channels, num_res_blocks, kernel_size):
super(ResidualStack, self).__init__()
assert (kernel_size - 1) % 2 == 0, " [!] kernel_size has to be odd."
base_padding = (kernel_size - 1) // 2
self.blocks = nn.ModuleList()
for idx in range(num_res_blocks):
layer_kernel_size = kernel_size
layer_dilation = layer_kernel_size**idx
layer_padding = base_padding * layer_dilation
self.blocks += [nn.Sequential(
nn.LeakyReLU(0.2),
nn.ReflectionPad1d(layer_padding),
weight_norm(
nn.Conv1d(channels,
channels,
kernel_size=kernel_size,
dilation=layer_dilation,
bias=True)),
nn.LeakyReLU(0.2),
weight_norm(
nn.Conv1d(channels, channels, kernel_size=1, bias=True)),
)]
self.shortcuts = nn.ModuleList([
weight_norm(nn.Conv1d(channels, channels, kernel_size=1,
bias=True)) for i in range(num_res_blocks)
])
def forward(self, x):
for block, shortcut in zip(self.blocks, self.shortcuts):
x = shortcut(x) + block(x)
return x
def remove_weight_norm(self):
for block, shortcut in zip(self.blocks, self.shortcuts):
nn.utils.remove_weight_norm(block[2])
nn.utils.remove_weight_norm(block[4])
nn.utils.remove_weight_norm(shortcut)

View File

@ -0,0 +1,87 @@
import torch
from torch.nn import functional as F
class ResidualBlock(torch.nn.Module):
"""Residual block module in WaveNet."""
def __init__(self,
kernel_size=3,
res_channels=64,
gate_channels=128,
skip_channels=64,
aux_channels=80,
dropout=0.0,
dilation=1,
bias=True,
use_causal_conv=False):
super(ResidualBlock, self).__init__()
self.dropout = dropout
# no future time stamps available
if use_causal_conv:
padding = (kernel_size - 1) * dilation
else:
assert (kernel_size -
1) % 2 == 0, "Not support even number kernel size."
padding = (kernel_size - 1) // 2 * dilation
self.use_causal_conv = use_causal_conv
# dilation conv
self.conv = torch.nn.Conv1d(res_channels,
gate_channels,
kernel_size,
padding=padding,
dilation=dilation,
bias=bias)
# local conditioning
if aux_channels > 0:
self.conv1x1_aux = torch.nn.Conv1d(aux_channels,
gate_channels,
1,
bias=False)
else:
self.conv1x1_aux = None
# conv output is split into two groups
gate_out_channels = gate_channels // 2
self.conv1x1_out = torch.nn.Conv1d(gate_out_channels,
res_channels,
1,
bias=bias)
self.conv1x1_skip = torch.nn.Conv1d(gate_out_channels,
skip_channels,
1,
bias=bias)
def forward(self, x, c):
"""
x: B x D_res x T
c: B x D_aux x T
"""
residual = x
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.conv(x)
# remove future time steps if use_causal_conv conv
x = x[:, :, :residual.size(-1)] if self.use_causal_conv else x
# split into two part for gated activation
splitdim = 1
xa, xb = x.split(x.size(splitdim) // 2, dim=splitdim)
# local conditioning
if c is not None:
assert self.conv1x1_aux is not None
c = self.conv1x1_aux(c)
ca, cb = c.split(c.size(splitdim) // 2, dim=splitdim)
xa, xb = xa + ca, xb + cb
x = torch.tanh(xa) * torch.sigmoid(xb)
# for skip connection
s = self.conv1x1_skip(x)
# for residual connection
x = (self.conv1x1_out(x) + residual) * (0.5**2)
return x, s

View File

@ -0,0 +1,56 @@
import numpy as np
import torch
import torch.nn.functional as F
from scipy import signal as sig
# adapted from
# https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/parallel_wavegan
class PQMF(torch.nn.Module):
def __init__(self, N=4, taps=62, cutoff=0.15, beta=9.0):
super(PQMF, self).__init__()
self.N = N
self.taps = taps
self.cutoff = cutoff
self.beta = beta
QMF = sig.firwin(taps + 1, cutoff, window=('kaiser', beta))
H = np.zeros((N, len(QMF)))
G = np.zeros((N, len(QMF)))
for k in range(N):
constant_factor = (2 * k + 1) * (np.pi /
(2 * N)) * (np.arange(taps + 1) -
((taps - 1) / 2))
phase = (-1)**k * np.pi / 4
H[k] = 2 * QMF * np.cos(constant_factor + phase)
G[k] = 2 * QMF * np.cos(constant_factor - phase)
H = torch.from_numpy(H[:, None, :]).float()
G = torch.from_numpy(G[None, :, :]).float()
self.register_buffer("H", H)
self.register_buffer("G", G)
updown_filter = torch.zeros((N, N, N)).float()
for k in range(N):
updown_filter[k, k, 0] = 1.0
self.register_buffer("updown_filter", updown_filter)
self.N = N
self.pad_fn = torch.nn.ConstantPad1d(taps // 2, 0.0)
def forward(self, x):
return self.analysis(x)
def analysis(self, x):
return F.conv1d(x, self.H, padding=self.taps // 2, stride=self.N)
def synthesis(self, x):
x = F.conv_transpose1d(x,
self.updown_filter * self.N,
stride=self.N)
x = F.conv1d(x, self.G, padding=self.taps // 2)
return x

View File

@ -0,0 +1,640 @@
0.0000000e+000
-5.5252865e-004
-5.6176926e-004
-4.9475181e-004
-4.8752280e-004
-4.8937912e-004
-5.0407143e-004
-5.2265643e-004
-5.4665656e-004
-5.6778026e-004
-5.8709305e-004
-6.1327474e-004
-6.3124935e-004
-6.5403334e-004
-6.7776908e-004
-6.9416146e-004
-7.1577365e-004
-7.2550431e-004
-7.4409419e-004
-7.4905981e-004
-7.6813719e-004
-7.7248486e-004
-7.8343323e-004
-7.7798695e-004
-7.8036647e-004
-7.8014496e-004
-7.7579773e-004
-7.6307936e-004
-7.5300014e-004
-7.3193572e-004
-7.2153920e-004
-6.9179375e-004
-6.6504151e-004
-6.3415949e-004
-5.9461189e-004
-5.5645764e-004
-5.1455722e-004
-4.6063255e-004
-4.0951215e-004
-3.5011759e-004
-2.8969812e-004
-2.0983373e-004
-1.4463809e-004
-6.1733441e-005
1.3494974e-005
1.0943831e-004
2.0430171e-004
2.9495311e-004
4.0265402e-004
5.1073885e-004
6.2393761e-004
7.4580259e-004
8.6084433e-004
9.8859883e-004
1.1250155e-003
1.2577885e-003
1.3902495e-003
1.5443220e-003
1.6868083e-003
1.8348265e-003
1.9841141e-003
2.1461584e-003
2.3017255e-003
2.4625617e-003
2.6201759e-003
2.7870464e-003
2.9469448e-003
3.1125421e-003
3.2739613e-003
3.4418874e-003
3.6008268e-003
3.7603923e-003
3.9207432e-003
4.0819753e-003
4.2264269e-003
4.3730720e-003
4.5209853e-003
4.6606461e-003
4.7932561e-003
4.9137604e-003
5.0393023e-003
5.1407354e-003
5.2461166e-003
5.3471681e-003
5.4196776e-003
5.4876040e-003
5.5475715e-003
5.5938023e-003
5.6220643e-003
5.6455197e-003
5.6389200e-003
5.6266114e-003
5.5917129e-003
5.5404364e-003
5.4753783e-003
5.3838976e-003
5.2715759e-003
5.1382275e-003
4.9839688e-003
4.8109469e-003
4.6039530e-003
4.3801862e-003
4.1251642e-003
3.8456408e-003
3.5401247e-003
3.2091886e-003
2.8446758e-003
2.4508540e-003
2.0274176e-003
1.5784683e-003
1.0902329e-003
5.8322642e-004
2.7604519e-005
-5.4642809e-004
-1.1568136e-003
-1.8039473e-003
-2.4826724e-003
-3.1933778e-003
-3.9401124e-003
-4.7222596e-003
-5.5337211e-003
-6.3792293e-003
-7.2615817e-003
-8.1798233e-003
-9.1325330e-003
-1.0115022e-002
-1.1131555e-002
-1.2185000e-002
-1.3271822e-002
-1.4390467e-002
-1.5540555e-002
-1.6732471e-002
-1.7943338e-002
-1.9187243e-002
-2.0453179e-002
-2.1746755e-002
-2.3068017e-002
-2.4416099e-002
-2.5787585e-002
-2.7185943e-002
-2.8607217e-002
-3.0050266e-002
-3.1501761e-002
-3.2975408e-002
-3.4462095e-002
-3.5969756e-002
-3.7481285e-002
-3.9005368e-002
-4.0534917e-002
-4.2064909e-002
-4.3609754e-002
-4.5148841e-002
-4.6684303e-002
-4.8216572e-002
-4.9738576e-002
-5.1255616e-002
-5.2763075e-002
-5.4245277e-002
-5.5717365e-002
-5.7161645e-002
-5.8591568e-002
-5.9983748e-002
-6.1345517e-002
-6.2685781e-002
-6.3971590e-002
-6.5224711e-002
-6.6436751e-002
-6.7607599e-002
-6.8704383e-002
-6.9763024e-002
-7.0762871e-002
-7.1700267e-002
-7.2568258e-002
-7.3362026e-002
-7.4100364e-002
-7.4745256e-002
-7.5313734e-002
-7.5800836e-002
-7.6199248e-002
-7.6499217e-002
-7.6709349e-002
-7.6817398e-002
-7.6823001e-002
-7.6720492e-002
-7.6505072e-002
-7.6174832e-002
-7.5730576e-002
-7.5157626e-002
-7.4466439e-002
-7.3640601e-002
-7.2677464e-002
-7.1582636e-002
-7.0353307e-002
-6.8966401e-002
-6.7452502e-002
-6.5769067e-002
-6.3944481e-002
-6.1960278e-002
-5.9816657e-002
-5.7515269e-002
-5.5046003e-002
-5.2409382e-002
-4.9597868e-002
-4.6630331e-002
-4.3476878e-002
-4.0145828e-002
-3.6641812e-002
-3.2958393e-002
-2.9082401e-002
-2.5030756e-002
-2.0799707e-002
-1.6370126e-002
-1.1762383e-002
-6.9636862e-003
-1.9765601e-003
3.2086897e-003
8.5711749e-003
1.4128883e-002
1.9883413e-002
2.5822729e-002
3.1953127e-002
3.8277657e-002
4.4780682e-002
5.1480418e-002
5.8370533e-002
6.5440985e-002
7.2694330e-002
8.0137293e-002
8.7754754e-002
9.5553335e-002
1.0353295e-001
1.1168269e-001
1.2000780e-001
1.2850029e-001
1.3715518e-001
1.4597665e-001
1.5496071e-001
1.6409589e-001
1.7338082e-001
1.8281725e-001
1.9239667e-001
2.0212502e-001
2.1197359e-001
2.2196527e-001
2.3206909e-001
2.4230169e-001
2.5264803e-001
2.6310533e-001
2.7366340e-001
2.8432142e-001
2.9507167e-001
3.0590986e-001
3.1682789e-001
3.2781137e-001
3.3887227e-001
3.4999141e-001
3.6115899e-001
3.7237955e-001
3.8363500e-001
3.9492118e-001
4.0623177e-001
4.1756969e-001
4.2891199e-001
4.4025538e-001
4.5159965e-001
4.6293081e-001
4.7424532e-001
4.8552531e-001
4.9677083e-001
5.0798175e-001
5.1912350e-001
5.3022409e-001
5.4125534e-001
5.5220513e-001
5.6307891e-001
5.7385241e-001
5.8454032e-001
5.9511231e-001
6.0557835e-001
6.1591099e-001
6.2612427e-001
6.3619801e-001
6.4612697e-001
6.5590163e-001
6.6551399e-001
6.7496632e-001
6.8423533e-001
6.9332824e-001
7.0223887e-001
7.1094104e-001
7.1944626e-001
7.2774489e-001
7.3582118e-001
7.4368279e-001
7.5131375e-001
7.5870808e-001
7.6586749e-001
7.7277809e-001
7.7942875e-001
7.8583531e-001
7.9197358e-001
7.9784664e-001
8.0344858e-001
8.0876950e-001
8.1381913e-001
8.1857760e-001
8.2304199e-001
8.2722753e-001
8.3110385e-001
8.3469374e-001
8.3797173e-001
8.4095414e-001
8.4362383e-001
8.4598185e-001
8.4803158e-001
8.4978052e-001
8.5119715e-001
8.5230470e-001
8.5310209e-001
8.5357206e-001
8.5373856e-001
8.5357206e-001
8.5310209e-001
8.5230470e-001
8.5119715e-001
8.4978052e-001
8.4803158e-001
8.4598185e-001
8.4362383e-001
8.4095414e-001
8.3797173e-001
8.3469374e-001
8.3110385e-001
8.2722753e-001
8.2304199e-001
8.1857760e-001
8.1381913e-001
8.0876950e-001
8.0344858e-001
7.9784664e-001
7.9197358e-001
7.8583531e-001
7.7942875e-001
7.7277809e-001
7.6586749e-001
7.5870808e-001
7.5131375e-001
7.4368279e-001
7.3582118e-001
7.2774489e-001
7.1944626e-001
7.1094104e-001
7.0223887e-001
6.9332824e-001
6.8423533e-001
6.7496632e-001
6.6551399e-001
6.5590163e-001
6.4612697e-001
6.3619801e-001
6.2612427e-001
6.1591099e-001
6.0557835e-001
5.9511231e-001
5.8454032e-001
5.7385241e-001
5.6307891e-001
5.5220513e-001
5.4125534e-001
5.3022409e-001
5.1912350e-001
5.0798175e-001
4.9677083e-001
4.8552531e-001
4.7424532e-001
4.6293081e-001
4.5159965e-001
4.4025538e-001
4.2891199e-001
4.1756969e-001
4.0623177e-001
3.9492118e-001
3.8363500e-001
3.7237955e-001
3.6115899e-001
3.4999141e-001
3.3887227e-001
3.2781137e-001
3.1682789e-001
3.0590986e-001
2.9507167e-001
2.8432142e-001
2.7366340e-001
2.6310533e-001
2.5264803e-001
2.4230169e-001
2.3206909e-001
2.2196527e-001
2.1197359e-001
2.0212502e-001
1.9239667e-001
1.8281725e-001
1.7338082e-001
1.6409589e-001
1.5496071e-001
1.4597665e-001
1.3715518e-001
1.2850029e-001
1.2000780e-001
1.1168269e-001
1.0353295e-001
9.5553335e-002
8.7754754e-002
8.0137293e-002
7.2694330e-002
6.5440985e-002
5.8370533e-002
5.1480418e-002
4.4780682e-002
3.8277657e-002
3.1953127e-002
2.5822729e-002
1.9883413e-002
1.4128883e-002
8.5711749e-003
3.2086897e-003
-1.9765601e-003
-6.9636862e-003
-1.1762383e-002
-1.6370126e-002
-2.0799707e-002
-2.5030756e-002
-2.9082401e-002
-3.2958393e-002
-3.6641812e-002
-4.0145828e-002
-4.3476878e-002
-4.6630331e-002
-4.9597868e-002
-5.2409382e-002
-5.5046003e-002
-5.7515269e-002
-5.9816657e-002
-6.1960278e-002
-6.3944481e-002
-6.5769067e-002
-6.7452502e-002
-6.8966401e-002
-7.0353307e-002
-7.1582636e-002
-7.2677464e-002
-7.3640601e-002
-7.4466439e-002
-7.5157626e-002
-7.5730576e-002
-7.6174832e-002
-7.6505072e-002
-7.6720492e-002
-7.6823001e-002
-7.6817398e-002
-7.6709349e-002
-7.6499217e-002
-7.6199248e-002
-7.5800836e-002
-7.5313734e-002
-7.4745256e-002
-7.4100364e-002
-7.3362026e-002
-7.2568258e-002
-7.1700267e-002
-7.0762871e-002
-6.9763024e-002
-6.8704383e-002
-6.7607599e-002
-6.6436751e-002
-6.5224711e-002
-6.3971590e-002
-6.2685781e-002
-6.1345517e-002
-5.9983748e-002
-5.8591568e-002
-5.7161645e-002
-5.5717365e-002
-5.4245277e-002
-5.2763075e-002
-5.1255616e-002
-4.9738576e-002
-4.8216572e-002
-4.6684303e-002
-4.5148841e-002
-4.3609754e-002
-4.2064909e-002
-4.0534917e-002
-3.9005368e-002
-3.7481285e-002
-3.5969756e-002
-3.4462095e-002
-3.2975408e-002
-3.1501761e-002
-3.0050266e-002
-2.8607217e-002
-2.7185943e-002
-2.5787585e-002
-2.4416099e-002
-2.3068017e-002
-2.1746755e-002
-2.0453179e-002
-1.9187243e-002
-1.7943338e-002
-1.6732471e-002
-1.5540555e-002
-1.4390467e-002
-1.3271822e-002
-1.2185000e-002
-1.1131555e-002
-1.0115022e-002
-9.1325330e-003
-8.1798233e-003
-7.2615817e-003
-6.3792293e-003
-5.5337211e-003
-4.7222596e-003
-3.9401124e-003
-3.1933778e-003
-2.4826724e-003
-1.8039473e-003
-1.1568136e-003
-5.4642809e-004
2.7604519e-005
5.8322642e-004
1.0902329e-003
1.5784683e-003
2.0274176e-003
2.4508540e-003
2.8446758e-003
3.2091886e-003
3.5401247e-003
3.8456408e-003
4.1251642e-003
4.3801862e-003
4.6039530e-003
4.8109469e-003
4.9839688e-003
5.1382275e-003
5.2715759e-003
5.3838976e-003
5.4753783e-003
5.5404364e-003
5.5917129e-003
5.6266114e-003
5.6389200e-003
5.6455197e-003
5.6220643e-003
5.5938023e-003
5.5475715e-003
5.4876040e-003
5.4196776e-003
5.3471681e-003
5.2461166e-003
5.1407354e-003
5.0393023e-003
4.9137604e-003
4.7932561e-003
4.6606461e-003
4.5209853e-003
4.3730720e-003
4.2264269e-003
4.0819753e-003
3.9207432e-003
3.7603923e-003
3.6008268e-003
3.4418874e-003
3.2739613e-003
3.1125421e-003
2.9469448e-003
2.7870464e-003
2.6201759e-003
2.4625617e-003
2.3017255e-003
2.1461584e-003
1.9841141e-003
1.8348265e-003
1.6868083e-003
1.5443220e-003
1.3902495e-003
1.2577885e-003
1.1250155e-003
9.8859883e-004
8.6084433e-004
7.4580259e-004
6.2393761e-004
5.1073885e-004
4.0265402e-004
2.9495311e-004
2.0430171e-004
1.0943831e-004
1.3494974e-005
-6.1733441e-005
-1.4463809e-004
-2.0983373e-004
-2.8969812e-004
-3.5011759e-004
-4.0951215e-004
-4.6063255e-004
-5.1455722e-004
-5.5645764e-004
-5.9461189e-004
-6.3415949e-004
-6.6504151e-004
-6.9179375e-004
-7.2153920e-004
-7.3193572e-004
-7.5300014e-004
-7.6307936e-004
-7.7579773e-004
-7.8014496e-004
-7.8036647e-004
-7.7798695e-004
-7.8343323e-004
-7.7248486e-004
-7.6813719e-004
-7.4905981e-004
-7.4409419e-004
-7.2550431e-004
-7.1577365e-004
-6.9416146e-004
-6.7776908e-004
-6.5403334e-004
-6.3124935e-004
-6.1327474e-004
-5.8709305e-004
-5.6778026e-004
-5.4665656e-004
-5.2265643e-004
-5.0407143e-004
-4.8937912e-004
-4.8752280e-004
-4.9475181e-004
-5.6176926e-004
-5.5252865e-004

Some files were not shown because too many files have changed in this diff Show More