mirror of https://github.com/coqui-ai/TTS.git
Merge remote-tracking branch 'TTS/dev' into dev
This commit is contained in:
commit
63401797f6
50
README.md
50
README.md
|
@ -1,9 +1,14 @@
|
||||||
<p align="center"><img src="https://user-images.githubusercontent.com/1402048/52643646-c2102980-2edd-11e9-8c37-b72f3c89a640.png" data-canonical-src="
|
<p align="center"><img src="https://user-images.githubusercontent.com/1402048/52643646-c2102980-2edd-11e9-8c37-b72f3c89a640.png" data-canonical-src="
|
||||||
" width="320" height="95" /></p>
|
" width="320" height="95" /></p>
|
||||||
|
|
||||||
|
<p align='center'>
|
||||||
<img src="https://travis-ci.org/mozilla/TTS.svg?branch=dev"/>
|
<img src="https://travis-ci.org/mozilla/TTS.svg?branch=dev"/>
|
||||||
|
<a href='https://discourse.mozilla.org/c/tts'><img src="https://img.shields.io/badge/discourse-online-green.svg"/></a>
|
||||||
|
</p>
|
||||||
|
|
||||||
This project is a part of [Mozilla Common Voice](https://voice.mozilla.org/en). TTS aims a deep learning based Text2Speech engine, low in cost and high in quality.
|
This project is a part of [Mozilla Common Voice](https://voice.mozilla.org/en).
|
||||||
|
|
||||||
|
Mozilla TTS aims a deep learning based Text2Speech engine, low in cost and high in quality.
|
||||||
|
|
||||||
You can check some of synthesized voice samples from [here](https://erogol.github.io/ddc-samples/).
|
You can check some of synthesized voice samples from [here](https://erogol.github.io/ddc-samples/).
|
||||||
|
|
||||||
|
@ -38,25 +43,26 @@ Vocoders:
|
||||||
You can also help us implement more models. Some TTS related work can be found [here](https://github.com/erogol/TTS-papers).
|
You can also help us implement more models. Some TTS related work can be found [here](https://github.com/erogol/TTS-papers).
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
- High performance Deep Learning models for Text2Speech related tasks.
|
- High performance Deep Learning models for Text2Speech tasks.
|
||||||
- Text2Speech models (Tacotron, Tacotron2).
|
- Text2Spec models (Tacotron, Tacotron2).
|
||||||
- Speaker Encoder to compute speaker embeddings efficiently.
|
- Speaker Encoder to compute speaker embeddings efficiently.
|
||||||
- Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS)
|
- Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS, ParallelWaveGAN)
|
||||||
- Support for multi-speaker TTS training.
|
|
||||||
- Support for Multi-GPUs training.
|
|
||||||
- Ability to convert Torch models to Tensorflow 2.0 for inference.
|
|
||||||
- Released pre-trained models.
|
|
||||||
- Fast and efficient model training.
|
- Fast and efficient model training.
|
||||||
- Detailed training logs on console and Tensorboard.
|
- Detailed training logs on console and Tensorboard.
|
||||||
|
- Support for multi-speaker TTS.
|
||||||
|
- Efficient Multi-GPUs training.
|
||||||
|
- Ability to convert PyTorch models to Tensorflow 2.0 and TFLite for inference.
|
||||||
|
- Released models in PyTorch, Tensorflow and TFLite.
|
||||||
- Tools to curate Text2Speech datasets under```dataset_analysis```.
|
- Tools to curate Text2Speech datasets under```dataset_analysis```.
|
||||||
- Demo server for model testing.
|
- Demo server for model testing.
|
||||||
- Notebooks for extensive model benchmarking.
|
- Notebooks for extensive model benchmarking.
|
||||||
- Modular (but not too much) code base enabling easy testing for new ideas.
|
- Modular (but not too much) code base enabling easy testing for new ideas.
|
||||||
|
|
||||||
## Requirements and Installation
|
## Main Requirements and Installation
|
||||||
Highly recommended to use [miniconda](https://conda.io/miniconda.html) for easier installation.
|
Highly recommended to use [miniconda](https://conda.io/miniconda.html) for easier installation.
|
||||||
* python>=3.6
|
* python>=3.6
|
||||||
* pytorch>=0.4.1
|
* pytorch>=1.4.1
|
||||||
|
* tensorflow>=2.2
|
||||||
* librosa
|
* librosa
|
||||||
* tensorboard
|
* tensorboard
|
||||||
* tensorboardX
|
* tensorboardX
|
||||||
|
@ -107,26 +113,14 @@ Audio examples: [soundcloud](https://soundcloud.com/user-565970875/pocket-articl
|
||||||
|
|
||||||
<img src="images/example_model_output.png?raw=true" alt="example_output" width="400"/>
|
<img src="images/example_model_output.png?raw=true" alt="example_output" width="400"/>
|
||||||
|
|
||||||
## Runtime
|
## [Mozilla TTS Tutorials and Notebooks](https://github.com/mozilla/TTS/wiki/TTS-Notebooks-and-Tutorials)
|
||||||
The most time-consuming part is the vocoder algorithm (Griffin-Lim) which runs on CPU. By setting its number of iterations lower, you might have faster execution with a small loss of quality. Some of the experimental values are below.
|
|
||||||
|
|
||||||
Sentence: "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."
|
|
||||||
|
|
||||||
Audio length is approximately 6 secs.
|
|
||||||
|
|
||||||
| Time (secs) | System | # GL iters | Model
|
|
||||||
| ---- |:-------|:-----------| ---- |
|
|
||||||
|2.00|GTX1080Ti|30|Tacotron|
|
|
||||||
|3.01|GTX1080Ti|60|Tacotron|
|
|
||||||
|3.57|CPU|60|Tacotron|
|
|
||||||
|5.27|GTX1080Ti|60|Tacotron2|
|
|
||||||
|6.50|CPU|60|Tacotron2|
|
|
||||||
|
|
||||||
|
|
||||||
## Datasets and Data-Loading
|
## Datasets and Data-Loading
|
||||||
TTS provides a generic dataloader easy to use for new datasets. You need to write an preprocessor function to integrate your own dataset.Check ```datasets/preprocess.py``` to see some examples. After the function, you need to set ```dataset``` field in ```config.json```. Do not forget other data related fields too.
|
TTS provides a generic dataloader easy to use for your custom dataset.
|
||||||
|
You just need to write a simple function to format the dataset. Check ```datasets/preprocess.py``` to see some examples.
|
||||||
|
After that, you need to set ```dataset``` fields in ```config.json```.
|
||||||
|
|
||||||
Some of the open-sourced datasets that we successfully applied TTS, are linked below.
|
Some of the public datasets that we successfully applied TTS:
|
||||||
|
|
||||||
- [LJ Speech](https://keithito.com/LJ-Speech-Dataset/)
|
- [LJ Speech](https://keithito.com/LJ-Speech-Dataset/)
|
||||||
- [Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/)
|
- [Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/)
|
||||||
|
@ -164,8 +158,6 @@ In case of any error or intercepted execution, if there is no checkpoint yet und
|
||||||
|
|
||||||
You can also enjoy Tensorboard, if you point Tensorboard argument```--logdir``` to the experiment folder.
|
You can also enjoy Tensorboard, if you point Tensorboard argument```--logdir``` to the experiment folder.
|
||||||
|
|
||||||
## [Testing and Examples](https://github.com/mozilla/TTS/wiki/Examples-using-TTS)
|
|
||||||
|
|
||||||
## Contribution guidelines
|
## Contribution guidelines
|
||||||
This repository is governed by Mozilla's code of conduct and etiquette guidelines. For more details, please read the [Mozilla Community Participation Guidelines.](https://www.mozilla.org/about/governance/policies/participation/)
|
This repository is governed by Mozilla's code of conduct and etiquette guidelines. For more details, please read the [Mozilla Community Participation Guidelines.](https://www.mozilla.org/about/governance/policies/participation/)
|
||||||
|
|
||||||
|
|
|
@ -15,7 +15,10 @@ tqdm
|
||||||
inflect
|
inflect
|
||||||
pysbd
|
pysbd
|
||||||
bokeh==1.4.0
|
bokeh==1.4.0
|
||||||
|
pysbd
|
||||||
soundfile
|
soundfile
|
||||||
nose==1.3.7
|
nose==1.3.7
|
||||||
cardboardlint==1.3.0
|
cardboardlint==1.3.0
|
||||||
pylint==2.5.3
|
pylint==2.5.3
|
||||||
|
fuzzywuzzy
|
||||||
|
gdown
|
||||||
|
|
|
@ -1,5 +1,4 @@
|
||||||
import io
|
import io
|
||||||
import re
|
|
||||||
import sys
|
import sys
|
||||||
import time
|
import time
|
||||||
|
|
||||||
|
|
3
setup.py
3
setup.py
|
@ -93,11 +93,14 @@ requirements = {
|
||||||
"inflect",
|
"inflect",
|
||||||
"pysbd",
|
"pysbd",
|
||||||
"bokeh==1.4.0",
|
"bokeh==1.4.0",
|
||||||
|
"pysbd",
|
||||||
"soundfile",
|
"soundfile",
|
||||||
"phonemizer>=2.2.0",
|
"phonemizer>=2.2.0",
|
||||||
"nose==1.3.7",
|
"nose==1.3.7",
|
||||||
"cardboardlint==1.3.0",
|
"cardboardlint==1.3.0",
|
||||||
"pylint==2.5.3",
|
"pylint==2.5.3",
|
||||||
|
'fuzzywuzzy',
|
||||||
|
'gdown'
|
||||||
],
|
],
|
||||||
'pip_install':[
|
'pip_install':[
|
||||||
'tensorflow>=2.2.0',
|
'tensorflow>=2.2.0',
|
||||||
|
|
|
@ -35,5 +35,3 @@ model.decoder.set_max_decoder_steps(1000)
|
||||||
|
|
||||||
# create tflite model
|
# create tflite model
|
||||||
tflite_model = convert_tacotron2_to_tflite(model, output_path=args.output_path)
|
tflite_model = convert_tacotron2_to_tflite(model, output_path=args.output_path)
|
||||||
|
|
||||||
print(f'Tflite Model size is {len(tflite_model) / (1024.0 * 1024.0)} MBs.')
|
|
||||||
|
|
|
@ -16,6 +16,7 @@ def convert_tacotron2_to_tflite(model,
|
||||||
tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS
|
tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS
|
||||||
]
|
]
|
||||||
tflite_model = converter.convert()
|
tflite_model = converter.convert()
|
||||||
|
print(f'Tflite Model size is {len(tflite_model) / (1024.0 * 1024.0)} MBs.')
|
||||||
if output_path is not None:
|
if output_path is not None:
|
||||||
# same model binary if outputpath is provided
|
# same model binary if outputpath is provided
|
||||||
with open(output_path, 'wb') as f:
|
with open(output_path, 'wb') as f:
|
||||||
|
|
|
@ -0,0 +1,33 @@
|
||||||
|
# Convert Tensorflow Tacotron2 model to TF-Lite binary
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
from TTS.utils.io import load_config
|
||||||
|
from TTS.vocoder.tf.utils.generic_utils import setup_generator
|
||||||
|
from TTS.vocoder.tf.utils.io import load_checkpoint
|
||||||
|
from TTS.vocoder.tf.utils.tflite import convert_melgan_to_tflite
|
||||||
|
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument('--tf_model',
|
||||||
|
type=str,
|
||||||
|
help='Path to target torch model to be converted to TF.')
|
||||||
|
parser.add_argument('--config_path',
|
||||||
|
type=str,
|
||||||
|
help='Path to config file of torch model.')
|
||||||
|
parser.add_argument('--output_path',
|
||||||
|
type=str,
|
||||||
|
help='path to tflite output binary.')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Set constants
|
||||||
|
CONFIG = load_config(args.config_path)
|
||||||
|
|
||||||
|
# load the model
|
||||||
|
model = setup_generator(CONFIG)
|
||||||
|
model.build_inference()
|
||||||
|
model = load_checkpoint(model, args.tf_model)
|
||||||
|
|
||||||
|
# create tflite model
|
||||||
|
tflite_model = convert_melgan_to_tflite(model, output_path=args.output_path)
|
||||||
|
|
|
@ -51,7 +51,7 @@ class PQMF(tf.keras.layers.Layer):
|
||||||
|
|
||||||
def synthesis(self, x):
|
def synthesis(self, x):
|
||||||
"""
|
"""
|
||||||
x : B x 1 x T
|
x : B x D x T
|
||||||
"""
|
"""
|
||||||
x = tf.transpose(x, perm=[0, 2, 1])
|
x = tf.transpose(x, perm=[0, 2, 1])
|
||||||
x = tf.nn.conv1d_transpose(
|
x = tf.nn.conv1d_transpose(
|
||||||
|
|
|
@ -109,3 +109,20 @@ class MelganGenerator(tf.keras.models.Model):
|
||||||
def build_inference(self):
|
def build_inference(self):
|
||||||
x = tf.random.uniform((1, self.in_channels, 4), dtype=tf.float32)
|
x = tf.random.uniform((1, self.in_channels, 4), dtype=tf.float32)
|
||||||
self(x, training=False)
|
self(x, training=False)
|
||||||
|
|
||||||
|
@tf.function(
|
||||||
|
experimental_relax_shapes=True,
|
||||||
|
input_signature=[
|
||||||
|
tf.TensorSpec([1, None, None], dtype=tf.float32),
|
||||||
|
],)
|
||||||
|
def inference_tflite(self, c):
|
||||||
|
c = tf.transpose(c, perm=[0, 2, 1])
|
||||||
|
c = tf.expand_dims(c, 2)
|
||||||
|
# FIXME: TF had no replicate padding as in Torch
|
||||||
|
# c = tf.pad(c, [[0, 0], [self.inference_padding, self.inference_padding], [0, 0], [0, 0]], "REFLECT")
|
||||||
|
o = c
|
||||||
|
for layer in self.model_layers:
|
||||||
|
o = layer(o)
|
||||||
|
# o = self.model_layers(c)
|
||||||
|
o = tf.transpose(o, perm=[0, 3, 2, 1])
|
||||||
|
return o[:, :, 0, :]
|
|
@ -30,11 +30,6 @@ class MultibandMelganGenerator(MelganGenerator):
|
||||||
def pqmf_synthesis(self, x):
|
def pqmf_synthesis(self, x):
|
||||||
return self.pqmf_layer.synthesis(x)
|
return self.pqmf_layer.synthesis(x)
|
||||||
|
|
||||||
# def call(self, c, training=False):
|
|
||||||
# if training:
|
|
||||||
# raise NotImplementedError()
|
|
||||||
# return self.inference(c)
|
|
||||||
|
|
||||||
def inference(self, c):
|
def inference(self, c):
|
||||||
c = tf.transpose(c, perm=[0, 2, 1])
|
c = tf.transpose(c, perm=[0, 2, 1])
|
||||||
c = tf.expand_dims(c, 2)
|
c = tf.expand_dims(c, 2)
|
||||||
|
@ -46,3 +41,20 @@ class MultibandMelganGenerator(MelganGenerator):
|
||||||
o = tf.transpose(o, perm=[0, 3, 2, 1])
|
o = tf.transpose(o, perm=[0, 3, 2, 1])
|
||||||
o = self.pqmf_layer.synthesis(o[:, :, 0, :])
|
o = self.pqmf_layer.synthesis(o[:, :, 0, :])
|
||||||
return o
|
return o
|
||||||
|
|
||||||
|
@tf.function(
|
||||||
|
experimental_relax_shapes=True,
|
||||||
|
input_signature=[
|
||||||
|
tf.TensorSpec([1, 80, None], dtype=tf.float32),
|
||||||
|
],)
|
||||||
|
def inference_tflite(self, c):
|
||||||
|
c = tf.transpose(c, perm=[0, 2, 1])
|
||||||
|
c = tf.expand_dims(c, 2)
|
||||||
|
# FIXME: TF had no replicate padding as in Torch
|
||||||
|
# c = tf.pad(c, [[0, 0], [self.inference_padding, self.inference_padding], [0, 0], [0, 0]], "REFLECT")
|
||||||
|
o = c
|
||||||
|
for layer in self.model_layers:
|
||||||
|
o = layer(o)
|
||||||
|
o = tf.transpose(o, perm=[0, 3, 2, 1])
|
||||||
|
o = self.pqmf_layer.synthesis(o[:, :, 0, :])
|
||||||
|
return o
|
||||||
|
|
|
@ -0,0 +1,31 @@
|
||||||
|
import tensorflow as tf
|
||||||
|
|
||||||
|
|
||||||
|
def convert_melgan_to_tflite(model,
|
||||||
|
output_path=None,
|
||||||
|
experimental_converter=True):
|
||||||
|
"""Convert Tensorflow MelGAN model to TFLite. Save a binary file if output_path is
|
||||||
|
provided, else return TFLite model."""
|
||||||
|
|
||||||
|
concrete_function = model.inference_tflite.get_concrete_function()
|
||||||
|
converter = tf.lite.TFLiteConverter.from_concrete_functions(
|
||||||
|
[concrete_function])
|
||||||
|
converter.experimental_new_converter = experimental_converter
|
||||||
|
converter.optimizations = []
|
||||||
|
converter.target_spec.supported_ops = [
|
||||||
|
tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS
|
||||||
|
]
|
||||||
|
tflite_model = converter.convert()
|
||||||
|
print(f'Tflite Model size is {len(tflite_model) / (1024.0 * 1024.0)} MBs.')
|
||||||
|
if output_path is not None:
|
||||||
|
# same model binary if outputpath is provided
|
||||||
|
with open(output_path, 'wb') as f:
|
||||||
|
f.write(tflite_model)
|
||||||
|
return None
|
||||||
|
return tflite_model
|
||||||
|
|
||||||
|
|
||||||
|
def load_tflite_model(tflite_path):
|
||||||
|
tflite_model = tf.lite.Interpreter(model_path=tflite_path)
|
||||||
|
tflite_model.allocate_tensors()
|
||||||
|
return tflite_model
|
Loading…
Reference in New Issue