coqui-tts/notebooks/DDC_TTS_and_MultiBand_MelGA...

10 KiB
Raw Blame History

None <html lang="en"> <head> </head>

Mozilla TTS on CPU Real-Time Speech Synthesis with TFLite

These models are converted from released PyTorch models using our TF utilities provided in Mozilla TTS.

Notebook Details

These TFLite models support TF 2.3rc0 and for different versions you might need to regenerate them.

TFLite optimizations degrades the TTS model performance and we do not apply any optimization for the vocoder model due to the same reason. If you like to keep the quality, consider to regenerate TFLite model accordingly.

Models optimized with TFLite can be slow on a regular CPU since it is optimized specifically for lower-end systems.


Model Details

We use Tacotron2 and MultiBand-Melgan models and LJSpeech dataset.

Tacotron2 is trained using Double Decoder Consistency (DDC) only for 130K steps (3 days) with a single GPU.

MultiBand-Melgan is trained 1.45M steps with real spectrograms.

Note that both model performances can be improved with more training.

Download TF Models and configs

In [ ]:
!gdown --id 17PYXCmTe0el_SLTwznrt3vOArNGMGo5v -O tts_model.tflite
!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json
In [ ]:
!gdown --id 1aXveT-NjOM1mUr6tM4JfWjshq67GvVIO -O vocoder_model.tflite
!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json
!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy

Setup Libraries

In [ ]:
# need it for char to phoneme conversion
! sudo apt-get install espeak
In [ ]:
!git clone https://github.com/mozilla/TTS
In [ ]:
%cd TTS
!git checkout c7296b3
!pip install -r requirements.txt
!python setup.py install
!pip install tensorflow==2.3.0rc0
%cd ..

Define TTS function

In [ ]:
def run_vocoder(mel_spec):
  vocoder_inputs = mel_spec[None, :, :]
  # get input and output details
  input_details = vocoder_model.get_input_details()
  # reshape input tensor for the new input shape
  vocoder_model.resize_tensor_input(input_details[0]['index'], vocoder_inputs.shape)
  vocoder_model.allocate_tensors()
  detail = input_details[0]
  vocoder_model.set_tensor(detail['index'], vocoder_inputs)
  # run the model
  vocoder_model.invoke()
  # collect outputs
  output_details = vocoder_model.get_output_details()
  waveform = vocoder_model.get_tensor(output_details[0]['index'])
  return waveform 


def tts(model, text, CONFIG, p):
    t_1 = time.time()
    waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
                                                                             truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars,
                                                                             backend='tflite')
    waveform = run_vocoder(mel_postnet_spec.T)
    waveform = waveform[0, 0]
    rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
    tps = (time.time() - t_1) / len(waveform)
    print(waveform.shape)
    print(" > Run-time: {}".format(time.time() - t_1))
    print(" > Real-time factor: {}".format(rtf))
    print(" > Time per step: {}".format(tps))
    IPython.display.display(IPython.display.Audio(waveform, rate=CONFIG.audio['sample_rate']))  
    return alignment, mel_postnet_spec, stop_tokens, waveform

Load TF Models

In [ ]:
import os
import torch
import time
import IPython

from TTS.tf.utils.tflite import load_tflite_model
from TTS.tf.utils.io import load_checkpoint
from TTS.utils.io import load_config
from TTS.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.synthesis import synthesis
In [ ]:
# runtime settings
use_cuda = False
In [ ]:
# model paths
TTS_MODEL = "tts_model.tflite"
TTS_CONFIG = "config.json"
VOCODER_MODEL = "vocoder_model.tflite"
VOCODER_CONFIG = "config_vocoder.json"
In [ ]:
# load configs
TTS_CONFIG = load_config(TTS_CONFIG)
VOCODER_CONFIG = load_config(VOCODER_CONFIG)
In [ ]:
# load the audio processor
ap = AudioProcessor(**TTS_CONFIG.audio)         
In [ ]:
# LOAD TTS MODEL
# multi speaker 
speaker_id = None
speakers = []

# load the models
model = load_tflite_model(TTS_MODEL)
vocoder_model = load_tflite_model(VOCODER_MODEL)

Run Inference

In [ ]:
sentence =  "Bill got in the habit of asking himself “Is that thought true?” and if he wasnt absolutely certain it was, he just let it go."
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, ap)
</html>