Mozilla TTS on CPU Real-Time Speech Synthesis with TFLite¶

These models are converted from released PyTorch models using our TF utilities provided in Mozilla TTS.

Notebook Details¶

These TFLite models support TF 2.3rc0 and for different versions you might need to regenerate them.

TFLite optimizations degrades the TTS model performance and we do not apply any optimization for the vocoder model due to the same reason. If you like to keep the quality, consider to regenerate TFLite model accordingly.

Models optimized with TFLite can be slow on a regular CPU since it is optimized specifically for lower-end systems.

Model Details¶

We use Tacotron2 and MultiBand-Melgan models and LJSpeech dataset.

Tacotron2 is trained using Double Decoder Consistency (DDC) only for 130K steps (3 days) with a single GPU.

MultiBand-Melgan is trained 1.45M steps with real spectrograms.

Note that both model performances can be improved with more training.

Download TF Models and configs¶

In [ ]:

!gdown --id 17PYXCmTe0el_SLTwznrt3vOArNGMGo5v -O tts_model.tflite
!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json

In [ ]:

!gdown --id 1aXveT-NjOM1mUr6tM4JfWjshq67GvVIO -O vocoder_model.tflite
!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json
!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy

Setup Libraries¶

In [ ]:

# need it for char to phoneme conversion
! sudo apt-get install espeak

In [ ]:

!git clone https://github.com/mozilla/TTS

In [ ]:

%cd TTS
!git checkout c7296b3
!pip install -r requirements.txt
!python setup.py install
!pip install tensorflow==2.3.0rc0
%cd ..

Define TTS function¶

In [ ]:

def run_vocoder(mel_spec):
  vocoder_inputs = mel_spec[None, :, :]
  # get input and output details
  input_details = vocoder_model.get_input_details()
  # reshape input tensor for the new input shape
  vocoder_model.resize_tensor_input(input_details[0]['index'], vocoder_inputs.shape)
  vocoder_model.allocate_tensors()
  detail = input_details[0]
  vocoder_model.set_tensor(detail['index'], vocoder_inputs)
  # run the model
  vocoder_model.invoke()
  # collect outputs
  output_details = vocoder_model.get_output_details()
  waveform = vocoder_model.get_tensor(output_details[0]['index'])
  return waveform 


def tts(model, text, CONFIG, p):
    t_1 = time.time()
    waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
                                                                             truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars,
                                                                             backend='tflite')
    waveform = run_vocoder(mel_postnet_spec.T)
    waveform = waveform[0, 0]
    rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
    tps = (time.time() - t_1) / len(waveform)
    print(waveform.shape)
    print(" > Run-time: {}".format(time.time() - t_1))
    print(" > Real-time factor: {}".format(rtf))
    print(" > Time per step: {}".format(tps))
    IPython.display.display(IPython.display.Audio(waveform, rate=CONFIG.audio['sample_rate']))  
    return alignment, mel_postnet_spec, stop_tokens, waveform

Load TF Models¶

In [ ]:

import os
import torch
import time
import IPython

from TTS.tf.utils.tflite import load_tflite_model
from TTS.tf.utils.io import load_checkpoint
from TTS.utils.io import load_config
from TTS.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.synthesis import synthesis

In [ ]:

# runtime settings
use_cuda = False

In [ ]:

# model paths
TTS_MODEL = "tts_model.tflite"
TTS_CONFIG = "config.json"
VOCODER_MODEL = "vocoder_model.tflite"
VOCODER_CONFIG = "config_vocoder.json"

In [ ]:

# load configs
TTS_CONFIG = load_config(TTS_CONFIG)
VOCODER_CONFIG = load_config(VOCODER_CONFIG)

In [ ]:

# load the audio processor
ap = AudioProcessor(**TTS_CONFIG.audio)

In [ ]:

# LOAD TTS MODEL
# multi speaker 
speaker_id = None
speakers = []

# load the models
model = load_tflite_model(TTS_MODEL)
vocoder_model = load_tflite_model(VOCODER_MODEL)

Run Inference¶

In [ ]:

sentence =  "Bill got in the habit of asking himself “Is that thought true?” and if he wasn’t absolutely certain it was, he just let it go."
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, ap)

</html>

10 KiB Raw Blame History Unescape Escape

Mozilla TTS on CPU Real-Time Speech Synthesis with TFLite¶

Notebook Details¶

Model Details¶

Download TF Models and configs¶

Setup Libraries¶

Define TTS function¶

Load TF Models¶

Run Inference¶

10 KiB

Raw Blame History