Update generic_utils.py (#3561 )

Handles cases when git branch produces no output or invalid output. Right now, it just crashes with `StopIteration`
Bug fix in MP3 and FLAC compute length on TTSDataset (#3092 )
2024-02-10 11:20:58 -03:00 · 2023-12-27 13:23:43 -03:00 · 2023-12-14 18:00:30 +01:00 · 2023-12-14 14:26:31 +01:00 · 2023-12-13 08:54:57 +01:00 · 2023-12-13 08:53:43 +01:00
110 changed files with 1342 additions and 1061 deletions
--- a/.github/workflows/api_tests.yml
+++ b/.github/workflows/api_tests.yml
@ -1,53 +0,0 @@
-name: api_tests
-
-on:
-  push:
-    branches:
-      - main
-jobs:
-  check_skip:
-    runs-on: ubuntu-latest
-    if: "! contains(github.event.head_commit.message, '[ci skip]')"
-    steps:
-      - run: echo "${{ github.event.head_commit.message }}"
-
-  test:
-    runs-on: ubuntu-latest
-    strategy:
-      fail-fast: false
-      matrix:
-        python-version: [3.9, "3.10", "3.11"]
-        experimental: [false]
-    steps:
-      - uses: actions/checkout@v3
-      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v4
-        with:
-          python-version: ${{ matrix.python-version }}
-          architecture: x64
-          cache: 'pip'
-          cache-dependency-path: 'requirements*'
-      - name: check OS
-        run: cat /etc/os-release
-      - name: set ENV
-        run: |
-          export TRAINER_TELEMETRY=0
-      - name: Install dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y --no-install-recommends git make gcc
-          sudo apt-get install espeak-ng
-          make system-deps
-      - name: Install/upgrade Python setup deps
-        run: python3 -m pip install --upgrade pip setuptools wheel
-      - name: Replace scarf urls
-        run: |
-          sed -i 's/https:\/\/coqui.gateway.scarf.sh\//https:\/\/github.com\/coqui-ai\/TTS\/releases\/download\//g' TTS/.models.json
-      - name: Install TTS
-        run: |
-          python3 -m pip install .[all]
-          python3 setup.py egg_info
-      - name: Unit tests
-        run: make api_tests
-        env:
-          COQUI_STUDIO_TOKEN: ${{ secrets.COQUI_STUDIO_TOKEN }}
--- a/.github/workflows/zoo_tests_tortoise.yml
+++ b/.github/workflows/zoo_tests_tortoise.yml
@ -1,52 +0,0 @@
-name: zoo-tests-tortoise
-
-on:
-  push:
-    branches:
-      - main
-  pull_request:
-    types: [opened, synchronize, reopened]
-jobs:
-  check_skip:
-    runs-on: ubuntu-latest
-    if: "! contains(github.event.head_commit.message, '[ci skip]')"
-    steps:
-      - run: echo "${{ github.event.head_commit.message }}"
-
-  test:
-    runs-on: ubuntu-latest
-    strategy:
-      fail-fast: false
-      matrix:
-        python-version: [3.9, "3.10", "3.11"]
-        experimental: [false]
-    steps:
-      - uses: actions/checkout@v3
-      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v4
-        with:
-          python-version: ${{ matrix.python-version }}
-          architecture: x64
-          cache: 'pip'
-          cache-dependency-path: 'requirements*'
-      - name: check OS
-        run: cat /etc/os-release
-      - name: set ENV
-        run: export TRAINER_TELEMETRY=0
-      - name: Install dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y git make gcc
-          sudo apt-get install espeak espeak-ng
-          make system-deps
-      - name: Install/upgrade Python setup deps
-        run: python3 -m pip install --upgrade pip setuptools wheel
-      - name: Replace scarf urls
-        run: |
-          sed -i 's/https:\/\/coqui.gateway.scarf.sh\//https:\/\/github.com\/coqui-ai\/TTS\/releases\/download\//g' TTS/.models.json
-      - name: Install TTS
-        run: |
-          python3 -m pip install .[all]
-          python3 setup.py egg_info
-      - name: Unit tests
-        run: nose2 -F -v -B --with-coverage --coverage TTS tests.zoo_tests.test_models.test_tortoise
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -48,7 +48,7 @@ The following steps are tested on an Ubuntu system.

 1. Fork 🐸TTS[https://github.com/coqui-ai/TTS] by clicking the fork button at the top right corner of the project page.

-2. Clone 🐸TTS and add the main repo as a new remote named ```upsteam```.
+2. Clone 🐸TTS and add the main repo as a new remote named ```upstream```.

    ```bash
    $ git clone git@github.com:<your Github name>/TTS.git
--- a/3
+++ b/3
@ -35,9 +35,6 @@ test_zoo:	## run zoo tests.
 inference_tests: ## run inference tests.
 	nose2 -F -v -B --with-coverage --coverage TTS tests.inference_tests

-api_tests: ## run api tests.
-	nose2 -F -v -B --with-coverage --coverage TTS tests.api_tests
-
 data_tests: ## run data tests.
 	nose2 -F -v -B --with-coverage --coverage TTS tests.data_tests

--- a/README.md
+++ b/README.md
@ -7,11 +7,6 @@
 - 📣 [🐶Bark](https://github.com/suno-ai/bark) is now available for inference with unconstrained voice cloning. [Docs](https://tts.readthedocs.io/en/dev/models/bark.html)
 - 📣 You can use [~1100 Fairseq models](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) with 🐸TTS.
 - 📣 🐸TTS now supports 🐢Tortoise with faster inference. [Docs](https://tts.readthedocs.io/en/dev/models/tortoise.html)
- 📣 **Coqui Studio API** is landed on 🐸TTS. - [Example](https://github.com/coqui-ai/TTS/blob/dev/README.md#-python-api)
- 📣 [**Coqui Studio API**](https://docs.coqui.ai/docs) is live.
- 📣 Voice generation with prompts - **Prompt to Voice** - is live on [**Coqui Studio**](https://app.coqui.ai/auth/signin)!! - [Blog Post](https://coqui.ai/blog/tts/prompt-to-voice)
- 📣 Voice generation with fusion - **Voice fusion** - is live on [**Coqui Studio**](https://app.coqui.ai/auth/signin).
- 📣 Voice cloning is live on [**Coqui Studio**](https://app.coqui.ai/auth/signin).

 <div align="center">
 <img src="https://static.scarf.sh/a.png?x-pxid=cf317fe7-2188-4721-bc01-124bb5d5dbb2" />
@ -72,7 +67,7 @@ Please use our dedicated channels for questions and discussion. Help is much mor
 | Type                            | Links                               |
 | ------------------------------- | --------------------------------------- |
 | 💼 **Documentation**              | [ReadTheDocs](https://tts.readthedocs.io/en/latest/)
-| 💾 **Installation**               | [TTS/README.md](https://github.com/coqui-ai/TTS/tree/dev#install-tts)|
+| 💾 **Installation**               | [TTS/README.md](https://github.com/coqui-ai/TTS/tree/dev#installation)|
 | 👩‍💻 **Contributing**               | [CONTRIBUTING.md](https://github.com/coqui-ai/TTS/blob/main/CONTRIBUTING.md)|
 | 📌 **Road Map**                   | [Main Development Plans](https://github.com/coqui-ai/TTS/issues/378)
 | 🚀 **Released Models**            | [TTS Releases](https://github.com/coqui-ai/TTS/releases) and [Experimental Models](https://github.com/coqui-ai/TTS/wiki/Experimental-Released-Models)|
@ -253,29 +248,6 @@ tts.tts_with_vc_to_file(
 )
 ```

-#### Example using [🐸Coqui Studio](https://coqui.ai) voices.
-You access all of your cloned voices and built-in speakers in [🐸Coqui Studio](https://coqui.ai).
-To do this, you'll need an API token, which you can obtain from the [account page](https://coqui.ai/account).
-After obtaining the API token, you'll need to configure the COQUI_STUDIO_TOKEN environment variable.
-
-Once you have a valid API token in place, the studio speakers will be displayed as distinct models within the list.
-These models will follow the naming convention `coqui_studio/en/<studio_speaker_name>/coqui_studio`
-
-```python
-# XTTS model
-models = TTS(cs_api_model="XTTS").list_models()
-# Init TTS with the target studio speaker
-tts = TTS(model_name="coqui_studio/en/Torcull Diarmuid/coqui_studio", progress_bar=False)
-# Run TTS
-tts.tts_to_file(text="This is a test.", language="en", file_path=OUTPUT_PATH)
-
-# V1 model
-models = TTS(cs_api_model="V1").list_models()
-# Run TTS with emotion and speed control
-# Emotion control only works with V1 model
-tts.tts_to_file(text="This is a test.", file_path=OUTPUT_PATH, emotion="Happy", speed=1.5)
-```
-
 #### Example text to speech using **Fairseq models in ~1100 languages** 🤯.
 For Fairseq models, use the following name format: `tts_models/<lang-iso_code>/fairseq/vits`.
 You can find the language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html)
@ -351,12 +323,6 @@ If you don't specify any models, then it uses LJSpeech based English model.
  $ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
  ```

- Run TTS and define speed factor to use for 🐸Coqui Studio models, between 0.0 and 2.0:
-
-  ```
-  $ tts --text "Text for TTS" --model_name "coqui_studio/<language>/<dataset>/<model_name>" --speed 1.2 --out_path output/path/speech.wav
-  ```
-
 - Run a TTS model with its default vocoder model:

  ```
--- a/TTS/.models.json
+++ b/TTS/.models.json
@ -3,12 +3,13 @@
        "multilingual": {
            "multi-dataset": {
                "xtts_v2": {
-                    "description": "XTTS-v2.0.2 by Coqui with 16 languages.",
+                    "description": "XTTS-v2.0.3 by Coqui with 17 languages.",
                    "hf_url": [
                        "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/model.pth",
                        "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/config.json",
                        "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/vocab.json",
-                        "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/hash.md5"
+                        "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/hash.md5",
+                        "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/speakers_xtts.pth"
                    ],
                    "model_hash": "10f92b55c512af7a8d39d650547a15a7",
                    "default_vocoder": null,
@ -45,7 +46,7 @@
                    "hf_url": [
                        "https://coqui.gateway.scarf.sh/hf/bark/coarse_2.pt",
                        "https://coqui.gateway.scarf.sh/hf/bark/fine_2.pt",
-                        "https://app.coqui.ai/tts_model/text_2.pt",
+                        "https://coqui.gateway.scarf.sh/hf/text_2.pt",
                        "https://coqui.gateway.scarf.sh/hf/bark/config.json",
                        "https://coqui.gateway.scarf.sh/hf/bark/hubert.pt",
                        "https://coqui.gateway.scarf.sh/hf/bark/tokenizer.pth"
@ -270,7 +271,7 @@
                "tortoise-v2": {
                    "description": "Tortoise tts model https://github.com/neonbjb/tortoise-tts",
                    "github_rls_url": [
-                        "https://app.coqui.ai/tts_model/autoregressive.pth",
+                        "https://coqui.gateway.scarf.sh/v0.14.1_models/autoregressive.pth",
                        "https://coqui.gateway.scarf.sh/v0.14.1_models/clvp2.pth",
                        "https://coqui.gateway.scarf.sh/v0.14.1_models/cvvp.pth",
                        "https://coqui.gateway.scarf.sh/v0.14.1_models/diffusion_decoder.pth",
--- a/TTS/VERSION
+++ b/TTS/VERSION
@ -1 +1 @@
-0.21.0
+0.22.0
--- a/TTS/api.py
+++ b/TTS/api.py
@ -6,12 +6,12 @@ from typing import Union
 import numpy as np
 from torch import nn

-from TTS.cs_api import CS_API
 from TTS.utils.audio.numpy_transforms import save_wav
 from TTS.utils.manage import ModelManager
 from TTS.utils.synthesizer import Synthesizer
 from TTS.config import load_config

+
 class TTS(nn.Module):
    """TODO: Add voice conversion and Capacitron support."""

@ -23,7 +23,6 @@ class TTS(nn.Module):
        vocoder_path: str = None,
        vocoder_config_path: str = None,
        progress_bar: bool = True,
-        cs_api_model: str = "XTTS",
        gpu=False,
    ):
        """🐸TTS python interface that allows to load and use the released models.
@ -59,9 +58,6 @@ class TTS(nn.Module):
            vocoder_path (str, optional): Path to the vocoder checkpoint. Defaults to None.
            vocoder_config_path (str, optional): Path to the vocoder config. Defaults to None.
            progress_bar (bool, optional): Whether to pring a progress bar while downloading a model. Defaults to True.
-            cs_api_model (str, optional): Name of the model to use for the Coqui Studio API. Available models are
-                "XTTS", "V1". You can also use `TTS.cs_api.CS_API" for more control.
-                Defaults to "XTTS".
            gpu (bool, optional): Enable/disable GPU. Some models might be too slow on CPU. Defaults to False.
        """
        super().__init__()
@ -69,17 +65,17 @@ class TTS(nn.Module):
        self.config = load_config(config_path) if config_path else None
        self.synthesizer = None
        self.voice_converter = None
-        self.csapi = None
-        self.cs_api_model = cs_api_model
        self.model_name = ""
        if gpu:
            warnings.warn("`gpu` will be deprecated. Please use `tts.to(device)` instead.")

-        if model_name is not None:
-            if "tts_models" in model_name or "coqui_studio" in model_name:
+        if model_name is not None and len(model_name) > 0:
+            if "tts_models" in model_name:
                self.load_tts_model_by_name(model_name, gpu)
            elif "voice_conversion_models" in model_name:
                self.load_vc_model_by_name(model_name, gpu)
+            else:
+                self.load_model_by_name(model_name, gpu)

        if model_path:
            self.load_tts_model_by_path(
@ -96,17 +92,15 @@ class TTS(nn.Module):
            return self.synthesizer.tts_model.speaker_manager.num_speakers > 1
        return False

-    @property
-    def is_coqui_studio(self):
-        if self.model_name is None:
-            return False
-        return "coqui_studio" in self.model_name
-
    @property
    def is_multi_lingual(self):
        # Not sure what sets this to None, but applied a fix to prevent crashing.
-        if (isinstance(self.model_name, str) and "xtts" in self.model_name or
-                self.config and ("xtts" in self.config.model or len(self.config.languages) > 1)):
+        if (
+            isinstance(self.model_name, str)
+            and "xtts" in self.model_name
+            or self.config
+            and ("xtts" in self.config.model or len(self.config.languages) > 1)
+        ):
            return True
        if hasattr(self.synthesizer.tts_model, "language_manager") and self.synthesizer.tts_model.language_manager:
            return self.synthesizer.tts_model.language_manager.num_languages > 1
@ -129,14 +123,7 @@ class TTS(nn.Module):
        return Path(__file__).parent / ".models.json"

    def list_models(self):
-        try:
-            csapi = CS_API(model=self.cs_api_model)
-            models = csapi.list_speakers_as_tts_models()
-        except ValueError as e:
-            print(e)
-            models = []
-        manager = ModelManager(models_file=TTS.get_models_file_path(), progress_bar=False, verbose=False)
-        return manager.list_tts_models() + models
+        return ModelManager(models_file=TTS.get_models_file_path(), progress_bar=False, verbose=False)

    def download_model_by_name(self, model_name: str):
        model_path, config_path, model_item = self.manager.download_model(model_name)
@ -149,6 +136,15 @@ class TTS(nn.Module):
        vocoder_path, vocoder_config_path, _ = self.manager.download_model(model_item["default_vocoder"])
        return model_path, config_path, vocoder_path, vocoder_config_path, None

+    def load_model_by_name(self, model_name: str, gpu: bool = False):
+        """Load one of the 🐸TTS models by name.
+
+        Args:
+            model_name (str): Model name to load. You can list models by ```tts.models```.
+            gpu (bool, optional): Enable/disable GPU. Some models might be too slow on CPU. Defaults to False.
+        """
+        self.load_tts_model_by_name(model_name, gpu)
+
    def load_vc_model_by_name(self, model_name: str, gpu: bool = False):
        """Load one of the voice conversion models by name.

@ -170,30 +166,26 @@ class TTS(nn.Module):
        TODO: Add tests
        """
        self.synthesizer = None
-        self.csapi = None
        self.model_name = model_name

-        if "coqui_studio" in model_name:
-            self.csapi = CS_API()
-        else:
-            model_path, config_path, vocoder_path, vocoder_config_path, model_dir = self.download_model_by_name(
-                model_name
-            )
+        model_path, config_path, vocoder_path, vocoder_config_path, model_dir = self.download_model_by_name(
+            model_name
+        )

-            # init synthesizer
-            # None values are fetch from the model
-            self.synthesizer = Synthesizer(
-                tts_checkpoint=model_path,
-                tts_config_path=config_path,
-                tts_speakers_file=None,
-                tts_languages_file=None,
-                vocoder_checkpoint=vocoder_path,
-                vocoder_config=vocoder_config_path,
-                encoder_checkpoint=None,
-                encoder_config=None,
-                model_dir=model_dir,
-                use_cuda=gpu,
-            )
+        # init synthesizer
+        # None values are fetch from the model
+        self.synthesizer = Synthesizer(
+            tts_checkpoint=model_path,
+            tts_config_path=config_path,
+            tts_speakers_file=None,
+            tts_languages_file=None,
+            vocoder_checkpoint=vocoder_path,
+            vocoder_config=vocoder_config_path,
+            encoder_checkpoint=None,
+            encoder_config=None,
+            model_dir=model_dir,
+            use_cuda=gpu,
+        )

    def load_tts_model_by_path(
        self, model_path: str, config_path: str, vocoder_path: str = None, vocoder_config: str = None, gpu: bool = False
@ -230,77 +222,17 @@ class TTS(nn.Module):
        **kwargs,
    ) -> None:
        """Check if the arguments are valid for the model."""
-        if not self.is_coqui_studio:
-            # check for the coqui tts models
-            if self.is_multi_speaker and (speaker is None and speaker_wav is None):
-                raise ValueError("Model is multi-speaker but no `speaker` is provided.")
-            if self.is_multi_lingual and language is None:
-                raise ValueError("Model is multi-lingual but no `language` is provided.")
-            if not self.is_multi_speaker and speaker is not None and "voice_dir" not in kwargs:
-                raise ValueError("Model is not multi-speaker but `speaker` is provided.")
-            if not self.is_multi_lingual and language is not None:
-                raise ValueError("Model is not multi-lingual but `language` is provided.")
-            if not emotion is None and not speed is None:
-                raise ValueError("Emotion and speed can only be used with Coqui Studio models.")
-        else:
-            if emotion is None:
-                emotion = "Neutral"
-            if speed is None:
-                speed = 1.0
-            # check for the studio models
-            if speaker_wav is not None:
-                raise ValueError("Coqui Studio models do not support `speaker_wav` argument.")
-            if speaker is not None:
-                raise ValueError("Coqui Studio models do not support `speaker` argument.")
-            if language is not None and language != "en":
-                raise ValueError("Coqui Studio models currently support only `language=en` argument.")
-            if emotion not in ["Neutral", "Happy", "Sad", "Angry", "Dull"]:
-                raise ValueError(f"Emotion - `{emotion}` - must be one of `Neutral`, `Happy`, `Sad`, `Angry`, `Dull`.")
-
-    def tts_coqui_studio(
-        self,
-        text: str,
-        speaker_name: str = None,
-        language: str = None,
-        emotion: str = None,
-        speed: float = 1.0,
-        pipe_out=None,
-        file_path: str = None,
-    ) -> Union[np.ndarray, str]:
-        """Convert text to speech using Coqui Studio models. Use `CS_API` class if you are only interested in the API.
-
-        Args:
-            text (str):
-                Input text to synthesize.
-            speaker_name (str, optional):
-                Speaker name from Coqui Studio. Defaults to None.
-            language (str): Language of the text. If None, the default language of the speaker is used. Language is only
-                supported by `XTTS` model.
-            emotion (str, optional):
-                Emotion of the speaker. One of "Neutral", "Happy", "Sad", "Angry", "Dull". Emotions are only available
-                with "V1" model. Defaults to None.
-            speed (float, optional):
-                Speed of the speech. Defaults to 1.0.
-            pipe_out (BytesIO, optional):
-                Flag to stdout the generated TTS wav file for shell pipe.
-            file_path (str, optional):
-                Path to save the output file. When None it returns the `np.ndarray` of waveform. Defaults to None.
-
-        Returns:
-            Union[np.ndarray, str]: Waveform of the synthesized speech or path to the output file.
-        """
-        speaker_name = self.model_name.split("/")[2]
-        if file_path is not None:
-            return self.csapi.tts_to_file(
-                text=text,
-                speaker_name=speaker_name,
-                language=language,
-                speed=speed,
-                pipe_out=pipe_out,
-                emotion=emotion,
-                file_path=file_path,
-            )[0]
-        return self.csapi.tts(text=text, speaker_name=speaker_name, language=language, speed=speed, emotion=emotion)[0]
+        # check for the coqui tts models
+        if self.is_multi_speaker and (speaker is None and speaker_wav is None):
+            raise ValueError("Model is multi-speaker but no `speaker` is provided.")
+        if self.is_multi_lingual and language is None:
+            raise ValueError("Model is multi-lingual but no `language` is provided.")
+        if not self.is_multi_speaker and speaker is not None and "voice_dir" not in kwargs:
+            raise ValueError("Model is not multi-speaker but `speaker` is provided.")
+        if not self.is_multi_lingual and language is not None:
+            raise ValueError("Model is not multi-lingual but `language` is provided.")
+        if not emotion is None and not speed is None:
+            raise ValueError("Emotion and speed can only be used with Coqui Studio models. Which is discontinued.")

    def tts(
        self,
@ -310,6 +242,7 @@ class TTS(nn.Module):
        speaker_wav: str = None,
        emotion: str = None,
        speed: float = None,
+        split_sentences: bool = True,
        **kwargs,
    ):
        """Convert text to speech.
@ -330,14 +263,16 @@ class TTS(nn.Module):
            speed (float, optional):
                Speed factor to use for 🐸Coqui Studio models, between 0 and 2.0. If None, Studio models use 1.0.
                Defaults to None.
+            split_sentences (bool, optional):
+                Split text into sentences, synthesize them separately and concatenate the file audio.
+                Setting it False uses more VRAM and possibly hit model specific text length or VRAM limits. Only
+                applicable to the 🐸TTS models. Defaults to True.
+            kwargs (dict, optional):
+                Additional arguments for the model.
        """
        self._check_arguments(
            speaker=speaker, language=language, speaker_wav=speaker_wav, emotion=emotion, speed=speed, **kwargs
        )
-        if self.csapi is not None:
-            return self.tts_coqui_studio(
-                text=text, speaker_name=speaker, language=language, emotion=emotion, speed=speed
-            )
        wav = self.synthesizer.tts(
            text=text,
            speaker_name=speaker,
@ -347,6 +282,7 @@ class TTS(nn.Module):
            style_wav=None,
            style_text=None,
            reference_speaker_name=None,
+            split_sentences=split_sentences,
            **kwargs,
        )
        return wav
@ -361,6 +297,7 @@ class TTS(nn.Module):
        speed: float = 1.0,
        pipe_out=None,
        file_path: str = "output.wav",
+        split_sentences: bool = True,
        **kwargs,
    ):
        """Convert text to speech.
@ -385,22 +322,23 @@ class TTS(nn.Module):
                Flag to stdout the generated TTS wav file for shell pipe.
            file_path (str, optional):
                Output file path. Defaults to "output.wav".
+            split_sentences (bool, optional):
+                Split text into sentences, synthesize them separately and concatenate the file audio.
+                Setting it False uses more VRAM and possibly hit model specific text length or VRAM limits. Only
+                applicable to the 🐸TTS models. Defaults to True.
            kwargs (dict, optional):
                Additional arguments for the model.
        """
        self._check_arguments(speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)

-        if self.csapi is not None:
-            return self.tts_coqui_studio(
-                text=text,
-                speaker_name=speaker,
-                language=language,
-                emotion=emotion,
-                speed=speed,
-                file_path=file_path,
-                pipe_out=pipe_out,
-            )
-        wav = self.tts(text=text, speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
+        wav = self.tts(
+            text=text,
+            speaker=speaker,
+            language=language,
+            speaker_wav=speaker_wav,
+            split_sentences=split_sentences,
+            **kwargs,
+        )
        self.synthesizer.save_wav(wav=wav, path=file_path, pipe_out=pipe_out)
        return file_path

@ -440,7 +378,14 @@ class TTS(nn.Module):
        save_wav(wav=wav, path=file_path, sample_rate=self.voice_converter.vc_config.audio.output_sample_rate)
        return file_path

-    def tts_with_vc(self, text: str, language: str = None, speaker_wav: str = None, speaker: str = None):
+    def tts_with_vc(
+        self,
+        text: str,
+        language: str = None,
+        speaker_wav: str = None,
+        speaker: str = None,
+        split_sentences: bool = True,
+    ):
        """Convert text to speech with voice conversion.

        It combines tts with voice conversion to fake voice cloning.
@ -460,10 +405,16 @@ class TTS(nn.Module):
            speaker (str, optional):
                Speaker name for multi-speaker. You can check whether loaded model is multi-speaker by
                `tts.is_multi_speaker` and list speakers by `tts.speakers`. Defaults to None.
+            split_sentences (bool, optional):
+                Split text into sentences, synthesize them separately and concatenate the file audio.
+                Setting it False uses more VRAM and possibly hit model specific text length or VRAM limits. Only
+                applicable to the 🐸TTS models. Defaults to True.
        """
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as fp:
            # Lazy code... save it to a temp file to resample it while reading it for VC
-            self.tts_to_file(text=text, speaker=speaker, language=language, file_path=fp.name)
+            self.tts_to_file(
+                text=text, speaker=speaker, language=language, file_path=fp.name, split_sentences=split_sentences
+            )
        if self.voice_converter is None:
            self.load_vc_model_by_name("voice_conversion_models/multilingual/vctk/freevc24")
        wav = self.voice_converter.voice_conversion(source_wav=fp.name, target_wav=speaker_wav)
@ -476,6 +427,7 @@ class TTS(nn.Module):
        speaker_wav: str = None,
        file_path: str = "output.wav",
        speaker: str = None,
+        split_sentences: bool = True,
    ):
        """Convert text to speech with voice conversion and save to file.

@ -495,6 +447,12 @@ class TTS(nn.Module):
            speaker (str, optional):
                Speaker name for multi-speaker. You can check whether loaded model is multi-speaker by
                `tts.is_multi_speaker` and list speakers by `tts.speakers`. Defaults to None.
+            split_sentences (bool, optional):
+                Split text into sentences, synthesize them separately and concatenate the file audio.
+                Setting it False uses more VRAM and possibly hit model specific text length or VRAM limits. Only
+                applicable to the 🐸TTS models. Defaults to True.
        """
-        wav = self.tts_with_vc(text=text, language=language, speaker_wav=speaker_wav, speaker=speaker)
+        wav = self.tts_with_vc(
+            text=text, language=language, speaker_wav=speaker_wav, speaker=speaker, split_sentences=split_sentences
+        )
        save_wav(wav=wav, path=file_path, sample_rate=self.voice_converter.vc_config.audio.output_sample_rate)
--- a/TTS/bin/synthesize.py
+++ b/TTS/bin/synthesize.py
@ -66,12 +66,6 @@ If you don't specify any models, then it uses LJSpeech based English model.
  $ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
  ```

- Run TTS and define speed factor to use for 🐸Coqui Studio models, between 0.0 and 2.0:
-
-  ```
-  $ tts --text "Text for TTS" --model_name "coqui_studio/<language>/<dataset>/<model_name>" --speed 1.2 --out_path output/path/speech.wav
-  ```
-
 - Run a TTS model with its default vocoder model:

  ```
@ -222,25 +216,6 @@ def main():
        default=None,
    )
    parser.add_argument("--encoder_config_path", type=str, help="Path to speaker encoder config file.", default=None)
-
-    # args for coqui studio
-    parser.add_argument(
-        "--cs_model",
-        type=str,
-        help="Name of the 🐸Coqui Studio model. Available models are `XTTS`, `V1`.",
-    )
-    parser.add_argument(
-        "--emotion",
-        type=str,
-        help="Emotion to condition the model with. Only available for 🐸Coqui Studio `V1` model.",
-        default=None,
-    )
-    parser.add_argument(
-        "--language",
-        type=str,
-        help="Language to condition the model with. Only available for 🐸Coqui Studio `XTTS` model.",
-        default=None,
-    )
    parser.add_argument(
        "--pipe_out",
        help="stdout the generated TTS wav file for shell pipe.",
@ -249,13 +224,7 @@ def main():
        const=True,
        default=False,
    )
-    parser.add_argument(
-        "--speed",
-        type=float,
-        help="Speed factor to use for 🐸Coqui Studio models, between 0.0 and 2.0.",
-        default=None,
-    )
-
+    
    # args for multi-speaker synthesis
    parser.add_argument("--speakers_file_path", type=str, help="JSON file for multi-speaker model.", default=None)
    parser.add_argument("--language_ids_file_path", type=str, help="JSON file for multi-lingual model.", default=None)
@ -389,7 +358,6 @@ def main():

        # CASE1 #list : list pre-trained TTS models
        if args.list_models:
-            manager.add_cs_api_models(api.list_models())
            manager.list_models()
            sys.exit()

@ -404,29 +372,7 @@ def main():
            manager.model_info_by_full_name(model_query_full_name)
            sys.exit()

-        # CASE3: TTS with coqui studio models
-        if "coqui_studio" in args.model_name:
-            print(" > Using 🐸Coqui Studio model: ", args.model_name)
-            api = TTS(model_name=args.model_name, cs_api_model=args.cs_model)
-            api.tts_to_file(
-                text=args.text,
-                emotion=args.emotion,
-                file_path=args.out_path,
-                language=args.language,
-                speed=args.speed,
-                pipe_out=pipe_out,
-            )
-            print(" > Saving output to ", args.out_path)
-            return
-
-        if args.language_idx is None and args.language is not None:
-            msg = (
-                "--language is only supported for Coqui Studio models. "
-                "Use --language_idx to specify the target language for multilingual models."
-            )
-            raise ValueError(msg)
-
-        # CASE4: load pre-trained model paths
+        # CASE3: load pre-trained model paths
        if args.model_name is not None and not args.model_path:
            model_path, config_path, model_item = manager.download_model(args.model_name)
            # tts model
@ -454,7 +400,7 @@ def main():
        if args.vocoder_name is not None and not args.vocoder_path:
            vocoder_path, vocoder_config_path, _ = manager.download_model(args.vocoder_name)

-        # CASE5: set custom model paths
+        # CASE4: set custom model paths
        if args.model_path is not None:
            tts_path = args.model_path
            tts_config_path = args.config_path
--- a/TTS/bin/train_encoder.py
+++ b/TTS/bin/train_encoder.py
@ -125,7 +125,7 @@ def evaluation(model, criterion, data_loader, global_step):

 def train(model, optimizer, scheduler, criterion, data_loader, eval_data_loader, global_step):
    model.train()
-    best_loss = float("inf")
+    best_loss = {"train_loss": None, "eval_loss": float("inf")}
    avg_loader_time = 0
    end_time = time.time()
    for epoch in range(c.epochs):
@ -248,7 +248,7 @@ def train(model, optimizer, scheduler, criterion, data_loader, eval_data_loader,
            )
            # save the best checkpoint
            best_loss = save_best_model(
-                eval_loss,
+                {"train_loss": None, "eval_loss": eval_loss},
                best_loss,
                c,
                model,
--- a/TTS/config/init.py
+++ b/TTS/config/init.py
@ -16,12 +16,9 @@ def read_json_with_comments(json_path):
    # fallback to json
    with fsspec.open(json_path, "r", encoding="utf-8") as f:
        input_str = f.read()
-    # handle comments
-    input_str = re.sub(r"\\\n", "", input_str)
-    input_str = re.sub(r"//.*\n", "\n", input_str)
-    data = json.loads(input_str)
-    return data
-
+    # handle comments but not urls with //
+    input_str = re.sub(r"(\"(?:[^\"\\]|\\.)*\")|(/\*(?:.|[\\n\\r])*?\*/)|(//.*)", lambda m: m.group(1) or m.group(2) or "", input_str)
+    return json.loads(input_str)

 def register_config(model_name: str) -> Coqpit:
    """Find the right config for the given model name.
--- a/TTS/cs_api.py
+++ b/TTS/cs_api.py
@ -1,317 +0,0 @@
-import http.client
-import json
-import os
-import tempfile
-import urllib.request
-from typing import Tuple
-
-import numpy as np
-import requests
-from scipy.io import wavfile
-
-from TTS.utils.audio.numpy_transforms import save_wav
-
-
-class Speaker(object):
-    """Convert dict to object."""
-
-    def __init__(self, d, is_voice=False):
-        self.is_voice = is_voice
-        for k, v in d.items():
-            if isinstance(k, (list, tuple)):
-                setattr(self, k, [Speaker(x) if isinstance(x, dict) else x for x in v])
-            else:
-                setattr(self, k, Speaker(v) if isinstance(v, dict) else v)
-
-    def __repr__(self):
-        return str(self.__dict__)
-
-
-class CS_API:
-    """🐸Coqui Studio API Wrapper.
-
-    🐸Coqui Studio is the most advanced voice generation platform. You can generate new voices by voice cloning, voice
-    interpolation, or our unique prompt to voice technology. It also provides a set of built-in voices with different
-    characteristics. You can use these voices to generate new audio files or use them in your applications.
-    You can use all the built-in and your own 🐸Coqui Studio speakers with this API with an API token.
-    You can signup to 🐸Coqui Studio from https://app.coqui.ai/auth/signup and get an API token from
-    https://app.coqui.ai/account. We can either enter the token as an environment variable as
-    `export COQUI_STUDIO_TOKEN=<token>` or pass it as `CS_API(api_token=<toke>)`.
-    Visit https://app.coqui.ai/api for more information.
-
-
-    Args:
-        api_token (str): 🐸Coqui Studio API token. If not provided, it will be read from the environment variable
-            `COQUI_STUDIO_TOKEN`.
-        model (str): 🐸Coqui Studio model. It can be either `V1`, `XTTS`. Default is `XTTS`.
-
-
-    Example listing all available speakers:
-        >>> from TTS.api import CS_API
-        >>> tts = CS_API()
-        >>> tts.speakers
-
-    Example listing all emotions:
-        >>> # emotions are only available for `V1` model
-        >>> from TTS.api import CS_API
-        >>> tts = CS_API(model="V1")
-        >>> tts.emotions
-
-    Example with a built-in 🐸 speaker:
-        >>> from TTS.api import CS_API
-        >>> tts = CS_API()
-        >>> wav, sr = api.tts("Hello world", speaker_name=tts.speakers[0].name)
-        >>> filepath = tts.tts_to_file(text="Hello world!", speaker_name=tts.speakers[0].name, file_path="output.wav")
-
-    Example with multi-language model:
-        >>> from TTS.api import CS_API
-        >>> tts = CS_API(model="XTTS")
-        >>> wav, sr = api.tts("Hello world", speaker_name=tts.speakers[0].name, language="en")
-    """
-
-    MODEL_ENDPOINTS = {
-        "V1": {
-            "list_speakers": "https://app.coqui.ai/api/v2/speakers",
-            "synthesize": "https://app.coqui.ai/api/v2/samples",
-            "list_voices": "https://app.coqui.ai/api/v2/voices",
-        },
-        "XTTS": {
-            "list_speakers": "https://app.coqui.ai/api/v2/speakers",
-            "synthesize": "https://app.coqui.ai/api/v2/samples/xtts/render/",
-            "list_voices": "https://app.coqui.ai/api/v2/voices/xtts",
-        },
-    }
-
-    SUPPORTED_LANGUAGES = ["en", "es", "de", "fr", "it", "pt", "pl", "tr", "ru", "nl", "cs", "ar", "zh-cn", "ja"]
-
-    def __init__(self, api_token=None, model="XTTS"):
-        self.api_token = api_token
-        self.model = model
-        self.headers = None
-        self._speakers = None
-        self._check_token()
-
-    @staticmethod
-    def ping_api():
-        URL = "https://coqui.gateway.scarf.sh/tts/api"
-        _ = requests.get(URL)
-
-    @property
-    def speakers(self):
-        if self._speakers is None:
-            self._speakers = self.list_all_speakers()
-        return self._speakers
-
-    @property
-    def emotions(self):
-        """Return a list of available emotions.
-
-        TODO: Get this from the API endpoint.
-        """
-        if self.model == "V1":
-            return ["Neutral", "Happy", "Sad", "Angry", "Dull"]
-        else:
-            raise ValueError(f"❗ Emotions are not available for {self.model}.")
-
-    def _check_token(self):
-        if self.api_token is None:
-            self.api_token = os.environ.get("COQUI_STUDIO_TOKEN")
-            self.headers = {"Content-Type": "application/json", "Authorization": f"Bearer {self.api_token}"}
-        if not self.api_token:
-            raise ValueError(
-                "No API token found for 🐸Coqui Studio voices - https://coqui.ai \n"
-                "Visit 🔗https://app.coqui.ai/account to get one.\n"
-                "Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`\n"
-                ""
-            )
-
-    def list_all_speakers(self):
-        """Return both built-in Coqui Studio speakers and custom voices created by the user."""
-        return self.list_speakers() + self.list_voices()
-
-    def list_speakers(self):
-        """List built-in Coqui Studio speakers."""
-        self._check_token()
-        conn = http.client.HTTPSConnection("app.coqui.ai")
-        url = self.MODEL_ENDPOINTS[self.model]["list_speakers"]
-        conn.request("GET", f"{url}?page=1&per_page=100", headers=self.headers)
-        res = conn.getresponse()
-        data = res.read()
-        return [Speaker(s) for s in json.loads(data)["result"]]
-
-    def list_voices(self):
-        """List custom voices created by the user."""
-        conn = http.client.HTTPSConnection("app.coqui.ai")
-        url = self.MODEL_ENDPOINTS[self.model]["list_voices"]
-        conn.request("GET", f"{url}?page=1&per_page=100", headers=self.headers)
-        res = conn.getresponse()
-        data = res.read()
-        return [Speaker(s, True) for s in json.loads(data)["result"]]
-
-    def list_speakers_as_tts_models(self):
-        """List speakers in ModelManager format."""
-        models = []
-        for speaker in self.speakers:
-            model = f"coqui_studio/multilingual/{speaker.name}/{self.model}"
-            models.append(model)
-        return models
-
-    def name_to_speaker(self, name):
-        for speaker in self.speakers:
-            if speaker.name == name:
-                return speaker
-        raise ValueError(f"Speaker {name} not found in {self.speakers}")
-
-    def id_to_speaker(self, speaker_id):
-        for speaker in self.speakers:
-            if speaker.id == speaker_id:
-                return speaker
-        raise ValueError(f"Speaker {speaker_id} not found.")
-
-    @staticmethod
-    def url_to_np(url):
-        tmp_file, _ = urllib.request.urlretrieve(url)
-        rate, data = wavfile.read(tmp_file)
-        return data, rate
-
-    @staticmethod
-    def _create_payload(model, text, speaker, speed, emotion, language):
-        payload = {}
-        # if speaker.is_voice:
-        payload["voice_id"] = speaker.id
-        # else:
-        payload["speaker_id"] = speaker.id
-
-        if model == "V1":
-            payload.update(
-                {
-                    "emotion": emotion,
-                    "name": speaker.name,
-                    "text": text,
-                    "speed": speed,
-                }
-            )
-        elif model == "XTTS":
-            payload.update(
-                {
-                    "name": speaker.name,
-                    "text": text,
-                    "speed": speed,
-                    "language": language,
-                }
-            )
-        else:
-            raise ValueError(f"❗ Unknown model {model}")
-        return payload
-
-    def _check_tts_args(self, text, speaker_name, speaker_id, emotion, speed, language):
-        assert text is not None, "❗ text is required for V1 model."
-        assert speaker_name is not None, "❗ speaker_name is required for V1 model."
-        if self.model == "V1":
-            if emotion is None:
-                emotion = "Neutral"
-            assert language is None, "❗ language is not supported for V1 model."
-        elif self.model == "XTTS":
-            assert emotion is None, f"❗ Emotions are not supported for XTTS model. Use V1 model."
-            assert language is not None, "❗ Language is required for XTTS model."
-            assert (
-                language in self.SUPPORTED_LANGUAGES
-            ), f"❗ Language {language} is not yet supported. Check https://docs.coqui.ai/reference/samples_xtts_create."
-        return text, speaker_name, speaker_id, emotion, speed, language
-
-    def tts(
-        self,
-        text: str,
-        speaker_name: str = None,
-        speaker_id=None,
-        emotion=None,
-        speed=1.0,
-        language=None,  # pylint: disable=unused-argument
-    ) -> Tuple[np.ndarray, int]:
-        """Synthesize speech from text.
-
-        Args:
-            text (str): Text to synthesize.
-            speaker_name (str): Name of the speaker. You can get the list of speakers with `list_speakers()` and
-                voices (user generated speakers) with `list_voices()`.
-            speaker_id (str): Speaker ID. If None, the speaker name is used.
-            emotion (str): Emotion of the speaker. One of "Neutral", "Happy", "Sad", "Angry", "Dull". Emotions are only
-                supported by `V1` model. Defaults to None.
-            speed (float): Speed of the speech. 1.0 is normal speed.
-            language (str): Language of the text. If None, the default language of the speaker is used. Language is only
-                supported by `XTTS` model. See https://docs.coqui.ai/reference/samples_xtts_create for supported languages.
-        """
-        self._check_token()
-        self.ping_api()
-
-        if speaker_name is None and speaker_id is None:
-            raise ValueError(" [!] Please provide either a `speaker_name` or a `speaker_id`.")
-        if speaker_id is None:
-            speaker = self.name_to_speaker(speaker_name)
-        else:
-            speaker = self.id_to_speaker(speaker_id)
-
-        text, speaker_name, speaker_id, emotion, speed, language = self._check_tts_args(
-            text, speaker_name, speaker_id, emotion, speed, language
-        )
-
-        conn = http.client.HTTPSConnection("app.coqui.ai")
-        payload = self._create_payload(self.model, text, speaker, speed, emotion, language)
-        url = self.MODEL_ENDPOINTS[self.model]["synthesize"]
-        conn.request("POST", url, json.dumps(payload), self.headers)
-        res = conn.getresponse()
-        data = res.read()
-        try:
-            wav, sr = self.url_to_np(json.loads(data)["audio_url"])
-        except KeyError as e:
-            raise ValueError(f" [!] 🐸 API returned error: {data}") from e
-        return wav, sr
-
-    def tts_to_file(
-        self,
-        text: str,
-        speaker_name: str,
-        speaker_id=None,
-        emotion=None,
-        speed=1.0,
-        pipe_out=None,
-        language=None,
-        file_path: str = None,
-    ) -> str:
-        """Synthesize speech from text and save it to a file.
-
-        Args:
-            text (str): Text to synthesize.
-            speaker_name (str): Name of the speaker. You can get the list of speakers with `list_speakers()` and
-                voices (user generated speakers) with `list_voices()`.
-            speaker_id (str): Speaker ID. If None, the speaker name is used.
-            emotion (str): Emotion of the speaker. One of "Neutral", "Happy", "Sad", "Angry", "Dull".
-            speed (float): Speed of the speech. 1.0 is normal speed.
-            pipe_out (BytesIO, optional): Flag to stdout the generated TTS wav file for shell pipe.
-            language (str): Language of the text. If None, the default language of the speaker is used. Language is only
-                supported by `XTTS` model. Currently supports en, de, es, fr, it, pt, pl. Defaults to "en".
-            file_path (str): Path to save the file. If None, a temporary file is created.
-        """
-        if file_path is None:
-            file_path = tempfile.mktemp(".wav")
-        wav, sr = self.tts(text, speaker_name, speaker_id, emotion, speed, language)
-        save_wav(wav=wav, path=file_path, sample_rate=sr, pipe_out=pipe_out)
-        return file_path
-
-
-if __name__ == "__main__":
-    import time
-
-    api = CS_API()
-    print(api.speakers)
-    print(api.list_speakers_as_tts_models())
-
-    ts = time.time()
-    wav, sr = api.tts(
-        "It took me quite a long time to develop a voice.", language="en", speaker_name=api.speakers[0].name
-    )
-    print(f" [i] XTTS took {time.time() - ts:.2f}s")
-
-    filepath = api.tts_to_file(
-        text="Hello world!", speaker_name=api.speakers[0].name, language="en", file_path="output.wav"
-    )
--- a/TTS/demos/xtts_ft_demo/requirements.txt
+++ b/TTS/demos/xtts_ft_demo/requirements.txt
@ -0,0 +1,2 @@
+faster_whisper==0.9.0
+gradio==4.7.1
--- a/TTS/demos/xtts_ft_demo/utils/formatter.py
+++ b/TTS/demos/xtts_ft_demo/utils/formatter.py
@ -0,0 +1,160 @@
+import os
+import gc
+import torchaudio
+import pandas
+from faster_whisper import WhisperModel
+from glob import glob
+
+from tqdm import tqdm
+
+import torch
+import torchaudio
+# torch.set_num_threads(1)
+
+from TTS.tts.layers.xtts.tokenizer import multilingual_cleaners
+
+torch.set_num_threads(16)
+
+
+import os
+
+audio_types = (".wav", ".mp3", ".flac")
+
+
+def list_audios(basePath, contains=None):
+    # return the set of files that are valid
+    return list_files(basePath, validExts=audio_types, contains=contains)
+
+def list_files(basePath, validExts=None, contains=None):
+    # loop over the directory structure
+    for (rootDir, dirNames, filenames) in os.walk(basePath):
+        # loop over the filenames in the current directory
+        for filename in filenames:
+            # if the contains string is not none and the filename does not contain
+            # the supplied string, then ignore the file
+            if contains is not None and filename.find(contains) == -1:
+                continue
+
+            # determine the file extension of the current file
+            ext = filename[filename.rfind("."):].lower()
+
+            # check to see if the file is an audio and should be processed
+            if validExts is None or ext.endswith(validExts):
+                # construct the path to the audio and yield it
+                audioPath = os.path.join(rootDir, filename)
+                yield audioPath
+
+def format_audio_list(audio_files, target_language="en", out_path=None, buffer=0.2, eval_percentage=0.15, speaker_name="coqui", gradio_progress=None):
+    audio_total_size = 0
+    # make sure that ooutput file exists
+    os.makedirs(out_path, exist_ok=True)
+
+    # Loading Whisper
+    device = "cuda" if torch.cuda.is_available() else "cpu" 
+
+    print("Loading Whisper Model!")
+    asr_model = WhisperModel("large-v2", device=device, compute_type="float16")
+
+    metadata = {"audio_file": [], "text": [], "speaker_name": []}
+
+    if gradio_progress is not None:
+        tqdm_object = gradio_progress.tqdm(audio_files, desc="Formatting...")
+    else:
+        tqdm_object = tqdm(audio_files)
+
+    for audio_path in tqdm_object:
+        wav, sr = torchaudio.load(audio_path)
+        # stereo to mono if needed
+        if wav.size(0) != 1:
+            wav = torch.mean(wav, dim=0, keepdim=True)
+
+        wav = wav.squeeze()
+        audio_total_size += (wav.size(-1) / sr)
+
+        segments, _ = asr_model.transcribe(audio_path, word_timestamps=True, language=target_language)
+        segments = list(segments)
+        i = 0
+        sentence = ""
+        sentence_start = None
+        first_word = True
+        # added all segments words in a unique list
+        words_list = []
+        for _, segment in enumerate(segments):
+            words = list(segment.words)
+            words_list.extend(words)
+
+        # process each word
+        for word_idx, word in enumerate(words_list):
+            if first_word:
+                sentence_start = word.start
+                # If it is the first sentence, add buffer or get the begining of the file
+                if word_idx == 0:
+                    sentence_start = max(sentence_start - buffer, 0)  # Add buffer to the sentence start
+                else:
+                    # get previous sentence end
+                    previous_word_end = words_list[word_idx - 1].end
+                    # add buffer or get the silence midle between the previous sentence and the current one
+                    sentence_start = max(sentence_start - buffer, (previous_word_end + sentence_start)/2)
+
+                sentence = word.word
+                first_word = False
+            else:
+                sentence += word.word
+
+            if word.word[-1] in ["!", ".", "?"]:
+                sentence = sentence[1:]
+                # Expand number and abbreviations plus normalization
+                sentence = multilingual_cleaners(sentence, target_language)
+                audio_file_name, _ = os.path.splitext(os.path.basename(audio_path))
+
+                audio_file = f"wavs/{audio_file_name}_{str(i).zfill(8)}.wav"
+
+                # Check for the next word's existence
+                if word_idx + 1 < len(words_list):
+                    next_word_start = words_list[word_idx + 1].start
+                else:
+                    # If don't have more words it means that it is the last sentence then use the audio len as next word start
+                    next_word_start = (wav.shape[0] - 1) / sr
+
+                # Average the current word end and next word start
+                word_end = min((word.end + next_word_start) / 2, word.end + buffer)
+                
+                absoulte_path = os.path.join(out_path, audio_file)
+                os.makedirs(os.path.dirname(absoulte_path), exist_ok=True)
+                i += 1
+                first_word = True
+
+                audio = wav[int(sr*sentence_start):int(sr*word_end)].unsqueeze(0)
+                # if the audio is too short ignore it (i.e < 0.33 seconds)
+                if audio.size(-1) >= sr/3:
+                    torchaudio.save(absoulte_path,
+                        audio,
+                        sr
+                    )
+                else:
+                    continue
+
+                metadata["audio_file"].append(audio_file)
+                metadata["text"].append(sentence)
+                metadata["speaker_name"].append(speaker_name)
+
+    df = pandas.DataFrame(metadata)
+    df = df.sample(frac=1)
+    num_val_samples = int(len(df)*eval_percentage)
+
+    df_eval = df[:num_val_samples]
+    df_train = df[num_val_samples:]
+
+    df_train = df_train.sort_values('audio_file')
+    train_metadata_path = os.path.join(out_path, "metadata_train.csv")
+    df_train.to_csv(train_metadata_path, sep="|", index=False)
+
+    eval_metadata_path = os.path.join(out_path, "metadata_eval.csv")
+    df_eval = df_eval.sort_values('audio_file')
+    df_eval.to_csv(eval_metadata_path, sep="|", index=False)
+
+    # deallocate VRAM and RAM
+    del asr_model, df_train, df_eval, df, metadata
+    gc.collect()
+
+    return train_metadata_path, eval_metadata_path, audio_total_size
--- a/TTS/demos/xtts_ft_demo/utils/gpt_train.py
+++ b/TTS/demos/xtts_ft_demo/utils/gpt_train.py
@ -0,0 +1,172 @@
+import os
+import gc
+
+from trainer import Trainer, TrainerArgs
+
+from TTS.config.shared_configs import BaseDatasetConfig
+from TTS.tts.datasets import load_tts_samples
+from TTS.tts.layers.xtts.trainer.gpt_trainer import GPTArgs, GPTTrainer, GPTTrainerConfig, XttsAudioConfig
+from TTS.utils.manage import ModelManager
+
+
+def train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path, max_audio_length=255995):
+    #  Logging parameters
+    RUN_NAME = "GPT_XTTS_FT"
+    PROJECT_NAME = "XTTS_trainer"
+    DASHBOARD_LOGGER = "tensorboard"
+    LOGGER_URI = None
+
+    # Set here the path that the checkpoints will be saved. Default: ./run/training/
+    OUT_PATH = os.path.join(output_path, "run", "training")
+
+    # Training Parameters
+    OPTIMIZER_WD_ONLY_ON_WEIGHTS = True  # for multi-gpu training please make it False
+    START_WITH_EVAL = False  # if True it will star with evaluation
+    BATCH_SIZE = batch_size  # set here the batch size
+    GRAD_ACUMM_STEPS = grad_acumm  # set here the grad accumulation steps
+
+
+    # Define here the dataset that you want to use for the fine-tuning on.
+    config_dataset = BaseDatasetConfig(
+        formatter="coqui",
+        dataset_name="ft_dataset",
+        path=os.path.dirname(train_csv),
+        meta_file_train=train_csv,
+        meta_file_val=eval_csv,
+        language=language,
+    )
+
+    # Add here the configs of the datasets
+    DATASETS_CONFIG_LIST = [config_dataset]
+
+    # Define the path where XTTS v2.0.1 files will be downloaded
+    CHECKPOINTS_OUT_PATH = os.path.join(OUT_PATH, "XTTS_v2.0_original_model_files/")
+    os.makedirs(CHECKPOINTS_OUT_PATH, exist_ok=True)
+
+
+    # DVAE files
+    DVAE_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/dvae.pth"
+    MEL_NORM_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/mel_stats.pth"
+
+    # Set the path to the downloaded files
+    DVAE_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(DVAE_CHECKPOINT_LINK))
+    MEL_NORM_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(MEL_NORM_LINK))
+
+    # download DVAE files if needed
+    if not os.path.isfile(DVAE_CHECKPOINT) or not os.path.isfile(MEL_NORM_FILE):
+        print(" > Downloading DVAE files!")
+        ModelManager._download_model_files([MEL_NORM_LINK, DVAE_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True)
+
+
+    # Download XTTS v2.0 checkpoint if needed
+    TOKENIZER_FILE_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/vocab.json"
+    XTTS_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/model.pth"
+    XTTS_CONFIG_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/config.json"
+
+    # XTTS transfer learning parameters: You we need to provide the paths of XTTS model checkpoint that you want to do the fine tuning.
+    TOKENIZER_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(TOKENIZER_FILE_LINK))  # vocab.json file
+    XTTS_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_CHECKPOINT_LINK))  # model.pth file
+    XTTS_CONFIG_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_CONFIG_LINK))  # config.json file
+
+    # download XTTS v2.0 files if needed
+    if not os.path.isfile(TOKENIZER_FILE) or not os.path.isfile(XTTS_CHECKPOINT):
+        print(" > Downloading XTTS v2.0 files!")
+        ModelManager._download_model_files(
+            [TOKENIZER_FILE_LINK, XTTS_CHECKPOINT_LINK, XTTS_CONFIG_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True
+        )
+
+    # init args and config
+    model_args = GPTArgs(
+        max_conditioning_length=132300,  # 6 secs
+        min_conditioning_length=66150,  # 3 secs
+        debug_loading_failures=False,
+        max_wav_length=max_audio_length,  # ~11.6 seconds
+        max_text_length=200,
+        mel_norm_file=MEL_NORM_FILE,
+        dvae_checkpoint=DVAE_CHECKPOINT,
+        xtts_checkpoint=XTTS_CHECKPOINT,  # checkpoint path of the model that you want to fine-tune
+        tokenizer_file=TOKENIZER_FILE,
+        gpt_num_audio_tokens=1026,
+        gpt_start_audio_token=1024,
+        gpt_stop_audio_token=1025,
+        gpt_use_masking_gt_prompt_approach=True,
+        gpt_use_perceiver_resampler=True,
+    )
+    # define audio config
+    audio_config = XttsAudioConfig(sample_rate=22050, dvae_sample_rate=22050, output_sample_rate=24000)
+    # training parameters config
+    config = GPTTrainerConfig(
+        epochs=num_epochs,
+        output_path=OUT_PATH,
+        model_args=model_args,
+        run_name=RUN_NAME,
+        project_name=PROJECT_NAME,
+        run_description="""
+            GPT XTTS training
+            """,
+        dashboard_logger=DASHBOARD_LOGGER,
+        logger_uri=LOGGER_URI,
+        audio=audio_config,
+        batch_size=BATCH_SIZE,
+        batch_group_size=48,
+        eval_batch_size=BATCH_SIZE,
+        num_loader_workers=8,
+        eval_split_max_size=256,
+        print_step=50,
+        plot_step=100,
+        log_model_step=100,
+        save_step=1000,
+        save_n_checkpoints=1,
+        save_checkpoints=True,
+        # target_loss="loss",
+        print_eval=False,
+        # Optimizer values like tortoise, pytorch implementation with modifications to not apply WD to non-weight parameters.
+        optimizer="AdamW",
+        optimizer_wd_only_on_weights=OPTIMIZER_WD_ONLY_ON_WEIGHTS,
+        optimizer_params={"betas": [0.9, 0.96], "eps": 1e-8, "weight_decay": 1e-2},
+        lr=5e-06,  # learning rate
+        lr_scheduler="MultiStepLR",
+        # it was adjusted accordly for the new step scheme
+        lr_scheduler_params={"milestones": [50000 * 18, 150000 * 18, 300000 * 18], "gamma": 0.5, "last_epoch": -1},
+        test_sentences=[],
+    )
+
+    # init the model from config
+    model = GPTTrainer.init_from_config(config)
+
+    # load training samples
+    train_samples, eval_samples = load_tts_samples(
+        DATASETS_CONFIG_LIST,
+        eval_split=True,
+        eval_split_max_size=config.eval_split_max_size,
+        eval_split_size=config.eval_split_size,
+    )
+
+    # init the trainer and 🚀
+    trainer = Trainer(
+        TrainerArgs(
+            restore_path=None,  # xtts checkpoint is restored via xtts_checkpoint key so no need of restore it using Trainer restore_path parameter
+            skip_train_epoch=False,
+            start_with_eval=START_WITH_EVAL,
+            grad_accum_steps=GRAD_ACUMM_STEPS,
+        ),
+        config,
+        output_path=OUT_PATH,
+        model=model,
+        train_samples=train_samples,
+        eval_samples=eval_samples,
+    )
+    trainer.fit()
+
+    # get the longest text audio file to use as speaker reference
+    samples_len = [len(item["text"].split(" ")) for item in train_samples]
+    longest_text_idx =  samples_len.index(max(samples_len))
+    speaker_ref = train_samples[longest_text_idx]["audio_file"]
+
+    trainer_out_path = trainer.output_path
+
+    # deallocate VRAM and RAM
+    del model, trainer, train_samples, eval_samples
+    gc.collect()
+
+    return XTTS_CONFIG_FILE, XTTS_CHECKPOINT, TOKENIZER_FILE, trainer_out_path, speaker_ref
--- a/TTS/demos/xtts_ft_demo/xtts_demo.py
+++ b/TTS/demos/xtts_ft_demo/xtts_demo.py
@ -0,0 +1,415 @@
+import argparse
+import os
+import sys
+import tempfile
+
+import gradio as gr
+import librosa.display
+import numpy as np
+
+import os
+import torch
+import torchaudio
+import traceback
+from TTS.demos.xtts_ft_demo.utils.formatter import format_audio_list
+from TTS.demos.xtts_ft_demo.utils.gpt_train import train_gpt
+
+from TTS.tts.configs.xtts_config import XttsConfig
+from TTS.tts.models.xtts import Xtts
+
+
+def clear_gpu_cache():
+    # clear the GPU cache
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+
+XTTS_MODEL = None
+def load_model(xtts_checkpoint, xtts_config, xtts_vocab):
+    global XTTS_MODEL
+    clear_gpu_cache()
+    if not xtts_checkpoint or not xtts_config or not xtts_vocab:
+        return "You need to run the previous steps or manually set the `XTTS checkpoint path`, `XTTS config path`, and `XTTS vocab path` fields !!"
+    config = XttsConfig()
+    config.load_json(xtts_config)
+    XTTS_MODEL = Xtts.init_from_config(config)
+    print("Loading XTTS model! ")
+    XTTS_MODEL.load_checkpoint(config, checkpoint_path=xtts_checkpoint, vocab_path=xtts_vocab, use_deepspeed=False)
+    if torch.cuda.is_available():
+        XTTS_MODEL.cuda()
+
+    print("Model Loaded!")
+    return "Model Loaded!"
+
+def run_tts(lang, tts_text, speaker_audio_file):
+    if XTTS_MODEL is None or not speaker_audio_file:
+        return "You need to run the previous step to load the model !!", None, None
+
+    gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents(audio_path=speaker_audio_file, gpt_cond_len=XTTS_MODEL.config.gpt_cond_len, max_ref_length=XTTS_MODEL.config.max_ref_len, sound_norm_refs=XTTS_MODEL.config.sound_norm_refs)
+    out = XTTS_MODEL.inference(
+        text=tts_text,
+        language=lang,
+        gpt_cond_latent=gpt_cond_latent,
+        speaker_embedding=speaker_embedding,
+        temperature=XTTS_MODEL.config.temperature, # Add custom parameters here
+        length_penalty=XTTS_MODEL.config.length_penalty,
+        repetition_penalty=XTTS_MODEL.config.repetition_penalty,
+        top_k=XTTS_MODEL.config.top_k,
+        top_p=XTTS_MODEL.config.top_p,
+    )
+
+    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as fp:
+        out["wav"] = torch.tensor(out["wav"]).unsqueeze(0)
+        out_path = fp.name
+        torchaudio.save(out_path, out["wav"], 24000)
+
+    return "Speech generated !", out_path, speaker_audio_file
+
+
+
+
+# define a logger to redirect 
+class Logger:
+    def __init__(self, filename="log.out"):
+        self.log_file = filename
+        self.terminal = sys.stdout
+        self.log = open(self.log_file, "w")
+
+    def write(self, message):
+        self.terminal.write(message)
+        self.log.write(message)
+
+    def flush(self):
+        self.terminal.flush()
+        self.log.flush()
+
+    def isatty(self):
+        return False
+
+# redirect stdout and stderr to a file
+sys.stdout = Logger()
+sys.stderr = sys.stdout
+
+
+# logging.basicConfig(stream=sys.stdout, level=logging.INFO)
+import logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s",
+    handlers=[
+        logging.StreamHandler(sys.stdout)
+    ]
+)
+
+def read_logs():
+    sys.stdout.flush()
+    with open(sys.stdout.log_file, "r") as f:
+        return f.read()
+
+
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser(
+        description="""XTTS fine-tuning demo\n\n"""
+        """
+        Example runs:
+        python3 TTS/demos/xtts_ft_demo/xtts_demo.py --port 
+        """,
+        formatter_class=argparse.RawTextHelpFormatter,
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        help="Port to run the gradio demo. Default: 5003",
+        default=5003,
+    )
+    parser.add_argument(
+        "--out_path",
+        type=str,
+        help="Output path (where data and checkpoints will be saved) Default: /tmp/xtts_ft/",
+        default="/tmp/xtts_ft/",
+    )
+
+    parser.add_argument(
+        "--num_epochs",
+        type=int,
+        help="Number of epochs to train. Default: 10",
+        default=10,
+    )
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        help="Batch size. Default: 4",
+        default=4,
+    )
+    parser.add_argument(
+        "--grad_acumm",
+        type=int,
+        help="Grad accumulation steps. Default: 1",
+        default=1,
+    )
+    parser.add_argument(
+        "--max_audio_length",
+        type=int,
+        help="Max permitted audio size in seconds. Default: 11",
+        default=11,
+    )
+
+    args = parser.parse_args()
+
+    with gr.Blocks() as demo:
+        with gr.Tab("1 - Data processing"):
+            out_path = gr.Textbox(
+                label="Output path (where data and checkpoints will be saved):",
+                value=args.out_path,
+            )
+            # upload_file = gr.Audio(
+            #     sources="upload",
+            #     label="Select here the audio files that you want to use for XTTS trainining !",
+            #     type="filepath",
+            # )
+            upload_file = gr.File(
+                file_count="multiple",
+                label="Select here the audio files that you want to use for XTTS trainining (Supported formats: wav, mp3, and flac)",
+            )
+            lang = gr.Dropdown(
+                label="Dataset Language",
+                value="en",
+                choices=[
+                    "en",
+                    "es",
+                    "fr",
+                    "de",
+                    "it",
+                    "pt",
+                    "pl",
+                    "tr",
+                    "ru",
+                    "nl",
+                    "cs",
+                    "ar",
+                    "zh",
+                    "hu",
+                    "ko",
+                    "ja"
+                ],
+            )
+            progress_data = gr.Label(
+                label="Progress:"
+            )
+            logs = gr.Textbox(
+                label="Logs:",
+                interactive=False,
+            )
+            demo.load(read_logs, None, logs, every=1)
+
+            prompt_compute_btn = gr.Button(value="Step 1 - Create dataset")
+        
+            def preprocess_dataset(audio_path, language, out_path, progress=gr.Progress(track_tqdm=True)):
+                clear_gpu_cache()
+                out_path = os.path.join(out_path, "dataset")
+                os.makedirs(out_path, exist_ok=True)
+                if audio_path is None:
+                    return "You should provide one or multiple audio files! If you provided it, probably the upload of the files is not finished yet!", "", ""
+                else:
+                    try:
+                        train_meta, eval_meta, audio_total_size = format_audio_list(audio_path, target_language=language, out_path=out_path, gradio_progress=progress)
+                    except:
+                        traceback.print_exc()
+                        error = traceback.format_exc()
+                        return f"The data processing was interrupted due an error !! Please check the console to verify the full error message! \n Error summary: {error}", "", ""
+
+                clear_gpu_cache()
+
+                # if audio total len is less than 2 minutes raise an error
+                if audio_total_size < 120:
+                    message = "The sum of the duration of the audios that you provided should be at least 2 minutes!"
+                    print(message)
+                    return message, "", ""
+
+                print("Dataset Processed!")
+                return "Dataset Processed!", train_meta, eval_meta
+
+        with gr.Tab("2 - Fine-tuning XTTS Encoder"):
+            train_csv = gr.Textbox(
+                label="Train CSV:",
+            )
+            eval_csv = gr.Textbox(
+                label="Eval CSV:",
+            )
+            num_epochs =  gr.Slider(
+                label="Number of epochs:",
+                minimum=1,
+                maximum=100,
+                step=1,
+                value=args.num_epochs,
+            )
+            batch_size = gr.Slider(
+                label="Batch size:",
+                minimum=2,
+                maximum=512,
+                step=1,
+                value=args.batch_size,
+            )
+            grad_acumm = gr.Slider(
+                label="Grad accumulation steps:",
+                minimum=2,
+                maximum=128,
+                step=1,
+                value=args.grad_acumm,
+            )
+            max_audio_length = gr.Slider(
+                label="Max permitted audio size in seconds:",
+                minimum=2,
+                maximum=20,
+                step=1,
+                value=args.max_audio_length,
+            )
+            progress_train = gr.Label(
+                label="Progress:"
+            )
+            logs_tts_train = gr.Textbox(
+                label="Logs:",
+                interactive=False,
+            )
+            demo.load(read_logs, None, logs_tts_train, every=1)
+            train_btn = gr.Button(value="Step 2 - Run the training")
+
+            def train_model(language, train_csv, eval_csv, num_epochs, batch_size, grad_acumm, output_path, max_audio_length):
+                clear_gpu_cache()
+                if not train_csv or not eval_csv:
+                    return "You need to run the data processing step or manually set `Train CSV` and `Eval CSV` fields !", "", "", "", ""
+                try:
+                    # convert seconds to waveform frames
+                    max_audio_length = int(max_audio_length * 22050)
+                    config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=output_path, max_audio_length=max_audio_length)
+                except:
+                    traceback.print_exc()
+                    error = traceback.format_exc()
+                    return f"The training was interrupted due an error !! Please check the console to check the full error message! \n Error summary: {error}", "", "", "", ""
+
+                # copy original files to avoid parameters changes issues
+                os.system(f"cp {config_path} {exp_path}")
+                os.system(f"cp {vocab_file} {exp_path}")
+
+                ft_xtts_checkpoint = os.path.join(exp_path, "best_model.pth")
+                print("Model training done!")
+                clear_gpu_cache()
+                return "Model training done!", config_path, vocab_file, ft_xtts_checkpoint, speaker_wav
+
+        with gr.Tab("3 - Inference"):
+            with gr.Row():
+                with gr.Column() as col1:
+                    xtts_checkpoint = gr.Textbox(
+                        label="XTTS checkpoint path:",
+                        value="",
+                    )
+                    xtts_config = gr.Textbox(
+                        label="XTTS config path:",
+                        value="",
+                    )
+
+                    xtts_vocab = gr.Textbox(
+                        label="XTTS vocab path:",
+                        value="",
+                    )
+                    progress_load = gr.Label(
+                        label="Progress:"
+                    )
+                    load_btn = gr.Button(value="Step 3 - Load Fine-tuned XTTS model")
+
+                with gr.Column() as col2:
+                    speaker_reference_audio = gr.Textbox(
+                        label="Speaker reference audio:",
+                        value="",
+                    )
+                    tts_language = gr.Dropdown(
+                        label="Language",
+                        value="en",
+                        choices=[
+                            "en",
+                            "es",
+                            "fr",
+                            "de",
+                            "it",
+                            "pt",
+                            "pl",
+                            "tr",
+                            "ru",
+                            "nl",
+                            "cs",
+                            "ar",
+                            "zh",
+                            "hu",
+                            "ko",
+                            "ja",
+                        ]
+                    )
+                    tts_text = gr.Textbox(
+                        label="Input Text.",
+                        value="This model sounds really good and above all, it's reasonably fast.",
+                    )
+                    tts_btn = gr.Button(value="Step 4 - Inference")
+
+                with gr.Column() as col3:
+                    progress_gen = gr.Label(
+                        label="Progress:"
+                    )
+                    tts_output_audio = gr.Audio(label="Generated Audio.")
+                    reference_audio = gr.Audio(label="Reference audio used.")
+
+            prompt_compute_btn.click(
+                fn=preprocess_dataset,
+                inputs=[
+                    upload_file,
+                    lang,
+                    out_path,
+                ],
+                outputs=[
+                    progress_data,
+                    train_csv,
+                    eval_csv,
+                ],
+            )
+
+
+            train_btn.click(
+                fn=train_model,
+                inputs=[
+                    lang,
+                    train_csv,
+                    eval_csv,
+                    num_epochs,
+                    batch_size,
+                    grad_acumm,
+                    out_path,
+                    max_audio_length,
+                ],
+                outputs=[progress_train, xtts_config, xtts_vocab, xtts_checkpoint, speaker_reference_audio],
+            )
+            
+            load_btn.click(
+                fn=load_model,
+                inputs=[
+                    xtts_checkpoint,
+                    xtts_config,
+                    xtts_vocab
+                ],
+                outputs=[progress_load],
+            )
+
+            tts_btn.click(
+                fn=run_tts,
+                inputs=[
+                    tts_language,
+                    tts_text,
+                    speaker_reference_audio,
+                ],
+                outputs=[progress_gen, tts_output_audio, reference_audio],
+            )
+
+    demo.launch(
+        share=True,
+        debug=False,
+        server_port=args.port,
+        server_name="0.0.0.0"
+    )
--- a/TTS/tts/configs/xtts_config.py
+++ b/TTS/tts/configs/xtts_config.py
@ -88,6 +88,7 @@ class XttsConfig(BaseTTSConfig):
            "hu",
            "ko",
            "ja",
+            "hi",
        ]
    )

--- a/TTS/tts/datasets/dataset.py
+++ b/TTS/tts/datasets/dataset.py
@ -13,6 +13,8 @@ from TTS.tts.utils.data import prepare_data, prepare_stop_target, prepare_tensor
 from TTS.utils.audio import AudioProcessor
 from TTS.utils.audio.numpy_transforms import compute_energy as calculate_energy

+import mutagen
+
 # to prevent too many open files error as suggested here
 # https://github.com/pytorch/pytorch/issues/11201#issuecomment-421146936
 torch.multiprocessing.set_sharing_strategy("file_system")
@ -42,6 +44,15 @@ def string2filename(string):
    return filename


+def get_audio_size(audiopath):
+    extension = audiopath.rpartition(".")[-1].lower()
+    if extension not in {"mp3", "wav", "flac"}:
+        raise RuntimeError(f"The audio format {extension} is not supported, please convert the audio files to mp3, flac, or wav format!")
+
+    audio_info = mutagen.File(audiopath).info
+    return int(audio_info.length * audio_info.sample_rate)
+
+
 class TTSDataset(Dataset):
    def __init__(
        self,
@ -176,7 +187,7 @@ class TTSDataset(Dataset):
        lens = []
        for item in self.samples:
            _, wav_file, *_ = _parse_sample(item)
-            audio_len = os.path.getsize(wav_file) / 16 * 8  # assuming 16bit audio
+            audio_len = get_audio_size(wav_file)
            lens.append(audio_len)
        return lens

@ -295,7 +306,7 @@ class TTSDataset(Dataset):
    def _compute_lengths(samples):
        new_samples = []
        for item in samples:
-            audio_length = os.path.getsize(item["audio_file"]) / 16 * 8  # assuming 16bit audio
+            audio_length = get_audio_size(item["audio_file"])
            text_lenght = len(item["text"])
            item["audio_length"] = audio_length
            item["text_length"] = text_lenght
--- a/TTS/tts/layers/xtts/tokenizer.py
+++ b/TTS/tts/layers/xtts/tokenizer.py
@ -636,6 +636,9 @@ class VoiceBpeTokenizer:
                txt = korean_transliterate(txt)
        elif lang == "ja":
            txt = japanese_cleaners(txt, self.katsu)
+        elif lang == "hi":
+            # @manmay will implement this
+            txt = basic_cleaners(txt)
        else:
            raise NotImplementedError(f"Language '{lang}' is not supported.")
        return txt
--- a/TTS/tts/layers/xtts/trainer/gpt_trainer.py
+++ b/TTS/tts/layers/xtts/trainer/gpt_trainer.py
@ -225,11 +225,11 @@ class GPTTrainer(BaseTTS):

    @torch.no_grad()
    def test_run(self, assets) -> Tuple[Dict, Dict]:  # pylint: disable=W0613
+        test_audios = {}
        if self.config.test_sentences:
            # init gpt for inference mode
            self.xtts.gpt.init_gpt_for_inference(kv_cache=self.args.kv_cache, use_deepspeed=False)
            self.xtts.gpt.eval()
-            test_audios = {}
            print(" | > Synthesizing test sentences.")
            for idx, s_info in enumerate(self.config.test_sentences):
                wav = self.xtts.synthesize(
@ -319,9 +319,12 @@ class GPTTrainer(BaseTTS):
        return self.train_step(batch, criterion)

    def on_train_epoch_start(self, trainer):
-        trainer.model.eval() # the whole model to eval
+        trainer.model.eval()  # the whole model to eval
        # put gpt model in training mode
-        trainer.model.xtts.gpt.train()
+        if hasattr(trainer.model, "module") and hasattr(trainer.model.module, "xtts"):
+            trainer.model.module.xtts.gpt.train()
+        else:
+            trainer.model.xtts.gpt.train()

    def on_init_end(self, trainer):  # pylint: disable=W0613
        # ignore similarities.pth on clearml save/upload
@ -387,7 +390,8 @@ class GPTTrainer(BaseTTS):
            else:
                loader = DataLoader(
                    dataset,
-                    batch_sampler=sampler,
+                    sampler=sampler,
+                    batch_size = config.eval_batch_size if is_eval else config.batch_size,
                    collate_fn=dataset.collate_fn,
                    num_workers=config.num_eval_loader_workers if is_eval else config.num_loader_workers,
                    pin_memory=False,
--- a/TTS/tts/layers/xtts/xtts_manager.py
+++ b/TTS/tts/layers/xtts/xtts_manager.py
@ -0,0 +1,34 @@
+import torch
+
+class SpeakerManager():
+    def __init__(self, speaker_file_path=None):
+        self.speakers = torch.load(speaker_file_path)
+
+    @property
+    def name_to_id(self):
+        return self.speakers.keys()
+    
+    @property
+    def num_speakers(self):
+        return len(self.name_to_id)
+    
+    @property
+    def speaker_names(self):
+        return list(self.name_to_id.keys())
+    
+
+class LanguageManager():
+    def __init__(self, config):
+        self.langs = config["languages"]
+
+    @property
+    def name_to_id(self):
+        return self.langs
+    
+    @property
+    def num_languages(self):
+        return len(self.name_to_id)
+    
+    @property
+    def language_names(self):
+        return list(self.name_to_id)
--- a/TTS/tts/layers/xtts/zh_num2words.py
+++ b/TTS/tts/layers/xtts/zh_num2words.py
@ -65,7 +65,7 @@ CN_PUNCS_NONSTOP = "＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［
 CN_PUNCS = CN_PUNCS_STOP + CN_PUNCS_NONSTOP

 PUNCS = CN_PUNCS + string.punctuation
-PUNCS_TRANSFORM = str.maketrans(PUNCS, " " * len(PUNCS), "")  # replace puncs with space
+PUNCS_TRANSFORM = str.maketrans(PUNCS, "," * len(PUNCS), "")  # replace puncs with English comma


 # https://zh.wikipedia.org/wiki/全行和半行
--- a/TTS/tts/models/forward_tts.py
+++ b/TTS/tts/models/forward_tts.py
@ -241,7 +241,7 @@ class ForwardTTS(BaseTTS):
        )

        self.duration_predictor = DurationPredictor(
-            self.args.hidden_channels + self.embedded_speaker_dim,
+            self.args.hidden_channels,
            self.args.duration_predictor_hidden_channels,
            self.args.duration_predictor_kernel_size,
            self.args.duration_predictor_dropout_p,
@ -249,7 +249,7 @@ class ForwardTTS(BaseTTS):

        if self.args.use_pitch:
            self.pitch_predictor = DurationPredictor(
-                self.args.hidden_channels + self.embedded_speaker_dim,
+                self.args.hidden_channels,
                self.args.pitch_predictor_hidden_channels,
                self.args.pitch_predictor_kernel_size,
                self.args.pitch_predictor_dropout_p,
@ -263,7 +263,7 @@ class ForwardTTS(BaseTTS):

        if self.args.use_energy:
            self.energy_predictor = DurationPredictor(
-                self.args.hidden_channels + self.embedded_speaker_dim,
+                self.args.hidden_channels,
                self.args.energy_predictor_hidden_channels,
                self.args.energy_predictor_kernel_size,
                self.args.energy_predictor_dropout_p,
@ -299,7 +299,8 @@ class ForwardTTS(BaseTTS):
        if config.use_d_vector_file:
            self.embedded_speaker_dim = config.d_vector_dim
            if self.args.d_vector_dim != self.args.hidden_channels:
-                self.proj_g = nn.Conv1d(self.args.d_vector_dim, self.args.hidden_channels, 1)
+                #self.proj_g = nn.Conv1d(self.args.d_vector_dim, self.args.hidden_channels, 1)
+                self.proj_g = nn.Linear(in_features=self.args.d_vector_dim, out_features=self.args.hidden_channels)
        # init speaker embedding layer
        if config.use_speaker_embedding and not config.use_d_vector_file:
            print(" > Init speaker_embedding layer.")
@ -403,10 +404,13 @@ class ForwardTTS(BaseTTS):
        # [B, T, C]
        x_emb = self.emb(x)
        # encoder pass
-        o_en = self.encoder(torch.transpose(x_emb, 1, -1), x_mask)
+	#o_en = self.encoder(torch.transpose(x_emb, 1, -1), x_mask)
+        o_en = self.encoder(torch.transpose(x_emb, 1, -1), x_mask, g)
        # speaker conditioning
        # TODO: try different ways of conditioning
-        if g is not None:
+        if g is not None: 
+            if hasattr(self, "proj_g"):
+                g = self.proj_g(g.view(g.shape[0], -1)).unsqueeze(-1)            
            o_en = o_en + g
        return o_en, x_mask, g, x_emb

--- a/TTS/tts/models/xtts.py
+++ b/TTS/tts/models/xtts.py
@ -11,6 +11,7 @@ from TTS.tts.layers.xtts.gpt import GPT
 from TTS.tts.layers.xtts.hifigan_decoder import HifiDecoder
 from TTS.tts.layers.xtts.stream_generator import init_stream_support
 from TTS.tts.layers.xtts.tokenizer import VoiceBpeTokenizer, split_sentence
+from TTS.tts.layers.xtts.xtts_manager import SpeakerManager, LanguageManager
 from TTS.tts.models.base_tts import BaseTTS
 from TTS.utils.io import load_fsspec

@ -272,6 +273,11 @@ class Xtts(BaseTTS):
            style_embs = []
            for i in range(0, audio.shape[1], 22050 * chunk_length):
                audio_chunk = audio[:, i : i + 22050 * chunk_length]
+
+                # if the chunk is too short ignore it 
+                if audio_chunk.size(-1) < 22050 * 0.33:
+                    continue
+
                mel_chunk = wav_to_mel_cloning(
                    audio_chunk,
                    mel_norms=self.mel_stats.cpu(),
@ -373,7 +379,7 @@ class Xtts(BaseTTS):

        return gpt_cond_latents, speaker_embedding

-    def synthesize(self, text, config, speaker_wav, language, **kwargs):
+    def synthesize(self, text, config, speaker_wav, language, speaker_id=None, **kwargs):
        """Synthesize speech with the given input text.

        Args:
@ -388,12 +394,6 @@ class Xtts(BaseTTS):
            `text_input` as text token IDs after tokenizer, `voice_samples` as samples used for cloning, `conditioning_latents`
            as latents used at inference.

-        """
-        return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
-
-    def inference_with_config(self, text, config, ref_audio_path, language, **kwargs):
-        """
-        inference with config
        """
        assert (
            "zh-cn" if language == "zh" else language in self.config.languages
@ -405,13 +405,18 @@ class Xtts(BaseTTS):
            "repetition_penalty": config.repetition_penalty,
            "top_k": config.top_k,
            "top_p": config.top_p,
+        }
+        settings.update(kwargs)  # allow overriding of preset settings with kwargs
+        if speaker_id is not None:
+            gpt_cond_latent, speaker_embedding = self.speaker_manager.speakers[speaker_id].values()
+            return self.inference(text, language, gpt_cond_latent, speaker_embedding, **settings)
+        settings.update({
            "gpt_cond_len": config.gpt_cond_len,
            "gpt_cond_chunk_len": config.gpt_cond_chunk_len,
            "max_ref_len": config.max_ref_len,
            "sound_norm_refs": config.sound_norm_refs,
-        }
-        settings.update(kwargs)  # allow overriding of preset settings with kwargs
-        return self.full_inference(text, ref_audio_path, language, **settings)
+        })
+        return self.full_inference(text, speaker_wav, language, **settings)

    @torch.inference_mode()
    def full_inference(
@ -515,6 +520,8 @@ class Xtts(BaseTTS):
    ):
        language = language.split("-")[0]  # remove the country code
        length_scale = 1.0 / max(speed, 0.05)
+        gpt_cond_latent = gpt_cond_latent.to(self.device)
+        speaker_embedding = speaker_embedding.to(self.device)
        if enable_text_splitting:
            text = split_sentence(text, language, self.tokenizer.char_limits[language])
        else:
@ -623,6 +630,8 @@ class Xtts(BaseTTS):
    ):
        language = language.split("-")[0]  # remove the country code
        length_scale = 1.0 / max(speed, 0.05)
+        gpt_cond_latent = gpt_cond_latent.to(self.device)
+        speaker_embedding = speaker_embedding.to(self.device)
        if enable_text_splitting:
            text = split_sentence(text, language, self.tokenizer.char_limits[language])
        else:
@ -728,6 +737,7 @@ class Xtts(BaseTTS):
        eval=True,
        strict=True,
        use_deepspeed=False,
+        speaker_file_path=None,
    ):
        """
        Loads a checkpoint from disk and initializes the model's state and tokenizer.
@ -747,6 +757,14 @@ class Xtts(BaseTTS):
        model_path = checkpoint_path or os.path.join(checkpoint_dir, "model.pth")
        vocab_path = vocab_path or os.path.join(checkpoint_dir, "vocab.json")

+        if speaker_file_path is None and checkpoint_dir is not None:
+            speaker_file_path = os.path.join(checkpoint_dir, "speakers_xtts.pth")
+
+        self.language_manager = LanguageManager(config)
+        self.speaker_manager = None
+        if speaker_file_path is not None and os.path.exists(speaker_file_path):
+            self.speaker_manager = SpeakerManager(speaker_file_path)
+
        if os.path.exists(vocab_path):
            self.tokenizer = VoiceBpeTokenizer(vocab_file=vocab_path)

--- a/TTS/tts/utils/text/punctuation.py
+++ b/TTS/tts/utils/text/punctuation.py
@ -15,7 +15,6 @@ class PuncPosition(Enum):
    BEGIN = 0
    END = 1
    MIDDLE = 2
-    ALONE = 3


 class Punctuation:
@ -92,7 +91,7 @@ class Punctuation:
            return [text], []
        # the text is only punctuations
        if len(matches) == 1 and matches[0].group() == text:
-            return [], [_PUNC_IDX(text, PuncPosition.ALONE)]
+            return [], [_PUNC_IDX(text, PuncPosition.BEGIN)]
        # build a punctuation map to be used later to restore punctuations
        puncs = []
        for match in matches:
@ -107,11 +106,14 @@ class Punctuation:
        for idx, punc in enumerate(puncs):
            split = text.split(punc.punc)
            prefix, suffix = split[0], punc.punc.join(split[1:])
+            text = suffix
+            if prefix == "":
+                # We don't want to insert an empty string in case of initial punctuation
+                continue
            splitted_text.append(prefix)
            # if the text does not end with a punctuation, add it to the last item
            if idx == len(puncs) - 1 and len(suffix) > 0:
                splitted_text.append(suffix)
-            text = suffix
        return splitted_text, puncs

    @classmethod
@ -127,10 +129,10 @@ class Punctuation:
            ['This is', 'example'], ['.', '!'] -> "This is. example!"

        """
-        return cls._restore(text, puncs, 0)
+        return cls._restore(text, puncs)

    @classmethod
-    def _restore(cls, text, puncs, num):  # pylint: disable=too-many-return-statements
+    def _restore(cls, text, puncs):  # pylint: disable=too-many-return-statements
        """Auxiliary method for Punctuation.restore()"""
        if not puncs:
            return text
@ -142,21 +144,18 @@ class Punctuation:
        current = puncs[0]

        if current.position == PuncPosition.BEGIN:
-            return cls._restore([current.punc + text[0]] + text[1:], puncs[1:], num)
+            return cls._restore([current.punc + text[0]] + text[1:], puncs[1:])

        if current.position == PuncPosition.END:
-            return [text[0] + current.punc] + cls._restore(text[1:], puncs[1:], num + 1)
-
-        if current.position == PuncPosition.ALONE:
-            return [current.mark] + cls._restore(text, puncs[1:], num + 1)
+            return [text[0] + current.punc] + cls._restore(text[1:], puncs[1:])

        # POSITION == MIDDLE
        if len(text) == 1:  # pragma: nocover
            # a corner case where the final part of an intermediate
            # mark (I) has not been phonemized
-            return cls._restore([text[0] + current.punc], puncs[1:], num)
+            return cls._restore([text[0] + current.punc], puncs[1:])

-        return cls._restore([text[0] + current.punc + text[1]] + text[2:], puncs[1:], num)
+        return cls._restore([text[0] + current.punc + text[1]] + text[2:], puncs[1:])


 # if __name__ == "__main__":
--- a/TTS/utils/generic_utils.py
+++ b/TTS/utils/generic_utils.py
@ -36,9 +36,7 @@ def get_git_branch():
        current.replace("* ", "")
    except subprocess.CalledProcessError:
        current = "inside_docker"
-    except FileNotFoundError:
-        current = "unknown"
-    except StopIteration:
+    except (FileNotFoundError, StopIteration) as e:
        current = "unknown"
    return current

--- a/TTS/utils/manage.py
+++ b/TTS/utils/manage.py
@ -1,5 +1,6 @@
 import json
 import os
+import re
 import tarfile
 import zipfile
 from pathlib import Path
@ -10,7 +11,7 @@ import fsspec
 import requests
 from tqdm import tqdm

-from TTS.config import load_config
+from TTS.config import load_config, read_json_with_comments
 from TTS.utils.generic_utils import get_user_data_dir

 LICENSE_URLS = {
@ -26,7 +27,6 @@ LICENSE_URLS = {
 }


-
 class ModelManager(object):
    tqdm_progress = None
    """Manage TTS models defined in .models.json.
@ -65,30 +65,7 @@ class ModelManager(object):
        Args:
            file_path (str): path to .models.json.
        """
-        with open(file_path, "r", encoding="utf-8") as json_file:
-            self.models_dict = json.load(json_file)
-
-    def add_cs_api_models(self, model_list: List[str]):
-        """Add list of Coqui Studio model names that are returned from the api
-
-        Each has the following format `<coqui_studio_model>/en/<speaker_name>/<coqui_studio_model>`
-        """
-
-        def _add_model(model_name: str):
-            if not "coqui_studio" in model_name:
-                return
-            model_type, lang, dataset, model = model_name.split("/")
-            if model_type not in self.models_dict:
-                self.models_dict[model_type] = {}
-            if lang not in self.models_dict[model_type]:
-                self.models_dict[model_type][lang] = {}
-            if dataset not in self.models_dict[model_type][lang]:
-                self.models_dict[model_type][lang][dataset] = {}
-            if model not in self.models_dict[model_type][lang][dataset]:
-                self.models_dict[model_type][lang][dataset][model] = {}
-
-        for model_name in model_list:
-            _add_model(model_name)
+        self.models_dict = read_json_with_comments(file_path)

    def _list_models(self, model_type, model_count=0):
        if self.verbose:
@ -276,13 +253,15 @@ class ModelManager(object):
            model_item["model_url"] = model_item["hf_url"]
        elif "fairseq" in model_item["model_name"]:
            model_item["model_url"] = "https://coqui.gateway.scarf.sh/fairseq/"
+        elif "xtts" in model_item["model_name"]:
+            model_item["model_url"] = "https://coqui.gateway.scarf.sh/xtts/"
        return model_item

    def _set_model_item(self, model_name):
        # fetch model info from the dict
-        model_type, lang, dataset, model = model_name.split("/")
-        model_full_name = f"{model_type}--{lang}--{dataset}--{model}"
        if "fairseq" in model_name:
+            model_type = "tts_models"
+            lang = model_name.split("/")[1]
            model_item = {
                "model_type": "tts_models",
                "license": "CC BY-NC 4.0",
@ -291,10 +270,38 @@ class ModelManager(object):
                "description": "this model is released by Meta under Fairseq repo. Visit https://github.com/facebookresearch/fairseq/tree/main/examples/mms for more info.",
            }
            model_item["model_name"] = model_name
+        elif "xtts" in model_name and len(model_name.split("/")) != 4:
+            # loading xtts models with only model name (e.g. xtts_v2.0.2)
+            # check model name has the version number with regex
+            version_regex = r"v\d+\.\d+\.\d+"
+            if re.search(version_regex, model_name):
+                model_version = model_name.split("_")[-1]
+            else:
+                model_version = "main"
+            model_type = "tts_models"
+            lang = "multilingual"
+            dataset = "multi-dataset"
+            model = model_name
+            model_item = {
+                "default_vocoder": None,
+                "license": "CPML",
+                "contact": "info@coqui.ai",
+                "tos_required": True,
+                "hf_url": [
+                    f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{model_version}/model.pth",
+                    f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{model_version}/config.json",
+                    f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{model_version}/vocab.json",
+                    f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{model_version}/hash.md5",
+                    f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{model_version}/speakers_xtts.pth",
+                ],
+            }
        else:
            # get model from models.json
+            model_type, lang, dataset, model = model_name.split("/")
            model_item = self.models_dict[model_type][lang][dataset][model]
            model_item["model_type"] = model_type
+
+        model_full_name = f"{model_type}--{lang}--{dataset}--{model}"
        md5hash = model_item["model_hash"] if "model_hash" in model_item else None
        model_item = self.set_model_url(model_item)
        return model_item, model_full_name, model, md5hash
@ -303,9 +310,9 @@ class ModelManager(object):
    def ask_tos(model_full_path):
        """Ask the user to agree to the terms of service"""
        tos_path = os.path.join(model_full_path, "tos_agreed.txt")
-        print(" > You must agree to the terms of service to use this model.")
-        print(" | > Please see the terms of service at https://coqui.ai/cpml.txt")
-        print(' | > "I have read, understood and agreed to the Terms and Conditions." - [y/n]')
+        print(" > You must confirm the following:")
+        print(' | > "I have purchased a commercial license from Coqui: licensing@coqui.ai"')
+        print(' | > "Otherwise, I agree to the terms of the non-commercial CPML: https://coqui.ai/cpml" - [y/n]')
        answer = input(" | | > ")
        if answer.lower() == "y":
            with open(tos_path, "w", encoding="utf-8") as f:
--- a/TTS/utils/synthesizer.py
+++ b/TTS/utils/synthesizer.py
@ -264,6 +264,7 @@ class Synthesizer(nn.Module):
        style_text=None,
        reference_wav=None,
        reference_speaker_name=None,
+        split_sentences: bool = True,
        **kwargs,
    ) -> List[int]:
        """🐸 TTS magic. Run all the models and generate speech.
@ -277,6 +278,8 @@ class Synthesizer(nn.Module):
            style_text ([type], optional): transcription of style_wav for Capacitron. Defaults to None.
            reference_wav ([type], optional): reference waveform for voice conversion. Defaults to None.
            reference_speaker_name ([type], optional): speaker id of reference waveform. Defaults to None.
+            split_sentences (bool, optional): split the input text into sentences. Defaults to True.
+            **kwargs: additional arguments to pass to the TTS model.
        Returns:
            List[int]: [description]
        """
@ -289,8 +292,10 @@ class Synthesizer(nn.Module):
            )

        if text:
-            sens = self.split_into_sentences(text)
-            print(" > Text splitted to sentences.")
+            sens = [text]
+            if split_sentences:
+                print(" > Text splitted to sentences.")
+                sens = self.split_into_sentences(text)
            print(sens)

        # handle multi-speaker
@ -300,7 +305,7 @@ class Synthesizer(nn.Module):
        speaker_embedding = None
        speaker_id = None
        if self.tts_speakers_file or hasattr(self.tts_model.speaker_manager, "name_to_id"):
-            if speaker_name and isinstance(speaker_name, str):
+            if speaker_name and isinstance(speaker_name, str) and not self.tts_config.model == "xtts":
                if self.tts_config.use_d_vector_file:
                    # get the average speaker embedding from the saved d_vectors.
                    speaker_embedding = self.tts_model.speaker_manager.get_mean_embedding(
@ -330,7 +335,9 @@ class Synthesizer(nn.Module):
        # handle multi-lingual
        language_id = None
        if self.tts_languages_file or (
-            hasattr(self.tts_model, "language_manager") and self.tts_model.language_manager is not None
+            hasattr(self.tts_model, "language_manager") 
+            and self.tts_model.language_manager is not None
+            and not self.tts_config.model == "xtts"
        ):
            if len(self.tts_model.language_manager.name_to_id) == 1:
                language_id = list(self.tts_model.language_manager.name_to_id.values())[0]
@ -361,6 +368,7 @@ class Synthesizer(nn.Module):
        if (
            speaker_wav is not None
            and self.tts_model.speaker_manager is not None
+            and hasattr(self.tts_model.speaker_manager, "encoder_ap")
            and self.tts_model.speaker_manager.encoder_ap is not None
        ):
            speaker_embedding = self.tts_model.speaker_manager.compute_embedding_from_clip(speaker_wav)
--- a/docs/source/configuration.md
+++ b/docs/source/configuration.md
@ -56,4 +56,4 @@ ModelConfig()

 In the example above, ```ModelConfig()``` is the final configuration that the model receives and it has all the fields necessary for the model.

-We host pre-defined model configurations under ```TTS/<model_class>/configs/```.Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided.
+We host pre-defined model configurations under ```TTS/<model_class>/configs/```. Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided.
--- a/docs/source/finetuning.md
+++ b/docs/source/finetuning.md
@ -21,7 +21,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
    Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own
    speech dataset and achieve reasonable results with only a couple of hours of data.

-    However, note that, fine-tuning does not ensure great results. The model performance is still depends on the
+    However, note that, fine-tuning does not ensure great results. The model performance still depends on the
    {ref}`dataset quality <what_makes_a_good_dataset>` and the hyper-parameters you choose for fine-tuning. Therefore,
    it still takes a bit of tinkering.

@ -41,7 +41,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
    tts --list_models
    ```

-    The command above lists the the models in a naming format as ```<model_type>/<language>/<dataset>/<model_name>```.
+    The command above lists the models in a naming format as ```<model_type>/<language>/<dataset>/<model_name>```.

    Or you can manually check the `.model.json` file in the project directory.

--- a/docs/source/formatting_your_dataset.md
+++ b/docs/source/formatting_your_dataset.md
@ -7,7 +7,7 @@ If you have a single audio file and you need to split it into clips, there are d

 It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using `wav` file format.

-Let's assume you created the audio clips and their transcription. You can collect all your clips under a folder. Let's call this folder `wavs`.
+Let's assume you created the audio clips and their transcription. You can collect all your clips in a folder. Let's call this folder `wavs`.

 ```
 /wavs
@ -17,7 +17,7 @@ Let's assume you created the audio clips and their transcription. You can collec
  ...
 ```

-You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimitered by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text.
+You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimited by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text.

 We recommend the following format delimited by `|`. In the following example, `audio1`, `audio2` refer to files `audio1.wav`, `audio2.wav` etc.

@ -55,7 +55,7 @@ For more info about dataset qualities and properties check our [post](https://gi

 After you collect and format your dataset, you need to check two things. Whether you need a `formatter` and a `text_cleaner`. The `formatter` loads the text file (created above) as a list and the `text_cleaner` performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).

-If you use a different dataset format then the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`.
+If you use a different dataset format than the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`.

 If your dataset is in a new language or it needs special normalization steps, then you need a new `text_cleaner`.

--- a/docs/source/implementing_a_new_language_frontend.md
+++ b/docs/source/implementing_a_new_language_frontend.md
@ -2,11 +2,11 @@

 - Language frontends are located under `TTS.tts.utils.text`
 - Each special language has a separate folder.
- Each folder containst all the utilities for processing the text input.
+- Each folder contains all the utilities for processing the text input.
 - `TTS.tts.utils.text.phonemizers` contains the main phonemizer for a language. This is the class that uses the utilities
 from the previous step and used to convert the text to phonemes or graphemes for the model.
 - After you implement your phonemizer, you need to add it to the `TTS/tts/utils/text/phonemizers/__init__.py` to be able to
 map the language code in the model config - `config.phoneme_language` - to the phonemizer class and initiate the phonemizer automatically.
 - You should also add tests to `tests/text_tests` if you want to make a PR.

-We suggest you to check the available implementations as reference. Good luck!
+We suggest you to check the available implementations as reference. Good luck!
--- a/docs/source/implementing_a_new_model.md
+++ b/docs/source/implementing_a_new_model.md
@ -145,7 +145,7 @@ class MyModel(BaseTTS):
        Args:
            ap (AudioProcessor): audio processor used at training.
            batch (Dict): Model inputs used at the previous training step.
-            outputs (Dict): Model outputs generated at the previoud training step.
+            outputs (Dict): Model outputs generated at the previous training step.

        Returns:
            Tuple[Dict, np.ndarray]: training plots and output waveform.
@ -183,7 +183,7 @@ class MyModel(BaseTTS):
        ...

    def get_optimizer(self) -> Union["Optimizer", List["Optimizer"]]:
-        """Setup an return optimizer or optimizers."""
+        """Setup a return optimizer or optimizers."""
        pass

    def get_lr(self) -> Union[float, List[float]]:
--- a/docs/source/inference.md
+++ b/docs/source/inference.md
@ -172,48 +172,6 @@ tts.tts_with_vc_to_file(
 )
 ```

-#### Example text to speech using [🐸Coqui Studio](https://coqui.ai) models.
-
-You can use all of your available speakers in the studio.
-[🐸Coqui Studio](https://coqui.ai) API token is required. You can get it from the [account page](https://coqui.ai/account).
-You should set the `COQUI_STUDIO_TOKEN` environment variable to use the API token.
-
-```python
-# If you have a valid API token set you will see the studio speakers as separate models in the list.
-# The name format is coqui_studio/en/<studio_speaker_name>/coqui_studio
-models = TTS().list_models()
-# Init TTS with the target studio speaker
-tts = TTS(model_name="coqui_studio/en/Torcull Diarmuid/coqui_studio", progress_bar=False)
-# Run TTS
-tts.tts_to_file(text="This is a test.", file_path=OUTPUT_PATH)
-# Run TTS with emotion and speed control
-tts.tts_to_file(text="This is a test.", file_path=OUTPUT_PATH, emotion="Happy", speed=1.5)
-```
-
-If you just need 🐸 Coqui Studio speakers, you can use `CS_API`. It is a wrapper around the 🐸 Coqui Studio API.
-
-```python
-from TTS.api import CS_API
-
-# Init 🐸 Coqui Studio API
-# you can either set the API token as an environment variable `COQUI_STUDIO_TOKEN` or pass it as an argument.
-
-# XTTS - Best quality and life-like speech in multiple languages. See https://docs.coqui.ai/reference/samples_xtts_create for supported languages.
-api = CS_API(api_token=<token>, model="XTTS")
-api.speakers  # all the speakers are available with all the models.
-api.list_speakers()
-api.list_voices()
-wav, sample_rate = api.tts(text="This is a test.", speaker=api.speakers[0].name, emotion="Happy", language="en", speed=1.5)
-
-# V1 - Fast and lightweight TTS in EN with emotion control.
-api = CS_API(api_token=<token>, model="V1")
-api.speakers
-api.emotions  # emotions are only for the V1 model.
-api.list_speakers()
-api.list_voices()
-wav, sample_rate = api.tts(text="This is a test.", speaker=api.speakers[0].name, emotion="Happy", speed=1.5)
-```
-
 #### Example text to speech using **Fairseq models in ~1100 languages** 🤯.
 For these models use the following name format: `tts_models/<lang-iso_code>/fairseq/vits`.

--- a/docs/source/marytts.md
+++ b/docs/source/marytts.md
@ -2,13 +2,13 @@

 ## What is Mary-TTS?

-[Mary (Modular Architecture for Research in sYynthesis) Text-to-Speech](http://mary.dfki.de/) is an open-source (GNU LGPL license), multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of [DFKI’s](http://www.dfki.de/web) Language Technology Lab and the [Institute of Phonetics](http://www.coli.uni-saarland.de/groups/WB/Phonetics/) at Saarland University, Germany. It is now maintained by the Multimodal Speech Processing Group in the [Cluster of Excellence MMCI](https://www.mmci.uni-saarland.de/) and DFKI.
+[Mary (Modular Architecture for Research in sYnthesis) Text-to-Speech](http://mary.dfki.de/) is an open-source (GNU LGPL license), multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of [DFKI’s](http://www.dfki.de/web) Language Technology Lab and the [Institute of Phonetics](http://www.coli.uni-saarland.de/groups/WB/Phonetics/) at Saarland University, Germany. It is now maintained by the Multimodal Speech Processing Group in the [Cluster of Excellence MMCI](https://www.mmci.uni-saarland.de/) and DFKI.
 MaryTTS has been around for a very! long time. Version 3.0 even dates back to 2006, long before Deep Learning was a broadly known term and the last official release was version 5.2 in 2016.
 You can check out this OpenVoice-Tech page to learn more: https://openvoice-tech.net/index.php/MaryTTS

 ## Why Mary-TTS compatibility is relevant

-Due to it's open-source nature, relatively high quality voices and fast synthetization speed Mary-TTS was a popular choice in the past and many tools implemented API support over the years like screen-readers (NVDA + SpeechHub), smart-home HUBs (openHAB, Home Assistant) or voice assistants (Rhasspy, Mycroft, SEPIA). A compatibility layer for Coqui-TTS will ensure that these tools can use Coqui as a drop-in replacement and get even better voices right away.
+Due to its open-source nature, relatively high quality voices and fast synthetization speed Mary-TTS was a popular choice in the past and many tools implemented API support over the years like screen-readers (NVDA + SpeechHub), smart-home HUBs (openHAB, Home Assistant) or voice assistants (Rhasspy, Mycroft, SEPIA). A compatibility layer for Coqui-TTS will ensure that these tools can use Coqui as a drop-in replacement and get even better voices right away.

 ## API and code examples

@ -40,4 +40,4 @@ You can enter the same URLs in your browser and check-out the results there as w
 ### How it works and limitations

 A classic Mary-TTS server would usually show all installed locales and voices via the corresponding endpoints and accept the parameters `LOCALE` and `VOICE` for processing. For Coqui-TTS we usually start the server with one specific locale and model and thus cannot return all available options. Instead we return the active locale and use the model name as "voice". Since we only have one active model and always want to return a WAV-file, we currently ignore all other processing parameters except `INPUT_TEXT`. Since the gender is not defined for models in Coqui-TTS we always return `u` (undefined).
-We think that this is an acceptable compromise, since users are often only interested in one specific voice anyways, but the API might get extended in the future to support multiple languages and voices at the same time.
+We think that this is an acceptable compromise, since users are often only interested in one specific voice anyways, but the API might get extended in the future to support multiple languages and voices at the same time.
--- a/docs/source/models/tortoise.md
+++ b/docs/source/models/tortoise.md
@ -1,6 +1,6 @@
 # 🐢 Tortoise
 Tortoise is a very expressive TTS system with impressive voice cloning capabilities. It is based on an GPT like autogressive acoustic model that converts input
-text to discritized acouistic tokens, a diffusion model that converts these tokens to melspeectrogram frames and a Univnet vocoder to convert the spectrograms to
+text to discritized acoustic tokens, a diffusion model that converts these tokens to melspectrogram frames and a Univnet vocoder to convert the spectrograms to
 the final audio signal. The important downside is that Tortoise is very slow compared to the parallel TTS models like VITS.

 Big thanks to 👑[@manmay-nakhashi](https://github.com/manmay-nakhashi) who helped us implement Tortoise in 🐸TTS.
--- a/docs/source/models/xtts.md
+++ b/docs/source/models/xtts.md
@ -21,7 +21,7 @@ a few tricks to make it faster and support streaming inference.
 - Across the board quality improvements.

 ### Code
-Current implementation only supports inference.
+Current implementation only supports inference and GPT encoder training.

 ### Languages
 As of now, XTTS-v2 supports 16 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu) and Korean (ko).
@ -36,35 +36,39 @@ Come and join in our 🐸Community. We're active on [Discord](https://discord.gg
 You can also mail us at info@coqui.ai.

 ### Inference
-#### 🐸TTS API
-
-##### Single reference
-```python
-from TTS.api import TTS
-tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
-
-# generate speech by cloning a voice using default settings
-tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
-                file_path="output.wav",
-                speaker_wav=["/path/to/target/speaker.wav"],
-                language="en")
-```
-
-##### Multiple references
-```python
-from TTS.api import TTS
-tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
-
-# generate speech by cloning a voice using default settings
-tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
-                file_path="output.wav",
-                speaker_wav=["/path/to/target/speaker.wav", "/path/to/target/speaker_2.wav", "/path/to/target/speaker_3.wav"],
-                language="en")
-```

 #### 🐸TTS Command line

-##### Single reference
+You can check all supported languages with the following command: 
+
+```console
+ tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
+    --list_language_idx
+```
+
+You can check all Coqui available speakers with the following command: 
+
+```console
+ tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
+    --list_speaker_idx
+```
+
+##### Coqui speakers
+You can do inference using one of the available speakers using the following command:
+
+```console
+ tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
+     --text "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent." \
+     --speaker_idx "Ana Florence" \
+     --language_idx en \
+     --use_cuda true
+```
+
+##### Clone a voice
+You can clone a speaker voice using a single or multiple references:
+
+###### Single reference
+
 ```console
 tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
     --text "Bugün okula gitmek istemiyorum." \
@ -73,7 +77,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
     --use_cuda true
 ```

-##### Multiple references
+###### Multiple references
 ```console
 tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
     --text "Bugün okula gitmek istemiyorum." \
@ -91,15 +95,102 @@ or for all wav files in a directory you can use:
     --use_cuda true
 ```

+#### 🐸TTS API

-#### model directly
+##### Clone a voice
+You can clone a speaker voice using a single or multiple references:

-If you want to be able to run with `use_deepspeed=True` and enjoy the speedup, you need to install deepspeed first.
+###### Single reference
+
+Splits the text into sentences and generates audio for each sentence. The audio files are then concatenated to produce the final audio.
+You can optionally disable sentence splitting for better coherence but more VRAM and possibly hitting models context length limit.
+
+```python
+from TTS.api import TTS
+tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
+
+# generate speech by cloning a voice using default settings
+tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
+                file_path="output.wav",
+                speaker_wav=["/path/to/target/speaker.wav"],
+                language="en",
+                split_sentences=True
+                )
+```
+
+###### Multiple references
+
+You can pass multiple audio files to the `speaker_wav` argument for better voice cloning.
+
+```python
+from TTS.api import TTS
+
+# using the default version set in 🐸TTS
+tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
+
+# using a specific version
+# 👀 see the branch names for versions on https://huggingface.co/coqui/XTTS-v2/tree/main
+# ❗some versions might be incompatible with the API
+tts = TTS("xtts_v2.0.2", gpu=True)
+
+# getting the latest XTTS_v2
+tts = TTS("xtts", gpu=True)
+
+# generate speech by cloning a voice using default settings
+tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
+                file_path="output.wav",
+                speaker_wav=["/path/to/target/speaker.wav", "/path/to/target/speaker_2.wav", "/path/to/target/speaker_3.wav"],
+                language="en")
+```
+
+##### Coqui speakers
+
+You can do inference using one of the available speakers using the following code:
+
+```python
+from TTS.api import TTS
+tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
+
+# generate speech by cloning a voice using default settings
+tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
+                file_path="output.wav",
+                speaker="Ana Florence",
+                language="en",
+                split_sentences=True
+                )
+```
+
+
+#### 🐸TTS Model API
+
+To use the model API, you need to download the model files and pass config and model file paths manually.
+
+#### Manual Inference
+
+If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first.

 ```console
 pip install deepspeed==0.10.3
 ```

+##### inference parameters
+
+- `text`: The text to be synthesized.
+- `language`: The language of the text to be synthesized.
+- `gpt_cond_latent`: The latent vector you get with get_conditioning_latents. (You can cache for faster inference with same speaker)
+- `speaker_embedding`: The speaker embedding you get with get_conditioning_latents. (You can cache for faster inference with same speaker)
+- `temperature`: The softmax temperature of the autoregressive model. Defaults to 0.65.
+- `length_penalty`: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs. Defaults to 1.0.
+- `repetition_penalty`: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc. Defaults to 2.0.
+- `top_k`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 50.
+- `top_p`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 0.8.
+- `speed`: The speed rate of the generated audio. Defaults to 1.0. (can produce artifacts if far from 1.0)
+- `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True.
+
+
+##### Inference
+
+
 ```python
 import os
 import torch
@ -129,7 +220,7 @@ torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
 ```


-#### streaming inference
+##### Streaming manually

 Here the goal is to stream the audio as it is being generated. This is useful for real-time applications.
 Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster.
@ -175,6 +266,50 @@ torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000)

 ### Training

+#### Easy training
+To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps:
+
+- Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter
+- Train the XTTS GPT encoder with the processed data
+- Inference support using the fine-tuned model
+
+The user can run this gradio demo locally or remotely using a Colab Notebook.
+
+##### Run demo on Colab
+To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook.
+
+The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing).
+
+To learn how to use this Colab Notebook please check the [XTTS fine-tuning video]().
+
+If you are not able to acess the video you need to follow the steps:
+
+1. Open the Colab notebook and start the demo by runining the first two cells (ignore pip install errors in the first one).
+2. Click on the link "Running on public URL:" on the second cell output.
+3. On the first Tab (1 - Data processing) you need to select the audio file or files, wait for upload, and then click on the button "Step 1 - Create dataset" and then wait until the dataset processing is done.
+4. Soon as the dataset processing is done you need to go to the second Tab (2 - Fine-tuning XTTS Encoder) and press the button "Step 2 - Run the training" and then wait until the training is finished. Note that it can take up to 40 minutes.
+5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference".
+
+
+##### Run demo locally
+
+To run the demo locally you need to do the following steps:
+1. Install   🐸 TTS following the instructions available [here](https://tts.readthedocs.io/en/dev/installation.html#installation).
+2. Install the Gradio demo requirements with the command `python3 -m pip install -r TTS/demos/xtts_ft_demo/requirements.txt`
+3. Run the Gradio demo using the command `python3 TTS/demos/xtts_ft_demo/xtts_demo.py`
+4. Follow the steps presented in the [tutorial video](https://www.youtube.com/watch?v=8tpDiiouGxc&feature=youtu.be) to be able to fine-tune and test the fine-tuned model.
+
+
+If you are not able to access the video, here is what you need to do:
+
+1. On the first Tab (1 - Data processing) select the audio file or files, wait for upload
+2. Click on the button "Step 1 - Create dataset" and then wait until the dataset processing is done.
+3. Go to the second Tab (2 - Fine-tuning XTTS Encoder) and press the button "Step 2 - Run the training" and then wait until the training is finished. it will take some time.
+4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded.
+5. Now you can run inference with the model by clicking on the button "Step 4 - Inference".
+
+#### Advanced training
+
 A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py

 You need to change the fields of the `BaseDatasetConfig` to match your dataset and then update `GPTArgs` and `GPTTrainerConfig` fields as you need. By default, it will use the same parameters that XTTS v1.1 model was trained with. To speed up the model convergence, as default, it will also download the XTTS v1.1 checkpoint and load it.
@ -222,6 +357,7 @@ torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000)
 ```


+
 ## References and Acknowledgements
 - VallE: https://arxiv.org/abs/2301.02111
 - Tortoise Repo: https://github.com/neonbjb/tortoise-tts
--- a/requirements.txt
+++ b/requirements.txt
@ -17,6 +17,7 @@ pyyaml>=6.0
 fsspec>=2023.6.0 # <= 2023.9.1 makes aux tests fail
 aiohttp>=3.8.1
 packaging>=23.1
+mutagen==1.47.0
 # deps for examples
 flask>=2.0.1
 # deps for inference
@ -27,7 +28,7 @@ pandas>=1.4,<2.0
 # deps for training
 matplotlib>=3.7.0
 # coqui stack
-trainer>=0.0.32
+trainer>=0.0.36
 # config management
 coqpit>=0.0.16
 # chinese g2p deps
--- a/tests/api_tests/init.py
+++ b/tests/api_tests/init.py
--- a/tests/api_tests/test_python_api.py
+++ b/tests/api_tests/test_python_api.py
@ -1,113 +0,0 @@
-import os
-import unittest
-
-from tests import get_tests_data_path, get_tests_output_path
-from TTS.api import CS_API, TTS
-
-OUTPUT_PATH = os.path.join(get_tests_output_path(), "test_python_api.wav")
-cloning_test_wav_path = os.path.join(get_tests_data_path(), "ljspeech/wavs/LJ001-0028.wav")
-
-
-is_coqui_available = os.environ.get("COQUI_STUDIO_TOKEN")
-
-
-if is_coqui_available:
-
-    class CS_APITest(unittest.TestCase):
-        def test_speakers(self):
-            tts = CS_API()
-            self.assertGreater(len(tts.speakers), 1)
-
-        def test_emotions(self):
-            tts = CS_API()
-            self.assertGreater(len(tts.emotions), 1)
-
-        def test_list_calls(self):
-            tts = CS_API()
-            self.assertGreater(len(tts.list_voices()), 1)
-            self.assertGreater(len(tts.list_speakers()), 1)
-            self.assertGreater(len(tts.list_all_speakers()), 1)
-            self.assertGreater(len(tts.list_speakers_as_tts_models()), 1)
-
-        def test_name_to_speaker(self):
-            tts = CS_API()
-            speaker_name = tts.list_speakers_as_tts_models()[0].split("/")[2]
-            speaker = tts.name_to_speaker(speaker_name)
-            self.assertEqual(speaker.name, speaker_name)
-
-        def test_tts(self):
-            tts = CS_API()
-            wav, sr = tts.tts(text="This is a test.", speaker_name=tts.list_speakers()[0].name)
-            self.assertEqual(sr, 44100)
-            self.assertGreater(len(wav), 1)
-
-    class TTSTest(unittest.TestCase):
-        def test_single_speaker_model(self):
-            tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False, gpu=False)
-
-            error_raised = False
-            try:
-                tts.tts_to_file(text="Ich bin eine Testnachricht.", speaker="Thorsten", language="de")
-            except ValueError:
-                error_raised = True
-
-            tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH)
-
-            self.assertTrue(error_raised)
-            self.assertFalse(tts.is_multi_speaker)
-            self.assertFalse(tts.is_multi_lingual)
-            self.assertIsNone(tts.speakers)
-            self.assertIsNone(tts.languages)
-
-        def test_studio_model(self):
-            tts = TTS(model_name="coqui_studio/en/Zacharie Aimilios/coqui_studio")
-            tts.tts_to_file(text="This is a test.")
-
-            # check speed > 2.0 raises error
-            raised_error = False
-            try:
-                _ = tts.tts(text="This is a test.", speed=4.0, emotion="Sad")  # should raise error with speed > 2.0
-            except ValueError:
-                raised_error = True
-            self.assertTrue(raised_error)
-
-            # check emotion is invalid
-            raised_error = False
-            try:
-                _ = tts.tts(text="This is a test.", speed=2.0, emotion="No Emo")  # should raise error with speed > 2.0
-            except ValueError:
-                raised_error = True
-            self.assertTrue(raised_error)
-
-            # check valid call
-            wav = tts.tts(text="This is a test.", speed=2.0, emotion="Sad")
-            self.assertGreater(len(wav), 0)
-
-        def test_fairseq_model(self):  # pylint: disable=no-self-use
-            tts = TTS(model_name="tts_models/eng/fairseq/vits")
-            tts.tts_to_file(text="This is a test.")
-
-        def test_multi_speaker_multi_lingual_model(self):
-            tts = TTS()
-            tts.load_tts_model_by_name(tts.models[0])  # YourTTS
-            tts.tts_to_file(
-                text="Hello world!", speaker=tts.speakers[0], language=tts.languages[0], file_path=OUTPUT_PATH
-            )
-
-            self.assertTrue(tts.is_multi_speaker)
-            self.assertTrue(tts.is_multi_lingual)
-            self.assertGreater(len(tts.speakers), 1)
-            self.assertGreater(len(tts.languages), 1)
-
-        def test_voice_cloning(self):  # pylint: disable=no-self-use
-            tts = TTS()
-            tts.load_tts_model_by_name("tts_models/multilingual/multi-dataset/your_tts")
-            tts.tts_to_file("Hello world!", speaker_wav=cloning_test_wav_path, language="en", file_path=OUTPUT_PATH)
-
-        def test_voice_conversion(self):  # pylint: disable=no-self-use
-            tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False, gpu=False)
-            tts.voice_conversion_to_file(
-                source_wav=cloning_test_wav_path,
-                target_wav=cloning_test_wav_path,
-                file_path=OUTPUT_PATH,
-            )
--- a/tests/api_tests/test_synthesize_api.py
+++ b/tests/api_tests/test_synthesize_api.py
@ -1,25 +0,0 @@
-import os
-
-from tests import get_tests_output_path, run_cli
-
-
-def test_synthesize():
-    """Test synthesize.py with diffent arguments."""
-    output_path = os.path.join(get_tests_output_path(), "output.wav")
-
-    # 🐸 Coqui studio model
-    run_cli(
-        'tts --model_name "coqui_studio/en/Torcull Diarmuid/coqui_studio" '
-        '--text "This is it" '
-        f'--out_path "{output_path}"'
-    )
-
-    # 🐸 Coqui studio model with speed arg.
-    run_cli(
-        'tts --model_name "coqui_studio/en/Torcull Diarmuid/coqui_studio" '
-        '--text "This is it but slow" --speed 0.1'
-        f'--out_path "{output_path}"'
-    )
-
-    # test pipe_out command
-    run_cli(f'tts --text "test." --pipe_out --out_path "{output_path}" | aplay')
--- a/tests/data/ljspeech/metadata_flac.csv
+++ b/tests/data/ljspeech/metadata_flac.csv
@ -0,0 +1,9 @@
+audio_file|text|transcription|speaker_name
+wavs/LJ001-0001.flac|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|ljspeech-0
+wavs/LJ001-0002.flac|in being comparatively modern.|in being comparatively modern.|ljspeech-0
+wavs/LJ001-0003.flac|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|ljspeech-1
+wavs/LJ001-0004.flac|produced the block books, which were the immediate predecessors of the true printed book,|produced the block books, which were the immediate predecessors of the true printed book,|ljspeech-1
+wavs/LJ001-0005.flac|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|ljspeech-2
+wavs/LJ001-0006.flac|And it is worth mention in passing that, as an example of fine typography,|And it is worth mention in passing that, as an example of fine typography,|ljspeech-2
+wavs/LJ001-0007.flac|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about 1455,|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about fourteen fifty-five,|ljspeech-3
+wavs/LJ001-0008.flac|has never been surpassed.|has never been surpassed.|ljspeech-3
--- a/tests/data/ljspeech/metadata_mp3.csv
+++ b/tests/data/ljspeech/metadata_mp3.csv
@ -0,0 +1,9 @@
+audio_file|text|transcription|speaker_name
+wavs/LJ001-0001.mp3|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|ljspeech-0
+wavs/LJ001-0002.mp3|in being comparatively modern.|in being comparatively modern.|ljspeech-0
+wavs/LJ001-0003.mp3|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|ljspeech-1
+wavs/LJ001-0004.mp3|produced the block books, which were the immediate predecessors of the true printed book,|produced the block books, which were the immediate predecessors of the true printed book,|ljspeech-1
+wavs/LJ001-0005.mp3|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|ljspeech-2
+wavs/LJ001-0006.mp3|And it is worth mention in passing that, as an example of fine typography,|And it is worth mention in passing that, as an example of fine typography,|ljspeech-2
+wavs/LJ001-0007.mp3|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about 1455,|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about fourteen fifty-five,|ljspeech-3
+wavs/LJ001-0008.mp3|has never been surpassed.|has never been surpassed.|ljspeech-3
--- a/tests/data/ljspeech/metadata_wav.csv
+++ b/tests/data/ljspeech/metadata_wav.csv
@ -0,0 +1,9 @@
+audio_file|text|transcription|speaker_name
+wavs/LJ001-0001.wav|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|ljspeech-0
+wavs/LJ001-0002.wav|in being comparatively modern.|in being comparatively modern.|ljspeech-0
+wavs/LJ001-0003.wav|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|ljspeech-1
+wavs/LJ001-0004.wav|produced the block books, which were the immediate predecessors of the true printed book,|produced the block books, which were the immediate predecessors of the true printed book,|ljspeech-1
+wavs/LJ001-0005.wav|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|ljspeech-2
+wavs/LJ001-0006.wav|And it is worth mention in passing that, as an example of fine typography,|And it is worth mention in passing that, as an example of fine typography,|ljspeech-2
+wavs/LJ001-0007.wav|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about 1455,|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about fourteen fifty-five,|ljspeech-3
+wavs/LJ001-0008.wav|has never been surpassed.|has never been surpassed.|ljspeech-3
--- a/tests/data/ljspeech/wavs/LJ001-0001.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0001.flac
--- a/tests/data/ljspeech/wavs/LJ001-0001.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0001.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0002.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0002.flac
--- a/tests/data/ljspeech/wavs/LJ001-0002.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0002.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0003.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0003.flac
--- a/tests/data/ljspeech/wavs/LJ001-0003.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0003.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0004.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0004.flac
--- a/tests/data/ljspeech/wavs/LJ001-0004.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0004.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0005.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0005.flac
--- a/tests/data/ljspeech/wavs/LJ001-0005.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0005.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0006.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0006.flac
--- a/tests/data/ljspeech/wavs/LJ001-0006.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0006.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0007.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0007.flac
--- a/tests/data/ljspeech/wavs/LJ001-0007.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0007.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0008.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0008.flac
--- a/tests/data/ljspeech/wavs/LJ001-0008.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0008.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0009.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0009.flac
--- a/tests/data/ljspeech/wavs/LJ001-0009.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0009.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0010.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0010.flac
--- a/tests/data/ljspeech/wavs/LJ001-0010.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0010.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0011.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0011.flac
--- a/tests/data/ljspeech/wavs/LJ001-0011.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0011.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0012.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0012.flac
--- a/tests/data/ljspeech/wavs/LJ001-0012.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0012.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0013.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0013.flac
--- a/tests/data/ljspeech/wavs/LJ001-0013.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0013.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0014.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0014.flac
--- a/tests/data/ljspeech/wavs/LJ001-0014.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0014.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0015.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0015.flac
--- a/tests/data/ljspeech/wavs/LJ001-0015.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0015.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0016.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0016.flac
--- a/tests/data/ljspeech/wavs/LJ001-0016.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0016.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0017.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0017.flac
--- a/tests/data/ljspeech/wavs/LJ001-0017.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0017.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0018.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0018.flac
--- a/tests/data/ljspeech/wavs/LJ001-0018.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0018.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0019.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0019.flac
--- a/tests/data/ljspeech/wavs/LJ001-0019.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0019.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0020.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0020.flac
--- a/tests/data/ljspeech/wavs/LJ001-0020.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0020.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0021.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0021.flac
--- a/tests/data/ljspeech/wavs/LJ001-0021.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0021.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0022.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0022.flac
--- a/tests/data/ljspeech/wavs/LJ001-0022.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0022.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0023.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0023.flac
--- a/tests/data/ljspeech/wavs/LJ001-0023.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0023.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0024.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0024.flac
--- a/tests/data/ljspeech/wavs/LJ001-0024.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0024.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0025.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0025.flac
--- a/tests/data/ljspeech/wavs/LJ001-0025.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0025.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0026.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0026.flac
--- a/tests/data/ljspeech/wavs/LJ001-0026.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0026.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0027.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0027.flac
--- a/tests/data/ljspeech/wavs/LJ001-0027.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0027.mp3
--- a/tests/data/ljspeech/wavs/LJ001-0028.flac
+++ b/tests/data/ljspeech/wavs/LJ001-0028.flac
--- a/tests/data/ljspeech/wavs/LJ001-0028.mp3
+++ b/tests/data/ljspeech/wavs/LJ001-0028.mp3
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Nick Potafiy	dbf1a08a0d	Update generic_utils.py (#3561 ) Handles cases when git branch produces no output or invalid output. Right now, it just crashes with `StopIteration`	2024-02-10 11:20:58 -03:00
Edresson Casanova	5dcc16d193	Bug fix in MP3 and FLAC compute length on TTSDataset (#3092 ) * Bug Fix on XTTS load * Bug fix in MP3 length on TTSDataset * Update TTS/tts/datasets/dataset.py Co-authored-by: Aarni Koskela <akx@iki.fi> * Uses mutagen for all audio formats * Add dataloader test wit hall supported audio formats * Use mutagen.File * Update * Fix aux unit tests * Bug fixe on unit tests --------- Co-authored-by: Aarni Koskela <akx@iki.fi>	2023-12-27 13:23:43 -03:00
Eren Gölge	55c7063724	Merge pull request #3423 from idiap/fix-aux-tests Fix CI (save best model after 0 steps in tests)	2023-12-14 18:00:30 +01:00
Enno Hermann	99fee6f5ad	build: use Trainer>=0.0.36	2023-12-14 14:26:31 +01:00
Eren Gölge	186cafb34c	Merge pull request #3412 from coqui-ai/reuben/docs-studio-refs Remove Coqui Studio references	2023-12-13 08:54:57 +01:00
Eren Gölge	3991d83b2c	Merge branch 'dev' into reuben/docs-studio-refs	2023-12-13 08:53:43 +01:00
Eren Gölge	fa28f99f15	Update to v0.22.0	2023-12-12 16:10:46 +01:00
Eren Gölge	8c1a8b522b	Merge pull request #3405 from coqui-ai/studio_speakers Add studio speakers to open source XTTS!	2023-12-12 16:10:09 +01:00
Reuben Morais	0859e9f252	Remove Coqui Studio references	2023-12-12 16:09:57 +01:00
Enno Hermann	9f325b1f6c	fixup! Fix aux unit tests	2023-12-12 16:07:16 +01:00
Edresson Casanova	fc099218df	Fix aux unit tests	2023-12-12 16:07:16 +01:00
Eren Gölge	934b87bbd1	Merge pull request #3391 from aaron-lii/multi-gpu support multiple GPU training for XTTS	2023-12-12 13:51:26 +01:00
Eren Gölge	b0fe0e678d	Merge pull request #3392 from joelhoward0/fix_contributing_typo fixes a typo	2023-12-12 13:50:59 +01:00
Eren Gölge	936084be7e	Merge pull request #3404 from freds0/dev Training fastspeech2 with External Speaker Embeddings	2023-12-12 13:50:27 +01:00
Eren Gölge	8e6a7cbfbf	Update .models.json	2023-12-12 13:50:01 +01:00
Eren Gölge	8999780aff	Update test_models.py	2023-12-12 13:30:21 +01:00
Eren Gölge	4dc0722bbc	Update .models.json	2023-12-12 13:28:16 +01:00
Edresson Casanova	4b33699b41	Update docs	2023-12-12 09:22:07 -03:00
Edresson Casanova	b6e1ac66d9	Add docs	2023-12-12 09:19:56 -03:00
WeberJulian	61b67ef16f	Fix read_json_with_comments	2023-12-11 23:58:52 +01:00
WeberJulian	d47b6df4e5	Make comments in .model.json valid	2023-12-11 23:35:27 +01:00
WeberJulian	605a857add	Remove tortoise	2023-12-11 23:35:07 +01:00
WeberJulian	b40750d1f5	Remove models that require app.coqui.ai	2023-12-11 23:17:54 +01:00
WeberJulian	ecc38891fb	Fix CI readme	2023-12-11 23:01:30 +01:00
WeberJulian	5ab228dff2	Fix CI	2023-12-11 22:31:53 +01:00
WeberJulian	8c20a599d8	Remove coqui studio integration from TTS	2023-12-11 22:11:46 +01:00
WeberJulian	5cd750ac7e	Fix API and CI	2023-12-11 20:21:53 +01:00
WeberJulian	e3c9dab7a3	Make CLI work	2023-12-11 18:49:18 +01:00
WeberJulian	0a90359a42	rename speaker file	2023-12-11 18:48:49 +01:00
WeberJulian	a5c0d9780f	rename manager	2023-12-11 18:48:31 +01:00
WeberJulian	36143fee26	Add basic speaker manager	2023-12-11 15:25:46 +01:00
Frederico S. Oliveira	f9117918fe	Update .models.json	2023-12-11 10:47:31 -03:00
Frederico S. Oliveira	163f9a3fdf	Merge branch 'coqui-ai:dev' into dev	2023-12-11 10:04:07 -03:00
WeberJulian	0a136a8535	Download speaker file	2023-12-11 11:29:36 +01:00
joelhoward0	e535cfe07c	fixes a typo	2023-12-08 14:19:57 +00:00
Aaron-Li	b6e929696a	support multiple GPU training	2023-12-08 16:55:32 +08:00
Eren Gölge	c99e885cc8	Merge pull request #3373 from coqui-ai/add-doc-xtts Add inference parameters	2023-12-07 14:07:28 +01:00
Eren Gölge	4b35a1e756	Merge pull request #3381 from JRMeyer/licensing-message Print message for either commercial license or CPML	2023-12-07 13:57:39 +01:00
Josh Meyer	759d9ab3ae	Print message for either commercial license or CPML	2023-12-07 13:54:48 +01:00
Eren Gölge	6b2ba527fa	Merge pull request #3368 from omahs/patch-1 Fix typos	2023-12-06 15:10:14 +01:00
WeberJulian	7d1a6defd6	Add inference parameters	2023-12-06 11:43:31 +01:00
omahs	f659fa16bc	fix typo	2023-12-05 09:50:33 +01:00
omahs	716657c835	fix typos	2023-12-05 09:48:03 +01:00
omahs	775a9138b7	fix typo	2023-12-05 09:47:07 +01:00
omahs	cfb143b9fb	fix typos	2023-12-05 09:46:36 +01:00
omahs	c03fe7377b	fix typos	2023-12-05 09:45:00 +01:00
omahs	bba21b86c6	fix typo	2023-12-05 09:41:23 +01:00
Eren Gölge	e49c512d99	Merge pull request #3351 from aaron-lii/chinese-puncs fix pause problem of Chinese speech	2023-12-04 15:57:42 +01:00
Eren Gölge	9c7b850995	Merge pull request #3352 from VladCuciureanu/patch-1 fix: Few typos in Tortoise docs.	2023-12-04 15:56:37 +01:00
Eren Gölge	2d02015978	Update to v0.21.3	2023-12-01 23:52:57 +01:00
Edresson Casanova	5f900f156a	Add XTTS Fine tuning gradio demo (#3296 ) * Add XTTS FT demo data processing pipeline * Add training and inference columns * Uses tabs instead of columns * Fix demo freezing issue * Update demo * Convert stereo to mono * Bug fix on XTTS inference * Update gradio demo * Update gradio demo * Update gradio demo * Update gradio demo * Add parameters to be able to set then on colab demo * Add erros messages * Add intuitive error messages * Update * Add max_audio_length parameter * Add XTTS fine-tuner docs * Update XTTS finetuner docs * Delete trainer to freeze memory * Delete unused variables * Add gc.collect() * Update xtts.md --------- Co-authored-by: Eren Gölge <erogol@hotmail.com>	2023-12-01 23:52:23 +01:00
Vlad Cuciureanu	f5b41674e8	fix: Few typos in Tortoise docs.	2023-12-01 20:42:41 +02:00
Aaron-Li	7b8808186a	fix pause problem of Chinese speech	2023-12-01 23:30:03 +08:00
Frederico S. Oliveira	bcd500fa7b	Fixing bug Correction in training the Fastspeech/Fastspeech2/FastPitch/SpeedySpeech model using external speaker embedding.	2023-11-30 17:27:05 -03:00
Frederico S. Oliveira	a26e51b0b4	Merge branch 'coqui-ai:dev' into dev	2023-11-30 14:19:05 -03:00
Eren Gölge	6d1905c2b7	Update to v0.21.2	2023-11-30 13:05:10 +01:00
Hannes Krumbiegel	e40527b103	Fix link to installation instructions (#3329 )	2023-11-30 13:03:33 +01:00
Enno Hermann	39321d02be	fix: correctly strip/restore initial punctuation (#3336 ) * refactor(punctuation): remove orphan code for handling lone punctuation The case of lone punctuation is already handled at the top of restore(). The removed if statement would never be called and would in fact raise an AttributeError because the _punc_index named tuple doesn't have the attribute `mark`. * refactor(punctuation): remove unused argument * fix(punctuation): correctly handle initial punctuation Stripping and restoring initial punctuation didn't work correctly because the string-splitting caused an additional empty string to be inserted in the text list (because `".A".split(".")` => `["", "A"]`). Now, an initial empty string is skipped and relevant test cases are added. Fixes #3333	2023-11-30 13:03:16 +01:00
Eren Gölge	93283385e0	Merge pull request #3318 from coqui-ai/calling_hf_models Run XTTS models by direct name with versions	2023-11-30 13:02:26 +01:00
Frederico S. Oliveira	77c2155609	Merge pull request #1 from coqui-ai/dev Update	2023-11-29 17:24:02 -03:00
Eren G??lge	bfbaffc84a	Fixup	2023-11-28 13:47:45 +01:00
Eren G??lge	18b7d746cb	Updating XTTS docs	2023-11-27 14:54:49 +01:00
Eren G??lge	b75e90ba85	Make text splitting optional	2023-11-27 14:53:11 +01:00
Eren G??lge	3b8894a3dd	Make style	2023-11-27 14:15:50 +01:00
Eren G??lge	2fd8cf3d94	Make xtts runnable by version names	2023-11-27 14:15:16 +01:00
Eren G??lge	11ec9f7471	Add hi in config defaults	2023-11-24 15:38:36 +01:00
Eren G??lge	00a870c26a	Update to v0.21.1	2023-11-24 15:15:44 +01:00
Eren G??lge	7e575068c9	Merge branch 'dev' of https://github.com/coqui-ai/TTS into dev	2023-11-24 15:15:19 +01:00
Eren G??lge	32065139e7	Simple text cleaner for "hi"	2023-11-24 15:14:34 +01:00
Fred	f6eaa61afe	Adding checkpoint model	2023-07-02 18:55:50 -03:00
 @ -1 +1 @@
 .21.0
 .22.0