Merge pull request #207 from idiap/docs

Improve documentation
2024-12-12 18:52:18 +01:00 · 2024-12-12 18:52:18 +01:00 · cd52907351
parent f329072df2 e38dcbea7a
commit cd52907351
36 changed files with 568 additions and 696 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -11,30 +11,25 @@ You can contribute not only with code but with bug reports, comments, questions,
 If you like to contribute code, squash a bug but if you don't know where to start, here are some pointers.
 - [Development Road Map](https://github.com/coqui-ai/TTS/issues/378)
    You can pick something out of our road map. We keep the progess of the project in this simple issue thread. It has new model proposals or developmental updates etc.
 - [Github Issues Tracker](https://github.com/idiap/coqui-ai-TTS/issues)
    This is a place to find feature requests, bugs.
-    Issues with the ```good first issue``` tag are good place for beginners to take on.
+    Issues with the ```good first issue``` tag are good place for beginners to
-
+    take on. Issues tagged with `help wanted` are suited for more experienced
- ✨**PR**✨ [pages](https://github.com/idiap/coqui-ai-TTS/pulls) with the ```🚀new version``` tag.
+    outside contributors.
    We list all the target improvements for the next version. You can pick one of them and start contributing.
 - Also feel free to suggest new features, ideas and models. We're always open for new things.
-## Call for sharing language models
+## Call for sharing pretrained models
 If possible, please consider sharing your pre-trained models in any language (if the licences allow for you to do so). We will include them in our model catalogue for public use and give the proper attribution, whether it be your name, company, website or any other source specified.
 This model can be shared in two ways:
 1. Share the model files with us and we serve them with the next 🐸 TTS release.
 2. Upload your models on GDrive and share the link.
-Models are served under `.models.json` file and any model is available under TTS CLI or Server end points.
+Models are served under `.models.json` file and any model is available under TTS
 CLI and Python API end points.
 Either way you choose, please make sure you send the models [here](https://github.com/coqui-ai/TTS/discussions/930).
@ -135,7 +130,8 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
 13. Let's discuss until it is perfect. 💪
-    We might ask you for certain changes that would appear in the ✨**PR**✨'s page under 🐸TTS[https://github.com/idiap/coqui-ai-TTS/pulls].
+    We might ask you for certain changes that would appear in the
    [Github ✨**PR**✨'s page](https://github.com/idiap/coqui-ai-TTS/pulls).
 14. Once things look perfect, We merge it to the ```dev``` branch and make it ready for the next version.
@ -143,9 +139,9 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
 If you prefer working within a Docker container as your development environment, you can do the following:
-1. Fork 🐸TTS[https://github.com/idiap/coqui-ai-TTS] by clicking the fork button at the top right corner of the project page.
+1. Fork the 🐸TTS [Github repository](https://github.com/idiap/coqui-ai-TTS) by clicking the fork button at the top right corner of the page.
-2. Clone 🐸TTS and add the main repo as a new remote named ```upsteam```.
+2. Clone 🐸TTS and add the main repo as a new remote named ```upstream```.
    ```bash
    git clone git@github.com:<your Github name>/coqui-ai-TTS.git
--- a/5
+++ b/5
@ -59,9 +59,6 @@ lint:	## run linters.
 system-deps:	## install linux system deps
 	sudo apt-get install -y libsndfile1-dev
 build-docs: ## build the docs
 	cd docs && make clean && make build
 install:	## install 🐸 TTS
 	uv sync --all-extras
@ -70,4 +67,4 @@ install_dev:	## install 🐸 TTS for development.
 	uv run pre-commit install
 docs:	## build the docs
-	$(MAKE) -C docs clean && $(MAKE) -C docs html
+	uv run --group docs $(MAKE) -C docs clean && uv run --group docs $(MAKE) -C docs html
--- a/README.md
+++ b/README.md
@ -1,39 +1,34 @@
-
+# <img src="https://raw.githubusercontent.com/idiap/coqui-ai-TTS/main/images/coqui-log-green-TTS.png" height="56"/>
 ## 🐸Coqui TTS News
 - 📣 Fork of the [original, unmaintained repository](https://github.com/coqui-ai/TTS). New PyPI package: [coqui-tts](https://pypi.org/project/coqui-tts)
 - 📣 [OpenVoice](https://github.com/myshell-ai/OpenVoice) models now available for voice conversion.
 - 📣 Prebuilt wheels are now also published for Mac and Windows (in addition to Linux as before) for easier installation across platforms.
 - 📣 ⓍTTSv2 is here with 17 languages and better performance across the board. ⓍTTS can stream with <200ms latency.
 - 📣 ⓍTTS fine-tuning code is out. Check the [example recipes](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes/ljspeech).
 - 📣 [🐶Bark](https://github.com/suno-ai/bark) is now available for inference with unconstrained voice cloning. [Docs](https://coqui-tts.readthedocs.io/en/latest/models/bark.html)
 - 📣 You can use [Fairseq models in ~1100 languages](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) with 🐸TTS.
 ## <img src="https://raw.githubusercontent.com/idiap/coqui-ai-TTS/main/images/coqui-log-green-TTS.png" height="56"/>
-**🐸TTS is a library for advanced Text-to-Speech generation.**
+**🐸 Coqui TTS is a library for advanced Text-to-Speech generation.**
 🚀 Pretrained models in +1100 languages.
 🛠️ Tools for training new models and fine-tuning existing models in any language.
 📚 Utilities for dataset analysis and curation.
 ______________________________________________________________________
 [![Discord](https://img.shields.io/discord/1037326658807533628?color=%239B59B6&label=chat%20on%20discord)](https://discord.gg/5eXr5seRrv)
 [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/coqui-tts)](https://pypi.org/project/coqui-tts/)
 [![License](<https://img.shields.io/badge/License-MPL%202.0-brightgreen.svg>)](https://opensource.org/licenses/MPL-2.0)
-[![PyPI version](https://badge.fury.io/py/coqui-tts.svg)](https://badge.fury.io/py/coqui-tts)
+[![PyPI version](https://badge.fury.io/py/coqui-tts.svg)](https://pypi.org/project/coqui-tts/)
 [![Downloads](https://pepy.tech/badge/coqui-tts)](https://pepy.tech/project/coqui-tts)
 [![DOI](https://zenodo.org/badge/265612440.svg)](https://zenodo.org/badge/latestdoi/265612440)
-
+[![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/tests.yml/badge.svg)](https://github.com/idiap/coqui-ai-TTS/actions/workflows/tests.yml)
-![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/tests.yml/badge.svg)
+[![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/docker.yaml/badge.svg)](https://github.com/idiap/coqui-ai-TTS/actions/workflows/docker.yaml)
-![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/docker.yaml/badge.svg)
+[![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/style_check.yml/badge.svg)](https://github.com/idiap/coqui-ai-TTS/actions/workflows/style_check.yml)
 ![GithubActions](https://github.com/idiap/coqui-ai-TTS/actions/workflows/style_check.yml/badge.svg)
 [![Docs](<https://readthedocs.org/projects/coqui-tts/badge/?version=latest&style=plastic>)](https://coqui-tts.readthedocs.io/en/latest/)
 </div>
-______________________________________________________________________
+## 📣 News
 - **Fork of the [original, unmaintained repository](https://github.com/coqui-ai/TTS). New PyPI package: [coqui-tts](https://pypi.org/project/coqui-tts)**
 - 0.25.0: [OpenVoice](https://github.com/myshell-ai/OpenVoice) models now available for voice conversion.
 - 0.24.2: Prebuilt wheels are now also published for Mac and Windows (in addition to Linux as before) for easier installation across platforms.
 - 0.20.0: XTTSv2 is here with 17 languages and better performance across the board. XTTS can stream with <200ms latency.
 - 0.19.0: XTTS fine-tuning code is out. Check the [example recipes](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes/ljspeech).
 - 0.14.1: You can use [Fairseq models in ~1100 languages](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) with 🐸TTS.
 ## 💬 Where to ask questions
 Please use our dedicated channels for questions and discussion. Help is much more valuable if it's shared publicly so that more people can benefit from it.
@ -63,71 +58,67 @@ repository are also still a useful source of information.
 | 🚀 **Released Models**            | [Standard models](https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.json) and [Fairseq models in ~1100 languages](https://github.com/idiap/coqui-ai-TTS#example-text-to-speech-using-fairseq-models-in-1100-languages-)|
 ## Features
- High-performance Deep Learning models for Text2Speech tasks. See lists of models below.
+- High-performance text-to-speech and voice conversion models, see list below.
- Fast and efficient model training.
+- Fast and efficient model training with detailed training logs on the terminal and Tensorboard.
- Detailed training logs on the terminal and Tensorboard.
+- Support for multi-speaker and multilingual TTS.
 - Support for Multi-speaker TTS.
 - Efficient, flexible, lightweight but feature complete `Trainer API`.
 - Released and ready-to-use models.
- Tools to curate Text2Speech datasets under```dataset_analysis```.
+- Tools to curate TTS datasets under ```dataset_analysis/```.
- Utilities to use and test your models.
+- Command line and Python APIs to use and test your models.
 - Modular (but not too much) code base enabling easy implementation of new ideas.
 ## Model Implementations
 ### Spectrogram models
- Tacotron: [paper](https://arxiv.org/abs/1703.10135)
+- [Tacotron](https://arxiv.org/abs/1703.10135), [Tacotron2](https://arxiv.org/abs/1712.05884)
- Tacotron2: [paper](https://arxiv.org/abs/1712.05884)
+- [Glow-TTS](https://arxiv.org/abs/2005.11129), [SC-GlowTTS](https://arxiv.org/abs/2104.05557)
- Glow-TTS: [paper](https://arxiv.org/abs/2005.11129)
+- [Speedy-Speech](https://arxiv.org/abs/2008.03802)
- Speedy-Speech: [paper](https://arxiv.org/abs/2008.03802)
+- [Align-TTS](https://arxiv.org/abs/2003.01950)
- Align-TTS: [paper](https://arxiv.org/abs/2003.01950)
+- [FastPitch](https://arxiv.org/pdf/2006.06873.pdf)
- FastPitch: [paper](https://arxiv.org/pdf/2006.06873.pdf)
+- [FastSpeech](https://arxiv.org/abs/1905.09263), [FastSpeech2](https://arxiv.org/abs/2006.04558)
- FastSpeech: [paper](https://arxiv.org/abs/1905.09263)
+- [Capacitron](https://arxiv.org/abs/1906.03402)
- FastSpeech2: [paper](https://arxiv.org/abs/2006.04558)
+- [OverFlow](https://arxiv.org/abs/2211.06892)
- SC-GlowTTS: [paper](https://arxiv.org/abs/2104.05557)
+- [Neural HMM TTS](https://arxiv.org/abs/2108.13320)
- Capacitron: [paper](https://arxiv.org/abs/1906.03402)
+- [Delightful TTS](https://arxiv.org/abs/2110.12612)
 - OverFlow: [paper](https://arxiv.org/abs/2211.06892)
 - Neural HMM TTS: [paper](https://arxiv.org/abs/2108.13320)
 - Delightful TTS: [paper](https://arxiv.org/abs/2110.12612)
 ### End-to-End Models
- ⓍTTS: [blog](https://coqui.ai/blog/tts/open_xtts)
+- [XTTS](https://arxiv.org/abs/2406.04904)
- VITS: [paper](https://arxiv.org/pdf/2106.06103)
+- [VITS](https://arxiv.org/pdf/2106.06103)
- 🐸 YourTTS: [paper](https://arxiv.org/abs/2112.02418)
+- 🐸[YourTTS](https://arxiv.org/abs/2112.02418)
- 🐢 Tortoise: [orig. repo](https://github.com/neonbjb/tortoise-tts)
+- 🐢[Tortoise](https://github.com/neonbjb/tortoise-tts)
- 🐶 Bark: [orig. repo](https://github.com/suno-ai/bark)
+- 🐶[Bark](https://github.com/suno-ai/bark)
 ### Attention Methods
 - Guided Attention: [paper](https://arxiv.org/abs/1710.08969)
 - Forward Backward Decoding: [paper](https://arxiv.org/abs/1907.09006)
 - Graves Attention: [paper](https://arxiv.org/abs/1910.10288)
 - Double Decoder Consistency: [blog](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/)
 - Dynamic Convolutional Attention: [paper](https://arxiv.org/pdf/1910.10288.pdf)
 - Alignment Network: [paper](https://arxiv.org/abs/2108.10447)
 ### Speaker Encoder
 - GE2E: [paper](https://arxiv.org/abs/1710.10467)
 - Angular Loss: [paper](https://arxiv.org/pdf/2003.11982.pdf)
 ### Vocoders
- MelGAN: [paper](https://arxiv.org/abs/1910.06711)
+- [MelGAN](https://arxiv.org/abs/1910.06711)
- MultiBandMelGAN: [paper](https://arxiv.org/abs/2005.05106)
+- [MultiBandMelGAN](https://arxiv.org/abs/2005.05106)
- ParallelWaveGAN: [paper](https://arxiv.org/abs/1910.11480)
+- [ParallelWaveGAN](https://arxiv.org/abs/1910.11480)
- GAN-TTS discriminators: [paper](https://arxiv.org/abs/1909.11646)
+- [GAN-TTS discriminators](https://arxiv.org/abs/1909.11646)
- WaveRNN: [origin](https://github.com/fatchord/WaveRNN/)
+- [WaveRNN](https://github.com/fatchord/WaveRNN/)
- WaveGrad: [paper](https://arxiv.org/abs/2009.00713)
+- [WaveGrad](https://arxiv.org/abs/2009.00713)
- HiFiGAN: [paper](https://arxiv.org/abs/2010.05646)
+- [HiFiGAN](https://arxiv.org/abs/2010.05646)
- UnivNet: [paper](https://arxiv.org/abs/2106.07889)
+- [UnivNet](https://arxiv.org/abs/2106.07889)
 ### Voice Conversion
- FreeVC: [paper](https://arxiv.org/abs/2210.15418)
+- [FreeVC](https://arxiv.org/abs/2210.15418)
- OpenVoice: [technical report](https://arxiv.org/abs/2312.01479)
+- [OpenVoice](https://arxiv.org/abs/2312.01479)
 ### Others
 - Attention methods: [Guided Attention](https://arxiv.org/abs/1710.08969),
  [Forward Backward Decoding](https://arxiv.org/abs/1907.09006),
  [Graves Attention](https://arxiv.org/abs/1910.10288),
  [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/),
  [Dynamic Convolutional Attention](https://arxiv.org/pdf/1910.10288.pdf),
  [Alignment Network](https://arxiv.org/abs/2108.10447)
 - Speaker encoders: [GE2E](https://arxiv.org/abs/1710.10467),
  [Angular Loss](https://arxiv.org/pdf/2003.11982.pdf)
 You can also help us implement more models.
 <!-- start installation -->
 ## Installation
 🐸TTS is tested on Ubuntu 22.04 with **python >= 3.9, < 3.13.**.
-If you are only interested in [synthesizing speech](https://coqui-tts.readthedocs.io/en/latest/inference.html) with the released 🐸TTS models, installing from PyPI is the easiest option.
+🐸TTS is tested on Ubuntu 24.04 with **python >= 3.9, < 3.13**, but should also
 work on Mac and Windows.
 If you are only interested in [synthesizing speech](https://coqui-tts.readthedocs.io/en/latest/inference.html) with the pretrained 🐸TTS models, installing from PyPI is the easiest option.
 ```bash
 pip install coqui-tts
@ -165,21 +156,18 @@ pip install -e .[server,ja]
 ### Platforms
-If you are on Ubuntu (Debian), you can also run following commands for installation.
+If you are on Ubuntu (Debian), you can also run the following commands for installation.
 ```bash
-make system-deps  # intended to be used on Ubuntu (Debian). Let us know if you have a different OS.
+make system-deps
 make install
 ```
-If you are on Windows, 👑@GuyPaddock wrote installation instructions
+<!-- end installation -->
 [here](https://stackoverflow.com/questions/66726331/how-can-i-run-mozilla-tts-coqui-tts-training-with-cuda-on-a-windows-system)
 (note that these are out of date, e.g. you need to have at least Python 3.9).
 ## Docker Image
-You can also try TTS without install with the docker image.
+You can also try out Coqui TTS without installation with the docker image.
-Simply run the following command and you will be able to run TTS without installing it.
+Simply run the following command and you will be able to run TTS:
 ```bash
 docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu
@ -193,10 +181,10 @@ More details about the docker images (like GPU support) can be found
 ## Synthesizing speech by 🐸TTS
-
+<!-- start inference -->
 ### 🐍 Python API
-#### Running a multi-speaker and multi-lingual model
+#### Multi-speaker and multi-lingual model
 ```python
 import torch
@ -208,47 +196,60 @@ device = "cuda" if torch.cuda.is_available() else "cpu"
 # List available 🐸TTS models
 print(TTS().list_models())
-# Init TTS
+# Initialize TTS
 tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
 # List speakers
 print(tts.speakers)
 # Run TTS
-# ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language
+# ❗ XTTS supports both, but many models allow only one of the `speaker` and
-# Text to speech list of amplitude values as output
+# `speaker_wav` arguments
-wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en")
+
-# Text to speech to a file
+# TTS with list of amplitude values as output, clone the voice from `speaker_wav`
-tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
+wav = tts.tts(
  text="Hello world!",
  speaker_wav="my/cloning/audio.wav",
  language="en"
 )
 # TTS to a file, use a preset speaker
 tts.tts_to_file(
  text="Hello world!",
  speaker="Craig Gutsy",
  language="en",
  file_path="output.wav"
 )
 ```
-#### Running a single speaker model
+#### Single speaker model
 ```python
-# Init TTS with the target model name
+# Initialize TTS with the target model name
-tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False).to(device)
+tts = TTS("tts_models/de/thorsten/tacotron2-DDC").to(device)
 # Run TTS
 tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH)
 # Example voice cloning with YourTTS in English, French and Portuguese
 tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to(device)
 tts.tts_to_file("This is voice cloning.", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
 tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wav", language="fr-fr", file_path="output.wav")
 tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt-br", file_path="output.wav")
 ```
-#### Example voice conversion
+#### Voice conversion (VC)
 Converting the voice in `source_wav` to the voice of `target_wav`
 ```python
-tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda")
+tts = TTS("voice_conversion_models/multilingual/vctk/freevc24").to("cuda")
-tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav")
+tts.voice_conversion_to_file(
  source_wav="my/source.wav",
  target_wav="my/target.wav",
  file_path="output.wav"
 )
 ```
 Other available voice conversion models:
 - `voice_conversion_models/multilingual/multi-dataset/openvoice_v1`
 - `voice_conversion_models/multilingual/multi-dataset/openvoice_v2`
-#### Example voice cloning together with the default voice conversion model.
+#### Voice cloning by combining single speaker TTS model with the default VC model
 This way, you can clone voices by using any model in 🐸TTS. The FreeVC model is
 used for voice conversion after synthesizing speech.
@ -263,7 +264,7 @@ tts.tts_with_vc_to_file(
 )
 ```
-#### Example text to speech using **Fairseq models in ~1100 languages** 🤯.
+#### TTS using Fairseq models in ~1100 languages 🤯
 For Fairseq models, use the following name format: `tts_models/<lang-iso_code>/fairseq/vits`.
 You can find the language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html)
 and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms).
@ -277,147 +278,126 @@ api.tts_to_file(
 )
 ```
-### Command-line `tts`
+### Command-line interface `tts`
 <!-- begin-tts-readme -->
-Synthesize speech on command line.
+Synthesize speech on the command line.
 You can either use your trained model or choose a model from the provided list.
 If you don't specify any models, then it uses LJSpeech based English model.
 #### Single Speaker Models
 - List provided models:
-  ```
+  ```sh
-  $ tts --list_models
+  tts --list_models
  ```
- Get model info (for both tts_models and vocoder_models):
+- Get model information. Use the names obtained from `--list_models`.
-
+  ```sh
-  - Query by type/name:
+  tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
    The model_info_by_name uses the name as it from the --list_models.
    ```
    $ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
  ```
  For example:
-    ```
+  ```sh
-    $ tts --model_info_by_name tts_models/tr/common-voice/glow-tts
+  tts --model_info_by_name tts_models/tr/common-voice/glow-tts
-    $ tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
+  tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
    ```
  - Query by type/idx:
    The model_query_idx uses the corresponding idx from --list_models.
    ```
    $ tts --model_info_by_idx "<model_type>/<model_query_idx>"
  ```
-    For example:
+#### Single speaker models
-    ```
+- Run TTS with the default model (`tts_models/en/ljspeech/tacotron2-DDC`):
    $ tts --model_info_by_idx tts_models/3
    ```
-  - Query info for model info by full name:
+  ```sh
-    ```
+  tts --text "Text for TTS" --out_path output/path/speech.wav
    $ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
    ```
 - Run TTS with default models:
  ```
  $ tts --text "Text for TTS" --out_path output/path/speech.wav
  ```
 - Run TTS and pipe out the generated TTS wav file data:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
+  tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
  ```
 - Run a TTS model with its default vocoder model:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
+  tts --text "Text for TTS" \
      --model_name "<model_type>/<language>/<dataset>/<model_name>" \
      --out_path output/path/speech.wav
  ```
  For example:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --out_path output/path/speech.wav
+  tts --text "Text for TTS" \
      --model_name "tts_models/en/ljspeech/glow-tts" \
      --out_path output/path/speech.wav
  ```
- Run with specific TTS and vocoder models from the list:
+- Run with specific TTS and vocoder models from the list. Note that not every vocoder is compatible with every TTS model.
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
+  tts --text "Text for TTS" \
      --model_name "<model_type>/<language>/<dataset>/<model_name>" \
      --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" \
      --out_path output/path/speech.wav
  ```
  For example:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --vocoder_name "vocoder_models/en/ljspeech/univnet" --out_path output/path/speech.wav
+  tts --text "Text for TTS" \
      --model_name "tts_models/en/ljspeech/glow-tts" \
      --vocoder_name "vocoder_models/en/ljspeech/univnet" \
      --out_path output/path/speech.wav
  ```
- Run your own TTS model (Using Griffin-Lim Vocoder):
+- Run your own TTS model (using Griffin-Lim Vocoder):
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
+  tts --text "Text for TTS" \
      --model_path path/to/model.pth \
      --config_path path/to/config.json \
      --out_path output/path/speech.wav
  ```
 - Run your own TTS and Vocoder models:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
+  tts --text "Text for TTS" \
-      --vocoder_path path/to/vocoder.pth --vocoder_config_path path/to/vocoder_config.json
+      --model_path path/to/model.pth \
      --config_path path/to/config.json \
      --out_path output/path/speech.wav \
      --vocoder_path path/to/vocoder.pth \
      --vocoder_config_path path/to/vocoder_config.json
  ```
-#### Multi-speaker Models
+#### Multi-speaker models
- List the available speakers and choose a <speaker_id> among them:
+- List the available speakers and choose a `<speaker_id>` among them:
-  ```
+  ```sh
-  $ tts --model_name "<language>/<dataset>/<model_name>"  --list_speaker_idxs
+  tts --model_name "<language>/<dataset>/<model_name>"  --list_speaker_idxs
  ```
 - Run the multi-speaker TTS model with the target speaker ID:
-  ```
+  ```sh
-  $ tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>"  --speaker_idx <speaker_id>
+  tts --text "Text for TTS." --out_path output/path/speech.wav \
      --model_name "<language>/<dataset>/<model_name>"  --speaker_idx <speaker_id>
  ```
 - Run your own multi-speaker TTS model:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
+  tts --text "Text for TTS" --out_path output/path/speech.wav \
      --model_path path/to/model.pth --config_path path/to/config.json \
      --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
  ```
-### Voice Conversion Models
+#### Voice conversion models
-```
+```sh
-$ tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
+tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" \
    --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
 ```
 <!-- end-tts-readme -->
 ## Directory Structure
 ```
 |- notebooks/       (Jupyter Notebooks for model evaluation, parameter selection and data analysis.)
 |- utils/           (common utilities.)
 |- TTS
    |- bin/             (folder for all the executables.)
      |- train*.py                  (train your target model.)
      |- ...
    |- tts/             (text to speech models)
        |- layers/          (model layer definitions)
        |- models/          (model definitions)
        |- utils/           (model specific utilities.)
    |- speaker_encoder/ (Speaker Encoder models.)
        |- (same)
    |- vocoder/         (Vocoder models.)
        |- (same)
    |- vc/         (Voice conversion models.)
        |- (same)
 ```
--- a/TTS/bin/synthesize.py
+++ b/TTS/bin/synthesize.py
@ -14,123 +14,122 @@ from TTS.utils.generic_utils import ConsoleFormatter, setup_logger
 logger = logging.getLogger(__name__)
 description = """
-Synthesize speech on command line.
+Synthesize speech on the command line.
 You can either use your trained model or choose a model from the provided list.
-If you don't specify any models, then it uses LJSpeech based English model.
+- List provided models:
  ```sh
  tts --list_models
  ```
 - Get model information. Use the names obtained from `--list_models`.
  ```sh
  tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
  ```
  For example:
  ```sh
  tts --model_info_by_name tts_models/tr/common-voice/glow-tts
  tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
  ```
 #### Single Speaker Models
- List provided models:
+- Run TTS with the default model (`tts_models/en/ljspeech/tacotron2-DDC`):
-  ```
+  ```sh
-  $ tts --list_models
+  tts --text "Text for TTS" --out_path output/path/speech.wav
  ```
 - Get model info (for both tts_models and vocoder_models):
  - Query by type/name:
    The model_info_by_name uses the name as it from the --list_models.
    ```
    $ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
    ```
    For example:
    ```
    $ tts --model_info_by_name tts_models/tr/common-voice/glow-tts
    $ tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
    ```
  - Query by type/idx:
    The model_query_idx uses the corresponding idx from --list_models.
    ```
    $ tts --model_info_by_idx "<model_type>/<model_query_idx>"
    ```
    For example:
    ```
    $ tts --model_info_by_idx tts_models/3
    ```
  - Query info for model info by full name:
    ```
    $ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
    ```
 - Run TTS with default models:
  ```
  $ tts --text "Text for TTS" --out_path output/path/speech.wav
  ```
 - Run TTS and pipe out the generated TTS wav file data:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
+  tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
  ```
 - Run a TTS model with its default vocoder model:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
+  tts --text "Text for TTS" \\
      --model_name "<model_type>/<language>/<dataset>/<model_name>" \\
      --out_path output/path/speech.wav
  ```
  For example:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --out_path output/path/speech.wav
+  tts --text "Text for TTS" \\
      --model_name "tts_models/en/ljspeech/glow-tts" \\
      --out_path output/path/speech.wav
  ```
- Run with specific TTS and vocoder models from the list:
+- Run with specific TTS and vocoder models from the list. Note that not every vocoder is compatible with every TTS model.
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
+  tts --text "Text for TTS" \\
      --model_name "<model_type>/<language>/<dataset>/<model_name>" \\
      --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" \\
      --out_path output/path/speech.wav
  ```
  For example:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --vocoder_name "vocoder_models/en/ljspeech/univnet" --out_path output/path/speech.wav
+  tts --text "Text for TTS" \\
      --model_name "tts_models/en/ljspeech/glow-tts" \\
      --vocoder_name "vocoder_models/en/ljspeech/univnet" \\
      --out_path output/path/speech.wav
  ```
- Run your own TTS model (Using Griffin-Lim Vocoder):
+- Run your own TTS model (using Griffin-Lim Vocoder):
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
+  tts --text "Text for TTS" \\
      --model_path path/to/model.pth \\
      --config_path path/to/config.json \\
      --out_path output/path/speech.wav
  ```
 - Run your own TTS and Vocoder models:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
+  tts --text "Text for TTS" \\
-      --vocoder_path path/to/vocoder.pth --vocoder_config_path path/to/vocoder_config.json
+      --model_path path/to/model.pth \\
      --config_path path/to/config.json \\
      --out_path output/path/speech.wav \\
      --vocoder_path path/to/vocoder.pth \\
      --vocoder_config_path path/to/vocoder_config.json
  ```
 #### Multi-speaker Models
- List the available speakers and choose a <speaker_id> among them:
+- List the available speakers and choose a `<speaker_id>` among them:
-  ```
+  ```sh
-  $ tts --model_name "<language>/<dataset>/<model_name>"  --list_speaker_idxs
+  tts --model_name "<language>/<dataset>/<model_name>"  --list_speaker_idxs
  ```
 - Run the multi-speaker TTS model with the target speaker ID:
-  ```
+  ```sh
-  $ tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>"  --speaker_idx <speaker_id>
+  tts --text "Text for TTS." --out_path output/path/speech.wav \\
      --model_name "<language>/<dataset>/<model_name>"  --speaker_idx <speaker_id>
  ```
 - Run your own multi-speaker TTS model:
-  ```
+  ```sh
-  $ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
+  tts --text "Text for TTS" --out_path output/path/speech.wav \\
      --model_path path/to/model.pth --config_path path/to/config.json \\
      --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
  ```
-### Voice Conversion Models
+#### Voice Conversion Models
-```
+```sh
-$ tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
+tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" \\
    --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
 ```
 """
--- a/TTS/model.py
+++ b/TTS/model.py
@ -12,7 +12,7 @@ from trainer import TrainerModel
 class BaseTrainerModel(TrainerModel):
    """BaseTrainerModel model expanding TrainerModel with required functions by 🐸TTS.
-    Every new 🐸TTS model must inherit it.
+    Every new Coqui model must inherit it.
    """
    @staticmethod
--- a/TTS/tts/models/bark.py
+++ b/TTS/tts/models/bark.py
@ -206,12 +206,14 @@ class Bark(BaseTTS):
            speaker_wav (str): Path to the speaker audio file for cloning a new voice. It is cloned and saved in
                `voice_dirs` with the name `speaker_id`. Defaults to None.
            voice_dirs (List[str]): List of paths that host reference audio files for speakers. Defaults to None.
-            **kwargs: Model specific inference settings used by `generate_audio()` and `TTS.tts.layers.bark.inference_funcs.generate_text_semantic().
+            **kwargs: Model specific inference settings used by `generate_audio()` and
                      `TTS.tts.layers.bark.inference_funcs.generate_text_semantic()`.
        Returns:
-            A dictionary of the output values with `wav` as output waveform, `deterministic_seed` as seed used at inference,
+            A dictionary of the output values with `wav` as output waveform,
-            `text_input` as text token IDs after tokenizer, `voice_samples` as samples used for cloning, `conditioning_latents`
+            `deterministic_seed` as seed used at inference, `text_input` as text token IDs
-            as latents used at inference.
+            after tokenizer, `voice_samples` as samples used for cloning,
            `conditioning_latents` as latents used at inference.
        """
        speaker_id = "random" if speaker_id is None else speaker_id
--- a/TTS/tts/models/base_tts.py
+++ b/TTS/tts/models/base_tts.py
@ -80,12 +80,14 @@ class BaseTTS(BaseTrainerModel):
            raise ValueError("config must be either a *Config or *Args")
    def init_multispeaker(self, config: Coqpit, data: List = None):
-        """Initialize a speaker embedding layer if needen and define expected embedding channel size for defining
+        """Set up for multi-speaker TTS.
-        `in_channels` size of the connected layers.
+
        Initialize a speaker embedding layer if needed and define expected embedding
        channel size for defining `in_channels` size of the connected layers.
        This implementation yields 3 possible outcomes:
-        1. If `config.use_speaker_embedding` and `config.use_d_vector_file are False, do nothing.
+        1. If `config.use_speaker_embedding` and `config.use_d_vector_file` are False, do nothing.
        2. If `config.use_d_vector_file` is True, set expected embedding channel size to `config.d_vector_dim` or 512.
        3. If `config.use_speaker_embedding`, initialize a speaker embedding layer with channel size of
           `config.d_vector_dim` or 512.
--- a/TTS/tts/models/overflow.py
+++ b/TTS/tts/models/overflow.py
@ -48,10 +48,11 @@ class Overflow(BaseTTS):
        are available at https://shivammehta25.github.io/OverFlow/.
    Note:
-        - Neural HMMs uses flat start initialization i.e it computes the means and std and transition probabilities
+        - Neural HMMs uses flat start initialization i.e it computes the means
-        of the dataset and uses them to initialize the model. This benefits the model and helps with faster learning
+          and std and transition probabilities of the dataset and uses them to initialize
-        If you change the dataset or want to regenerate the parameters change the `force_generate_statistics` and
+          the model. This benefits the model and helps with faster learning If you change
-        `mel_statistics_parameter_path` accordingly.
+          the dataset or want to regenerate the parameters change the
          `force_generate_statistics` and `mel_statistics_parameter_path` accordingly.
        - To enable multi-GPU training, set the `use_grad_checkpointing=False` in config.
          This will significantly increase the memory usage.  This is because to compute
--- a/TTS/tts/models/tortoise.py
+++ b/TTS/tts/models/tortoise.py
@ -423,7 +423,9 @@ class Tortoise(BaseTTS):
        Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
        These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
        properties.
-        :param voice_samples: List of arbitrary reference clips, which should be *pairs* of torch tensors containing arbitrary kHz waveform data.
+
        :param voice_samples: List of arbitrary reference clips, which should be *pairs*
                              of torch tensors containing arbitrary kHz waveform data.
        :param latent_averaging_mode: 0/1/2 for following modes:
            0 - latents will be generated as in original tortoise, using ~4.27s from each voice sample, averaging latent across all samples
            1 - latents will be generated using (almost) entire voice samples, averaged across all the ~4.27s chunks
@ -671,7 +673,7 @@ class Tortoise(BaseTTS):
                As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
            diffusion_temperature: (float) Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
                                      are the "mean" prediction of the diffusion network and will sound bland and smeared.
-            hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive transformer.
+            hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive transformer.
                                    Extra keyword args fed to this function get forwarded directly to that API. Documentation
                                    here: https://huggingface.co/docs/transformers/internal/generation_utils
--- a/TTS/tts/models/xtts.py
+++ b/TTS/tts/models/xtts.py
@ -178,7 +178,7 @@ class XttsArgs(Coqpit):
 class Xtts(BaseTTS):
-    """ⓍTTS model implementation.
+    """XTTS model implementation.
    ❗ Currently it only supports inference.
@ -460,7 +460,7 @@ class Xtts(BaseTTS):
            gpt_cond_chunk_len: (int) Chunk length used for cloning. It must be <= `gpt_cond_len`.
                If gpt_cond_len == gpt_cond_chunk_len, no chunking. Defaults to 6 seconds.
-            hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive
+            hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive
                transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation
                here: https://huggingface.co/docs/transformers/internal/generation_utils
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -52,6 +52,7 @@ extensions = [
    "sphinx_inline_tabs",
 ]
 suppress_warnings = ["autosectionlabel.*"]
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]
@ -67,6 +68,8 @@ myst_enable_extensions = [
    "linkify",
 ]
 myst_heading_anchors = 4
 # 'sphinxcontrib.katex',
 # 'sphinx.ext.autosectionlabel',
--- a/docs/source/configuration.md
+++ b/docs/source/configuration.md
@ -1,6 +1,6 @@
 # Configuration
-We use 👩‍✈️[Coqpit] for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit.
+We use 👩‍✈️[Coqpit](https://github.com/idiap/coqui-ai-coqpit) for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit.
 ```python
 from dataclasses import asdict, dataclass, field
@ -36,7 +36,7 @@ class SimpleConfig(Coqpit):
        check_argument("val_c", c, restricted=True)
 ```
-In TTS, each model must have a configuration class that exposes all the values necessary for its lifetime.
+In Coqui, each model must have a configuration class that exposes all the values necessary for its lifetime.
 It defines model architecture, hyper-parameters, training, and inference settings. For our models, we merge all the fields in a single configuration class for ease. It may not look like a wise practice but enables easier bookkeeping and reproducible experiments.
--- a/docs/source/datasets/formatting_your_dataset.md
+++ b/docs/source/datasets/formatting_your_dataset.md
@ -1,7 +1,9 @@
 (formatting_your_dataset)=
-# Formatting Your Dataset
+# Formatting your dataset
-For training a TTS model, you need a dataset with speech recordings and transcriptions. The speech must be divided into audio clips and each clip needs transcription.
+For training a TTS model, you need a dataset with speech recordings and
 transcriptions. The speech must be divided into audio clips and each clip needs
 a transcription.
 If you have a single audio file and you need to split it into clips, there are different open-source tools for you. We recommend Audacity. It is an open-source and free audio editing software.
@ -49,7 +51,7 @@ The format above is taken from widely-used the [LJSpeech](https://keithito.com/L
 Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English.
-For more info about dataset qualities and properties check our [post](https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset).
+For more info about dataset qualities and properties check [this page](what_makes_a_good_dataset.md).
 ## Using Your Dataset in 🐸TTS
--- a/docs/source/datasets/index.md
+++ b/docs/source/datasets/index.md
@ -0,0 +1,12 @@
 # Datasets
 For training a TTS model, you need a dataset with speech recordings and
 transcriptions. See the following pages for more information on:
 ```{toctree}
 :maxdepth: 1
 formatting_your_dataset
 what_makes_a_good_dataset
 tts_datasets
 ```
--- a/docs/source/datasets/tts_datasets.md
+++ b/docs/source/datasets/tts_datasets.md
@ -1,6 +1,6 @@
-# TTS Datasets
+# Public TTS datasets
-Some of the known public datasets that we successfully applied 🐸TTS:
+Some of the known public datasets that were successfully used for 🐸TTS:
 - [English - LJ Speech](https://keithito.com/LJ-Speech-Dataset/)
 - [English - Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/)
--- a/docs/source/datasets/what_makes_a_good_dataset.md
+++ b/docs/source/datasets/what_makes_a_good_dataset.md
--- a/docs/source/docker_images.md
+++ b/docs/source/docker_images.md
@ -1,20 +1,20 @@
 (docker_images)=
-## Docker images
+# Docker images
 We provide docker images to be able to test TTS without having to setup your own environment.
-### Using premade images
+## Using premade images
 You can use premade images built automatically from the latest TTS version.
-#### CPU version
+### CPU version
 ```bash
 docker pull ghcr.io/coqui-ai/tts-cpu
 ```
-#### GPU version
+### GPU version
 ```bash
 docker pull ghcr.io/coqui-ai/tts
 ```
-### Building your own image
+## Building your own image
 ```bash
 docker build -t tts .
 ```
--- a/docs/source/extension/implementing_a_new_language_frontend.md
+++ b/docs/source/extension/implementing_a_new_language_frontend.md
@ -1,4 +1,4 @@
-# Implementing a New Language Frontend
+# Implementing new language front ends
 - Language front ends are located under `TTS.tts.utils.text`
 - Each special language has a separate folder.
--- a/docs/source/extension/implementing_a_new_model.md
+++ b/docs/source/extension/implementing_a_new_model.md
@ -1,4 +1,4 @@
-# Implementing a Model
+# Implementing new models
 1. Implement layers.
@ -36,7 +36,8 @@
    There is also the `callback` interface by which you can manipulate both the model and the `Trainer` states. Callbacks give you
    an infinite flexibility to add custom behaviours for your model and training routines.
-    For more details, see {ref}`BaseTTS <Base tts Model>` and :obj:`TTS.utils.callbacks`.
+    For more details, see [BaseTTS](../main_classes/model_api.md#base-tts-model)
    and `TTS.utils.callbacks`.
 6. Optionally, define `MyModelArgs`.
@ -62,7 +63,7 @@
    We love you more when you document your code. ❤️
-# Template 🐸TTS Model implementation
+## Template 🐸TTS Model implementation
 You can start implementing your model by copying the following base class.
--- a/docs/source/extension/index.md
+++ b/docs/source/extension/index.md
@ -0,0 +1,14 @@
 # Adding models or languages
 You can extend Coqui by implementing new model architectures or adding front
 ends for new languages. See the pages below for more details. The [project
 structure](../project_structure.md) and [contribution
 guidelines](../contributing.md) may also be helpful. Please open a pull request
 with your changes to share back the improvements with the community.
 ```{toctree}
 :maxdepth: 1
 implementing_a_new_model
 implementing_a_new_language_frontend
 ```
--- a/docs/source/faq.md
+++ b/docs/source/faq.md
@ -1,4 +1,4 @@
-# Humble FAQ
+# FAQ
 We tried to collect common issues and questions we receive about 🐸TTS. It is worth checking before going deeper.
 ## Errors with a pre-trained model. How can I resolve this?
@ -7,7 +7,7 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is
 - If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny.
 ## What are the requirements of a good 🐸TTS dataset?
-* {ref}`See this page <what_makes_a_good_dataset>`
+- [See this page](datasets/what_makes_a_good_dataset.md)
 ## How should I choose the right model?
 - First, train Tacotron. It is smaller and faster to experiment with. If it performs poorly, try Tacotron2.
@ -18,7 +18,7 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is
 ## How can I train my own `tts` model?
 0. Check your dataset with notebooks in [dataset_analysis](https://github.com/idiap/coqui-ai-TTS/tree/main/notebooks/dataset_analysis) folder. Use [this notebook](https://github.com/idiap/coqui-ai-TTS/blob/main/notebooks/dataset_analysis/CheckSpectrograms.ipynb) to find the right audio processing parameters. A better set of parameters results in a better audio synthesis.
-1. Write your own dataset `formatter` in `datasets/formatters.py` or format your dataset as one of the supported datasets, like LJSpeech.
+1. Write your own dataset `formatter` in `datasets/formatters.py` or [format](datasets/formatting_your_dataset) your dataset as one of the supported datasets, like LJSpeech.
    A `formatter` parses the metadata file and converts a list of training samples.
 2. If you have a dataset with a different alphabet than English, you need to set your own character list in the ```config.json```.
@ -61,7 +61,8 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is
    - SingleGPU training: ```CUDA_VISIBLE_DEVICES="0" python train_tts.py --config_path config.json```
    - MultiGPU training: ```python3 -m trainer.distribute --gpus "0,1" --script TTS/bin/train_tts.py --config_path config.json```
-**Note:** You can also train your model using pure 🐍 python. Check ```{eval-rst} :ref: 'tutorial_for_nervous_beginners'```.
+**Note:** You can also train your model using pure 🐍 python. Check the
 [tutorial](tutorial_for_nervous_beginners.md).
 ## How can I train in a different language?
 - Check steps 2, 3, 4, 5 above.
@ -104,7 +105,7 @@ The best approach is to pick a set of promising models and run a Mean-Opinion-Sc
 - Check the 4th step under "How can I check model performance?"
 ## How can I test a trained model?
- The best way is to use `tts` or `tts-server` commands. For details check {ref}`here <synthesizing_speech>`.
+- The best way is to use `tts` or `tts-server` commands. For details check [here](inference.md).
 - If you need to code your own ```TTS.utils.synthesizer.Synthesizer``` class.
 ## My Tacotron model does not stop - I see "Decoder stopped with 'max_decoder_steps" - Stopnet does not work.
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -1,50 +1,56 @@
 ---
 hide-toc: true
 ---
 ```{include} ../../README.md
 :relative-images:
 :end-before: <!-- start installation -->
 ```
 ----
-# Documentation Content
+```{toctree}
-```{eval-rst}
+:maxdepth: 1
 .. toctree::
    :maxdepth: 2
 :caption: Get started
 :hidden:
 tutorial_for_nervous_beginners
 installation
 docker_images
 faq
 project_structure
 contributing
 ```
-.. toctree::
+```{toctree}
-    :maxdepth: 2
+:maxdepth: 1
-    :caption: Using 🐸TTS
+:caption: Using Coqui
 :hidden:
 inference
-    docker_images
+training/index
-    implementing_a_new_model
+extension/index
-    implementing_a_new_language_frontend
+datasets/index
-    training_a_model
+```
    finetuning
    configuration
    formatting_your_dataset
    what_makes_a_good_dataset
    tts_datasets
    marytts
-.. toctree::
+
-    :maxdepth: 2
+```{toctree}
 :maxdepth: 1
 :caption: Main Classes
 :hidden:
 configuration
 main_classes/trainer_api
 main_classes/audio_processor
 main_classes/model_api
 main_classes/dataset
 main_classes/gan
 main_classes/speaker_manager
 ```
-.. toctree::
+
-    :maxdepth: 2
+```{toctree}
-    :caption: `tts` Models
+:maxdepth: 1
 :caption: TTS Models
 :hidden:
 models/glow_tts.md
 models/vits.md
@ -54,9 +60,4 @@
 models/tortoise.md
 models/bark.md
 models/xtts.md
 .. toctree::
    :maxdepth: 2
    :caption: `vocoder` Models
 ```
--- a/docs/source/inference.md
+++ b/docs/source/inference.md
@ -1,194 +1,21 @@
 (synthesizing_speech)=
-# Synthesizing Speech
+# Synthesizing speech
-First, you need to install TTS. We recommend using PyPi. You need to call the command below:
+## Overview
-```bash
+Coqui TTS provides three main methods for inference:
-$ pip install coqui-tts
+
 1. 🐍Python API
 2. TTS command line interface (CLI)
 3. [Local demo server](server.md)
 ```{include} ../../README.md
 :start-after: <!-- start inference -->
 ```
 After the installation, 2 terminal commands are available.
-1. TTS Command Line Interface (CLI). - `tts`
+```{toctree}
-2. Local Demo Server. - `tts-server`
+:hidden:
-3. In 🐍Python. - `from TTS.api import TTS`
+server
-
+marytts
 ## On the Commandline - `tts`
 ![cli.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/tts_cli.gif)
 After the installation, 🐸TTS provides a CLI interface for synthesizing speech using pre-trained models. You can either use your own model or the release models under 🐸TTS.
 Listing released 🐸TTS models.
 ```bash
 tts --list_models
 ```
 Run a TTS model, from the release models list, with its default vocoder. (Simply copy and paste the full model names from the list as arguments for the command below.)
 ```bash
 tts --text "Text for TTS" \
    --model_name "<type>/<language>/<dataset>/<model_name>" \
    --out_path folder/to/save/output.wav
 ```
 Run a tts and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model.
 ```bash
 tts --text "Text for TTS" \
    --model_name "tts_models/<language>/<dataset>/<model_name>" \
    --vocoder_name "vocoder_models/<language>/<dataset>/<model_name>" \
    --out_path folder/to/save/output.wav
 ```
 Run your own TTS model (Using Griffin-Lim Vocoder)
 ```bash
 tts --text "Text for TTS" \
    --model_path path/to/model.pth \
    --config_path path/to/config.json \
    --out_path folder/to/save/output.wav
 ```
 Run your own TTS and Vocoder models
 ```bash
 tts --text "Text for TTS" \
    --config_path path/to/config.json \
    --model_path path/to/model.pth \
    --out_path folder/to/save/output.wav \
    --vocoder_path path/to/vocoder.pth \
    --vocoder_config_path path/to/vocoder_config.json
 ```
 Run a multi-speaker TTS model from the released models list.
 ```bash
 tts --model_name "tts_models/<language>/<dataset>/<model_name>"  --list_speaker_idxs  # list the possible speaker IDs.
 tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "tts_models/<language>/<dataset>/<model_name>"  --speaker_idx "<speaker_id>"
 ```
 Run a released voice conversion model
 ```bash
 tts --model_name "voice_conversion/<language>/<dataset>/<model_name>"
    --source_wav "my/source/speaker/audio.wav"
    --target_wav "my/target/speaker/audio.wav"
    --out_path folder/to/save/output.wav
 ```
 **Note:** You can use ```./TTS/bin/synthesize.py``` if you prefer running ```tts``` from the TTS project folder.
 ## On the Demo Server - `tts-server`
 <!-- <img src="https://raw.githubusercontent.com/idiap/coqui-ai-TTS/main/images/demo_server.gif" height="56"/> -->
 ![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif)
 You can boot up a demo 🐸TTS server to run an inference with your models (make
 sure to install the additional dependencies with `pip install coqui-tts[server]`).
 Note that the server is not optimized for performance but gives you an easy way
 to interact with the models.
 The demo server provides pretty much the same interface as the CLI command.
 ```bash
 tts-server -h # see the help
 tts-server --list_models  # list the available models.
 ```
 Run a TTS model, from the release models list, with its default vocoder.
 If the model you choose is a multi-speaker TTS model, you can select different speakers on the Web interface and synthesize
 speech.
 ```bash
 tts-server --model_name "<type>/<language>/<dataset>/<model_name>"
 ```
 Run a TTS and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model.
 ```bash
 tts-server --model_name "<type>/<language>/<dataset>/<model_name>" \
           --vocoder_name "<type>/<language>/<dataset>/<model_name>"
 ```
 ## Python 🐸TTS API
 You can run a multi-speaker and multi-lingual model in Python as
 ```python
 import torch
 from TTS.api import TTS
 # Get device
 device = "cuda" if torch.cuda.is_available() else "cpu"
 # List available 🐸TTS models
 print(TTS().list_models())
 # Init TTS
 tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
 # Run TTS
 # ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language
 # Text to speech list of amplitude values as output
 wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en")
 # Text to speech to a file
 tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
 ```
 #### Here is an example for a single speaker model.
 ```python
 # Init TTS with the target model name
 tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False)
 # Run TTS
 tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH)
 ```
 #### Example voice cloning with YourTTS in English, French and Portuguese:
 ```python
 tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to("cuda")
 tts.tts_to_file("This is voice cloning.", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
 tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wav", language="fr", file_path="output.wav")
 tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt", file_path="output.wav")
 ```
 #### Example voice conversion converting speaker of the `source_wav` to the speaker of the `target_wav`
 ```python
 tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda")
 tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav")
 ```
 #### Example voice cloning by a single speaker TTS model combining with the voice conversion model.
 This way, you can clone voices by using any model in 🐸TTS.
 ```python
 tts = TTS("tts_models/de/thorsten/tacotron2-DDC")
 tts.tts_with_vc_to_file(
    "Wie sage ich auf Italienisch, dass ich dich liebe?",
    speaker_wav="target/speaker.wav",
    file_path="ouptut.wav"
 )
 ```
 #### Example text to speech using **Fairseq models in ~1100 languages** 🤯.
 For these models use the following name format: `tts_models/<lang-iso_code>/fairseq/vits`.
 You can find the list of language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html) and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms).
 ```python
 from TTS.api import TTS
 api = TTS(model_name="tts_models/eng/fairseq/vits").to("cuda")
 api.tts_to_file("This is a test.", file_path="output.wav")
 # TTS with on the fly voice conversion
 api = TTS("tts_models/deu/fairseq/vits")
 api.tts_with_vc_to_file(
    "Wie sage ich auf Italienisch, dass ich dich liebe?",
    speaker_wav="target/speaker.wav",
    file_path="ouptut.wav"
 )
 ```
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@ -1,40 +1,6 @@
 # Installation
-🐸TTS supports python >=3.9 <3.13.0 and was tested on Ubuntu 22.04.
+```{include} ../../README.md
-
+:start-after: <!-- start installation -->
-## Using `pip`
+:end-before: <!-- end installation -->
 `pip` is recommended if you want to use 🐸TTS only for inference.
 You can install from PyPI as follows:
 ```bash
 pip install coqui-tts  # from PyPI
 ```
 Or install from Github:
 ```bash
 pip install git+https://github.com/idiap/coqui-ai-TTS  # from Github
 ```
 ## Installing From Source
 This is recommended for development and more control over 🐸TTS.
 ```bash
 git clone https://github.com/idiap/coqui-ai-TTS
 cd coqui-ai-TTS
 make system-deps  # only on Linux systems.
 # Install package and optional extras
 make install
 # Same as above + dev dependencies and pre-commit
 make install_dev
 ```
 ## On Windows
 If you are on Windows, 👑@GuyPaddock wrote installation instructions
 [here](https://stackoverflow.com/questions/66726331/) (note that these are out
 of date, e.g. you need to have at least Python 3.9)
--- a/docs/source/main_classes/model_api.md
+++ b/docs/source/main_classes/model_api.md
@ -1,22 +1,22 @@
 # Model API
 Model API provides you a set of functions that easily make your model compatible with the `Trainer`,
-`Synthesizer` and `ModelZoo`.
+`Synthesizer` and the Coqui Python API.
-## Base TTS Model
+## Base Trainer Model
 ```{eval-rst}
 .. autoclass:: TTS.model.BaseTrainerModel
    :members:
 ```
-## Base tts Model
+## Base TTS Model
 ```{eval-rst}
 .. autoclass:: TTS.tts.models.base_tts.BaseTTS
    :members:
 ```
-## Base vocoder Model
+## Base Vocoder Model
 ```{eval-rst}
 .. autoclass:: TTS.vocoder.models.base_vocoder.BaseVocoder
--- a/docs/source/main_classes/trainer_api.md
+++ b/docs/source/main_classes/trainer_api.md
@ -1,3 +1,3 @@
 # Trainer API
-We made the trainer a separate project on https://github.com/eginhard/coqui-trainer
+We made the trainer a separate project: https://github.com/idiap/coqui-ai-Trainer
--- a/docs/source/marytts.md
+++ b/docs/source/marytts.md
@ -1,4 +1,4 @@
-# Mary-TTS API Support for Coqui-TTS
+# Mary-TTS API support for Coqui TTS
 ## What is Mary-TTS?
--- a/docs/source/models/xtts.md
+++ b/docs/source/models/xtts.md
@ -1,25 +1,25 @@
-# ⓍTTS
+# XTTS
-ⓍTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise,
+XTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise,
-ⓍTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy.
+XTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy.
 There is no need for an excessive amount of training data that spans countless hours.
-### Features
+## Features
 - Voice cloning.
 - Cross-language voice cloning.
 - Multi-lingual speech generation.
 - 24khz sampling rate.
- Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-inference))
+- Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-manually))
 - Fine-tuning support. (See [Training](#training))
-### Updates with v2
+## Updates with v2
 - Improved voice cloning.
 - Voices can be cloned with a single audio file or multiple audio files, without any effect on the runtime.
 - Across the board quality improvements.
-### Code
+## Code
 Current implementation only supports inference and GPT encoder training.
-### Languages
+## Languages
 XTTS-v2 supports 17 languages:
 - Arabic (ar)
@ -40,15 +40,15 @@ XTTS-v2 supports 17 languages:
 - Spanish (es)
 - Turkish (tr)
-### License
+## License
 This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml).
-### Contact
+## Contact
 Come and join in our 🐸Community. We're active on [Discord](https://discord.gg/fBC58unbKE) and [Github](https://github.com/idiap/coqui-ai-TTS/discussions).
-### Inference
+## Inference
-#### 🐸TTS Command line
+### 🐸TTS Command line
 You can check all supported languages with the following command:
@ -64,7 +64,7 @@ You can check all Coqui available speakers with the following command:
    --list_speaker_idx
 ```
-##### Coqui speakers
+#### Coqui speakers
 You can do inference using one of the available speakers using the following command:
 ```console
@ -75,10 +75,10 @@ You can do inference using one of the available speakers using the following com
     --use_cuda
 ```
-##### Clone a voice
+#### Clone a voice
 You can clone a speaker voice using a single or multiple references:
-###### Single reference
+##### Single reference
 ```console
 tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
@ -88,7 +88,7 @@ You can clone a speaker voice using a single or multiple references:
     --use_cuda
 ```
-###### Multiple references
+##### Multiple references
 ```console
 tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
     --text "Bugün okula gitmek istemiyorum." \
@ -106,12 +106,12 @@ or for all wav files in a directory you can use:
     --use_cuda
 ```
-#### 🐸TTS API
+### 🐸TTS API
-##### Clone a voice
+#### Clone a voice
 You can clone a speaker voice using a single or multiple references:
-###### Single reference
+##### Single reference
 Splits the text into sentences and generates audio for each sentence. The audio files are then concatenated to produce the final audio.
 You can optionally disable sentence splitting for better coherence but more VRAM and possibly hitting models context length limit.
@ -129,7 +129,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
                )
 ```
-###### Multiple references
+##### Multiple references
 You can pass multiple audio files to the `speaker_wav` argument for better voice cloning.
@ -154,7 +154,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
                language="en")
 ```
-##### Coqui speakers
+#### Coqui speakers
 You can do inference using one of the available speakers using the following code:
@ -172,11 +172,11 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
 ```
-#### 🐸TTS Model API
+### 🐸TTS Model API
 To use the model API, you need to download the model files and pass config and model file paths manually.
-#### Manual Inference
+### Manual Inference
 If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first.
@ -184,7 +184,7 @@ If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjo
 pip install deepspeed==0.10.3
 ```
-##### inference parameters
+#### Inference parameters
 - `text`: The text to be synthesized.
 - `language`: The language of the text to be synthesized.
@ -199,7 +199,7 @@ pip install deepspeed==0.10.3
 - `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True.
-##### Inference
+#### Inference
 ```python
@ -231,7 +231,7 @@ torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
 ```
-##### Streaming manually
+#### Streaming manually
 Here the goal is to stream the audio as it is being generated. This is useful for real-time applications.
 Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster.
@ -275,9 +275,9 @@ torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000)
 ```
-### Training
+## Training
-#### Easy training
+### Easy training
 To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps:
 - Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter
@ -286,7 +286,7 @@ To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio
 The user can run this gradio demo locally or remotely using a Colab Notebook.
-##### Run demo on Colab
+#### Run demo on Colab
 To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook.
 The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing).
@ -302,7 +302,7 @@ If you are not able to acess the video you need to follow the steps:
 5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference".
-##### Run demo locally
+#### Run demo locally
 To run the demo locally you need to do the following steps:
 1. Install   🐸 TTS following the instructions available [here](https://coqui-tts.readthedocs.io/en/latest/installation.html).
@ -319,7 +319,7 @@ If you are not able to access the video, here is what you need to do:
 4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded.
 5. Now you can run inference with the model by clicking on the button "Step 4 - Inference".
-#### Advanced training
+### Advanced training
 A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py
@ -393,6 +393,6 @@ torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000)
 ## XTTS Model
 ```{eval-rst}
-.. autoclass:: TTS.tts.models.xtts.XTTS
+.. autoclass:: TTS.tts.models.xtts.Xtts
    :members:
 ```
--- a/docs/source/project_structure.md
+++ b/docs/source/project_structure.md
@ -0,0 +1,30 @@
 # Project structure
 ## Directory structure
 A non-comprehensive overview of the Coqui source code:
 | Directory | Contents |
 | - | - |
 | **Core** | |
 | **[`TTS/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS)** | Main source code |
 | **[`-   .models.json`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/.models.json)** | Pretrained model list |
 | **[`-   api.py`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/api.py)** | Python API |
 | **[`-   bin/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/bin)** | Executables and CLI |
 | **[`-   tts/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts)** | Text-to-speech models |
 | **[`-       configs/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts/configs)** | Model configurations |
 | **[`-       layers/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts/layers)** | Model layer definitions |
 | **[`-       models/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/tts/models)** | Model definitions |
 | **[`-   vc/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/vc)** | Voice conversion models |
 | `-       (same)` | |
 | **[`-   vocoder/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/vocoder)** | Vocoder models |
 | `-       (same)` | |
 | **[`-   encoder/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/TTS/encoder)** | Speaker encoder models |
 | `-       (same)` | |
 | **Recipes/notebooks** | |
 | **[`notebooks/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/notebooks)** | Jupyter Notebooks for model evaluation, parameter selection and data analysis |
 | **[`recipes/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes)** | Training recipes |
 | **Others** | |
 | **[`pyproject.toml`](https://github.com/idiap/coqui-ai-TTS/tree/dev/pyproject.toml)** | Project metadata, configuration and dependencies |
 | **[`docs/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/docs)** | Documentation |
 | **[`tests/`](https://github.com/idiap/coqui-ai-TTS/tree/dev/tests)** | Unit and integration tests |
--- a/docs/source/server.md
+++ b/docs/source/server.md
@ -0,0 +1,30 @@
 # Demo server
 ![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif)
 You can boot up a demo 🐸TTS server to run an inference with your models (make
 sure to install the additional dependencies with `pip install coqui-tts[server]`).
 Note that the server is not optimized for performance and does not support all
 Coqui models yet.
 The demo server provides pretty much the same interface as the CLI command.
 ```bash
 tts-server -h # see the help
 tts-server --list_models  # list the available models.
 ```
 Run a TTS model, from the release models list, with its default vocoder.
 If the model you choose is a multi-speaker TTS model, you can select different speakers on the Web interface and synthesize
 speech.
 ```bash
 tts-server --model_name "<type>/<language>/<dataset>/<model_name>"
 ```
 Run a TTS and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model.
 ```bash
 tts-server --model_name "<type>/<language>/<dataset>/<model_name>" \
           --vocoder_name "<type>/<language>/<dataset>/<model_name>"
 ```
--- a/docs/source/training/finetuning.md
+++ b/docs/source/training/finetuning.md
@ -1,4 +1,4 @@
-# Fine-tuning a 🐸 TTS model
+# Fine-tuning a model
 ## Fine-tuning
@ -21,8 +21,9 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
    Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own
    speech dataset and achieve reasonable results with only a couple of hours of data.
-    However, note that, fine-tuning does not ensure great results. The model performance still depends on the
+    However, note that, fine-tuning does not ensure great results. The model
-    {ref}`dataset quality <what_makes_a_good_dataset>` and the hyper-parameters you choose for fine-tuning. Therefore,
+    performance still depends on the [dataset quality](../datasets/what_makes_a_good_dataset.md)
    and the hyper-parameters you choose for fine-tuning. Therefore,
    it still takes a bit of tinkering.
@ -31,7 +32,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
 1. Setup your dataset.
    You need to format your target dataset in a certain way so that 🐸TTS data loader will be able to load it for the
-    training. Please see {ref}`this page <formatting_your_dataset>` for more information about formatting.
+    training. Please see [this page](../datasets/formatting_your_dataset.md) for more information about formatting.
 2. Choose the model you want to fine-tune.
@ -47,7 +48,8 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
    You should choose the model based on your requirements. Some models are fast and some are better in speech quality.
    One lazy way to test a model is running the model on the hardware you want to use and see how it works. For
-    simple testing, you can use the `tts` command on the terminal. For more info see {ref}`here <synthesizing_speech>`.
+    simple testing, you can use the `tts` command on the terminal. For more info
    see [here](../inference.md).
 3. Download the model.
--- a/docs/source/training/index.md
+++ b/docs/source/training/index.md
@ -0,0 +1,10 @@
 # Training and fine-tuning
 The following pages show you how to train and fine-tune Coqui models:
 ```{toctree}
 :maxdepth: 1
 training_a_model
 finetuning
 ```
--- a/docs/source/training/training_a_model.md
+++ b/docs/source/training/training_a_model.md
@ -1,4 +1,4 @@
-# Training a Model
+# Training a model
 1. Decide the model you want to use.
@ -11,11 +11,10 @@
 3. Check the recipes.
-    Recipes are located under `TTS/recipes/`. They do not promise perfect models but they provide a good start point for
+    Recipes are located under `TTS/recipes/`. They do not promise perfect models but they provide a good start point.
    `Nervous Beginners`.
    A recipe for `GlowTTS` using `LJSpeech` dataset looks like below. Let's be creative and call this `train_glowtts.py`.
-    ```{literalinclude} ../../recipes/ljspeech/glow_tts/train_glowtts.py
+    ```{literalinclude} ../../../recipes/ljspeech/glow_tts/train_glowtts.py
    ```
    You need to change fields of the `BaseDatasetConfig` to match your dataset and then update `GlowTTSConfig`
@ -113,7 +112,7 @@
    Note that different models have different metrics, visuals and outputs.
-    You should also check the [FAQ page](https://github.com/coqui-ai/TTS/wiki/FAQ) for common problems and solutions
+    You should also check the [FAQ page](../faq.md) for common problems and solutions
    that occur in a training.
 7. Use your best model for inference.
@ -132,7 +131,7 @@
    In the example above, we trained a `GlowTTS` model, but the same workflow applies to all the other 🐸TTS models.
-# Multi-speaker Training
+## Multi-speaker Training
 Training a multi-speaker model is mostly the same as training a single-speaker model.
 You need to specify a couple of configuration parameters, initiate a `SpeakerManager` instance and pass it to the model.
@ -142,5 +141,5 @@ d-vectors. For using d-vectors, you first need to compute the d-vectors using th
 The same Glow-TTS model above can be trained on a multi-speaker VCTK dataset with the script below.
-```{literalinclude} ../../recipes/vctk/glow_tts/train_glow_tts.py
+```{literalinclude} ../../../recipes/vctk/glow_tts/train_glow_tts.py
 ```
--- a/docs/source/tutorial_for_nervous_beginners.md
+++ b/docs/source/tutorial_for_nervous_beginners.md
@ -1,24 +1,37 @@
-# Tutorial For Nervous Beginners
+# Tutorial for nervous beginners
-## Installation
+First [install](installation.md) Coqui TTS.
-User friendly installation. Recommended only for synthesizing voice.
+## Synthesizing Speech
 You can run `tts` and synthesize speech directly on the terminal.
 ```bash
-$ pip install coqui-tts
+$ tts -h # see the help
 $ tts --list_models  # list the available models.
 ```
-Developer friendly installation.
+![cli.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/tts_cli.gif)
 You can call `tts-server` to start a local demo server that you can open on
 your favorite web browser and 🗣️ (make sure to install the additional
 dependencies with `pip install coqui-tts[server]`).
 ```bash
-$ git clone https://github.com/idiap/coqui-ai-TTS
+$ tts-server -h # see the help
-$ cd coqui-ai-TTS
+$ tts-server --list_models  # list the available models.
 $ pip install -e .
 ```
 ![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif)
 See [this page](inference.md) for more details on synthesizing speech with the
 CLI, server or Python API.
 ## Training a `tts` Model
-A breakdown of a simple script that trains a GlowTTS model on the LJspeech dataset. See the comments for more details.
+A breakdown of a simple script that trains a GlowTTS model on the LJspeech
 dataset. For a more in-depth guide to training and fine-tuning also see [this
 page](training/index.md).
 ### Pure Python Way
@ -99,25 +112,3 @@ We still support running training from CLI like in the old days. The same traini
 ```
 ❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above.
 ## Synthesizing Speech
 You can run `tts` and synthesize speech directly on the terminal.
 ```bash
 $ tts -h # see the help
 $ tts --list_models  # list the available models.
 ```
 ![cli.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/tts_cli.gif)
 You can call `tts-server` to start a local demo server that you can open on
 your favorite web browser and 🗣️ (make sure to install the additional
 dependencies with `pip install coqui-tts[server]`).
 ```bash
 $ tts-server -h # see the help
 $ tts-server --list_models  # list the available models.
 ```
 ![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif)
--- a/pyproject.toml
+++ b/pyproject.toml
@ -143,12 +143,12 @@ dev = [
 ]
 # Dependencies for building the documentation
 docs = [
-    "furo>=2023.5.20",
+    "furo>=2024.8.6",
-    "myst-parser==2.0.0",
+    "myst-parser==3.0.1",
-    "sphinx==7.2.5",
+    "sphinx==7.4.7",
    "sphinx_inline_tabs>=2023.4.21",
-    "sphinx_copybutton>=0.1",
+    "sphinx_copybutton>=0.5.2",
-    "linkify-it-py>=2.0.0",
+    "linkify-it-py>=2.0.3",
 ]
 [project.urls]
--- a/scripts/sync_readme.py
+++ b/scripts/sync_readme.py
@ -22,8 +22,12 @@ def sync_readme():
    new_content = replace_between_markers(orig_content, "tts-readme", description.strip())
    if args.check:
        if orig_content != new_content:
-            print("README.md is out of sync; please edit TTS/bin/TTS_README.md and run scripts/sync_readme.py")
+            print(
                "README.md is out of sync; please reconcile README.md and TTS/bin/synthesize.py and run scripts/sync_readme.py"
            )
            exit(42)
        print("All good, files in sync")
        exit(0)
    readme_path.write_text(new_content)
    print("Updated README.md")
`@ -1,3 +1,3 @@`
	`# Trainer API`	`# Trainer API`

	`We made the trainer a separate project on https://github.com/eginhard/coqui-trainer`	`We made the trainer a separate project: https://github.com/idiap/coqui-ai-Trainer`
`@ -1,4 +1,4 @@`
	`# Mary-TTS API Support for Coqui-TTS`	`# Mary-TTS API support for Coqui TTS`

	`## What is Mary-TTS?`	`## What is Mary-TTS?`