From bba21b86c654e9f90ba7cf0622162655ebb95453 Mon Sep 17 00:00:00 2001 From: omahs <73983677+omahs@users.noreply.github.com> Date: Tue, 5 Dec 2023 09:41:23 +0100 Subject: [PATCH 1/6] fix typo --- docs/source/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/configuration.md b/docs/source/configuration.md index cde7e073..ada61e16 100644 --- a/docs/source/configuration.md +++ b/docs/source/configuration.md @@ -56,4 +56,4 @@ ModelConfig() In the example above, ```ModelConfig()``` is the final configuration that the model receives and it has all the fields necessary for the model. -We host pre-defined model configurations under ```TTS//configs/```.Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided. \ No newline at end of file +We host pre-defined model configurations under ```TTS//configs/```. Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided. From c03fe7377b7b7eb21be6e0241b793f24ea4e0878 Mon Sep 17 00:00:00 2001 From: omahs <73983677+omahs@users.noreply.github.com> Date: Tue, 5 Dec 2023 09:45:00 +0100 Subject: [PATCH 2/6] fix typos --- docs/source/finetuning.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/finetuning.md b/docs/source/finetuning.md index c236260d..069f5651 100644 --- a/docs/source/finetuning.md +++ b/docs/source/finetuning.md @@ -21,7 +21,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways: Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own speech dataset and achieve reasonable results with only a couple of hours of data. - However, note that, fine-tuning does not ensure great results. The model performance is still depends on the + However, note that, fine-tuning does not ensure great results. The model performance still depends on the {ref}`dataset quality ` and the hyper-parameters you choose for fine-tuning. Therefore, it still takes a bit of tinkering. @@ -41,7 +41,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways: tts --list_models ``` - The command above lists the the models in a naming format as ```///```. + The command above lists the models in a naming format as ```///```. Or you can manually check the `.model.json` file in the project directory. From cfb143b9fbaa38e4418cd697fa2879d979b51642 Mon Sep 17 00:00:00 2001 From: omahs <73983677+omahs@users.noreply.github.com> Date: Tue, 5 Dec 2023 09:46:36 +0100 Subject: [PATCH 3/6] fix typos --- docs/source/formatting_your_dataset.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/formatting_your_dataset.md b/docs/source/formatting_your_dataset.md index 796c7b6d..23c497d0 100644 --- a/docs/source/formatting_your_dataset.md +++ b/docs/source/formatting_your_dataset.md @@ -7,7 +7,7 @@ If you have a single audio file and you need to split it into clips, there are d It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using `wav` file format. -Let's assume you created the audio clips and their transcription. You can collect all your clips under a folder. Let's call this folder `wavs`. +Let's assume you created the audio clips and their transcription. You can collect all your clips in a folder. Let's call this folder `wavs`. ``` /wavs @@ -17,7 +17,7 @@ Let's assume you created the audio clips and their transcription. You can collec ... ``` -You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimitered by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text. +You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimited by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text. We recommend the following format delimited by `|`. In the following example, `audio1`, `audio2` refer to files `audio1.wav`, `audio2.wav` etc. @@ -55,7 +55,7 @@ For more info about dataset qualities and properties check our [post](https://gi After you collect and format your dataset, you need to check two things. Whether you need a `formatter` and a `text_cleaner`. The `formatter` loads the text file (created above) as a list and the `text_cleaner` performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format). -If you use a different dataset format then the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`. +If you use a different dataset format than the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`. If your dataset is in a new language or it needs special normalization steps, then you need a new `text_cleaner`. From 775a9138b72fef55b2a17c729caf418459664238 Mon Sep 17 00:00:00 2001 From: omahs <73983677+omahs@users.noreply.github.com> Date: Tue, 5 Dec 2023 09:47:07 +0100 Subject: [PATCH 4/6] fix typo --- docs/source/implementing_a_new_language_frontend.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/implementing_a_new_language_frontend.md b/docs/source/implementing_a_new_language_frontend.md index f4f6a04a..2041352d 100644 --- a/docs/source/implementing_a_new_language_frontend.md +++ b/docs/source/implementing_a_new_language_frontend.md @@ -2,11 +2,11 @@ - Language frontends are located under `TTS.tts.utils.text` - Each special language has a separate folder. -- Each folder containst all the utilities for processing the text input. +- Each folder contains all the utilities for processing the text input. - `TTS.tts.utils.text.phonemizers` contains the main phonemizer for a language. This is the class that uses the utilities from the previous step and used to convert the text to phonemes or graphemes for the model. - After you implement your phonemizer, you need to add it to the `TTS/tts/utils/text/phonemizers/__init__.py` to be able to map the language code in the model config - `config.phoneme_language` - to the phonemizer class and initiate the phonemizer automatically. - You should also add tests to `tests/text_tests` if you want to make a PR. -We suggest you to check the available implementations as reference. Good luck! \ No newline at end of file +We suggest you to check the available implementations as reference. Good luck! From 716657c83555a5ad917f79fd7dba3d7b2245df8f Mon Sep 17 00:00:00 2001 From: omahs <73983677+omahs@users.noreply.github.com> Date: Tue, 5 Dec 2023 09:48:03 +0100 Subject: [PATCH 5/6] fix typos --- docs/source/implementing_a_new_model.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/implementing_a_new_model.md b/docs/source/implementing_a_new_model.md index e2a0437e..1bf7a882 100644 --- a/docs/source/implementing_a_new_model.md +++ b/docs/source/implementing_a_new_model.md @@ -145,7 +145,7 @@ class MyModel(BaseTTS): Args: ap (AudioProcessor): audio processor used at training. batch (Dict): Model inputs used at the previous training step. - outputs (Dict): Model outputs generated at the previoud training step. + outputs (Dict): Model outputs generated at the previous training step. Returns: Tuple[Dict, np.ndarray]: training plots and output waveform. @@ -183,7 +183,7 @@ class MyModel(BaseTTS): ... def get_optimizer(self) -> Union["Optimizer", List["Optimizer"]]: - """Setup an return optimizer or optimizers.""" + """Setup a return optimizer or optimizers.""" pass def get_lr(self) -> Union[float, List[float]]: From f659fa16bce7855c737e9360b0fa13e06bd4f4fd Mon Sep 17 00:00:00 2001 From: omahs <73983677+omahs@users.noreply.github.com> Date: Tue, 5 Dec 2023 09:50:33 +0100 Subject: [PATCH 6/6] fix typo --- docs/source/marytts.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/marytts.md b/docs/source/marytts.md index 81d54710..9091ca33 100644 --- a/docs/source/marytts.md +++ b/docs/source/marytts.md @@ -2,13 +2,13 @@ ## What is Mary-TTS? -[Mary (Modular Architecture for Research in sYynthesis) Text-to-Speech](http://mary.dfki.de/) is an open-source (GNU LGPL license), multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of [DFKI’s](http://www.dfki.de/web) Language Technology Lab and the [Institute of Phonetics](http://www.coli.uni-saarland.de/groups/WB/Phonetics/) at Saarland University, Germany. It is now maintained by the Multimodal Speech Processing Group in the [Cluster of Excellence MMCI](https://www.mmci.uni-saarland.de/) and DFKI. +[Mary (Modular Architecture for Research in sYnthesis) Text-to-Speech](http://mary.dfki.de/) is an open-source (GNU LGPL license), multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of [DFKI’s](http://www.dfki.de/web) Language Technology Lab and the [Institute of Phonetics](http://www.coli.uni-saarland.de/groups/WB/Phonetics/) at Saarland University, Germany. It is now maintained by the Multimodal Speech Processing Group in the [Cluster of Excellence MMCI](https://www.mmci.uni-saarland.de/) and DFKI. MaryTTS has been around for a very! long time. Version 3.0 even dates back to 2006, long before Deep Learning was a broadly known term and the last official release was version 5.2 in 2016. You can check out this OpenVoice-Tech page to learn more: https://openvoice-tech.net/index.php/MaryTTS ## Why Mary-TTS compatibility is relevant -Due to it's open-source nature, relatively high quality voices and fast synthetization speed Mary-TTS was a popular choice in the past and many tools implemented API support over the years like screen-readers (NVDA + SpeechHub), smart-home HUBs (openHAB, Home Assistant) or voice assistants (Rhasspy, Mycroft, SEPIA). A compatibility layer for Coqui-TTS will ensure that these tools can use Coqui as a drop-in replacement and get even better voices right away. +Due to its open-source nature, relatively high quality voices and fast synthetization speed Mary-TTS was a popular choice in the past and many tools implemented API support over the years like screen-readers (NVDA + SpeechHub), smart-home HUBs (openHAB, Home Assistant) or voice assistants (Rhasspy, Mycroft, SEPIA). A compatibility layer for Coqui-TTS will ensure that these tools can use Coqui as a drop-in replacement and get even better voices right away. ## API and code examples @@ -40,4 +40,4 @@ You can enter the same URLs in your browser and check-out the results there as w ### How it works and limitations A classic Mary-TTS server would usually show all installed locales and voices via the corresponding endpoints and accept the parameters `LOCALE` and `VOICE` for processing. For Coqui-TTS we usually start the server with one specific locale and model and thus cannot return all available options. Instead we return the active locale and use the model name as "voice". Since we only have one active model and always want to return a WAV-file, we currently ignore all other processing parameters except `INPUT_TEXT`. Since the gender is not defined for models in Coqui-TTS we always return `u` (undefined). -We think that this is an acceptable compromise, since users are often only interested in one specific voice anyways, but the API might get extended in the future to support multiple languages and voices at the same time. \ No newline at end of file +We think that this is an acceptable compromise, since users are often only interested in one specific voice anyways, but the API might get extended in the future to support multiple languages and voices at the same time.