Compare commits

...

114 Commits
v0.20.3 ... dev

Author SHA1 Message Date
Nick Potafiy dbf1a08a0d
Update generic_utils.py (#3561)
Handles cases when git branch produces no output or invalid output. Right now, it just crashes with `StopIteration`
2024-02-10 11:20:58 -03:00
Edresson Casanova 5dcc16d193
Bug fix in MP3 and FLAC compute length on TTSDataset (#3092)
* Bug Fix on XTTS load

* Bug fix in MP3 length on TTSDataset

* Update TTS/tts/datasets/dataset.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Uses mutagen for all audio formats

* Add dataloader test wit hall supported audio formats

* Use mutagen.File

* Update

* Fix aux unit tests

* Bug fixe on unit tests

---------

Co-authored-by: Aarni Koskela <akx@iki.fi>
2023-12-27 13:23:43 -03:00
Eren Gölge 55c7063724
Merge pull request #3423 from idiap/fix-aux-tests
Fix CI (save best model after 0 steps in tests)
2023-12-14 18:00:30 +01:00
Enno Hermann 99fee6f5ad build: use Trainer>=0.0.36 2023-12-14 14:26:31 +01:00
Eren Gölge 186cafb34c
Merge pull request #3412 from coqui-ai/reuben/docs-studio-refs
Remove Coqui Studio references
2023-12-13 08:54:57 +01:00
Eren Gölge 3991d83b2c
Merge branch 'dev' into reuben/docs-studio-refs 2023-12-13 08:53:43 +01:00
Eren Gölge fa28f99f15
Update to v0.22.0 2023-12-12 16:10:46 +01:00
Eren Gölge 8c1a8b522b
Merge pull request #3405 from coqui-ai/studio_speakers
Add studio speakers to open source XTTS!
2023-12-12 16:10:09 +01:00
Reuben Morais 0859e9f252
Remove Coqui Studio references 2023-12-12 16:09:57 +01:00
Enno Hermann 9f325b1f6c fixup! Fix aux unit tests 2023-12-12 16:07:16 +01:00
Edresson Casanova fc099218df Fix aux unit tests 2023-12-12 16:07:16 +01:00
Eren Gölge 934b87bbd1
Merge pull request #3391 from aaron-lii/multi-gpu
support multiple GPU training for XTTS
2023-12-12 13:51:26 +01:00
Eren Gölge b0fe0e678d
Merge pull request #3392 from joelhoward0/fix_contributing_typo
fixes a typo
2023-12-12 13:50:59 +01:00
Eren Gölge 936084be7e
Merge pull request #3404 from freds0/dev
Training fastspeech2 with External Speaker Embeddings
2023-12-12 13:50:27 +01:00
Eren Gölge 8e6a7cbfbf
Update .models.json 2023-12-12 13:50:01 +01:00
Eren Gölge 8999780aff
Update test_models.py 2023-12-12 13:30:21 +01:00
Eren Gölge 4dc0722bbc
Update .models.json 2023-12-12 13:28:16 +01:00
Edresson Casanova 4b33699b41 Update docs 2023-12-12 09:22:07 -03:00
Edresson Casanova b6e1ac66d9 Add docs 2023-12-12 09:19:56 -03:00
WeberJulian 61b67ef16f Fix read_json_with_comments 2023-12-11 23:58:52 +01:00
WeberJulian d47b6df4e5 Make comments in .model.json valid 2023-12-11 23:35:27 +01:00
WeberJulian 605a857add Remove tortoise 2023-12-11 23:35:07 +01:00
WeberJulian b40750d1f5 Remove models that require app.coqui.ai 2023-12-11 23:17:54 +01:00
WeberJulian ecc38891fb Fix CI readme 2023-12-11 23:01:30 +01:00
WeberJulian 5ab228dff2 Fix CI 2023-12-11 22:31:53 +01:00
WeberJulian 8c20a599d8 Remove coqui studio integration from TTS 2023-12-11 22:11:46 +01:00
WeberJulian 5cd750ac7e Fix API and CI 2023-12-11 20:21:53 +01:00
WeberJulian e3c9dab7a3 Make CLI work 2023-12-11 18:49:18 +01:00
WeberJulian 0a90359a42 rename speaker file 2023-12-11 18:48:49 +01:00
WeberJulian a5c0d9780f rename manager 2023-12-11 18:48:31 +01:00
WeberJulian 36143fee26 Add basic speaker manager 2023-12-11 15:25:46 +01:00
Frederico S. Oliveira f9117918fe
Update .models.json 2023-12-11 10:47:31 -03:00
Frederico S. Oliveira 163f9a3fdf
Merge branch 'coqui-ai:dev' into dev 2023-12-11 10:04:07 -03:00
WeberJulian 0a136a8535 Download speaker file 2023-12-11 11:29:36 +01:00
joelhoward0 e535cfe07c fixes a typo 2023-12-08 14:19:57 +00:00
Aaron-Li b6e929696a support multiple GPU training 2023-12-08 16:55:32 +08:00
Eren Gölge c99e885cc8
Merge pull request #3373 from coqui-ai/add-doc-xtts
Add inference parameters
2023-12-07 14:07:28 +01:00
Eren Gölge 4b35a1e756
Merge pull request #3381 from JRMeyer/licensing-message
Print message for either commercial license or CPML
2023-12-07 13:57:39 +01:00
Josh Meyer 759d9ab3ae
Print message for either commercial license or CPML 2023-12-07 13:54:48 +01:00
Eren Gölge 6b2ba527fa
Merge pull request #3368 from omahs/patch-1
Fix typos
2023-12-06 15:10:14 +01:00
WeberJulian 7d1a6defd6 Add inference parameters 2023-12-06 11:43:31 +01:00
omahs f659fa16bc
fix typo 2023-12-05 09:50:33 +01:00
omahs 716657c835
fix typos 2023-12-05 09:48:03 +01:00
omahs 775a9138b7
fix typo 2023-12-05 09:47:07 +01:00
omahs cfb143b9fb
fix typos 2023-12-05 09:46:36 +01:00
omahs c03fe7377b
fix typos 2023-12-05 09:45:00 +01:00
omahs bba21b86c6
fix typo 2023-12-05 09:41:23 +01:00
Eren Gölge e49c512d99
Merge pull request #3351 from aaron-lii/chinese-puncs
fix pause problem of Chinese speech
2023-12-04 15:57:42 +01:00
Eren Gölge 9c7b850995
Merge pull request #3352 from VladCuciureanu/patch-1
fix: Few typos in Tortoise docs.
2023-12-04 15:56:37 +01:00
Eren Gölge 2d02015978
Update to v0.21.3 2023-12-01 23:52:57 +01:00
Edresson Casanova 5f900f156a
Add XTTS Fine tuning gradio demo (#3296)
* Add XTTS FT demo data processing pipeline

* Add training and inference columns

* Uses tabs instead of columns

* Fix demo freezing issue

* Update demo

* Convert stereo to mono

* Bug fix on XTTS inference

* Update gradio demo

* Update gradio demo

* Update gradio demo

* Update gradio demo

* Add parameters to be able to set then on colab demo

* Add erros messages

* Add intuitive error messages

* Update

* Add max_audio_length parameter

* Add XTTS fine-tuner docs

* Update XTTS finetuner docs

* Delete trainer to freeze memory

* Delete unused variables

* Add gc.collect()

* Update xtts.md

---------

Co-authored-by: Eren Gölge <erogol@hotmail.com>
2023-12-01 23:52:23 +01:00
Vlad Cuciureanu f5b41674e8
fix: Few typos in Tortoise docs. 2023-12-01 20:42:41 +02:00
Aaron-Li 7b8808186a fix pause problem of Chinese speech 2023-12-01 23:30:03 +08:00
Frederico S. Oliveira bcd500fa7b Fixing bug
Correction in training the Fastspeech/Fastspeech2/FastPitch/SpeedySpeech model using external speaker embedding.
2023-11-30 17:27:05 -03:00
Frederico S. Oliveira a26e51b0b4
Merge branch 'coqui-ai:dev' into dev 2023-11-30 14:19:05 -03:00
Eren Gölge 6d1905c2b7
Update to v0.21.2 2023-11-30 13:05:10 +01:00
Hannes Krumbiegel e40527b103
Fix link to installation instructions (#3329) 2023-11-30 13:03:33 +01:00
Enno Hermann 39321d02be
fix: correctly strip/restore initial punctuation (#3336)
* refactor(punctuation): remove orphan code for handling lone punctuation

The case of lone punctuation is already handled at the top of restore(). The
removed if statement would never be called and would in fact raise an
AttributeError because the _punc_index named tuple doesn't have the attribute
`mark`.

* refactor(punctuation): remove unused argument

* fix(punctuation): correctly handle initial punctuation

Stripping and restoring initial punctuation didn't work correctly because the
string-splitting caused an additional empty string to be inserted in the text
list (because `".A".split(".")` => `["", "A"]`). Now, an initial empty string is
skipped and relevant test cases are added.

Fixes #3333
2023-11-30 13:03:16 +01:00
Eren Gölge 93283385e0
Merge pull request #3318 from coqui-ai/calling_hf_models
Run XTTS models by direct name with versions
2023-11-30 13:02:26 +01:00
Frederico S. Oliveira 77c2155609
Merge pull request #1 from coqui-ai/dev
Update
2023-11-29 17:24:02 -03:00
Eren G??lge bfbaffc84a Fixup 2023-11-28 13:47:45 +01:00
Eren G??lge 18b7d746cb Updating XTTS docs 2023-11-27 14:54:49 +01:00
Eren G??lge b75e90ba85 Make text splitting optional 2023-11-27 14:53:11 +01:00
Eren G??lge 3b8894a3dd Make style 2023-11-27 14:15:50 +01:00
Eren G??lge 2fd8cf3d94 Make xtts runnable by version names 2023-11-27 14:15:16 +01:00
Eren G??lge 11ec9f7471 Add hi in config defaults 2023-11-24 15:38:36 +01:00
Eren G??lge 00a870c26a Update to v0.21.1 2023-11-24 15:15:44 +01:00
Eren G??lge 7e575068c9 Merge branch 'dev' of https://github.com/coqui-ai/TTS into dev 2023-11-24 15:15:19 +01:00
Eren G??lge 32065139e7 Simple text cleaner for "hi" 2023-11-24 15:14:34 +01:00
Eren Gölge 1542a50c3a
Update to v0.21.0 2023-11-24 14:37:05 +01:00
Eren G??lge 6dd43b0ce2 Update to XTTS v2.0.3 2023-11-24 14:36:04 +01:00
Julian Weber a55755c8df
update deepspeed version (#3281) 2023-11-24 12:35:49 +01:00
Kaszanas 1bf5926196
Introducing Development Dockerfile (#3263)
* Moved Dockerfile, COPY at the end

This change should prevent re-installation of the dependencies upon
every change of the repository's contents. Typically if Docker detects
that something changed in a layer, all downstream layers are invalidated
and rebuilt.

* Moved Dockerfile back to main directory

Main dockerfile in a separate directory can cause issues with the
current CI/CD setup. This can be a good change for later.

* Introduced Dockerfile.dev, updated CONTRIBUTING

Dockerfile.dev can be used as a separate development environment for
anyone that does not wish to install the dependencies locally.
2023-11-24 12:30:15 +01:00
TITC 4d0f53d2ee
Misjudgment of `is_multi_lingual` When Loading Multilingual Model via `model_path` (#3273)
* load multilingual model by path

* use config to assert multi lingual or not
2023-11-24 12:28:31 +01:00
Enno Hermann 8c5227ed84
Fix tts_with_vc (#3275)
* Revert "fix for issue 3067"

This reverts commit 041b4b6723.

Fixes #3143. The original issue (#3067) was people trying to use
tts.tts_with_vc_to_file() with XTTS and was "fixed" in #3109. But XTTS has
integrated VC and you can just do tts.tts_to_file(..., speaker_wav="..."), there
is no point in passing it through FreeVC afterwards. So, reverting this commit
because it breaks tts.tts_with_vc_to_file() for any model that doesn't have
integrated VC, i.e. all models this method is meant for.

* fix: support multi-speaker models in tts_with_vc/tts_with_vc_to_file

* fix: only compute spk embeddings for models that support it

Fixes #1440. Passing a `speaker_wav` argument to regular Vits models failed
because they don't support voice cloning. Now that argument is simply ignored.
2023-11-24 12:26:37 +01:00
Enno Hermann 2af0220996
fix: don't pass quotes to espeak (#3286)
Previously, the text was wrapped in an additional set of quotes that was passed
to Espeak. This could result in different phonemization in certain edges and
caused the insertion of an initial separator "_" that had to be removed.
Compare:
$ espeak-ng -q -b 1 -v en-us --ipa=1 '"A"'
_ˈɐ
$ espeak-ng -q -b 1 -v en-us --ipa=1 'A'
ˈeɪ

Fixes #2619
2023-11-24 12:25:37 +01:00
Enno Hermann 4a2684be34
fix(bin.synthesize): more informative error for wrong --language argument (#3294)
In multilingual models, the target language is specified via the
`--language_idx` argument. However, the `tts` CLI also accepts a `--language`
argument for use with Coqui Studio, so it is easy to choose the wrong one,
resulting in the following confusing error at synthesis time:

```
AssertionError:   Language None is not supported. Supported languages are
['en', 'es', 'fr', 'de', 'it', 'pt', 'pl', 'tr', 'ru', 'nl', 'cs', 'ar',
'zh-cn', 'hu', 'ko', 'ja']
```

This commit adds a better error message when `--language` is passed for a
non-studio model.

Fixes #3270, fixes #3291
2023-11-24 12:24:42 +01:00
Tessa Painter 64f391b583
Made the tqdm `progress_bar` objects of static download methods a static class variable (#3297) 2023-11-24 12:23:59 +01:00
Eren Gölge b47d9c6e36
Merge pull request #3243 from idiap/checkpoints
Remove duplicate/unused code
2023-11-22 23:52:06 +01:00
Eren Gölge 29dede20d3
Merge pull request #3249 from coqui-ai/run_ci_for_v0.20.6
Run CI for v0.20.6
2023-11-17 15:45:26 +01:00
Eren Gölge c011ab7455 Update to v0.20.6 2023-11-17 15:16:32 +01:00
Eren G??lge 52cb1e2f68 Update model hash for v2.0.2 2023-11-17 15:16:32 +01:00
Edresson Casanova 6075fa208c Ensures that only GPT model is in training mode during XTTS GPT training (#3241)
* Ensures that only GPT model is in training mode during training

* Fix parallel wavegan unit test
2023-11-17 15:15:22 +01:00
Eren G??lge a3279f9294 Make style 2023-11-17 15:15:22 +01:00
Eren G??lge f21067a84a Make k_diffusion optional 2023-11-17 15:15:21 +01:00
Eren G??lge 44494daa27 Update CI version 2023-11-17 15:15:21 +01:00
Eren G??lge c864acf2b7 Update versions 2023-11-17 15:15:21 +01:00
Edresson Casanova 11283fce07
Ensures that only GPT model is in training mode during XTTS GPT training (#3241)
* Ensures that only GPT model is in training mode during training

* Fix parallel wavegan unit test
2023-11-17 15:13:46 +01:00
Eren Gölge 14579a4607
Merge pull request #3248 from coqui-ai/slacker_deps
Update versions
2023-11-17 15:13:19 +01:00
Eren G??lge 44880f09ed Make style 2023-11-17 13:43:34 +01:00
Eren G??lge 26efdf6ee7 Make k_diffusion optional 2023-11-17 13:42:33 +01:00
Eren G??lge 08d11e9198 Update CI version 2023-11-17 13:01:32 +01:00
Eren G??lge 63d7145647 Update versions 2023-11-17 12:10:46 +01:00
Enno Hermann 0fb0d67de7 refactor: use save_checkpoint()/save_best_model() from Trainer 2023-11-17 01:18:23 +01:00
Enno Hermann 96678c7ba2 refactor: use copy_model_files() from Trainer 2023-11-17 01:18:23 +01:00
Enno Hermann 5119e651a1 chore(utils.io): remove unused code
These are all available in Trainer.
2023-11-17 01:18:23 +01:00
Enno Hermann 39fe38bda4 refactor: use save_fsspec() from Trainer 2023-11-17 01:18:23 +01:00
Enno Hermann fdf0c8b10a chore(encoder): remove unused code 2023-11-17 01:18:23 +01:00
Eren Gölge 7e4375da2b
Update to v0.20.6 2023-11-16 17:52:13 +01:00
Julian Weber fbc18b8c34
Fix zh bug (#3238) 2023-11-16 17:51:37 +01:00
Julian Weber 675f983550
Add sentence splitting (#3227)
* Add sentence spliting

* update requirements

* update default args v2

* Add spanish

* Fix return gpt_latents

* Update requirements

* Fix requirements
2023-11-16 11:01:11 +01:00
Enno Hermann 3c2d5a9e03
Remove duplicate AudioProcessor code and fix ExtractTTSpectrogram.ipynb (#3230)
* chore: remove unused argument

* refactor(audio.processor): remove duplicate stft+griffin_lim

* chore(audio.processor): remove unused compute_stft_paddings

Same function available in numpy_transforms

* refactor(audio.processor): remove duplicate db_to_amp

* refactor(audio.processor): remove duplicate amp_to_db

* refactor(audio.processor): remove duplicate linear_to_mel

* refactor(audio.processor): remove duplicate mel_to_linear

* refactor(audio.processor): remove duplicate build_mel_basis

* refactor(audio.processor): remove duplicate stft_parameters

* refactor(audio.processor): use pre-/deemphasis from numpy_transforms

* refactor(audio.processor): use rms_volume_norm from numpy_transforms

* chore(audio.processor): remove duplicate assert

Already checked in numpy_transforms.compute_f0

* refactor(audio.processor): use find_endpoint from numpy_transforms

* refactor(audio.processor): use trim_silence from numpy_transforms

* refactor(audio.processor): use volume_norm from numpy_transforms

* refactor(audio.processor): use load_wav from numpy_transforms

* fix(bin.extract_tts_spectrograms): set quantization bits

* fix(ExtractTTSpectrogram.ipynb): adapt to current TTS code

Fixes #2447, #2574

* refactor(audio.processor): remove duplicate quantization methods
2023-11-16 10:57:06 +01:00
Eren Gölge 88630c60e5
Update to v0.20.5 2023-11-15 14:02:51 +01:00
Edresson Casanova 73a5bd08c0
Fix XTTS GPT padding and inference issues (#3216)
* Fix end artifact for fine tuning models

* Bug fix on zh-cn inference

* Remove ununsed code
2023-11-15 14:02:05 +01:00
Ikko Eltociear Ashimine 15f0ac57d6
Update README.md (#3215)
Dicord -> Discord
2023-11-15 13:59:56 +01:00
Julian Weber 04901fb2e4
Add speed control for inference (#3214)
* Add speed control for inference

* Fix XTTS tests

* Add speed control tests
2023-11-14 16:07:17 +01:00
Eren Gölge d96f3885d5
Update to v0.20.4 2023-11-13 17:07:25 +01:00
Eren Gölge ac3df409a6
Merge pull request #3208 from coqui-ai/fix_max_mel_len
fix max generation length for XTTS
2023-11-13 14:32:56 +01:00
Eren Gölge f32a465711
Merge pull request #3207 from coqui-ai/update_xtts_cloning
Update XTTS cloning
2023-11-13 14:32:43 +01:00
Eren G??lge 92fa988aec Fixup 2023-11-13 13:44:06 +01:00
WeberJulian b85536b23f fix max generation length 2023-11-13 13:18:45 +01:00
Eren G??lge b2682d39c5 Make style 2023-11-13 13:01:01 +01:00
Eren G??lge a16360af85 Implement chunking gpt_cond 2023-11-13 13:00:08 +01:00
Fred f6eaa61afe Adding checkpoint model 2023-07-02 18:55:50 -03:00
137 changed files with 2028 additions and 1884 deletions

View File

@ -1,53 +0,0 @@
name: api_tests
on:
push:
branches:
- main
jobs:
check_skip:
runs-on: ubuntu-latest
if: "! contains(github.event.head_commit.message, '[ci skip]')"
steps:
- run: echo "${{ github.event.head_commit.message }}"
test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: [3.9, "3.10", "3.11"]
experimental: [false]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
architecture: x64
cache: 'pip'
cache-dependency-path: 'requirements*'
- name: check OS
run: cat /etc/os-release
- name: set ENV
run: |
export TRAINER_TELEMETRY=0
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends git make gcc
sudo apt-get install espeak-ng
make system-deps
- name: Install/upgrade Python setup deps
run: python3 -m pip install --upgrade pip setuptools wheel
- name: Replace scarf urls
run: |
sed -i 's/https:\/\/coqui.gateway.scarf.sh\//https:\/\/github.com\/coqui-ai\/TTS\/releases\/download\//g' TTS/.models.json
- name: Install TTS
run: |
python3 -m pip install .[all]
python3 setup.py egg_info
- name: Unit tests
run: make api_tests
env:
COQUI_STUDIO_TOKEN: ${{ secrets.COQUI_STUDIO_TOKEN }}

View File

@ -10,7 +10,7 @@ jobs:
build-sdist:
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Verify tag matches version
run: |
set -ex
@ -38,7 +38,7 @@ jobs:
matrix:
python-version: ["3.9", "3.10", "3.11"]
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}

View File

@ -1,52 +0,0 @@
name: zoo-tests-tortoise
on:
push:
branches:
- main
pull_request:
types: [opened, synchronize, reopened]
jobs:
check_skip:
runs-on: ubuntu-latest
if: "! contains(github.event.head_commit.message, '[ci skip]')"
steps:
- run: echo "${{ github.event.head_commit.message }}"
test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: [3.9, "3.10", "3.11"]
experimental: [false]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
architecture: x64
cache: 'pip'
cache-dependency-path: 'requirements*'
- name: check OS
run: cat /etc/os-release
- name: set ENV
run: export TRAINER_TELEMETRY=0
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y git make gcc
sudo apt-get install espeak espeak-ng
make system-deps
- name: Install/upgrade Python setup deps
run: python3 -m pip install --upgrade pip setuptools wheel
- name: Replace scarf urls
run: |
sed -i 's/https:\/\/coqui.gateway.scarf.sh\//https:\/\/github.com\/coqui-ai\/TTS\/releases\/download\//g' TTS/.models.json
- name: Install TTS
run: |
python3 -m pip install .[all]
python3 setup.py egg_info
- name: Unit tests
run: nose2 -F -v -B --with-coverage --coverage TTS tests.zoo_tests.test_models.test_tortoise

View File

@ -48,7 +48,7 @@ The following steps are tested on an Ubuntu system.
1. Fork 🐸TTS[https://github.com/coqui-ai/TTS] by clicking the fork button at the top right corner of the project page.
2. Clone 🐸TTS and add the main repo as a new remote named ```upsteam```.
2. Clone 🐸TTS and add the main repo as a new remote named ```upstream```.
```bash
$ git clone git@github.com:<your Github name>/TTS.git
@ -128,6 +128,32 @@ The following steps are tested on an Ubuntu system.
14. Once things look perfect, We merge it to the ```dev``` branch and make it ready for the next version.
## Development in Docker container
If you prefer working within a Docker container as your development environment, you can do the following:
1. Fork 🐸TTS[https://github.com/coqui-ai/TTS] by clicking the fork button at the top right corner of the project page.
2. Clone 🐸TTS and add the main repo as a new remote named ```upsteam```.
```bash
$ git clone git@github.com:<your Github name>/TTS.git
$ cd TTS
$ git remote add upstream https://github.com/coqui-ai/TTS.git
```
3. Build the Docker Image as your development environment (it installs all of the dependencies for you):
```
docker build --tag=tts-dev:latest -f .\dockerfiles\Dockerfile.dev .
```
4. Run the container with GPU support:
```
docker run -it --gpus all tts-dev:latest /bin/bash
```
Feel free to ping us at any step you need help using our communication channels.
If you are new to Github or open-source contribution, These are good resources.

View File

@ -1,13 +1,19 @@
ARG BASE=nvidia/cuda:11.8.0-base-ubuntu22.04
FROM ${BASE}
RUN apt-get update && apt-get upgrade -y
RUN apt-get install -y --no-install-recommends gcc g++ make python3 python3-dev python3-pip python3-venv python3-wheel espeak-ng libsndfile1-dev && rm -rf /var/lib/apt/lists/*
RUN pip3 install llvmlite --ignore-installed
WORKDIR /root
COPY . /root
# Install Dependencies:
RUN pip3 install torch torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
RUN rm -rf /root/.cache/pip
# Copy TTS repository contents:
WORKDIR /root
COPY . /root
RUN make install
ENTRYPOINT ["tts"]
CMD ["--help"]

View File

@ -35,9 +35,6 @@ test_zoo: ## run zoo tests.
inference_tests: ## run inference tests.
nose2 -F -v -B --with-coverage --coverage TTS tests.inference_tests
api_tests: ## run api tests.
nose2 -F -v -B --with-coverage --coverage TTS tests.api_tests
data_tests: ## run data tests.
nose2 -F -v -B --with-coverage --coverage TTS tests.data_tests

View File

@ -7,11 +7,6 @@
- 📣 [🐶Bark](https://github.com/suno-ai/bark) is now available for inference with unconstrained voice cloning. [Docs](https://tts.readthedocs.io/en/dev/models/bark.html)
- 📣 You can use [~1100 Fairseq models](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) with 🐸TTS.
- 📣 🐸TTS now supports 🐢Tortoise with faster inference. [Docs](https://tts.readthedocs.io/en/dev/models/tortoise.html)
- 📣 **Coqui Studio API** is landed on 🐸TTS. - [Example](https://github.com/coqui-ai/TTS/blob/dev/README.md#-python-api)
- 📣 [**Coqui Studio API**](https://docs.coqui.ai/docs) is live.
- 📣 Voice generation with prompts - **Prompt to Voice** - is live on [**Coqui Studio**](https://app.coqui.ai/auth/signin)!! - [Blog Post](https://coqui.ai/blog/tts/prompt-to-voice)
- 📣 Voice generation with fusion - **Voice fusion** - is live on [**Coqui Studio**](https://app.coqui.ai/auth/signin).
- 📣 Voice cloning is live on [**Coqui Studio**](https://app.coqui.ai/auth/signin).
<div align="center">
<img src="https://static.scarf.sh/a.png?x-pxid=cf317fe7-2188-4721-bc01-124bb5d5dbb2" />
@ -28,7 +23,7 @@
📚 Utilities for dataset analysis and curation.
______________________________________________________________________
[![Dicord](https://img.shields.io/discord/1037326658807533628?color=%239B59B6&label=chat%20on%20discord)](https://discord.gg/5eXr5seRrv)
[![Discord](https://img.shields.io/discord/1037326658807533628?color=%239B59B6&label=chat%20on%20discord)](https://discord.gg/5eXr5seRrv)
[![License](<https://img.shields.io/badge/License-MPL%202.0-brightgreen.svg>)](https://opensource.org/licenses/MPL-2.0)
[![PyPI version](https://badge.fury.io/py/TTS.svg)](https://badge.fury.io/py/TTS)
[![Covenant](https://camo.githubusercontent.com/7d620efaa3eac1c5b060ece5d6aacfcc8b81a74a04d05cd0398689c01c4463bb/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436f6e7472696275746f72253230436f76656e616e742d76322e3025323061646f707465642d6666363962342e737667)](https://github.com/coqui-ai/TTS/blob/master/CODE_OF_CONDUCT.md)
@ -72,7 +67,7 @@ Please use our dedicated channels for questions and discussion. Help is much mor
| Type | Links |
| ------------------------------- | --------------------------------------- |
| 💼 **Documentation** | [ReadTheDocs](https://tts.readthedocs.io/en/latest/)
| 💾 **Installation** | [TTS/README.md](https://github.com/coqui-ai/TTS/tree/dev#install-tts)|
| 💾 **Installation** | [TTS/README.md](https://github.com/coqui-ai/TTS/tree/dev#installation)|
| 👩‍💻 **Contributing** | [CONTRIBUTING.md](https://github.com/coqui-ai/TTS/blob/main/CONTRIBUTING.md)|
| 📌 **Road Map** | [Main Development Plans](https://github.com/coqui-ai/TTS/issues/378)
| 🚀 **Released Models** | [TTS Releases](https://github.com/coqui-ai/TTS/releases) and [Experimental Models](https://github.com/coqui-ai/TTS/wiki/Experimental-Released-Models)|
@ -253,29 +248,6 @@ tts.tts_with_vc_to_file(
)
```
#### Example using [🐸Coqui Studio](https://coqui.ai) voices.
You access all of your cloned voices and built-in speakers in [🐸Coqui Studio](https://coqui.ai).
To do this, you'll need an API token, which you can obtain from the [account page](https://coqui.ai/account).
After obtaining the API token, you'll need to configure the COQUI_STUDIO_TOKEN environment variable.
Once you have a valid API token in place, the studio speakers will be displayed as distinct models within the list.
These models will follow the naming convention `coqui_studio/en/<studio_speaker_name>/coqui_studio`
```python
# XTTS model
models = TTS(cs_api_model="XTTS").list_models()
# Init TTS with the target studio speaker
tts = TTS(model_name="coqui_studio/en/Torcull Diarmuid/coqui_studio", progress_bar=False)
# Run TTS
tts.tts_to_file(text="This is a test.", language="en", file_path=OUTPUT_PATH)
# V1 model
models = TTS(cs_api_model="V1").list_models()
# Run TTS with emotion and speed control
# Emotion control only works with V1 model
tts.tts_to_file(text="This is a test.", file_path=OUTPUT_PATH, emotion="Happy", speed=1.5)
```
#### Example text to speech using **Fairseq models in ~1100 languages** 🤯.
For Fairseq models, use the following name format: `tts_models/<lang-iso_code>/fairseq/vits`.
You can find the language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html)
@ -351,12 +323,6 @@ If you don't specify any models, then it uses LJSpeech based English model.
$ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
```
- Run TTS and define speed factor to use for 🐸Coqui Studio models, between 0.0 and 2.0:
```
$ tts --text "Text for TTS" --model_name "coqui_studio/<language>/<dataset>/<model_name>" --speed 1.2 --out_path output/path/speech.wav
```
- Run a TTS model with its default vocoder model:
```

View File

@ -3,14 +3,15 @@
"multilingual": {
"multi-dataset": {
"xtts_v2": {
"description": "XTTS-v2 by Coqui with 16 languages.",
"description": "XTTS-v2.0.3 by Coqui with 17 languages.",
"hf_url": [
"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/model.pth",
"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/config.json",
"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/vocab.json",
"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/hash.md5"
"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/hash.md5",
"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/speakers_xtts.pth"
],
"model_hash": "6a09d1ad43896f06041ed8195956c9698f13b6189dc80f1c74bdc2b8e8d15324",
"model_hash": "10f92b55c512af7a8d39d650547a15a7",
"default_vocoder": null,
"commit": "480a6cdf7",
"license": "CPML",
@ -45,7 +46,7 @@
"hf_url": [
"https://coqui.gateway.scarf.sh/hf/bark/coarse_2.pt",
"https://coqui.gateway.scarf.sh/hf/bark/fine_2.pt",
"https://app.coqui.ai/tts_model/text_2.pt",
"https://coqui.gateway.scarf.sh/hf/text_2.pt",
"https://coqui.gateway.scarf.sh/hf/bark/config.json",
"https://coqui.gateway.scarf.sh/hf/bark/hubert.pt",
"https://coqui.gateway.scarf.sh/hf/bark/tokenizer.pth"
@ -270,7 +271,7 @@
"tortoise-v2": {
"description": "Tortoise tts model https://github.com/neonbjb/tortoise-tts",
"github_rls_url": [
"https://app.coqui.ai/tts_model/autoregressive.pth",
"https://coqui.gateway.scarf.sh/v0.14.1_models/autoregressive.pth",
"https://coqui.gateway.scarf.sh/v0.14.1_models/clvp2.pth",
"https://coqui.gateway.scarf.sh/v0.14.1_models/cvvp.pth",
"https://coqui.gateway.scarf.sh/v0.14.1_models/diffusion_decoder.pth",

View File

@ -1 +1 @@
0.20.3
0.22.0

View File

@ -6,10 +6,10 @@ from typing import Union
import numpy as np
from torch import nn
from TTS.cs_api import CS_API
from TTS.utils.audio.numpy_transforms import save_wav
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer
from TTS.config import load_config
class TTS(nn.Module):
@ -23,7 +23,6 @@ class TTS(nn.Module):
vocoder_path: str = None,
vocoder_config_path: str = None,
progress_bar: bool = True,
cs_api_model: str = "XTTS",
gpu=False,
):
"""🐸TTS python interface that allows to load and use the released models.
@ -59,28 +58,24 @@ class TTS(nn.Module):
vocoder_path (str, optional): Path to the vocoder checkpoint. Defaults to None.
vocoder_config_path (str, optional): Path to the vocoder config. Defaults to None.
progress_bar (bool, optional): Whether to pring a progress bar while downloading a model. Defaults to True.
cs_api_model (str, optional): Name of the model to use for the Coqui Studio API. Available models are
"XTTS", "V1". You can also use `TTS.cs_api.CS_API" for more control.
Defaults to "XTTS".
gpu (bool, optional): Enable/disable GPU. Some models might be too slow on CPU. Defaults to False.
"""
super().__init__()
self.manager = ModelManager(models_file=self.get_models_file_path(), progress_bar=progress_bar, verbose=False)
self.config = load_config(config_path) if config_path else None
self.synthesizer = None
self.voice_converter = None
self.csapi = None
self.cs_api_model = cs_api_model
self.model_name = ""
if gpu:
warnings.warn("`gpu` will be deprecated. Please use `tts.to(device)` instead.")
if model_name is not None:
if "tts_models" in model_name or "coqui_studio" in model_name:
if model_name is not None and len(model_name) > 0:
if "tts_models" in model_name:
self.load_tts_model_by_name(model_name, gpu)
elif "voice_conversion_models" in model_name:
self.load_vc_model_by_name(model_name, gpu)
else:
self.load_model_by_name(model_name, gpu)
if model_path:
self.load_tts_model_by_path(
@ -97,16 +92,15 @@ class TTS(nn.Module):
return self.synthesizer.tts_model.speaker_manager.num_speakers > 1
return False
@property
def is_coqui_studio(self):
if self.model_name is None:
return False
return "coqui_studio" in self.model_name
@property
def is_multi_lingual(self):
# Not sure what sets this to None, but applied a fix to prevent crashing.
if isinstance(self.model_name, str) and "xtts" in self.model_name:
if (
isinstance(self.model_name, str)
and "xtts" in self.model_name
or self.config
and ("xtts" in self.config.model or len(self.config.languages) > 1)
):
return True
if hasattr(self.synthesizer.tts_model, "language_manager") and self.synthesizer.tts_model.language_manager:
return self.synthesizer.tts_model.language_manager.num_languages > 1
@ -129,14 +123,7 @@ class TTS(nn.Module):
return Path(__file__).parent / ".models.json"
def list_models(self):
try:
csapi = CS_API(model=self.cs_api_model)
models = csapi.list_speakers_as_tts_models()
except ValueError as e:
print(e)
models = []
manager = ModelManager(models_file=TTS.get_models_file_path(), progress_bar=False, verbose=False)
return manager.list_tts_models() + models
return ModelManager(models_file=TTS.get_models_file_path(), progress_bar=False, verbose=False)
def download_model_by_name(self, model_name: str):
model_path, config_path, model_item = self.manager.download_model(model_name)
@ -149,6 +136,15 @@ class TTS(nn.Module):
vocoder_path, vocoder_config_path, _ = self.manager.download_model(model_item["default_vocoder"])
return model_path, config_path, vocoder_path, vocoder_config_path, None
def load_model_by_name(self, model_name: str, gpu: bool = False):
"""Load one of the 🐸TTS models by name.
Args:
model_name (str): Model name to load. You can list models by ```tts.models```.
gpu (bool, optional): Enable/disable GPU. Some models might be too slow on CPU. Defaults to False.
"""
self.load_tts_model_by_name(model_name, gpu)
def load_vc_model_by_name(self, model_name: str, gpu: bool = False):
"""Load one of the voice conversion models by name.
@ -170,30 +166,26 @@ class TTS(nn.Module):
TODO: Add tests
"""
self.synthesizer = None
self.csapi = None
self.model_name = model_name
if "coqui_studio" in model_name:
self.csapi = CS_API()
else:
model_path, config_path, vocoder_path, vocoder_config_path, model_dir = self.download_model_by_name(
model_name
)
model_path, config_path, vocoder_path, vocoder_config_path, model_dir = self.download_model_by_name(
model_name
)
# init synthesizer
# None values are fetch from the model
self.synthesizer = Synthesizer(
tts_checkpoint=model_path,
tts_config_path=config_path,
tts_speakers_file=None,
tts_languages_file=None,
vocoder_checkpoint=vocoder_path,
vocoder_config=vocoder_config_path,
encoder_checkpoint=None,
encoder_config=None,
model_dir=model_dir,
use_cuda=gpu,
)
# init synthesizer
# None values are fetch from the model
self.synthesizer = Synthesizer(
tts_checkpoint=model_path,
tts_config_path=config_path,
tts_speakers_file=None,
tts_languages_file=None,
vocoder_checkpoint=vocoder_path,
vocoder_config=vocoder_config_path,
encoder_checkpoint=None,
encoder_config=None,
model_dir=model_dir,
use_cuda=gpu,
)
def load_tts_model_by_path(
self, model_path: str, config_path: str, vocoder_path: str = None, vocoder_config: str = None, gpu: bool = False
@ -230,77 +222,17 @@ class TTS(nn.Module):
**kwargs,
) -> None:
"""Check if the arguments are valid for the model."""
if not self.is_coqui_studio:
# check for the coqui tts models
if self.is_multi_speaker and (speaker is None and speaker_wav is None):
raise ValueError("Model is multi-speaker but no `speaker` is provided.")
if self.is_multi_lingual and language is None:
raise ValueError("Model is multi-lingual but no `language` is provided.")
if not self.is_multi_speaker and speaker is not None and "voice_dir" not in kwargs:
raise ValueError("Model is not multi-speaker but `speaker` is provided.")
if not self.is_multi_lingual and language is not None:
raise ValueError("Model is not multi-lingual but `language` is provided.")
if not emotion is None and not speed is None:
raise ValueError("Emotion and speed can only be used with Coqui Studio models.")
else:
if emotion is None:
emotion = "Neutral"
if speed is None:
speed = 1.0
# check for the studio models
if speaker_wav is not None:
raise ValueError("Coqui Studio models do not support `speaker_wav` argument.")
if speaker is not None:
raise ValueError("Coqui Studio models do not support `speaker` argument.")
if language is not None and language != "en":
raise ValueError("Coqui Studio models currently support only `language=en` argument.")
if emotion not in ["Neutral", "Happy", "Sad", "Angry", "Dull"]:
raise ValueError(f"Emotion - `{emotion}` - must be one of `Neutral`, `Happy`, `Sad`, `Angry`, `Dull`.")
def tts_coqui_studio(
self,
text: str,
speaker_name: str = None,
language: str = None,
emotion: str = None,
speed: float = 1.0,
pipe_out=None,
file_path: str = None,
) -> Union[np.ndarray, str]:
"""Convert text to speech using Coqui Studio models. Use `CS_API` class if you are only interested in the API.
Args:
text (str):
Input text to synthesize.
speaker_name (str, optional):
Speaker name from Coqui Studio. Defaults to None.
language (str): Language of the text. If None, the default language of the speaker is used. Language is only
supported by `XTTS` model.
emotion (str, optional):
Emotion of the speaker. One of "Neutral", "Happy", "Sad", "Angry", "Dull". Emotions are only available
with "V1" model. Defaults to None.
speed (float, optional):
Speed of the speech. Defaults to 1.0.
pipe_out (BytesIO, optional):
Flag to stdout the generated TTS wav file for shell pipe.
file_path (str, optional):
Path to save the output file. When None it returns the `np.ndarray` of waveform. Defaults to None.
Returns:
Union[np.ndarray, str]: Waveform of the synthesized speech or path to the output file.
"""
speaker_name = self.model_name.split("/")[2]
if file_path is not None:
return self.csapi.tts_to_file(
text=text,
speaker_name=speaker_name,
language=language,
speed=speed,
pipe_out=pipe_out,
emotion=emotion,
file_path=file_path,
)[0]
return self.csapi.tts(text=text, speaker_name=speaker_name, language=language, speed=speed, emotion=emotion)[0]
# check for the coqui tts models
if self.is_multi_speaker and (speaker is None and speaker_wav is None):
raise ValueError("Model is multi-speaker but no `speaker` is provided.")
if self.is_multi_lingual and language is None:
raise ValueError("Model is multi-lingual but no `language` is provided.")
if not self.is_multi_speaker and speaker is not None and "voice_dir" not in kwargs:
raise ValueError("Model is not multi-speaker but `speaker` is provided.")
if not self.is_multi_lingual and language is not None:
raise ValueError("Model is not multi-lingual but `language` is provided.")
if not emotion is None and not speed is None:
raise ValueError("Emotion and speed can only be used with Coqui Studio models. Which is discontinued.")
def tts(
self,
@ -310,6 +242,7 @@ class TTS(nn.Module):
speaker_wav: str = None,
emotion: str = None,
speed: float = None,
split_sentences: bool = True,
**kwargs,
):
"""Convert text to speech.
@ -330,14 +263,16 @@ class TTS(nn.Module):
speed (float, optional):
Speed factor to use for 🐸Coqui Studio models, between 0 and 2.0. If None, Studio models use 1.0.
Defaults to None.
split_sentences (bool, optional):
Split text into sentences, synthesize them separately and concatenate the file audio.
Setting it False uses more VRAM and possibly hit model specific text length or VRAM limits. Only
applicable to the 🐸TTS models. Defaults to True.
kwargs (dict, optional):
Additional arguments for the model.
"""
self._check_arguments(
speaker=speaker, language=language, speaker_wav=speaker_wav, emotion=emotion, speed=speed, **kwargs
)
if self.csapi is not None:
return self.tts_coqui_studio(
text=text, speaker_name=speaker, language=language, emotion=emotion, speed=speed
)
wav = self.synthesizer.tts(
text=text,
speaker_name=speaker,
@ -347,6 +282,7 @@ class TTS(nn.Module):
style_wav=None,
style_text=None,
reference_speaker_name=None,
split_sentences=split_sentences,
**kwargs,
)
return wav
@ -361,6 +297,7 @@ class TTS(nn.Module):
speed: float = 1.0,
pipe_out=None,
file_path: str = "output.wav",
split_sentences: bool = True,
**kwargs,
):
"""Convert text to speech.
@ -385,22 +322,23 @@ class TTS(nn.Module):
Flag to stdout the generated TTS wav file for shell pipe.
file_path (str, optional):
Output file path. Defaults to "output.wav".
split_sentences (bool, optional):
Split text into sentences, synthesize them separately and concatenate the file audio.
Setting it False uses more VRAM and possibly hit model specific text length or VRAM limits. Only
applicable to the 🐸TTS models. Defaults to True.
kwargs (dict, optional):
Additional arguments for the model.
"""
self._check_arguments(speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
if self.csapi is not None:
return self.tts_coqui_studio(
text=text,
speaker_name=speaker,
language=language,
emotion=emotion,
speed=speed,
file_path=file_path,
pipe_out=pipe_out,
)
wav = self.tts(text=text, speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
wav = self.tts(
text=text,
speaker=speaker,
language=language,
speaker_wav=speaker_wav,
split_sentences=split_sentences,
**kwargs,
)
self.synthesizer.save_wav(wav=wav, path=file_path, pipe_out=pipe_out)
return file_path
@ -440,7 +378,14 @@ class TTS(nn.Module):
save_wav(wav=wav, path=file_path, sample_rate=self.voice_converter.vc_config.audio.output_sample_rate)
return file_path
def tts_with_vc(self, text: str, language: str = None, speaker_wav: str = None):
def tts_with_vc(
self,
text: str,
language: str = None,
speaker_wav: str = None,
speaker: str = None,
split_sentences: bool = True,
):
"""Convert text to speech with voice conversion.
It combines tts with voice conversion to fake voice cloning.
@ -457,17 +402,32 @@ class TTS(nn.Module):
speaker_wav (str, optional):
Path to a reference wav file to use for voice cloning with supporting models like YourTTS.
Defaults to None.
speaker (str, optional):
Speaker name for multi-speaker. You can check whether loaded model is multi-speaker by
`tts.is_multi_speaker` and list speakers by `tts.speakers`. Defaults to None.
split_sentences (bool, optional):
Split text into sentences, synthesize them separately and concatenate the file audio.
Setting it False uses more VRAM and possibly hit model specific text length or VRAM limits. Only
applicable to the 🐸TTS models. Defaults to True.
"""
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as fp:
# Lazy code... save it to a temp file to resample it while reading it for VC
self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name, speaker_wav=speaker_wav)
self.tts_to_file(
text=text, speaker=speaker, language=language, file_path=fp.name, split_sentences=split_sentences
)
if self.voice_converter is None:
self.load_vc_model_by_name("voice_conversion_models/multilingual/vctk/freevc24")
wav = self.voice_converter.voice_conversion(source_wav=fp.name, target_wav=speaker_wav)
return wav
def tts_with_vc_to_file(
self, text: str, language: str = None, speaker_wav: str = None, file_path: str = "output.wav"
self,
text: str,
language: str = None,
speaker_wav: str = None,
file_path: str = "output.wav",
speaker: str = None,
split_sentences: bool = True,
):
"""Convert text to speech with voice conversion and save to file.
@ -484,6 +444,15 @@ class TTS(nn.Module):
Defaults to None.
file_path (str, optional):
Output file path. Defaults to "output.wav".
speaker (str, optional):
Speaker name for multi-speaker. You can check whether loaded model is multi-speaker by
`tts.is_multi_speaker` and list speakers by `tts.speakers`. Defaults to None.
split_sentences (bool, optional):
Split text into sentences, synthesize them separately and concatenate the file audio.
Setting it False uses more VRAM and possibly hit model specific text length or VRAM limits. Only
applicable to the 🐸TTS models. Defaults to True.
"""
wav = self.tts_with_vc(text=text, language=language, speaker_wav=speaker_wav)
wav = self.tts_with_vc(
text=text, language=language, speaker_wav=speaker_wav, speaker=speaker, split_sentences=split_sentences
)
save_wav(wav=wav, path=file_path, sample_rate=self.voice_converter.vc_config.audio.output_sample_rate)

View File

@ -15,6 +15,7 @@ from TTS.tts.models import setup_model
from TTS.tts.utils.speakers import SpeakerManager
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.utils.audio.numpy_transforms import quantize
from TTS.utils.generic_utils import count_parameters
use_cuda = torch.cuda.is_available()
@ -159,7 +160,7 @@ def inference(
def extract_spectrograms(
data_loader, model, ap, output_path, quantized_wav=False, save_audio=False, debug=False, metada_name="metada.txt"
data_loader, model, ap, output_path, quantize_bits=0, save_audio=False, debug=False, metada_name="metada.txt"
):
model.eval()
export_metadata = []
@ -196,8 +197,8 @@ def extract_spectrograms(
_, wavq_path, mel_path, wav_gl_path, wav_path = set_filename(wav_file_path, output_path)
# quantize and save wav
if quantized_wav:
wavq = ap.quantize(wav)
if quantize_bits > 0:
wavq = quantize(wav, quantize_bits)
np.save(wavq_path, wavq)
# save TTS mel
@ -263,7 +264,7 @@ def main(args): # pylint: disable=redefined-outer-name
model,
ap,
args.output_path,
quantized_wav=args.quantized,
quantize_bits=args.quantize_bits,
save_audio=args.save_audio,
debug=args.debug,
metada_name="metada.txt",
@ -277,7 +278,7 @@ if __name__ == "__main__":
parser.add_argument("--output_path", type=str, help="Path to save mel specs", required=True)
parser.add_argument("--debug", default=False, action="store_true", help="Save audio files for debug")
parser.add_argument("--save_audio", default=False, action="store_true", help="Save audio files")
parser.add_argument("--quantized", action="store_true", help="Save quantized audio files")
parser.add_argument("--quantize_bits", type=int, default=0, help="Save quantized audio files if non-zero")
parser.add_argument("--eval", type=bool, help="compute eval.", default=True)
args = parser.parse_args()

View File

@ -66,12 +66,6 @@ If you don't specify any models, then it uses LJSpeech based English model.
$ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
```
- Run TTS and define speed factor to use for 🐸Coqui Studio models, between 0.0 and 2.0:
```
$ tts --text "Text for TTS" --model_name "coqui_studio/<language>/<dataset>/<model_name>" --speed 1.2 --out_path output/path/speech.wav
```
- Run a TTS model with its default vocoder model:
```
@ -222,25 +216,6 @@ def main():
default=None,
)
parser.add_argument("--encoder_config_path", type=str, help="Path to speaker encoder config file.", default=None)
# args for coqui studio
parser.add_argument(
"--cs_model",
type=str,
help="Name of the 🐸Coqui Studio model. Available models are `XTTS`, `V1`.",
)
parser.add_argument(
"--emotion",
type=str,
help="Emotion to condition the model with. Only available for 🐸Coqui Studio `V1` model.",
default=None,
)
parser.add_argument(
"--language",
type=str,
help="Language to condition the model with. Only available for 🐸Coqui Studio `XTTS` model.",
default=None,
)
parser.add_argument(
"--pipe_out",
help="stdout the generated TTS wav file for shell pipe.",
@ -249,12 +224,6 @@ def main():
const=True,
default=False,
)
parser.add_argument(
"--speed",
type=float,
help="Speed factor to use for 🐸Coqui Studio models, between 0.0 and 2.0.",
default=None,
)
# args for multi-speaker synthesis
parser.add_argument("--speakers_file_path", type=str, help="JSON file for multi-speaker model.", default=None)
@ -389,7 +358,6 @@ def main():
# CASE1 #list : list pre-trained TTS models
if args.list_models:
manager.add_cs_api_models(api.list_models())
manager.list_models()
sys.exit()
@ -404,22 +372,7 @@ def main():
manager.model_info_by_full_name(model_query_full_name)
sys.exit()
# CASE3: TTS with coqui studio models
if "coqui_studio" in args.model_name:
print(" > Using 🐸Coqui Studio model: ", args.model_name)
api = TTS(model_name=args.model_name, cs_api_model=args.cs_model)
api.tts_to_file(
text=args.text,
emotion=args.emotion,
file_path=args.out_path,
language=args.language,
speed=args.speed,
pipe_out=pipe_out,
)
print(" > Saving output to ", args.out_path)
return
# CASE4: load pre-trained model paths
# CASE3: load pre-trained model paths
if args.model_name is not None and not args.model_path:
model_path, config_path, model_item = manager.download_model(args.model_name)
# tts model
@ -447,7 +400,7 @@ def main():
if args.vocoder_name is not None and not args.vocoder_path:
vocoder_path, vocoder_config_path, _ = manager.download_model(args.vocoder_name)
# CASE5: set custom model paths
# CASE4: set custom model paths
if args.model_path is not None:
tts_path = args.model_path
tts_config_path = args.config_path

View File

@ -8,17 +8,17 @@ import traceback
import torch
from torch.utils.data import DataLoader
from trainer.io import copy_model_files, save_best_model, save_checkpoint
from trainer.torch import NoamLR
from trainer.trainer_utils import get_optimizer
from TTS.encoder.dataset import EncoderDataset
from TTS.encoder.utils.generic_utils import save_best_model, save_checkpoint, setup_encoder_model
from TTS.encoder.utils.generic_utils import setup_encoder_model
from TTS.encoder.utils.training import init_training
from TTS.encoder.utils.visual import plot_embeddings
from TTS.tts.datasets import load_tts_samples
from TTS.utils.audio import AudioProcessor
from TTS.utils.generic_utils import count_parameters, remove_experiment_folder
from TTS.utils.io import copy_model_files
from TTS.utils.samplers import PerfectBatchSampler
from TTS.utils.training import check_update
@ -125,7 +125,7 @@ def evaluation(model, criterion, data_loader, global_step):
def train(model, optimizer, scheduler, criterion, data_loader, eval_data_loader, global_step):
model.train()
best_loss = float("inf")
best_loss = {"train_loss": None, "eval_loss": float("inf")}
avg_loader_time = 0
end_time = time.time()
for epoch in range(c.epochs):
@ -222,7 +222,9 @@ def train(model, optimizer, scheduler, criterion, data_loader, eval_data_loader,
if global_step % c.save_step == 0:
# save model
save_checkpoint(model, optimizer, criterion, loss.item(), OUT_PATH, global_step, epoch)
save_checkpoint(
c, model, optimizer, None, global_step, epoch, OUT_PATH, criterion=criterion.state_dict()
)
end_time = time.time()
@ -245,7 +247,18 @@ def train(model, optimizer, scheduler, criterion, data_loader, eval_data_loader,
flush=True,
)
# save the best checkpoint
best_loss = save_best_model(model, optimizer, criterion, eval_loss, best_loss, OUT_PATH, global_step, epoch)
best_loss = save_best_model(
{"train_loss": None, "eval_loss": eval_loss},
best_loss,
c,
model,
optimizer,
None,
global_step,
epoch,
OUT_PATH,
criterion=criterion.state_dict(),
)
model.train()
return best_loss, global_step
@ -276,7 +289,7 @@ def main(args): # pylint: disable=redefined-outer-name
if c.loss == "softmaxproto" and c.model != "speaker_encoder":
c.map_classid_to_classname = map_classid_to_classname
copy_model_files(c, OUT_PATH)
copy_model_files(c, OUT_PATH, new_fields={})
if args.restore_path:
criterion, args.restore_step = model.load_checkpoint(

View File

@ -16,12 +16,9 @@ def read_json_with_comments(json_path):
# fallback to json
with fsspec.open(json_path, "r", encoding="utf-8") as f:
input_str = f.read()
# handle comments
input_str = re.sub(r"\\\n", "", input_str)
input_str = re.sub(r"//.*\n", "\n", input_str)
data = json.loads(input_str)
return data
# handle comments but not urls with //
input_str = re.sub(r"(\"(?:[^\"\\]|\\.)*\")|(/\*(?:.|[\\n\\r])*?\*/)|(//.*)", lambda m: m.group(1) or m.group(2) or "", input_str)
return json.loads(input_str)
def register_config(model_name: str) -> Coqpit:
"""Find the right config for the given model name.

View File

@ -1,314 +0,0 @@
import http.client
import json
import os
import tempfile
import urllib.request
from typing import Tuple
import numpy as np
import requests
from scipy.io import wavfile
from TTS.utils.audio.numpy_transforms import save_wav
class Speaker(object):
"""Convert dict to object."""
def __init__(self, d, is_voice=False):
self.is_voice = is_voice
for k, v in d.items():
if isinstance(k, (list, tuple)):
setattr(self, k, [Speaker(x) if isinstance(x, dict) else x for x in v])
else:
setattr(self, k, Speaker(v) if isinstance(v, dict) else v)
def __repr__(self):
return str(self.__dict__)
class CS_API:
"""🐸Coqui Studio API Wrapper.
🐸Coqui Studio is the most advanced voice generation platform. You can generate new voices by voice cloning, voice
interpolation, or our unique prompt to voice technology. It also provides a set of built-in voices with different
characteristics. You can use these voices to generate new audio files or use them in your applications.
You can use all the built-in and your own 🐸Coqui Studio speakers with this API with an API token.
You can signup to 🐸Coqui Studio from https://app.coqui.ai/auth/signup and get an API token from
https://app.coqui.ai/account. We can either enter the token as an environment variable as
`export COQUI_STUDIO_TOKEN=<token>` or pass it as `CS_API(api_token=<toke>)`.
Visit https://app.coqui.ai/api for more information.
Args:
api_token (str): 🐸Coqui Studio API token. If not provided, it will be read from the environment variable
`COQUI_STUDIO_TOKEN`.
model (str): 🐸Coqui Studio model. It can be either `V1`, `XTTS`. Default is `XTTS`.
Example listing all available speakers:
>>> from TTS.api import CS_API
>>> tts = CS_API()
>>> tts.speakers
Example listing all emotions:
>>> # emotions are only available for `V1` model
>>> from TTS.api import CS_API
>>> tts = CS_API(model="V1")
>>> tts.emotions
Example with a built-in 🐸 speaker:
>>> from TTS.api import CS_API
>>> tts = CS_API()
>>> wav, sr = api.tts("Hello world", speaker_name=tts.speakers[0].name)
>>> filepath = tts.tts_to_file(text="Hello world!", speaker_name=tts.speakers[0].name, file_path="output.wav")
Example with multi-language model:
>>> from TTS.api import CS_API
>>> tts = CS_API(model="XTTS")
>>> wav, sr = api.tts("Hello world", speaker_name=tts.speakers[0].name, language="en")
"""
MODEL_ENDPOINTS = {
"V1": {
"list_speakers": "https://app.coqui.ai/api/v2/speakers",
"synthesize": "https://app.coqui.ai/api/v2/samples",
"list_voices": "https://app.coqui.ai/api/v2/voices",
},
"XTTS": {
"list_speakers": "https://app.coqui.ai/api/v2/speakers",
"synthesize": "https://app.coqui.ai/api/v2/samples/xtts/render/",
"list_voices": "https://app.coqui.ai/api/v2/voices/xtts",
},
}
SUPPORTED_LANGUAGES = ["en", "es", "de", "fr", "it", "pt", "pl", "tr", "ru", "nl", "cs", "ar", "zh-cn", "ja"]
def __init__(self, api_token=None, model="XTTS"):
self.api_token = api_token
self.model = model
self.headers = None
self._speakers = None
self._check_token()
@staticmethod
def ping_api():
URL = "https://coqui.gateway.scarf.sh/tts/api"
_ = requests.get(URL)
@property
def speakers(self):
if self._speakers is None:
self._speakers = self.list_all_speakers()
return self._speakers
@property
def emotions(self):
"""Return a list of available emotions.
TODO: Get this from the API endpoint.
"""
if self.model == "V1":
return ["Neutral", "Happy", "Sad", "Angry", "Dull"]
else:
raise ValueError(f"❗ Emotions are not available for {self.model}.")
def _check_token(self):
if self.api_token is None:
self.api_token = os.environ.get("COQUI_STUDIO_TOKEN")
self.headers = {"Content-Type": "application/json", "Authorization": f"Bearer {self.api_token}"}
if not self.api_token:
raise ValueError(
"No API token found for 🐸Coqui Studio voices - https://coqui.ai \n"
"Visit 🔗https://app.coqui.ai/account to get one.\n"
"Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`\n"
""
)
def list_all_speakers(self):
"""Return both built-in Coqui Studio speakers and custom voices created by the user."""
return self.list_speakers() + self.list_voices()
def list_speakers(self):
"""List built-in Coqui Studio speakers."""
self._check_token()
conn = http.client.HTTPSConnection("app.coqui.ai")
url = self.MODEL_ENDPOINTS[self.model]["list_speakers"]
conn.request("GET", f"{url}?page=1&per_page=100", headers=self.headers)
res = conn.getresponse()
data = res.read()
return [Speaker(s) for s in json.loads(data)["result"]]
def list_voices(self):
"""List custom voices created by the user."""
conn = http.client.HTTPSConnection("app.coqui.ai")
url = self.MODEL_ENDPOINTS[self.model]["list_voices"]
conn.request("GET", f"{url}?page=1&per_page=100", headers=self.headers)
res = conn.getresponse()
data = res.read()
return [Speaker(s, True) for s in json.loads(data)["result"]]
def list_speakers_as_tts_models(self):
"""List speakers in ModelManager format."""
models = []
for speaker in self.speakers:
model = f"coqui_studio/multilingual/{speaker.name}/{self.model}"
models.append(model)
return models
def name_to_speaker(self, name):
for speaker in self.speakers:
if speaker.name == name:
return speaker
raise ValueError(f"Speaker {name} not found in {self.speakers}")
def id_to_speaker(self, speaker_id):
for speaker in self.speakers:
if speaker.id == speaker_id:
return speaker
raise ValueError(f"Speaker {speaker_id} not found.")
@staticmethod
def url_to_np(url):
tmp_file, _ = urllib.request.urlretrieve(url)
rate, data = wavfile.read(tmp_file)
return data, rate
@staticmethod
def _create_payload(model, text, speaker, speed, emotion, language):
payload = {}
# if speaker.is_voice:
payload["voice_id"] = speaker.id
# else:
payload["speaker_id"] = speaker.id
if model == "V1":
payload.update(
{
"emotion": emotion,
"name": speaker.name,
"text": text,
"speed": speed,
}
)
elif model == "XTTS":
payload.update(
{
"name": speaker.name,
"text": text,
"speed": speed,
"language": language,
}
)
else:
raise ValueError(f"❗ Unknown model {model}")
return payload
def _check_tts_args(self, text, speaker_name, speaker_id, emotion, speed, language):
assert text is not None, "❗ text is required for V1 model."
assert speaker_name is not None, "❗ speaker_name is required for V1 model."
if self.model == "V1":
if emotion is None:
emotion = "Neutral"
assert language is None, "❗ language is not supported for V1 model."
elif self.model == "XTTS":
assert emotion is None, f"❗ Emotions are not supported for XTTS model. Use V1 model."
assert language is not None, "❗ Language is required for XTTS model."
assert (
language in self.SUPPORTED_LANGUAGES
), f"❗ Language {language} is not yet supported. Check https://docs.coqui.ai/reference/samples_xtts_create."
return text, speaker_name, speaker_id, emotion, speed, language
def tts(
self,
text: str,
speaker_name: str = None,
speaker_id=None,
emotion=None,
speed=1.0,
language=None, # pylint: disable=unused-argument
) -> Tuple[np.ndarray, int]:
"""Synthesize speech from text.
Args:
text (str): Text to synthesize.
speaker_name (str): Name of the speaker. You can get the list of speakers with `list_speakers()` and
voices (user generated speakers) with `list_voices()`.
speaker_id (str): Speaker ID. If None, the speaker name is used.
emotion (str): Emotion of the speaker. One of "Neutral", "Happy", "Sad", "Angry", "Dull". Emotions are only
supported by `V1` model. Defaults to None.
speed (float): Speed of the speech. 1.0 is normal speed.
language (str): Language of the text. If None, the default language of the speaker is used. Language is only
supported by `XTTS` model. See https://docs.coqui.ai/reference/samples_xtts_create for supported languages.
"""
self._check_token()
self.ping_api()
if speaker_name is None and speaker_id is None:
raise ValueError(" [!] Please provide either a `speaker_name` or a `speaker_id`.")
if speaker_id is None:
speaker = self.name_to_speaker(speaker_name)
else:
speaker = self.id_to_speaker(speaker_id)
text, speaker_name, speaker_id, emotion, speed, language = self._check_tts_args(
text, speaker_name, speaker_id, emotion, speed, language
)
conn = http.client.HTTPSConnection("app.coqui.ai")
payload = self._create_payload(self.model, text, speaker, speed, emotion, language)
url = self.MODEL_ENDPOINTS[self.model]["synthesize"]
conn.request("POST", url, json.dumps(payload), self.headers)
res = conn.getresponse()
data = res.read()
try:
wav, sr = self.url_to_np(json.loads(data)["audio_url"])
except KeyError as e:
raise ValueError(f" [!] 🐸 API returned error: {data}") from e
return wav, sr
def tts_to_file(
self,
text: str,
speaker_name: str,
speaker_id=None,
emotion=None,
speed=1.0,
pipe_out=None,
language=None,
file_path: str = None,
) -> str:
"""Synthesize speech from text and save it to a file.
Args:
text (str): Text to synthesize.
speaker_name (str): Name of the speaker. You can get the list of speakers with `list_speakers()` and
voices (user generated speakers) with `list_voices()`.
speaker_id (str): Speaker ID. If None, the speaker name is used.
emotion (str): Emotion of the speaker. One of "Neutral", "Happy", "Sad", "Angry", "Dull".
speed (float): Speed of the speech. 1.0 is normal speed.
pipe_out (BytesIO, optional): Flag to stdout the generated TTS wav file for shell pipe.
language (str): Language of the text. If None, the default language of the speaker is used. Language is only
supported by `XTTS` model. Currently supports en, de, es, fr, it, pt, pl. Defaults to "en".
file_path (str): Path to save the file. If None, a temporary file is created.
"""
if file_path is None:
file_path = tempfile.mktemp(".wav")
wav, sr = self.tts(text, speaker_name, speaker_id, emotion, speed, language)
save_wav(wav=wav, path=file_path, sample_rate=sr, pipe_out=pipe_out)
return file_path
if __name__ == "__main__":
import time
api = CS_API()
print(api.speakers)
print(api.list_speakers_as_tts_models())
ts = time.time()
wav, sr = api.tts("It took me quite a long time to develop a voice.", language="en", speaker_name=api.speakers[0].name)
print(f" [i] XTTS took {time.time() - ts:.2f}s")
filepath = api.tts_to_file(text="Hello world!", speaker_name=api.speakers[0].name, language="en", file_path="output.wav")

View File

@ -0,0 +1,2 @@
faster_whisper==0.9.0
gradio==4.7.1

View File

@ -0,0 +1,160 @@
import os
import gc
import torchaudio
import pandas
from faster_whisper import WhisperModel
from glob import glob
from tqdm import tqdm
import torch
import torchaudio
# torch.set_num_threads(1)
from TTS.tts.layers.xtts.tokenizer import multilingual_cleaners
torch.set_num_threads(16)
import os
audio_types = (".wav", ".mp3", ".flac")
def list_audios(basePath, contains=None):
# return the set of files that are valid
return list_files(basePath, validExts=audio_types, contains=contains)
def list_files(basePath, validExts=None, contains=None):
# loop over the directory structure
for (rootDir, dirNames, filenames) in os.walk(basePath):
# loop over the filenames in the current directory
for filename in filenames:
# if the contains string is not none and the filename does not contain
# the supplied string, then ignore the file
if contains is not None and filename.find(contains) == -1:
continue
# determine the file extension of the current file
ext = filename[filename.rfind("."):].lower()
# check to see if the file is an audio and should be processed
if validExts is None or ext.endswith(validExts):
# construct the path to the audio and yield it
audioPath = os.path.join(rootDir, filename)
yield audioPath
def format_audio_list(audio_files, target_language="en", out_path=None, buffer=0.2, eval_percentage=0.15, speaker_name="coqui", gradio_progress=None):
audio_total_size = 0
# make sure that ooutput file exists
os.makedirs(out_path, exist_ok=True)
# Loading Whisper
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Loading Whisper Model!")
asr_model = WhisperModel("large-v2", device=device, compute_type="float16")
metadata = {"audio_file": [], "text": [], "speaker_name": []}
if gradio_progress is not None:
tqdm_object = gradio_progress.tqdm(audio_files, desc="Formatting...")
else:
tqdm_object = tqdm(audio_files)
for audio_path in tqdm_object:
wav, sr = torchaudio.load(audio_path)
# stereo to mono if needed
if wav.size(0) != 1:
wav = torch.mean(wav, dim=0, keepdim=True)
wav = wav.squeeze()
audio_total_size += (wav.size(-1) / sr)
segments, _ = asr_model.transcribe(audio_path, word_timestamps=True, language=target_language)
segments = list(segments)
i = 0
sentence = ""
sentence_start = None
first_word = True
# added all segments words in a unique list
words_list = []
for _, segment in enumerate(segments):
words = list(segment.words)
words_list.extend(words)
# process each word
for word_idx, word in enumerate(words_list):
if first_word:
sentence_start = word.start
# If it is the first sentence, add buffer or get the begining of the file
if word_idx == 0:
sentence_start = max(sentence_start - buffer, 0) # Add buffer to the sentence start
else:
# get previous sentence end
previous_word_end = words_list[word_idx - 1].end
# add buffer or get the silence midle between the previous sentence and the current one
sentence_start = max(sentence_start - buffer, (previous_word_end + sentence_start)/2)
sentence = word.word
first_word = False
else:
sentence += word.word
if word.word[-1] in ["!", ".", "?"]:
sentence = sentence[1:]
# Expand number and abbreviations plus normalization
sentence = multilingual_cleaners(sentence, target_language)
audio_file_name, _ = os.path.splitext(os.path.basename(audio_path))
audio_file = f"wavs/{audio_file_name}_{str(i).zfill(8)}.wav"
# Check for the next word's existence
if word_idx + 1 < len(words_list):
next_word_start = words_list[word_idx + 1].start
else:
# If don't have more words it means that it is the last sentence then use the audio len as next word start
next_word_start = (wav.shape[0] - 1) / sr
# Average the current word end and next word start
word_end = min((word.end + next_word_start) / 2, word.end + buffer)
absoulte_path = os.path.join(out_path, audio_file)
os.makedirs(os.path.dirname(absoulte_path), exist_ok=True)
i += 1
first_word = True
audio = wav[int(sr*sentence_start):int(sr*word_end)].unsqueeze(0)
# if the audio is too short ignore it (i.e < 0.33 seconds)
if audio.size(-1) >= sr/3:
torchaudio.save(absoulte_path,
audio,
sr
)
else:
continue
metadata["audio_file"].append(audio_file)
metadata["text"].append(sentence)
metadata["speaker_name"].append(speaker_name)
df = pandas.DataFrame(metadata)
df = df.sample(frac=1)
num_val_samples = int(len(df)*eval_percentage)
df_eval = df[:num_val_samples]
df_train = df[num_val_samples:]
df_train = df_train.sort_values('audio_file')
train_metadata_path = os.path.join(out_path, "metadata_train.csv")
df_train.to_csv(train_metadata_path, sep="|", index=False)
eval_metadata_path = os.path.join(out_path, "metadata_eval.csv")
df_eval = df_eval.sort_values('audio_file')
df_eval.to_csv(eval_metadata_path, sep="|", index=False)
# deallocate VRAM and RAM
del asr_model, df_train, df_eval, df, metadata
gc.collect()
return train_metadata_path, eval_metadata_path, audio_total_size

View File

@ -0,0 +1,172 @@
import os
import gc
from trainer import Trainer, TrainerArgs
from TTS.config.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.layers.xtts.trainer.gpt_trainer import GPTArgs, GPTTrainer, GPTTrainerConfig, XttsAudioConfig
from TTS.utils.manage import ModelManager
def train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path, max_audio_length=255995):
# Logging parameters
RUN_NAME = "GPT_XTTS_FT"
PROJECT_NAME = "XTTS_trainer"
DASHBOARD_LOGGER = "tensorboard"
LOGGER_URI = None
# Set here the path that the checkpoints will be saved. Default: ./run/training/
OUT_PATH = os.path.join(output_path, "run", "training")
# Training Parameters
OPTIMIZER_WD_ONLY_ON_WEIGHTS = True # for multi-gpu training please make it False
START_WITH_EVAL = False # if True it will star with evaluation
BATCH_SIZE = batch_size # set here the batch size
GRAD_ACUMM_STEPS = grad_acumm # set here the grad accumulation steps
# Define here the dataset that you want to use for the fine-tuning on.
config_dataset = BaseDatasetConfig(
formatter="coqui",
dataset_name="ft_dataset",
path=os.path.dirname(train_csv),
meta_file_train=train_csv,
meta_file_val=eval_csv,
language=language,
)
# Add here the configs of the datasets
DATASETS_CONFIG_LIST = [config_dataset]
# Define the path where XTTS v2.0.1 files will be downloaded
CHECKPOINTS_OUT_PATH = os.path.join(OUT_PATH, "XTTS_v2.0_original_model_files/")
os.makedirs(CHECKPOINTS_OUT_PATH, exist_ok=True)
# DVAE files
DVAE_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/dvae.pth"
MEL_NORM_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/mel_stats.pth"
# Set the path to the downloaded files
DVAE_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(DVAE_CHECKPOINT_LINK))
MEL_NORM_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(MEL_NORM_LINK))
# download DVAE files if needed
if not os.path.isfile(DVAE_CHECKPOINT) or not os.path.isfile(MEL_NORM_FILE):
print(" > Downloading DVAE files!")
ModelManager._download_model_files([MEL_NORM_LINK, DVAE_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True)
# Download XTTS v2.0 checkpoint if needed
TOKENIZER_FILE_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/vocab.json"
XTTS_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/model.pth"
XTTS_CONFIG_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/config.json"
# XTTS transfer learning parameters: You we need to provide the paths of XTTS model checkpoint that you want to do the fine tuning.
TOKENIZER_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(TOKENIZER_FILE_LINK)) # vocab.json file
XTTS_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_CHECKPOINT_LINK)) # model.pth file
XTTS_CONFIG_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_CONFIG_LINK)) # config.json file
# download XTTS v2.0 files if needed
if not os.path.isfile(TOKENIZER_FILE) or not os.path.isfile(XTTS_CHECKPOINT):
print(" > Downloading XTTS v2.0 files!")
ModelManager._download_model_files(
[TOKENIZER_FILE_LINK, XTTS_CHECKPOINT_LINK, XTTS_CONFIG_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True
)
# init args and config
model_args = GPTArgs(
max_conditioning_length=132300, # 6 secs
min_conditioning_length=66150, # 3 secs
debug_loading_failures=False,
max_wav_length=max_audio_length, # ~11.6 seconds
max_text_length=200,
mel_norm_file=MEL_NORM_FILE,
dvae_checkpoint=DVAE_CHECKPOINT,
xtts_checkpoint=XTTS_CHECKPOINT, # checkpoint path of the model that you want to fine-tune
tokenizer_file=TOKENIZER_FILE,
gpt_num_audio_tokens=1026,
gpt_start_audio_token=1024,
gpt_stop_audio_token=1025,
gpt_use_masking_gt_prompt_approach=True,
gpt_use_perceiver_resampler=True,
)
# define audio config
audio_config = XttsAudioConfig(sample_rate=22050, dvae_sample_rate=22050, output_sample_rate=24000)
# training parameters config
config = GPTTrainerConfig(
epochs=num_epochs,
output_path=OUT_PATH,
model_args=model_args,
run_name=RUN_NAME,
project_name=PROJECT_NAME,
run_description="""
GPT XTTS training
""",
dashboard_logger=DASHBOARD_LOGGER,
logger_uri=LOGGER_URI,
audio=audio_config,
batch_size=BATCH_SIZE,
batch_group_size=48,
eval_batch_size=BATCH_SIZE,
num_loader_workers=8,
eval_split_max_size=256,
print_step=50,
plot_step=100,
log_model_step=100,
save_step=1000,
save_n_checkpoints=1,
save_checkpoints=True,
# target_loss="loss",
print_eval=False,
# Optimizer values like tortoise, pytorch implementation with modifications to not apply WD to non-weight parameters.
optimizer="AdamW",
optimizer_wd_only_on_weights=OPTIMIZER_WD_ONLY_ON_WEIGHTS,
optimizer_params={"betas": [0.9, 0.96], "eps": 1e-8, "weight_decay": 1e-2},
lr=5e-06, # learning rate
lr_scheduler="MultiStepLR",
# it was adjusted accordly for the new step scheme
lr_scheduler_params={"milestones": [50000 * 18, 150000 * 18, 300000 * 18], "gamma": 0.5, "last_epoch": -1},
test_sentences=[],
)
# init the model from config
model = GPTTrainer.init_from_config(config)
# load training samples
train_samples, eval_samples = load_tts_samples(
DATASETS_CONFIG_LIST,
eval_split=True,
eval_split_max_size=config.eval_split_max_size,
eval_split_size=config.eval_split_size,
)
# init the trainer and 🚀
trainer = Trainer(
TrainerArgs(
restore_path=None, # xtts checkpoint is restored via xtts_checkpoint key so no need of restore it using Trainer restore_path parameter
skip_train_epoch=False,
start_with_eval=START_WITH_EVAL,
grad_accum_steps=GRAD_ACUMM_STEPS,
),
config,
output_path=OUT_PATH,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
)
trainer.fit()
# get the longest text audio file to use as speaker reference
samples_len = [len(item["text"].split(" ")) for item in train_samples]
longest_text_idx = samples_len.index(max(samples_len))
speaker_ref = train_samples[longest_text_idx]["audio_file"]
trainer_out_path = trainer.output_path
# deallocate VRAM and RAM
del model, trainer, train_samples, eval_samples
gc.collect()
return XTTS_CONFIG_FILE, XTTS_CHECKPOINT, TOKENIZER_FILE, trainer_out_path, speaker_ref

View File

@ -0,0 +1,415 @@
import argparse
import os
import sys
import tempfile
import gradio as gr
import librosa.display
import numpy as np
import os
import torch
import torchaudio
import traceback
from TTS.demos.xtts_ft_demo.utils.formatter import format_audio_list
from TTS.demos.xtts_ft_demo.utils.gpt_train import train_gpt
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
def clear_gpu_cache():
# clear the GPU cache
if torch.cuda.is_available():
torch.cuda.empty_cache()
XTTS_MODEL = None
def load_model(xtts_checkpoint, xtts_config, xtts_vocab):
global XTTS_MODEL
clear_gpu_cache()
if not xtts_checkpoint or not xtts_config or not xtts_vocab:
return "You need to run the previous steps or manually set the `XTTS checkpoint path`, `XTTS config path`, and `XTTS vocab path` fields !!"
config = XttsConfig()
config.load_json(xtts_config)
XTTS_MODEL = Xtts.init_from_config(config)
print("Loading XTTS model! ")
XTTS_MODEL.load_checkpoint(config, checkpoint_path=xtts_checkpoint, vocab_path=xtts_vocab, use_deepspeed=False)
if torch.cuda.is_available():
XTTS_MODEL.cuda()
print("Model Loaded!")
return "Model Loaded!"
def run_tts(lang, tts_text, speaker_audio_file):
if XTTS_MODEL is None or not speaker_audio_file:
return "You need to run the previous step to load the model !!", None, None
gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents(audio_path=speaker_audio_file, gpt_cond_len=XTTS_MODEL.config.gpt_cond_len, max_ref_length=XTTS_MODEL.config.max_ref_len, sound_norm_refs=XTTS_MODEL.config.sound_norm_refs)
out = XTTS_MODEL.inference(
text=tts_text,
language=lang,
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=XTTS_MODEL.config.temperature, # Add custom parameters here
length_penalty=XTTS_MODEL.config.length_penalty,
repetition_penalty=XTTS_MODEL.config.repetition_penalty,
top_k=XTTS_MODEL.config.top_k,
top_p=XTTS_MODEL.config.top_p,
)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as fp:
out["wav"] = torch.tensor(out["wav"]).unsqueeze(0)
out_path = fp.name
torchaudio.save(out_path, out["wav"], 24000)
return "Speech generated !", out_path, speaker_audio_file
# define a logger to redirect
class Logger:
def __init__(self, filename="log.out"):
self.log_file = filename
self.terminal = sys.stdout
self.log = open(self.log_file, "w")
def write(self, message):
self.terminal.write(message)
self.log.write(message)
def flush(self):
self.terminal.flush()
self.log.flush()
def isatty(self):
return False
# redirect stdout and stderr to a file
sys.stdout = Logger()
sys.stderr = sys.stdout
# logging.basicConfig(stream=sys.stdout, level=logging.INFO)
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[
logging.StreamHandler(sys.stdout)
]
)
def read_logs():
sys.stdout.flush()
with open(sys.stdout.log_file, "r") as f:
return f.read()
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="""XTTS fine-tuning demo\n\n"""
"""
Example runs:
python3 TTS/demos/xtts_ft_demo/xtts_demo.py --port
""",
formatter_class=argparse.RawTextHelpFormatter,
)
parser.add_argument(
"--port",
type=int,
help="Port to run the gradio demo. Default: 5003",
default=5003,
)
parser.add_argument(
"--out_path",
type=str,
help="Output path (where data and checkpoints will be saved) Default: /tmp/xtts_ft/",
default="/tmp/xtts_ft/",
)
parser.add_argument(
"--num_epochs",
type=int,
help="Number of epochs to train. Default: 10",
default=10,
)
parser.add_argument(
"--batch_size",
type=int,
help="Batch size. Default: 4",
default=4,
)
parser.add_argument(
"--grad_acumm",
type=int,
help="Grad accumulation steps. Default: 1",
default=1,
)
parser.add_argument(
"--max_audio_length",
type=int,
help="Max permitted audio size in seconds. Default: 11",
default=11,
)
args = parser.parse_args()
with gr.Blocks() as demo:
with gr.Tab("1 - Data processing"):
out_path = gr.Textbox(
label="Output path (where data and checkpoints will be saved):",
value=args.out_path,
)
# upload_file = gr.Audio(
# sources="upload",
# label="Select here the audio files that you want to use for XTTS trainining !",
# type="filepath",
# )
upload_file = gr.File(
file_count="multiple",
label="Select here the audio files that you want to use for XTTS trainining (Supported formats: wav, mp3, and flac)",
)
lang = gr.Dropdown(
label="Dataset Language",
value="en",
choices=[
"en",
"es",
"fr",
"de",
"it",
"pt",
"pl",
"tr",
"ru",
"nl",
"cs",
"ar",
"zh",
"hu",
"ko",
"ja"
],
)
progress_data = gr.Label(
label="Progress:"
)
logs = gr.Textbox(
label="Logs:",
interactive=False,
)
demo.load(read_logs, None, logs, every=1)
prompt_compute_btn = gr.Button(value="Step 1 - Create dataset")
def preprocess_dataset(audio_path, language, out_path, progress=gr.Progress(track_tqdm=True)):
clear_gpu_cache()
out_path = os.path.join(out_path, "dataset")
os.makedirs(out_path, exist_ok=True)
if audio_path is None:
return "You should provide one or multiple audio files! If you provided it, probably the upload of the files is not finished yet!", "", ""
else:
try:
train_meta, eval_meta, audio_total_size = format_audio_list(audio_path, target_language=language, out_path=out_path, gradio_progress=progress)
except:
traceback.print_exc()
error = traceback.format_exc()
return f"The data processing was interrupted due an error !! Please check the console to verify the full error message! \n Error summary: {error}", "", ""
clear_gpu_cache()
# if audio total len is less than 2 minutes raise an error
if audio_total_size < 120:
message = "The sum of the duration of the audios that you provided should be at least 2 minutes!"
print(message)
return message, "", ""
print("Dataset Processed!")
return "Dataset Processed!", train_meta, eval_meta
with gr.Tab("2 - Fine-tuning XTTS Encoder"):
train_csv = gr.Textbox(
label="Train CSV:",
)
eval_csv = gr.Textbox(
label="Eval CSV:",
)
num_epochs = gr.Slider(
label="Number of epochs:",
minimum=1,
maximum=100,
step=1,
value=args.num_epochs,
)
batch_size = gr.Slider(
label="Batch size:",
minimum=2,
maximum=512,
step=1,
value=args.batch_size,
)
grad_acumm = gr.Slider(
label="Grad accumulation steps:",
minimum=2,
maximum=128,
step=1,
value=args.grad_acumm,
)
max_audio_length = gr.Slider(
label="Max permitted audio size in seconds:",
minimum=2,
maximum=20,
step=1,
value=args.max_audio_length,
)
progress_train = gr.Label(
label="Progress:"
)
logs_tts_train = gr.Textbox(
label="Logs:",
interactive=False,
)
demo.load(read_logs, None, logs_tts_train, every=1)
train_btn = gr.Button(value="Step 2 - Run the training")
def train_model(language, train_csv, eval_csv, num_epochs, batch_size, grad_acumm, output_path, max_audio_length):
clear_gpu_cache()
if not train_csv or not eval_csv:
return "You need to run the data processing step or manually set `Train CSV` and `Eval CSV` fields !", "", "", "", ""
try:
# convert seconds to waveform frames
max_audio_length = int(max_audio_length * 22050)
config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=output_path, max_audio_length=max_audio_length)
except:
traceback.print_exc()
error = traceback.format_exc()
return f"The training was interrupted due an error !! Please check the console to check the full error message! \n Error summary: {error}", "", "", "", ""
# copy original files to avoid parameters changes issues
os.system(f"cp {config_path} {exp_path}")
os.system(f"cp {vocab_file} {exp_path}")
ft_xtts_checkpoint = os.path.join(exp_path, "best_model.pth")
print("Model training done!")
clear_gpu_cache()
return "Model training done!", config_path, vocab_file, ft_xtts_checkpoint, speaker_wav
with gr.Tab("3 - Inference"):
with gr.Row():
with gr.Column() as col1:
xtts_checkpoint = gr.Textbox(
label="XTTS checkpoint path:",
value="",
)
xtts_config = gr.Textbox(
label="XTTS config path:",
value="",
)
xtts_vocab = gr.Textbox(
label="XTTS vocab path:",
value="",
)
progress_load = gr.Label(
label="Progress:"
)
load_btn = gr.Button(value="Step 3 - Load Fine-tuned XTTS model")
with gr.Column() as col2:
speaker_reference_audio = gr.Textbox(
label="Speaker reference audio:",
value="",
)
tts_language = gr.Dropdown(
label="Language",
value="en",
choices=[
"en",
"es",
"fr",
"de",
"it",
"pt",
"pl",
"tr",
"ru",
"nl",
"cs",
"ar",
"zh",
"hu",
"ko",
"ja",
]
)
tts_text = gr.Textbox(
label="Input Text.",
value="This model sounds really good and above all, it's reasonably fast.",
)
tts_btn = gr.Button(value="Step 4 - Inference")
with gr.Column() as col3:
progress_gen = gr.Label(
label="Progress:"
)
tts_output_audio = gr.Audio(label="Generated Audio.")
reference_audio = gr.Audio(label="Reference audio used.")
prompt_compute_btn.click(
fn=preprocess_dataset,
inputs=[
upload_file,
lang,
out_path,
],
outputs=[
progress_data,
train_csv,
eval_csv,
],
)
train_btn.click(
fn=train_model,
inputs=[
lang,
train_csv,
eval_csv,
num_epochs,
batch_size,
grad_acumm,
out_path,
max_audio_length,
],
outputs=[progress_train, xtts_config, xtts_vocab, xtts_checkpoint, speaker_reference_audio],
)
load_btn.click(
fn=load_model,
inputs=[
xtts_checkpoint,
xtts_config,
xtts_vocab
],
outputs=[progress_load],
)
tts_btn.click(
fn=run_tts,
inputs=[
tts_language,
tts_text,
speaker_reference_audio,
],
outputs=[progress_gen, tts_output_audio, reference_audio],
)
demo.launch(
share=True,
debug=False,
server_port=args.port,
server_name="0.0.0.0"
)

View File

@ -1,15 +1,12 @@
import datetime
import glob
import os
import random
import re
import numpy as np
from scipy import signal
from TTS.encoder.models.lstm import LSTMSpeakerEncoder
from TTS.encoder.models.resnet import ResNetSpeakerEncoder
from TTS.utils.io import save_fsspec
class AugmentWAV(object):
@ -118,11 +115,6 @@ class AugmentWAV(object):
return self.additive_noise(noise_type, audio)
def to_camel(text):
text = text.capitalize()
return re.sub(r"(?!^)_([a-zA-Z])", lambda m: m.group(1).upper(), text)
def setup_encoder_model(config: "Coqpit"):
if config.model_params["model_name"].lower() == "lstm":
model = LSTMSpeakerEncoder(
@ -142,41 +134,3 @@ def setup_encoder_model(config: "Coqpit"):
audio_config=config.audio,
)
return model
def save_checkpoint(model, optimizer, criterion, model_loss, out_path, current_step, epoch):
checkpoint_path = "checkpoint_{}.pth".format(current_step)
checkpoint_path = os.path.join(out_path, checkpoint_path)
print(" | | > Checkpoint saving : {}".format(checkpoint_path))
new_state_dict = model.state_dict()
state = {
"model": new_state_dict,
"optimizer": optimizer.state_dict() if optimizer is not None else None,
"criterion": criterion.state_dict(),
"step": current_step,
"epoch": epoch,
"loss": model_loss,
"date": datetime.date.today().strftime("%B %d, %Y"),
}
save_fsspec(state, checkpoint_path)
def save_best_model(model, optimizer, criterion, model_loss, best_loss, out_path, current_step, epoch):
if model_loss < best_loss:
new_state_dict = model.state_dict()
state = {
"model": new_state_dict,
"optimizer": optimizer.state_dict(),
"criterion": criterion.state_dict(),
"step": current_step,
"epoch": epoch,
"loss": model_loss,
"date": datetime.date.today().strftime("%B %d, %Y"),
}
best_loss = model_loss
bestmodel_path = "best_model.pth"
bestmodel_path = os.path.join(out_path, bestmodel_path)
print("\n > BEST MODEL ({0:.5f}) : {1:}".format(model_loss, bestmodel_path))
save_fsspec(state, bestmodel_path)
return best_loss

View File

@ -1,38 +0,0 @@
import datetime
import os
from TTS.utils.io import save_fsspec
def save_checkpoint(model, optimizer, model_loss, out_path, current_step):
checkpoint_path = "checkpoint_{}.pth".format(current_step)
checkpoint_path = os.path.join(out_path, checkpoint_path)
print(" | | > Checkpoint saving : {}".format(checkpoint_path))
new_state_dict = model.state_dict()
state = {
"model": new_state_dict,
"optimizer": optimizer.state_dict() if optimizer is not None else None,
"step": current_step,
"loss": model_loss,
"date": datetime.date.today().strftime("%B %d, %Y"),
}
save_fsspec(state, checkpoint_path)
def save_best_model(model, optimizer, model_loss, best_loss, out_path, current_step):
if model_loss < best_loss:
new_state_dict = model.state_dict()
state = {
"model": new_state_dict,
"optimizer": optimizer.state_dict(),
"step": current_step,
"loss": model_loss,
"date": datetime.date.today().strftime("%B %d, %Y"),
}
best_loss = model_loss
bestmodel_path = "best_model.pth"
bestmodel_path = os.path.join(out_path, bestmodel_path)
print("\n > BEST MODEL ({0:.5f}) : {1:}".format(model_loss, bestmodel_path))
save_fsspec(state, bestmodel_path)
return best_loss

View File

@ -3,13 +3,13 @@ from dataclasses import dataclass, field
from coqpit import Coqpit
from trainer import TrainerArgs, get_last_checkpoint
from trainer.io import copy_model_files
from trainer.logging import logger_factory
from trainer.logging.console_logger import ConsoleLogger
from TTS.config import load_config, register_config
from TTS.tts.utils.text.characters import parse_symbols
from TTS.utils.generic_utils import get_experiment_folder_path, get_git_branch
from TTS.utils.io import copy_model_files
@dataclass

View File

@ -43,7 +43,12 @@ class XttsConfig(BaseTTSConfig):
Defaults to `16`.
gpt_cond_len (int):
Secs audio to be used as conditioning for the autoregressive model. Defaults to `3`.
Secs audio to be used as conditioning for the autoregressive model. Defaults to `12`.
gpt_cond_chunk_len (int):
Audio chunk size in secs. Audio is split into chunks and latents are extracted for each chunk. Then the
latents are averaged. Chunking improves the stability. It must be <= gpt_cond_len.
If gpt_cond_len == gpt_cond_chunk_len, no chunking. Defaults to `4`.
max_ref_len (int):
Maximum number of seconds of audio to be used as conditioning for the decoder. Defaults to `10`.
@ -83,6 +88,7 @@ class XttsConfig(BaseTTSConfig):
"hu",
"ko",
"ja",
"hi",
]
)
@ -95,6 +101,7 @@ class XttsConfig(BaseTTSConfig):
num_gpt_outputs: int = 1
# cloning
gpt_cond_len: int = 3
gpt_cond_len: int = 12
gpt_cond_chunk_len: int = 4
max_ref_len: int = 10
sound_norm_refs: bool = False

View File

@ -13,6 +13,8 @@ from TTS.tts.utils.data import prepare_data, prepare_stop_target, prepare_tensor
from TTS.utils.audio import AudioProcessor
from TTS.utils.audio.numpy_transforms import compute_energy as calculate_energy
import mutagen
# to prevent too many open files error as suggested here
# https://github.com/pytorch/pytorch/issues/11201#issuecomment-421146936
torch.multiprocessing.set_sharing_strategy("file_system")
@ -42,6 +44,15 @@ def string2filename(string):
return filename
def get_audio_size(audiopath):
extension = audiopath.rpartition(".")[-1].lower()
if extension not in {"mp3", "wav", "flac"}:
raise RuntimeError(f"The audio format {extension} is not supported, please convert the audio files to mp3, flac, or wav format!")
audio_info = mutagen.File(audiopath).info
return int(audio_info.length * audio_info.sample_rate)
class TTSDataset(Dataset):
def __init__(
self,
@ -176,7 +187,7 @@ class TTSDataset(Dataset):
lens = []
for item in self.samples:
_, wav_file, *_ = _parse_sample(item)
audio_len = os.path.getsize(wav_file) / 16 * 8 # assuming 16bit audio
audio_len = get_audio_size(wav_file)
lens.append(audio_len)
return lens
@ -295,7 +306,7 @@ class TTSDataset(Dataset):
def _compute_lengths(samples):
new_samples = []
for item in samples:
audio_length = os.path.getsize(item["audio_file"]) / 16 * 8 # assuming 16bit audio
audio_length = get_audio_size(item["audio_file"])
text_lenght = len(item["text"])
item["audio_length"] = audio_length
item["text_length"] = text_lenght

View File

@ -13,12 +13,18 @@ import math
import numpy as np
import torch
import torch as th
from k_diffusion.sampling import sample_dpmpp_2m, sample_euler_ancestral
from tqdm import tqdm
from TTS.tts.layers.tortoise.dpm_solver import DPM_Solver, NoiseScheduleVP, model_wrapper
K_DIFFUSION_SAMPLERS = {"k_euler_a": sample_euler_ancestral, "dpm++2m": sample_dpmpp_2m}
try:
from k_diffusion.sampling import sample_dpmpp_2m, sample_euler_ancestral
K_DIFFUSION_SAMPLERS = {"k_euler_a": sample_euler_ancestral, "dpm++2m": sample_dpmpp_2m}
except ImportError:
K_DIFFUSION_SAMPLERS = None
SAMPLERS = ["dpm++2m", "p", "ddim"]
@ -531,6 +537,8 @@ class GaussianDiffusion:
if self.conditioning_free is not True:
raise RuntimeError("cond_free must be true")
with tqdm(total=self.num_timesteps) as pbar:
if K_DIFFUSION_SAMPLERS is None:
raise ModuleNotFoundError("Install k_diffusion for using k_diffusion samplers")
return self.k_diffusion_sample_loop(K_DIFFUSION_SAMPLERS[s], pbar, *args, **kwargs)
else:
raise RuntimeError("sampler not impl")

View File

@ -562,15 +562,21 @@ class DPM_Solver:
if order == 3:
K = steps // 3 + 1
if steps % 3 == 0:
orders = [3,] * (
orders = [
3,
] * (
K - 2
) + [2, 1]
elif steps % 3 == 1:
orders = [3,] * (
orders = [
3,
] * (
K - 1
) + [1]
else:
orders = [3,] * (
orders = [
3,
] * (
K - 1
) + [2]
elif order == 2:
@ -581,7 +587,9 @@ class DPM_Solver:
] * K
else:
K = steps // 2 + 1
orders = [2,] * (
orders = [
2,
] * (
K - 1
) + [1]
elif order == 1:
@ -1440,7 +1448,10 @@ class DPM_Solver:
model_prev_list[-1] = self.model_fn(x, t)
elif method in ["singlestep", "singlestep_fixed"]:
if method == "singlestep":
(timesteps_outer, orders,) = self.get_orders_and_timesteps_for_singlestep_solver(
(
timesteps_outer,
orders,
) = self.get_orders_and_timesteps_for_singlestep_solver(
steps=steps,
order=order,
skip_type=skip_type,

View File

@ -128,6 +128,7 @@ class GPT(nn.Module):
self.heads = heads
self.model_dim = model_dim
self.max_conditioning_inputs = max_conditioning_inputs
self.max_gen_mel_tokens = max_mel_tokens - self.max_conditioning_inputs - 2
self.max_mel_tokens = -1 if max_mel_tokens == -1 else max_mel_tokens + 2 + self.max_conditioning_inputs
self.max_text_tokens = -1 if max_text_tokens == -1 else max_text_tokens + 2
self.max_prompt_tokens = max_prompt_tokens
@ -425,15 +426,6 @@ class GPT(nn.Module):
if max_mel_len > audio_codes.shape[-1]:
audio_codes = F.pad(audio_codes, (0, max_mel_len - audio_codes.shape[-1]))
silence = True
for idx, l in enumerate(code_lengths):
length = l.item()
while silence:
if audio_codes[idx, length - 1] != 83:
break
length -= 1
code_lengths[idx] = length
# 💖 Lovely assertions
assert (
max_mel_len <= audio_codes.shape[-1]
@ -449,7 +441,9 @@ class GPT(nn.Module):
audio_codes = F.pad(audio_codes[:, :max_mel_len], (0, 1), value=self.stop_audio_token)
# Pad mel codes with stop_audio_token
audio_codes = self.set_mel_padding(audio_codes, code_lengths)
audio_codes = self.set_mel_padding(
audio_codes, code_lengths - 3
) # -3 to get the real code lengths without consider start and stop tokens that was not added yet
# Build input and target tensors
# Prepend start token to inputs and append stop token to targets
@ -598,7 +592,7 @@ class GPT(nn.Module):
bos_token_id=self.start_audio_token,
pad_token_id=self.stop_audio_token,
eos_token_id=self.stop_audio_token,
max_length=self.max_mel_tokens,
max_length=self.max_gen_mel_tokens + gpt_inputs.shape[-1],
**hf_generate_kwargs,
)
if "return_dict_in_generate" in hf_generate_kwargs:
@ -611,7 +605,7 @@ class GPT(nn.Module):
bos_token_id=self.start_audio_token,
pad_token_id=self.stop_audio_token,
eos_token_id=self.stop_audio_token,
max_length=self.max_mel_tokens,
max_length=self.max_gen_mel_tokens + fake_inputs.shape[-1],
do_stream=True,
**hf_generate_kwargs,
)

View File

@ -1,17 +1,73 @@
import json
import os
import re
import textwrap
from functools import cached_property
import pypinyin
import torch
from hangul_romanize import Transliter
from hangul_romanize.rule import academic
from num2words import num2words
from spacy.lang.ar import Arabic
from spacy.lang.en import English
from spacy.lang.es import Spanish
from spacy.lang.ja import Japanese
from spacy.lang.zh import Chinese
from tokenizers import Tokenizer
from functools import cached_property
from TTS.tts.layers.xtts.zh_num2words import TextNorm as zh_num2words
def get_spacy_lang(lang):
if lang == "zh":
return Chinese()
elif lang == "ja":
return Japanese()
elif lang == "ar":
return Arabic()
elif lang == "es":
return Spanish()
else:
# For most languages, Enlish does the job
return English()
def split_sentence(text, lang, text_split_length=250):
"""Preprocess the input text"""
text_splits = []
if text_split_length is not None and len(text) >= text_split_length:
text_splits.append("")
nlp = get_spacy_lang(lang)
nlp.add_pipe("sentencizer")
doc = nlp(text)
for sentence in doc.sents:
if len(text_splits[-1]) + len(str(sentence)) <= text_split_length:
# if the last sentence + the current sentence is less than the text_split_length
# then add the current sentence to the last sentence
text_splits[-1] += " " + str(sentence)
text_splits[-1] = text_splits[-1].lstrip()
elif len(str(sentence)) > text_split_length:
# if the current sentence is greater than the text_split_length
for line in textwrap.wrap(
str(sentence),
width=text_split_length,
drop_whitespace=True,
break_on_hyphens=False,
tabsize=1,
):
text_splits.append(str(line))
else:
text_splits.append(str(sentence))
if len(text_splits) > 1:
if text_splits[0] == "":
del text_splits[0]
else:
text_splits = [text.lstrip()]
return text_splits
_whitespace_re = re.compile(r"\s+")
# List of (regular expression, replacement) pairs for abbreviations:
@ -464,7 +520,7 @@ def _expand_number(m, lang="en"):
def expand_numbers_multilingual(text, lang="en"):
if lang == "zh" or lang == "zh-cn":
if lang == "zh":
text = zh_num2words()(text)
else:
if lang in ["en", "ru"]:
@ -525,7 +581,7 @@ def japanese_cleaners(text, katsu):
return text
def korean_cleaners(text):
def korean_transliterate(text):
r = Transliter(academic)
return r.translit(text)
@ -546,7 +602,7 @@ class VoiceBpeTokenizer:
"it": 213,
"pt": 203,
"pl": 224,
"zh-cn": 82,
"zh": 82,
"ar": 166,
"cs": 186,
"ru": 182,
@ -560,29 +616,38 @@ class VoiceBpeTokenizer:
@cached_property
def katsu(self):
import cutlet
return cutlet.Cutlet()
def check_input_length(self, txt, lang):
lang = lang.split("-")[0] # remove the region
limit = self.char_limits.get(lang, 250)
if len(txt) > limit:
print(f"[!] Warning: The text length exceeds the character limit of {limit} for language '{lang}', this might cause truncated audio.")
print(
f"[!] Warning: The text length exceeds the character limit of {limit} for language '{lang}', this might cause truncated audio."
)
def preprocess_text(self, txt, lang):
if lang in {"ar", "cs", "de", "en", "es", "fr", "hu", "it", "nl", "pl", "pt", "ru", "tr", "zh", "zh-cn"}:
if lang in {"ar", "cs", "de", "en", "es", "fr", "hu", "it", "nl", "pl", "pt", "ru", "tr", "zh", "ko"}:
txt = multilingual_cleaners(txt, lang)
if lang in {"zh", "zh-cn"}:
if lang == "zh":
txt = chinese_transliterate(txt)
if lang == "ko":
txt = korean_transliterate(txt)
elif lang == "ja":
txt = japanese_cleaners(txt, self.katsu)
elif lang == "ko":
txt = korean_cleaners(txt)
elif lang == "hi":
# @manmay will implement this
txt = basic_cleaners(txt)
else:
raise NotImplementedError(f"Language '{lang}' is not supported.")
return txt
def encode(self, txt, lang):
lang = lang.split("-")[0] # remove the region
self.check_input_length(txt, lang)
txt = self.preprocess_text(txt, lang)
lang = "zh-cn" if lang == "zh" else lang
txt = f"[{lang}]{txt}"
txt = txt.replace(" ", "[SPACE]")
return self.tokenizer.encode(txt).ids

View File

@ -5,6 +5,7 @@ import sys
import torch
import torch.nn.functional as F
import torch.utils.data
from TTS.tts.models.xtts import load_audio
torch.set_num_threads(1)

View File

@ -225,11 +225,11 @@ class GPTTrainer(BaseTTS):
@torch.no_grad()
def test_run(self, assets) -> Tuple[Dict, Dict]: # pylint: disable=W0613
test_audios = {}
if self.config.test_sentences:
# init gpt for inference mode
self.xtts.gpt.init_gpt_for_inference(kv_cache=self.args.kv_cache, use_deepspeed=False)
self.xtts.gpt.eval()
test_audios = {}
print(" | > Synthesizing test sentences.")
for idx, s_info in enumerate(self.config.test_sentences):
wav = self.xtts.synthesize(
@ -318,9 +318,13 @@ class GPTTrainer(BaseTTS):
batch["cond_idxs"] = None
return self.train_step(batch, criterion)
def on_epoch_start(self, trainer): # pylint: disable=W0613
# guarante that dvae will be in eval mode after .train() on evaluation end
self.dvae = self.dvae.eval()
def on_train_epoch_start(self, trainer):
trainer.model.eval() # the whole model to eval
# put gpt model in training mode
if hasattr(trainer.model, "module") and hasattr(trainer.model.module, "xtts"):
trainer.model.module.xtts.gpt.train()
else:
trainer.model.xtts.gpt.train()
def on_init_end(self, trainer): # pylint: disable=W0613
# ignore similarities.pth on clearml save/upload
@ -386,7 +390,8 @@ class GPTTrainer(BaseTTS):
else:
loader = DataLoader(
dataset,
batch_sampler=sampler,
sampler=sampler,
batch_size = config.eval_batch_size if is_eval else config.batch_size,
collate_fn=dataset.collate_fn,
num_workers=config.num_eval_loader_workers if is_eval else config.num_loader_workers,
pin_memory=False,

View File

@ -0,0 +1,34 @@
import torch
class SpeakerManager():
def __init__(self, speaker_file_path=None):
self.speakers = torch.load(speaker_file_path)
@property
def name_to_id(self):
return self.speakers.keys()
@property
def num_speakers(self):
return len(self.name_to_id)
@property
def speaker_names(self):
return list(self.name_to_id.keys())
class LanguageManager():
def __init__(self, config):
self.langs = config["languages"]
@property
def name_to_id(self):
return self.langs
@property
def num_languages(self):
return len(self.name_to_id)
@property
def language_names(self):
return list(self.name_to_id)

View File

@ -65,7 +65,7 @@ CN_PUNCS_NONSTOP = "
CN_PUNCS = CN_PUNCS_STOP + CN_PUNCS_NONSTOP
PUNCS = CN_PUNCS + string.punctuation
PUNCS_TRANSFORM = str.maketrans(PUNCS, " " * len(PUNCS), "") # replace puncs with space
PUNCS_TRANSFORM = str.maketrans(PUNCS, "," * len(PUNCS), "") # replace puncs with English comma
# https://zh.wikipedia.org/wiki/全行和半行

View File

@ -241,7 +241,7 @@ class ForwardTTS(BaseTTS):
)
self.duration_predictor = DurationPredictor(
self.args.hidden_channels + self.embedded_speaker_dim,
self.args.hidden_channels,
self.args.duration_predictor_hidden_channels,
self.args.duration_predictor_kernel_size,
self.args.duration_predictor_dropout_p,
@ -249,7 +249,7 @@ class ForwardTTS(BaseTTS):
if self.args.use_pitch:
self.pitch_predictor = DurationPredictor(
self.args.hidden_channels + self.embedded_speaker_dim,
self.args.hidden_channels,
self.args.pitch_predictor_hidden_channels,
self.args.pitch_predictor_kernel_size,
self.args.pitch_predictor_dropout_p,
@ -263,7 +263,7 @@ class ForwardTTS(BaseTTS):
if self.args.use_energy:
self.energy_predictor = DurationPredictor(
self.args.hidden_channels + self.embedded_speaker_dim,
self.args.hidden_channels,
self.args.energy_predictor_hidden_channels,
self.args.energy_predictor_kernel_size,
self.args.energy_predictor_dropout_p,
@ -299,7 +299,8 @@ class ForwardTTS(BaseTTS):
if config.use_d_vector_file:
self.embedded_speaker_dim = config.d_vector_dim
if self.args.d_vector_dim != self.args.hidden_channels:
self.proj_g = nn.Conv1d(self.args.d_vector_dim, self.args.hidden_channels, 1)
#self.proj_g = nn.Conv1d(self.args.d_vector_dim, self.args.hidden_channels, 1)
self.proj_g = nn.Linear(in_features=self.args.d_vector_dim, out_features=self.args.hidden_channels)
# init speaker embedding layer
if config.use_speaker_embedding and not config.use_d_vector_file:
print(" > Init speaker_embedding layer.")
@ -403,10 +404,13 @@ class ForwardTTS(BaseTTS):
# [B, T, C]
x_emb = self.emb(x)
# encoder pass
o_en = self.encoder(torch.transpose(x_emb, 1, -1), x_mask)
#o_en = self.encoder(torch.transpose(x_emb, 1, -1), x_mask)
o_en = self.encoder(torch.transpose(x_emb, 1, -1), x_mask, g)
# speaker conditioning
# TODO: try different ways of conditioning
if g is not None:
if hasattr(self, "proj_g"):
g = self.proj_g(g.view(g.shape[0], -1)).unsqueeze(-1)
o_en = o_en + g
return o_en, x_mask, g, x_emb

View File

@ -7,11 +7,11 @@ import torch.nn.functional as F
import torchaudio
from coqpit import Coqpit
from TTS.tts.layers.tortoise.audio_utils import wav_to_univnet_mel
from TTS.tts.layers.xtts.gpt import GPT
from TTS.tts.layers.xtts.hifigan_decoder import HifiDecoder
from TTS.tts.layers.xtts.stream_generator import init_stream_support
from TTS.tts.layers.xtts.tokenizer import VoiceBpeTokenizer
from TTS.tts.layers.xtts.tokenizer import VoiceBpeTokenizer, split_sentence
from TTS.tts.layers.xtts.xtts_manager import SpeakerManager, LanguageManager
from TTS.tts.models.base_tts import BaseTTS
from TTS.utils.io import load_fsspec
@ -255,60 +255,63 @@ class Xtts(BaseTTS):
return next(self.parameters()).device
@torch.inference_mode()
def get_gpt_cond_latents(self, audio, sr, length: int = 3):
def get_gpt_cond_latents(self, audio, sr, length: int = 30, chunk_length: int = 6):
"""Compute the conditioning latents for the GPT model from the given audio.
Args:
audio (tensor): audio tensor.
sr (int): Sample rate of the audio.
length (int): Length of the audio in seconds. Defaults to 3.
length (int): Length of the audio in seconds. If < 0, use the whole audio. Defaults to 30.
chunk_length (int): Length of the audio chunks in seconds. When `length == chunk_length`, the whole audio
is being used without chunking. It must be < `length`. Defaults to 6.
"""
if sr != 22050:
audio = torchaudio.functional.resample(audio, sr, 22050)
audio = audio[:, : 22050 * length]
if length > 0:
audio = audio[:, : 22050 * length]
if self.args.gpt_use_perceiver_resampler:
n_fft = 2048
hop_length = 256
win_length = 1024
style_embs = []
for i in range(0, audio.shape[1], 22050 * chunk_length):
audio_chunk = audio[:, i : i + 22050 * chunk_length]
# if the chunk is too short ignore it
if audio_chunk.size(-1) < 22050 * 0.33:
continue
mel_chunk = wav_to_mel_cloning(
audio_chunk,
mel_norms=self.mel_stats.cpu(),
n_fft=2048,
hop_length=256,
win_length=1024,
power=2,
normalized=False,
sample_rate=22050,
f_min=0,
f_max=8000,
n_mels=80,
)
style_emb = self.gpt.get_style_emb(mel_chunk.to(self.device), None)
style_embs.append(style_emb)
# mean style embedding
cond_latent = torch.stack(style_embs).mean(dim=0)
else:
n_fft = 4096
hop_length = 1024
win_length = 4096
mel = wav_to_mel_cloning(
audio,
mel_norms=self.mel_stats.cpu(),
n_fft=n_fft,
hop_length=hop_length,
win_length=win_length,
power=2,
normalized=False,
sample_rate=22050,
f_min=0,
f_max=8000,
n_mels=80,
)
cond_latent = self.gpt.get_style_emb(mel.to(self.device))
return cond_latent.transpose(1, 2)
@torch.inference_mode()
def get_diffusion_cond_latents(self, audio, sr):
from math import ceil
diffusion_conds = []
CHUNK_SIZE = 102400
audio_24k = torchaudio.functional.resample(audio, sr, 24000)
for chunk in range(ceil(audio_24k.shape[1] / CHUNK_SIZE)):
current_sample = audio_24k[:, chunk * CHUNK_SIZE : (chunk + 1) * CHUNK_SIZE]
current_sample = pad_or_truncate(current_sample, CHUNK_SIZE)
cond_mel = wav_to_univnet_mel(
current_sample.to(self.device),
do_normalization=False,
device=self.device,
mel = wav_to_mel_cloning(
audio,
mel_norms=self.mel_stats.cpu(),
n_fft=4096,
hop_length=1024,
win_length=4096,
power=2,
normalized=False,
sample_rate=22050,
f_min=0,
f_max=8000,
n_mels=80,
)
diffusion_conds.append(cond_mel)
diffusion_conds = torch.stack(diffusion_conds, dim=1)
diffusion_latent = self.diffusion_decoder.get_conditioning(diffusion_conds)
return diffusion_latent
cond_latent = self.gpt.get_style_emb(mel.to(self.device))
return cond_latent.transpose(1, 2)
@torch.inference_mode()
def get_speaker_embedding(self, audio, sr):
@ -323,12 +326,24 @@ class Xtts(BaseTTS):
def get_conditioning_latents(
self,
audio_path,
max_ref_length=30,
gpt_cond_len=6,
max_ref_length=10,
gpt_cond_chunk_len=6,
librosa_trim_db=None,
sound_norm_refs=False,
load_sr=24000,
load_sr=22050,
):
"""Get the conditioning latents for the GPT model from the given audio.
Args:
audio_path (str or List[str]): Path to reference audio file(s).
max_ref_length (int): Maximum length of each reference audio in seconds. Defaults to 30.
gpt_cond_len (int): Length of the audio used for gpt latents. Defaults to 6.
gpt_cond_chunk_len (int): Chunk length used for gpt latents. It must be <= gpt_conf_len. Defaults to 6.
librosa_trim_db (int, optional): Trim the audio using this value. If None, not trimming. Defaults to None.
sound_norm_refs (bool, optional): Whether to normalize the audio. Defaults to False.
load_sr (int, optional): Sample rate to load the audio. Defaults to 24000.
"""
# deal with multiples references
if not isinstance(audio_path, list):
audio_paths = [audio_path]
@ -339,24 +354,24 @@ class Xtts(BaseTTS):
audios = []
speaker_embedding = None
for file_path in audio_paths:
# load the audio in 24khz to avoid issued with multiple sr references
audio = load_audio(file_path, load_sr)
audio = audio[:, : load_sr * max_ref_length].to(self.device)
if audio.shape[0] > 1:
audio = audio.mean(0, keepdim=True)
if sound_norm_refs:
audio = (audio / torch.abs(audio).max()) * 0.75
if librosa_trim_db is not None:
audio = librosa.effects.trim(audio, top_db=librosa_trim_db)[0]
# compute latents for the decoder
speaker_embedding = self.get_speaker_embedding(audio, load_sr)
speaker_embeddings.append(speaker_embedding)
audios.append(audio)
# use a merge of all references for gpt cond latents
# merge all the audios and compute the latents for the gpt
full_audio = torch.cat(audios, dim=-1)
gpt_cond_latents = self.get_gpt_cond_latents(full_audio, load_sr, length=gpt_cond_len) # [1, 1024, T]
gpt_cond_latents = self.get_gpt_cond_latents(
full_audio, load_sr, length=gpt_cond_len, chunk_length=gpt_cond_chunk_len
) # [1, 1024, T]
if speaker_embeddings:
speaker_embedding = torch.stack(speaker_embeddings)
@ -364,7 +379,7 @@ class Xtts(BaseTTS):
return gpt_cond_latents, speaker_embedding
def synthesize(self, text, config, speaker_wav, language, **kwargs):
def synthesize(self, text, config, speaker_wav, language, speaker_id=None, **kwargs):
"""Synthesize speech with the given input text.
Args:
@ -379,15 +394,9 @@ class Xtts(BaseTTS):
`text_input` as text token IDs after tokenizer, `voice_samples` as samples used for cloning, `conditioning_latents`
as latents used at inference.
"""
return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
def inference_with_config(self, text, config, ref_audio_path, language, **kwargs):
"""
inference with config
"""
assert (
language in self.config.languages
"zh-cn" if language == "zh" else language in self.config.languages
), f" ❗ Language {language} is not supported. Supported languages are {self.config.languages}"
# Use generally found best tuning knobs for generation.
settings = {
@ -396,12 +405,18 @@ class Xtts(BaseTTS):
"repetition_penalty": config.repetition_penalty,
"top_k": config.top_k,
"top_p": config.top_p,
"gpt_cond_len": config.gpt_cond_len,
"max_ref_len": config.max_ref_len,
"sound_norm_refs": config.sound_norm_refs,
}
settings.update(kwargs) # allow overriding of preset settings with kwargs
return self.full_inference(text, ref_audio_path, language, **settings)
if speaker_id is not None:
gpt_cond_latent, speaker_embedding = self.speaker_manager.speakers[speaker_id].values()
return self.inference(text, language, gpt_cond_latent, speaker_embedding, **settings)
settings.update({
"gpt_cond_len": config.gpt_cond_len,
"gpt_cond_chunk_len": config.gpt_cond_chunk_len,
"max_ref_len": config.max_ref_len,
"sound_norm_refs": config.sound_norm_refs,
})
return self.full_inference(text, speaker_wav, language, **settings)
@torch.inference_mode()
def full_inference(
@ -410,14 +425,15 @@ class Xtts(BaseTTS):
ref_audio_path,
language,
# GPT inference
temperature=0.65,
length_penalty=1,
repetition_penalty=2.0,
temperature=0.75,
length_penalty=1.0,
repetition_penalty=10.0,
top_k=50,
top_p=0.85,
do_sample=True,
# Cloning
gpt_cond_len=6,
gpt_cond_len=30,
gpt_cond_chunk_len=6,
max_ref_len=10,
sound_norm_refs=False,
**hf_generate_kwargs,
@ -448,7 +464,10 @@ class Xtts(BaseTTS):
(aka boring) outputs. Defaults to 0.8.
gpt_cond_len: (int) Length of the audio used for cloning. If audio is shorter, then audio length is used
else the first `gpt_cond_len` secs is used. Defaults to 6 seconds.
else the first `gpt_cond_len` secs is used. Defaults to 30 seconds.
gpt_cond_chunk_len: (int) Chunk length used for cloning. It must be <= `gpt_cond_len`.
If gpt_cond_len == gpt_cond_chunk_len, no chunking. Defaults to 6 seconds.
hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive
transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation
@ -461,6 +480,7 @@ class Xtts(BaseTTS):
(gpt_cond_latent, speaker_embedding) = self.get_conditioning_latents(
audio_path=ref_audio_path,
gpt_cond_len=gpt_cond_len,
gpt_cond_chunk_len=gpt_cond_chunk_len,
max_ref_length=max_ref_len,
sound_norm_refs=sound_norm_refs,
)
@ -487,72 +507,78 @@ class Xtts(BaseTTS):
gpt_cond_latent,
speaker_embedding,
# GPT inference
temperature=0.65,
length_penalty=1,
repetition_penalty=2.0,
temperature=0.75,
length_penalty=1.0,
repetition_penalty=10.0,
top_k=50,
top_p=0.85,
do_sample=True,
num_beams=1,
speed=1.0,
enable_text_splitting=False,
**hf_generate_kwargs,
):
text = text.strip().lower()
text_tokens = torch.IntTensor(self.tokenizer.encode(text, lang=language)).unsqueeze(0).to(self.device)
language = language.split("-")[0] # remove the country code
length_scale = 1.0 / max(speed, 0.05)
gpt_cond_latent = gpt_cond_latent.to(self.device)
speaker_embedding = speaker_embedding.to(self.device)
if enable_text_splitting:
text = split_sentence(text, language, self.tokenizer.char_limits[language])
else:
text = [text]
# print(" > Input text: ", text)
# print(" > Input text preprocessed: ",self.tokenizer.preprocess_text(text, language))
# print(" > Input tokens: ", text_tokens)
# print(" > Decoded text: ", self.tokenizer.decode(text_tokens[0].cpu().numpy()))
assert (
text_tokens.shape[-1] < self.args.gpt_max_text_tokens
), " ❗ XTTS can only generate text with a maximum of 400 tokens."
wavs = []
gpt_latents_list = []
for sent in text:
sent = sent.strip().lower()
text_tokens = torch.IntTensor(self.tokenizer.encode(sent, lang=language)).unsqueeze(0).to(self.device)
with torch.no_grad():
gpt_codes = self.gpt.generate(
cond_latents=gpt_cond_latent,
text_inputs=text_tokens,
input_tokens=None,
do_sample=do_sample,
top_p=top_p,
top_k=top_k,
temperature=temperature,
num_return_sequences=self.gpt_batch_size,
num_beams=num_beams,
length_penalty=length_penalty,
repetition_penalty=repetition_penalty,
output_attentions=False,
**hf_generate_kwargs,
)
expected_output_len = torch.tensor(
[gpt_codes.shape[-1] * self.gpt.code_stride_len], device=text_tokens.device
)
assert (
text_tokens.shape[-1] < self.args.gpt_max_text_tokens
), " ❗ XTTS can only generate text with a maximum of 400 tokens."
text_len = torch.tensor([text_tokens.shape[-1]], device=self.device)
gpt_latents = self.gpt(
text_tokens,
text_len,
gpt_codes,
expected_output_len,
cond_latents=gpt_cond_latent,
return_attentions=False,
return_latent=True,
)
silence_token = 83
ctokens = 0
for k in range(gpt_codes.shape[-1]):
if gpt_codes[0, k] == silence_token:
ctokens += 1
else:
ctokens = 0
if ctokens > 8:
gpt_latents = gpt_latents[:, :k]
break
with torch.no_grad():
gpt_codes = self.gpt.generate(
cond_latents=gpt_cond_latent,
text_inputs=text_tokens,
input_tokens=None,
do_sample=do_sample,
top_p=top_p,
top_k=top_k,
temperature=temperature,
num_return_sequences=self.gpt_batch_size,
num_beams=num_beams,
length_penalty=length_penalty,
repetition_penalty=repetition_penalty,
output_attentions=False,
**hf_generate_kwargs,
)
expected_output_len = torch.tensor(
[gpt_codes.shape[-1] * self.gpt.code_stride_len], device=text_tokens.device
)
wav = self.hifigan_decoder(gpt_latents, g=speaker_embedding)
text_len = torch.tensor([text_tokens.shape[-1]], device=self.device)
gpt_latents = self.gpt(
text_tokens,
text_len,
gpt_codes,
expected_output_len,
cond_latents=gpt_cond_latent,
return_attentions=False,
return_latent=True,
)
if length_scale != 1.0:
gpt_latents = F.interpolate(
gpt_latents.transpose(1, 2), scale_factor=length_scale, mode="linear"
).transpose(1, 2)
gpt_latents_list.append(gpt_latents.cpu())
wavs.append(self.hifigan_decoder(gpt_latents, g=speaker_embedding).cpu().squeeze())
return {
"wav": wav.cpu().numpy().squeeze(),
"gpt_latents": gpt_latents,
"wav": torch.cat(wavs, dim=0).numpy(),
"gpt_latents": torch.cat(gpt_latents_list, dim=1).numpy(),
"speaker_embedding": speaker_embedding,
}
@ -566,7 +592,7 @@ class Xtts(BaseTTS):
if overlap_len > len(wav_chunk):
# wav_chunk is smaller than overlap_len, pass on last wav_gen
if wav_gen_prev is not None:
wav_chunk = wav_gen[(wav_gen_prev.shape[0] - overlap_len):]
wav_chunk = wav_gen[(wav_gen_prev.shape[0] - overlap_len) :]
else:
# not expecting will hit here as problem happens on last chunk
wav_chunk = wav_gen[-overlap_len:]
@ -592,58 +618,78 @@ class Xtts(BaseTTS):
stream_chunk_size=20,
overlap_wav_len=1024,
# GPT inference
temperature=0.65,
length_penalty=1,
repetition_penalty=2.0,
temperature=0.75,
length_penalty=1.0,
repetition_penalty=10.0,
top_k=50,
top_p=0.85,
do_sample=True,
speed=1.0,
enable_text_splitting=False,
**hf_generate_kwargs,
):
text = text.strip().lower()
text_tokens = torch.IntTensor(self.tokenizer.encode(text, lang=language)).unsqueeze(0).to(self.device)
language = language.split("-")[0] # remove the country code
length_scale = 1.0 / max(speed, 0.05)
gpt_cond_latent = gpt_cond_latent.to(self.device)
speaker_embedding = speaker_embedding.to(self.device)
if enable_text_splitting:
text = split_sentence(text, language, self.tokenizer.char_limits[language])
else:
text = [text]
fake_inputs = self.gpt.compute_embeddings(
gpt_cond_latent.to(self.device),
text_tokens,
)
gpt_generator = self.gpt.get_generator(
fake_inputs=fake_inputs,
top_k=top_k,
top_p=top_p,
temperature=temperature,
do_sample=do_sample,
num_beams=1,
num_return_sequences=1,
length_penalty=float(length_penalty),
repetition_penalty=float(repetition_penalty),
output_attentions=False,
output_hidden_states=True,
**hf_generate_kwargs,
)
for sent in text:
sent = sent.strip().lower()
text_tokens = torch.IntTensor(self.tokenizer.encode(sent, lang=language)).unsqueeze(0).to(self.device)
last_tokens = []
all_latents = []
wav_gen_prev = None
wav_overlap = None
is_end = False
assert (
text_tokens.shape[-1] < self.args.gpt_max_text_tokens
), " ❗ XTTS can only generate text with a maximum of 400 tokens."
while not is_end:
try:
x, latent = next(gpt_generator)
last_tokens += [x]
all_latents += [latent]
except StopIteration:
is_end = True
fake_inputs = self.gpt.compute_embeddings(
gpt_cond_latent.to(self.device),
text_tokens,
)
gpt_generator = self.gpt.get_generator(
fake_inputs=fake_inputs,
top_k=top_k,
top_p=top_p,
temperature=temperature,
do_sample=do_sample,
num_beams=1,
num_return_sequences=1,
length_penalty=float(length_penalty),
repetition_penalty=float(repetition_penalty),
output_attentions=False,
output_hidden_states=True,
**hf_generate_kwargs,
)
if is_end or (stream_chunk_size > 0 and len(last_tokens) >= stream_chunk_size):
gpt_latents = torch.cat(all_latents, dim=0)[None, :]
wav_gen = self.hifigan_decoder(gpt_latents, g=speaker_embedding.to(self.device))
wav_chunk, wav_gen_prev, wav_overlap = self.handle_chunks(
wav_gen.squeeze(), wav_gen_prev, wav_overlap, overlap_wav_len
)
last_tokens = []
yield wav_chunk
last_tokens = []
all_latents = []
wav_gen_prev = None
wav_overlap = None
is_end = False
while not is_end:
try:
x, latent = next(gpt_generator)
last_tokens += [x]
all_latents += [latent]
except StopIteration:
is_end = True
if is_end or (stream_chunk_size > 0 and len(last_tokens) >= stream_chunk_size):
gpt_latents = torch.cat(all_latents, dim=0)[None, :]
if length_scale != 1.0:
gpt_latents = F.interpolate(
gpt_latents.transpose(1, 2), scale_factor=length_scale, mode="linear"
).transpose(1, 2)
wav_gen = self.hifigan_decoder(gpt_latents, g=speaker_embedding.to(self.device))
wav_chunk, wav_gen_prev, wav_overlap = self.handle_chunks(
wav_gen.squeeze(), wav_gen_prev, wav_overlap, overlap_wav_len
)
last_tokens = []
yield wav_chunk
def forward(self):
raise NotImplementedError(
@ -691,6 +737,7 @@ class Xtts(BaseTTS):
eval=True,
strict=True,
use_deepspeed=False,
speaker_file_path=None,
):
"""
Loads a checkpoint from disk and initializes the model's state and tokenizer.
@ -710,6 +757,14 @@ class Xtts(BaseTTS):
model_path = checkpoint_path or os.path.join(checkpoint_dir, "model.pth")
vocab_path = vocab_path or os.path.join(checkpoint_dir, "vocab.json")
if speaker_file_path is None and checkpoint_dir is not None:
speaker_file_path = os.path.join(checkpoint_dir, "speakers_xtts.pth")
self.language_manager = LanguageManager(config)
self.speaker_manager = None
if speaker_file_path is not None and os.path.exists(speaker_file_path):
self.speaker_manager = SpeakerManager(speaker_file_path)
if os.path.exists(vocab_path):
self.tokenizer = VoiceBpeTokenizer(vocab_file=vocab_path)

View File

@ -185,20 +185,16 @@ class ESpeak(BasePhonemizer):
if tie:
args.append("--tie=%s" % tie)
args.append('"' + text + '"')
args.append(text)
# compute phonemes
phonemes = ""
for line in _espeak_exe(self._ESPEAK_LIB, args, sync=True):
logging.debug("line: %s", repr(line))
ph_decoded = line.decode("utf8").strip()
# espeak need to skip first two characters of the retuned text:
# version 1.48.03: "_ p_ɹ_ˈaɪ_ɚ t_ə n_oʊ_v_ˈɛ_m_b_ɚ t_w_ˈɛ_n_t_i t_ˈuː\n"
# espeak:
# version 1.48.15: " p_ɹ_ˈaɪ_ɚ t_ə n_oʊ_v_ˈɛ_m_b_ɚ t_w_ˈɛ_n_t_i t_ˈuː\n"
# espeak-ng need to skip the first character of the retuned text:
# "_p_ɹ_ˈaɪ_ɚ t_ə n_oʊ_v_ˈɛ_m_b_ɚ t_w_ˈɛ_n_t_i t_ˈuː\n"
# dealing with the conditions descrived above
ph_decoded = ph_decoded[:1].replace("_", "") + ph_decoded[1:]
# espeak-ng:
# "p_ɹ_ˈaɪ_ɚ t_ə n_oʊ_v_ˈɛ_m_b_ɚ t_w_ˈɛ_n_t_i t_ˈuː\n"
# espeak-ng backend can add language flags that need to be removed:
# "sɛʁtˈɛ̃ mˈo kɔm (en)fˈʊtbɔːl(fr) ʒenˈɛʁ de- flˈaɡ də- lˈɑ̃ɡ."

View File

@ -15,7 +15,6 @@ class PuncPosition(Enum):
BEGIN = 0
END = 1
MIDDLE = 2
ALONE = 3
class Punctuation:
@ -92,7 +91,7 @@ class Punctuation:
return [text], []
# the text is only punctuations
if len(matches) == 1 and matches[0].group() == text:
return [], [_PUNC_IDX(text, PuncPosition.ALONE)]
return [], [_PUNC_IDX(text, PuncPosition.BEGIN)]
# build a punctuation map to be used later to restore punctuations
puncs = []
for match in matches:
@ -107,11 +106,14 @@ class Punctuation:
for idx, punc in enumerate(puncs):
split = text.split(punc.punc)
prefix, suffix = split[0], punc.punc.join(split[1:])
text = suffix
if prefix == "":
# We don't want to insert an empty string in case of initial punctuation
continue
splitted_text.append(prefix)
# if the text does not end with a punctuation, add it to the last item
if idx == len(puncs) - 1 and len(suffix) > 0:
splitted_text.append(suffix)
text = suffix
return splitted_text, puncs
@classmethod
@ -127,10 +129,10 @@ class Punctuation:
['This is', 'example'], ['.', '!'] -> "This is. example!"
"""
return cls._restore(text, puncs, 0)
return cls._restore(text, puncs)
@classmethod
def _restore(cls, text, puncs, num): # pylint: disable=too-many-return-statements
def _restore(cls, text, puncs): # pylint: disable=too-many-return-statements
"""Auxiliary method for Punctuation.restore()"""
if not puncs:
return text
@ -142,21 +144,18 @@ class Punctuation:
current = puncs[0]
if current.position == PuncPosition.BEGIN:
return cls._restore([current.punc + text[0]] + text[1:], puncs[1:], num)
return cls._restore([current.punc + text[0]] + text[1:], puncs[1:])
if current.position == PuncPosition.END:
return [text[0] + current.punc] + cls._restore(text[1:], puncs[1:], num + 1)
if current.position == PuncPosition.ALONE:
return [current.mark] + cls._restore(text, puncs[1:], num + 1)
return [text[0] + current.punc] + cls._restore(text[1:], puncs[1:])
# POSITION == MIDDLE
if len(text) == 1: # pragma: nocover
# a corner case where the final part of an intermediate
# mark (I) has not been phonemized
return cls._restore([text[0] + current.punc], puncs[1:], num)
return cls._restore([text[0] + current.punc], puncs[1:])
return cls._restore([text[0] + current.punc + text[1]] + text[2:], puncs[1:], num)
return cls._restore([text[0] + current.punc + text[1]] + text[2:], puncs[1:])
# if __name__ == "__main__":

View File

@ -201,7 +201,6 @@ def stft(
def istft(
*,
y: np.ndarray = None,
fft_size: int = None,
hop_length: int = None,
win_length: int = None,
window: str = "hann",

View File

@ -5,10 +5,26 @@ import librosa
import numpy as np
import scipy.io.wavfile
import scipy.signal
import soundfile as sf
from TTS.tts.utils.helpers import StandardScaler
from TTS.utils.audio.numpy_transforms import compute_f0
from TTS.utils.audio.numpy_transforms import (
amp_to_db,
build_mel_basis,
compute_f0,
db_to_amp,
deemphasis,
find_endpoint,
griffin_lim,
load_wav,
mel_to_spec,
millisec_to_length,
preemphasis,
rms_volume_norm,
spec_to_mel,
stft,
trim_silence,
volume_norm,
)
# pylint: disable=too-many-public-methods
@ -200,7 +216,9 @@ class AudioProcessor(object):
# setup stft parameters
if hop_length is None:
# compute stft parameters from given time values
self.hop_length, self.win_length = self._stft_parameters()
self.win_length, self.hop_length = millisec_to_length(
frame_length_ms=self.frame_length_ms, frame_shift_ms=self.frame_shift_ms, sample_rate=self.sample_rate
)
else:
# use stft parameters from config file
self.hop_length = hop_length
@ -215,8 +233,13 @@ class AudioProcessor(object):
for key, value in members.items():
print(" | > {}:{}".format(key, value))
# create spectrogram utils
self.mel_basis = self._build_mel_basis()
self.inv_mel_basis = np.linalg.pinv(self._build_mel_basis())
self.mel_basis = build_mel_basis(
sample_rate=self.sample_rate,
fft_size=self.fft_size,
num_mels=self.num_mels,
mel_fmax=self.mel_fmax,
mel_fmin=self.mel_fmin,
)
# setup scaler
if stats_path and signal_norm:
mel_mean, mel_std, linear_mean, linear_std, _ = self.load_stats(stats_path)
@ -232,35 +255,6 @@ class AudioProcessor(object):
return AudioProcessor(verbose=verbose, **config.audio)
return AudioProcessor(verbose=verbose, **config)
### setting up the parameters ###
def _build_mel_basis(
self,
) -> np.ndarray:
"""Build melspectrogram basis.
Returns:
np.ndarray: melspectrogram basis.
"""
if self.mel_fmax is not None:
assert self.mel_fmax <= self.sample_rate // 2
return librosa.filters.mel(
sr=self.sample_rate, n_fft=self.fft_size, n_mels=self.num_mels, fmin=self.mel_fmin, fmax=self.mel_fmax
)
def _stft_parameters(
self,
) -> Tuple[int, int]:
"""Compute the real STFT parameters from the time values.
Returns:
Tuple[int, int]: hop length and window length for STFT.
"""
factor = self.frame_length_ms / self.frame_shift_ms
assert (factor).is_integer(), " [!] frame_shift_ms should divide frame_length_ms"
hop_length = int(self.frame_shift_ms / 1000.0 * self.sample_rate)
win_length = int(hop_length * factor)
return hop_length, win_length
### normalization ###
def normalize(self, S: np.ndarray) -> np.ndarray:
"""Normalize values into `[0, self.max_norm]` or `[-self.max_norm, self.max_norm]`
@ -386,31 +380,6 @@ class AudioProcessor(object):
self.linear_scaler = StandardScaler()
self.linear_scaler.set_stats(linear_mean, linear_std)
### DB and AMP conversion ###
# pylint: disable=no-self-use
def _amp_to_db(self, x: np.ndarray) -> np.ndarray:
"""Convert amplitude values to decibels.
Args:
x (np.ndarray): Amplitude spectrogram.
Returns:
np.ndarray: Decibels spectrogram.
"""
return self.spec_gain * _log(np.maximum(1e-5, x), self.base)
# pylint: disable=no-self-use
def _db_to_amp(self, x: np.ndarray) -> np.ndarray:
"""Convert decibels spectrogram to amplitude spectrogram.
Args:
x (np.ndarray): Decibels spectrogram.
Returns:
np.ndarray: Amplitude spectrogram.
"""
return _exp(x / self.spec_gain, self.base)
### Preemphasis ###
def apply_preemphasis(self, x: np.ndarray) -> np.ndarray:
"""Apply pre-emphasis to the audio signal. Useful to reduce the correlation between neighbouring signal values.
@ -424,32 +393,13 @@ class AudioProcessor(object):
Returns:
np.ndarray: Decorrelated audio signal.
"""
if self.preemphasis == 0:
raise RuntimeError(" [!] Preemphasis is set 0.0.")
return scipy.signal.lfilter([1, -self.preemphasis], [1], x)
return preemphasis(x=x, coef=self.preemphasis)
def apply_inv_preemphasis(self, x: np.ndarray) -> np.ndarray:
"""Reverse pre-emphasis."""
if self.preemphasis == 0:
raise RuntimeError(" [!] Preemphasis is set 0.0.")
return scipy.signal.lfilter([1], [1, -self.preemphasis], x)
return deemphasis(x=x, coef=self.preemphasis)
### SPECTROGRAMs ###
def _linear_to_mel(self, spectrogram: np.ndarray) -> np.ndarray:
"""Project a full scale spectrogram to a melspectrogram.
Args:
spectrogram (np.ndarray): Full scale spectrogram.
Returns:
np.ndarray: Melspectrogram
"""
return np.dot(self.mel_basis, spectrogram)
def _mel_to_linear(self, mel_spec: np.ndarray) -> np.ndarray:
"""Convert a melspectrogram to full scale spectrogram."""
return np.maximum(1e-10, np.dot(self.inv_mel_basis, mel_spec))
def spectrogram(self, y: np.ndarray) -> np.ndarray:
"""Compute a spectrogram from a waveform.
@ -460,11 +410,16 @@ class AudioProcessor(object):
np.ndarray: Spectrogram.
"""
if self.preemphasis != 0:
D = self._stft(self.apply_preemphasis(y))
else:
D = self._stft(y)
y = self.apply_preemphasis(y)
D = stft(
y=y,
fft_size=self.fft_size,
hop_length=self.hop_length,
win_length=self.win_length,
pad_mode=self.stft_pad_mode,
)
if self.do_amp_to_db_linear:
S = self._amp_to_db(np.abs(D))
S = amp_to_db(x=np.abs(D), gain=self.spec_gain, base=self.base)
else:
S = np.abs(D)
return self.normalize(S).astype(np.float32)
@ -472,32 +427,35 @@ class AudioProcessor(object):
def melspectrogram(self, y: np.ndarray) -> np.ndarray:
"""Compute a melspectrogram from a waveform."""
if self.preemphasis != 0:
D = self._stft(self.apply_preemphasis(y))
else:
D = self._stft(y)
y = self.apply_preemphasis(y)
D = stft(
y=y,
fft_size=self.fft_size,
hop_length=self.hop_length,
win_length=self.win_length,
pad_mode=self.stft_pad_mode,
)
S = spec_to_mel(spec=np.abs(D), mel_basis=self.mel_basis)
if self.do_amp_to_db_mel:
S = self._amp_to_db(self._linear_to_mel(np.abs(D)))
else:
S = self._linear_to_mel(np.abs(D))
S = amp_to_db(x=S, gain=self.spec_gain, base=self.base)
return self.normalize(S).astype(np.float32)
def inv_spectrogram(self, spectrogram: np.ndarray) -> np.ndarray:
"""Convert a spectrogram to a waveform using Griffi-Lim vocoder."""
S = self.denormalize(spectrogram)
S = self._db_to_amp(S)
S = db_to_amp(x=S, gain=self.spec_gain, base=self.base)
# Reconstruct phase
if self.preemphasis != 0:
return self.apply_inv_preemphasis(self._griffin_lim(S**self.power))
return self._griffin_lim(S**self.power)
W = self._griffin_lim(S**self.power)
return self.apply_inv_preemphasis(W) if self.preemphasis != 0 else W
def inv_melspectrogram(self, mel_spectrogram: np.ndarray) -> np.ndarray:
"""Convert a melspectrogram to a waveform using Griffi-Lim vocoder."""
D = self.denormalize(mel_spectrogram)
S = self._db_to_amp(D)
S = self._mel_to_linear(S) # Convert back to linear
if self.preemphasis != 0:
return self.apply_inv_preemphasis(self._griffin_lim(S**self.power))
return self._griffin_lim(S**self.power)
S = db_to_amp(x=D, gain=self.spec_gain, base=self.base)
S = mel_to_spec(mel=S, mel_basis=self.mel_basis) # Convert back to linear
W = self._griffin_lim(S**self.power)
return self.apply_inv_preemphasis(W) if self.preemphasis != 0 else W
def out_linear_to_mel(self, linear_spec: np.ndarray) -> np.ndarray:
"""Convert a full scale linear spectrogram output of a network to a melspectrogram.
@ -509,60 +467,22 @@ class AudioProcessor(object):
np.ndarray: Normalized melspectrogram.
"""
S = self.denormalize(linear_spec)
S = self._db_to_amp(S)
S = self._linear_to_mel(np.abs(S))
S = self._amp_to_db(S)
S = db_to_amp(x=S, gain=self.spec_gain, base=self.base)
S = spec_to_mel(spec=np.abs(S), mel_basis=self.mel_basis)
S = amp_to_db(x=S, gain=self.spec_gain, base=self.base)
mel = self.normalize(S)
return mel
### STFT and ISTFT ###
def _stft(self, y: np.ndarray) -> np.ndarray:
"""Librosa STFT wrapper.
Args:
y (np.ndarray): Audio signal.
Returns:
np.ndarray: Complex number array.
"""
return librosa.stft(
y=y,
n_fft=self.fft_size,
def _griffin_lim(self, S):
return griffin_lim(
spec=S,
num_iter=self.griffin_lim_iters,
hop_length=self.hop_length,
win_length=self.win_length,
fft_size=self.fft_size,
pad_mode=self.stft_pad_mode,
window="hann",
center=True,
)
def _istft(self, y: np.ndarray) -> np.ndarray:
"""Librosa iSTFT wrapper."""
return librosa.istft(y, hop_length=self.hop_length, win_length=self.win_length)
def _griffin_lim(self, S):
angles = np.exp(2j * np.pi * np.random.rand(*S.shape))
try:
S_complex = np.abs(S).astype(np.complex)
except AttributeError: # np.complex is deprecated since numpy 1.20.0
S_complex = np.abs(S).astype(complex)
y = self._istft(S_complex * angles)
if not np.isfinite(y).all():
print(" [!] Waveform is not finite everywhere. Skipping the GL.")
return np.array([0.0])
for _ in range(self.griffin_lim_iters):
angles = np.exp(1j * np.angle(self._stft(y)))
y = self._istft(S_complex * angles)
return y
def compute_stft_paddings(self, x, pad_sides=1):
"""Compute paddings used by Librosa's STFT. Compute right padding (final frame) or both sides padding
(first and final frames)"""
assert pad_sides in (1, 2)
pad = (x.shape[0] // self.hop_length + 1) * self.hop_length - x.shape[0]
if pad_sides == 1:
return 0, pad
return pad // 2, pad // 2 + pad % 2
def compute_f0(self, x: np.ndarray) -> np.ndarray:
"""Compute pitch (f0) of a waveform using the same parameters used for computing melspectrogram.
@ -581,8 +501,6 @@ class AudioProcessor(object):
>>> wav = ap.load_wav(WAV_FILE, sr=ap.sample_rate)[:5 * ap.sample_rate]
>>> pitch = ap.compute_f0(wav)
"""
assert self.pitch_fmax is not None, " [!] Set `pitch_fmax` before caling `compute_f0`."
assert self.pitch_fmin is not None, " [!] Set `pitch_fmin` before caling `compute_f0`."
# align F0 length to the spectrogram length
if len(x) % self.hop_length == 0:
x = np.pad(x, (0, self.hop_length // 2), mode=self.stft_pad_mode)
@ -612,21 +530,24 @@ class AudioProcessor(object):
Returns:
int: Last point without silence.
"""
window_length = int(self.sample_rate * min_silence_sec)
hop_length = int(window_length / 4)
threshold = self._db_to_amp(-self.trim_db)
for x in range(hop_length, len(wav) - window_length, hop_length):
if np.max(wav[x : x + window_length]) < threshold:
return x + hop_length
return len(wav)
return find_endpoint(
wav=wav,
trim_db=self.trim_db,
sample_rate=self.sample_rate,
min_silence_sec=min_silence_sec,
gain=self.spec_gain,
base=self.base,
)
def trim_silence(self, wav):
"""Trim silent parts with a threshold and 0.01 sec margin"""
margin = int(self.sample_rate * 0.01)
wav = wav[margin:-margin]
return librosa.effects.trim(wav, top_db=self.trim_db, frame_length=self.win_length, hop_length=self.hop_length)[
0
]
return trim_silence(
wav=wav,
sample_rate=self.sample_rate,
trim_db=self.trim_db,
win_length=self.win_length,
hop_length=self.hop_length,
)
@staticmethod
def sound_norm(x: np.ndarray) -> np.ndarray:
@ -638,13 +559,7 @@ class AudioProcessor(object):
Returns:
np.ndarray: Volume normalized waveform.
"""
return x / abs(x).max() * 0.95
@staticmethod
def _rms_norm(wav, db_level=-27):
r = 10 ** (db_level / 20)
a = np.sqrt((len(wav) * (r**2)) / np.sum(wav**2))
return wav * a
return volume_norm(x=x)
def rms_volume_norm(self, x: np.ndarray, db_level: float = None) -> np.ndarray:
"""Normalize the volume based on RMS of the signal.
@ -657,9 +572,7 @@ class AudioProcessor(object):
"""
if db_level is None:
db_level = self.db_level
assert -99 <= db_level <= 0, " [!] db_level should be between -99 and 0"
wav = self._rms_norm(x, db_level)
return wav
return rms_volume_norm(x=x, db_level=db_level)
### save and load ###
def load_wav(self, filename: str, sr: int = None) -> np.ndarray:
@ -674,15 +587,10 @@ class AudioProcessor(object):
Returns:
np.ndarray: Loaded waveform.
"""
if self.resample:
# loading with resampling. It is significantly slower.
x, sr = librosa.load(filename, sr=self.sample_rate)
elif sr is None:
# SF is faster than librosa for loading files
x, sr = sf.read(filename)
assert self.sample_rate == sr, "%s vs %s" % (self.sample_rate, sr)
if sr is not None:
x = load_wav(filename=filename, sample_rate=sr, resample=True)
else:
x, sr = librosa.load(filename, sr=sr)
x = load_wav(filename=filename, sample_rate=self.sample_rate, resample=self.resample)
if self.do_trim_silence:
try:
x = self.trim_silence(x)
@ -723,55 +631,3 @@ class AudioProcessor(object):
filename (str): Path to the wav file.
"""
return librosa.get_duration(filename=filename)
@staticmethod
def mulaw_encode(wav: np.ndarray, qc: int) -> np.ndarray:
mu = 2**qc - 1
# wav_abs = np.minimum(np.abs(wav), 1.0)
signal = np.sign(wav) * np.log(1 + mu * np.abs(wav)) / np.log(1.0 + mu)
# Quantize signal to the specified number of levels.
signal = (signal + 1) / 2 * mu + 0.5
return np.floor(
signal,
)
@staticmethod
def mulaw_decode(wav, qc):
"""Recovers waveform from quantized values."""
mu = 2**qc - 1
x = np.sign(wav) / mu * ((1 + mu) ** np.abs(wav) - 1)
return x
@staticmethod
def encode_16bits(x):
return np.clip(x * 2**15, -(2**15), 2**15 - 1).astype(np.int16)
@staticmethod
def quantize(x: np.ndarray, bits: int) -> np.ndarray:
"""Quantize a waveform to a given number of bits.
Args:
x (np.ndarray): Waveform to quantize. Must be normalized into the range `[-1, 1]`.
bits (int): Number of quantization bits.
Returns:
np.ndarray: Quantized waveform.
"""
return (x + 1.0) * (2**bits - 1) / 2
@staticmethod
def dequantize(x, bits):
"""Dequantize a waveform from the given number of bits."""
return 2 * x / (2**bits - 1) - 1
def _log(x, base):
if base == 10:
return np.log10(x)
return np.log(x)
def _exp(x, base):
if base == 10:
return np.power(10, x)
return np.exp(x)

View File

@ -36,9 +36,7 @@ def get_git_branch():
current.replace("* ", "")
except subprocess.CalledProcessError:
current = "inside_docker"
except FileNotFoundError:
current = "unknown"
except StopIteration:
except (FileNotFoundError, StopIteration) as e:
current = "unknown"
return current

View File

@ -1,13 +1,9 @@
import datetime
import json
import os
import pickle as pickle_tts
import shutil
from typing import Any, Callable, Dict, Union
import fsspec
import torch
from coqpit import Coqpit
from TTS.utils.generic_utils import get_user_data_dir
@ -28,34 +24,6 @@ class AttrDict(dict):
self.__dict__ = self
def copy_model_files(config: Coqpit, out_path, new_fields=None):
"""Copy config.json and other model files to training folder and add
new fields.
Args:
config (Coqpit): Coqpit config defining the training run.
out_path (str): output path to copy the file.
new_fields (dict): new fileds to be added or edited
in the config file.
"""
copy_config_path = os.path.join(out_path, "config.json")
# add extra information fields
if new_fields:
config.update(new_fields, allow_new=True)
# TODO: Revert to config.save_json() once Coqpit supports arbitrary paths.
with fsspec.open(copy_config_path, "w", encoding="utf8") as f:
json.dump(config.to_dict(), f, indent=4)
# copy model stats file if available
if config.audio.stats_path is not None:
copy_stats_path = os.path.join(out_path, "scale_stats.npy")
filesystem = fsspec.get_mapper(copy_stats_path).fs
if not filesystem.exists(copy_stats_path):
with fsspec.open(config.audio.stats_path, "rb") as source_file:
with fsspec.open(copy_stats_path, "wb") as target_file:
shutil.copyfileobj(source_file, target_file)
def load_fsspec(
path: str,
map_location: Union[str, Callable, torch.device, Dict[Union[str, torch.device], Union[str, torch.device]]] = None,
@ -100,117 +68,3 @@ def load_checkpoint(
if eval:
model.eval()
return model, state
def save_fsspec(state: Any, path: str, **kwargs):
"""Like torch.save but can save to other locations (e.g. s3:// , gs://).
Args:
state: State object to save
path: Any path or url supported by fsspec.
**kwargs: Keyword arguments forwarded to torch.save.
"""
with fsspec.open(path, "wb") as f:
torch.save(state, f, **kwargs)
def save_model(config, model, optimizer, scaler, current_step, epoch, output_path, **kwargs):
if hasattr(model, "module"):
model_state = model.module.state_dict()
else:
model_state = model.state_dict()
if isinstance(optimizer, list):
optimizer_state = [optim.state_dict() for optim in optimizer]
elif optimizer.__class__.__name__ == "CapacitronOptimizer":
optimizer_state = [optimizer.primary_optimizer.state_dict(), optimizer.secondary_optimizer.state_dict()]
else:
optimizer_state = optimizer.state_dict() if optimizer is not None else None
if isinstance(scaler, list):
scaler_state = [s.state_dict() for s in scaler]
else:
scaler_state = scaler.state_dict() if scaler is not None else None
if isinstance(config, Coqpit):
config = config.to_dict()
state = {
"config": config,
"model": model_state,
"optimizer": optimizer_state,
"scaler": scaler_state,
"step": current_step,
"epoch": epoch,
"date": datetime.date.today().strftime("%B %d, %Y"),
}
state.update(kwargs)
save_fsspec(state, output_path)
def save_checkpoint(
config,
model,
optimizer,
scaler,
current_step,
epoch,
output_folder,
**kwargs,
):
file_name = "checkpoint_{}.pth".format(current_step)
checkpoint_path = os.path.join(output_folder, file_name)
print("\n > CHECKPOINT : {}".format(checkpoint_path))
save_model(
config,
model,
optimizer,
scaler,
current_step,
epoch,
checkpoint_path,
**kwargs,
)
def save_best_model(
current_loss,
best_loss,
config,
model,
optimizer,
scaler,
current_step,
epoch,
out_path,
keep_all_best=False,
keep_after=10000,
**kwargs,
):
if current_loss < best_loss:
best_model_name = f"best_model_{current_step}.pth"
checkpoint_path = os.path.join(out_path, best_model_name)
print(" > BEST MODEL : {}".format(checkpoint_path))
save_model(
config,
model,
optimizer,
scaler,
current_step,
epoch,
checkpoint_path,
model_loss=current_loss,
**kwargs,
)
fs = fsspec.get_mapper(out_path).fs
# only delete previous if current is saved successfully
if not keep_all_best or (current_step < keep_after):
model_names = fs.glob(os.path.join(out_path, "best_model*.pth"))
for model_name in model_names:
if os.path.basename(model_name) != best_model_name:
fs.rm(model_name)
# create a shortcut which always points to the currently best model
shortcut_name = "best_model.pth"
shortcut_path = os.path.join(out_path, shortcut_name)
fs.copy(checkpoint_path, shortcut_path)
best_loss = current_loss
return best_loss

View File

@ -1,5 +1,6 @@
import json
import os
import re
import tarfile
import zipfile
from pathlib import Path
@ -10,7 +11,7 @@ import fsspec
import requests
from tqdm import tqdm
from TTS.config import load_config
from TTS.config import load_config, read_json_with_comments
from TTS.utils.generic_utils import get_user_data_dir
LICENSE_URLS = {
@ -27,6 +28,7 @@ LICENSE_URLS = {
class ModelManager(object):
tqdm_progress = None
"""Manage TTS models defined in .models.json.
It provides an interface to list and download
models defines in '.model.json'
@ -63,30 +65,7 @@ class ModelManager(object):
Args:
file_path (str): path to .models.json.
"""
with open(file_path, "r", encoding="utf-8") as json_file:
self.models_dict = json.load(json_file)
def add_cs_api_models(self, model_list: List[str]):
"""Add list of Coqui Studio model names that are returned from the api
Each has the following format `<coqui_studio_model>/en/<speaker_name>/<coqui_studio_model>`
"""
def _add_model(model_name: str):
if not "coqui_studio" in model_name:
return
model_type, lang, dataset, model = model_name.split("/")
if model_type not in self.models_dict:
self.models_dict[model_type] = {}
if lang not in self.models_dict[model_type]:
self.models_dict[model_type][lang] = {}
if dataset not in self.models_dict[model_type][lang]:
self.models_dict[model_type][lang][dataset] = {}
if model not in self.models_dict[model_type][lang][dataset]:
self.models_dict[model_type][lang][dataset][model] = {}
for model_name in model_list:
_add_model(model_name)
self.models_dict = read_json_with_comments(file_path)
def _list_models(self, model_type, model_count=0):
if self.verbose:
@ -274,13 +253,15 @@ class ModelManager(object):
model_item["model_url"] = model_item["hf_url"]
elif "fairseq" in model_item["model_name"]:
model_item["model_url"] = "https://coqui.gateway.scarf.sh/fairseq/"
elif "xtts" in model_item["model_name"]:
model_item["model_url"] = "https://coqui.gateway.scarf.sh/xtts/"
return model_item
def _set_model_item(self, model_name):
# fetch model info from the dict
model_type, lang, dataset, model = model_name.split("/")
model_full_name = f"{model_type}--{lang}--{dataset}--{model}"
if "fairseq" in model_name:
model_type = "tts_models"
lang = model_name.split("/")[1]
model_item = {
"model_type": "tts_models",
"license": "CC BY-NC 4.0",
@ -289,10 +270,38 @@ class ModelManager(object):
"description": "this model is released by Meta under Fairseq repo. Visit https://github.com/facebookresearch/fairseq/tree/main/examples/mms for more info.",
}
model_item["model_name"] = model_name
elif "xtts" in model_name and len(model_name.split("/")) != 4:
# loading xtts models with only model name (e.g. xtts_v2.0.2)
# check model name has the version number with regex
version_regex = r"v\d+\.\d+\.\d+"
if re.search(version_regex, model_name):
model_version = model_name.split("_")[-1]
else:
model_version = "main"
model_type = "tts_models"
lang = "multilingual"
dataset = "multi-dataset"
model = model_name
model_item = {
"default_vocoder": None,
"license": "CPML",
"contact": "info@coqui.ai",
"tos_required": True,
"hf_url": [
f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{model_version}/model.pth",
f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{model_version}/config.json",
f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{model_version}/vocab.json",
f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{model_version}/hash.md5",
f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{model_version}/speakers_xtts.pth",
],
}
else:
# get model from models.json
model_type, lang, dataset, model = model_name.split("/")
model_item = self.models_dict[model_type][lang][dataset][model]
model_item["model_type"] = model_type
model_full_name = f"{model_type}--{lang}--{dataset}--{model}"
md5hash = model_item["model_hash"] if "model_hash" in model_item else None
model_item = self.set_model_url(model_item)
return model_item, model_full_name, model, md5hash
@ -301,9 +310,9 @@ class ModelManager(object):
def ask_tos(model_full_path):
"""Ask the user to agree to the terms of service"""
tos_path = os.path.join(model_full_path, "tos_agreed.txt")
print(" > You must agree to the terms of service to use this model.")
print(" | > Please see the terms of service at https://coqui.ai/cpml.txt")
print(' | > "I have read, understood and agreed to the Terms and Conditions." - [y/n]')
print(" > You must confirm the following:")
print(' | > "I have purchased a commercial license from Coqui: licensing@coqui.ai"')
print(' | > "Otherwise, I agree to the terms of the non-commercial CPML: https://coqui.ai/cpml" - [y/n]')
answer = input(" | | > ")
if answer.lower() == "y":
with open(tos_path, "w", encoding="utf-8") as f:
@ -525,12 +534,12 @@ class ModelManager(object):
total_size_in_bytes = int(r.headers.get("content-length", 0))
block_size = 1024 # 1 Kibibyte
if progress_bar:
progress_bar = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
ModelManager.tqdm_progress = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
temp_zip_name = os.path.join(output_folder, file_url.split("/")[-1])
with open(temp_zip_name, "wb") as file:
for data in r.iter_content(block_size):
if progress_bar:
progress_bar.update(len(data))
ModelManager.tqdm_progress.update(len(data))
file.write(data)
with zipfile.ZipFile(temp_zip_name) as z:
z.extractall(output_folder)
@ -560,12 +569,12 @@ class ModelManager(object):
total_size_in_bytes = int(r.headers.get("content-length", 0))
block_size = 1024 # 1 Kibibyte
if progress_bar:
progress_bar = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
ModelManager.tqdm_progress = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
temp_tar_name = os.path.join(output_folder, file_url.split("/")[-1])
with open(temp_tar_name, "wb") as file:
for data in r.iter_content(block_size):
if progress_bar:
progress_bar.update(len(data))
ModelManager.tqdm_progress.update(len(data))
file.write(data)
with tarfile.open(temp_tar_name) as t:
t.extractall(output_folder)
@ -596,10 +605,10 @@ class ModelManager(object):
block_size = 1024 # 1 Kibibyte
with open(temp_zip_name, "wb") as file:
if progress_bar:
progress_bar = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
ModelManager.tqdm_progress = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
for data in r.iter_content(block_size):
if progress_bar:
progress_bar.update(len(data))
ModelManager.tqdm_progress.update(len(data))
file.write(data)
@staticmethod

View File

@ -264,6 +264,7 @@ class Synthesizer(nn.Module):
style_text=None,
reference_wav=None,
reference_speaker_name=None,
split_sentences: bool = True,
**kwargs,
) -> List[int]:
"""🐸 TTS magic. Run all the models and generate speech.
@ -277,6 +278,8 @@ class Synthesizer(nn.Module):
style_text ([type], optional): transcription of style_wav for Capacitron. Defaults to None.
reference_wav ([type], optional): reference waveform for voice conversion. Defaults to None.
reference_speaker_name ([type], optional): speaker id of reference waveform. Defaults to None.
split_sentences (bool, optional): split the input text into sentences. Defaults to True.
**kwargs: additional arguments to pass to the TTS model.
Returns:
List[int]: [description]
"""
@ -289,8 +292,10 @@ class Synthesizer(nn.Module):
)
if text:
sens = self.split_into_sentences(text)
print(" > Text splitted to sentences.")
sens = [text]
if split_sentences:
print(" > Text splitted to sentences.")
sens = self.split_into_sentences(text)
print(sens)
# handle multi-speaker
@ -300,7 +305,7 @@ class Synthesizer(nn.Module):
speaker_embedding = None
speaker_id = None
if self.tts_speakers_file or hasattr(self.tts_model.speaker_manager, "name_to_id"):
if speaker_name and isinstance(speaker_name, str):
if speaker_name and isinstance(speaker_name, str) and not self.tts_config.model == "xtts":
if self.tts_config.use_d_vector_file:
# get the average speaker embedding from the saved d_vectors.
speaker_embedding = self.tts_model.speaker_manager.get_mean_embedding(
@ -330,7 +335,9 @@ class Synthesizer(nn.Module):
# handle multi-lingual
language_id = None
if self.tts_languages_file or (
hasattr(self.tts_model, "language_manager") and self.tts_model.language_manager is not None
hasattr(self.tts_model, "language_manager")
and self.tts_model.language_manager is not None
and not self.tts_config.model == "xtts"
):
if len(self.tts_model.language_manager.name_to_id) == 1:
language_id = list(self.tts_model.language_manager.name_to_id.values())[0]
@ -358,7 +365,12 @@ class Synthesizer(nn.Module):
)
# compute a new d_vector from the given clip.
if speaker_wav is not None and self.tts_model.speaker_manager is not None:
if (
speaker_wav is not None
and self.tts_model.speaker_manager is not None
and hasattr(self.tts_model.speaker_manager, "encoder_ap")
and self.tts_model.speaker_manager.encoder_ap is not None
):
speaker_embedding = self.tts_model.speaker_manager.compute_embedding_from_clip(speaker_wav)
vocoder_device = "cpu"

View File

@ -94,6 +94,7 @@ class ParallelWaveganConfig(BaseGANVocoderConfig):
use_noise_augment: bool = False
use_cache: bool = True
steps_to_start_discriminator: int = 200000
target_loss: str = "loss_1"
# LOSS PARAMETERS - overrides
use_stft_loss: bool = True

View File

@ -7,6 +7,7 @@ from coqpit import Coqpit
from tqdm import tqdm
from TTS.utils.audio import AudioProcessor
from TTS.utils.audio.numpy_transforms import mulaw_encode, quantize
def preprocess_wav_files(out_path: str, config: Coqpit, ap: AudioProcessor):
@ -29,7 +30,11 @@ def preprocess_wav_files(out_path: str, config: Coqpit, ap: AudioProcessor):
mel = ap.melspectrogram(y)
np.save(mel_path, mel)
if isinstance(config.mode, int):
quant = ap.mulaw_encode(y, qc=config.mode) if config.model_args.mulaw else ap.quantize(y, bits=config.mode)
quant = (
mulaw_encode(wav=y, mulaw_qc=config.mode)
if config.model_args.mulaw
else quantize(x=y, quantize_bits=config.mode)
)
np.save(quant_path, quant)

View File

@ -2,6 +2,8 @@ import numpy as np
import torch
from torch.utils.data import Dataset
from TTS.utils.audio.numpy_transforms import mulaw_encode, quantize
class WaveRNNDataset(Dataset):
"""
@ -66,7 +68,9 @@ class WaveRNNDataset(Dataset):
x_input = audio
elif isinstance(self.mode, int):
x_input = (
self.ap.mulaw_encode(audio, qc=self.mode) if self.mulaw else self.ap.quantize(audio, bits=self.mode)
mulaw_encode(wav=audio, mulaw_qc=self.mode)
if self.mulaw
else quantize(x=audio, quantize_bits=self.mode)
)
else:
raise RuntimeError("Unknown dataset mode - ", self.mode)

View File

@ -13,6 +13,7 @@ from torch.utils.data.distributed import DistributedSampler
from TTS.tts.utils.visual import plot_spectrogram
from TTS.utils.audio import AudioProcessor
from TTS.utils.audio.numpy_transforms import mulaw_decode
from TTS.utils.io import load_fsspec
from TTS.vocoder.datasets.wavernn_dataset import WaveRNNDataset
from TTS.vocoder.layers.losses import WaveRNNLoss
@ -399,7 +400,7 @@ class Wavernn(BaseVocoder):
output = output[0]
if self.args.mulaw and isinstance(self.args.mode, int):
output = AudioProcessor.mulaw_decode(output, self.args.mode)
output = mulaw_decode(wav=output, mulaw_qc=self.args.mode)
# Fade-out at the end to avoid signal cutting out suddenly
fade_out = np.linspace(1, 0, 20 * self.config.audio.hop_length)

View File

@ -0,0 +1,44 @@
ARG BASE=nvidia/cuda:11.8.0-base-ubuntu22.04
FROM ${BASE}
# Install OS dependencies:
RUN apt-get update && apt-get upgrade -y
RUN apt-get install -y --no-install-recommends \
gcc g++ \
make \
python3 python3-dev python3-pip python3-venv python3-wheel \
espeak-ng libsndfile1-dev \
&& rm -rf /var/lib/apt/lists/*
# Install Major Python Dependencies:
RUN pip3 install llvmlite --ignore-installed
RUN pip3 install torch torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
RUN rm -rf /root/.cache/pip
WORKDIR /root
# Copy Dependency Lock Files:
COPY \
Makefile \
pyproject.toml \
setup.py \
requirements.dev.txt \
requirements.ja.txt \
requirements.notebooks.txt \
requirements.txt \
/root/
# Install Project Dependencies
# Separate stage to limit re-downloading:
RUN pip install \
-r requirements.txt \
-r requirements.dev.txt \
-r requirements.ja.txt \
-r requirements.notebooks.txt
# Copy TTS repository contents:
COPY . /root
# Installing the TTS package itself:
RUN make install

View File

@ -56,4 +56,4 @@ ModelConfig()
In the example above, ```ModelConfig()``` is the final configuration that the model receives and it has all the fields necessary for the model.
We host pre-defined model configurations under ```TTS/<model_class>/configs/```.Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided.
We host pre-defined model configurations under ```TTS/<model_class>/configs/```. Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided.

View File

@ -21,7 +21,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own
speech dataset and achieve reasonable results with only a couple of hours of data.
However, note that, fine-tuning does not ensure great results. The model performance is still depends on the
However, note that, fine-tuning does not ensure great results. The model performance still depends on the
{ref}`dataset quality <what_makes_a_good_dataset>` and the hyper-parameters you choose for fine-tuning. Therefore,
it still takes a bit of tinkering.
@ -41,7 +41,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
tts --list_models
```
The command above lists the the models in a naming format as ```<model_type>/<language>/<dataset>/<model_name>```.
The command above lists the models in a naming format as ```<model_type>/<language>/<dataset>/<model_name>```.
Or you can manually check the `.model.json` file in the project directory.

View File

@ -7,7 +7,7 @@ If you have a single audio file and you need to split it into clips, there are d
It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using `wav` file format.
Let's assume you created the audio clips and their transcription. You can collect all your clips under a folder. Let's call this folder `wavs`.
Let's assume you created the audio clips and their transcription. You can collect all your clips in a folder. Let's call this folder `wavs`.
```
/wavs
@ -17,7 +17,7 @@ Let's assume you created the audio clips and their transcription. You can collec
...
```
You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimitered by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text.
You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimited by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text.
We recommend the following format delimited by `|`. In the following example, `audio1`, `audio2` refer to files `audio1.wav`, `audio2.wav` etc.
@ -55,7 +55,7 @@ For more info about dataset qualities and properties check our [post](https://gi
After you collect and format your dataset, you need to check two things. Whether you need a `formatter` and a `text_cleaner`. The `formatter` loads the text file (created above) as a list and the `text_cleaner` performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).
If you use a different dataset format then the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`.
If you use a different dataset format than the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`.
If your dataset is in a new language or it needs special normalization steps, then you need a new `text_cleaner`.

View File

@ -2,7 +2,7 @@
- Language frontends are located under `TTS.tts.utils.text`
- Each special language has a separate folder.
- Each folder containst all the utilities for processing the text input.
- Each folder contains all the utilities for processing the text input.
- `TTS.tts.utils.text.phonemizers` contains the main phonemizer for a language. This is the class that uses the utilities
from the previous step and used to convert the text to phonemes or graphemes for the model.
- After you implement your phonemizer, you need to add it to the `TTS/tts/utils/text/phonemizers/__init__.py` to be able to

View File

@ -145,7 +145,7 @@ class MyModel(BaseTTS):
Args:
ap (AudioProcessor): audio processor used at training.
batch (Dict): Model inputs used at the previous training step.
outputs (Dict): Model outputs generated at the previoud training step.
outputs (Dict): Model outputs generated at the previous training step.
Returns:
Tuple[Dict, np.ndarray]: training plots and output waveform.
@ -183,7 +183,7 @@ class MyModel(BaseTTS):
...
def get_optimizer(self) -> Union["Optimizer", List["Optimizer"]]:
"""Setup an return optimizer or optimizers."""
"""Setup a return optimizer or optimizers."""
pass
def get_lr(self) -> Union[float, List[float]]:

View File

@ -172,48 +172,6 @@ tts.tts_with_vc_to_file(
)
```
#### Example text to speech using [🐸Coqui Studio](https://coqui.ai) models.
You can use all of your available speakers in the studio.
[🐸Coqui Studio](https://coqui.ai) API token is required. You can get it from the [account page](https://coqui.ai/account).
You should set the `COQUI_STUDIO_TOKEN` environment variable to use the API token.
```python
# If you have a valid API token set you will see the studio speakers as separate models in the list.
# The name format is coqui_studio/en/<studio_speaker_name>/coqui_studio
models = TTS().list_models()
# Init TTS with the target studio speaker
tts = TTS(model_name="coqui_studio/en/Torcull Diarmuid/coqui_studio", progress_bar=False)
# Run TTS
tts.tts_to_file(text="This is a test.", file_path=OUTPUT_PATH)
# Run TTS with emotion and speed control
tts.tts_to_file(text="This is a test.", file_path=OUTPUT_PATH, emotion="Happy", speed=1.5)
```
If you just need 🐸 Coqui Studio speakers, you can use `CS_API`. It is a wrapper around the 🐸 Coqui Studio API.
```python
from TTS.api import CS_API
# Init 🐸 Coqui Studio API
# you can either set the API token as an environment variable `COQUI_STUDIO_TOKEN` or pass it as an argument.
# XTTS - Best quality and life-like speech in multiple languages. See https://docs.coqui.ai/reference/samples_xtts_create for supported languages.
api = CS_API(api_token=<token>, model="XTTS")
api.speakers # all the speakers are available with all the models.
api.list_speakers()
api.list_voices()
wav, sample_rate = api.tts(text="This is a test.", speaker=api.speakers[0].name, emotion="Happy", language="en", speed=1.5)
# V1 - Fast and lightweight TTS in EN with emotion control.
api = CS_API(api_token=<token>, model="V1")
api.speakers
api.emotions # emotions are only for the V1 model.
api.list_speakers()
api.list_voices()
wav, sample_rate = api.tts(text="This is a test.", speaker=api.speakers[0].name, emotion="Happy", speed=1.5)
```
#### Example text to speech using **Fairseq models in ~1100 languages** 🤯.
For these models use the following name format: `tts_models/<lang-iso_code>/fairseq/vits`.

View File

@ -2,13 +2,13 @@
## What is Mary-TTS?
[Mary (Modular Architecture for Research in sYynthesis) Text-to-Speech](http://mary.dfki.de/) is an open-source (GNU LGPL license), multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of [DFKIs](http://www.dfki.de/web) Language Technology Lab and the [Institute of Phonetics](http://www.coli.uni-saarland.de/groups/WB/Phonetics/) at Saarland University, Germany. It is now maintained by the Multimodal Speech Processing Group in the [Cluster of Excellence MMCI](https://www.mmci.uni-saarland.de/) and DFKI.
[Mary (Modular Architecture for Research in sYnthesis) Text-to-Speech](http://mary.dfki.de/) is an open-source (GNU LGPL license), multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of [DFKIs](http://www.dfki.de/web) Language Technology Lab and the [Institute of Phonetics](http://www.coli.uni-saarland.de/groups/WB/Phonetics/) at Saarland University, Germany. It is now maintained by the Multimodal Speech Processing Group in the [Cluster of Excellence MMCI](https://www.mmci.uni-saarland.de/) and DFKI.
MaryTTS has been around for a very! long time. Version 3.0 even dates back to 2006, long before Deep Learning was a broadly known term and the last official release was version 5.2 in 2016.
You can check out this OpenVoice-Tech page to learn more: https://openvoice-tech.net/index.php/MaryTTS
## Why Mary-TTS compatibility is relevant
Due to it's open-source nature, relatively high quality voices and fast synthetization speed Mary-TTS was a popular choice in the past and many tools implemented API support over the years like screen-readers (NVDA + SpeechHub), smart-home HUBs (openHAB, Home Assistant) or voice assistants (Rhasspy, Mycroft, SEPIA). A compatibility layer for Coqui-TTS will ensure that these tools can use Coqui as a drop-in replacement and get even better voices right away.
Due to its open-source nature, relatively high quality voices and fast synthetization speed Mary-TTS was a popular choice in the past and many tools implemented API support over the years like screen-readers (NVDA + SpeechHub), smart-home HUBs (openHAB, Home Assistant) or voice assistants (Rhasspy, Mycroft, SEPIA). A compatibility layer for Coqui-TTS will ensure that these tools can use Coqui as a drop-in replacement and get even better voices right away.
## API and code examples

View File

@ -1,6 +1,6 @@
# 🐢 Tortoise
Tortoise is a very expressive TTS system with impressive voice cloning capabilities. It is based on an GPT like autogressive acoustic model that converts input
text to discritized acouistic tokens, a diffusion model that converts these tokens to melspeectrogram frames and a Univnet vocoder to convert the spectrograms to
text to discritized acoustic tokens, a diffusion model that converts these tokens to melspectrogram frames and a Univnet vocoder to convert the spectrograms to
the final audio signal. The important downside is that Tortoise is very slow compared to the parallel TTS models like VITS.
Big thanks to 👑[@manmay-nakhashi](https://github.com/manmay-nakhashi) who helped us implement Tortoise in 🐸TTS.

View File

@ -21,7 +21,7 @@ a few tricks to make it faster and support streaming inference.
- Across the board quality improvements.
### Code
Current implementation only supports inference.
Current implementation only supports inference and GPT encoder training.
### Languages
As of now, XTTS-v2 supports 16 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu) and Korean (ko).
@ -36,35 +36,39 @@ Come and join in our 🐸Community. We're active on [Discord](https://discord.gg
You can also mail us at info@coqui.ai.
### Inference
#### 🐸TTS API
##### Single reference
```python
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
file_path="output.wav",
speaker_wav=["/path/to/target/speaker.wav"],
language="en")
```
##### Multiple references
```python
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
file_path="output.wav",
speaker_wav=["/path/to/target/speaker.wav", "/path/to/target/speaker_2.wav", "/path/to/target/speaker_3.wav"],
language="en")
```
#### 🐸TTS Command line
##### Single reference
You can check all supported languages with the following command:
```console
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--list_language_idx
```
You can check all Coqui available speakers with the following command:
```console
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--list_speaker_idx
```
##### Coqui speakers
You can do inference using one of the available speakers using the following command:
```console
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--text "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent." \
--speaker_idx "Ana Florence" \
--language_idx en \
--use_cuda true
```
##### Clone a voice
You can clone a speaker voice using a single or multiple references:
###### Single reference
```console
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--text "Bugün okula gitmek istemiyorum." \
@ -73,7 +77,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
--use_cuda true
```
##### Multiple references
###### Multiple references
```console
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--text "Bugün okula gitmek istemiyorum." \
@ -91,15 +95,102 @@ or for all wav files in a directory you can use:
--use_cuda true
```
#### 🐸TTS API
#### model directly
##### Clone a voice
You can clone a speaker voice using a single or multiple references:
If you want to be able to run with `use_deepspeed=True` and enjoy the speedup, you need to install deepspeed first.
###### Single reference
Splits the text into sentences and generates audio for each sentence. The audio files are then concatenated to produce the final audio.
You can optionally disable sentence splitting for better coherence but more VRAM and possibly hitting models context length limit.
```python
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
file_path="output.wav",
speaker_wav=["/path/to/target/speaker.wav"],
language="en",
split_sentences=True
)
```
###### Multiple references
You can pass multiple audio files to the `speaker_wav` argument for better voice cloning.
```python
from TTS.api import TTS
# using the default version set in 🐸TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# using a specific version
# 👀 see the branch names for versions on https://huggingface.co/coqui/XTTS-v2/tree/main
# ❗some versions might be incompatible with the API
tts = TTS("xtts_v2.0.2", gpu=True)
# getting the latest XTTS_v2
tts = TTS("xtts", gpu=True)
# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
file_path="output.wav",
speaker_wav=["/path/to/target/speaker.wav", "/path/to/target/speaker_2.wav", "/path/to/target/speaker_3.wav"],
language="en")
```
##### Coqui speakers
You can do inference using one of the available speakers using the following code:
```python
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
file_path="output.wav",
speaker="Ana Florence",
language="en",
split_sentences=True
)
```
#### 🐸TTS Model API
To use the model API, you need to download the model files and pass config and model file paths manually.
#### Manual Inference
If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first.
```console
pip install deepspeed==0.8.3
pip install deepspeed==0.10.3
```
##### inference parameters
- `text`: The text to be synthesized.
- `language`: The language of the text to be synthesized.
- `gpt_cond_latent`: The latent vector you get with get_conditioning_latents. (You can cache for faster inference with same speaker)
- `speaker_embedding`: The speaker embedding you get with get_conditioning_latents. (You can cache for faster inference with same speaker)
- `temperature`: The softmax temperature of the autoregressive model. Defaults to 0.65.
- `length_penalty`: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs. Defaults to 1.0.
- `repetition_penalty`: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc. Defaults to 2.0.
- `top_k`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 50.
- `top_p`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 0.8.
- `speed`: The speed rate of the generated audio. Defaults to 1.0. (can produce artifacts if far from 1.0)
- `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True.
##### Inference
```python
import os
import torch
@ -129,7 +220,7 @@ torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
```
#### streaming inference
##### Streaming manually
Here the goal is to stream the audio as it is being generated. This is useful for real-time applications.
Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster.
@ -175,6 +266,50 @@ torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000)
### Training
#### Easy training
To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps:
- Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter
- Train the XTTS GPT encoder with the processed data
- Inference support using the fine-tuned model
The user can run this gradio demo locally or remotely using a Colab Notebook.
##### Run demo on Colab
To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook.
The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing).
To learn how to use this Colab Notebook please check the [XTTS fine-tuning video]().
If you are not able to acess the video you need to follow the steps:
1. Open the Colab notebook and start the demo by runining the first two cells (ignore pip install errors in the first one).
2. Click on the link "Running on public URL:" on the second cell output.
3. On the first Tab (1 - Data processing) you need to select the audio file or files, wait for upload, and then click on the button "Step 1 - Create dataset" and then wait until the dataset processing is done.
4. Soon as the dataset processing is done you need to go to the second Tab (2 - Fine-tuning XTTS Encoder) and press the button "Step 2 - Run the training" and then wait until the training is finished. Note that it can take up to 40 minutes.
5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference".
##### Run demo locally
To run the demo locally you need to do the following steps:
1. Install 🐸 TTS following the instructions available [here](https://tts.readthedocs.io/en/dev/installation.html#installation).
2. Install the Gradio demo requirements with the command `python3 -m pip install -r TTS/demos/xtts_ft_demo/requirements.txt`
3. Run the Gradio demo using the command `python3 TTS/demos/xtts_ft_demo/xtts_demo.py`
4. Follow the steps presented in the [tutorial video](https://www.youtube.com/watch?v=8tpDiiouGxc&feature=youtu.be) to be able to fine-tune and test the fine-tuned model.
If you are not able to access the video, here is what you need to do:
1. On the first Tab (1 - Data processing) select the audio file or files, wait for upload
2. Click on the button "Step 1 - Create dataset" and then wait until the dataset processing is done.
3. Go to the second Tab (2 - Fine-tuning XTTS Encoder) and press the button "Step 2 - Run the training" and then wait until the training is finished. it will take some time.
4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded.
5. Now you can run inference with the model by clicking on the button "Step 4 - Inference".
#### Advanced training
A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py
You need to change the fields of the `BaseDatasetConfig` to match your dataset and then update `GPTArgs` and `GPTTrainerConfig` fields as you need. By default, it will use the same parameters that XTTS v1.1 model was trained with. To speed up the model convergence, as default, it will also download the XTTS v1.1 checkpoint and load it.
@ -222,6 +357,7 @@ torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000)
```
## References and Acknowledgements
- VallE: https://arxiv.org/abs/2301.02111
- Tortoise Repo: https://github.com/neonbjb/tortoise-tts

View File

@ -13,23 +13,28 @@
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import torch\n",
"import importlib\n",
"import numpy as np\n",
"from tqdm import tqdm\n",
"from torch.utils.data import DataLoader\n",
"import soundfile as sf\n",
"import os\n",
"import pickle\n",
"\n",
"import numpy as np\n",
"import soundfile as sf\n",
"import torch\n",
"from matplotlib import pylab as plt\n",
"from torch.utils.data import DataLoader\n",
"from tqdm import tqdm\n",
"\n",
"from TTS.config import load_config\n",
"from TTS.tts.configs.shared_configs import BaseDatasetConfig\n",
"from TTS.tts.datasets import load_tts_samples\n",
"from TTS.tts.datasets.dataset import TTSDataset\n",
"from TTS.tts.layers.losses import L1LossMasked\n",
"from TTS.utils.audio import AudioProcessor\n",
"from TTS.config import load_config\n",
"from TTS.tts.utils.visual import plot_spectrogram\n",
"from TTS.tts.utils.helpers import sequence_mask\n",
"from TTS.tts.models import setup_model\n",
"from TTS.tts.utils.text.symbols import make_symbols, symbols, phonemes\n",
"from TTS.tts.utils.helpers import sequence_mask\n",
"from TTS.tts.utils.text.tokenizer import TTSTokenizer\n",
"from TTS.tts.utils.visual import plot_spectrogram\n",
"from TTS.utils.audio import AudioProcessor\n",
"from TTS.utils.audio.numpy_transforms import quantize\n",
"\n",
"%matplotlib inline\n",
"\n",
@ -49,11 +54,9 @@
" file_name = wav_file.split('.')[0]\n",
" os.makedirs(os.path.join(out_path, \"quant\"), exist_ok=True)\n",
" os.makedirs(os.path.join(out_path, \"mel\"), exist_ok=True)\n",
" os.makedirs(os.path.join(out_path, \"wav_gl\"), exist_ok=True)\n",
" wavq_path = os.path.join(out_path, \"quant\", file_name)\n",
" mel_path = os.path.join(out_path, \"mel\", file_name)\n",
" wav_path = os.path.join(out_path, \"wav_gl\", file_name)\n",
" return file_name, wavq_path, mel_path, wav_path"
" return file_name, wavq_path, mel_path"
]
},
{
@ -65,14 +68,14 @@
"# Paths and configurations\n",
"OUT_PATH = \"/home/ubuntu/TTS/recipes/ljspeech/LJSpeech-1.1/specs2/\"\n",
"DATA_PATH = \"/home/ubuntu/TTS/recipes/ljspeech/LJSpeech-1.1/\"\n",
"PHONEME_CACHE_PATH = \"/home/ubuntu/TTS/recipes/ljspeech/LJSpeech-1.1/phoneme_cache\"\n",
"DATASET = \"ljspeech\"\n",
"METADATA_FILE = \"metadata.csv\"\n",
"CONFIG_PATH = \"/home/ubuntu/.local/share/tts/tts_models--en--ljspeech--tacotron2-DDC_ph/config.json\"\n",
"MODEL_FILE = \"/home/ubuntu/.local/share/tts/tts_models--en--ljspeech--tacotron2-DDC_ph/model_file.pth\"\n",
"BATCH_SIZE = 32\n",
"\n",
"QUANTIZED_WAV = False\n",
"QUANTIZE_BIT = None\n",
"QUANTIZE_BITS = 0 # if non-zero, quantize wav files with the given number of bits\n",
"DRY_RUN = False # if False, does not generate output files, only computes loss and visuals.\n",
"\n",
"# Check CUDA availability\n",
@ -80,10 +83,10 @@
"print(\" > CUDA enabled: \", use_cuda)\n",
"\n",
"# Load the configuration\n",
"dataset_config = BaseDatasetConfig(formatter=DATASET, meta_file_train=METADATA_FILE, path=DATA_PATH)\n",
"C = load_config(CONFIG_PATH)\n",
"C.audio['do_trim_silence'] = False # IMPORTANT!!!!!!!!!!!!!!! disable to align mel specs with the wav files\n",
"ap = AudioProcessor(bits=QUANTIZE_BIT, **C.audio)\n",
"print(C['r'])"
"ap = AudioProcessor(**C.audio)"
]
},
{
@ -92,12 +95,10 @@
"metadata": {},
"outputs": [],
"source": [
"# If the vocabulary was passed, replace the default\n",
"if 'characters' in C and C['characters']:\n",
" symbols, phonemes = make_symbols(**C.characters)\n",
"# Initialize the tokenizer\n",
"tokenizer, C = TTSTokenizer.init_from_config(C)\n",
"\n",
"# Load the model\n",
"num_chars = len(phonemes) if C.use_phonemes else len(symbols)\n",
"# TODO: multiple speakers\n",
"model = setup_model(C)\n",
"model.load_checkpoint(C, MODEL_FILE, eval=True)"
@ -109,42 +110,21 @@
"metadata": {},
"outputs": [],
"source": [
"# Load the preprocessor based on the dataset\n",
"preprocessor = importlib.import_module(\"TTS.tts.datasets.formatters\")\n",
"preprocessor = getattr(preprocessor, DATASET.lower())\n",
"meta_data = preprocessor(DATA_PATH, METADATA_FILE)\n",
"# Load data instances\n",
"meta_data_train, meta_data_eval = load_tts_samples(dataset_config)\n",
"meta_data = meta_data_train + meta_data_eval\n",
"\n",
"dataset = TTSDataset(\n",
" C,\n",
" C.text_cleaner,\n",
" False,\n",
" ap,\n",
" meta_data,\n",
" characters=C.get('characters', None),\n",
" use_phonemes=C.use_phonemes,\n",
" phoneme_cache_path=C.phoneme_cache_path,\n",
" enable_eos_bos=C.enable_eos_bos_chars,\n",
" outputs_per_step=C[\"r\"],\n",
" compute_linear_spec=False,\n",
" ap=ap,\n",
" samples=meta_data,\n",
" tokenizer=tokenizer,\n",
" phoneme_cache_path=PHONEME_CACHE_PATH,\n",
")\n",
"loader = DataLoader(\n",
" dataset, batch_size=BATCH_SIZE, num_workers=4, collate_fn=dataset.collate_fn, shuffle=False, drop_last=False\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize lists for storing results\n",
"file_idxs = []\n",
"metadata = []\n",
"losses = []\n",
"postnet_losses = []\n",
"criterion = L1LossMasked(seq_len_norm=C.seq_len_norm)\n",
"\n",
"# Create log file\n",
"log_file_path = os.path.join(OUT_PATH, \"log.txt\")\n",
"log_file = open(log_file_path, \"w\")"
")"
]
},
{
@ -160,26 +140,33 @@
"metadata": {},
"outputs": [],
"source": [
"# Initialize lists for storing results\n",
"file_idxs = []\n",
"metadata = []\n",
"losses = []\n",
"postnet_losses = []\n",
"criterion = L1LossMasked(seq_len_norm=C.seq_len_norm)\n",
"\n",
"# Start processing with a progress bar\n",
"with torch.no_grad():\n",
"log_file_path = os.path.join(OUT_PATH, \"log.txt\")\n",
"with torch.no_grad() and open(log_file_path, \"w\") as log_file:\n",
" for data in tqdm(loader, desc=\"Processing\"):\n",
" try:\n",
" # setup input data\n",
" text_input, text_lengths, _, linear_input, mel_input, mel_lengths, stop_targets, item_idx = data\n",
"\n",
" # dispatch data to GPU\n",
" if use_cuda:\n",
" text_input = text_input.cuda()\n",
" text_lengths = text_lengths.cuda()\n",
" mel_input = mel_input.cuda()\n",
" mel_lengths = mel_lengths.cuda()\n",
" data[\"token_id\"] = data[\"token_id\"].cuda()\n",
" data[\"token_id_lengths\"] = data[\"token_id_lengths\"].cuda()\n",
" data[\"mel\"] = data[\"mel\"].cuda()\n",
" data[\"mel_lengths\"] = data[\"mel_lengths\"].cuda()\n",
"\n",
" mask = sequence_mask(text_lengths)\n",
" mel_outputs, postnet_outputs, alignments, stop_tokens = model.forward(text_input, text_lengths, mel_input)\n",
" mask = sequence_mask(data[\"token_id_lengths\"])\n",
" outputs = model.forward(data[\"token_id\"], data[\"token_id_lengths\"], data[\"mel\"])\n",
" mel_outputs = outputs[\"decoder_outputs\"]\n",
" postnet_outputs = outputs[\"model_outputs\"]\n",
"\n",
" # compute loss\n",
" loss = criterion(mel_outputs, mel_input, mel_lengths)\n",
" loss_postnet = criterion(postnet_outputs, mel_input, mel_lengths)\n",
" loss = criterion(mel_outputs, data[\"mel\"], data[\"mel_lengths\"])\n",
" loss_postnet = criterion(postnet_outputs, data[\"mel\"], data[\"mel_lengths\"])\n",
" losses.append(loss.item())\n",
" postnet_losses.append(loss_postnet.item())\n",
"\n",
@ -193,28 +180,27 @@
" postnet_outputs = torch.stack(mel_specs)\n",
" elif C.model == \"Tacotron2\":\n",
" postnet_outputs = postnet_outputs.detach().cpu().numpy()\n",
" alignments = alignments.detach().cpu().numpy()\n",
" alignments = outputs[\"alignments\"].detach().cpu().numpy()\n",
"\n",
" if not DRY_RUN:\n",
" for idx in range(text_input.shape[0]):\n",
" wav_file_path = item_idx[idx]\n",
" for idx in range(data[\"token_id\"].shape[0]):\n",
" wav_file_path = data[\"item_idxs\"][idx]\n",
" wav = ap.load_wav(wav_file_path)\n",
" file_name, wavq_path, mel_path, wav_path = set_filename(wav_file_path, OUT_PATH)\n",
" file_name, wavq_path, mel_path = set_filename(wav_file_path, OUT_PATH)\n",
" file_idxs.append(file_name)\n",
"\n",
" # quantize and save wav\n",
" if QUANTIZED_WAV:\n",
" wavq = ap.quantize(wav)\n",
" if QUANTIZE_BITS > 0:\n",
" wavq = quantize(wav, QUANTIZE_BITS)\n",
" np.save(wavq_path, wavq)\n",
"\n",
" # save TTS mel\n",
" mel = postnet_outputs[idx]\n",
" mel_length = mel_lengths[idx]\n",
" mel_length = data[\"mel_lengths\"][idx]\n",
" mel = mel[:mel_length, :].T\n",
" np.save(mel_path, mel)\n",
"\n",
" metadata.append([wav_file_path, mel_path])\n",
"\n",
" except Exception as e:\n",
" log_file.write(f\"Error processing data: {str(e)}\\n\")\n",
"\n",
@ -224,35 +210,20 @@
" log_file.write(f\"Mean Loss: {mean_loss}\\n\")\n",
" log_file.write(f\"Mean Postnet Loss: {mean_postnet_loss}\\n\")\n",
"\n",
"# Close the log file\n",
"log_file.close()\n",
"\n",
"# For wavernn\n",
"if not DRY_RUN:\n",
" pickle.dump(file_idxs, open(os.path.join(OUT_PATH, \"dataset_ids.pkl\"), \"wb\"))\n",
"\n",
"# For pwgan\n",
"with open(os.path.join(OUT_PATH, \"metadata.txt\"), \"w\") as f:\n",
" for data in metadata:\n",
" f.write(f\"{data[0]}|{data[1]+'.npy'}\\n\")\n",
" for wav_file_path, mel_path in metadata:\n",
" f.write(f\"{wav_file_path[0]}|{mel_path[1]+'.npy'}\\n\")\n",
"\n",
"# Print mean losses\n",
"print(f\"Mean Loss: {mean_loss}\")\n",
"print(f\"Mean Postnet Loss: {mean_postnet_loss}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# for pwgan\n",
"with open(os.path.join(OUT_PATH, \"metadata.txt\"), \"w\") as f:\n",
" for data in metadata:\n",
" f.write(f\"{data[0]}|{data[1]+'.npy'}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -267,7 +238,7 @@
"outputs": [],
"source": [
"idx = 1\n",
"ap.melspectrogram(ap.load_wav(item_idx[idx])).shape"
"ap.melspectrogram(ap.load_wav(data[\"item_idxs\"][idx])).shape"
]
},
{
@ -276,10 +247,9 @@
"metadata": {},
"outputs": [],
"source": [
"import soundfile as sf\n",
"wav, sr = sf.read(item_idx[idx])\n",
"mel_postnet = postnet_outputs[idx][:mel_lengths[idx], :]\n",
"mel_decoder = mel_outputs[idx][:mel_lengths[idx], :].detach().cpu().numpy()\n",
"wav, sr = sf.read(data[\"item_idxs\"][idx])\n",
"mel_postnet = postnet_outputs[idx][:data[\"mel_lengths\"][idx], :]\n",
"mel_decoder = mel_outputs[idx][:data[\"mel_lengths\"][idx], :].detach().cpu().numpy()\n",
"mel_truth = ap.melspectrogram(wav)\n",
"print(mel_truth.shape)"
]
@ -291,7 +261,7 @@
"outputs": [],
"source": [
"# plot posnet output\n",
"print(mel_postnet[:mel_lengths[idx], :].shape)\n",
"print(mel_postnet[:data[\"mel_lengths\"][idx], :].shape)\n",
"plot_spectrogram(mel_postnet, ap)"
]
},
@ -324,10 +294,9 @@
"outputs": [],
"source": [
"# postnet, decoder diff\n",
"from matplotlib import pylab as plt\n",
"mel_diff = mel_decoder - mel_postnet\n",
"plt.figure(figsize=(16, 10))\n",
"plt.imshow(abs(mel_diff[:mel_lengths[idx],:]).T,aspect=\"auto\", origin=\"lower\");\n",
"plt.imshow(abs(mel_diff[:data[\"mel_lengths\"][idx],:]).T,aspect=\"auto\", origin=\"lower\")\n",
"plt.colorbar()\n",
"plt.tight_layout()"
]
@ -339,10 +308,9 @@
"outputs": [],
"source": [
"# PLOT GT SPECTROGRAM diff\n",
"from matplotlib import pylab as plt\n",
"mel_diff2 = mel_truth.T - mel_decoder\n",
"plt.figure(figsize=(16, 10))\n",
"plt.imshow(abs(mel_diff2).T,aspect=\"auto\", origin=\"lower\");\n",
"plt.imshow(abs(mel_diff2).T,aspect=\"auto\", origin=\"lower\")\n",
"plt.colorbar()\n",
"plt.tight_layout()"
]
@ -354,21 +322,13 @@
"outputs": [],
"source": [
"# PLOT GT SPECTROGRAM diff\n",
"from matplotlib import pylab as plt\n",
"mel = postnet_outputs[idx]\n",
"mel_diff2 = mel_truth.T - mel[:mel_truth.shape[1]]\n",
"plt.figure(figsize=(16, 10))\n",
"plt.imshow(abs(mel_diff2).T,aspect=\"auto\", origin=\"lower\");\n",
"plt.imshow(abs(mel_diff2).T,aspect=\"auto\", origin=\"lower\")\n",
"plt.colorbar()\n",
"plt.tight_layout()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

View File

@ -1,33 +1,34 @@
# core deps
numpy==1.22.0;python_version<="3.10"
numpy==1.24.3;python_version>"3.10"
cython==0.29.30
numpy>=1.24.3;python_version>"3.10"
cython>=0.29.30
scipy>=1.11.2
torch>=2.1
torchaudio
soundfile==0.12.*
librosa==0.10.*
scikit-learn==1.3.0
soundfile>=0.12.0
librosa>=0.10.0
scikit-learn>=1.3.0
numba==0.55.1;python_version<"3.9"
numba==0.57.0;python_version>="3.9"
inflect==5.6.*
tqdm==4.64.*
anyascii==0.3.*
pyyaml==6.*
fsspec==2023.6.0 # <= 2023.9.1 makes aux tests fail
aiohttp==3.8.*
packaging==23.1
numba>=0.57.0;python_version>="3.9"
inflect>=5.6.0
tqdm>=4.64.1
anyascii>=0.3.0
pyyaml>=6.0
fsspec>=2023.6.0 # <= 2023.9.1 makes aux tests fail
aiohttp>=3.8.1
packaging>=23.1
mutagen==1.47.0
# deps for examples
flask==2.*
flask>=2.0.1
# deps for inference
pysbd==0.3.4
pysbd>=0.3.4
# deps for notebooks
umap-learn==0.5.*
umap-learn>=0.5.1
pandas>=1.4,<2.0
# deps for training
matplotlib==3.7.*
matplotlib>=3.7.0
# coqui stack
trainer
trainer>=0.0.36
# config management
coqpit>=0.0.16
# chinese g2p deps
@ -46,11 +47,11 @@ bangla
bnnumerizer
bnunicodenormalizer
#deps for tortoise
k_diffusion
einops==0.6.*
transformers==4.33.*
einops>=0.6.0
transformers>=4.33.0
#deps for bark
encodec==0.1.*
encodec>=0.1.1
# deps for XTTS
unidecode==1.3.*
unidecode>=1.3.2
num2words
spacy[ja]>=3

View File

@ -1,113 +0,0 @@
import os
import unittest
from tests import get_tests_data_path, get_tests_output_path
from TTS.api import CS_API, TTS
OUTPUT_PATH = os.path.join(get_tests_output_path(), "test_python_api.wav")
cloning_test_wav_path = os.path.join(get_tests_data_path(), "ljspeech/wavs/LJ001-0028.wav")
is_coqui_available = os.environ.get("COQUI_STUDIO_TOKEN")
if is_coqui_available:
class CS_APITest(unittest.TestCase):
def test_speakers(self):
tts = CS_API()
self.assertGreater(len(tts.speakers), 1)
def test_emotions(self):
tts = CS_API()
self.assertGreater(len(tts.emotions), 1)
def test_list_calls(self):
tts = CS_API()
self.assertGreater(len(tts.list_voices()), 1)
self.assertGreater(len(tts.list_speakers()), 1)
self.assertGreater(len(tts.list_all_speakers()), 1)
self.assertGreater(len(tts.list_speakers_as_tts_models()), 1)
def test_name_to_speaker(self):
tts = CS_API()
speaker_name = tts.list_speakers_as_tts_models()[0].split("/")[2]
speaker = tts.name_to_speaker(speaker_name)
self.assertEqual(speaker.name, speaker_name)
def test_tts(self):
tts = CS_API()
wav, sr = tts.tts(text="This is a test.", speaker_name=tts.list_speakers()[0].name)
self.assertEqual(sr, 44100)
self.assertGreater(len(wav), 1)
class TTSTest(unittest.TestCase):
def test_single_speaker_model(self):
tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False, gpu=False)
error_raised = False
try:
tts.tts_to_file(text="Ich bin eine Testnachricht.", speaker="Thorsten", language="de")
except ValueError:
error_raised = True
tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH)
self.assertTrue(error_raised)
self.assertFalse(tts.is_multi_speaker)
self.assertFalse(tts.is_multi_lingual)
self.assertIsNone(tts.speakers)
self.assertIsNone(tts.languages)
def test_studio_model(self):
tts = TTS(model_name="coqui_studio/en/Zacharie Aimilios/coqui_studio")
tts.tts_to_file(text="This is a test.")
# check speed > 2.0 raises error
raised_error = False
try:
_ = tts.tts(text="This is a test.", speed=4.0, emotion="Sad") # should raise error with speed > 2.0
except ValueError:
raised_error = True
self.assertTrue(raised_error)
# check emotion is invalid
raised_error = False
try:
_ = tts.tts(text="This is a test.", speed=2.0, emotion="No Emo") # should raise error with speed > 2.0
except ValueError:
raised_error = True
self.assertTrue(raised_error)
# check valid call
wav = tts.tts(text="This is a test.", speed=2.0, emotion="Sad")
self.assertGreater(len(wav), 0)
def test_fairseq_model(self): # pylint: disable=no-self-use
tts = TTS(model_name="tts_models/eng/fairseq/vits")
tts.tts_to_file(text="This is a test.")
def test_multi_speaker_multi_lingual_model(self):
tts = TTS()
tts.load_tts_model_by_name(tts.models[0]) # YourTTS
tts.tts_to_file(
text="Hello world!", speaker=tts.speakers[0], language=tts.languages[0], file_path=OUTPUT_PATH
)
self.assertTrue(tts.is_multi_speaker)
self.assertTrue(tts.is_multi_lingual)
self.assertGreater(len(tts.speakers), 1)
self.assertGreater(len(tts.languages), 1)
def test_voice_cloning(self): # pylint: disable=no-self-use
tts = TTS()
tts.load_tts_model_by_name("tts_models/multilingual/multi-dataset/your_tts")
tts.tts_to_file("Hello world!", speaker_wav=cloning_test_wav_path, language="en", file_path=OUTPUT_PATH)
def test_voice_conversion(self): # pylint: disable=no-self-use
tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False, gpu=False)
tts.voice_conversion_to_file(
source_wav=cloning_test_wav_path,
target_wav=cloning_test_wav_path,
file_path=OUTPUT_PATH,
)

View File

@ -1,25 +0,0 @@
import os
from tests import get_tests_output_path, run_cli
def test_synthesize():
"""Test synthesize.py with diffent arguments."""
output_path = os.path.join(get_tests_output_path(), "output.wav")
# 🐸 Coqui studio model
run_cli(
'tts --model_name "coqui_studio/en/Torcull Diarmuid/coqui_studio" '
'--text "This is it" '
f'--out_path "{output_path}"'
)
# 🐸 Coqui studio model with speed arg.
run_cli(
'tts --model_name "coqui_studio/en/Torcull Diarmuid/coqui_studio" '
'--text "This is it but slow" --speed 0.1'
f'--out_path "{output_path}"'
)
# test pipe_out command
run_cli(f'tts --text "test." --pipe_out --out_path "{output_path}" | aplay')

View File

@ -3,11 +3,11 @@ import unittest
import numpy as np
import torch
from trainer.io import save_checkpoint
from tests import get_tests_input_path
from TTS.config import load_config
from TTS.encoder.utils.generic_utils import setup_encoder_model
from TTS.encoder.utils.io import save_checkpoint
from TTS.tts.utils.managers import EmbeddingManager
from TTS.utils.audio import AudioProcessor
@ -31,7 +31,7 @@ class EmbeddingManagerTest(unittest.TestCase):
# create a dummy speaker encoder
model = setup_encoder_model(config)
save_checkpoint(model, None, None, get_tests_input_path(), 0)
save_checkpoint(config, model, None, None, 0, 0, get_tests_input_path())
# load audio processor and speaker encoder
manager = EmbeddingManager(encoder_model_path=encoder_model_path, encoder_config_path=encoder_config_path)

View File

@ -3,11 +3,11 @@ import unittest
import numpy as np
import torch
from trainer.io import save_checkpoint
from tests import get_tests_input_path
from TTS.config import load_config
from TTS.encoder.utils.generic_utils import setup_encoder_model
from TTS.encoder.utils.io import save_checkpoint
from TTS.tts.utils.speakers import SpeakerManager
from TTS.utils.audio import AudioProcessor
@ -30,7 +30,7 @@ class SpeakerManagerTest(unittest.TestCase):
# create a dummy speaker encoder
model = setup_encoder_model(config)
save_checkpoint(model, None, None, get_tests_input_path(), 0)
save_checkpoint(config, model, None, None, 0, 0, get_tests_input_path())
# load audio processor and speaker encoder
ap = AudioProcessor(**config.audio)

View File

@ -0,0 +1,9 @@
audio_file|text|transcription|speaker_name
wavs/LJ001-0001.flac|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|ljspeech-0
wavs/LJ001-0002.flac|in being comparatively modern.|in being comparatively modern.|ljspeech-0
wavs/LJ001-0003.flac|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|ljspeech-1
wavs/LJ001-0004.flac|produced the block books, which were the immediate predecessors of the true printed book,|produced the block books, which were the immediate predecessors of the true printed book,|ljspeech-1
wavs/LJ001-0005.flac|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|ljspeech-2
wavs/LJ001-0006.flac|And it is worth mention in passing that, as an example of fine typography,|And it is worth mention in passing that, as an example of fine typography,|ljspeech-2
wavs/LJ001-0007.flac|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about 1455,|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about fourteen fifty-five,|ljspeech-3
wavs/LJ001-0008.flac|has never been surpassed.|has never been surpassed.|ljspeech-3
Can't render this file because it contains an unexpected character in line 8 and column 86.

View File

@ -0,0 +1,9 @@
audio_file|text|transcription|speaker_name
wavs/LJ001-0001.mp3|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|ljspeech-0
wavs/LJ001-0002.mp3|in being comparatively modern.|in being comparatively modern.|ljspeech-0
wavs/LJ001-0003.mp3|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|ljspeech-1
wavs/LJ001-0004.mp3|produced the block books, which were the immediate predecessors of the true printed book,|produced the block books, which were the immediate predecessors of the true printed book,|ljspeech-1
wavs/LJ001-0005.mp3|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|ljspeech-2
wavs/LJ001-0006.mp3|And it is worth mention in passing that, as an example of fine typography,|And it is worth mention in passing that, as an example of fine typography,|ljspeech-2
wavs/LJ001-0007.mp3|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about 1455,|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about fourteen fifty-five,|ljspeech-3
wavs/LJ001-0008.mp3|has never been surpassed.|has never been surpassed.|ljspeech-3
Can't render this file because it contains an unexpected character in line 8 and column 85.

View File

@ -0,0 +1,9 @@
audio_file|text|transcription|speaker_name
wavs/LJ001-0001.wav|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|ljspeech-0
wavs/LJ001-0002.wav|in being comparatively modern.|in being comparatively modern.|ljspeech-0
wavs/LJ001-0003.wav|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process|ljspeech-1
wavs/LJ001-0004.wav|produced the block books, which were the immediate predecessors of the true printed book,|produced the block books, which were the immediate predecessors of the true printed book,|ljspeech-1
wavs/LJ001-0005.wav|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.|ljspeech-2
wavs/LJ001-0006.wav|And it is worth mention in passing that, as an example of fine typography,|And it is worth mention in passing that, as an example of fine typography,|ljspeech-2
wavs/LJ001-0007.wav|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about 1455,|the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about fourteen fifty-five,|ljspeech-3
wavs/LJ001-0008.wav|has never been surpassed.|has never been surpassed.|ljspeech-3
Can't render this file because it contains an unexpected character in line 8 and column 85.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Some files were not shown because too many files have changed in this diff Show More