From 45b2f8e42e3e81fdfe667b97314ae927b72aba63 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Eren=20G=C3=B6lge?= <egolge@coqui.ai>
Date: Fri, 10 Dec 2021 09:12:03 +0000
Subject: [PATCH] =?UTF-8?q?Add=20=F0=9F=91=91YourTTS=20docs?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/source/models/vits.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/docs/source/models/vits.md b/docs/source/models/vits.md
index 5c0e92f6..0c303f7a 100644
--- a/docs/source/models/vits.md
+++ b/docs/source/models/vits.md
@@ -3,10 +3,15 @@
 VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
 ) is an End-to-End (encoder -> vocoder together) TTS model that takes advantage of SOTA DL techniques like GANs, VAE,
 Normalizing Flows. It does not require external alignment annotations and learns the text-to-audio alignment
-using MAS as explained in the paper. The model architecture is a combination of GlowTTS encoder and HiFiGAN vocoder.
+using MAS, as explained in the paper. The model architecture is a combination of GlowTTS encoder and HiFiGAN vocoder.
 It is a feed-forward model with x67.12 real-time factor on a GPU.
 
+🐸 YourTTS is a multi-speaker and multi-lingual TTS model that can perform voice conversion and zero-shot speaker adaptation.
+It can also learn a new language or voice with a ~ 1 minute long audio clip. This is a big open gate for training
+TTS models in low-resources languages. 🐸 YourTTS uses VITS as the backbone architecture coupled with a speaker encoder model.
+
 ## Important resources & papers
+- 🐸 YourTTS: https://arxiv.org/abs/2112.02418
 - VITS: https://arxiv.org/pdf/2106.06103.pdf
 - Neural Spline Flows: https://arxiv.org/abs/1906.04032
 - Variational Autoencoder: https://arxiv.org/pdf/1312.6114.pdf