🐸TTS: Text-to-Speech deep learning toolkit
PyTorch toolkit for deep learning text-to-speech synthesis.
Learn more about TTS
🐸TTS is a PyTorch-based deep learning library for text-to-speech synthesis that implements multiple model architectures including Tacotron, Glow-TTS, and XTTS. The toolkit combines acoustic models for converting text to mel-spectrograms with vocoder models like HiFi-GAN and MelGAN for converting spectrograms to waveforms. It supports multi-speaker synthesis, voice cloning, voice conversion, and speaker encoding capabilities. The library is used in both research contexts and production deployments, with support for over 1100 languages through integration with Fairseq models.

Multi-architecture support
Implements various model architectures including Tacotron, Glow-TTS, XTTS, Tortoise, and Bark, allowing users to select approaches suited to their specific requirements. Integration with Fairseq models provides access to additional language coverage.
Voice cloning and conversion
Includes speaker encoder components and voice cloning capabilities that enable synthesis with new speaker characteristics. XTTS supports streaming inference with reported latency under 200ms.
Training and fine-tuning tools
Provides utilities for dataset analysis, curation, and model training from scratch or fine-tuning existing models. Example recipes are available for common datasets like LJSpeech.
npm install @google-cloud/text-to-speechAdds multi-GPU training for XTTS, studio speakers to open-source XTTS, and fixes Chinese speech pause handling; no breaking changes noted.
- –Enable multi-GPU training for XTTS models to scale training workloads across hardware.
- –Use new studio speaker voices now available in open-source XTTS for improved voice quality.
Adds a Gradio UI for no-code XTTS fine-tuning; no breaking changes or new requirements noted.
- –Use the new Gradio demo to fine-tune XTTS models without code, runnable locally, on Colab, or on a server.
- –Follow the step-by-step video tutorial or XTTS docs to train custom voice models with your own audio data.
Adds versioned XTTS model loading and optional sentence splitting; fixes punctuation handling in text preprocessing.
- –Load specific XTTS versions by appending version tags to model names (e.g., `xtts_v2.0.2`) or omit for latest.
- –Set `split_sentences=False` in `tts_to_file()` to disable automatic sentence splitting and apply custom text logic.
Related Repositories
Discover similar tools and frameworks used by developers
NAFNet
Efficient PyTorch architecture for image restoration tasks.
AutoGPT
Block-based visual editor for autonomous AI agents.
stablediffusion
Text-to-image diffusion in compressed latent space.
CodeFormer
Transformer-based face restoration using vector-quantized codebook lookup.
gym
Standard API for reinforcement learning environment interfaces.