About Zonos TTS

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.

Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.

For more details and speech samples, check out our blog here. We also have a hosted version available at playground.zyphra.com/audio.

Architecture Overview

Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An overview of the architecture can be seen below.

Usage

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
from zonos.utils import DEFAULT_DEVICE as device

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device=device)

wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
speaker = model.make_speaker_embedding(wav, sampling_rate)

cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)

wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)

For repeated sampling we highly recommend using the Gradio interface instead, as the minimal example needs to load the model every time it is run.

Features

  • Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output.
  • Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching.
  • Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German.
  • Audio quality and emotion control: Fine-grained control of speaking rate, pitch, maximum frequency, and various emotions.
  • Fast: Runs with a real-time factor of ~2x on an RTX 4090.
  • Gradio WebUI: Comes packaged with an easy to use Gradio interface to generate speech.
  • Simple installation and deployment: Can be installed and deployed using the docker file packaged with our repository.

Note: This is an unofficial about page for Zonos TTS. For the most accurate information, please refer to official documentation.