All posts
Release Open Source Model

Introducing OuteTTS 1.0: Open-Weight TTS with Voice Cloning

April 6, 2025 · 7 min read

Today we're releasing OuteTTS 1.0 — a major update to our open-weight text-to-speech system. The release comes in two variants built for different use cases and hardware budgets: Llama-OuteTTS-1.0-1B for maximum quality and language coverage, and OuteTTS-1.0-0.6B for lighter deployments under an Apache 2.0 license.

Both models run the same architecture, ship with the same outetts Python library, and bring the same generation quality improvements over previous versions — including automatic word alignment, a new audio encoder, and significantly improved voice cloning.

Why this release matters

Earlier versions of OuteTTS required text to be pre-processed before generation: romanization for non-Latin scripts, manual phoneme conversion for certain languages, and external word alignment pipelines. Version 1.0 eliminates all of that. You hand the model raw text — in any supported language, with numbers, punctuation, mixed scripts — and it handles the rest internally.

Voice cloning has also been rebuilt. The previous encoder needed longer reference clips and was sensitive to audio quality variations. The new DAC-based encoder produces stable embeddings from as little as 5 seconds of audio, with noticeably more accurate voice reproduction across languages.

What's new in version 1.0

Automatic word alignment

The model now performs word-level alignment internally during generation. Raw text — including languages without explicit word boundaries like Japanese and Chinese — is handled natively. No tokenization scripts, no external aligners, no pre-processing pipeline needed before calling the model.

New DAC audio encoder

OuteTTS 1.0 moves to a DAC (Discrete Audio Codec) encoder from ibm-research/DAC.speech.v1.0, using two codebooks for higher-fidelity audio reconstruction. This does increase the token generation rate from 75 to 150 tokens per second — a deliberate quality-over-speed tradeoff that particularly benefits multilingual output.

Richer generation metadata

The updated prompt format includes additional per-word metadata: timing, energy, spectral centroid, and pitch. This gives the model more information to work with when constructing natural prosody, resulting in more consistent rhythm and intonation — especially across longer outputs.

Direct numerical input

Numbers can now be passed directly. The model handles multilingual numerical reading without requiring conversion to written-out text first. Mixed-language prompts work but may default to the dominant language for numeral pronunciation.

Two models, two use cases

Model Training data Languages License Base model
Llama-OuteTTS-1.0-1B ~60k hours audio 23+ CC-BY-NC-SA-4.0 Llama 3.2 1B
OuteTTS-1.0-0.6B ~20k hours audio 14+ Apache 2.0 Qwen3 0.6B

Llama-OuteTTS-1.0-1B is the higher-quality option. Trained on 60k hours of audio across 23+ languages, it's the larger model in the series. If you need Arabic, Bengali, Lithuanian, Ukrainian, Persian, Swahili, Tamil, or any of the extended language set — this is the one to use. The CC-BY-NC-SA license permits research and non-commercial applications; commercial use requires a separate agreement.

OuteTTS-1.0-0.6B is the open-commercial option. Apache 2.0 licensed, it covers 14 languages and runs on significantly lighter hardware. It's built on Qwen3 0.6B and supports batched inference — making it well-suited for self-hosted production deployments where you're running on your own GPU and need throughput. Available in FP8, GGUF, and EXL2 quantizations for flexible deployment.

Voice cloning

Both models support one-shot voice cloning from a short audio reference. The process is the same for both variants: provide a clean audio clip, get back a speaker embedding, reuse it for any number of generations.

Practical guidelines:

  • 5–10 seconds of clean audio is sufficient. 10 seconds is the sweet spot.
  • Single speaker, minimal background noise. Recording environment matters more than clip length.
  • The clone inherits the speaker's accent — a British English reference will produce British-accented output in other languages too. For the best multilingual results, record the reference in the target language.
  • If clone quality is lower than expected, verify the encoded sample with interface.decode_and_save_speaker() before debugging generation settings.

Language coverage

The 1B model has two tiers of language support based on training data volume:

High-coverage languages (extensive training): English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Korean, Lithuanian, Russian, Spanish.

Moderate-coverage languages (good performance with occasional limitations): Portuguese, Belarusian, Bengali, Georgian, Hungarian, Latvian, Persian/Farsi, Polish, Swahili, Tamil, Ukrainian.

The 0.6B model supports: English, Chinese, Dutch, French, Georgian, German, Hungarian, Italian, Japanese, Korean, Latvian, Polish, Russian, Spanish.

Both models can attempt speech in languages outside their training set with varying results — worth experimenting with for closely related languages.

Quick start

pip install outetts

Basic usage (1B model)

import outetts

interface = outetts.Interface(
    config=outetts.ModelConfig.auto_config(
        model=outetts.Models.VERSION_1_0_SIZE_1B,
        backend=outetts.Backend.LLAMACPP,
        quantization=outetts.LlamaCppQuantization.FP16,
    )
)

# Built-in speaker profiles are available out of the box
speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL")

# Or clone any voice from a short audio sample
# speaker = interface.create_speaker("path/to/reference.wav")
# interface.save_speaker(speaker, "my_voice.json")
# speaker = interface.load_speaker("my_voice.json")

output = interface.generate(
    config=outetts.GenerationConfig(
        text="Hello, how are you doing?",
        generation_type=outetts.GenerationType.CHUNKED,
        speaker=speaker,
        sampler_config=outetts.SamplerConfig(temperature=0.4),
    )
)
output.save("output.wav")

Basic usage (0.6B model)

from outetts import Interface, ModelConfig, GenerationConfig, Backend, Models

interface = Interface(
    ModelConfig.auto_config(
        model=Models.VERSION_1_0_SIZE_0_6B,
        backend=Backend.HF,
    )
)

speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL")

output = interface.generate(
    GenerationConfig(text="Hello, how are you doing?", speaker=speaker)
)
output.save("output.wav")

Batched inference (0.6B)

The 0.6B model introduces batched generation support, tested on an NVIDIA L40S GPU. Batching is available through the VLLM, EXL2, and llama.cpp async server backends:

from outetts import Interface, ModelConfig, GenerationConfig, Backend, GenerationType

interface = Interface(
    ModelConfig(
        model_path="OuteAI/OuteTTS-1.0-0.6B-FP8",
        tokenizer_path="OuteAI/OuteTTS-1.0-0.6B",
        backend=Backend.VLLM,
    )
)

speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL")

output = interface.generate(
    GenerationConfig(
        text="Longer text that gets automatically split into chunks and processed in batches.",
        speaker=speaker,
        generation_type=GenerationType.BATCH,
        max_batch_size=32,
        dac_decoding_chunk=2048,
    )
)
output.save("output_batch.wav")

For EXL2, set backend=Backend.EXL2ASYNC and match exl2_cache_seq_multiply to your max_batch_size. For the llama.cpp async server, use Backend.LLAMACPP_ASYNC_SERVER and pass server_host in the generation config.

Sampling configuration — read this

OuteTTS 1.0 has a specific requirement for the repetition penalty that differs from most language models. The penalty must be applied to a 64-token recent window only — not the full context. Applying it across the entire context produces broken or degraded output.

ParameterRecommended value
Temperature0.4
Repetition Penalty1.1
Repetition Range64 tokens
Top-k40
Top-p0.9
Min-p0.05

If you're using the outetts library, this is handled automatically. If you're running the model through a custom inference stack, you'll need to implement the windowed repetition penalty yourself. Getting this wrong is the most common cause of garbled output.

If you want to generate speech without self-hosting, the OuteAI Studio and API offer a hosted TTS service — no setup, no GPU required.

Get started

Ready to build with speech AI?

No subscription, no seats. Top up credits and spend them only when you generate audio. Credits never expire.