Teaching Language Models to Speak via Audio Tokens and Forced Alignment
We present OuteTTS, a novel approach to text-to-speech synthesis that leverages pure language modeling without the need for external adapters or complex architectures. Our 350M parameter model demonstrates that high-quality speech synthesis is achievable through a straightforward approach using crafted prompts and audio tokens.
Text-to-speech synthesis has traditionally relied on complex architectures and specialized models. With OuteTTS, we demonstrate that a relatively small language model can learn to generate high-quality speech through a simple yet effective approach. Our model, with just 350M parameters, showcases the potential of using language models directly for speech synthesis.
Our model builds upon the LLaMa architecture, specifically utilizing our Oute3-350M-DEV base model, which was pre-trained on 30 billion DCLM-baseline-1.0 tokens. What makes OuteTTS unique is its three-step approach to audio processing:
[full transcription] [word] [duration token] [audio tokens]
During the training process, we observed distinct stages of improvement:
OuteTTS-0.1-350M offers several notable capabilities:
As an experimental v0.1 release, OuteTTS has several known limitations:
Implementation Note: While the model excels at shorter sentences, we recommend splitting longer text into smaller segments for optimal performance. Temperature settings may need adjustment based on specific use cases.
Input | Audio | Notes |
---|---|---|
Hello, I can speak pretty well, but sometimes I make some mistakes. | (temperature=0.1, repetition_penalty=1.1) | |
Hello, I can speak pretty well, but sometimes I make some mistakes. | (temperature=0.1, repetition_penalty=1.1) | |
Scientists have discovered a new planet that may be capable of supporting life! | Using the Q4_K_M quantized model. (temperature=0.7, repetition_penalty=1.1) | |
Scientists have discovered a new planet that may be capable of supporting life! | The model partially failed to follow the input text. (temperature=0.1, repetition_penalty=1.1) | |
Scientists have discovered a new planet that may be capable of supporting life! | In this case, changing to a higher temperature from 0.1 to 0.7 produces more consistent output. (temperature=0.7, repetition_penalty=1.1) |
Getting started with OuteTTS is straightforward.
GitHub Repositorypip install outetts
from outetts.v0_1.interface import InterfaceHF, InterfaceGGUF # Initialize the interface with the Hugging Face model interface = InterfaceHF("OuteAI/OuteTTS-0.1-350M") # Or initialize the interface with a GGUF model # interface = InterfaceGGUF("path/to/model.gguf") # Generate TTS output # Without a speaker reference, the model generates speech with random speaker characteristics output = interface.generate( text="Hello, am I working?", temperature=0.1, repetition_penalty=1.1, max_lenght=4096 ) # Play the generated audio output.play() # Save the generated audio to a file output.save("output.wav")
# Create a custom speaker from an audio file speaker = interface.create_speaker( "path/to/reference.wav", "reference text matching the audio" ) # Generate TTS with the custom voice output = interface.generate( text="This is a cloned voice speaking", speaker=speaker, temperature=0.1, repetition_penalty=1.1, max_lenght=4096 )
OuteTTS-0.1-350M represents a step forward in simplifying text-to-speech synthesis. By demonstrating that high-quality speech generation is possible through a pure language modeling approach.