OuteTTS 0.1 350M

Introduction

Text-to-speech synthesis has traditionally relied on complex architectures and specialized models. With OuteTTS, we demonstrate that a relatively small language model can learn to generate high-quality speech through a simple yet effective approach. Our model, with just 350M parameters, showcases the potential of using language models directly for speech synthesis.

Our model builds upon the LLaMa architecture, specifically utilizing our Oute3-350M-DEV base model, which was pre-trained on 30 billion DCLM-baseline-1.0 tokens. What makes OuteTTS unique is its three-step approach to audio processing:

Audio tokenization using WavTokenizer (processing 75 tokens per second)
CTC forced alignment to create precise word-to-audio token mapping
Structured prompt creation following the format:

[full transcription]
[word] [duration token] [audio tokens]

Training Progression

During the training process, we observed distinct stages of improvement:

100M tokens: The model began developing basic speaker tones and initial vocalization attempts
500M tokens: Clear improvement in word comprehensibility and sentence structure
1B tokens: Significant enhancement in word knowledge and overall clarity

Key Features

OuteTTS-0.1-350M offers several notable capabilities:

Pure language modeling approach to TTS
Voice cloning capabilities
LLaMa architecture
Compatible with llama.cpp and GGUF format

Known Limitations

As an experimental v0.1 release, OuteTTS has several known limitations:

Vocabulary constraints due to training data limitations
String-only input support
Potential word alterations or insertions due to the compact 350M parameter size
Variable temperature sensitivity depending on use case

Implementation Note: While the model excels at shorter sentences, we recommend splitting longer text into smaller segments for optimal performance. Temperature settings may need adjustment based on specific use cases.

Speech Samples

Input	Audio	Notes
Hello, I can speak pretty well, but sometimes I make some mistakes.		(temperature=0.1, repetition_penalty=1.1)
Hello, I can speak pretty well, but sometimes I make some mistakes.		(temperature=0.1, repetition_penalty=1.1)
Scientists have discovered a new planet that may be capable of supporting life!		Using the Q4_K_M quantized model. (temperature=0.7, repetition_penalty=1.1)
Scientists have discovered a new planet that may be capable of supporting life!		The model partially failed to follow the input text. (temperature=0.1, repetition_penalty=1.1)
Scientists have discovered a new planet that may be capable of supporting life!		In this case, changing to a higher temperature from 0.1 to 0.7 produces more consistent output. (temperature=0.7, repetition_penalty=1.1)

Code Implementation

Getting started with OuteTTS is straightforward.

GitHub Repository

pip install outetts

from outetts.v0_1.interface import InterfaceHF, InterfaceGGUF

# Initialize the interface with the Hugging Face model
interface = InterfaceHF("OuteAI/OuteTTS-0.1-350M")

# Or initialize the interface with a GGUF model
# interface = InterfaceGGUF("path/to/model.gguf")

# Generate TTS output
# Without a speaker reference, the model generates speech with random speaker characteristics
output = interface.generate(
    text="Hello, am I working?",
    temperature=0.1,
    repetition_penalty=1.1,
    max_lenght=4096
)

# Play the generated audio
output.play()

# Save the generated audio to a file
output.save("output.wav")

Voice Cloning Capabilities

# Create a custom speaker from an audio file
speaker = interface.create_speaker(
    "path/to/reference.wav",
    "reference text matching the audio"
)

# Generate TTS with the custom voice
output = interface.generate(
    text="This is a cloned voice speaking",
    speaker=speaker,
    temperature=0.1,
    repetition_penalty=1.1,
    max_lenght=4096
)

Future Directions

Increasing model parameters and expanding training data
Investigating alternative alignment methods for improved character compatibility
Exploring potential applications in speech-to-speech assistant models

Conclusion

OuteTTS-0.1-350M represents a step forward in simplifying text-to-speech synthesis. By demonstrating that high-quality speech generation is possible through a pure language modeling approach.

OuteTTS-0.1-350M

Abstract