All posts
Voice Cloning API Tutorial

Voice Cloning with 10 Seconds of Audio: A Developer's Guide

March 28, 2026 · 7 min read

Traditional voice cloning required hours of recorded data, days of fine-tuning, and a team to manage the pipeline. One-shot cloning changes the equation entirely: give the model 10 seconds of clean audio, and it produces a voice representation you can reuse across any generation — in seconds, not days.

This guide covers everything you need to know to create, use, and manage voice clones via the OuteAI API.

What is one-shot voice cloning?

Voice cloning is the process of capturing the characteristics of a specific speaker — their tone, cadence, accent, and timbre — and applying those characteristics to new speech. In a traditional pipeline, this required a large dataset of that speaker's recordings and a custom training run.

One-shot cloning does it from a single short sample. The model extracts a speaker embedding — a compact numerical representation of voice identity — from your reference audio. That embedding is stored and reused as many times as you need, without any further training.

The result isn't a copy of the reference audio. It's the model generating entirely new speech that sounds like that speaker.

How OuteAI's cloning works

When you submit reference audio, OuteAI passes it through a DAC (Discrete Audio Codec) encoder that converts raw waveforms into a compact sequence of audio tokens. These tokens capture the speaker's vocal fingerprint — pitch distribution, speaking pace, resonance — without storing the actual audio.

During generation, the model conditions its output on this embedding alongside your text prompt. The resulting speech inherits the referenced speaker's characteristics while producing words they never actually said.

This approach means:

  • Cloning is fast — encoding takes seconds.
  • You don't need a large dataset, just a clean sample.
  • The clone is reusable across any language the model supports.
  • No model weights are modified — the same base model handles all voices.

Prepare your reference audio

The quality of your clone depends heavily on the quality of the reference audio. Here's what to aim for:

RequirementValue
FormatWAV or MP3
Duration5–10 seconds (10 seconds is ideal)
Max file size10 MB
SpeakerSingle speaker only
EnvironmentClean audio, minimal background noise
ContentNatural speech — avoid extreme emotions or unusual effects

A 10-second clip of someone speaking naturally in a quiet room will almost always produce a better clone than a 30-second clip recorded in a noisy environment. Compression artifacts, background music, and multiple speakers in the same file all degrade clone quality.

If the clone quality is lower than expected, the issue is usually the reference audio. The model encodes what it hears — clipping, excessive reverb, or background noise all end up in the embedding.

Create a voice clone

Voice cloning is a multipart form upload. Send your audio file along with a display name to /api/v1/voice-clones:

curl https://outeai.com/api/v1/voice-clones \
  -X POST \
  -H "Authorization: Bearer oute_xxxxxxxxxxxxxxxxxxxx" \
  -F "[email protected]" \
  -F "name=My Custom Voice"

A successful response returns the new clone's metadata, including its voice_id:

{
  "data": {
    "voice_id": "vc_01jq3k9m5p8n2r6x7w4y0z3a1b",
    "name": "My Custom Voice",
    "created_at": "2026-03-28T14:22:10Z"
  }
}

The clone costs 0.025 credits and is available immediately. Save the voice_id — you'll use it in every generation call.

Creating a clone in Python

import os
import requests

API_KEY = os.environ["OUTEAI_API_KEY"]

with open("reference.wav", "rb") as audio_file:
    response = requests.post(
        "https://outeai.com/api/v1/voice-clones",
        headers={"Authorization": f"Bearer {API_KEY}"},
        files={"audio": audio_file},
        data={"name": "My Custom Voice"},
    )

response.raise_for_status()
voice_id = response.json()["data"]["voice_id"]
print(f"Clone created: {voice_id}")

Generate speech with your clone

A cloned voice is used exactly like a built-in voice — just pass its voice_id. Using the Python SDK:

from outeai import OuteAI
import os

client = OuteAI(os.environ["OUTEAI_API_KEY"])

result = client.generate_speech(
    text="Hello, this is my cloned voice speaking new text.",
    voice_id="vc_01jq3k9m5p8n2r6x7w4y0z3a1b",
)

result.save("cloned_output.wav")

Or using streaming for real-time delivery:

with client.stream_speech(
    text="Streaming audio from a cloned voice in real time.",
    voice_id="vc_01jq3k9m5p8n2r6x7w4y0z3a1b",
) as stream:
    stream.save("cloned_stream.wav")

Cross-lingual cloning

You can generate speech in any supported language using a clone created from audio in a different language. The model carries the speaker's vocal characteristics across languages. Keep in mind that strong accents in the reference audio will often carry over — a British English reference will produce British-accented output in other languages too.

For the best multilingual results, create a reference recording in the target language when possible.

List and delete clones

Retrieve all clones on your account:

curl https://outeai.com/api/v1/voice-clones \
  -H "Authorization: Bearer oute_xxxxxxxxxxxxxxxxxxxx"

Delete a specific clone by its ID:

curl https://outeai.com/api/v1/voice-clones/vc_01jq3k9m5p8n2r6x7w4y0z3a1b \
  -X DELETE \
  -H "Authorization: Bearer oute_xxxxxxxxxxxxxxxxxxxx"

Deleted clones are removed immediately and can no longer be used for generation. Your account supports up to 1,000 voice clones.

Tips for best clone quality

  • Use a condenser or USB microphone rather than a laptop mic or phone. Ambient noise is the most common quality killer.
  • Record in a small, carpeted room to reduce reverb. Hard surfaces create echo that degrades the embedding.
  • Speak naturally at a normal volume and pace. The model learns from what it hears — exaggerated speech produces an exaggerated clone.
  • Trim silence from the start and end of the clip. The model prefers audio that is mostly speech.
  • Avoid clipping — audio that peaks above 0 dBFS introduces distortion that shows up in the clone. Aim for peaks around −3 dBFS.
  • Test after creating — generate a short sample and listen critically. If the voice sounds off, record a cleaner reference and create a new clone.

A note on ethical use

Voice cloning is a powerful capability with real-world implications. Before cloning someone's voice, ensure you have their explicit consent. The OuteAI Terms of Service prohibit using voice clones to impersonate individuals without consent, to create misleading content, or for any purpose that would cause harm.

If you're cloning a voice for a business application — an AI assistant, a narrator for your product, a custom avatar — document the consent process and keep it on file.

The technology is meant to expand what's possible in creative and practical applications. Using it responsibly keeps that door open for everyone.

Get started

Ready to build with speech AI?

No subscription, no seats. Top up credits and spend them only when you generate audio. Credits never expire.