Audio Inputs

Accepted formats

Voice AI accepts:

Other formats return 415 Unsupported Media Type from the gateway. Convert with ffmpeg if needed:

ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav

API	Minimum	Recommended	Notes
Speech-to-Text	8 kHz	16 kHz	Audio above 16 kHz is downsampled.
Streaming Speech-to-Text	8 kHz	16 kHz	Must be declared in the config frame.
Voice Clone	16 kHz	24 kHz	Higher rates produce better clones.
Voice Design	n/a	n/a	No audio input.
Text-to-Speech output	16 kHz	24 kHz	Configurable per request.

Files above the upload limit return 413 Payload Too Large. For longer audio, segment client-side or use the streaming endpoint.

10 to 30 seconds of clean speech is the sweet spot. Longer samples do not improve quality.
Single speaker. Multi-speaker samples produce mixed clones.
Quiet environment. Ambient noise above the speech floor degrades the clone.
Natural prosody. A varied sentence is better than a monotone reading.
Consistent microphone. Avoid switching mics partway through the sample.