Skip to content

Audio inputs

Accepted formats

Voice AI accepts:

  • WAV (PCM, 16-bit, mono or stereo)
  • MP3 (CBR or VBR, any bitrate from 64 kbps and up)
  • M4A (AAC-LC)
  • FLAC

Other formats return 415 Unsupported Media Type from the gateway. Convert with ffmpeg if needed:

Terminal window
ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav

Sample rates

APIMinimumRecommendedNotes
Speech-to-Text8 kHz16 kHzAudio above 16 kHz is downsampled.
Streaming Speech-to-Text8 kHz16 kHzMust be declared in the config frame.
Voice Clone16 kHz24 kHzHigher rates produce better clones.
Voice Designn/an/aNo audio input.
Text-to-Speech output16 kHz24 kHzConfigurable per request.

File size limits

EndpointMax uploadMax duration
/ai/speech-to-text100 MB2 hours
/ai/transcriptions100 MB2 hours
/ai/voice-clone5 MB60 seconds
/ai/speech-translation100 MB2 hours

Files above the upload limit return 413 Payload Too Large. For longer audio, segment client-side or use the streaming endpoint.

Recommendations for cloning samples

  • 10 to 30 seconds of clean speech is the sweet spot. Longer samples do not improve quality.
  • Single speaker. Multi-speaker samples produce mixed clones.
  • Quiet environment. Ambient noise above the speech floor degrades the clone.
  • Natural prosody. A varied sentence is better than a monotone reading.
  • Consistent microphone. Avoid switching mics partway through the sample.