Audio inputs
Accepted formats
Voice AI accepts:
- WAV (PCM, 16-bit, mono or stereo)
- MP3 (CBR or VBR, any bitrate from 64 kbps and up)
- M4A (AAC-LC)
- FLAC
Other formats return 415 Unsupported Media Type from the gateway. Convert with ffmpeg if needed:
ffmpeg -i input.ogg -ar 16000 -ac 1 output.wavSample rates
| API | Minimum | Recommended | Notes |
|---|---|---|---|
| Speech-to-Text | 8 kHz | 16 kHz | Audio above 16 kHz is downsampled. |
| Streaming Speech-to-Text | 8 kHz | 16 kHz | Must be declared in the config frame. |
| Voice Clone | 16 kHz | 24 kHz | Higher rates produce better clones. |
| Voice Design | n/a | n/a | No audio input. |
| Text-to-Speech output | 16 kHz | 24 kHz | Configurable per request. |
File size limits
| Endpoint | Max upload | Max duration |
|---|---|---|
/ai/speech-to-text | 100 MB | 2 hours |
/ai/transcriptions | 100 MB | 2 hours |
/ai/voice-clone | 5 MB | 60 seconds |
/ai/speech-translation | 100 MB | 2 hours |
Files above the upload limit return 413 Payload Too Large. For longer audio, segment client-side or use the streaming endpoint.
Recommendations for cloning samples
- 10 to 30 seconds of clean speech is the sweet spot. Longer samples do not improve quality.
- Single speaker. Multi-speaker samples produce mixed clones.
- Quiet environment. Ambient noise above the speech floor degrades the clone.
- Natural prosody. A varied sentence is better than a monotone reading.
- Consistent microphone. Avoid switching mics partway through the sample.