Voice & Audio
VexAI provides comprehensive voice and audio capabilities: text-to-speech, speech-to-text, voice channels, meeting transcription, and an interactive voice assistant.
Overview
The voice system is built around a self-hosted-first philosophy: every provider supports a local option that requires only a baseUrl and no API keys. Cloud providers (OpenAI, Google, ElevenLabs) are also fully supported for users who prefer managed services.
You can run the entire voice stack at zero cost using Kokoro TTS + faster-whisper STT locally. No API keys needed. Just point baseUrl at your local service.
Text-to-Speech (TTS)
Generate spoken audio from text and send it to any Discord channel. The TTS system is a standalone tool available to the bot in all contexts.
Providers
| Provider | Key Required | Models | Notes |
|---|---|---|---|
openai |
Yes | tts-1, tts-1-hd | High quality ยท Voices: alloy, echo, fable, onyx, nova, shimmer |
google |
Yes | - | Google Cloud Text-to-Speech |
elevenlabs |
Yes | - | Premium voice cloning and synthesis |
local |
No | Any | Kokoro, Piper, AllTalk, or any OpenAI-compatible /v1/audio/speech endpoint |
All providers support a custom baseUrl for self-hosted or proxied endpoints.
Configuration
{
"voice": {
"tts": {
"provider": "local",
"baseUrl": "http://localhost:8880",
"voice": "af_heart",
"model": "kokoro"
}
}
}
Speech-to-Text (STT)
Transcribe voice messages, audio files, and audio from any URL. Automatic format fallback converts OGG/Opus to WAV via ffmpeg when the STT provider requires it.
Providers
| Provider | Key Required | Notes |
|---|---|---|
whisper-api |
Yes | OpenAI Whisper API |
google |
Yes | Google Cloud Speech-to-Text |
local |
No | faster-whisper, whisper.cpp, or any OpenAI-compatible /v1/audio/transcriptions endpoint |
Discord voice messages use OGG/Opus. If your STT provider doesn't support Opus natively, VexAI automatically converts to WAV using ffmpeg. Make sure ffmpeg is installed in your environment.
Voice Channels
VexAI can join and interact in Discord voice channels with full audio capabilities:
- Join & Leave: Connect to any voice channel the bot has access to
- Speak: Stream TTS audio directly into the voice channel
- Listen & Transcribe: Capture and transcribe user speech in real-time
- Always Unmuted: The bot is always unmuted in voice; deafen/undeafen controls listening mode
- Chunked Streaming: Long speech is chunked to prevent audio cutoff
Meeting Mode
Have VexAI join a voice channel and transcribe everything that's said. When the meeting ends, the bot produces a full transcript.
- Join a voice channel and transcribe all participants
- Produces a complete transcript when the meeting ends
- Useful for keeping records of voice discussions, standups, and planning sessions
Meeting transcripts include speaker attribution; each segment is tagged with the Discord user who spoke it.
Voice Assistant
The crown jewel of the voice system: a fully interactive voice assistant that lives in your Discord voice channel.
- Wake Word Detection: Activates on "hey vex" (configurable)
- Wake Aliases: Define additional trigger phrases
- Optional Text Responses: Mirror voice replies to a text channel
- Conversation History: Maintains context across turns (configurable max)
- Idle Timeout: Automatically leaves the channel after a period of inactivity
Configuration
{
"voice": {
"assistant": {
"enabled": true,
"wakeWord": "hey vex",
"wakeAliases": ["vex", "hey bot"],
"respondWithText": false,
"idleTimeout": 300,
"maxHistoryTurns": 10
}
}
}
The assistant continuously listens while undeafened. When it detects the wake word in transcribed speech, it captures the rest of the utterance, sends it through the full LLM pipeline (with tools), and streams the response back as TTS audio.
Self-Hosted Setup
Run the entire voice stack locally with zero API costs:
| Component | Recommended | Default Port |
|---|---|---|
| TTS | Kokoro, runs on CPU | 8880 |
| STT | faster-whisper, needs GPU for real-time | 8000 |
Kokoro TTS runs comfortably on CPU. However, faster-whisper requires a GPU (NVIDIA recommended) for real-time transcription performance. CPU-only transcription is possible but may introduce noticeable latency.
Set voice.tts.provider and voice.stt.provider to "local", point each baseUrl at your local service, and you have a fully functional voice system with no API keys and no ongoing costs.