Voice & Audio | VexAI Docs

Overview

The voice system is built around a self-hosted-first philosophy: every provider supports a local option that requires only a baseUrl and no API keys. Cloud providers (OpenAI, Google, ElevenLabs) are also fully supported for users who prefer managed services.

💡 Self-Hosted First

You can run the entire voice stack at zero cost using Kokoro TTS + faster-whisper STT locally. No API keys needed. Just point baseUrl at your local service.

Text-to-Speech (TTS)

Generate spoken audio from text and send it to any Discord channel. The TTS system is a standalone tool available to the bot in all contexts.

Providers

Provider	Key Required	Models	Notes
`openai`	Yes	tts-1, tts-1-hd	High quality · Voices: alloy, echo, fable, onyx, nova, shimmer
`google`	Yes	-	Google Cloud Text-to-Speech
`elevenlabs`	Yes	-	Premium voice cloning and synthesis
`local`	No	Any	Kokoro, Piper, AllTalk, or any OpenAI-compatible `/v1/audio/speech` endpoint

All providers support a custom baseUrl for self-hosted or proxied endpoints.

Configuration

{
  "voice": {
    "tts": {
      "provider": "local",
      "baseUrl": "http://localhost:8880",
      "voice": "af_heart",
      "model": "kokoro"
    }
  }
}

Speech-to-Text (STT)

Transcribe voice messages, audio files, and audio from any URL. Automatic format fallback converts OGG/Opus to WAV via ffmpeg when the STT provider requires it.

Providers

Provider	Key Required	Notes
`whisper-api`	Yes	OpenAI Whisper API
`google`	Yes	Google Cloud Speech-to-Text
`local`	No	faster-whisper, whisper.cpp, or any OpenAI-compatible `/v1/audio/transcriptions` endpoint

ℹ️ Format Handling

Discord voice messages use OGG/Opus. If your STT provider doesn't support Opus natively, VexAI automatically converts to WAV using ffmpeg. Make sure ffmpeg is installed in your environment.

Voice Channels

VexAI can join and interact in Discord voice channels with full audio capabilities:

Join & Leave: Connect to any voice channel the bot has access to
Speak: Stream TTS audio directly into the voice channel
Listen & Transcribe: Capture and transcribe user speech in real-time
Always Unmuted: The bot is always unmuted in voice; deafen/undeafen controls listening mode
Chunked Streaming: Long speech is chunked to prevent audio cutoff

Meeting Mode

Have VexAI join a voice channel and transcribe everything that's said. When the meeting ends, the bot produces a full transcript.

Join a voice channel and transcribe all participants
Produces a complete transcript when the meeting ends
Useful for keeping records of voice discussions, standups, and planning sessions

💡 Tip

Meeting transcripts include speaker attribution; each segment is tagged with the Discord user who spoke it.

Voice Assistant

The crown jewel of the voice system: a fully interactive voice assistant that lives in your Discord voice channel.

Wake Word Detection: Activates on "hey vex" (configurable)
Wake Aliases: Define additional trigger phrases
Optional Text Responses: Mirror voice replies to a text channel
Conversation History: Maintains context across turns (configurable max)
Idle Timeout: Automatically leaves the channel after a period of inactivity

Configuration

{
  "voice": {
    "assistant": {
      "enabled": true,
      "wakeWord": "hey vex",
      "wakeAliases": ["vex", "hey bot"],
      "respondWithText": false,
      "idleTimeout": 300,
      "maxHistoryTurns": 10
    }
  }
}

ℹ️ How It Works

The assistant continuously listens while undeafened. When it detects the wake word in transcribed speech, it captures the rest of the utterance, sends it through the full LLM pipeline (with tools), and streams the response back as TTS audio.

Self-Hosted Setup

Run the entire voice stack locally with zero API costs:

Component	Recommended	Default Port
TTS	Kokoro, runs on CPU	`8880`
STT	faster-whisper, needs GPU for real-time	`8000`

⚠️ Hardware Note

Kokoro TTS runs comfortably on CPU. However, faster-whisper requires a GPU (NVIDIA recommended) for real-time transcription performance. CPU-only transcription is possible but may introduce noticeable latency.

💡 Zero-Cost Voice Stack

Set voice.tts.provider and voice.stt.provider to "local", point each baseUrl at your local service, and you have a fully functional voice system with no API keys and no ongoing costs.