Voice & Audio

Voice & Audio

VexAI provides comprehensive voice and audio capabilities: text-to-speech, speech-to-text, voice channels, meeting transcription, and an interactive voice assistant.

Overview

The voice system is built around a self-hosted-first philosophy: every provider supports a local option that requires only a baseUrl and no API keys. Cloud providers (OpenAI, Google, ElevenLabs) are also fully supported for users who prefer managed services.

๐Ÿ’ก Self-Hosted First

You can run the entire voice stack at zero cost using Kokoro TTS + faster-whisper STT locally. No API keys needed. Just point baseUrl at your local service.

Text-to-Speech (TTS)

Generate spoken audio from text and send it to any Discord channel. The TTS system is a standalone tool available to the bot in all contexts.

Providers

ProviderKey RequiredModelsNotes
openai Yes tts-1, tts-1-hd High quality ยท Voices: alloy, echo, fable, onyx, nova, shimmer
google Yes - Google Cloud Text-to-Speech
elevenlabs Yes - Premium voice cloning and synthesis
local No Any Kokoro, Piper, AllTalk, or any OpenAI-compatible /v1/audio/speech endpoint

All providers support a custom baseUrl for self-hosted or proxied endpoints.

Configuration

{
  "voice": {
    "tts": {
      "provider": "local",
      "baseUrl": "http://localhost:8880",
      "voice": "af_heart",
      "model": "kokoro"
    }
  }
}

Speech-to-Text (STT)

Transcribe voice messages, audio files, and audio from any URL. Automatic format fallback converts OGG/Opus to WAV via ffmpeg when the STT provider requires it.

Providers

ProviderKey RequiredNotes
whisper-api Yes OpenAI Whisper API
google Yes Google Cloud Speech-to-Text
local No faster-whisper, whisper.cpp, or any OpenAI-compatible /v1/audio/transcriptions endpoint
โ„น๏ธ Format Handling

Discord voice messages use OGG/Opus. If your STT provider doesn't support Opus natively, VexAI automatically converts to WAV using ffmpeg. Make sure ffmpeg is installed in your environment.

Voice Channels

VexAI can join and interact in Discord voice channels with full audio capabilities:

  • Join & Leave: Connect to any voice channel the bot has access to
  • Speak: Stream TTS audio directly into the voice channel
  • Listen & Transcribe: Capture and transcribe user speech in real-time
  • Always Unmuted: The bot is always unmuted in voice; deafen/undeafen controls listening mode
  • Chunked Streaming: Long speech is chunked to prevent audio cutoff

Meeting Mode

Have VexAI join a voice channel and transcribe everything that's said. When the meeting ends, the bot produces a full transcript.

  • Join a voice channel and transcribe all participants
  • Produces a complete transcript when the meeting ends
  • Useful for keeping records of voice discussions, standups, and planning sessions
๐Ÿ’ก Tip

Meeting transcripts include speaker attribution; each segment is tagged with the Discord user who spoke it.

Voice Assistant

The crown jewel of the voice system: a fully interactive voice assistant that lives in your Discord voice channel.

  • Wake Word Detection: Activates on "hey vex" (configurable)
  • Wake Aliases: Define additional trigger phrases
  • Optional Text Responses: Mirror voice replies to a text channel
  • Conversation History: Maintains context across turns (configurable max)
  • Idle Timeout: Automatically leaves the channel after a period of inactivity

Configuration

{
  "voice": {
    "assistant": {
      "enabled": true,
      "wakeWord": "hey vex",
      "wakeAliases": ["vex", "hey bot"],
      "respondWithText": false,
      "idleTimeout": 300,
      "maxHistoryTurns": 10
    }
  }
}
โ„น๏ธ How It Works

The assistant continuously listens while undeafened. When it detects the wake word in transcribed speech, it captures the rest of the utterance, sends it through the full LLM pipeline (with tools), and streams the response back as TTS audio.

Self-Hosted Setup

Run the entire voice stack locally with zero API costs:

ComponentRecommendedDefault Port
TTSKokoro, runs on CPU8880
STTfaster-whisper, needs GPU for real-time8000
โš ๏ธ Hardware Note

Kokoro TTS runs comfortably on CPU. However, faster-whisper requires a GPU (NVIDIA recommended) for real-time transcription performance. CPU-only transcription is possible but may introduce noticeable latency.

๐Ÿ’ก Zero-Cost Voice Stack

Set voice.tts.provider and voice.stt.provider to "local", point each baseUrl at your local service, and you have a fully functional voice system with no API keys and no ongoing costs.