Voice AI Stack (ASR, STT, LLM, TTS)
The voice AI stack is a pipeline of four components: ASR/STT (speech to text), NLU/LLM (language understanding), TTS (text to speech), and the orchestration layer that glues them together in real time.
The voice AI stack is the pipeline that turns a spoken conversation into an AI-powered one. It has four parts: ASR/STT converts the user's speech to text; the LLM (large language model) understands the intent, decides the response, and calls tools; TTS converts the LLM's text response back to speech; and the orchestration layer glues it all together in real time, handling turn detection, interruptions, barge-in, and the sub-800ms latency budget required for natural conversation.
ASR / STT — Automatic Speech Recognition
Also called Speech-to-Text. Converts audio into text tokens as the user speaks. Leading providers for Indian languages include Sarvam Saaras and Deepgram Nova. Latency target: <200ms.
LLM — Large Language Model
The reasoning brain. Takes the transcribed text, applies your system prompt, retrieves context from your knowledge base (RAG), decides what to say, and optionally calls tools. Examples: GPT-4o, Claude Sonnet, Groq GPT-OSS, Sarvam-M. Latency target: <400ms to first token.
TTS — Text-to-Speech
Converts the LLM's text reply into audio. Leading providers: ElevenLabs, Cartesia, Sarvam Bulbul. Streaming TTS starts playing audio before the full text is generated. Latency target: <300ms to first audio byte.
Orchestration layer
This is where voice AI gets hard. The orchestration layer handles voice activity detection (is the user still speaking?), turn detection (when to start responding), interruptions (when the user cuts in mid-reply), barge-in, and the whole concurrency and state management of a real-time conversation. ThinnestAI provides this layer, built on LiveKit's real-time engine underneath.
More definitions
A voice AI agent is an AI-powered system that has real-time spoken conversations — over a phone call, a web widget or a SIP trunk — using speech recognition, a language model and speech synthesis.
Voice AI is the umbrella term for AI systems that understand and generate human speech in real time — powering voice assistants, phone agents, voice chatbots and real-time translation.
Conversational AI is the category of AI systems designed to interact with humans in natural language, across chat, voice, email and messaging — using NLU, LLMs and tool-calling to hold multi-turn conversations that actually accomplish work.
IVR is a rigid scripted decision tree (press 1 for sales). Voice AI is a natural-language agent that understands free-form speech, uses LLM reasoning, and calls tools to take real actions.
BYOK means you bring your own API keys for the LLM, STT and TTS providers, and the voice AI platform routes usage through your accounts instead of bundling the provider costs into its own pricing.
BYON means you bring your own phone number — via a Twilio, Vobiz or Exotel account — and connect it to the voice AI platform via SIP, instead of renting a number from the platform itself.
