Native speech-to-speech
voice agents
One model. Audio in, audio out. No cascading STT → LLM → TTS hops. ThinnestAI ships both major S2S providers — Gemini Live and OpenAI Realtime — with Indian phone numbers, flow editor, BYOK options and INR billing on top.
Gemini Live
Google's native-audio speech-to-speech model — one network round-trip from caller audio to agent reply, 30 voices, code-switching across Indian languages.
OpenAI Realtime
OpenAI's GPT-4o-class speech-to-speech model — premium reasoning, expressive voices, and the strongest English voice agent in production today.
What changes with speech-to-speech
- • Three separate models, three network hops
- • Typical end-to-end latency: 900–1,500ms
- • Best Indic language quality (Sarvam + Aero + Deepgram)
- • Maximum customization per layer
- • Predictable, deterministic for tool-heavy flows
- • Audio in → audio out, one round-trip
- • Typical end-to-end latency: 600–900ms
- • Better cross-language code-switching (Hinglish)
- • More natural prosody, emotional consistency
- • Lower per-call infra overhead
How S2S on ThinnestAI compares to Indian competitors
Sarvam and Gnani are India's strongest voice AI stacks, but both ship cascaded pipelines — they don't offer native speech-to-speech. ThinnestAI is the platform that gives you the S2S option (Gemini Live, OpenAI Realtime) and the cascaded Indic-first option (Sarvam, Aero, Deepgram, ElevenLabs) on the same flow editor.
| Capability | ThinnestAI (S2S) | Sarvam (cascaded) | Gnani (cascaded) |
|---|---|---|---|
| Native speech-to-speech | Yes — Gemini Live + OpenAI Realtime | No — STT + LLM + TTS pipeline | No — proprietary cascaded stack |
| End-to-end latency | 600–800ms | 900–1200ms | 800–1100ms |
| Hindi / Hinglish code-switching | Native via Gemini Live | Native via Sarvam-M | Native via Gnani LLM |
| Best-in-market Indic language quality | Available via cascaded option | Yes (Sarvam Bulbul + Saaras) | Yes (Gnani in-house) |
| Indian phone numbers | Yes — Vobiz, Twilio, Plivo, Exotel | BYO carrier | Yes — Gnani-managed |
| INR billing + GST invoice | Yes | Yes | Yes |
| BYO model API key | Yes — Gemini, OpenAI, Azure OpenAI | No — Sarvam-hosted only | No — Gnani-hosted only |
| No-code flow editor | Yes — drag-and-drop, branching, tools | No native flow editor | Yes — Gnani Studio |
| Pricing transparency | Public ₹/min, public model pricing | Public ₹/min | Enterprise quote |
Try speech-to-speech today
Sign up free, pick Gemini Live or OpenAI Realtime in the model dropdown, dial out from an Indian number in under 5 minutes. Welcome credits — no card required.
Frequently asked questions
What is speech-to-speech and how is it different from a cascaded voice pipeline?
+
A speech-to-speech (S2S) model takes audio in and produces audio out in a single forward pass — no separate STT, LLM and TTS hops. Cascaded pipelines stitch three models together. S2S typically lands at 600–900ms end-to-end latency vs 900–1,500ms for cascaded, with better cross-language code-switching and more natural prosody, at the cost of less fine-grained per-layer control.
Which speech-to-speech models does ThinnestAI support?
+
Two production-grade S2S providers: Google Gemini Live (Gemini 2.5 Flash native audio + Gemini 3.1 Flash Live preview) and OpenAI Realtime (gpt-4o-realtime and gpt-realtime). You can pick either from the model dropdown in the flow editor and swap between them without re-platforming.
How much does speech-to-speech cost on ThinnestAI?
+
All-in pricing including the ThinnestAI platform fee (₹1.5/min), telephony and model usage: Gemini Live runs ~₹4/min, OpenAI Realtime runs ~₹22/min. Model usage is billed at the official provider rates ($3/M audio input, $12/M audio output for Gemini Live; $32/M audio input, $64/M audio output for OpenAI Realtime). No minimums, no commitments — pay only for what you use.
Can I run speech-to-speech over real Indian phone numbers?
+
Yes. ThinnestAI ships Indian phone numbers via Vobiz, Twilio, Plivo and Exotel, and S2S models route through the same SIP/LiveKit telephony stack as our cascaded pipeline. Inbound, outbound and bidirectional flows all work the same way regardless of model type.
Can I bring my own Gemini or OpenAI API key (BYOK)?
+
Yes — BYOK is supported for Gemini (Google AI Studio + Vertex AI), OpenAI (direct API + Azure OpenAI) and most other LLM providers. When you bring your own key, the platform fee still applies (₹1.5/min) but model usage is billed to your own provider account, which lets you use enterprise credits, prepaid commitments or compliance-restricted hosting like Azure OpenAI Mumbai.
Is speech-to-speech good for Hindi and Hinglish?
+
Gemini Live handles Hindi and Hinglish code-switching natively and is now genuinely competitive with cascaded Indic-first stacks for everyday conversational quality. For low-support Indic languages (Bhojpuri, Awadhi, Maithili, Konkani) or maximum control over Indian-language quality, a cascaded path with Sarvam STT/LLM/TTS or Aero TTS still wins — and you can run both side by side on the same flow.
Do I get a free trial without a credit card?
+
Yes. Sign up and you get welcome credits — no card required. That's enough to test both Gemini Live and OpenAI Realtime end-to-end, including outbound calls from an Indian number. After the trial, you can stay on PAYG (top-up wallet, billed per actual usage) or move to Enterprise for committed-use discounts and GST invoicing.
How does S2S latency compare to a cascaded stack in production?
+
On ThinnestAI's production traffic: Gemini Live averages ~600–700ms first-audio-out, OpenAI Realtime ~800–900ms. A well-tuned cascaded stack (Deepgram nova-3 + Groq Llama + Aero TTS) lands around 900–1,100ms. S2S wins on latency by skipping the STT and TTS network hops, but cascaded wins when you need deterministic tool-call gating or per-layer model swaps.
