DeepgramBacked by Deepgram Startup ProgramLearn more
Speech-to-Speech

Native speech-to-speech
voice agents

One model. Audio in, audio out. No cascading STT → LLM → TTS hops. ThinnestAI ships both major S2S providers — Gemini Live and OpenAI Realtime — with Indian phone numbers, flow editor, BYOK options and INR billing on top.

What changes with speech-to-speech

Cascaded (STT → LLM → TTS)
  • • Three separate models, three network hops
  • • Typical end-to-end latency: 900–1,500ms
  • • Best Indic language quality (Sarvam + Aero + Deepgram)
  • • Maximum customization per layer
  • • Predictable, deterministic for tool-heavy flows
Native S2S (one model)
  • • Audio in → audio out, one round-trip
  • • Typical end-to-end latency: 600–900ms
  • • Better cross-language code-switching (Hinglish)
  • • More natural prosody, emotional consistency
  • • Lower per-call infra overhead
Our honest take: S2S is the right call when latency, naturalness and cross-language code-switching matter more than maximum control. For Hindi/Hinglish workloads, Gemini Live is now genuinely better than a cascaded path on most metrics. For premium English with deep tool use, OpenAI Realtime is the strongest model in production. For low-support Indic languages (Bhojpuri, Awadhi, Maithili, Konkani), a cascaded Sarvam path still wins. ThinnestAI lets you ship all three side by side without re-platforming.

How S2S on ThinnestAI compares to Indian competitors

Sarvam and Gnani are India's strongest voice AI stacks, but both ship cascaded pipelines — they don't offer native speech-to-speech. ThinnestAI is the platform that gives you the S2S option (Gemini Live, OpenAI Realtime) and the cascaded Indic-first option (Sarvam, Aero, Deepgram, ElevenLabs) on the same flow editor.

CapabilityThinnestAI (S2S)Sarvam (cascaded)Gnani (cascaded)
Native speech-to-speech
Yes — Gemini Live + OpenAI Realtime
No — STT + LLM + TTS pipelineNo — proprietary cascaded stack
End-to-end latency
600–800ms
900–1200ms800–1100ms
Hindi / Hinglish code-switching
Native via Gemini Live
Native via Sarvam-MNative via Gnani LLM
Best-in-market Indic language quality
Available via cascaded option
Yes (Sarvam Bulbul + Saaras)Yes (Gnani in-house)
Indian phone numbers
Yes — Vobiz, Twilio, Plivo, Exotel
BYO carrierYes — Gnani-managed
INR billing + GST invoice
Yes
YesYes
BYO model API key
Yes — Gemini, OpenAI, Azure OpenAI
No — Sarvam-hosted onlyNo — Gnani-hosted only
No-code flow editor
Yes — drag-and-drop, branching, tools
No native flow editorYes — Gnani Studio
Pricing transparency
Public ₹/min, public model pricing
Public ₹/minEnterprise quote

Try speech-to-speech today

Sign up free, pick Gemini Live or OpenAI Realtime in the model dropdown, dial out from an Indian number in under 5 minutes. Welcome credits — no card required.

Frequently asked questions

What is speech-to-speech and how is it different from a cascaded voice pipeline?

+

A speech-to-speech (S2S) model takes audio in and produces audio out in a single forward pass — no separate STT, LLM and TTS hops. Cascaded pipelines stitch three models together. S2S typically lands at 600–900ms end-to-end latency vs 900–1,500ms for cascaded, with better cross-language code-switching and more natural prosody, at the cost of less fine-grained per-layer control.

Which speech-to-speech models does ThinnestAI support?

+

Two production-grade S2S providers: Google Gemini Live (Gemini 2.5 Flash native audio + Gemini 3.1 Flash Live preview) and OpenAI Realtime (gpt-4o-realtime and gpt-realtime). You can pick either from the model dropdown in the flow editor and swap between them without re-platforming.

How much does speech-to-speech cost on ThinnestAI?

+

All-in pricing including the ThinnestAI platform fee (₹1.5/min), telephony and model usage: Gemini Live runs ~₹4/min, OpenAI Realtime runs ~₹22/min. Model usage is billed at the official provider rates ($3/M audio input, $12/M audio output for Gemini Live; $32/M audio input, $64/M audio output for OpenAI Realtime). No minimums, no commitments — pay only for what you use.

Can I run speech-to-speech over real Indian phone numbers?

+

Yes. ThinnestAI ships Indian phone numbers via Vobiz, Twilio, Plivo and Exotel, and S2S models route through the same SIP/LiveKit telephony stack as our cascaded pipeline. Inbound, outbound and bidirectional flows all work the same way regardless of model type.

Can I bring my own Gemini or OpenAI API key (BYOK)?

+

Yes — BYOK is supported for Gemini (Google AI Studio + Vertex AI), OpenAI (direct API + Azure OpenAI) and most other LLM providers. When you bring your own key, the platform fee still applies (₹1.5/min) but model usage is billed to your own provider account, which lets you use enterprise credits, prepaid commitments or compliance-restricted hosting like Azure OpenAI Mumbai.

Is speech-to-speech good for Hindi and Hinglish?

+

Gemini Live handles Hindi and Hinglish code-switching natively and is now genuinely competitive with cascaded Indic-first stacks for everyday conversational quality. For low-support Indic languages (Bhojpuri, Awadhi, Maithili, Konkani) or maximum control over Indian-language quality, a cascaded path with Sarvam STT/LLM/TTS or Aero TTS still wins — and you can run both side by side on the same flow.

Do I get a free trial without a credit card?

+

Yes. Sign up and you get welcome credits — no card required. That's enough to test both Gemini Live and OpenAI Realtime end-to-end, including outbound calls from an Indian number. After the trial, you can stay on PAYG (top-up wallet, billed per actual usage) or move to Enterprise for committed-use discounts and GST invoicing.

How does S2S latency compare to a cascaded stack in production?

+

On ThinnestAI's production traffic: Gemini Live averages ~600–700ms first-audio-out, OpenAI Realtime ~800–900ms. A well-tuned cascaded stack (Deepgram nova-3 + Groq Llama + Aero TTS) lands around 900–1,100ms. S2S wins on latency by skipping the STT and TTS network hops, but cascaded wins when you need deterministic tool-call gating or per-layer model swaps.