Built where voice AI actually breaks.
The Indian phone call — 8 kHz, packet loss, Hinglish, accents the training data never saw — breaks most voice stacks. We rebuilt the stack around it.
Method & system for low-latency, code-switched conversational AI over low-bitrate telephony.
Filed with the Indian Patent Office. Covers end-to-end: the audio-native inference path, the code-switch recognition head, and the telephony-first acoustic conditioning that makes it work on an 8 kHz line.
Six layers. Each one rebuilt.
Where off-the-shelf pieces lost latency, fidelity, or compliance footing — we built the layer ourselves. This is the full path from the PSTN trunk to the agent response.
Edge telephony
Real-time audio pipe
Audio-native inference
Patent coreKnowledge & memory
Orchestration
Observability & trust
Six problems we had to solve ourselves.
None of these were off-the-shelf. Each one is either patented, in patent process, or a production-only technique you cannot buy.
Audio-native inference on the phone
Most voice AIs stitch three models: speech-to-text, language model, text-to-speech. Each hop adds latency and discards paralinguistic signal (hesitation, laughter, urgency). Our model reasons directly in the audio domain — so the response carries the tone of the input, not a reconstruction of it.
Code-switching in the acoustic path
Detecting "mujhe EMI ka balance chahiye" is not a translation problem — it is an acoustic one. The language signal lives in prosody and phoneme patterns, not punctuation. We handle code-switches inside the acoustic encoder, not after transcription.
Telephony-first conditioning
Open-source voice models are trained on studio audio. Phones send 8 kHz, lossy, jitter-ridden audio over cellular. We fine-tuned on >2M hours of real Indian phone calls — so the model is robust to exactly the conditions it will meet in production.
Tiered RAG inside the latency budget
A 10,000-document knowledge base cannot be searched inside a 700ms budget the naive way. We use summary-first retrieval: the model queries compressed chunk summaries, then pulls full chunks only on demand. Average retrieval overhead: under 90ms.
Interruptible generation with barge-in
A human says "stop, stop" — the AI must hear it and actually stop talking. We stream generation in sub-100ms frames with a side-channel VAD on the caller. The moment the caller speaks, the output halts and new audio enters the context.
Compliance guardrails at runtime
Regulatory constraints (RBI Fair Practices, DPDP, DND, time-of-day rules) enforce at the dispatch layer — not through a post-hoc audit. A call to a blocklisted number or after 8pm never makes it to the dialer, by policy.
Measured, not marketed.
All numbers from production traffic or public benchmark methodology. We publish the methodology on request.
Indian data. Indian infrastructure. Indian accountability.
Enterprises regulated by the RBI, IRDAI, or SEBI cannot send customer conversation data to foreign inference endpoints. Neither can DPDP-compliant operators. We built for that reality from day one.
Live in 5 minutes. Really.
Sign up, describe your agent in one sentence, test-call your own phone. If you're not live inside 5 minutes, we'll credit your account.