Why Arrowhead Is Building Its Own Small Language Model

Share

Chapter 1: The constraints that forced our hand

A call to a customer in Delhi

Picture the workload we actually ship. An outbound collections call goes out to a customer in Delhi. The agent has to:

  • Speak Hinglish naturally, switching mid-clause without sounding like a translation engine.
  • Complete a turn end-to-end in under a second, every time, including the tail of the distribution — not just the median.
  • Look up the customer's outstanding balance via a tool call, never hallucinate the number, and follow the negotiation rules our BFSI client signed off on.
  • Stay inside Indian borders. Every byte. Including inference.

We build a lot of conversations like this. And when we sat down and asked, honestly, which off-the-shelf model could serve a workload with all four constraints at once — the answer kept coming back zero.

This post is the long version of that answer. Subsequent chapters cover how we train, what data we use, and what we've learned. This one is just the why: the four constraints that, in combination, force us toward a self-hosted, fine-tuned small language model for a meaningful chunk of our customers.

For the customers without these constraints — and we have plenty of them — we happily ship SOTA frontier models from OpenAI, Anthropic, or Google. The point of this post isn't that SLMs are universally better. It's that the constraint set we live inside makes them the right answer for Indian BFSI voice AI, and the math gets clearer if you actually run the numbers.


Constraint 1: Data residency that actually means something

Most BFSI compliance teams in India read "data residency" and assume it's a binary. It isn't. The interesting question isn't whether the provider claims residency — it's what specifically stays in-region. There are two separate things:

  1. Data at rest — where the provider stores prompts, outputs, training data, and logs.
  2. Inference — where the actual GPU computation happens when you call the API.

Almost every major provider now offers at-rest residency in India. Almost none offers inference residency without strings attached. The strings matter.

Here's the actual picture as of May 2026 for frontier-class models with inference in India:

Provider Model Inference in India? Pricing model required
OpenAI direct GPT-5.x / GPT-4.1 No (inference is US/EU only) n/a
Azure OpenAI Foundry, South India GPT-4.1 / GPT-5 Yes, via reserved PTU only Provisioned throughput (fixed monthly commitment)
Azure OpenAI Foundry, South India GPT-4.1 / GPT-5 — Global Standard PAYG No (Global Standard routes anywhere) Pay-as-you-go but residency lost
Azure OpenAI Foundry, South India gpt-realtime (voice) Not available in any India region n/a
Vertex AI, asia-south1 (Mumbai) Gemini 2.5 Flash / Flash-Lite Yes, PAYG Pay-as-you-go
Vertex AI, asia-south1 Gemini 2.5 Pro / 3.x Flash / 3.x Pro Not GA in asia-south1 n/a
AWS Bedrock, ap-south-1 / ap-south-2 Claude Opus/Sonnet/Haiku 4.5+ No (Global cross-Region inference routes out) n/a

A few things jump off the page:

For pay-as-you-go BFSI voice in India, the ceiling is Gemini 2.5 Flash. Everything stronger than that requires either committing to reserved capacity (Azure PTU) or accepting that inference leaves the country. The "frontier models are available in India" story is true in the same way a Lamborghini is available — you can have it, but only on terms most customers can't accept.

OpenAI's native voice model can't be used with Indian inference at all. gpt-realtime is in five regions worldwide as of this writing: Canada Central, Central US, East US 2, France Central, Sweden Central. There is no PTU path, no DataZone path, no regional Standard path. If your BFSI customer says "inference must stay in India" and you also want OpenAI-quality voice, the intersection is empty.

Even on Azure PTU, the economics are punishing for spiky workloads. PTU is a fixed monthly commitment regardless of utilization. BFSI traffic curves are notoriously spiky — heavy during working hours, near-zero overnight, festival peaks, regulatory deadlines. Paying for 24×7 capacity to handle 8×5 traffic means the effective per-token cost is multiples of what the published rate implies. The break-even against self-hosting only makes sense if your QPS is both high and steady.

Anthropic Claude on Bedrock in India looks tempting but isn't residency-compliant. Bedrock's Indian access for Claude 4.5+ models is delivered via Global Cross-Region Inference, which by AWS's own documentation routes the inference request out of the source region. Storage stays in India; computation doesn't.

So if your customer's compliance team interprets "data must not leave India" the way most BFSI compliance teams do — strictly, end-to-end, including inference — the SOTA shelf is effectively empty for voice. That's the first constraint.


Constraint 2: Tail latency matters more than the median

Voice agents have a brutal latency budget. Human conversational turn-taking is ~200–500ms; anything above ~800ms feels unnatural; above 1.5s and the customer starts repeating themselves or hangs up. And in voice, the metric that matters isn't median latency — it's p90 and p99. A model that's fast 80% of the time and freezes for 4 seconds the other 20% is a model that loses 20% of your calls.

The frontier-cloud models we tested don't survive this filter consistently. Some data points:

  • Public benchmarks of Gemini 2.5 Flash with default settings show ~1.9 seconds time-to-first-token even on simple prompts (dev.to benchmark, Dec 2025).
  • A peer voice-AI team in India publicly reported ~2s TTFT with Gemini Flash vs ~150ms on Groq on Google's own developer forum in April 2026 (Google AI Developer Forum).
  • Our own production measurements on Gemini 2.5 Flash p90 TTFB sit consistently above 3 seconds during Indian business hours — this is the tail, not the average.

The natural objection is: just buy provisioned throughput. That's what GSUs (Google) and PTUs (Microsoft) exist for. And yes, dedicated capacity does flatten the latency curve substantially. But:

  • GSU pricing isn't a discount, it's a stability tax. Google's published math is that fully utilized GSU capacity costs roughly 80–95% of dynamic shared quota — so at best a 5–20% discount, and only if you sustain near-100% utilization. Anything less, and you're paying more for the same tokens.
  • PT capacity is tied to a single model and region. You can't easily switch between Gemini Flash and Pro on the same commitment, and you eat the residency tradeoffs of whichever region you bought capacity in.
  • Spiky telephony workloads break the utilization assumption. Our concurrency curve isn't flat. Indian working hours plus seasonal regulatory pushes mean we'd be paying for 24×7 reserved capacity to cover 8×5 burst traffic. Effective per-token cost ends up 2–3× the headline number.

Self-hosting an SLM on dedicated GPUs gives us something the API path can't: a flat, predictable latency line we control. Our p99 is closer to our p50 because we own the queue, the batching, and the placement. The same GPUs can serve different workloads when traffic shifts. And the cost structure scales with our hardware, not with someone else's pricing committee.


Constraint 3: Colloquialness is not on the leaderboard

This one is the most contested point we make, so let's be precise.

Indian languages aren't missing from frontier models — most handle Hindi or Tamil grammatically just fine. The problem is conversational register. Real customers don't speak in clean, single-language sentences. They code-switch mid-clause. They drop English nouns into Hindi grammar. They use politeness markers and hedges that don't translate. They say "sir, ek minute hold karo na, mai check karta hu" and expect an agent that responds in the same register, not one that translates the sentiment into formal Hindi.

The standard benchmarks don't measure this. MMLU, AIME, BrowseComp — none of them tell you whether a model sounds like a human support agent or a translated press release. And the public Indian-language benchmarks (Sarvam's own benchmark suite included) are dominated by translated-from-English questions, which captures vocabulary and grammar but not turn-level register.

Where this leaves the comparison:

  • Gemini 3 Flash genuinely has strong Indian-language tokenization among the closed-source frontier models — credit where it's due. If we weren't constrained on residency and latency, it would be a serious contender.
  • Sarvam-105B and 30B explicitly target Indian languages and Hinglish, and Sarvam reports excellent performance on their own Indian-language pairwise benchmark. For broad multilingual Indian-language tasks, they're a genuinely strong open option.
  • Where our fine-tuned SLM wins is the narrow distribution we actually ship: short support-agent utterances, mid-clause code-switching at exactly the syntactic boundaries Indian speakers use, polite hedges, sector-specific BFSI vocabulary in Hinglish. We trained on this distribution. Nobody else did, because there's no public dataset that captures it at scale.

The honest framing is: we don't claim a general win on Indian languages. We claim a specific win on the distribution our customers actually produce, evaluated on transcripts our customers actually generate.

Example conversation between Arrowhead's finetuned model vs Sarvam 105B using the exact same system prompt

Arrowhead Sarvam 105B
Agent: Good evening जी, मैं Shivam बोल रहा हूँ
SBI credit cards team से— क्या
अभी एक minute बात कर सकते हैं?
Agent: नमस्ते जी, मैं Shivam बोल रहा हूँ SBI credit cards team से।
क्या अभी एक minute बात कर सकते हैं?
User: ji bolein User: ji bolein
Agent: Okay जी, thanks. Agent: मैं आपको हमारे नए SBI credit card के बारे में बताना चाहता हूँ
जो आपके खर्चों पर बेहतरीन rewards देता है। आपको इसमें interest है?
Agent: जी, मैं आपको SBI credit card के बारे में एक quick tip
बताने के लिए call किया है— यह card आपके daily expenses
पर extra cashback और rewards देता है,
जो आपके lifestyle के साथ easily fit हो जाएगा।
User: haan boliye
Agent: Interest rate भी competitive है, और joining fees भी minimal। Agent: आपका नाम क्या है?
Agent: क्या आप interested हैं इसमें? User: Shrey
User: haan boliye Agent: आपका शहर क्या है?
Agent: Okay जी, details share करने से पहले— आपका naam क्या है? User: bengaluru
User: Shrey Agent: Shrey जी, thank you. आपका नाम और शहर मिल गया है।
हमारे senior team member जल्द ही आपसे बात करेंगे। Thank you.
Agent: Okay Shrey, perfect.
Agent: आपकी city क्या है?
User: bengaluru
Agent: Okay Shrey, Bengaluru — got it.
Agent: आपकी details verify करके मैं call अपने senior को transfer कर रहा हूँ
Agent: वो आपको सारी details और next steps अच्छे से समझा देंगे।
Agent: एक moment hold कीजिये।
Agent: आपके time के लिए thank you, and have a great day.

This is what eval methodology that fits the workload looks like. Public benchmarks tell us how a model does on translated-from-English questions. Our eval tells us how a model does on the calls we'll actually take tomorrow morning. Those are different things.


Constraint 4: Reasoning is a luxury we can't afford on a phone call

The last constraint is the one most engineers underestimate until they try to ship voice in production.

Modern frontier models lean increasingly on test-time reasoning — chain-of-thought, extended thinking, multi-step deliberation. This is great for complex math and code. It's a disaster for voice. Reasoning models burn anywhere from 2 to 30 seconds of thinking tokens before they produce a single output token. There's no realistic way to put a reasoning model on a phone call.

Which forces us to the smaller, non-reasoning variants — Flash, Mini, Nano, Haiku. These hit the latency budget. But they have a different problem: they're more prompt-sensitive and more prone to drift. Pile on the rules a BFSI agent needs — KYC compliance, negotiation guardrails, escalation thresholds, scripting policies, multi-turn state — and the prompt grows to thousands of tokens. Long prompts on small models produce inconsistent behavior, schema-violating tool calls, and occasional hallucinations on numerical fields. Exactly what BFSI compliance teams reject in pre-prod review.

Fine-tuning collapses that complexity. Instead of stuffing 4,000 tokens of rules into every prompt and praying, we bake the behavior into the model weights. The runtime prompt becomes short, the behavior becomes consistent, tool-call schemas become deterministic, and the model gets faster because there's less input to process. We get most of the prompt-adherence benefit you'd expect from a much larger model, on hardware we control, at latency that fits inside a turn.

This is the part of the argument that holds even for customers who don't care about residency. If your workload has a strict latency budget, stable schema requirements, and high call volume, fine-tuning is the lever that lets a small model behave like a much larger one for your specific task. Cost per call drops; reliability goes up. The tradeoff is loss of generality — our model isn't writing poetry — but voice agents don't need to write poetry.


When SOTA is still the right answer

We're not zealots about this. The constraint set above is real for a chunk of our customers — Indian BFSI, voice, regulated. But plenty of our customers don't sit at that intersection, and for them the right answer is a frontier API.

The framework we actually use to decide, roughly:

flowchart TD
    A[Is end-to-end India inference
residency required?] -->|Yes| A1[Fine-tuned SLM,
self-hosted in-region] A -->|No| B[Is the latency budget
under ~1 second p99?] B -->|Yes| B1[Fine-tuned SLM
or carefully tuned Flash with PTU] B -->|No| C[Is this voice / phone-call
style turn-taking?] C -->|Yes| C1[Small non-reasoning model,
fine-tuned or not] C -->|No| D[Is domain logic complex enough
that prompt adherence is the bottleneck?
BFSI, healthcare, legal workflows] D -->|Yes| D1[Fine-tuned SLM, or
large model with heavy eval] D -->|No| D2[SOTA frontier API.
Use the best model available.] classDef decision fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000 classDef outcome fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px,color:#000 class A,B,C,D decision class A1,B1,C1,D1,D2 outcome

The leaves on the right side of that tree — non-residency-constrained, asynchronous, generative or analytical workloads — are where Claude 4.7, GPT-5.2, and Gemini 3.1 Pro absolutely earn their per-token cost. Customer success automation that runs nightly, document summarization for compliance review, generative content for marketing — we route those to the best frontier model that fits the task. There's no virtue in handicapping yourself.

The leaves on the left — where the constraints stack up — are where SLM economics dominate, and where we invest our training budget.


What's next

This chapter is the why. The harder question is the how: training a small model that's actually better than a large one at the workload you care about is not automatic. Most teams that try this end up with a worse model than the API they replaced. The next chapter covers what we learned about base model selection, data curation for Indian BFSI voice transcripts, eval methodology that catches the failures public benchmarks miss, and why we picked the parameter range we did.

We'll also share some of the embarrassing dead ends. There were a few.