The Fine-Tuning Illusion

Fine-tuning an English-first model on Indian language data feels pragmatic. It's faster, cheaper, and delivers passable benchmark results. But benchmarks lie. The moment you deploy a fine-tuned GPT-variant into a real Indian enterprise โ€” a BFSI call centre, a government portal, a healthcare chatbot โ€” the cracks appear immediately.

The problem isn't vocabulary. It's world-model. A model trained primarily on English internet text has absorbed an English-centric understanding of institutions, laws, social norms, and humor. Hindi words on top of that substrate don't change what the model fundamentally knows about how the world works.

The Code-Mix Problem

Consider a typical NBFC customer service interaction: 'Mera account mein credit hua tha but balance show nahi kar raha, kya problem hai bhai?' This is Hinglish โ€” neither Hindi nor English, but the primary communication mode of India's urban working class.

Every major multilingual model โ€” mBERT, XLM-R, GPT-4 โ€” shows measurable accuracy degradation on code-mixed input. The degradation compounds in high-stakes domains: financial products, medical diagnoses, legal queries. This isn't an edge case. It's the default.

Jurisprudence and Cultural Context

Indian law is a living hybrid of colonial British statute, post-independence constitutional doctrine, and thousands of state-level amendments. An LLM trained on English legal corpora will hallucinate confidently about Indian legal questions โ€” absorbing enough surface-level references to sound plausible while missing operative principles entirely.

The same applies across domains: Kharif and Rabi cycles, Ayurvedic medicine interactions, the bureaucratic workflows of Indian government portals. None of this is well-represented in English training data.

What Krishna LLM Does Differently

EngineAI's Krishna LLM is trained from scratch โ€” not fine-tuned โ€” on a corpus that is majority Indian-language and Indian-context from day one. The training pipeline ingests code-mixed content explicitly, treating Hinglish, Tanglish, and Banglish as first-class language variants rather than noise.

The tokenizer is built for Indian scripts natively. Evaluation benchmarks are constructed from real Indian enterprise data, not translated Western benchmarks. The result is a model that doesn't just generate Indian language text โ€” it reasons correctly within the Indian institutional context its users actually inhabit.

The Sovereignty Argument Is Practical

Setting aside geopolitical framing, there's a straightforward business case. Every query sent to a US-hosted LLM API is a potential DPDPA violation. Every piece of customer financial or health data leaving India creates regulatory and liability exposure.

The cost of compliance, auditing, and data residency enforcement on foreign APIs often exceeds the cost of equivalent sovereign infrastructure โ€” especially at Indian enterprise scale. India's AI future cannot be built on foreign model dependency any more than India's telecom future could have been built on permanently imported switching equipment.