SIH 2024 Winner · $300MLAI/ML

SHAKTI

Vision-Language Model (SHAKTI-2B) for Military Intelligence

Problem

Defense intelligence relies on dense, multi-page mixed-format documents - scans, maps, captioned imagery - that traditional OCR + keyword search misses. Analysts under time pressure need a system that retrieves semantically across both text and imagery, not just exact-match keywords.

Approach

SHAKTI-2B is a fine-tuned mPLUG-Owl3 vision-language model (trained with MS-SWIFT) for OCR on degraded scans and cross-modal retrieval over text + imagery. Redis batching cut the inference queue 2.5x on 4GB VRAM; the whole stack runs offline for air-gapped deployment. Built in a 40-hour sprint for SIH 2024 problem statement PS1604.

At a glance

Recognition

SIH 2024 National Winner

Benchmarks

VQAv2 80.1 · POPE 87.4 · NextQA 73.6 · NLVR2 85.6

OCR

95% printed / 65% handwritten

Base model

mPLUG-Owl3 (SHAKTI-2B)

Hardware

4GB VRAM · 2.5x faster via Redis batching

Sprint

40 hours

Tech decisions

Custom OCR over Tesseract
Military fonts and degraded scans dropped standard pipelines below usable threshold.
Vision-Language embeddings
A single index serves both text and image queries - no separate image search to maintain.
RAG over fine-tuning
Analyst queries are open-ended and document corpora rotate; retrieval beats memorization.
LangChain abstraction
Easy swap of LLM provider as the program of record evolves.

Stack

mPLUG-Owl3MS-SWIFTRAGFastAPIRedisPyTorch

Live demo GitHub