SHAKTI
Vision-Language RAG System for Military Intelligence

Problem
Defense intelligence relies on dense, multi-page mixed-format documents - scans, maps, captioned imagery - that traditional OCR + keyword search misses. Analysts under time pressure need a system that retrieves semantically across both text and imagery, not just exact-match keywords.
Approach
Vision-Language RAG. A custom OCR pipeline (79% accuracy on military document samples) extracts text from degraded scans; a VL backbone produces multimodal embeddings indexed in a vector store. Retrieval is cross-modal - a text query can surface captioned imagery, a snippet can surface the source document. FastAPI service layer with swappable LLM backends via LangChain.
At a glance
OCR accuracy
79%
Modality
Vision + Language
Recognition
SIH 2024 Winner
Domain
Defense / Military
Tech decisions
Custom OCR over Tesseract
Military fonts and degraded scans dropped standard pipelines below usable threshold.
Vision-Language embeddings
A single index serves both text and image queries - no separate image search to maintain.
RAG over fine-tuning
Analyst queries are open-ended and document corpora rotate; retrieval beats memorization.
LangChain abstraction
Easy swap of LLM provider as the program of record evolves.