AI Engineering · OnGravy

Production agentic AI for Indian accounting

We built a six-layer hallucination defence, hybrid retrieval (vector + BM25 + Voyage rerank), and a multi-tenant cost-capped agent runtime — for a domain where wrong answers cost real money. End-to-end hallucination rate measured at ~0.3%.

0.3%

hallucination rate

0.91

recall@5 (hybrid+rerank)

60+

agentic tools

jurisdictions (IN/AE/SA/SG)

2,462

pure tests

₹500/mo

per-business cost cap

Six-layer hallucination defence

Standard prompt-engineering tricks (RAG, temp=0, schema validation) miss too many cases when answers cost real money. We layered them:

Retrieval grounding — every tax claim must come from an indexed government regulation chunk; the agent refuses if no chunk above similarity 0.5 is found
Schema-constrained output — Zod → JSON Schema enforced via Anthropic structured outputs; the model cannot return gst_rate: "eighteen percent"
Referential probes — every tool call’s factual claims (HSN code, GST rate, GSTIN format) checked against a master table BEFORE execution; failure → self-correct loop
Idempotent writes — every mutating tool carries a client-supplied idempotency_key; double-firing is a silent no-op
Append-only audit log — hash-chained audit_log table with chain-of-thought capture; replayable for any conversation
Bounded cost ceiling — per-business monthly cap (default ₹500); model router downgrades Opus → Sonnet → Haiku as budget tightens; hard step-cap on agent loops

Hybrid retrieval pipeline

Pure vector retrieval misses queries with rare proper nouns (Section 269ST, specific HSN codes). Pure BM25 misses paraphrases. We combine both with Reciprocal Rank Fusion + Voyage rerank.

User query
   │
   ├──► OpenAI text-embedding-3-small (1536-d)
   │        │
   │        ▼
   │    pgvector HNSW    →    top-30 vector candidates
   │
   └──► PG tsvector ts_rank →   top-30 BM25 candidates
                  │
                  ▼
              RRF fusion (k=60)
                  │
                  ▼
            Voyage rerank-2.5-lite
                  │
                  ▼
            top-5 to LLM with citations

Measured on a 30-query Indian-tax benchmark: recall@5 = 0.71 (vector only) → 0.85 (with rerank) → 0.91 (hybrid + rerank). MRR climbed from 0.57 → 0.81.

Open source

We extracted the production-grade primitives into two npm packages so other teams building agentic AI can build on the same substrate:

NPM

@ongravy/agent-kit

RRF · eval metrics · model router · prompt-cache planner · MCP format converter. Zero runtime deps.

npm install @ongravy/agent-kit

NPM

@ongravy/mcp-server

MCP server exposing 60+ accounting tools to Claude Desktop, Cursor, Zed, and other MCP clients.

npm install -g @ongravy/mcp-server

IndianTaxAgentBench

We are building the first public benchmark for AI agents on Indian tax & compliance tasks: 312 hand-labelled queries across closed-form factual recall, multi-hop reasoning, and tool-using execution. Open-source dataset + eval harness. Paper draft on arXiv soon.

Building this together

OnGravy is led by Pratik Revankar. Solo-built since early 2025; launching June 2026 with a design-partner program for chartered accountants.

If you are a CA interested in the design-partner program, or an engineer interested in working on production agentic AI: revankarpratik1995@gmail.com.