RAG Pipeline for 10M+ Legal Documents

By Emin Can Başkaya

2026-04-10

The problem

Building a retrieval system over a legal corpus that lawyers can trust. The bar for a legal product is much higher than for general-purpose chatbots: if the system cites a source, that citation has to be traceable to the exact document and passage, or the product has no value. Lawyers notice fabrication immediately and never come back.

The approach

A hybrid retrieval architecture combining BM25 lexical search and dense vector retrieval through Elasticsearch, with a reranking layer and structured citation output tying every generated claim back to specific source passages. The ingestion layer handled ~10M+ documents across multiple Turkish legal sources with daily updates, resilience to source format changes, and handling for the heterogeneous structure of court decisions, legislation, regulations, and related materials.

On top of retrieval, I built the LLM integration layer that consumed retrieved context and produced structured outputs for downstream features: semantic search, case law lookup, and automated petition generation. The generation layer was constrained to ground every factual claim in retrieved sources, with hallucination controls at multiple stages of the pipeline.

The result

A production RAG system that served real lawyers in a regulated domain, stress-tested to 1,000 concurrent users, with citation accuracy sufficient for professional legal use.

Stack

FastAPI, PostgreSQL, Elasticsearch (hybrid search with dense vectors and BM25), LLM providers abstracted for swappability, Celery for async ingestion, Redis for queues.