Hierarchical Parsing of Turkish Legislation into AsciiDoc
By Emin Can Başkaya
2026-04-08
The problem
Turkish legislation lives on public government sources in a format that looks structured but isn’t machine-structured in any usable way. The hierarchical relationships that matter for legal reasoning — Kanun → Kısım → Bölüm → Madde → Fıkra → Bent — are expressed visually rather than semantically. Amendments layer on top of base law and rewrite specific clauses without touching their parents. Cross-references are free text (“3. maddenin 2. fıkrasına göre”) rather than resolvable links.
For an LLM to reason over this corpus with any accuracy, the hierarchy has to be explicit, the text has to be verbatim to the source, and the structure has to survive chunking and retrieval.
The approach
I built a hybrid parsing pipeline that combined structural document signals with LLM-based classification, designed to recover hierarchy while preserving source text exactly as published. The output format is AsciiDoc — chosen specifically because it natively handles nested numbered lists, cross-references, includes, and attributes, all of which map cleanly onto legislative structure in a way that JSON or Markdown don’t.
The pipeline turns ~1,000 core legislative documents into a structured, LLM-ingestible hierarchy where every Madde, Fıkra, and Bent is individually addressable, the original text is preserved without paraphrase, and the structural breadcrumb is available as metadata for retrieval.
Why this was the hard part
Most legal RAG projects either rely on naive chunking (which destroys legal hierarchy) or full LLM-based parsing (which hallucinates edits to the source text — fatal for a legal product). The hybrid approach solves both problems at once: rule-based signals for structure, LLM for classification, source text preserved verbatim throughout.
Stack
Python parsing pipeline, LLM classification layer, AsciiDoc output format, versioned storage for amendment tracking.