There’s a growing wave of legal AI tools that wrap a prompt around GPT or Claude, connect it to a database of laws, and call it a product. From the outside, the demo looks impressive — ask a legal question, get a structured answer with citations. But anyone who has used these tools on real cases knows the problem: the answer looks right, reads right, and is wrong in ways you can’t easily catch.
We built Branko AI to solve this differently. Here’s why we believe agentic workflows, built-in verification, and law versioning aren’t nice-to-have features — they’re the minimum bar for legal AI that actually works.
The single-prompt problem
When you ask an LLM a legal question in one shot, you’re asking it to do five things simultaneously: understand the facts, find the relevant law, check if the law was amended, apply the law to the facts, and structure the output. LLMs are remarkably good at making this look coherent. But coherence is not correctness.
The model might cite Article 76 of a law that was amended three years ago. It might pull the right law but miss that a bylaw overrides it for your specific case. It might extract the wrong claim amount from a dense court document and build a perfectly logical analysis on top of a wrong number.
You won’t catch these errors by reading the output. They’re buried in the reasoning chain, and the confident tone makes them invisible.
Why we chose an agentic pipeline
Instead of one prompt doing everything, Branko AI breaks legal analysis into four discrete steps, each with a specific job:
1. Retrieval — find every relevant law, bylaw, and court decision. Not keyword matching — hybrid semantic and keyword search across article-level indexed legislation, re-ranked by a model trained specifically on legal relevance.
2. Fact extraction — pull entities, amounts, dates, and legal relationships from the uploaded document. This uses a custom-trained NER model with 95% F1 accuracy, not the LLM guessing from context.
3. Legal analysis — apply the retrieved law to the extracted facts using IRAC methodology. The LLM only handles the reasoning step — everything it works with has already been verified by the previous steps.
4. Document drafting — generate the final output in the correct legal format for the jurisdiction.
Each step feeds the next. Each step can be independently evaluated. When something goes wrong, we know exactly which step failed and why.
Verification isn’t optional
Here’s what most legal AI vendors won’t tell you: they don’t know how good their system is. They can show you cherry-picked demos, but they can’t give you a number.
We can. Every step in our pipeline is benchmarked against a curated gold standard — real cases with expert-validated outputs. We measure retrieval precision, entity extraction accuracy, citation correctness, and analysis quality. Our current end-to-end benchmark: 85.6% across 20 real cases.
More importantly, we know which cases score 95% and which score 70%. We know that our retrieval is the strongest link and that certain edge cases in entity extraction are the weakest. This is how we improve systematically — not by tweaking prompts based on anecdotes, but by measuring, identifying the bottleneck, fixing it, and re-running the benchmark.
When we deploy a new version, we run the full evaluation suite. If scores drop, we don’t ship. This is what separates a product from a prototype.
The law versioning problem nobody talks about
Laws change. This sounds obvious, but the implications for legal AI are severe and almost entirely ignored.
Consider a labor dispute from 2021. The relevant provisions of the labor law were amended in 2022 and again in 2024. If your AI retrieves the current version of the law and applies it to a 2021 case, the analysis is wrong — and it will look completely right.
Most legal AI systems retrieve the latest version of every law because that’s what’s in their database. They have no concept of time.
Branko AI includes what we call the time machine. For every law in our corpus, we maintain a full version chain — every amendment, every gazette reference, every effective date. When you upload a case document, the system extracts the event date, identifies which version of each law was in force at that time, and reconstructs the law as it existed on that date. The legal analysis then runs against the temporally correct version.
This isn’t a feature. It’s a prerequisite for correct legal analysis. Any system without it is giving you answers based on the wrong law, and neither you nor the system knows it.
The compound effect
Each of these design choices — agentic decomposition, per-step verification, law versioning — matters on its own. But the real value is how they compound.
Because the pipeline is decomposed, we can train specialized models for each step. Our NER model doesn’t need to be a general-purpose entity extractor — it’s trained specifically on Macedonian legal documents. Our reranker doesn’t need to handle arbitrary queries — it’s fine-tuned on legal relevance judgments.
Because each step is verified, errors don’t propagate silently. A wrong entity extraction in Step 2 produces a measurably different output in Step 3, which our evaluation catches.
Because the law is versioned, the retrieval step doesn’t just find the right law — it finds the right version of the right law. The analysis builds on a foundation that is temporally correct by design, not by luck.
A single-prompt system has none of these properties. It might give you the right answer on a given day for a given question. But you have no way to know when it’s wrong, no way to measure how often it’s wrong, and no way to systematically make it less wrong.
What this means for legal practice
We’re not building a tool that replaces lawyers. We’re building infrastructure that makes legal analysis reproducible, measurable, and auditable.
When a junior associate spends four hours researching a question, the quality depends on their experience, their attention to detail that day, and whether they happened to check the right amendment. The output is a memo that looks authoritative but has no quality score attached.
When Branko AI analyzes the same question, the output comes with a retrieval trace showing exactly which laws were considered, an extraction log showing which facts were pulled from the document, and a citation chain linking every claim in the analysis to a specific article. The senior partner can verify the reasoning in minutes instead of re-doing the research from scratch.
This is what we mean by production-grade legal AI. Not a better chatbot — a system that earns trust through transparency, measurement, and the discipline to only ship what we can prove works.