Why you can't benchmark a legal case with a chatbot

Ask a general-purpose AI a simple question about employment tribunals — “how often do unfair dismissal claims succeed?” — and you will get a confident, specific, plausible answer. You will also, most of the time, get a wrong one. Not because the model is careless, but because of what it is doing when it answers. Understanding that gap is the difference between a number you can put in an advice letter and a number that quietly misleads.

There are, broadly, three ways a piece of software can answer a legal question. They are not variations on a theme. They are different machines doing different jobs, and only one of them can produce a statistic.

What to do next

Want to see what a structured record looks like?

Explore the Yerty Index

One: retrieval — the machine that reads

Most legal AI you have encountered is built on retrieval, often called RAG — retrieval-augmented generation. The mechanism is straightforward: take a large pile of documents, and when a question arrives, find the handful of passages that look most relevant, hand them to a language model, and have it write an answer grounded in what it just read.

Retrieval is genuinely good at some things, and it is worth being honest about them. It works on raw, unstructured text with no preparation. It is excellent at qualitative questions — “what reasoning did tribunals give when refusing to extend a time limit?” — where the answer lives in the prose of judgments and nowhere else. It can quote the exact passage it relied on. For reading, summarising, and surfacing the texture of how decisions are written, retrieval is the right tool.

But notice what it cannot do. When you ask “how often do these claims succeed?”, a retrieval system does not have the whole population in view. It has whatever passages its search step happened to pull — perhaps a dozen, perhaps thirty. It cannot count across a hundred thousand decisions, because it never sees a hundred thousand decisions at once. So when pressed for a rate, it does what language models do: it produces a fluent, confident number that was never computed from anything. It is an estimate wearing the costume of a statistic.

The architectural limit

This is not a flaw that a better model fixes. It is the architecture. Retrieval retrieves. Ask it to count, and it guesses.

What to do next

Need counts across the whole record — not a sample?

Browse the Intelligence Hub

Two: rules engines — the machine that recites

The second kind of system takes the opposite approach. Rather than reading judgments, it encodes what the law says. A lawyer and an engineer sit down and translate a body of regulation — eligibility criteria, procedural steps, statutory conditions — into deterministic logic. If this, then that. The result is transparent, traceable, and free of the guessing that haunts retrieval. Every output can be traced back to the rule that produced it.

For questions of what the law requires, this is powerful. “What are the conditions for a valid claim?” is a question a well-built rules engine answers precisely, every time, with its working shown.

But a rules engine holds no memory of what actually happened. It knows the requirements for an unfair dismissal claim; it does not know that, of the claims that reached a full merits hearing last year, a particular share succeeded. It encodes the rulebook, not the record. Ask it “what usually happens to cases like this?” and it has nothing to answer with, because outcomes are not rules. They are history — and history has to be collected, not codified.

What to do next

Outcomes are history — and history has to be collected.

See the Yerty Index

Three: the structured record — the machine that counts

The third approach is the one Yerty is built on, and it is neither of the above. It begins with the same public source everyone has — the tribunal decisions published on GOV.UK — but instead of reading them on demand, or encoding rules alongside them, it structures them. Every published decision is read once, its outcome and key facts extracted, verified, and held as a single field in a consistent, queryable record. Over 160,000+ tribunal records, structured, organised so that the whole population can be counted, grouped, and compared. That is the Yerty Index.

Once a decision is a structured row rather than a wall of text, the questions that defeat retrieval become ordinary. How often do these claims succeed at a full hearing? That is a count across the whole record, not a guess from a sample. What is the typical award, by claim type, by region, over five years? That is a straightforward aggregation. How does representation associate with outcome? A cross-tabulation. None of these can be answered by reading a handful of passages, and none of them are questions about what the law says. They are questions about what actually happened — and they can only be answered by a machine that has structured the record so it can be counted.

The crucial property is this: every figure is computed from the actual population, and every figure traces back to the real, published decisions behind it. There is no estimation step. When the number is 30 percent, it is 30 percent because the record was counted, not because a model found 30 percent plausible.

What to do next

Need a statistic you can cite — not a guess?

Explore the Intelligence Hub

Why this matters for anyone relying on the number

The distinction is not academic. It decides whether a number is safe to act on.

A litigation funder underwriting a case needs a success rate that is a real base rate, not a language model's impression of one. A solicitor putting a figure in an advice letter needs it to trace to something. A journalist reporting on tribunal patterns needs a source they can cite and defend. A claimant deciding whether to pursue or settle deserves an honest picture, not a fluent guess. In every one of these rooms, the question is the same: where did this number come from, and can I stand behind it? Retrieval cannot answer that question about its own statistics, because it did not compute them. A structured record can, because it did.

This is also why honesty is easier for a structured system, not harder. Because Yerty counts a defined population, it can tell you exactly what that population is — and what it is not. Settlements never reach a published decision, so they are invisible in the record, and we say so. The figures describe published outcomes, a specific and contested slice of all disputes, and we say that too. You cannot state the limits of a number you never actually computed. You can state the limits of a number you did. See how the Yerty Intelligence Hub puts that record to work.

What to do next

Ready to benchmark with cited figures?

Generate a Benchmark Report

The one-line version

Retrieval AI reads a few judgments and summarises them. Rules engines encode what the law says. Neither can tell you what actually happened across the whole record, because one only ever sees a sample and the other holds no history at all.

Benchmarking is a statistical question. It needs the entire record, structured — so the answer is counted, not guessed, and cited, not estimated.

That is what the Yerty Index is: not a prediction, not a rulebook, but the structured record of what actually happened. Explore it through the Yerty Intelligence Hub.

The Yerty Index is the structured dataset behind the Yerty Intelligence Hub. Figures are derived from published Employment Tribunal decisions and are provided for benchmarking context, not as a prediction of any individual case or as legal advice. Published decisions do not include settled or withdrawn claims. See our data use policy and methodology for coverage and citation.

Why you can't benchmark a legal case with a chatbot

One: retrieval — the machine that reads

Two: rules engines — the machine that recites

Three: the structured record — the machine that counts

Why this matters for anyone relying on the number

The one-line version

Benchmarking needs the whole record, structured.

Built on real cases. Reviewed by practising solicitors.