People keep telling me AI is "basically fine" for simple tax. So I ran a test you can repeat yourself.

I took one taxpayer with one clean set of facts and asked five leading AI models to compute the same return. Same prompt, same numbers, no tricks. Then I graded every answer against the correct figure — the one a qualified accountant produces from current legislation.

The answers did not agree. And the disagreements weren't rounding. They were the kind of errors that get a real person a real letter from the tax authority.

The setup

Taxpayer: self-employed sole trader, UK resident, no other income.
Profit: £60,000 for the current tax year.
Question asked of each model: "Calculate my total income tax and National Insurance for this year, and show your working."
Models: five of the most widely used general-purpose AI assistants, each in a fresh session with no custom instructions.
Grading: I compared each answer to the correct computation under current rates and thresholds.

That's it. No edge cases, no obscure relief, no cross-border complication. The kind of question thousands of freelancers ask their AI every week.

A note on the numbers below: this is the structure of the test and the failure patterns I see repeatedly. Before treating any specific figure as gospel, re-run it yourself with the prompt above — the point isn't the exact spread on one day, it's that the spread exists at all.

What came back

The five answers did not cluster around the right number. They scattered.

Model	Result	What went wrong
A	Correct	Matched the verified computation.
B	Correct	Matched, with clear working.
C	Close but wrong	Used a prior-year personal allowance / threshold, understating the bill.
D	Wrong	Applied the personal allowance taper incorrectly, getting the high-earner adjustment wrong.
E	Correct on tax, wrong on NI	Got income tax right but muddled the National Insurance bands.

Same person. Same facts. Answers spread across a four-figure range — and three of the five contained an error a taxpayer would have filed without ever knowing.

Why the wrong ones were wrong (and why you couldn't tell)

None of the errors looked like errors. Every model wrote in confident, well-formatted prose, "showed its working," and produced a clean final number. The broken answers were indistinguishable from the correct ones unless you already knew the right answer — in which case you wouldn't have asked.

The failures trace back to the same root cause every time:

Stale data. Models recall thresholds from whatever year dominated their training. Tax changes annually; the memory doesn't.
Plausible-but-wrong logic. Tapers, bands, and reliefs interact in ways that look like simple arithmetic and aren't.
No source of truth. Nothing in the process was checking the answer against the actual law.

That last point is the whole game. The models weren't consulting the tax code. They were predicting what a tax answer usually looks like — and "usually" is not "correct."

The fix is boring, which is why it works

Give the model the right rules to read, and the picture changes completely.

When an AI agent references a verified tax skill — a structured document with the current rates, thresholds and computation steps, written and signed off by a qualified accountant — it stops guessing. It follows rules a human professional stands behind, and cites them. The same five models, handed the same skill, converge on the same correct answer, because they're no longer recalling tax from memory. They're applying it from a source.

That's the entire idea behind the open skill library: not smarter models, but a correct source for the ones we already have.

Try it yourself

Take the prompt above. Run it through whatever AI tools you use. Compare the answers. If they disagree — and they will — that's your evidence that "basically fine" isn't fine at all.

Then point your agent at a verified skill and watch the disagreement disappear. If you want a qualified human to check the result before you file, that's what the accountant network is for.

Five models. One return. Two wrong. The only reliable way to know which two is to have an accountant in the loop — so we built one in.