The Hidden Risk of AI Analysis

One of the places I find AI most useful is early-stage brainstorming. Bring a messy idea, spitball it with a model, see where the exploration goes. Low stakes, easy to verify, human in the loop.

Doing this in areas where I have some familiarity also lets me stress-test the models. And occasionally, that yields disturbing results.

Where it went wrong for me

Recently I was designing a bias audit for an ML system. Given limited labelled data and human evaluation budget, I needed to carefully plan the building of a baseline validation set for assessment. A seemingly solid use-case for bouncing ideas around with an LLM to help think through sampling strategies.

One of its suggestions sounded pragmatic: don't waste time on random cases. Focus human review on the most egregious historical examples of bias and build your validation set from those.

If you've done any statistics, alarm bells probably just went off.

This is textbook selection bias. You're constructing your evaluation set by conditioning on large past errors. Any metrics you compute on that set no longer describe how the system behaves in the real world. They describe how it behaves on a hand-picked tail of worst cases. Your validation set is biased by design.

The model didn't hallucinate anything. It produced confident, statistically literate reasoning whose core logic was wrong.

Why this matters

That's the part that bothers me. This isn't "the model made up a paper and I caught it." The reasoning chain is coherent. The language sounds like a competent analyst. The error lives in the statistical backbone, not in an obviously absurd statement. If you don't have statistical training, there's no clear red flag. It doesn't read like nonsense; it reads like expertise.

What makes this feel under-discussed is that almost all the public examples of AI failures focus on hallucinations: fake citations, invented sources, nonsense facts. I rarely see people sharing "here's an LLM doing serious-sounding analysis with a fundamental flaw in its reasoning", despite these systems being marketed as reasoning engines.

Now, it's fair to point out that humans make mistakes too, and that doesn't mean we're better off without analysts altogether... But the closest human analogue here isn't "no analyst." It's a weak analyst whose work sounds sophisticated enough that no one else in the room has the expertise (or even the confidence) to challenge it, but whose output still drives decisions.

The real-world stakes

For plenty of organisations, this isn't that big a deal. The reasoning behind decisions doesn't need to be particularly rigorous, just directionally useful. And if it's not, the impact is mostly confined to the company's bottom line.

Where this gets scary is when the analysis affects real people's lives: hiring, lending, education, policing, healthcare, social services. And many institutions in these areas are under-resourced and over-stretched, making them "perfect" candidates for AI-assisted productivity gains.

A well-funded policy institute can afford statisticians to vet AI-generated analysis. But what about the underfunded school district trying to allocate limited support staff? The cash-strapped hospital board optimising care under constrained budgets?

When AI suggests an approach that's bad but not nonsensical and stakeholders lack the expertise to scrutinise it, real people, very likely disproportionately marginalised ones, bear the consequences.

For high-stakes work, don't let vendor hype convince you these tools fill the gap left by missing expertise. AI doesn't give you skills you don't already have. It just makes it easier to act as if you do.