Why do AI-generated plans fail even when they look complete?

Because the dangerous failure mode is not hallucination — it is silence. A plan author, human or AI, inherits the limits of what it checked. If the AI audited cron jobs but never thought to check systemd services, the plan reads as complete while an entire class of dependencies is missing. The plan is plausible, well-structured, and silently incomplete.

Can an AI verify its own plan?

Not reliably. In our own testing, a model asked to re-review its own work re-confirmed its own blind spots — in one case it recomputed a number correctly and still certified its earlier wrong conclusion. The verifier must be independent: a different model, a deterministic script, or a human. Self-certification from the same model is worth very little.

How much does independent verification cost?

In our case: about 30 minutes of read-only checks — re-running the commands the plan cited, inspecting services on each machine, checking mounts and sync state. It found two real problems that would have caused confusing mid-migration failures. It is the cheapest insurance in the whole process.

How to Verify an AI-Written Plan Before It Fails: The 30-Minute Method

Last week an AI wrote a migration plan for our infrastructure. The task: rename and restructure the folder system that three synced machines, a production bot, and our own AI agents depend on — roughly 11 GB and several hundred path references.

The plan was good. Well-structured, phased, with rollback steps. It was also about 95% right — and the missing 5% would have surfaced as broken services halfway through the migration, with no obvious cause.

We caught the missing 5% before executing anything. Not with better prompting, and not by asking the AI to double-check itself. With a method any team can copy: independent verification.

The real failure mode of AI plans

The popular fear about AI output is hallucination — confident nonsense. In day-to-day technical work, that is not what bites you.

What bites you is a plan that is plausible, well-organized, and silent about what it never checked. Every claim in it is true. The danger lives in the claims that are not in it.

A plan author — human or AI — inherits the limits of its own audit. If it checked scheduled cron jobs, the plan will confidently say "no scheduled dependencies" while a systemd service it never looked at crash-loops in the background. The document reads as complete. The gap is invisible by construction.

That is why "the AI checked it" means very little. It means one path was checked.

The method, step by step

This took us about 30 minutes. It works just as well on human-written plans.

1. Hand the plan to a different verifier

Give the complete plan to a different AI model — or a colleague — with one instruction:

"Verify everything. Trust nothing. Report what's wrong."

The key word is different. A fresh verifier does not share the author's assumptions about what was worth checking. We used a second model from a different family; a sceptical colleague works the same way.

2. Read-only checks only

The verifier's job is to inspect, not to fix. Re-run every command the plan based a claim on. Look at every system the plan touches. Change nothing yet.

This boundary matters: a verifier that starts fixing things mid-audit stops being a verifier and becomes a second author — with its own blind spots.

3. Check the class, not just the claim

This is the step that paid for everything. Our plan claimed there were no runtime dependencies on the folders being renamed, based on an audit of cron jobs. The verifier asked the class question: "Scheduled jobs were checked — what other kinds of scheduled or persistent things exist on these machines?" — and inspected systemd user services too.

It found two stale services the audit had missed. One of them had been crash-looping every 5 seconds for two weeks without anyone noticing. Both referenced paths the migration was about to change.

The lesson: when a plan says "we checked X," ask what belongs to the same category as X — and check the rest of the category.

4. Re-verify every number

Our plan stated how many path references needed updating. By the time we verified, the real count was about 30% higher — the system had kept growing in the two days since the plan was written.

Plans are snapshots. Systems are not. Any number in a plan is stale by default; re-measure at execution time.

5. Never accept the author's own "verified"

We have tested this directly with AI models: asked to review their own earlier work, they re-confirm their own conclusions — in one striking case a model recomputed a figure correctly and still certified its earlier, contradictory claim as correct. A self-review from the same author mostly re-walks the same paths that produced the gap in the first place.

Independence is not optional. It is the entire mechanism.

Why this is an old idea — and why it matters now

None of this is new. Finance has run the four-eyes principle for decades: the person who prepares a payment is never the person who approves it. Aviation, medicine, and engineering all institutionalized the same insight — authors cannot see their own blind spots, so review must be structurally independent.

AI just makes the principle urgent for everyone, because AI produces author-grade output at a volume no team of humans ever did. If your business uses AI to draft plans, proposals, or analyses, the operative question has changed. It is no longer "is the AI right?" It is "who verifies — and how is that documented?"

For European businesses there is a hard deadline attached to that question: the EU AI Act's core obligations apply from 2 August 2026, and documented human oversight is one of its central requirements. The 30-minute method above is not just good engineering hygiene — it is the seed of exactly the oversight process the regulation expects you to be able to show.

The takeaway

The dangerous AI failure mode is silence, not hallucination.
Verification must be independent — different model, script, or human. Self-review certifies blind spots.
Read-only first. Check the class, not the claim. Re-measure every number.
Budget ~30 minutes. In our case it prevented a broken production migration.

We build self-hosted AI agent systems for European SMEs — with this kind of human oversight designed in from day one, because that is what both reliability and the EU AI Act demand. If you are wondering what AI verification should look like in your business, talk to us.

How to Verify an AI-Written Plan Before It Fails: The 30-Minute Method

How to Verify an AI-Written Plan Before It Fails: The 30-Minute Method

The real failure mode of AI plans

The method, step by step

1. Hand the plan to a different verifier

2. Read-only checks only

3. Check the class, not just the claim

4. Re-verify every number

5. Never accept the author's own "verified"

Why this is an old idea — and why it matters now

The takeaway

Frequently Asked Questions

Why do AI-generated plans fail even when they look complete?

Can an AI verify its own plan?

How much does independent verification cost?

Ready to automate your business?

it's human stuff

Keep Reading

Letting an AI Write Code That Can Delete Things — Safely

An ICML Paper Won the Benchmark by Returning Nothing