AI Is Giving Us What We Want, Not What We Need

Featured

The MVP Was the Easy Part

Steve Blank has been teaching Stanford’s Lean LaunchPad for sixteen years. This year, for the first time, teams walked in on day one with MVPs that looked finished, not rough prototypes, not wireframes, but finished-looking products. “I’ve been writing about how AI is going to change startups, but seeing 8 teams actually implementing it was mind blowing,” he admitted. What previously took weeks to build now appeared almost overnight. Blank’s conclusion? “After the class, as the instructors sat processing what just happened, we realized there’s no going back.” He’s right, but he may be understating what we’re going back from. The friction AI has now removed wasn’t just annoying; it was doing a job.

When building was hard, weak assumptions had somewhere to get caught. A team had to spend real time, real money, and real energy turning an idea into something testable. That process created natural points of resistance, those moments when someone might pause and ask whether the problem was actually real, whether users genuinely cared, whether the customer was clear. The irony is that AI has cleaned up the mess, but in doing so it may have removed the mechanism that made the mess useful.

Jim Hornthal, of the Berkeley-based Haas School of Business, commenting on Blank’s post, put it simply: “Product development is no longer the primary bottleneck.” Teams can produce functional MVPs in days. But the hard constraint, he notes, hasn’t changed at all, it’s still customer discovery, validation, and the search for a repeatable business model. The “AI that still matters most,” as he drily puts it, is “Actual Interviews”. In his reply Netherlands-based Joost Okkinga added a key observation that sharpens the whole problem: “AI makes teams feel further ahead than they are. The product artefact improves faster than the evidence base. So the discipline needs to move upstream: map assumptions first, then use AI to accelerate the right tests.”

A startup founder can now generate a landing page, a prototype, a pitch deck, a feature roadmap, a user onboarding flow, and a full set of customer personas before they have properly tested whether the key assumption is true. The results may look sophisticated; it impresses casual observers, and it may even impress the founder themselves. But underneath it the core thinking may still be fragile, and now thanks to AI it’s harder to see that, because the surface is so tidy and clean looking.

A line graph comic titled "THE MVP WAS THE EASY PART" plotting a founder's confidence and mood over time. The curve goes up during the "BUILDING (DOPAMINE)" phase, which features captions like "Oh, random idea at 1 AM" and "14h: launch tweet scheduled 🚀". At the peak, the line turns red and plummets into the "DISTRIBUTION (CHARACTER DEVELOPMENT)" phase. Captions on the downward slope track growing realization over 45 days, including "4 days: 17 visitors. 9 are me" and "21 days: becoming a linkedin thought leader against my will," ending with a character lying defeated on the floor with the text "discovering that distribution was the startup." Underneath the phases, text reads: "AI GOT OUR BACK" for building, and "GOOD LUCK 💀" for distribution.

Credit: Dmitry Trofimets (https://www.linkedin.com/in/dtrofimets/)

When Gaps Become Outputs

There is a subtler version of the same problem that most teams discover the hard way. When AI becomes the most frequent reader of your instructions, documents, and context, the assumptions you never thought to write down stop being harmless gaps and start becoming inputs. A human colleague fills in what is missing from shared experience, background knowledge, or through a quiet word across a desk. A model does not ask for a discussion. It works with what it is given, and where something is missing it may not pause; it proceeds. As a result the gap does not stay a gap; it is baked into the output. This is why working with AI does not just reward precision. It also exposes every step you assumed did not need saying.

Research-Looking Is Not Research

The same dynamic is playing out in scientific research, and in some ways it’s more alarming there. A recent paper on AI-generated research “AI for Auto-Research: Roadmap & User Guide” describes systems that can produce a complete paper including idea generation, literature search, code, experiments, charts, manuscript, simulated peer review, and rebuttal all for as little as $15. One system ran for 228 hours, used 11.4 billion tokens, and produced 100 papers. Another reportedly ran more than 20 GPU experiments overnight and improved a draft score from 5.0 to 7.5 through automated review and revision loops.

The striking thing isn’t the $15 figure. It is the paper’s deeper warning: a research paper can now carry a clean title, a polished abstract, organised sections, good-looking figures, citations, experiments, and a confident conclusion, while the science underneath remains fragile. The code may run while testing the wrong thing. The idea may sound original until someone tries to implement it. To quote the paper’s authors: “The core challenge is therefore no longer whether AI can produce the forms of research, but whether it can preserve the substance of research: evidence, judgment, provenance, and accountability.”

In both cases, from startups to research labs,  it’s clear that AI is lowering the cost of producing something that looks complete while raising the cost of knowing whether it actually is fit for purpose. The bottleneck doesn’t disappear; thanks to AI it moves, from production upstream to validation, and it becomes harder to see because the downstream MVP or scientific paper no longer carries the obvious marks of imperfection that used to signal it.

Marketing Activity Is Not Evidence

The same confusion can show up in marketing, where AI agents can generate activity faster than teams can interpret whether that activity means anything commercially. In the crypto gaming sector, one of the first campaigns using AI agents has delivered results for Whale.io.”This MCP campaign featured over 15 AI agents that collectively generated more than $900k in volume,” reported igaming expert Adar Ziv this week. “While some use Claude and Cursor to find bugs in smart contracts, break bridges, and earn bug bounties, others have taken a more entertaining route,” he added.

And at first glance, that looks like evidence of success, that a well-designed incentive campaign can produce high transaction volume. But in this case, the key question is not “did the agents produce volume?”, the better question is: “what did that volume prove?” Did it reveal durable demand for agentic igaming, or did it show that agents can be directed to move capital rapidly inside a nicely designed incentive loop? Both are interesting, but they are not the same game. The danger is not that AI gets things wrong. It is that AI produces valid-looking evidence at speed, allowing teams to mistake accelerated activity for validated demand, or business value.

Mission-Ready Is Not Flight-Ready

Space technology is where this problem becomes most unforgiving, and where the stakes of misreading an AI-generated output are hardest to recover from. In ordinary software, a weak assumption may surface as churn, wasted roadmap time, or a failed feature — costly, but usually recoverable. In space, the same kind of hidden dependency can become a missed launch window, an expensive redesign, a failed payload, or a lost mission entirely. NASA’s own systems-engineering guidance warns that systems can “pass verification but fail validation,” and its modelling and simulation standards require assumptions, limits of operation, and uncertainty to be made explicit before decision-makers rely on simulation outputs. That is a formal acknowledgement of something engineers have long understood: a model can be internally consistent, well-documented, and technically impressive while still depending on assumptions that have not been properly tested.

If AI can help space tech teams generate cleaner mission plans, simulations, technical documentation, and design options earlier in the process, it can also make those plans look more compelling before the load-bearing engineering assumptions have been stress-tested. The simulation runs, the outputs look plausible, the documentation is polished, and funding is being sought. But coherence is not the same as correctness. A model can look rigorous precisely because it is self-consistent, while still being wrong in the way that matters. In space, discovering that distinction late is not a sprint retrospective. It can be an expensive failed mission. Surfacing assumptions before building on them is therefore not a methodology choice. In space, it is an engineering necessity. The first obligation is not to pick the right assumption to test, but to surface the assumptions behind the simulation, mission plan, or design case. You cannot rank risks you have not yet surfaced.

A Stitch in Time

This is where I believe the Needle Framework earns its keep, not as another productivity layer or a better prompt pack, but as a way of forcing the hidden logic into the open before AI accelerates everything built on top of it. Its value is not that it magically knows the right answer. It is that it asks the awkward prior questions: what does this plan depend on being true, what has been assumed rather than evidenced, and which dependency would hurt most if it failed?

AI does not automatically improve judgement. In fact, it can all too easily disguise the lack of it. It can produce a fluent explanation, a polished simulation, a confident-looking research paper, or a finished-looking product before any of the assumptions underneath have been properly tested. That is the real risk: not that AI gets things wrong, but that it makes wrong things look right, faster than ever before. When people are forced to explain their logic step by step, weak assumptions start to show themselves.

The Needle Framework is a discipline for making that happen earlier: before the MVP looks finished, before the paper sounds convincing, before the campaign dashboard fills up, and before the simulation becomes a source of false confidence. Because AI is increasingly good at giving us what we want. The harder task is working out what we actually need.

Why 4 Minutes Matter: The Hidden Cost of Imprecise Data in AI

Featured

Most of us grow up believing that a day is exactly 24 hours long. It’s tidy, convenient, and feels close enough to reality. But strictly speaking, the Earth completes one rotation on its axis in 23 hours and 56 minutes — what astronomers call a sidereal day. The extra four minutes come from the Earth’s simultaneous orbit around the Sun. If we ignored this subtlety, our sense of time would slowly drift out of sync with the Sun itself. Noon would stop being “midday.”

Those four minutes are a small detail — but they matter.


The Data Analogy

This is exactly what happens when organisations feed “close enough” data into AI systems. At first, the model might seem fine. Predictions look reasonable. The dashboards tick over. But just like those four missing minutes, tiny inaccuracies and fuzzy definitions build up. Over weeks, months, or years, the system drifts further from reality.

Suddenly, your AI isn’t aligned with the world as it actually is. Recommendations miss the mark. Bias creeps in. Customers lose trust.

The lesson? Precision in data is not pedantry. It’s the difference between alignment and drift.


Why Precision Matters

  • Compounding effect: Small errors accumulate over time. Like four minutes a day becoming hours, days, and months of misalignment.
  • AI is literal: Models take inputs as ground truth. A vague definition or inconsistent label isn’t “good enough.” It’s an anchor point for bad predictions.
  • Trust is fragile: Once stakeholders see AI outputs wobble, confidence in the entire system erodes.

The Needle Framework: Finding the Signal

Getting data right is about finding the needle in the haystack: the clear, sharp definition hidden among the fuzz. When you sharpen the data — consistent labels, correct units, precise categories — you give AI a sidereal day to lock onto. A stable reference point. A system that stays in sync instead of drifting.


So What?

AI isn’t magic; it’s alignment. And alignment starts with data. Just as astronomers can’t afford to ignore the missing four minutes, companies can’t afford to wave away small inconsistencies. The cost of “close enough” is hidden drift.

The sharper your data, the sharper your AI. And that’s where the real value emerges.


Four minutes matter in astronomy. And they matter in AI. Get your data precise, and your systems won’t just work today — they’ll stay aligned tomorrow.

Does AI Governance Need an Assumption Layer?

Why execution-boundary AI governance needs upstream assumption testing

Most AI governance still begins in the wrong place. It starts with rules, which is an understandable instinct, because rules are visible — they can be written down, audited, checked, enforced, and explained afterwards. In a world of increasingly capable AI agents, the impulse to define firm boundaries on what a machine is and isn’t allowed to do feels like the responsible first move.

But the harder problem is rarely the rule itself, it’s the assumption hidden inside it. A rule can be morally attractive, technically enforceable, and still fail in exactly the situation where it matters most — not because it was poorly written or carelessly considered, but because it depends on a condition that has quietly stopped being true. That’s where the next phase of AI governance gets difficult: not at the level of slogans, but at the level of assumptions.

When the rule meets its exception

Consider one of the most emotionally powerful rules in AI governance: an autonomous weapon should not use lethal force without a human in the loop. As a default, it’s hard to object to. It protects human judgment, prevents machines from becoming independent killing systems, and keeps moral responsibility attached to human authority rather than algorithmic decision-making.

But Ben Goertzel in The Anthropic Fable Farce recently offered an edge case that exposes what happens when this rule is treated as absolute. Picture a drone, cut off from the network, that sees a man seconds away from pressing a button that will launch a weapon capable of killing a million people. The drone’s hard rule says it may not use force without human approval — but no approval is available, because the link is down. If the drone follows the rule, a million people die. The scenario forces the uncomfortable question of whether, in that situation, the drone should act.

The point isn’t that autonomous weapons should be given broad permission to kill. The point is sharper: the absolute rule depends on an assumption, specifically that a human decision path will remain available when the decision matters. If that assumption fails, the rule may no longer produce the safety outcome it was designed to protect, with devastating consequences.

This doesn’t make the rule disappear, and it doesn’t make the underlying moral concern go away, if anything, the risks of machine error, spoofing, false positives, escalation, and misuse become more serious, not less. What disappears is the absolute version of the rule. And once even one legitimate exception is admitted, the governance question changes shape. It’s no longer enough to say a human must always be in the loop. The harder question becomes: under what conditions, defined in advance, could an exception ever be admissible, and who has the authority to define those conditions?

Why accuracy isn’t the whole answer

One tempting response is to treat this as simply a question of accuracy: if the AI isn’t reliable enough, it shouldn’t act, and perhaps the rule can change once it becomes reliable enough. That’s partly true, accuracy matters enormously, and in the case of a drone that misidentifies the person, the weapon, the intent, or the consequence isn’t preventing catastrophe, it’s creating one. Any exception that allows force on weak evidence would be both morally and technically dangerous.

But accuracy isn’t the whole problem. Even a far more accurate system would still need a governance structure around the decision: what evidence counts, how uncertainty is handled, whether the signal might have been spoofed, what prior authority exists, how the decision gets recorded, and how responsibility is assigned afterwards. The question isn’t only whether the AI is right. It’s whether the conditions under which it may act have been identified, tested, bounded, and made explicit before the crisis occurs, which is a different kind of work entirely. It isn’t model evaluation, red-teaming, or post-hoc audit. It’s the upstream task of discovering what the rule actually depends on being true.

In the drone case, that means surfacing assumptions like: the human approval path will remain available, waiting for approval is safer than acting, inaction is morally neutral rather than itself a choice, the system can reliably distinguish catastrophe from ambiguity, the exception can’t be spoofed or exploited, and the action remains accountable afterwards. These aren’t secondary details, they’re the real structure underneath the rule. If they are never surfaced, the governance system can look safe while remaining brittle.

The execution boundary is not enough

A new class of AI governance solutions is emerging around what might be called the execution boundary: the point where an agent stops merely suggesting something and starts doing something that affects the world — updating a record, moving money, changing a parameter, sending a message, triggering a workflow, approving a transaction, or making an operational decision.

This shift matters, because an agent that takes action creates consequences directly, in a way a chatbot that gives a bad answer doesn’t. The basic idea behind execution-boundary governance is right: before an agent acts, the system should check whether the action is authorised, evidenced, within scope, and safe to proceed, and it should be able to allow, restrict, escalate, delay, or refuse accordingly, creating an evidence record so the decision can be reviewed later. That’s a major improvement over governance that only reviews what happened afterwards.

But that leaves the harder question upstream. A runtime governance layer can enforce constraints, check authority, record evidence, and block actions that fall outside a permitted corridor. What it can’t do by itself is know which constraints should exist in the first place. Before a layer can enforce a rule, someone has to identify the dependency that makes the rule necessary, and before a system can block a dangerous transition, someone has to recognise that the transition is dangerous. That identification is the missing upstream layer, and it doesn’t happen automatically just because the execution layer is well built.

What changes when assumptions become operating instructions

In ordinary decision-making, a weak assumption tends to produce a bad plan, a failed project, or wasted money. In agentic AI, the same weak assumption can become part of an operating system, and that changes the stakes considerably.

A team might assume a monitoring signal is reliable enough to trigger an intervention, and an agent acts on it before it’s stable. A company might assume a customer request implies valid consent, and an agent moves or exposes data on that basis. A platform might assume a human approval step exists somewhere in the workflow, and an agent routes around it because the condition was never made explicit. A security system might assume a blocked identity means a blocked capability, and the capability leaks through another channel anyway. In none of these cases is the agent malfunctioning, in fact it’s doing exactly what the system permits. That’s the danger: assumptions that once sat quietly in strategy documents and slide decks can become executable, and a governance system can enforce the wrong thing with impressive consistency.

This is why the real bottleneck in AI governance isn’t primarily enforcement. The questions that matter most, such as what does this workflow actually depend on being true, or which assumption would create the most damage if wrong, aren’t technical afterthoughts, they’re part of governance itself. In many failures, the agent won’t have broken the rule at all. The wrong rule will have been encoded, or the exception was never defined, or the evidence requirement never matched the real risk. The governance layer performs exactly as designed, but the design is still flawed.

Where a Needle-style approach fits

An upstream assumption layer shouldn’t replace the execution layer, authorise action, or try to resolve moral decisions in the moment. Its role is narrower and earlier: to help identify what a proposed action depends on being correct, which assumptions are stated and which are merely implied, where the evidence is weak, where authority is ambiguous, and where a candidate constraint should be defined before execution rather than discovered after it.

This is where a Needle-style framework becomes useful, not as a governance system in itself, but as a method for finding the hidden dependency before the governance system is asked to enforce anything. The question is simple: what must be true for this decision, rule, workflow, or agent action to be safe enough to proceed? The follow-up is harder: what happens if that assumption fails? Different domains will produce different answers. In the drone case, it’s whether the human loop remains available. In an enterprise agent workflow, it might be whether approval has genuinely been granted, whether data use is permitted, or whether a signal is strong enough to justify action. But the structure is the same in every case: a visible rule sits on top of a hidden dependency, and if the dependency fails, the rule may no longer behave as intended.

Why this matters commercially

For enterprises, this isn’t only an ethics question, it’s operational. If agents are going to touch real systems, governance has to become part of production, not just policy. But production governance only works if the right constraints were selected in the first place. Get that wrong, and the costs can show up everywhere.

The recent Starbucks Korea crisis is a striking example of the potential failure pattern. In May 2026, Starbucks Korea launched a “Tank Day” tumbler promotion on the anniversary of the Gwangju pro-democracy uprising, using language many Koreans read as echoing both the military crackdown and a notorious police torture cover-up. The campaign was reportedly developed with the help of generative AI, but the real failure was not that AI suggested the wrong words. It was that the company’s human governance process did not catch what those words meant.

The promotion passed through multiple layers of approval, while the cultural and historical assumption underneath it remained invisible: that the people signing off the campaign understood the society they were selling to. The commercial consequences of the huge error were immediate: the campaign was pulled, the local CEO was dismissed, sales fell sharply, stores were later scheduled to close early for nationwide history and social-sensitivity training, and Starbucks Korea announced changes to its marketing approval procedures. That is why the case matters beyond branding. Any system that acts at speed whether military, enterprise, or marketing can fail when its governance process checks whether the workflow was approved, but not whether the workflow is standing on an assumption no one has tested.

Execution governance reduces one class of risk. Upstream assumption discovery reduces another: the risk of governing the wrong problem entirely. The execution layer asks whether an action may proceed. The upstream layer asks whether anyone has understood what that action actually depends on. One without the other is incomplete.

The rule is not the governance

Rules aren’t self-contained objects. They carry assumptions about the world: that certain signals are reliable, certain actors are reachable, certain authorities are clear, certain exceptions are rare or manageable. But real systems break assumptions constantly. Networks fail, signals drift, people route around controls, edge cases appear, incentives shift, adversaries spoof conditions, and human approval paths disappear exactly when they’re most needed.

None of this is an argument against rules. It’s an argument against treating rules as though they explain themselves. Good AI governance will still need strong execution boundaries: evidence, authority checks, refusal paths, escalation paths, auditability, and accountability. But before any of that, it needs assumption-tested rules. Before a system decides what an AI agent is allowed to do, someone has to ask what must never be allowed, what may be allowed only under extreme conditions, and what assumption separates the two. The most important governance question may therefore come earlier than we think: not whether the system can enforce the rule, but whether anyone has validated the assumption the rule depends on.