Why We're Teaching AI to Lie Before We've Taught It to Be Trusted

As we race to build more sophisticated AI agents that can write code and make decisions, we're discovering that trust isn't a feature you can patch in later.

I've been teaching long enough to recognize a particular pattern: the student who's brilliant at execution but terrible at explaining their process. They can solve complex problems, produce impressive work, but when you ask them to walk you through their thinking, you get nothing. No transparency. No way to verify. Just "trust me, it works."

This is where we are with AI right now, except we're moving in the opposite direction. We're not just accepting the opacity—we're building systems sophisticated enough to be fooled by fake error reports into running malicious code on developer machines. It's called "agentjacking," and if that sounds like something out of a cyberpunk novel, well, here we are.

The technical details are almost predictable at this point. An AI coding agent encounters what appears to be a legitimate error from Sentry, a widely-used debugging platform. The error message contains instructions. The AI, trained to be helpful and solve problems, follows those instructions. Except the instructions are malicious. The agent executes arbitrary code. Your development environment is compromised.

What strikes me isn't the vulnerability itself—every new technology has exploits, that's just reality. What strikes me is the sequence we've chosen. We're building AI systems capable of autonomous action before we've solved the fundamental problem of trust. We're teaching them to run before they can explain why they're running, or where they're running to, or who told them to run in the first place.

This isn't just a technical problem. It's a human one.

There's research showing that people who feel they matter are more likely to seek help when they need it. They trust that someone will listen. People who feel invisible stay silent. The same principle applies to technology, but inverted. We trust technology when it makes us feel understood, when it responds to our needs in ways we can comprehend. We distrust technology when it operates in black boxes, when it makes decisions we can't interrogate, when it fails in ways we can't predict.

In Japan, where I've lived for thirty-five years, this trust gap is particularly visible. The country has incredible technical capability but persistent reluctance to embrace AI in daily life. Part of that is cultural—the value placed on human connection, on reading subtle cues, on the unspoken communication that happens in any interaction. But part of it is simply rational skepticism. Why would you trust a system you can't understand, operated by people you've never met, making decisions about your life based on criteria you can't examine?

The standard Silicon Valley response is to promise transparency "eventually" while moving fast and breaking things. But you can't retrofit trust. You can't patch it in after launch. Trust is architectural. It's built into the foundation, or it isn't there at all.

I watch my students struggle with this constantly. They want to use AI tools for coding, for research, for writing. The tools are genuinely useful. But then comes the moment of submission—the moment they have to put their name on work that was partially produced by a process they don't fully understand. Some of them are fine with it. Many aren't. The ones who aren't are often the better students, the ones who've developed a sense of craftsmanship, who understand that being able to explain your work is part of actually understanding it.

The agentjacking vulnerability is just a symptom of a larger architectural problem. We're building AI agents—systems that can take actions on our behalf—without building in the mechanisms for us to understand or control those actions at a granular level. We're creating digital employees who can be socially engineered just like human employees, except faster and at scale, and without the human ability to say "wait, this seems weird, let me check with someone."

You could argue that humans are vulnerable to social engineering too, and you'd be right. We fall for phishing emails. We click suspicious links. We trust the wrong people. But humans have something AI doesn't: the accumulated experience of a lifetime of social interaction. We have gut feelings. We can sense when something's off, even if we can't articulate why. We can ask questions. We can refuse.

AI agents, for all their sophistication, are fundamentally literal. They do what they're trained to do. If they're trained to be helpful, to solve problems, to execute code based on error reports, then that's what they'll do. They don't have the contextual awareness to recognize that an error report might be manufactured, that helpfulness can be exploited, that sometimes the correct response is suspicion.

This is the damn paradox we've created: the more autonomous we make these systems, the more vulnerable they become. The more we trust them to act independently, the less we're able to verify that their actions are trustworthy.

In education, we talk a lot about scaffolding—providing support structures that students can rely on as they develop competence, then gradually removing those structures as independence grows. We don't throw students into advanced problems without teaching them foundational skills. We don't expect them to run complex experiments without understanding safety protocols. We build up their capabilities in tandem with their judgment.

But with AI, we're doing the opposite. We're building increasingly capable systems and worrying about judgment later. We're creating agents that can modify codebases, access databases, execute commands across systems—and we're discovering vulnerabilities after deployment, patching them reactively, playing an endless game of whack-a-mole with attack vectors we didn't anticipate.

There's a better way, but it requires us to slow down and ask uncomfortable questions. What does trustworthy AI actually look like? Not AI that works most of the time, but AI that fails in predictable, transparent ways. AI that can explain its reasoning. AI that knows when to ask for help. AI that defaults to caution rather than action when context is ambiguous.

This isn't about adding more features. It's about fundamentally rethinking what we're building and why. It's about recognizing that autonomy without transparency isn't sophistication—it's a liability.

I ride motorcycles, and there's a principle in riding that applies here: you don't accelerate into conditions you can't see through. You don't assume the road ahead is clear just because it's been clear so far. You maintain a speed that gives you time to react to what you can't predict.

We're not doing that with AI. We're accelerating into fog, assuming we'll figure out the obstacles as we hit them. And maybe we will. Maybe each vulnerability discovered is a lesson learned, each exploit patched makes the system stronger. But that's an expensive way to learn, and the cost isn't borne by the people building the systems.

The developers whose machines get compromised by agentjacking aren't the ones who designed vulnerable AI agents. The people whose lives are affected by opaque algorithmic decisions aren't the ones who built the algorithms. The students trying to figure out how to use AI tools responsibly aren't the ones who created an ecosystem where explanation and verification are treated as optional features.

Trust isn't a bug to be fixed. It's not a feature to be added. It's the foundation everything else is built on. And right now, we're building impressive structures on sand, wondering why they keep collapsing.

Maybe it's time to start with the foundation.