What the Claude Code Source Tells Us About Trust
The source reveals that Claude Code's classifier behavior can be changed remotely via feature flags. Here's what that means for how you think about the controls underneath your AI agents.
Earlier today, security researcher Chaofan Shou noticed that version 2.1.88 of the @anthropic-ai/claude-code npm package shipped with a source map file. Source maps are JSON files that bridge bundled production code back to the original source. They contain the literal, raw TypeScript. Every file, every comment, every internal constant. Anthropic’s entire 512,000-line Claude Code codebase was sitting in the npm registry for anyone to read.
The leak itself is a build configuration oversight. Bun generates source maps by default unless you turn them off. Someone didn’t turn them off. It happens.
What’s worth writing about isn’t the leak. It’s what the source reveals about how Claude Code’s safety controls actually work, who controls them, and what that means for developers who depend on them.
The permission architecture
Claude Code’s permission system is genuinely sophisticated. The source shows a multi-layered evaluation pipeline: a built-in safe-tool allowlist, user-configurable permission rules, in-project file operation defaults, and a transcript classifier that gates everything else. Anthropic published a detailed engineering post about the classifier on March 25th. It runs on Sonnet 4.6 in two stages: a fast single-token filter, then chain-of-thought reasoning only when the first stage flags something. They report a 0.4% false-positive rate.
This is real engineering. The threat model is thoughtful. They document specific failure modes from internal incident logs: agents deleting remote git branches from vague instructions, uploading auth tokens to compute clusters, attempting production database migrations. The classifier is tuned to catch overeager behavior and honest mistakes, not just obvious prompt injection.
None of that is the interesting finding.
The remote control layer
The source reveals that Claude Code polls /api/claude_code/settings on an hourly cadence for what the codebase calls “managed settings.” When changes arrive that Anthropic considers dangerous, the client shows a blocking dialog. Reject, and the app exits. There is no “keep running with old settings” option.
Beyond managed settings, the source contains a full GrowthBook SDK integration. GrowthBook is an open-source feature flagging and A/B testing platform. The flags in Claude Code use a tengu_ prefix (Tengu being the internal codename for the project) and are evaluated at runtime by querying Anthropic’s feature-flagging service. They can enable or disable features server-side based on your user account, your organization, or your A/B test cohort.
The extracted source and community analysis have cataloged over 25 GrowthBook runtime flags. Some of the notable ones:
tengu_transcript_classifier controls whether the auto-mode classifier is active. tengu_auto_mode_config determines the auto-mode configuration itself (enabled, opt-in, or disabled). tengu_max_version_config is a version killswitch. There are separate killswitches for bypass-permissions mode, voice mode, agent swarms, and the analytics sink.
Six or more killswitches, all remotely operable. The source repository analysis summarizes it plainly: “GrowthBook flags can change any user’s behavior without consent.”
That phrasing is a bit loaded. Let me reframe it more precisely.
What this actually means
Anthropic can change how Claude Code classifies commands as dangerous. They can change which safety features are active. They can do this per-user or per-organization, without shipping a new version, without any action from the developer, and without notification beyond whatever the managed-settings dialog surfaces.
This is probably not malicious. GrowthBook is a standard tool for rolling out features safely. If Anthropic discovers that their classifier has a false-negative pattern that lets a dangerous command through, the ability to tighten the classifier’s behavior across all users immediately is genuinely valuable. If they want to A/B test whether a new permission prompt reduces incidents, feature flags are how you do that.
The design makes sense from Anthropic’s perspective. They’re operating a system where the failure mode is an AI agent doing something destructive on a developer’s machine. Having remote kill switches and tunable classifiers is responsible engineering for the provider of that system.
But it changes the trust model in a way that matters.
When you configure Claude Code’s permission rules locally, you’re setting preferences that feed into a classification pipeline whose behavior can shift underneath you. Your allowedCommands and deny rules are inputs to a system, not the system itself. The classifier that ultimately decides whether a command runs or gets blocked is a model call, and the parameters of that model call, whether it happens at all, and what criteria it uses, are controlled by flags that Anthropic sets remotely.
This is distinct from a locally enforced policy. A local policy says “block rm -rf /” and that rule holds regardless of what any remote server thinks. A classifier-based system says “evaluate whether this action is dangerous given the full transcript context” and the definition of “dangerous” is a function of a prompt template, a model, and configuration that lives on someone else’s infrastructure.
The defense-in-depth question
Most developers running Claude Code in production are probably not thinking about this distinction. The permission system feels local. You configure rules in a settings file. You see approval prompts. It works. But the source now shows us that the enforcement layer is partially remote, partially classifier-based, and partially under Anthropic’s real-time control.
This isn’t an argument that Claude Code is insecure. The classifier catches real threats. The killswitches exist for legitimate operational reasons. Anthropic is not the adversary in the threat model most developers should worry about.
But if you’re operating in an environment where you need to explain exactly what controls exist between an AI agent and a destructive action, “a classifier whose behavior is remotely configurable by the vendor” is a different answer than “a deterministic policy I wrote and can audit.” Both are valid layers. They serve different purposes. Now we can see precisely why.
This is why defense in depth matters regardless of which agent you run. The agent’s built-in controls are one layer — and in Claude Code’s case, a sophisticated one. An external enforcement layer is a different layer entirely. It doesn’t replace the classifier; it handles the cases where you want a hard boundary that doesn’t depend on model judgment, and that holds regardless of what any remote configuration says.
Rampart is our take on that external layer — local policy files, deterministic evaluation, enforced outside the agent process. The agent doesn’t know it’s there. That’s intentional. But the point isn’t that external enforcement is better than internal classification. They solve different problems. A classifier understands conversational context and can catch an agent going off-script in ways a static rule never would. A deterministic policy can enforce a hard “never touch production credentials” rule in a way that doesn’t depend on a model call. Both are useful. Neither is sufficient alone.
What to sit with
The Claude Code source leak is going to generate a lot of hot takes. Most of them will focus on the Tamagotchi pet system or the “Undercover Mode” that tells Claude to hide its AI identity in open-source commits. Those are fun findings. The GrowthBook integration is the one worth understanding.
Every AI coding agent you use has a trust boundary between “what you configured” and “what actually enforces your intent.” Before today, that boundary in Claude Code was opaque. Now it’s readable. The source shows a well-engineered system with a specific trust model: Anthropic retains runtime control over safety-critical behavior, and your local configuration is an input to their system rather than the final word.
Whether that’s acceptable depends on your threat model. For most individual developers, it probably is. For teams operating agents against production infrastructure, it’s worth knowing that the controls you’re relying on can be silently reconfigured. Not because anyone will, but because understanding what layer you’re actually trusting is how you build defense that holds when assumptions change.