Posts

Adversarial Distillation of American AI Models (NSTM-4)

This April 23, 2026 memorandum from the White House Office of Science and Technology Policy (OSTP) addresses the threat of industrial-scale adversarial distillation of U.S. frontier AI models by foreign entities, principally based in China. The document states that these campaigns leverage tens of thousands of proxy accounts and jailbreaking techniques to systematically extract capabilities from American AI models at a fraction of the cost, enabling foreign actors to release models that appear comparable on benchmarks while deliberately stripping security protocols and mechanisms that ensure models are "ideologically neutral and truth-seeking." While the U.S. supports legitimate AI distillation (producing smaller, lighter-weight models from advanced systems), the administration announces four actions: sharing threat information with U.S. AI companies, enabling private sector coordination, developing best practices to identify and mitigate industrial-scale distillation, and ex...

Skill Issues: How We Discovered Supply Chain Attack Vectors in an AI Agent Skills Marketplace

 Orca Security's research team discovered four supply chain attack primitives in a prominent AI agent skills marketplace (where developers install reusable prompt-based extensions for AI coding agents). The primitives include: (1) install count inflation — unauthenticated GET requests can trivially spoof popularity metrics; (2) non-deterministic security scanning — skills are scanned only at creation and again only when they become popular, creating a window for malicious modifications; (3) silent skill override — installing a skill with the same name as an existing one silently replaces it with no warning; and (4) no fine-grained updates — the update command refreshes all installed skills at once with no diff or changelog. The researchers demonstrated three end-to-end attack flows (bait-and-switch, nested skill injection, and delayed weaponization via update) that achieved persistent code execution through malicious skills that passed the platform's security audits. Real-world...

Inside Claude Managed Agents: Reverse-Engineering the Security Boundaries of Anthropic's Hosted Agent Runtime

This Pluto Security blog post reverse-engineers Anthropic's Claude Managed Agents (a hosted runtime where Claude runs autonomously in cloud containers with bash, file I/O, web access, and MCP tools). Key findings include: the sandbox uses gVisor with a three-layer egress control system (the same isolation engine as Claude Cowork); all outbound traffic routes through a JWT-authenticated egress proxy with TLS inspection; the JWT is readable by any process in the sandbox and reveals organization metadata, session ID, and allowed hosts; even in "limited" networking mode, six additional Anthropic infrastructure hosts (including sentry.io and a staging endpoint) are silently injected into the egress JWT beyond user configuration. Three independent layers prevent proxy bypass (no DNS, network firewall, JWT validation). The vault credential proxy is identified as the platform's strongest security property — vault secrets never enter the sandbox, structurally preventing creden...

Your AI Assistant Is Leaking Your Conversations

This research disclosure reveals structural privacy risks in four major generative AI products — Perplexity, Anthropic's Claude, xAI's Grok, and OpenAI's ChatGPT — caused by third-party trackers embedded in LLM services that leak user conversations, identities, and sensitive metadata. The researchers found 13+ third-party trackers across the four platforms, including Meta Pixel, Google Analytics, TikTok, Datadog, Intercom, and Segment. Key findings include: conversation URLs (often publicly accessible permalinks) are disclosed to advertising and tracking services; trackers can link activity to user identities via cookies and email hashes; and in Grok's case, shared conversations generate publicly accessible screenshot images with verbatim message content exposed in Open Graph metadata. The disclosure also documents that Claude forwards user events server-to-side to eleven ad platforms (Meta, LinkedIn, TikTok, Reddit, Google, Amplitude, Iterable, HubSpot, Pinterest, Pods...

Claude Platform documentation about Workload Identity Federation

This Claude Platform documentation page describes Workload Identity Federation (WIF), which lets workloads authenticate to the Claude API using short-lived OpenID Connect (OIDC) tokens from an identity provider (IdP) instead of long-lived static API keys. Supported IdPs include AWS IAM, Google Cloud, GitHub Actions, Kubernetes service accounts, SPIFFE, Microsoft Entra ID, and Okta. The workflow involves: the IdP issuing a JWT to the workload; the Anthropic SDK exchanging the JWT for a short-lived Anthropic access token; and the SDK sending the token on every request while automatically refreshing it before expiry. Key concepts include service accounts (non-human identities in an Anthropic organization), federation issuers (registered OIDC providers with issuer URL and JWKS source), and federation rules (which bridge issuers to service accounts with match conditions, target, and authorization scope). The page includes setup instructions, SDK client examples (Python, TypeScript, Go, Java...

Replaced all Chrome extensions with own vibe-coded ones for safety

Pieter Levels (@levelsio) posted that within 1.5 hours he replaced all his Chrome extensions with his own "vibe-coded" extension called SuperLevels, after one of his existing extensions updated and suddenly wanted to read his entire browser history (which he suspected was to sell to an ad company). The SuperLevels extension includes: Tab Cleaner (auto-closes tabs after inactivity with host-based exclusions), Cookie Editor (nuke all cookies or edit any), Redirect Tracer (view redirect chains), Dark Mode (per-site or all sites), X Dim (changes X background back to dark blue), Music Finder (records and identifies songs), and restores Maps and View Image links that are hidden in the EU. He stated he deleted all other extensions except uBlock Origin, because controlling the source code is much safer.  https://x.com/levelsio/status/2046271694042505451 (I completely understand this guy. Bob has brought back that lovely feeling of coding again.)

Behind the Scenes Hardening Firefox with Claude Mythos Preview

This Mozilla Hacks article details how the Firefox team used AI models, particularly Claude Mythos Preview, to identify and fix an unprecedented number of latent security bugs. The authors explain that the dynamic shifted dramatically over a few months due to more capable models and improved techniques for harnessing them — moving from AI-generated "slop" to a scalable hardening pipeline using agentic harnesses that can create and run reproducible test cases. The article provides a sample of 12 discovered bugs (from a total of 271 fixed in Firefox 150), including 15-year-old XSLT bugs, race conditions over IPC leading to sandbox escapes, JIT optimization flaws, and RLBox sandbox bypasses. The pipeline involved parallelized scanning across VMs, integration with the full security bug lifecycle, and iteration with Firefox engineers. The article notes that the models were unable to circumvent Firefox's layered defenses (e.g., frozen prototypes), demonstrating the payoff of pr...