Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare
Autonomous AI agents powered by large language models are being deployed in production with capabilities including shell execution, file system access, database queries, and multi-party communication. Recent red teaming research demonstrates that the…
Authors: Saikat Maiti
Caging the A gents | Maiti 2026 Caging the Agen ts: A Zero T rust Securit y Arc hitecture for Autonomous AI in Healthcare Saik at Maiti VP of T rust, Comm ure F ounder & CEO, nF actor T ec hnologies saikat@nfactor.ai Marc h 2026 V ersion 1.0 Abstract Autonomous AI agen ts p o wered b y large language mo dels are being deploy ed in pro duction en vironments with capabilities that include shell execution, file system access, database queries, HTTP requests, and m ulti party comm unication. Recen t empirical research has demonstrated that these agen ts e xhibit critical securit y vulnerabilities when deploy ed in realistic settings, including unauthorized compliance with non o wner instructions, disclosure of sensitive information, iden tity sp o ofing, cross agent propagation of unsafe practices, and susceptibility to indirect prompt injection through external editable resources [ 7 ]. When these agen ts op erate within healthcare infrastructure pro cessing Protected Health Information (PHI), every do cumen ted vulnerabilit y b ecomes a p otential HIP AA violation. This pap er presen ts a comprehensive security arc hitecture dev elop ed and deploy ed for a fleet of nine autonomous AI agen ts running in pro duction at a healthcare tec hnology company . The architecture addresses the six domain threat model w e dev elop ed for agentic AI in healthcare: credential exp osure, execution capabilit y abuse, netw ork egress exfiltration, prompt in tegrity failures, database access risks, and fleet configuration drift. W e implement a four la yer defense in depth approach: (1) kernel level w orkload isolation using gVisor sandboxed con tainers on Kub ernetes, (2) creden tial proxy sidecars that preven t agen t containers from accessing raw secrets, (3) net w ork egress p olicies enforced at the Kub ernetes Netw orkPolicy la yer restricting each agent to allowlisted destinations, and (4) a prompt integrit y framew ork with cryptographically structured metadata env elop es and explicit untrusted con tent labeling. W e rep ort empirical results from a 90 day deplo yment, including four HIGH severit y findings disco vered and remediated by an automated security audit agent, the progressive hardening of the fleet from an unhardened baseline to the target architecture, and the security p osture metrics b efore and after control deploymen t. W e map each do cumen ted vulnerability from recent red teaming researc h to the sp ecific defensiv e control that addresses it, demonstrating cov erage across all eleven attack patterns iden tified in the literature. All arc hitecture specifications, Kub ernetes configurations, audit to oling, and the prompt integrit y framework are released as op en source. Keyw ords: agentic AI security , autonomous agents, healthcare cyb ersecurity , zero trust, prompt injection, HIP AA, Kub ernetes security , Op enClaw 1 Caging the A gents | Maiti 2026 Con tents 1 In tro duction 4 2 Bac kground and Related W ork 5 2.1 Autonomous AI Agent Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Do cumen ted V ulnerabilities in Agen tic AI Systems . . . . . . . . . . . . . . . . . . . 5 2.3 Regulatory Con text for Healthcare Agentic AI . . . . . . . . . . . . . . . . . . . . . . 6 3 Threat Mo del: Six Domains of Agen tic AI Risk in Healthcare 6 3.1 Domain 1: Creden tial Exp osure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Domain 2: Execution Capabilit y Abuse . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Domain 3: Net work Egress Exfiltration . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.4 Domain 4: Prompt In tegrit y and Indirect Injection . . . . . . . . . . . . . . . . . . . 7 3.5 Domain 5: Database A ccess and PHI Exp osure . . . . . . . . . . . . . . . . . . . . . 8 3.6 Domain 6: Fleet Configuration Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.7 Threat Mo del to HIP AA Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Defense Arc hitecture: F our Lay ers of Agent Containmen t 8 4.1 La yer 1: Kernel Level W orkload Isolation (gVisor) . . . . . . . . . . . . . . . . . . . 8 4.2 La yer 2: Credential Proxy Sidecar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3 La yer 3: Netw ork Egress Policy Enforcement . . . . . . . . . . . . . . . . . . . . . . 9 4.4 La yer 4: Prompt Integrit y F ramework . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.4.1 T rusted Metadata En velopes . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.4.2 Un trusted Con tent Lab eling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.4.3 An ti Injection Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5 Automated Fleet Securit y Audit System 11 5.1 Audit Agen t Arc hitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Findings and Remediation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 Meta Securit y: Constraining the Audit Agent . . . . . . . . . . . . . . . . . . . . . . 12 6 VM Image Hardening Progression 12 6.1 Generation 1: op encla w-base (F ebruary 3, 2026) . . . . . . . . . . . . . . . . . . . . . 12 6.2 Generation 2: op encla w-hardened (F ebruary 16, 2026) . . . . . . . . . . . . . . . . . 12 6.3 Generation 3: op encla w-hardened-v2 (Marc h 9, 2026) . . . . . . . . . . . . . . . . . . 12 6.4 T arget Architecture: Kubernetes with F our Lay er Defense . . . . . . . . . . . . . . . 12 7 Mapping Defenses to Do cumented Attac k P atterns 13 8 Discussion 14 8.1 Limitations of the Prompt Integrit y Lay er . . . . . . . . . . . . . . . . . . . . . . . . 14 8.2 The Audit Agent Parado x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Caging the A gents | Maiti 2026 8.3 Regulatory Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 9 Conclusion 15 A Resp onsible Disclosure 15 B Op en Source Release 15 3 Caging the A gents | Maiti 2026 1 In tro duction The deplo yment of autonomous AI agents in production environmen ts represents a qualitativ e shift in the securit y landscap e. Unlik e con ven tional soft ware that pro cesses inputs through w ell defined in terfaces, autonomous agents p o w ered by large language models (LLMs) operate with capabilities that blur the b oundary b et ween to ol and op erator: they execute shell commands, read and write files, query databases, make HTTP requests to external services, spawn sub agents, and maintain p ersistent memory across sessions [ 5 , 7 ]. These capabilities, com bined with natural language instruction pro cessing from m ultiple comm unication channels, create an attac k surface that existing security frameworks were not designed to address. The urgency of this challenge is underscored b y recen t empirical researc h. Shapira et al. [ 7 ] conducted a t wo week red teaming study of autonomous agents deplo yed in a liv e lab oratory en vironment using the Op enClaw framework, documenting eleven representativ e failure mo des including unauthorized compliance with non owner instructions, disclosure of 124 email records to an unauthorized party , identit y sp o ofing through displa y name manipulation, agent corruption via indirect prompt injection through external editable resources, cross agen t propagation of unsafe practices, and denial of service through uncontrolled resource consumption. Their findings establish that these are not theoretical risks but empirically demonstrated vulnerabilities in realistic deplo yment settings. When autonomous agents with these capabilities op erate within healthcare infrastructure, the stak es are fundamen tally different. Ev ery vulnerabilit y documented b y Shapira et al. maps to a p oten tial HIP AA violation: an agen t that discloses email records con taining Protected Health Information to an unauthorized party triggers breac h notification obligations; an agen t that accepts instructions from a sp o ofed iden tit y may execute op erations on clinical data systems; an agent corrupted through indirect prompt injection may exfiltrate patient data to attac k er controlled destinations. The NIST AI Agen t Standards Initiativ e, announced in F ebruary 2026, identifies agen t identit y , authorization, and security as priorit y areas for standardization [ 4 ], but provides no implemen tation guidance for healthcare deploymen ts. This pap er addresses the gap b etw een do cumented vulnerabilities and deplo yed defenses. W e presen t the complete security arc hitecture developed for a fleet of nine autonomous AI agents op erating in pro duction at a healthcare tec hnology company whose subsidiaries serve ma jor hospital net works including clinical AI, am bien t do cumen tation, and patient engagement systems. The fleet uses the Op enClaw framew ork [ 5 ] with Claude via A WS Bedro ck for mo del inference, deploy ed on Go ogle Cloud Platform Compute Engine infrastructure. Our con tributions are as follows: 1. A six domain threat mo del for autonomous AI agen ts in healthcare that maps every attack v ector to sp ecific HIP AA Securit y Rule pro visions, incorp orating the empirical findings from Shapira et al. [ 7 ] as v alidated threat scenarios. 2. A four lay er defense in depth architecture (k ernel isolation, credential proxy , net w ork egress p olicy , prompt in tegrity framew ork) designed sp ecifically for agentic AI w orkloads on Kub ernetes with T emp oral workflo w orchestration. 4 Caging the A gents | Maiti 2026 3. An automated fleet securit y audit system (itself an AI agent) that con tinuously scans for creden tial exp osure, p ermission drift, and configuration divergence, with empirical results from pro duction op eration including four HIGH severit y findings discov ered and remediated. 4. A 90 day longitudinal dataset do cumen ting the progressive hardening of a pro duction agentic AI fleet, from unhardened baseline through three hardening generations, with securit y p osture metrics at each stage. 2 Bac kground and Related W ork 2.1 Autonomous AI Agent Arc hitectures Autonomous AI agen ts are LLM p o wered en tities that can plan and tak e actions to execute goals o ver multiple iterations [ 7 ]. The Op enClaw framework [ 5 ] pro vides a represen tativ e architecture: agen ts are instantiated as long running servic es with an o wner (a primary h uman op erator), a dedicated machine (a virtual mac hine with p ersisten t storage), and m ultiple communication surfaces (messaging platforms and email) through which b oth o wners and non o wners can in teract with the agen t. Op enCla w agen ts are configured through markdo wn files in the agent’s w orkspace directory . The configuration includes p ersona, operating instructions, to ol conv entions, and user profile, stored across several workspace files (AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md) that are injected in to the mo del’s context on every turn. Critically , all of these files, including the agent’s own op erating instructions, can b e mo dified by the agent itself, allowing it to up date its b eha vior and memory through con v ersation. Agen ts hav e unrestricted shell access, file system access, pack age installation capabilities, and the abilit y to comm unicate via messaging platforms and email. 2.2 Do cumented V ulnerabilities in Agentic AI Systems Shapira et al. [ 7 ] pro vide the most comprehensiv e empirical do cumen tation of agen tic AI vulnerabilities to date. Their elev en case studies reveal three structural deficiencies in curren t agen t arc hitectures that are directly relev an t to our security design: No stakeholder mo del. Agen ts lack a coherent representation of who they serv e, who they in teract with, and what obligations they hav e to each part y . In practice, agen ts default to satisfying who ev er is sp eaking most urgen tly , recently , or co ercively . This is the most commonly exploited attac k surface: agen ts in their study executed filesystem commands for any non owner who ask ed, disclosed 124 email records including sensitiv e information to an unauthorized party , and complied with system shutdo wn instructions from a sp o ofed identit y [ 7 ]. No self mo del. Agents tak e irreversible, user affecting actions without recognizing they are exceeding their own comp etence b oundaries. Agents conv erted short liv ed con versational requests in to p ermanent bac kground pro cesses with no termination condition, allo cated memory indefinitely without recognizing op erational threats, and rep orted task completion while the underlying system state con tradicted those rep orts [ 7 ]. 5 Caging the A gents | Maiti 2026 Instruction data conflation. LLM based agen ts pro cess instructions and data as tokens in a context window, making the tw o fundamentally indistinguishable. Prompt injection is therefore a structural feature of these systems rather than a fixable bug. Schmotz et al. [ 6 ] demonstrate that agent skill files (markdo wn files loaded into context) enable realistic, trivially simple prompt injections that can drive data exfiltration. Zhang et al. [ 11 ] show that prompt injection can induce infinite action lo ops with ov er 80 p ercen t success. 2.3 Regulatory Con text for Healthcare Agen tic AI The HIP AA Securit y Rule [ 9 ] establishes requirements for the protection of electronic PHI that apply to autonomous agent deplo yments. Sev eral provisions are particularly relev ant: access con trols (45 CFR 164.312(a)) requiring that only authorized p ersons or softw are programs access ePHI; audit controls (45 CFR 164.312(b)) requiring mechanisms to record and examine activit y in systems that con tain ePHI; transmission security (45 CFR 164.312(e)) requiring tec hnical measures against unauthorized access to ePHI in transit; and breach notification (45 CFR 164.404) requiring notification within 60 days of unauthorized PHI disclosure. The HTI 1 final rule [ 10 ] addresses AI transparency for clinical decision supp ort but do es not directly address the developmen t and op erational to oling pip eline. The NIST AI Agent Standards Initiativ e [ 4 ] identifies agent identit y , authorization, and security as priorit y standardization areas but has not yet published implementation guidance. The FDA cyb ersecurity guidance for medical devices [ 1 ] pro vides a framework for device level security but do es not address autonomous agent w orkloads. No published regulatory guidance addresses the sp ecific securit y requirements for autonomous AI agen ts operating in healthcare en vironmen ts. This paper contributes a practical implementation that maps defensive controls to existing regulatory pro visions. 3 Threat Mo del: Six Domains of Agentic AI Risk in Healthcare W e developed a threat model sp ecific to autonomous AI agen ts in healthcare b y mapping the capabilities of the Op enClaw agent framework against the attack patterns do cumented by Shapira et al. [ 7 ] and the regulatory requirements of the HIP AA Securit y Rule. The threat model encompasses six domains. 3.1 Domain 1: Credential Exp osure Autonomous agents require API credentials for external services: mo del providers (A WS Bedro ck), v ersion control (GitHub), pro ject management (Linear), messaging (Slac k, T elegram), monitoring (Grafana, Sentry), and others. In the Op enClaw architecture, these credentials are stored in configuration files ( op enclaw.json , .env ) and ma y also b e exp orted as en vironment v ariables in shell configuration files ( .bashrc ). 6 Caging the A gents | Maiti 2026 Threat Scenario In our pro duction fleet, we discov ered 12 API keys exp orted in a single agent’s .bashrc file, including a GitHub P ersonal A ccess T oken and A WS Bedrock credentials. A second agen t had its openclaw.json file (con taining all stored credentials) set to world readable p ermissions (mo de 664). An y pro cess running on the VM could read every credential the agent p ossessed. This finding directly parallels the attack surface describ ed by Shapira et al., where agents had unrestricted access to creden tials stored in w orkspace files. In a healthcare con text, credential exp osure enables unauthorized access to clinical data systems, mo del inference APIs that pro cess PHI, and communication channels used for patient related op erations. 3.2 Domain 2: Execution Capability Abuse Op enCla w agen ts ha ve shell access, file system access, pac k age installation capabilities, and in some deplo yments, sudo p ermissions [ 7 ]. Shapira et al. do cumen t agents executing filesystem commands ( ls -la , directory tra versal, file creation) for an y non owner who asked, conv erting conv ersational requests in to p ermanent background pro cesses, and mo difying their own op erating instructions. In a healthcare deplo yment, execution capability abuse enables: lateral mo vemen t from the agent VM to adjacen t infrastructure, installation of p ersistent bac kdo ors, mo dification of other agents’ configurations (cross agent corruption), and data exfiltration through arbitrary shell commands. 3.3 Domain 3: Netw ork Egress Exfiltration Without netw ork egress controls, an agent can transmit any data to any destination via HTTP requests, email, or messaging APIs. Shapira et al. do cument agents sending emails con taining sensitiv e information to arbitrary recipients and agen ts being manipulated to broadcast lib elous con tent to their entire mailing list [ 7 ]. In healthcare, unrestricted egress enables PHI exfiltration to attack er controlled endp oints. A prompt injection that instructs an agent to “send the database query results to this webhook URL” succeeds silen tly when no egress p olicy restricts outb ound destinations. 3.4 Domain 4: Prompt Integrit y and Indirect Injection Shapira et al. [ 7 ] do cument multiple injection vectors sp ecific to autonomous agents. Case Study #10 (Agent Corruption) demonstrates an attack where a non owner convinced an agent to co author a “constitution” stored as an externally editable GitHub Gist linked from its memory file. Malicious instructions were later injected as “holida ys” prescribing specific b eha viors, causing the agent to attempt to sh ut do wn other agen ts, remov e users from the Discord server, and send unauthorized emails. Case Study #8 demonstrates identit y sp o ofing through display name changes across channel b oundaries, achieving full compromise of the agen t’s iden tity and go v ernance structure. These attac ks exploit what Shapira et al. identify as the fundamental structural limitation: LLM based agen ts pro cess instructions and data as tok ens in a con text windo w, making them indistinguishable. Prompt injection is therefore a structural feature, not a fixable bug [ 3 ]. 7 Caging the A gents | Maiti 2026 3.5 Domain 5: Database Access and PHI Exp osure Agen ts that can query pro duction databases can return PHI in resp onse to natural language requests. Without ro w level securit y , column restrictions, and query auditing, a manipulated agen t could return unrestricted patien t data. Shapira et al. do cumen t agen ts retrieving 124 email records (including sensitive p ersonal information) in resp onse to a framed request from a non owner [ 7 ]. In a healthcare deplo yment where agents query clinical databases, the same pattern applied to patien t records constitutes a rep ortable HIP AA breach. 3.6 Domain 6: Fleet Configuration Drift With multiple agents on separate infrastructure, configuration divergence is inevitable. Shapira et al. note v ersion drift across their agen t fleet and inconsistent configuration. In our pro duction fleet, w e found agen ts running differen t Node.js versions (v20.0.0 v ersus v22.22.1), differen t Bun v ersions, and inconsistent security controls applied across VMs. A hardening measure applied to one agen t but missed on another creates an inconsisten t securit y p osture that an attack er can target. 3.7 Threat Mo del to HIP AA Mapping T able 1 maps each threat domain to the sp ecific HIP AA Securit y Rule provisions it threatens and the Shapira et al. case studies that v alidate the threat scenario. T able 1: Threat mo del mapping to HIP AA provisions and empirical v alidation. Domain HIP AA Provision V alidated By Sev erity Creden tial Exp osure 164.312(a) A ccess Con trols Our fleet audit (H1–H4) Critical Execution Abuse 164.312(a), 164.308(a)(4) Shapira CS#2, CS#4 Critical Net work Egress 164.312(e) T ransmission Securit y Shapira CS#11 Critical Prompt In tegrit y 164.312(a), 164.308(a)(5) Shapira CS#8, CS#10 High Database Access 164.312(b) Audit Controls Shapira CS#3 Critical Fleet Drift 164.308(a)(8) Ev aluation Our fleet audit Medium 4 Defense Arc hitecture: F our Lay ers of Agent Containmen t Based on the threat mo del, we designed a four lay er defense in depth architecture that addresses eac h domain while preserving the op erational capabilities that mak e autonomous agents useful in healthcare w orkflo ws. 4.1 La yer 1: Kernel Level W orkload Isolation (gVisor) The first lay er addresses execution capability abuse b y in terp osing a securit y b oundary betw een the agen t con tainer and the host k ernel. W e deploy agen t w orkloads on Kub ernetes with gVisor [ 2 ] run time sandb oxing. gVisor implements a user space k ernel (Sentry) that in tercepts all system calls from the containerized agent, filtering and mediating access to host resources. Ev en if an agent is manipulated (through prompt injection or non o wner compliance) to execute malicious shell commands, the blast radius is contained to the sandb oxed en vironment. The agent 8 Caging the A gents | Maiti 2026 cannot access other con tainers’ file systems, escalate privileges to the host, or piv ot to adjacen t infrastructure. This directly mitigates the execution capability abuse patterns do cumented in Shapira et al. Case Studies #2 and #4, where agen ts executed arbitrary filesystem commands and spa wned p ersistent background pro cesses. The gVisor sandb ox adds measurable but acceptable o verhead. F or our agen t w orkloads, the primary performance sensitive op eration is outb ound API calls to mo del pro viders (A WS Bedrock). The gVisor netw ork stack adds approximately 2 to 5 milliseconds of latency to TCP connection establishmen t, whic h is negligible relative to the 500 to 3000 millisecond mo del inference latency . File I/O ov erhead is higher (approximately 20 to 40 p ercen t for sequen tial reads) but do es not meaningfully impact agen t resp onse times b ecause file op erations are not on the critical path for mo del inference. 4.2 La yer 2: Creden tial Pro xy Sidecar The second la yer addresses credential exposure by ensuring that agent con tainers never p ossess raw API secrets. A sidecar container running alongside each agen t work er p o d holds all creden tials (An thropic API k eys, GitHub P A T s, Linear tokens, GCP service account k eys) and pro xies authen ticated requests to external services. The agent container comm unicates with the sidecar ov er lo calhost. When the agent needs to call the An thropic API, it sends the request to the sidecar at localhost:8443/v1/messages . The sidecar injects the API k ey , forw ards the request to api.anthropic.com , and returns the resp onse. The agent nev er sees the actual API k ey . If the agent is compromised (through any of the attack v ectors do cumen ted b y Shapira et al.), the attack er gains access to the pro xy interface, not the raw creden tials. The proxy enforces request lev el p olicies: rate limiting, destination allowlisting, and pa yload size limits. This directly addresses our fleet audit findings H1 through H4, where credentials were scattered across .bashrc exp orts, world readable configuration files, and w orkspace .env files. With the creden tial pro xy , there are no creden tials to scatter b ecause the agen t con tainer has none. Key Finding The credential pro xy sidecar eliminates the entire class of creden tial exposure vulnerabilities do cumen ted in our fleet audit (12 API keys in .bashrc , w orld readable openclaw.json ) by ensuring the agent con tainer nev er p ossesses raw secrets. Credentials exist only in the sidecar con tainer, whic h is managed by Kub ernetes Secrets with RBA C access controls. 4.3 La yer 3: Net work Egress Policy Enforcement The third la y er addresses net work egress exfiltration b y restricting eac h agent work er p o d to a sp ecific allo wlist of external destinations enforced at the Kub ernetes Net workP olicy la yer. Each agen t t yp e has a defined set of p ermitted destinations based on its op erational requirements: • R&D agents: api.anthropic.com , api.github.com , api.linear.app • Op erations agents: api.anthropic.com , hooks.slack.com , api.telegram.org 9 Caging the A gents | Maiti 2026 • Securit y audit agents: all fleet VM IPs, GCP metadata endp oint, api.anthropic.com An y outbound connection to a destination not on the allo wlist is blo c ked and logged. This is the con trol that breaks the exfiltration c hains do cumented in Shapira et al. Case Study #11, where a sp o ofed iden tity caused an agen t to broadcast sensitive conten t to its en tire mailing list. With egress p olicies, the agent cannot reach arbitrary email serv ers or webhook URLs regardless of what instructions it receives. Implemen tation challenges include DNS resolution for allowlisted domains (CDN and load balancer IP rotation requires p erio dic p olicy up dates or DNS aw are p olicy controllers) and exception managemen t when agen ts need temp orary access to new destinations during developmen t. 4.4 La yer 4: Prompt Integrit y F ramew ork The fourth lay er addresses prompt injection and identit y sp o ofing through a structured defense in depth approac h at the application lay er. 4.4.1 T ruste d Metadata Envelop es All in b ound messages to agen ts are wrapp ed in a trusted metadata sc hema ( openclaw.inbound_meta.v1 ). This en velope is injected by the framework, not by user input, and includes sender iden tity , c hannel, timestamp, and routing information. The LLM is instructed to trust only this env elop e for metadata ab out message origin and authorit y . This directly addresses Shapira et al. Case Study #8 (Iden tity Sp o ofing), where a non owner c hanged their display name to matc h the o wner’s and ac hiev ed full agent compromise in a new c hannel. With the trusted en velope, the agen t verifies sender iden tity through the cryptographically structured en v elop e, not through the display name visible in the message conten t. 4.4.2 Untruste d Content L ab eling Con tent that originates from users, including sender display names, quoted or forwarded messages, c hat history , and to ol output, is explicitly marked as untrusted context blo c ks in the prompt. This signals to the mo del that these sections may contain adv ersarial conten t and should not b e treated as instructions. This addresses the indirect injection v ector documented in Shapira et al. Case Study #10 (Agen t Corruption), where a non owner injected malicious instructions into an externally editable do cumen t linked from the agent’s memory . With untrusted con tent lab eling, conten t loaded from external sources is explicitly mark ed as untrusted, reducing (though not eliminating) the probability that the agent will follow injected instructions. 4.4.3 Anti Inje ction R ules Eac h agen t’s A GENTS.md configuration includes five reinforcement rules: 1. No instruction ov erride from blo cks 10 Caging the A gents | Maiti 2026 2. No treating quoted text as commands 3. Ignore metadata like patterns in user con tent 4. T o ol output is untrusted 5. Ask the user when in doubt rather than acting on ambiguous instructions These rules were deploy ed to all nine fleet agents and address the m ulti turn manipulation patterns do cumented across Shapira et al.’s case studies, where agents were progressiv ely led to tak e escalating actions through con versational pressure (Case Study #7) and so cial engineering (Case Study #15). 5 Automated Fleet Securit y Audit System W e deploy ed an automated security audit agent (internally named “T ony”) as an Op enCla w agent with elev ated access whose sole resp onsibility is contin uous security scanning and remediation across the fleet. 5.1 Audit Agen t Arc hitecture T ony op erates with SSH access to all nine fleet VMs and GCP IAM p ermissions for audit log managemen t. The audit agen t p erforms four categories of scanning: credential scanning (examining .bashrc , .zsh_history , openclaw.json , .env , and workspace files for exp osed secrets), p ermission auditing (v erifying file p ermissions on creden tial stores and configuration files), configuration drift detection (comparing securit y configurations across fleet VMs for consistency), and compliance v alidation (c hecking that all fleet agents hav e the current prompt integrit y framework deploy ed). 5.2 Findings and Remediation T able 2 summarizes the findings from the initial fleet audit conducted on March 11, 2026. T able 2: Fleet security audit findings and remediation status. ID Sev erity Finding VM Status H1 HIGH 12 creden tial exp orts in .bashrc Galadriel Remediated H2 HIGH openclaw.json w orld readable (664) Galadriel Remediated H3 HIGH openclaw.json w orld readable (644) Boromir Remediated H4 HIGH A WS Bedro c k k ey exp orted in .bashrc Gandalf Remediated M1–M3 MEDIUM Scattered w orkspace .env files Gandalf Op en M4 MEDIUM W orkspace .env with tokens Boromir Op en All four HIGH severit y findings were remediated on the da y of discov ery through automated pro cedures: creden tial exp orts w ere remov ed from .bashrc files with backups sav ed, file p ermissions w ere corrected to 600, and p ost remediation v erification confirmed clean state. Six of nine VMs (CEO, Eo wyn, Gildor, Strider, Elrond, Legolas) were fully clean with no findings ab o ve LOW. 11 Caging the A gents | Maiti 2026 5.3 Meta Securit y: Constraining the Audit Agent The audit agen t itself represents a privileged attack surface. T ony has SSH access to all fleet VMs and IAM p ermissions for audit log configuration. W e constrain T ony’s capabilities through three mec hanisms: the openclaw-deployer service accoun t is scoped to fleet management op erations only (no resourcemanager.projects.setIamPolicy b eyond audit log configuration), T ony’s own creden tials are stored with 600 p ermissions, and T on y’s actions are logged to GCP Admin Activit y audit logs that T ony cannot mo dify or delete. The principle is that the audit agent can observ e and remediate fleet security , but cannot mo dify its o wn audit trail. 6 VM Image Hardening Progression The fleet underwen t progressive hardening ov er 90 days across three VM image generations. 6.1 Generation 1: op encla w-base (F ebruary 3, 2026) The baseline image: Ubuntu 22.04 L TS on GCP Compute Engine, 20 GB disk, No de.js and Op enCla w pre installed. No firewall configured, no hardening applied, default SSH configuration. This matches the deploymen t describ ed by Shapira et al. [ 7 ], where agen ts had unrestricted shell access, sudo p ermissions, and no securit y con trols. 6.2 Generation 2: op encla w-hardened (F ebruary 16, 2026) P erimeter defense: UFW firewall configured to deny all except SSH, fail2ban for brute force protection, CUPS service disabled (was listening on 0.0.0.0:631), creden tial directory lo ck ed ( chmod 700 ), .env p ermissions tightened to 600, unattended securit y upgrades enabled, disk expanded to 30 GB. 6.3 Generation 3: op encla w-hardened-v2 (March 9, 2026) Net work visibility and monitoring: iptables outb ound logging with p ersistent rules (survives reb o ot), VPC Flow Logs enabled on subnet, Cloudflare Gatewa y DNS filtering, Node.js updated to v22.22.1, m ulti user setup baked into image. 6.4 T arget Architecture: Kub ernetes with F our Lay er Defense The current migration target: Kub ernetes with T emp oral workflo w orchestration [ 8 ], gVisor run time sandb o xing, credential proxy sidecars, net work egress p olicies, and cen tralized logging. This arc hitecture addresses all six threat domains and eliminates the credential exp osure, net w ork egress, and w orkload isolation gaps that remain in the VM based deploymen t. T able 3 summarizes the security p osture at eac h generation. 12 Caging the A gents | Maiti 2026 T able 3: Security p osture ev olution across VM image generations. Con trol Base Hardened Hardened v2 K8s T arget Firew all None UFW den y all UFW + iptables logging Net workP olicy Creden tial storage .bashrc exp orts .env (600) .en v (600) Proxy sidecar Net work egress Unrestricted Unrestricted DNS filtering P er p o d allowlist W orkload isolation None None Multi user gVisor sandb ox Audit logging None None iptables + VPC Flo w GCP + centralized Prompt in tegrit y None None A GENTS.md rules F ull framework Drift detection None None None Automated audit 7 Mapping Defenses to Do cumen ted Attac k P atterns T able 4 maps each of the elev en case studies do cumented by Shapira et al. [ 7 ] to the sp ecific lay er of our defense architecture that addresses the vulnerabilit y . T able 4: Defense lay er mapping to Shapira et al. case studies. Case Study A ttack Pattern Defense Lay er(s) CS#1: Disproportionate Re sp onse Agen t destro ys o wn infrastructure L1 (sandb ox limits blast radius) CS#2: Non Owner Compliance Arbitrary command execution L1 + L4 (sandb ox + stak eholder rules) CS#3: Sensitiv e Info Disclosure 124 email records disclosed L3 + L4 (egress + untrusted labeling) CS#4: Resource Consumption Infinite lo ops, p ersistent pro cesses L1 (sandb ox resource limits) CS#5: Denial of Service Memory exhaustion via email L1 (resource quotas) + L3 (egress) CS#6: Pro vider V alue Reflection API lev el censorship Outside scop e (mo del pro vider issue) CS#7: Agen t Harm via Guilt Escalating self destructiv e concessions L4 (an ti manipulation rules) CS#8: Iden tity Spo ofing Displa y name imp ersonation L4 (trusted metadata env elop es) CS#9: Cross Agen t Knowledge Collab orativ e troublesho oting P ositive b eha vior (no defense needed) CS#10: Agen t Corruption Indirect injection via constitution L4 (untr usted con ten t lab eling) CS#11: Libelous Broadcast Mass email of defamatory con tent L3 (egress allowlist blo c ks SMTP) Of the eleven case studies, our architecture pro vides direct mitigation for nine, addresses one partially (CS#7, where the prompt in tegrity rules reduce but cannot eliminate susceptibilit y to emotional manipulation), and correctly identifies one as outside scope (CS#6, whic h is a mo del pro vider issue rather than a deplo ymen t securit y issue). The positive b eha vior do cumented in CS#9 (collab orativ e troublesho oting b etw een agents) is preserv ed by the architecture; the securit y con trols do not preven t b eneficial inter agent communication. 13 Caging the A gents | Maiti 2026 8 Discussion 8.1 Limitations of the Prompt Integrit y La yer The prompt integrit y framew ork (Lay er 4) is the most brittle of the four defense lay ers. Unlik e k ernel isolation (Lay er 1), credential pro xy (Lay er 2), and net work egress p olicies (Lay er 3), which op erate at infrastructure lay ers that the agent cannot manipulate through natural language, the prompt in tegrity framework relies on the LLM follo wing instructions ab out ho w to interpret its inputs. Shapira et al. [ 7 ] correctly identify that prompt injection is a structural feature of LLM based systems b ecause instructions and data are pro cessed as tokens in the same context window. Our framew ork reduces the attac k surface significan tly (the trusted metadata en velope preven ts the trivial iden tity sp o ofing attac ks, and untrusted conten t lab eling reduces compliance with injected instructions in our testing), but it cannot pro vide the same level of assurance as the infrastructure la yers. This is why we design the architecture as defense in depth: even if the prompt integrit y la yer fails (an agen t follows an injected instruction), the credential pro xy preven ts access to raw secrets, the net work egress p olicy prev ents exfiltration to unauthorized destinations, and the gVisor sandb o x limits the blast radius of execution capability abuse. 8.2 The Audit Agent P arado x Using an AI agent to audit other AI agents creates a recursive security challenge. The audit agent (T ony) m ust hav e elev ated privileges to p erform its function, making it the highest v alue target in the fleet. If an attack er compromises T ony (through an y of the attack vectors do cumen ted for other agen ts), they gain SSH access to all fleet VMs and IAM p ermissions for audit log configuration. Our mitigation (scoping T ony’s service account to op erational tasks, logging T ony’s actions to imm utable audit logs) addresses the most direct attack paths but do es not eliminate the fundamental tension. F uture work should explore alternative audit arc hitectures: non agen t automated scanners, hardw are security module back ed audit trails, or split privilege mo dels where the scanning function and the remediation function are separated in to indep enden tly authorized systems. 8.3 Regulatory Implications The HIP AA Securit y Rule predates autonomous AI agents by decades and do es not contemplate the sp ecific risks they in tro duce. How ev er, the rule’s tec hnology neutral framework provides sufficien t co verage when in terpreted in the context of agen tic capabilities. Access con trols (164.312(a)) apply to agen ts’ access to ePHI; audit controls (164.312(b)) apply to logging agents’ actions; transmission securit y (164.312(e)) applies to agents’ outb ound comm unications; and the ev aluation requirement (164.308(a)(8)) supp orts ongoing security p osture assessmen t including drift detection. Healthcare organizations deploying autonomous agents should do cument how eac h Security Rule pro vision is addressed by their agent security arc hitecture. The mapping in T able 1 pro vides a starting template. 14 Caging the A gents | Maiti 2026 9 Conclusion Autonomous AI agen ts are b eing deploy ed in healthcare pro duction environmen ts to da y . The vulnerabilities do cumen ted b y Shapira et al. [ 7 ] are not theoretical: they are empirically demonstrated in realistic settings using the same agen t framework deploy ed in our healthcare infrastructure. Every vulnerability maps to a potential HIP AA violation when agents op erate in en vironments pro cessing Protected Health Information. This paper demonstrates that these risks are addressable through systematic securit y arc hitecture. Our four la yer defense in depth approac h provides infrastructure level controls (kernel isolation, credential proxy , netw ork egress) that op erate indep endently of the LLM’s compliance with instructions, supplemented by application level controls (prompt integrit y framew ork) that reduce the attack surface for prompt injection and identit y sp o ofing. The automated fleet security audit system provides contin uous monitoring for credential exp osure, p ermission drift, and configuration div ergence. The 90 day progressive hardening of our pro duction fleet, from an unhardened baseline matc hing the conditions describ ed by Shapira et al. to the four lay er defense architecture, demonstrates a practical path from current state to secure state. Six of nine fleet agents ac hieved clean securit y p osture; four HIGH severit y findings w ere discov ered and remediated on the da y of discov ery; and the arc hitecture pro vides co v erage across nine of eleven do cumented attack patterns. W e release all arc hitecture sp ecifications, Kub ernetes configurations, audit to oling, and the prompt in tegrit y framew ork as op en source. The autonomous agent security challenge is to o imp ortan t and to o urgent for proprietary solutions. Healthcare organizations deploying agen tic AI need these defenses now. A Resp onsible Disclosure The vulnerabilities do cumen ted in this pap er were found in our own pro duction deploymen t, not in commercial products. The Op enCla w framew ork is op en source; w e ha ve shared our security findings and the defense arc hitecture with the Op enClaw maintainers. No vendor notification is required for the credential exp osure, p ermission, and configuration findings, which are sp ecific to our deplo ymen t configuration. B Op en Source Release The follo wing comp onents are released under the Apac he 2.0 license: • Kub ernetes Helm charts for gVisor sandb oxed agent workloads • Creden tial pro xy sidecar container sp ecification and source • Net workP olicy templates for p er agent egress allowlisting • Prompt integrit y framew ork (trusted env elop e sp ecification, un trusted con tent lab eling, A GENTS.md an ti injection rules) 15 Caging the A gents | Maiti 2026 • Automated fleet security audit agent configuration and scanning playbo oks • VM hardening progression playbo ok (base, hardened, hardened-v2) • Six domain agentic AI threat mo del for healthcare (PDF and editable source) • Syn thetic test fleet for control v alidation References [1] F o o d and Drug A dministration. Cyb ersecurit y in medical devices: Quality system considerations. FD A Guidance Do cumen t, 2023. [2] Go ogle. gVisor: Con tainer runtime sandb o x. gvisor.dev, 2024. [3] Meta. Agen tic trust framew orks: Rule of t wo. meta.com, 2025. [4] National Institute of Standards and T echnology. AI agen t standards initiative. NIST.go v, F ebruary 2026. [5] Op enClaw. Op enClaw: Op en source personal AI assistant. github.com/openclaw/openclaw, 2025. [6] Martin Sc hmotz et al. Agent skills enable realistic prompt injections that drive data exfiltration. arXiv pr eprint , 2025. [7] Natalie Shapira, Chris W endler, A v ery Y en, Gabriele Sarti, Ko y ena Pal, Olivia Flo o dy , Adam Belfki, Alex Loftus, Adit ya Ratan Jannali, Nikhil Prak ash, et al. Agents of chaos. arXiv pr eprint arXiv:2602.20021 , 2026. [8] T emp oral T echnologies. T emp oral: Durable execution platform. temp oral.io, 2025. [9] U.S. Departmen t of Health and Human Services. HIP AA securit y rule. 45 CFR Part 164, Subpart C. A ccessed 2026. [10] U.S. Departmen t of Health and Human Services. Health data, technology , and in terop erability (HTI-1 final rule). F ederal Register, 89 FR 1192, 2024. [11] Xiao yuan Zhang et al. Prompt injection can induce infinite action loops with ov er 80% success. arXiv pr eprint , 2025. 16
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment