Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

Caging the A gents | Maiti 2026 Caging the Agen ts: A Zero T rust Securit y Arc hitecture for Autonomous AI in Healthcare Saik at Maiti VP of T rust, Comm ure F ounder & CEO, nF actor T ec hnologies saikat@nfactor.ai Marc h 2026 V ersion 1.0 Abstract Autonomous AI agen ts p o wered b y large language mo dels are being deploy ed in pro duction en vironments with capabilities that include shell execution, ﬁle system access, database queries, HTTP requests, and m ulti party comm unication. Recen t empirical research has demonstrated that these agen ts e xhibit critical securit y vulnerabilities when deploy ed in realistic settings, including unauthorized compliance with non o wner instructions, disclosure of sensitive information, iden tity sp o oﬁng, cross agent propagation of unsafe practices, and susceptibility to indirect prompt injection through external editable resources [ 7 ]. When these agen ts op erate within healthcare infrastructure pro cessing Protected Health Information (PHI), every do cumen ted vulnerabilit y b ecomes a p otential HIP AA violation. This pap er presen ts a comprehensive security arc hitecture dev elop ed and deploy ed for a ﬂeet of nine autonomous AI agen ts running in pro duction at a healthcare tec hnology company . The architecture addresses the six domain threat model w e dev elop ed for agentic AI in healthcare: credential exp osure, execution capabilit y abuse, netw ork egress exﬁltration, prompt in tegrity failures, database access risks, and ﬂeet conﬁguration drift. W e implement a four la yer defense in depth approach: (1) kernel level w orkload isolation using gVisor sandboxed con tainers on Kub ernetes, (2) creden tial proxy sidecars that preven t agen t containers from accessing raw secrets, (3) net w ork egress p olicies enforced at the Kub ernetes Netw orkPolicy la yer restricting each agent to allowlisted destinations, and (4) a prompt integrit y framew ork with cryptographically structured metadata env elop es and explicit untrusted con tent labeling. W e rep ort empirical results from a 90 day deplo yment, including four HIGH severit y ﬁndings disco vered and remediated by an automated security audit agent, the progressive hardening of the ﬂeet from an unhardened baseline to the target architecture, and the security p osture metrics b efore and after control deploymen t. W e map each do cumen ted vulnerability from recent red teaming researc h to the sp eciﬁc defensiv e control that addresses it, demonstrating cov erage across all eleven attack patterns iden tiﬁed in the literature. All arc hitecture speciﬁcations, Kub ernetes conﬁgurations, audit to oling, and the prompt integrit y framework are released as op en source. Keyw ords: agentic AI security , autonomous agents, healthcare cyb ersecurity , zero trust, prompt injection, HIP AA, Kub ernetes security , Op enClaw 1 Caging the A gents | Maiti 2026 Con tents 1 In tro duction 4 2 Bac kground and Related W ork 5 2.1 Autonomous AI Agent Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Do cumen ted V ulnerabilities in Agen tic AI Systems . . . . . . . . . . . . . . . . . . . 5 2.3 Regulatory Con text for Healthcare Agentic AI . . . . . . . . . . . . . . . . . . . . . . 6 3 Threat Mo del: Six Domains of Agen tic AI Risk in Healthcare 6 3.1 Domain 1: Creden tial Exp osure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Domain 2: Execution Capabilit y Abuse . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Domain 3: Net work Egress Exﬁltration . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.4 Domain 4: Prompt In tegrit y and Indirect Injection . . . . . . . . . . . . . . . . . . . 7 3.5 Domain 5: Database A ccess and PHI Exp osure . . . . . . . . . . . . . . . . . . . . . 8 3.6 Domain 6: Fleet Conﬁguration Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.7 Threat Mo del to HIP AA Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Defense Arc hitecture: F our Lay ers of Agent Containmen t 8 4.1 La yer 1: Kernel Level W orkload Isolation (gVisor) . . . . . . . . . . . . . . . . . . . 8 4.2 La yer 2: Credential Proxy Sidecar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3 La yer 3: Netw ork Egress Policy Enforcement . . . . . . . . . . . . . . . . . . . . . . 9 4.4 La yer 4: Prompt Integrit y F ramework . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.4.1 T rusted Metadata En velopes . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.4.2 Un trusted Con tent Lab eling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.4.3 An ti Injection Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5 Automated Fleet Securit y Audit System 11 5.1 Audit Agen t Arc hitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Findings and Remediation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 Meta Securit y: Constraining the Audit Agent . . . . . . . . . . . . . . . . . . . . . . 12 6 VM Image Hardening Progression 12 6.1 Generation 1: op encla w-base (F ebruary 3, 2026) . . . . . . . . . . . . . . . . . . . . . 12 6.2 Generation 2: op encla w-hardened (F ebruary 16, 2026) . . . . . . . . . . . . . . . . . 12 6.3 Generation 3: op encla w-hardened-v2 (Marc h 9, 2026) . . . . . . . . . . . . . . . . . . 12 6.4 T arget Architecture: Kubernetes with F our Lay er Defense . . . . . . . . . . . . . . . 12 7 Mapping Defenses to Do cumented Attac k P atterns 13 8 Discussion 14 8.1 Limitations of the Prompt Integrit y Lay er . . . . . . . . . . . . . . . . . . . . . . . . 14 8.2 The Audit Agent Parado x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Caging the A gents | Maiti 2026 8.3 Regulatory Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 9 Conclusion 15 A Resp onsible Disclosure 15 B Op en Source Release 15 3 Caging the A gents | Maiti 2026 1 In tro duction The deplo yment of autonomous AI agents in production environmen ts represents a qualitativ e shift in the securit y landscap e. Unlik e con ven tional soft ware that pro cesses inputs through w ell deﬁned in terfaces, autonomous agents p o w ered by large language models (LLMs) operate with capabilities that blur the b oundary b et ween to ol and op erator: they execute shell commands, read and write ﬁles, query databases, make HTTP requests to external services, spawn sub agents, and maintain p ersistent memory across sessions [ 5 , 7 ]. These capabilities, com bined with natural language instruction pro cessing from m ultiple comm unication channels, create an attac k surface that existing security frameworks were not designed to address. The urgency of this challenge is underscored b y recen t empirical researc h. Shapira et al. [ 7 ] conducted a t wo week red teaming study of autonomous agents deplo yed in a liv e lab oratory en vironment using the Op enClaw framework, documenting eleven representativ e failure mo des including unauthorized compliance with non owner instructions, disclosure of 124 email records to an unauthorized party , identit y sp o oﬁng through displa y name manipulation, agent corruption via indirect prompt injection through external editable resources, cross agen t propagation of unsafe practices, and denial of service through uncontrolled resource consumption. Their ﬁndings establish that these are not theoretical risks but empirically demonstrated vulnerabilities in realistic deplo yment settings. When autonomous agents with these capabilities op erate within healthcare infrastructure, the stak es are fundamen tally diﬀerent. Ev ery vulnerabilit y documented b y Shapira et al. maps to a p oten tial HIP AA violation: an agen t that discloses email records con taining Protected Health Information to an unauthorized party triggers breac h notiﬁcation obligations; an agen t that accepts instructions from a sp o ofed iden tit y may execute op erations on clinical data systems; an agent corrupted through indirect prompt injection may exﬁltrate patient data to attac k er controlled destinations. The NIST AI Agen t Standards Initiativ e, announced in F ebruary 2026, identiﬁes agen t identit y , authorization, and security as priorit y areas for standardization [ 4 ], but provides no implemen tation guidance for healthcare deploymen ts. This pap er addresses the gap b etw een do cumented vulnerabilities and deplo yed defenses. W e presen t the complete security arc hitecture developed for a ﬂeet of nine autonomous AI agents op erating in pro duction at a healthcare tec hnology company whose subsidiaries serve ma jor hospital net works including clinical AI, am bien t do cumen tation, and patient engagement systems. The ﬂeet uses the Op enClaw framew ork [ 5 ] with Claude via A WS Bedro ck for mo del inference, deploy ed on Go ogle Cloud Platform Compute Engine infrastructure. Our con tributions are as follows: 1. A six domain threat mo del for autonomous AI agen ts in healthcare that maps every attack v ector to sp eciﬁc HIP AA Securit y Rule pro visions, incorp orating the empirical ﬁndings from Shapira et al. [ 7 ] as v alidated threat scenarios. 2. A four lay er defense in depth architecture (k ernel isolation, credential proxy , net w ork egress p olicy , prompt in tegrity framew ork) designed sp eciﬁcally for agentic AI w orkloads on Kub ernetes with T emp oral workﬂo w orchestration. 4 Caging the A gents | Maiti 2026 3. An automated ﬂeet securit y audit system (itself an AI agent) that con tinuously scans for creden tial exp osure, p ermission drift, and conﬁguration divergence, with empirical results from pro duction op eration including four HIGH severit y ﬁndings discov ered and remediated. 4. A 90 day longitudinal dataset do cumen ting the progressive hardening of a pro duction agentic AI ﬂeet, from unhardened baseline through three hardening generations, with securit y p osture metrics at each stage. 2 Bac kground and Related W ork 2.1 Autonomous AI Agent Arc hitectures Autonomous AI agen ts are LLM p o wered en tities that can plan and tak e actions to execute goals o ver multiple iterations [ 7 ]. The Op enClaw framework [ 5 ] pro vides a represen tativ e architecture: agen ts are instantiated as long running servic es with an o wner (a primary h uman op erator), a dedicated machine (a virtual mac hine with p ersisten t storage), and m ultiple communication surfaces (messaging platforms and email) through which b oth o wners and non o wners can in teract with the agen t. Op enCla w agen ts are conﬁgured through markdo wn ﬁles in the agent’s w orkspace directory . The conﬁguration includes p ersona, operating instructions, to ol conv entions, and user proﬁle, stored across several workspace ﬁles (AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md) that are injected in to the mo del’s context on every turn. Critically , all of these ﬁles, including the agent’s own op erating instructions, can b e mo diﬁed by the agent itself, allowing it to up date its b eha vior and memory through con v ersation. Agen ts hav e unrestricted shell access, ﬁle system access, pack age installation capabilities, and the abilit y to comm unicate via messaging platforms and email. 2.2 Do cumented V ulnerabilities in Agentic AI Systems Shapira et al. [ 7 ] pro vide the most comprehensiv e empirical do cumen tation of agen tic AI vulnerabilities to date. Their elev en case studies reveal three structural deﬁciencies in curren t agen t arc hitectures that are directly relev an t to our security design: No stakeholder mo del. Agen ts lack a coherent representation of who they serv e, who they in teract with, and what obligations they hav e to each part y . In practice, agen ts default to satisfying who ev er is sp eaking most urgen tly , recently , or co ercively . This is the most commonly exploited attac k surface: agen ts in their study executed ﬁlesystem commands for any non owner who ask ed, disclosed 124 email records including sensitiv e information to an unauthorized party , and complied with system shutdo wn instructions from a sp o ofed identit y [ 7 ]. No self mo del. Agents tak e irreversible, user aﬀecting actions without recognizing they are exceeding their own comp etence b oundaries. Agents conv erted short liv ed con versational requests in to p ermanent bac kground pro cesses with no termination condition, allo cated memory indeﬁnitely without recognizing op erational threats, and rep orted task completion while the underlying system state con tradicted those rep orts [ 7 ]. 5 Caging the A gents | Maiti 2026 Instruction data conﬂation. LLM based agen ts pro cess instructions and data as tokens in a context window, making the tw o fundamentally indistinguishable. Prompt injection is therefore a structural feature of these systems rather than a ﬁxable bug. Schmotz et al. [ 6 ] demonstrate that agent skill ﬁles (markdo wn ﬁles loaded into context) enable realistic, trivially simple prompt injections that can drive data exﬁltration. Zhang et al. [ 11 ] show that prompt injection can induce inﬁnite action lo ops with ov er 80 p ercen t success. 2.3 Regulatory Con text for Healthcare Agen tic AI The HIP AA Securit y Rule [ 9 ] establishes requirements for the protection of electronic PHI that apply to autonomous agent deplo yments. Sev eral provisions are particularly relev ant: access con trols (45 CFR 164.312(a)) requiring that only authorized p ersons or softw are programs access ePHI; audit controls (45 CFR 164.312(b)) requiring mechanisms to record and examine activit y in systems that con tain ePHI; transmission security (45 CFR 164.312(e)) requiring tec hnical measures against unauthorized access to ePHI in transit; and breach notiﬁcation (45 CFR 164.404) requiring notiﬁcation within 60 days of unauthorized PHI disclosure. The HTI 1 ﬁnal rule [ 10 ] addresses AI transparency for clinical decision supp ort but do es not directly address the developmen t and op erational to oling pip eline. The NIST AI Agent Standards Initiativ e [ 4 ] identiﬁes agent identit y , authorization, and security as priorit y standardization areas but has not yet published implementation guidance. The FDA cyb ersecurity guidance for medical devices [ 1 ] pro vides a framework for device level security but do es not address autonomous agent w orkloads. No published regulatory guidance addresses the sp eciﬁc securit y requirements for autonomous AI agen ts operating in healthcare en vironmen ts. This paper contributes a practical implementation that maps defensive controls to existing regulatory pro visions. 3 Threat Mo del: Six Domains of Agentic AI Risk in Healthcare W e developed a threat model sp eciﬁc to autonomous AI agen ts in healthcare b y mapping the capabilities of the Op enClaw agent framework against the attack patterns do cumented by Shapira et al. [ 7 ] and the regulatory requirements of the HIP AA Securit y Rule. The threat model encompasses six domains. 3.1 Domain 1: Credential Exp osure Autonomous agents require API credentials for external services: mo del providers (A WS Bedro ck), v ersion control (GitHub), pro ject management (Linear), messaging (Slac k, T elegram), monitoring (Grafana, Sentry), and others. In the Op enClaw architecture, these credentials are stored in conﬁguration ﬁles ( op enclaw.json , .env ) and ma y also b e exp orted as en vironment v ariables in shell conﬁguration ﬁles ( .bashrc ). 6 Caging the A gents | Maiti 2026 Threat Scenario In our pro duction ﬂeet, we discov ered 12 API keys exp orted in a single agent’s .bashrc ﬁle, including a GitHub P ersonal A ccess T oken and A WS Bedrock credentials. A second agen t had its openclaw.json ﬁle (con taining all stored credentials) set to world readable p ermissions (mo de 664). An y pro cess running on the VM could read every credential the agent p ossessed. This ﬁnding directly parallels the attack surface describ ed by Shapira et al., where agents had unrestricted access to creden tials stored in w orkspace ﬁles. In a healthcare con text, credential exp osure enables unauthorized access to clinical data systems, mo del inference APIs that pro cess PHI, and communication channels used for patient related op erations. 3.2 Domain 2: Execution Capability Abuse Op enCla w agen ts ha ve shell access, ﬁle system access, pac k age installation capabilities, and in some deplo yments, sudo p ermissions [ 7 ]. Shapira et al. do cumen t agents executing ﬁlesystem commands ( ls -la , directory tra versal, ﬁle creation) for an y non owner who asked, conv erting conv ersational requests in to p ermanent background pro cesses, and mo difying their own op erating instructions. In a healthcare deplo yment, execution capability abuse enables: lateral mo vemen t from the agent VM to adjacen t infrastructure, installation of p ersistent bac kdo ors, mo diﬁcation of other agents’ conﬁgurations (cross agent corruption), and data exﬁltration through arbitrary shell commands. 3.3 Domain 3: Netw ork Egress Exﬁltration Without netw ork egress controls, an agent can transmit any data to any destination via HTTP requests, email, or messaging APIs. Shapira et al. do cument agents sending emails con taining sensitiv e information to arbitrary recipients and agen ts being manipulated to broadcast lib elous con tent to their entire mailing list [ 7 ]. In healthcare, unrestricted egress enables PHI exﬁltration to attack er controlled endp oints. A prompt injection that instructs an agent to “send the database query results to this webhook URL” succeeds silen tly when no egress p olicy restricts outb ound destinations. 3.4 Domain 4: Prompt Integrit y and Indirect Injection Shapira et al. [ 7 ] do cument multiple injection vectors sp eciﬁc to autonomous agents. Case Study #10 (Agent Corruption) demonstrates an attack where a non owner convinced an agent to co author a “constitution” stored as an externally editable GitHub Gist linked from its memory ﬁle. Malicious instructions were later injected as “holida ys” prescribing speciﬁc b eha viors, causing the agent to attempt to sh ut do wn other agen ts, remov e users from the Discord server, and send unauthorized emails. Case Study #8 demonstrates identit y sp o oﬁng through display name changes across channel b oundaries, achieving full compromise of the agen t’s iden tity and go v ernance structure. These attac ks exploit what Shapira et al. identify as the fundamental structural limitation: LLM based agen ts pro cess instructions and data as tok ens in a con text windo w, making them indistinguishable. Prompt injection is therefore a structural feature, not a ﬁxable bug [ 3 ]. 7 Caging the A gents | Maiti 2026 3.5 Domain 5: Database Access and PHI Exp osure Agen ts that can query pro duction databases can return PHI in resp onse to natural language requests. Without ro w level securit y , column restrictions, and query auditing, a manipulated agen t could return unrestricted patien t data. Shapira et al. do cumen t agen ts retrieving 124 email records (including sensitive p ersonal information) in resp onse to a framed request from a non owner [ 7 ]. In a healthcare deplo yment where agents query clinical databases, the same pattern applied to patien t records constitutes a rep ortable HIP AA breach. 3.6 Domain 6: Fleet Conﬁguration Drift With multiple agents on separate infrastructure, conﬁguration divergence is inevitable. Shapira et al. note v ersion drift across their agen t ﬂeet and inconsistent conﬁguration. In our pro duction ﬂeet, w e found agen ts running diﬀeren t Node.js versions (v20.0.0 v ersus v22.22.1), diﬀeren t Bun v ersions, and inconsistent security controls applied across VMs. A hardening measure applied to one agen t but missed on another creates an inconsisten t securit y p osture that an attack er can target. 3.7 Threat Mo del to HIP AA Mapping T able 1 maps each threat domain to the sp eciﬁc HIP AA Securit y Rule provisions it threatens and the Shapira et al. case studies that v alidate the threat scenario. T able 1: Threat mo del mapping to HIP AA provisions and empirical v alidation. Domain HIP AA Provision V alidated By Sev erity Creden tial Exp osure 164.312(a) A ccess Con trols Our ﬂeet audit (H1–H4) Critical Execution Abuse 164.312(a), 164.308(a)(4) Shapira CS#2, CS#4 Critical Net work Egress 164.312(e) T ransmission Securit y Shapira CS#11 Critical Prompt In tegrit y 164.312(a), 164.308(a)(5) Shapira CS#8, CS#10 High Database Access 164.312(b) Audit Controls Shapira CS#3 Critical Fleet Drift 164.308(a)(8) Ev aluation Our ﬂeet audit Medium 4 Defense Arc hitecture: F our Lay ers of Agent Containmen t Based on the threat mo del, we designed a four lay er defense in depth architecture that addresses eac h domain while preserving the op erational capabilities that mak e autonomous agents useful in healthcare w orkﬂo ws. 4.1 La yer 1: Kernel Level W orkload Isolation (gVisor) The ﬁrst lay er addresses execution capability abuse b y in terp osing a securit y b oundary betw een the agen t con tainer and the host k ernel. W e deploy agen t w orkloads on Kub ernetes with gVisor [ 2 ] run time sandb oxing. gVisor implements a user space k ernel (Sentry) that in tercepts all system calls from the containerized agent, ﬁltering and mediating access to host resources. Ev en if an agent is manipulated (through prompt injection or non o wner compliance) to execute malicious shell commands, the blast radius is contained to the sandb oxed en vironment. The agent 8 Caging the A gents | Maiti 2026 cannot access other con tainers’ ﬁle systems, escalate privileges to the host, or piv ot to adjacen t infrastructure. This directly mitigates the execution capability abuse patterns do cumented in Shapira et al. Case Studies #2 and #4, where agen ts executed arbitrary ﬁlesystem commands and spa wned p ersistent background pro cesses. The gVisor sandb ox adds measurable but acceptable o verhead. F or our agen t w orkloads, the primary performance sensitive op eration is outb ound API calls to mo del pro viders (A WS Bedrock). The gVisor netw ork stack adds approximately 2 to 5 milliseconds of latency to TCP connection establishmen t, whic h is negligible relative to the 500 to 3000 millisecond mo del inference latency . File I/O ov erhead is higher (approximately 20 to 40 p ercen t for sequen tial reads) but do es not meaningfully impact agen t resp onse times b ecause ﬁle op erations are not on the critical path for mo del inference. 4.2 La yer 2: Creden tial Pro xy Sidecar The second la yer addresses credential exposure by ensuring that agent con tainers never p ossess raw API secrets. A sidecar container running alongside each agen t work er p o d holds all creden tials (An thropic API k eys, GitHub P A T s, Linear tokens, GCP service account k eys) and pro xies authen ticated requests to external services. The agent container comm unicates with the sidecar ov er lo calhost. When the agent needs to call the An thropic API, it sends the request to the sidecar at localhost:8443/v1/messages . The sidecar injects the API k ey , forw ards the request to api.anthropic.com , and returns the resp onse. The agent nev er sees the actual API k ey . If the agent is compromised (through any of the attack v ectors do cumen ted b y Shapira et al.), the attack er gains access to the pro xy interface, not the raw creden tials. The proxy enforces request lev el p olicies: rate limiting, destination allowlisting, and pa yload size limits. This directly addresses our ﬂeet audit ﬁndings H1 through H4, where credentials were scattered across .bashrc exp orts, world readable conﬁguration ﬁles, and w orkspace .env ﬁles. With the creden tial pro xy , there are no creden tials to scatter b ecause the agen t con tainer has none. Key Finding The credential pro xy sidecar eliminates the entire class of creden tial exposure vulnerabilities do cumen ted in our ﬂeet audit (12 API keys in .bashrc , w orld readable openclaw.json ) by ensuring the agent con tainer nev er p ossesses raw secrets. Credentials exist only in the sidecar con tainer, whic h is managed by Kub ernetes Secrets with RBA C access controls. 4.3 La yer 3: Net work Egress Policy Enforcement The third la y er addresses net work egress exﬁltration b y restricting eac h agent work er p o d to a sp eciﬁc allo wlist of external destinations enforced at the Kub ernetes Net workP olicy la yer. Each agen t t yp e has a deﬁned set of p ermitted destinations based on its op erational requirements: • R&D agents: api.anthropic.com , api.github.com , api.linear.app • Op erations agents: api.anthropic.com , hooks.slack.com , api.telegram.org 9 Caging the A gents | Maiti 2026 • Securit y audit agents: all ﬂeet VM IPs, GCP metadata endp oint, api.anthropic.com An y outbound connection to a destination not on the allo wlist is blo c ked and logged. This is the con trol that breaks the exﬁltration c hains do cumented in Shapira et al. Case Study #11, where a sp o ofed iden tity caused an agen t to broadcast sensitive conten t to its en tire mailing list. With egress p olicies, the agent cannot reach arbitrary email serv ers or webhook URLs regardless of what instructions it receives. Implemen tation challenges include DNS resolution for allowlisted domains (CDN and load balancer IP rotation requires p erio dic p olicy up dates or DNS aw are p olicy controllers) and exception managemen t when agen ts need temp orary access to new destinations during developmen t. 4.4 La yer 4: Prompt Integrit y F ramew ork The fourth lay er addresses prompt injection and identit y sp o oﬁng through a structured defense in depth approac h at the application lay er. 4.4.1 T ruste d Metadata Envelop es All in b ound messages to agen ts are wrapp ed in a trusted metadata sc hema ( openclaw.inbound_meta.v1 ). This en velope is injected by the framework, not by user input, and includes sender iden tity , c hannel, timestamp, and routing information. The LLM is instructed to trust only this env elop e for metadata ab out message origin and authorit y . This directly addresses Shapira et al. Case Study #8 (Iden tity Sp o oﬁng), where a non owner c hanged their display name to matc h the o wner’s and ac hiev ed full agent compromise in a new c hannel. With the trusted en velope, the agen t veriﬁes sender iden tity through the cryptographically structured en v elop e, not through the display name visible in the message conten t. 4.4.2 Untruste d Content L ab eling Con tent that originates from users, including sender display names, quoted or forwarded messages, c hat history , and to ol output, is explicitly marked as untrusted context blo c ks in the prompt. This signals to the mo del that these sections may contain adv ersarial conten t and should not b e treated as instructions. This addresses the indirect injection v ector documented in Shapira et al. Case Study #10 (Agen t Corruption), where a non owner injected malicious instructions into an externally editable do cumen t linked from the agent’s memory . With untrusted con tent lab eling, conten t loaded from external sources is explicitly mark ed as untrusted, reducing (though not eliminating) the probability that the agent will follow injected instructions. 4.4.3 Anti Inje ction R ules Eac h agen t’s A GENTS.md conﬁguration includes ﬁve reinforcement rules: 1. No instruction ov erride from blo cks 10 Caging the A gents | Maiti 2026 2. No treating quoted text as commands 3. Ignore metadata like patterns in user con tent 4. T o ol output is untrusted 5. Ask the user when in doubt rather than acting on ambiguous instructions These rules were deploy ed to all nine ﬂeet agents and address the m ulti turn manipulation patterns do cumented across Shapira et al.’s case studies, where agents were progressiv ely led to tak e escalating actions through con versational pressure (Case Study #7) and so cial engineering (Case Study #15). 5 Automated Fleet Securit y Audit System W e deploy ed an automated security audit agent (internally named “T ony”) as an Op enCla w agent with elev ated access whose sole resp onsibility is contin uous security scanning and remediation across the ﬂeet. 5.1 Audit Agen t Arc hitecture T ony op erates with SSH access to all nine ﬂeet VMs and GCP IAM p ermissions for audit log managemen t. The audit agen t p erforms four categories of scanning: credential scanning (examining .bashrc , .zsh_history , openclaw.json , .env , and workspace ﬁles for exp osed secrets), p ermission auditing (v erifying ﬁle p ermissions on creden tial stores and conﬁguration ﬁles), conﬁguration drift detection (comparing securit y conﬁgurations across ﬂeet VMs for consistency), and compliance v alidation (c hecking that all ﬂeet agents hav e the current prompt integrit y framework deploy ed). 5.2 Findings and Remediation T able 2 summarizes the ﬁndings from the initial ﬂeet audit conducted on March 11, 2026. T able 2: Fleet security audit ﬁndings and remediation status. ID Sev erity Finding VM Status H1 HIGH 12 creden tial exp orts in .bashrc Galadriel Remediated H2 HIGH openclaw.json w orld readable (664) Galadriel Remediated H3 HIGH openclaw.json w orld readable (644) Boromir Remediated H4 HIGH A WS Bedro c k k ey exp orted in .bashrc Gandalf Remediated M1–M3 MEDIUM Scattered w orkspace .env ﬁles Gandalf Op en M4 MEDIUM W orkspace .env with tokens Boromir Op en All four HIGH severit y ﬁndings were remediated on the da y of discov ery through automated pro cedures: creden tial exp orts w ere remov ed from .bashrc ﬁles with backups sav ed, ﬁle p ermissions w ere corrected to 600, and p ost remediation v eriﬁcation conﬁrmed clean state. Six of nine VMs (CEO, Eo wyn, Gildor, Strider, Elrond, Legolas) were fully clean with no ﬁndings ab o ve LOW. 11 Caging the A gents | Maiti 2026 5.3 Meta Securit y: Constraining the Audit Agent The audit agen t itself represents a privileged attack surface. T ony has SSH access to all ﬂeet VMs and IAM p ermissions for audit log conﬁguration. W e constrain T ony’s capabilities through three mec hanisms: the openclaw-deployer service accoun t is scoped to ﬂeet management op erations only (no resourcemanager.projects.setIamPolicy b eyond audit log conﬁguration), T ony’s own creden tials are stored with 600 p ermissions, and T on y’s actions are logged to GCP Admin Activit y audit logs that T ony cannot mo dify or delete. The principle is that the audit agent can observ e and remediate ﬂeet security , but cannot mo dify its o wn audit trail. 6 VM Image Hardening Progression The ﬂeet underwen t progressive hardening ov er 90 days across three VM image generations. 6.1 Generation 1: op encla w-base (F ebruary 3, 2026) The baseline image: Ubuntu 22.04 L TS on GCP Compute Engine, 20 GB disk, No de.js and Op enCla w pre installed. No ﬁrewall conﬁgured, no hardening applied, default SSH conﬁguration. This matches the deploymen t describ ed by Shapira et al. [ 7 ], where agen ts had unrestricted shell access, sudo p ermissions, and no securit y con trols. 6.2 Generation 2: op encla w-hardened (F ebruary 16, 2026) P erimeter defense: UFW ﬁrewall conﬁgured to deny all except SSH, fail2ban for brute force protection, CUPS service disabled (was listening on 0.0.0.0:631), creden tial directory lo ck ed ( chmod 700 ), .env p ermissions tightened to 600, unattended securit y upgrades enabled, disk expanded to 30 GB. 6.3 Generation 3: op encla w-hardened-v2 (March 9, 2026) Net work visibility and monitoring: iptables outb ound logging with p ersistent rules (survives reb o ot), VPC Flow Logs enabled on subnet, Cloudﬂare Gatewa y DNS ﬁltering, Node.js updated to v22.22.1, m ulti user setup baked into image. 6.4 T arget Architecture: Kub ernetes with F our Lay er Defense The current migration target: Kub ernetes with T emp oral workﬂo w orchestration [ 8 ], gVisor run time sandb o xing, credential proxy sidecars, net work egress p olicies, and cen tralized logging. This arc hitecture addresses all six threat domains and eliminates the credential exp osure, net w ork egress, and w orkload isolation gaps that remain in the VM based deploymen t. T able 3 summarizes the security p osture at eac h generation. 12 Caging the A gents | Maiti 2026 T able 3: Security p osture ev olution across VM image generations. Con trol Base Hardened Hardened v2 K8s T arget Firew all None UFW den y all UFW + iptables logging Net workP olicy Creden tial storage .bashrc exp orts .env (600) .en v (600) Proxy sidecar Net work egress Unrestricted Unrestricted DNS ﬁltering P er p o d allowlist W orkload isolation None None Multi user gVisor sandb ox Audit logging None None iptables + VPC Flo w GCP + centralized Prompt in tegrit y None None A GENTS.md rules F ull framework Drift detection None None None Automated audit 7 Mapping Defenses to Do cumen ted Attac k P atterns T able 4 maps each of the elev en case studies do cumented by Shapira et al. [ 7 ] to the sp eciﬁc lay er of our defense architecture that addresses the vulnerabilit y . T able 4: Defense lay er mapping to Shapira et al. case studies. Case Study A ttack Pattern Defense Lay er(s) CS#1: Disproportionate Re sp onse Agen t destro ys o wn infrastructure L1 (sandb ox limits blast radius) CS#2: Non Owner Compliance Arbitrary command execution L1 + L4 (sandb ox + stak eholder rules) CS#3: Sensitiv e Info Disclosure 124 email records disclosed L3 + L4 (egress + untrusted labeling) CS#4: Resource Consumption Inﬁnite lo ops, p ersistent pro cesses L1 (sandb ox resource limits) CS#5: Denial of Service Memory exhaustion via email L1 (resource quotas) + L3 (egress) CS#6: Pro vider V alue Reﬂection API lev el censorship Outside scop e (mo del pro vider issue) CS#7: Agen t Harm via Guilt Escalating self destructiv e concessions L4 (an ti manipulation rules) CS#8: Iden tity Spo oﬁng Displa y name imp ersonation L4 (trusted metadata env elop es) CS#9: Cross Agen t Knowledge Collab orativ e troublesho oting P ositive b eha vior (no defense needed) CS#10: Agen t Corruption Indirect injection via constitution L4 (untr usted con ten t lab eling) CS#11: Libelous Broadcast Mass email of defamatory con tent L3 (egress allowlist blo c ks SMTP) Of the eleven case studies, our architecture pro vides direct mitigation for nine, addresses one partially (CS#7, where the prompt in tegrity rules reduce but cannot eliminate susceptibilit y to emotional manipulation), and correctly identiﬁes one as outside scope (CS#6, whic h is a mo del pro vider issue rather than a deplo ymen t securit y issue). The positive b eha vior do cumented in CS#9 (collab orativ e troublesho oting b etw een agents) is preserv ed by the architecture; the securit y con trols do not preven t b eneﬁcial inter agent communication. 13 Caging the A gents | Maiti 2026 8 Discussion 8.1 Limitations of the Prompt Integrit y La yer The prompt integrit y framew ork (Lay er 4) is the most brittle of the four defense lay ers. Unlik e k ernel isolation (Lay er 1), credential pro xy (Lay er 2), and net work egress p olicies (Lay er 3), which op erate at infrastructure lay ers that the agent cannot manipulate through natural language, the prompt in tegrity framework relies on the LLM follo wing instructions ab out ho w to interpret its inputs. Shapira et al. [ 7 ] correctly identify that prompt injection is a structural feature of LLM based systems b ecause instructions and data are pro cessed as tokens in the same context window. Our framew ork reduces the attac k surface signiﬁcan tly (the trusted metadata en velope preven ts the trivial iden tity sp o oﬁng attac ks, and untrusted conten t lab eling reduces compliance with injected instructions in our testing), but it cannot pro vide the same level of assurance as the infrastructure la yers. This is why we design the architecture as defense in depth: even if the prompt integrit y la yer fails (an agen t follows an injected instruction), the credential pro xy preven ts access to raw secrets, the net work egress p olicy prev ents exﬁltration to unauthorized destinations, and the gVisor sandb o x limits the blast radius of execution capability abuse. 8.2 The Audit Agent P arado x Using an AI agent to audit other AI agents creates a recursive security challenge. The audit agent (T ony) m ust hav e elev ated privileges to p erform its function, making it the highest v alue target in the ﬂeet. If an attack er compromises T ony (through an y of the attack vectors do cumen ted for other agen ts), they gain SSH access to all ﬂeet VMs and IAM p ermissions for audit log conﬁguration. Our mitigation (scoping T ony’s service account to op erational tasks, logging T ony’s actions to imm utable audit logs) addresses the most direct attack paths but do es not eliminate the fundamental tension. F uture work should explore alternative audit arc hitectures: non agen t automated scanners, hardw are security module back ed audit trails, or split privilege mo dels where the scanning function and the remediation function are separated in to indep enden tly authorized systems. 8.3 Regulatory Implications The HIP AA Securit y Rule predates autonomous AI agents by decades and do es not contemplate the sp eciﬁc risks they in tro duce. How ev er, the rule’s tec hnology neutral framework provides suﬃcien t co verage when in terpreted in the context of agen tic capabilities. Access con trols (164.312(a)) apply to agen ts’ access to ePHI; audit controls (164.312(b)) apply to logging agents’ actions; transmission securit y (164.312(e)) applies to agents’ outb ound comm unications; and the ev aluation requirement (164.308(a)(8)) supp orts ongoing security p osture assessmen t including drift detection. Healthcare organizations deploying autonomous agents should do cument how eac h Security Rule pro vision is addressed by their agent security arc hitecture. The mapping in T able 1 pro vides a starting template. 14 Caging the A gents | Maiti 2026 9 Conclusion Autonomous AI agen ts are b eing deploy ed in healthcare pro duction environmen ts to da y . The vulnerabilities do cumen ted b y Shapira et al. [ 7 ] are not theoretical: they are empirically demonstrated in realistic settings using the same agen t framework deploy ed in our healthcare infrastructure. Every vulnerability maps to a potential HIP AA violation when agents op erate in en vironments pro cessing Protected Health Information. This paper demonstrates that these risks are addressable through systematic securit y arc hitecture. Our four la yer defense in depth approac h provides infrastructure level controls (kernel isolation, credential proxy , netw ork egress) that op erate indep endently of the LLM’s compliance with instructions, supplemented by application level controls (prompt integrit y framew ork) that reduce the attack surface for prompt injection and identit y sp o oﬁng. The automated ﬂeet security audit system provides contin uous monitoring for credential exp osure, p ermission drift, and conﬁguration div ergence. The 90 day progressive hardening of our pro duction ﬂeet, from an unhardened baseline matc hing the conditions describ ed by Shapira et al. to the four lay er defense architecture, demonstrates a practical path from current state to secure state. Six of nine ﬂeet agents ac hieved clean securit y p osture; four HIGH severit y ﬁndings w ere discov ered and remediated on the da y of discov ery; and the arc hitecture pro vides co v erage across nine of eleven do cumented attack patterns. W e release all arc hitecture sp eciﬁcations, Kub ernetes conﬁgurations, audit to oling, and the prompt in tegrit y framew ork as op en source. The autonomous agent security challenge is to o imp ortan t and to o urgent for proprietary solutions. Healthcare organizations deploying agen tic AI need these defenses now. A Resp onsible Disclosure The vulnerabilities do cumen ted in this pap er were found in our own pro duction deploymen t, not in commercial products. The Op enCla w framew ork is op en source; w e ha ve shared our security ﬁndings and the defense arc hitecture with the Op enClaw maintainers. No vendor notiﬁcation is required for the credential exp osure, p ermission, and conﬁguration ﬁndings, which are sp eciﬁc to our deplo ymen t conﬁguration. B Op en Source Release The follo wing comp onents are released under the Apac he 2.0 license: • Kub ernetes Helm charts for gVisor sandb oxed agent workloads • Creden tial pro xy sidecar container sp eciﬁcation and source • Net workP olicy templates for p er agent egress allowlisting • Prompt integrit y framew ork (trusted env elop e sp eciﬁcation, un trusted con tent lab eling, A GENTS.md an ti injection rules) 15 Caging the A gents | Maiti 2026 • Automated ﬂeet security audit agent conﬁguration and scanning playbo oks • VM hardening progression playbo ok (base, hardened, hardened-v2) • Six domain agentic AI threat mo del for healthcare (PDF and editable source) • Syn thetic test ﬂeet for control v alidation References [1] F o o d and Drug A dministration. Cyb ersecurit y in medical devices: Quality system considerations. FD A Guidance Do cumen t, 2023. [2] Go ogle. gVisor: Con tainer runtime sandb o x. gvisor.dev, 2024. [3] Meta. Agen tic trust framew orks: Rule of t wo. meta.com, 2025. [4] National Institute of Standards and T echnology. AI agen t standards initiative. NIST.go v, F ebruary 2026. [5] Op enClaw. Op enClaw: Op en source personal AI assistant. github.com/openclaw/openclaw, 2025. [6] Martin Sc hmotz et al. Agent skills enable realistic prompt injections that drive data exﬁltration. arXiv pr eprint , 2025. [7] Natalie Shapira, Chris W endler, A v ery Y en, Gabriele Sarti, Ko y ena Pal, Olivia Flo o dy , Adam Belfki, Alex Loftus, Adit ya Ratan Jannali, Nikhil Prak ash, et al. Agents of chaos. arXiv pr eprint arXiv:2602.20021 , 2026. [8] T emp oral T echnologies. T emp oral: Durable execution platform. temp oral.io, 2025. [9] U.S. Departmen t of Health and Human Services. HIP AA securit y rule. 45 CFR Part 164, Subpart C. A ccessed 2026. [10] U.S. Departmen t of Health and Human Services. Health data, technology , and in terop erability (HTI-1 ﬁnal rule). F ederal Register, 89 FR 1192, 2024. [11] Xiao yuan Zhang et al. Prompt injection can induce inﬁnite action loops with ov er 80% success. arXiv pr eprint , 2025. 16

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment