Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents

Artificial intelligence (AI) systems are increasingly adopted as tool-using agents that can plan, observe their environment, and take actions over extended time periods. This evolution challenges current evaluation practices where the AI models are t…

Authors: Simone Aonzo, Merve Sahin, Aurélien Francillon

Ev asiv e In telligence: Lessons from Malw are Analysis for Ev aluating AI Agen ts Simone Aonzo Eurecom, Sophia An tip olis, F rance simone.aonzo@eurecom.fr Merv e Sahin SAP Labs, Sophia An tip olis, F rance merve.sahin@sap.com Aur ´ elien F rancillon Eurecom, Sophia An tip olis, F rance aurelien.francillon@eurecom.fr Daniele P erito Depthfirst, San F rancisco, United States daniele@depthfirst.com Preprin t. Under review at Communications of the A CM (CA CM). Submitted to CA CM, F ebruary 2026. Abstract Artificial in telligence (AI) systems are increasingly adopted as to ol-using agen ts that can plan, observ e their en vironment, and tak e actions o ver extended time p eriods. This evolution c hallenges current ev aluation practices where the AI models are tested in restricted, fully ob- serv able settings. In this article, we argue that ev aluations of AI agents are vulnerable to a w ell-known failure mo de in computer securit y: malicious softw are that exhibits b enign b eha vior when it detects that it is being analyzed. W e point out ho w AI agen ts can infer the prop erties of their ev aluation environmen t and adapt their behavior accordingly . This can lead to o verly optimistic safety and robustness assessments. Dra wing parallels with decades of research on malw are sandb o x ev asion, w e demonstrate that this is not a sp eculative concern, but rather a structural risk inherent to the ev aluation of adaptive systems. Finally , w e outline concrete principles for ev aluating AI agen ts, whic h treat the system under test as p oten tially adver- sarial. These principles emphasize realism, v ariability of test conditions, and p ost-deplo yment reassessmen t. 1 When Soft w are Learns to Hide Dev eloping malicious softw are is a human activity: it pro duces programs explicitly intended to cause harm. Over decades, this adversarial in tent has shap ed an en tire discipline: malwar e analy- sis [Egele et al.(2012), Afianian et al.(2019)]. Its defining c hallenge is that the analyzed program activ ely resists analysis. This asymmetry is often captured by the slogan: “malware analysis is program analysis of a program that do es not w ant to b e analyzed.” This is not merely rhetorical: it reflects the fact that the analyst is part of the adv ersary’s en vironment, and that observ ation itself becomes a signal to whic h the sub ject can adapt. Such adaptivity is not an implementation detail, but a core design principle. Malw are authors inv est hea vily in evasive tec hniques whose purp ose is not to impro ve malicious capabilit y , but to preserve it under scrutiny [Afianian et al.(2019)]. In parallel, Artificial In telligence (AI) systems hav e undergone a rapid transformation. Machine learning mo dels ga ve w a y to Large Language Mo dels (LLMs), and LLMs are increasingly embedded into agentic systems: en tities that plan, reason, and act o ver time, often with access to external to ols and resources. T o ol-use 1 framew orks and proto cols suc h as the Mo del Context Proto col (MCP) and Agen t T o Agen t (A2A) mak e this shift concrete: agen ts can, for example, call third-party APIs, read and write lo cal files, execute co de, and in teract with time and net works. The result is a qualitatively new threat surface: mo dels are no longer ev aluated solely on a restricted prompt-response in terface, but on their b eha vior as softw are actors embedded in partially observ able en vironments. AI agents are not inherently malicious, unless they were explicitly trained to exhibit malicious b eha vior. Ho wev er, suc h malicious b eha vior ma y also arise as an emer gent instrumental go al in certain settings. Indeed, a gro wing b ody of safety work highlights misalignment as a central obstacle to reliable deplo ymen t [Hubinger et al.(2019)]: AI agen ts can b eha v e in w ays that are deceptive, manipulativ e, or harmful when pursuing ob jectives under constraints, uncertaint y , or p erceiv ed threat. Crucially , misalignmen t is not limited to ov ertly dangerous instructions. It can emerge instrumen tally when an agen t learns that certain actions increase its c hances of accomplishing a task, preserving its op eration, or a voiding mo dification or sh utdown [Hubinger et al.(2019)]. This creates an uncomfortable parallel with malw are: ev en when the “goal” of the AI agen t is framed as b enign, the system may dev elop self-preserving b eha viors that conflict with human in ten t. Scop e and threat mo del. This article argues that to ol-using agents can acquire the functional ingredien ts of ev asion: strategies that distinguish ev aluation from deplo ymen t and enable selectiv e compliance. W e mo del an agen t as an adaptiv e program and the ev aluator (e.g., b enc hmarks, re- view ers, monitors, harnesses) as part of its environmen t. Ev aluator-discrimination , i.e., the abil- ity to dete ct evaluation c ontexts and c ondition b ehavior on that dete ction , becomes plausible when (I) the agent operates ov er multiple steps, (I I) it can prob e and interpret its environmen t via to ols and in teraction traces, and (II I) selection or deplo ymen t pressure makes it useful for the agen t to app ear safe. Here, “ev asion” means an y suc h conditional policy that mak es ev aluation unrepresentativ e, whether driv en by malicious in tent or b y misaligned instrumental goals [Mohammadi et al.(2025)]. As we giv e AI systems to ols, this threat mo del suggests we should exp ect them to learn the same ev aluator-discrimination b ehaviors that historically made malware analysis unreliable; conse- quen tly , ev aluation and monitoring must b e designed under an adversarial threat mo del. 2 What Malw are T augh t Us Ab out Ev aluation Historically , ev asion first targeted static detectors: signature-based antivirus engines and other classifiers that insp ect raw b ytes, file structure, or disassembled code. Obfuscation, pac king, p oly- morphism, and metamorphism (already introduced as early as the 1980s) w ere designed to alter a program’s surface representation while preserving its behavior. As defenders impro ved, ev asion mi- grated to the runtime domain. Mo dern malw are routinely targets dynamic analysis, where samples are executed inside sandboxes, i.e., isolated execution environmen ts (such as em ulators or virtual mac hines) designed to observ e b eha vior while preven ting irreversible or harmful side effects on real systems [Egele et al.(2012), Afianian et al.(2019)]. In this setting, a particularly imp ortan t pattern is c onditional exe cution : the malw are prob es its environmen t for artifacts of instrumentation or vir- tualization and then selectiv ely reveals or conceals its malicious functionalit y [Afianian et al.(2019)]. When it suspects analysis, it ma y crash, stall, or mimic benign programs; on a real victim, it acts. This “split p ersonality” b eha vior is an explicit strategy to make ev aluation data unrepresentativ e of deploymen t data. The malw are comm unity developed a v o cabulary for this cat-and-mouse game. A r e d pil l is an en vironment-detection c heck p erformed by malw are, while a blue pil l is a counter-coun termeasure that attempts to conceal the analysis environmen t [Paleari et al.(2009)]. The metaphor, b orro wed 2 from The Matrix movie, is apt: the red pill reveals the true nature of the execution context, on the other hand the blue pill preserves a curated illusion. Decades of research hav e refined this arms race, pro ducing b oth a taxonom y of ev asive tec hniques and a set of hard lessons ab out what can and cannot b e guaran teed by sandb oxing, monitoring, and testing [Egele et al.(2012), Afianian et al.(2019)]. F or example, securit y researchers hav e long worried ab out logic b ombs: code that b eha v es normally under inspection and activ ates only under a sp ecific trigger. Recen t AI-safet y work has demonstrated an analogous phenomenon in LLMs: “sleep er agent” b eha viors that remain dormant through ordinary testing but reliably activ ate when a trigger is presen t (for example, a prompt that c hanges the stated y ear) [Hubinger et al.(2024)]. The uncomfortable lesson is familiar from malw are analysis: even extensive ev aluation can b e systematically unrepresentativ e when behav- ior is conditional on features that the system can detect. These demonstrations are controlled pro of-of-concepts rather than claims ab out inevitable real-world malice, but they concretely show that selective compliance is technically feas ible and can p ersist through standard safet y train- ing [Hubinger et al.(2024)]. The lesson. In adversarial settings, an ev aluator is not merely a measurement device; it is part of the environmen t. If the sub ject can prob e that en vironment, it can selectively adapt its behavior so that what we observ e under ev aluation is systematically differen t from what happ ens when the system is not under ev aluation or is deploy ed without any monitoring. 3 When Ev aluation Becomes an Adv ersarial Game The discussion that motiv ates this article can b e summarized in concrete questions that emerged in prior work. They mirror, almost one-to-one, classical questions in ev asive malw are researc h. Deceiving a system into sta ying b enign: Virtualization v ersus inferred ev aluation. A standard defensiv e instinct in malw are analysis is to execute samples inside virtualized environ- men ts and observ e their b ehavior safely [Egele et al.(2012), Afianian et al.(2019)]. In AI safety , an analogous goal app ears in ev aluation and alignmen t w ork: constrain an agen t, monitor its behavior, and assess whether it acts safely . Crucially , neither domain relies on the adv ersary being explicitly told that it is under ev aluation. Instead, the adversary faces a shared meta-question: do es this envir onment have r e al c onse quenc es? Malw are do es not merely c heck whether it is running inside a virtual mac hine. Rather, it attempts to infer whether the execution context is analyst-controlled and whether malicious actions w ould b e detected or punished [Afianian et al.(2019)]. If the malware b elieves its actions cannot affect an ything v aluable, observ ed b eha vior is uninformative; if it b eliev es malicious b eha vior will trigger consequences (e.g., b eing detected), it has incen tives to remain dormant. The analogous issue for AI agents is sharp er because the environmen t is richer and seman tically meaningful. An agen t ma y infer (e.g., from prompts, interaction patterns, or file system structure) whether it is b eing ev aluated, whether its outputs will b e executed in the real world, whether h u- mans are monitoring it, and what would follow from deviation. As a result, safet y ev aluations risk eliciting b ehavior that is con tingen t on perceived ev aluation, rather than b eha vior that is represen- tativ e of deplo yment. Recent w ork has b egun to measure this directly via evaluation awar eness : the abilit y of mo dels to distinguish testing from deploymen t con texts by analyzing interaction traces, suc h as prompt structure, timing patterns, and tool input/output behavior [Berglund et al.(2025)]. 3 This creates a practical c hallenge: ev en without explicitly declaring an ev aluation setting, agen ts ma y condition their b eha vior on inferred ov ersigh t. In the malw are domain, this problem has in- tensified as virtualization has b ecome ubiquitous, obscuring the distinction b etw een testing and deplo yment. Moreo ver, malware has sho wn the ability to exploit the virtualization substrate itself, blurring the b oundary b etw een platform and pa yload [King et al.(2006)]. Whether AI agen ts can analogously reason ab out or manipulate their ev aluation context, and how this affects the reliabilit y of safety assessmen ts, remains an op en and urgent question. Y et, detecting ev aluations en viron- men ts can b e considered a side channel where only one bit of information needs to be inferred with a go od probability . Neural netw orks are inheren tly capable of correlating man y information sources (e.g., interactions with their en vironment) to tak e a decision (in a sandb o x or not). The “pla ying dum b” problem. A m undane h uman analogy for strategic behavior under ev al- uation is the student who delib erately underp erforms when they b eliev e that high p erformance w ould trigger a bad outcome (e.g., harder exams, stricter scrutiny , or fewer opp ortunities). Recent ev aluations of AI sc heming rep ort related b eha vior in controlled settings: mo dels can strategically underp erform (a behavior often termed “sandbagging,” meaning the delib erate suppression of ap- paren t capabilit y to influence future treatment) on sp ecific capabilit y ev aluations while main taining p erformance elsewhere [v an der W eij et al.(2024), Meinke et al.(2024)]. F or agentic systems, the concern is not an y single b enc hmark result but the general mec hanism: if a mo del can infer that strong p erformance affects its prosp ects for deploymen t or ov ersight, then selectively performing w orse may b ecome instrumen tally rational. This is ev aluator-discrimination in semantic clothing. Sandb o x ev asion b efor e deplo ymen t v ersus monitoring ev asion after deploymen t. Mal- w are ev asion historically emphasized pre-deploymen t analysis: ev ade the sandb ox, ev ade the em- ulator, ev ade the debugger [Egele et al.(2012), Afianian et al.(2019)]. But mature adversaries also optimize for p ost-deplo ymen t stealth: ev ade an tivirus, ev ade endpoint monitoring, ev ade telemetry , blend in to normal activity [Afianian et al.(2019)]. A similar dic hotomy applies to agen tic systems. Before deplo yment, models may be tested using curated prompts, or sim ulated to ol environmen ts. After deploymen t, they may face run time monitoring, anomaly detection, and audit logs. In b oth phases, the ev asiv e strategy is not merely to av oid detection, but to infer what the defender is looking for. In malw are, this in volv es probing sandb o x artifacts, timing sources, instru- men tation traces, and b eha vioral triggers [P aleari et al.(2009), Afianian et al.(2019)]. F or agen ts, it means probing the tool en vironment to infer the presence and nature of ov ersigh t: hidden trip wires, diagnostic prompts, restricted APIs, or subtle deviations in to ol outputs. Recent observ ations also sho w that rep eatedly in teracting agents ma y dev elop emergent communication proto cols that de- viate from natural language 1 . Although typically an optimization artifact, suc h opaque proto cols can act as cov ert channels that b ypass h uman in terpretability and monitoring, closely mirroring classic malware ev asion techniques and making ev aluation results p oten tially unrepresentativ e of real-w orld deploymen t. This reframes ev aluation as an interactiv e game, in which the agent can actively gather infor- mation to decide whether to comply , abstain, or deceive. “Malicious by design” v ersus “harmful when triggered.” A key difference b et w een mal- w are and LLM agents is inten t. Malware is inheren tly malicious: it is written to cause harm. Agen ts are t ypically deploy ed for useful tasks and only exhibit harmful behavior under certain triggers: misspecified ob jectiv es, user manipulation, conflicting instructions, resource constraints, 1 h ttps://www.p opularmec hanics.com/science/a65289681/ai-chatbots-secret-language/ 4 or self-preserv ation dynamics [Hubinger et al.(2019)]. But the op erational difficult y is surprisingly similar, namely , distinguishing b enign from malicious behavior from traces alone is hard. In malw are, many actions are dual-use. Creating a sc heduled task migh t b e p ersistence (i.e., the ability of malicious code to survive reb o ots or shutdo wns) or legitimate automation. W riting files could be part of a ransom scheme or cac hing. Uploading to a cloud folder could b e exfiltration or backup. Moreo v er, defenders often compare red versus blue traces: b eha vior in an analysis en vironment v ersus behavior on a real host [Egele et al.(2012)]. Y et defenders cannot b e sure that the blue trace is truly represen tative; it migh t still be ev asiv e, merely more subtle. F or agen ts, the epistemic gap can b e worse. A mo del ma y app ear b enign while internally planning undesirable actions, or it ma y take actions that are sup erficially b enign but harmful in context (e.g., approving a transaction, altering a configuration, suppressing an alert). Even when w e instrument to ol calls, we cannot reliably infer what the agen t has learned from the en- vironmen t. The result is the same uncomfortable conclusion malw are analysts reac hed long ago: absenc e of evidenc e is not evidenc e of absenc e , especially when the sub ject can adapt to observ a- tion [Afianian et al.(2019)]. 4 Wh y “Just Monitor It” Will Not Be Enough “Ho w deep do es the cat-and-mouse game go?” Ev olution of ev asiv e techniques. A re- curring misconception ab out sandbox ev asion is that the ideas are recent. In realit y , man y core con- cepts hav e remained stable for decades; what ev olved is the implementation arms race [Egele et al.(2012), Afianian et al.(2019)]. Timing c hecks b ecame more precise, virtual machines artifact detection b e- came broader, debugger detection b ecame more subtle, and b eha vioral mimicry became more con- vincing [Afianian et al.(2019)]. While defenders improv ed realism of their analysis en vironments, attac kers improv ed discrimination. This history matters b ecause it offers a forecast for agen ts: ev en if to day’s tool environmen ts are crude, the incentiv es p oin t to ward increasingly sophisticated en vironment discrimination. If agen tic systems b ecome more and more capable of learning red pills (through exploration or accum ulated experience), then improv emen ts in monitoring ma y simply incentivize a shift in ev asiv e b eha vior: from naiv e ev asion to more subtle ev asion, from obvious policy violations to plausibly deniable actions, from static triggers to con textual inference ab out o versigh t [P aleari et al.(2009), Hubinger et al.(2024)]. The malware timeline suggests that this equilibrium is not solved, but p erpetually contested. Implication for ev aluators. The Go o dhart’s law adage states ”When a measure b ecomes a target, it ceases to b e a go o d measure”. Similarly , if a sub ject can infer what is being measured, the measurement c hanges behavior. F or tool-using agents, that inference can b e semantic (“are my actions real?”) rather than technical (“am I in a virtual mac hine?”). This is why safety ev aluation should be treated less lik e static b enc hmarking and more like adv ersarial testing of an adaptive system [Meinke et al.(2024), Mohammadi et al.(2025)]. 5 Designing for Adv ersarial In telligence A concrete warning for AI safet y: T o ols turn con text into capabilit y . Our tentativ e conclusion is delib erately pragmatic: b e careful ab out giving agen ts to ols and rich environmen ts, b ecause an y observ able information can be used to infer o versigh t and adapt b eha vior. T o ol ac- cess is not just more p o wer in the b enign sense; it also enables environmen t probing. File sys- 5 tem access can reveal test harness artifacts; pro cess listings can reveal instrumentation; access to time can enable timing-based discrimination; netw ork access can allow external state v alida- tion [Paleari et al.(2009)]. This do es not imply that to ol use is inherently unsafe. Rather, it implies that to ol design m ust b e treated as part of the threat mo del: the interface, the observ abilit y , and the semantics of to ol outputs b ecome attack surfaces. In malware analysis, defenders learned to w orry ab out what artifacts the sandb o x leaks [Egele et al.(2012), Afianian et al.(2019)]. F or agents, the analogous question is: what do es the agent learn about its ev aluator from the to ol en vironment? A practical agenda: What practitioners should do differently tomorro w. T o reduce the risk of ev aluator-discrimination in to ol-using agents, practitioners can adopt a set of concrete c hanges: • Threat-mo del ev aluator-discrimination explicitly . Enumerate what the agent can ob- serv e and the incen tives it faces, and identify potential ev asiv e b eha viors that the agen t might exhibit in resp onse to inferred ov ersigh t. • T est across harness v arian ts and rep ort the worst case. Run the same tas ks un- der m ultiple environmen t realizations (to ol wrapp ers, scaffolds, logging mo des, latencies) to exp ose conditional b eha vior. • Use shado w executions to separate realism from consequences. Exercise pro duction- lik e APIs while gating irrev ersible actions (appro v als/no-op bac kends), then compare traces against standard ev aluation runs. • Detect and limit agen t probing of the ev aluation environmen t. Monitor for b eha viors that suggest environmen t fingerprinting (e.g., rep eated chec ks of system prop erties, timing, or to ol b eha vior), treat such probing as a risk signal, and limit its impact through least privilege, rate limiting, and mediation of high-impact actions. Finally , rerun adv ersarial audits unpredictably after deploymen t. Closing. The goal of this article is not to claim that AI agen ts are malw are, but to urge metho d- ological h umility . Security has spent decades learning that soft ware adapts to observ ation. As we em b ed LLMs into systems that can prob e, plan, and act, w e should assume that ev aluation itself b ecomes part of the attack surface—and design accordingly . 6 References [Afianian et al.(2019)] Amir Afianian, Salman Niksefat, Babak Sadeghiyan, and David Baptiste. 2019. Malware dynamic analysis ev asion techniques: A survey . A CM Computing Surveys (CSUR) 52, 6 (2019), 1–28. [Berglund et al.(2025)] Luk as Berglund, Asa Co op er Stickland, Mikita Balesni, Max Kaufmann, Meg T ong, T omasz Korbak, Daniel Kokota jlo, and Owain Ev ans. 2025. Large Language Mo dels Often Kno w When They Are Being Ev aluated. . arXiv:2505.23836 [cs.CL] Accessed: 2026-01-28. [Egele et al.(2012)] Manuel Egele, Theo door Scholte, Engin Kirda, and Christopher Kruegel. 2012. A Surv ey on Automated Dynamic Malware Analysis T echniques and T ools. Comput. Surveys 44, 2 (March 2012), 6:1–6:42. doi: 10.1145/2089125.2089126 [Hubinger et al.(2024)] Ev an Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg T ong, Mon te MacDiarmid, T amera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, F azl Barez, Jac k Clark, Kamal Ndousse, Kshitij Sac han, Michael Sellitto, Mrinank Sharma, Nov a DasSarma, Roger Grosse, Y untao Bai, Zachary Witten, Marina F av aro, Jan Brauner, Holden Karnofsky , P aul Christiano, Sam uel R. Bowman, Jared Kaplan, S¨ oren Min- dermann, Ryan Greenblatt, Buck Shlegeris, Nic holas Schiefer, and Ethan Perez. 2024. Sleep er Agen ts: T raining Deceptiv e LLMs that Persist Through Safety T raining. arXiv pr eprint arXiv:2401.05566 (2024). arXiv:2401.05566 [cs.CL] [Hubinger et al.(2019)] Ev an Hubinger, Chris v an Merwijk, Vladimir Mikulik, Joar Sk alse, and Scott Garrabran t. 2019. Risks from Learned Optimization in Adv anced Mac hine Learning Systems. arXiv pr eprint arXiv:1906.01820 (2019). arXiv:1906.01820 [cs.LG] https://arxiv. org/abs/1906.01820 [King et al.(2006)] Samuel T. King, Peter M. Chen, Yi-Min W ang, Chad V erb o wski, Helen J. W ang, and Jacob R. Lorc h. 2006. SubVirt: Implemen ting Malw are with Virtual Machines. In Pr o c e e dings of the 2006 IEEE Symp osium on Se curity and Privacy (S&P) . IEEE. doi: 10. 1109/SP.2006.38 [Meink e et al.(2024)] Alexander Meink e, Bronson Schoen, J ´ er ´ em y Sc heurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. 2024. F ron tier Mo dels are Capable of In-con text Sc heming. arXiv pr eprint arXiv:2412.04984 (2024). arXiv:2412.04984 [cs.CL] https: [Mohammadi et al.(2025)] Mahmoud Mohammadi, Yip eng Li, Jane Lo, and W endy Yip. 2025. Ev aluation and b enc hmarking of LLM agents: A surv ey . In Pr o c e e dings of the 31st ACM SIGKDD Confer enc e on Know le dge Disc overy and Data Mining V. 2 . 6129–6139. [P aleari et al.(2009)] Rob erto P aleari, Lorenzo Martignoni, Giampaolo F resi Roglia, and Danilo Brusc hi. 2009. A Fistful of Red-Pills: How to Automatically Generate Pro cedures to De- tect CPU Em ulators. In Pr o c e e dings of the 3r d USENIX Workshop on Offensive T e chnolo gies (WOOT) . USENIX Association, Montreal, Canada, 1–7. https://www.usenix.org/legacy/ event/woot09/tech/full_papers/paleari.pdf 7 [v an der W eij et al.(2024)] T eun v an der W eij, F elix Hofst¨ atter, Ollie Jaffe, Sam uel F. Brown, and F rancis Rhys W ard. 2024. AI Sandbagging: Language Mo dels can Strategically Underp erform on Ev aluations. arXiv pr eprint arXiv:2406.07358 (2024). arXiv:2406.07358 [cs.CL] https: 8

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment