What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline

PhD-No.....XX F acult y of Sciences, T ec hnology , and Medicine DISSER T A TION Presen ted on 16th Marc h in Luxembourg to obtain the degree of DOCTEUR DE L’UNIVERSITE DU LUXEMBOUR G EN INF ORMA TIQUE b y Beno ˆ ıt Alcaraz Born on 13th January 1998 in Bourgoin-Jallieu (F rance) What if Pino cc hio W ere a Reinforcemen t Learning Agen t: A Normativ e End-to-End Pip eline Do ctoral Sc hool in Science and Engineering Dissertation Defence Committee: Committee members: Prof. Decebal Mo can u Prof. Marija Slavk ovik Prof. Laetitia Matignon Dr. Amro Na jjar Prof. Leendert WN v an der T orre Sup ervisor: Leendert WN v an der T orre, Professor Aﬃda vit I hereby conﬁrm that the PhD thesis entitled “What if Pino cchio W ere a Reinforcemen t Learning Agen t: A Normativ e End-to-End Pip eline” has b een written indep enden tly and without any other sources than cited. Luxem b ourg, Name iii Remerciemen ts Je tiens ` a r emer cier mon sup erviseur, le Pr ofesseur L e on van der T orr e, p our m’avoir p ermis de r´ ealiser c ette th ` ese, ainsi que p our m’avoir fait c onﬁanc e dans le choix de ma th ´ ematique. J’ai appr´ eci ´ e c ette c ol lab or ation et je suis aujour d’hui ﬁer de p ouvoir lui pr´ esenter c e manuscrit. Je r emer cie aussi Dr Amr o Najjar p our son aide et ses c onseils tout au long de ma th` ese, ainsi que p our son soutien dans mon tr avail. ` A mes amis, Salma et A ria, je vous r emer cie sinc ` er ement p our m’avoir ac c omp agn´ e dans c ette aventur e et de m’avoir soutenu tout du long. Mer ci ` a vous p our votr e bienveil lanc e. J’ai appr ´ eci´ e p asser c es moments ` a vos cˆ ot´ es et j’esp ` er e p oursuivr e c ette amiti´ e bien au-del` a de c es quatr e ann´ ees. Je r emer cie mes c ol l` egues et c o-auteurs, A dam, David, Emery, A lexandr e et tous les autr es que je ne p eux citer, p our leurs c onseils. V otr e aide m’a ´ et´ e pr´ ecieuse et je gar de un souvenir ind´ el ´ ebile de nos c ol lab or ations. Christopher, je te r emer cie de m’avoir fait d´ ec ouvrir le monde de la r e cher che ainsi que de m’y avoir donn ´ e goˆ ut. T out a d´ ebut ´ e ave c nos tr avaux sur AJAR, et quatr e ans plus tar d, c ette th` ese en est l’ab outissement. J’esp` er e que tu la tr ouver as ` a la hauteur de notr e c ol lab or ation. Enﬁn, je te r emer cie Gloria. ` A mes cˆ ot´ es du d´ ebut ` a la ﬁn, je te suis pr ofond ´ ement r e c onnaissant. Mer ci d’avoir p artag´ e c et instant ave c moi. Ce ne fut p as un long ﬂeuve tr anquil le, mais c omme le disait si bien Paul V al´ ery, “L e vent se l` eve !... Il faut tenter de vivr e !”, Et ainsi dans les moments de doute, tu as su m’´ ep auler, me soutenir, et m’´ ec outer ave c p assion. Si nul ne p eut pr´ edir e l’avenir, je p eux dir e ave c c ertitude de p ar c es derni ` er es ann ´ ees qu’` a jamais dans mon c o eur tu r ester as. iv Je d´ edis c ette th` ese ` a mes p ar ents, Jos´ ephine et ´ Eric, sans qui rien de tout c ela n ’aur ait ´ et´ e p ossible. Je vous suis r e c onnaissant de toujours m’avoir p ouss´ e et soutenu, non seulement c es quatr e ann´ ees, mais aussi les vingt-quatr e qui ont pr´ ec ´ ed´ e. Malgr´ e la distanc e, vous avez su ˆ etr e pr´ esents, et mˆ eme si le titr e de c ette th ` ese p eut vous sembler quelque p eu obscur, vous avez toujours su ´ ec outer ave c attention et ﬁert´ e lorsque j’en p arlais. Je suis aujour d’hui ﬁer, ` a mon tour, de p ouvoir ´ ecrir e c es mots dans c e manuscrit, r endant ainsi ` a C ´ esar c e qui de dr oit lui r evient. Mes p ar ents : je vous aime. v Index 1 In tro duction 1 1.1 Con text and Motiv ations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Researc h Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Metho dology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Con tributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 La y out of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Preliminaries 12 2.1 T echnical Bac kground . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Reinforcemen t Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.3 F ormal Argumen tation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Bac kground . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 The AJAR F ramework . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 The Jiminy Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 Norm Guided Reinforcement Learning Agent . . . . . . . . . . . . . . 26 3 π -NoCCHIO: A Con text-Aw are Normativ e Architecture 29 3.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.1 Con text-Aw are Normative Reasoning . . . . . . . . . . . . . . . . . . . 33 vi 3.2.2 A Normative Sup ervisor for a Normativ e Agent . . . . . . . . . . . . . 38 3.2.3 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.4 δ -Lexicographic Selection . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.1 En vironmen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Related W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.1 Normativ e Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 58 3.4.2 Normativ e Sup ervisors . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.3 Other Approaches to Normativ e Agency . . . . . . . . . . . . . . . . . 63 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 Dynamically Mo delling the Norms of Stakeholders 68 4.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Review of the Norm Mining T echniques . . . . . . . . . . . . . . . . . . . . . 69 4.2.1 Researc h Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.2 Analysis of the Researc h Questions . . . . . . . . . . . . . . . . . . . . 71 4.2.3 Analysis of the Approac hes . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.4 T akea w ays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Argumen tativ e Rule Induction . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.1 Computing a Justiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.2 n -argumen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.3 Bip olar Argumen tation . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.4 Searc h Metho d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4 Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.2 Quan titativ e Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4.3 Qualitativ e Ev aluation of the Generated Explanations . . . . . . . . . 100 4.5 Related W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 vii 5 Prop osing a Robust Approach to Norm Avoidance 107 5.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2 Norm Avoidance: Concept and Deﬁnitions . . . . . . . . . . . . . . . . . . . . 110 5.3 Mitigation Strategies for Norm Avoidance in Reinforcemen t Learning . . . . . 111 5.3.1 Preliminary Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.2 Prop osed Approac hes . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.3 Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4 Related W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6 Related W ork 125 6.1 Pip elines for Normative Agents . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2 Learning of Informal and So cial Norms . . . . . . . . . . . . . . . . . . . . . . 128 6.3 V alue Alignmen t in Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7 F uture W ork 132 7.1 Impro ving the Dialogue Comp onen ts . . . . . . . . . . . . . . . . . . . . . . . 132 7.2 Adapting the Approach to Complex Environmen ts . . . . . . . . . . . . . . . 133 7.3 Pro viding F ormal Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.4 T echnical Impro v emen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8 Summary 136 References 140 viii List of Figures 2.1 Example of an MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Reinforcemen t Learning training lo op. . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Example of a Lab elled MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Represen tation as a directed graph of an argumentation framework. . . . . . 19 2.5 Represen tation of the AJAR framework. . . . . . . . . . . . . . . . . . . . . . 24 2.6 Jimin y’s smart home example. Reused from Liao et al. [112]. . . . . . . . . . 25 2.7 Represen tation of the Jimin y pip eline. Reused from Liao et al. [111]. . . . . . 26 3.1 Example of graph generated by tw o a v atars. . . . . . . . . . . . . . . . . . . . 38 3.2 Diagram of a standard reinforcement learning training lo op. . . . . . . . . . . 39 3.3 Diagram of the π -NoCCHIO training lo op. . . . . . . . . . . . . . . . . . . . . 40 3.4 Diagram of the judge comp onen t. . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Argumen tation graph for the norm F ( speeding ). . . . . . . . . . . . . . . . . 47 3.6 Argumen tation graph for the norm F ( stop | on road ). . . . . . . . . . . . . . 47 3.7 Argumen tation graph in state s for the norm F ( speeding ). . . . . . . . . . . 47 3.8 Argumen tation graph in state s for the norm F ( stop | on road ). . . . . . . . 47 3.9 The Taxi-A/B/C environmen t. . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.10 The Taxi-D environmen t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.11 Ev olution of the reward and violation count during the training phase in the RL-Lex in Taxi-A en vironmen t. . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.12 Ev olution of the reward and violation count during the training phase in the RL-DLex in Taxi-A en vironmen t. . . . . . . . . . . . . . . . . . . . . . . . . 56 ix 3.13 Ev olution of the reward and violation count during the training phase in the RL-Lex in Taxi-B en vironmen t. . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.14 Ev olution of the reward and violation count during the training phase in the RL-DLex in Taxi-B en vironmen t. . . . . . . . . . . . . . . . . . . . . . . . . 57 3.15 Ev olution of the reward and violation count during the training phase in the RL-Lex in Taxi-C en vironmen t. . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.16 Ev olution of the reward and violation count during the training phase in the RL-DLex in Taxi-C en vironmen t. . . . . . . . . . . . . . . . . . . . . . . . . 57 3.17 Normativ e arc hitecture prop osed b y Neufeld et al. [146]. . . . . . . . . . . . . 63 4.1 Ho w each research question is addressed b y each approach. . . . . . . . . . . 72 4.2 Comparison with the maximal v alue. . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Comparison with the minimal v alue. . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 T axonomy of the iden tiﬁed approac hes. . . . . . . . . . . . . . . . . . . . . . 77 4.5 Main area of the prop osed approac hes. . . . . . . . . . . . . . . . . . . . . . . 78 4.6 Univ ersal graph for the Car dataset. The τ argument is in green. The λ argu- men t is in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.7 Con textual graph for the Car dataset and a sp eciﬁc input. The target argumen t is in green. The λ argument is in red. . . . . . . . . . . . . . . . . . . . . . . 88 4.8 Example of bipolar argumen tation framew ork. Solid edges denote attac ks, and dashed edges supp orts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.9 Univ ersal graph for MM-delta. . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.10 Bipolar univ ersal graph for MM-delta. . . . . . . . . . . . . . . . . . . . . . . 101 4.11 First scenario. Outcome n ° 1 is on the left, and outcome n ° 2 on the right. . . . 102 4.12 Second scenario. Outcome n ° 1 is on the left, and outcome n ° 2 on the right. . 103 x 5.1 s 0 represen ts the resulting state of an agent’s complian t transition ( s, a, s 0 ); s 1 is a state resulting from the transition ( s, a, s 1 ) where the norm was defeated for the ﬁrst time (a defeat state); s 2 is a state where a norm has not b een complied with in the preceding state-action-state transition ( s, a, s 2 ) (a non- compliance state); and s 3 indicates a violation of the norm in the preceding transition ( s, a, s 3 ) (a violation state). . . . . . . . . . . . . . . . . . . . . . . 118 xi List of T ables 3.1 Example of a state ( s ), with the facts ( ϵ ( s )) and arguments ( g etAr gs ( ϵ ( s ))) that can b e built from it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Exp ected Q -v alues for a fak e scenario with a state s and a list of actions a ∈ A ( s ). 48 3.3 Example of Q -v alues in a state s . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1 Researc h Questions Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Summary of the facts, i.e. , the v alues for the diﬀeren t attributes, for the example. 88 4.3 Dataset name, size, and characteristic. Column “A tt. T yp es” indicates ho w man y attributes are contin uous (c) or nominal (n). Column “Arg.” indicates the num b er of couples attribute-v alue ( i.e. , argumen ts) constructed with a segmen tation into 6 in terv als for the contin uous attributes. . . . . . . . . . . 99 4.4 Accuracy of the baseline (the SVM classiﬁer from the Python Scikit library) and the v arian ts of ARIA on several datasets. The b est v alue for each dataset is b olded, and the runner-up underlined. . . . . . . . . . . . . . . . . . . . . . 99 4.5 Summary of the facts for the scenario n ° 1. . . . . . . . . . . . . . . . . . . . . 101 4.6 Summary of the facts for the scenario n ° 2. . . . . . . . . . . . . . . . . . . . . 103 5.1 En umeration of possible parameter settings. Self indicates that the norm is defeated b y the agen t or b y an external source. Alt. indicates that there is a fully complian t path to the goal state. Obj. indicates that there is a second goal, and that reac hing that goal w ould force the agen t to reach a state whic h defeats the norm that limits access to the ﬁrst goal. . . . . . . . . . . . . . . 117 5.2 Comparison of the prop osed approaches. . . . . . . . . . . . . . . . . . . . . . 122 xii Abstract In the past decade, artiﬁcial intelligence (AI) has developed quickly . With this rapid progression came the need for systems capable of complying with the rules and norms of our so ciet y so that they can b e successfully and safely in tegrated into our daily liv es. In- spired by the story of Pino cc hio in “Le avv enture di Pino cchio - Storia di un burattino”, this thesis prop oses a pip eline that addresses the problem of developing norm compliant and con text-a w are agen ts. Building on the AJAR, Jiminy , and NGRL architectures, the w ork in- tro duces π -NoCCHIO, a hybrid mo del in which reinforcement learning agents are sup ervised b y argumentation-based normative advisors. In order to mak e this pip eline op erational, this thesis also presen ts a no v el algorithm for automatically extracting the argumen ts and relation- ships that underlie the advisors’ decisions. Finally , this thesis inv estigates the phenomenon of norm avoidanc e , providing a deﬁnition and a mitigation strategy within the context of reinforcemen t learning agents. Each comp onen t of the pip eline is empirically ev aluated. The thesis concludes with a discussion of related w ork, current limitations, and directions for future research. xiii Chapter 1 In tro duction 1.1 Con text and Motiv ations P inocchio is the main protagonist of “Le avv enture di Pino cchio - Storia di un burattino”. This b o ok and its adaptations tell the story of a sen tient pupp et that faces morally relev ant situations along its wanderings. His companion, Jimin y Crick et, plays the role of its consciousness, advising the pupp et on what is right and wrong. Even tually , the story ends w ell as Pino cc hio learns from his exp eriences and he is then transformed into a real, ﬂesh-and-blo od little boy . Although Pino cchio is often seen as an allegory for naught y children, we can also interpret it as an artiﬁcial agent [176, 208, 53]. An artiﬁcial agent is an autonomous system capable of p erceiving its en vironment, reacting to c hanges, pursuing its o wn goals proactively , and in teracting socially with others in the pursuit of individual or collective tasks. W e can regard Pino cc hio as an agen t that will do anything that gives a reward, this without accounting for any ethical or normative considerations. F or this reason, it requires this external consciousness that is Jiminy Cric ket and that provides some normative guidance to it. In the past, Artiﬁcial In telligence (AI) 1 had a limited ﬁeld of applications due to tec hnical constrain ts when dealing with the real world. This resulted in very few scenarios in which 1 Nilsson [148] deﬁned AI as follows: “ A rtiﬁcial intel ligenc e is that activity devote d to making machines intel ligent, and intel ligenc e is that quality that enables an entity to function appr opriately and with foresight in its envir onment ”. 1 AI was interacting with h umans. T o da y , AI is developing rapidly , and it will b e common in the future to engage with AI systems in our daily life. Already , sev eral of these AI systems can be found in v arious real-world applications, such as autonomously driving cars on our roads [187], automated job applicant selection and shortlisting [156], smart grids regulating energy distribution [105], audiovisual and textual conten t generation for en tertainmen t or customer supp ort [87], adaptive rob ots learning while op erating alongside w arehouse work- ers [7], or conv ersational agen ts [55]. Because these artiﬁcial agents are no w ev olving in our w orld, they need to follow our norms [194]. In human so cieties, norms serve as implicit or explicit guidelines that gov ern in teractions to ensure co ordination, safet y , and eﬃciency [137, 164]. These norms can b e legal ( e.g. , traﬃc la ws for autonomous v ehicles), so cial ( e.g. , queuing in public spaces) or ethical ( e.g. , fairness in AI decision-making). These norms can sometimes b e made explicit through text or they can emerge dynamically through rep eated in teractions. Norms may v ary across cultural and situational contexts [128]. A system or an en vironmen t in whic h norms are in force can be called a normative system . Not resp ecting the relev ant norms can lead to misunderstandings, lac k of trust, or dangerous b eha viours, ultimately resulting, at b est, in ineﬃciency and, at w orst, in moral or phy sical harm. Ho w ev er, sev eral questions arise. How can agen ts b ehav e optimally if they hav e to adhere to norms that restrain their actions? Where would the norms in question come from? Ho w should we shap e agen ts so that they in tegrate into these systems? The aforementioned are just a few of many questions and challenges w e are currently facing. In the scientiﬁc literature, one can ﬁnd t w o ma jor kinds of approac hes. The ﬁrst kind, often referred to as top-down approac hes, usually relies on the concept of norms as deﬁned in deontic logic. Deon tic logic is a subﬁeld of logic that classiﬁes norms as obligations, p ermissions, or prohibitions [84]. Doing so allo ws us to use logical to ols to, for example, assess whether an agent is b eha ving well according to a set of norms. It can also serve to determine what an agent ought to do ev en if that is not stated explicitly . It can also b e extended by considering the cases where a norm do es not ha ve to be resp ected as the situation is exceptional, in the sense that it ﬁts a set of criteria that makes adherence to 2 the norm less relev ant. T o some exten t, the Jiminy architecture [112, 111] 2 can b e seen as emplo ying a top-down approach. The architecture serves to determine, in a given context, whic h actions an agent has the righ t to p erform. T o do this, it considers the preferences and argumen ts of several stakeholders, suc h as the user, the law enforcement, or the manufacturer. It also resolv es dilemmas where no action is fully complian t with the norms. This ensemble of tec hniques enables reasoning ab out the duties of an agent, but it often struggles in uncertain, large, and con tinuous environmen ts where the consequence of an action is not clear and where there is no w a y to av oid breaking the rules. The s econd group of approac hes, kno wn as b ottom-up approaches, in volv es analysing large quantities of unstructured data, or learning through trial and error, so that the agent’s b eha viour is optimal in terms of the task it w as des igned for. But when agen ts engage in a prohibited action, they can b e p enalised. This incentivises agents to complete their given task in a w a y that complies with the system’s norms. Although these techniques w ork well in complex en vironments, they do not provide an y guaran tee of what exactly is learn t by the system. Is the agent learning malicious b ehaviours? Do es it fa v our certain norms o v er others in cases of conﬂict? Can it delib erately violate a norm? Because of the opacity of suc h systems, it is not realistic to imagine we can obtain solid answ ers. Since regulations on autonomous systems increasingly emphasise transparency (see the Europ ean AI Act [89]), this problem can no longer b e tolerated. Consequen tly , a third class of approac hes exists. It aims to address the problems asso- ciated with b oth top-do wn and b ottom-up approac hes while preserving their go o d points— top-do wn approac hes are strong on transparency and deductiv e p o wer while b ottom-up ap- proac hes are strong on adaption capability and computational eﬃciency . The third class of approac hes are often referred to as hybrid . One example of such a hybrid approach is the NGRL (Norm-Guided Reinforcement Learning) agent 3 prop osed by Neufeld [143] (and fur- ther dev elop ed in subsequen t publications [144, 145]). In this reinforcemen t learning [189] arc hitecture, an agent learns to ac hiev e its main task b y learning a p olicy that optimises its rew ard while prioritising compliance with the norms in its environmen t. It ev olv es in accor- 2 This architecture is more deeply detailed in Section 2.2.2. 3 This approach is presented in more detail in Section 2.2.3. 3 dance with a lab elled Marko v decision pro cess that returns a set of lab els for eac h state visited b y the agen t. Using constitutiv e norms [183], these lab els serv e to determine whether the agen t committed a violation. Then, suc h information is returned to the agent so that it can then up date its p olicy . Another example is AJAR (A Judging Agents-based RL-framew ork) 4 prop osed by Alcaraz et al. [9, 10]. Whereas the previous approach emphasises the learning algorithm, this approach fo cuses on the gov ernor comp onen t that determines whether the agen t acted ethically 5 or reprehensibly , ev aluated against a designated moral v alue. This judgemen t is then scalarised into a numerical v alue and com bined with the reward for ac- complishing the main task. The ﬁnal v alue is then giv en to the reinforcement learning agent so it can learn to optimise it. In the story of Pino cc hio, b oth Pino cc hio and Jiminy can b e mo delled using the agent metaphor [37, 40]. This metaphor attributes men tal attitudes to actors 6 suc h as goals or b eliefs. With resp ect to the previously in tro duced hybrid architectures, Pino cc hio’s goal is simply to optimise its o wn reward. It believes that performing a certain action in a particular con text will pro duce a fa v ourable outcome. In contrast, Jimin y Crick et’s goal is to ensure that the p ersp ectiv es of multiple stakeholders are tak en in to account. Its b eliefs correspond to the argumen ts put forw ard by all of them. When Jiminy advises Pino cc hio and Pino cchio follo ws the advice, we can see it working as a hybrid system where Pino cc hio is the agent 7 , and Jiminy is the moral or normativ e go v ernor. This thesis aims at creating a no v el hybrid approac h called π -NoCCHIO [8] (where π refers to p olicy in reinforcemen t learning, and No refers to Normativ e). It is intended to complement rather than replace existing approaches. This thesis prop oses a base arc hitecture that makes use of recent top-down and b ottom-up tec hniques. This architecture serves as an interface for training an agen t to follo w the norms of v arious stakeholders. These stakeholders may or may not share the same p oin ts of view. 8 F or the sak e of transparency and explainabilit y , 4 See Section 2.2.1 for more details. 5 AJAR was designed for applications in ethics rather than normative systems sp eciﬁcally . 6 Here, an actor is a general term that encompasses a single so cial entit y , a group [45], a virtual commu- nit y [42], a normative system [41, 47], a contract [43] or an organisation [40, 46]. 7 P ossibly a learning agent. 8 As Perelman [162] puts it, “If men ( sic ) oppose eac h other concerning a decision to b e taken, it is not b ecause they commit some error of logic or calculation. They discuss aprop os the applicable rule, the ends 4 the pro cess of determining whether a norm m ust b e resp ected in a certain context [1, 127] is carried out b y a symbolic reasoning engine. Mean while, to ensure the agent’s abilit y to adapt to unkno wn environmen ts and a broad range of situations, we make use of reinforcemen t learning (RL), a common choice among b ottom-up approaches. The prop osed architecture requires the norms to b e represen ted in a structured and ex- plicit manner, as they require to b e manipulated by the top-do wn part. But suc h repre- sen tation of these norms is not alw a ys directly a v ailable and often need to b e handcrafted b y those who wish to use this approac h. Finding the reasons that render a norm irrelev an t and the reasons that reinstantiate it [1, 127] ( i.e. , a reason that creates an exception to the exception) may prov e to b e even more complex. This limited scalability can drastically limit the applicabilit y of the approac h if not handled prop erly . F or this reason, we prop ose to enhance this approach with an algorithm that aims to extract the reasons which may render a norm inapplicable, or reinstantiate it for a given con text. F urthermore, when developing this system, we iden tiﬁed an emerging issue raising serious concerns, not only for the prop osed approach but also the ones dev elop ed within the litera- ture of normative agen ts. This issue, termed norm avoidanc e , is a phenomenon that shares similarities with reward hacking, as it can mislead the agent into gam ing the norms. 9 Then, the b eha viour learnt b y the normative agent may not corresp ond to that exp ected b y the designer, p oten tially leading to b eha vioural hazards or safety issues. As such, we formalise this concept and prop ose, as a ﬁrst step, a wa y to address it with π -NoCCHIO. Since this thesis draws on tec hniques from v arious subﬁelds of AI, the follo wing section pro vides a summary of the essential bac kground required for proper understanding. to b e considered, the meaning to b e given to v alues, the interpretation and characterisation of facts.” These factors are why , even though the stakeholders discuss the same set of norms with the same observ ations, they ma y still disagree. 9 One counter-argumen t is that one should just create the norms more cautiously so that norm a v oidance do es not happ en. W e claim that this is not a plausible solution as conceptual ﬂa ws hav e passed undetected man y times in the history of computer science, e.g. , phantom attacks [142], integer ov erﬂow [106], or biased data [214]. 5 1.2 Researc h Questions Researc h questions w ere form ulated to guide the developmen t of the thesis and clarify its core contributions. They are as follows: R Q1. Ho w can we design an artiﬁcial agen t that adheres to context-speciﬁc norms? R Q1.1. Ho w can w e design an agent that learns behaviours integrating heterogeneous normativ e viewp oin ts? R Q1.2. Whic h metho ds can mitigate rew ard-gaming that compromises norm adher- ence by RL agen ts? R Q2. What is the most suitable metho d for gathering norms to supp ort the functioning of the prop osed architecture? R Q2.1. Ho w to learn the con text-dep enden t exceptions to the application of a norm? R Q2.2. In what wa y can norms and their exceptions b e mo delled to enhance in telli- gibilit y? In order to answer these researc h questions, this thesis prop oses an architecture, π - NoCCHIO [8], that is based on a com bination of the Jiminy architecture [112, 111], the NGRL agent [143, 144, 145] and the AJAR framework [9, 10]. Even though these previous w orks ha v e attempted to address the aforementioned research questions, by now they each suﬀer sev eral limitations due to certain considerations that w ere out of scop e at the time of their design. First, the Jiminy arc hitecture, while v aluable for mo delling heterogeneous stakeholder viewp oin ts, presen ts diﬃculties when in tegrated w ith reinforcement learning agents as it was originally agnostic to sup ervised agen ts. Its reasoning process is computationally costly , b oth online and oﬄine, requiring extensive inference rules and the construction of argumen tation graphs, which mak es its use in sto c hastic environmen ts challenging. Moreov er, it provides no mec hanism for handling uncertain t y , as it exclusively selects actions prior to execution and do es not adapt to address their consequences, which can lead to short-sighted b eha viours. The approach also raises priv acy concerns since stak eholder mo dels stored within the system ma y exp ose sensitive information. The arc hitecture assumes that all kno wledge m ust b e made 6 explicit, which risks discouraging stak eholders from disclosing their argumen t, and ev en tually not accepting the use of such a system. Second, the AJAR framework w as originally designed for machine ethics rather than normativ e systems 10 , which limits its applicabilit y in the particular con text of normative systems. Its aggregation of the judges’ outputs is largely arbitrary , and it assumes that each moral v alue is ev aluated in isolation b y a single judge, without considering ho w diﬀerent v alues or norms might conﬂict with others. More imp ortantly , AJAR treats ethical ev aluations as scalar rewards that can b e balanced against task p erformance. Although this is acceptable in ethical decision-making, it is problematic in normativ e settings where compliance with norms is treated as an obligation rather than a negotiable preference. Finally , the NGRL agent, although pro viding a solid foundation for incorp orating norms in to RL, also suﬀers from notable limitations. It assumes that the set of constitutive and regulativ e norms and the thresholds for lexicographic selection are b oth predeﬁned, whic h requires a signiﬁcant amoun t of exp ert knowledge and severely limits scalability . The choice of thresholds itself is arbitrary , and the framew ork only oﬀers a limited w a y of handling conﬂicts, leaving it vulnerable in situations where violations are unav oidable or where norms clash in complex w a ys. In particular, it is also prone to causing agents to commit norm a v oidance. As a result, while NGRL represen ts an imp ortant step to w ards hybrid normative agen ts, it is less robust in dynamic, uncertain environmen ts. The aim of the arc hitecture prop osed in this thesis, π -NoCCHIO, is to mak e an agen t learn to comply with context-dependent norms ( R Q1 and R Q1.1 ). Since this arc hitecture requires that the data is collected ( RQ2 ) and structured in a wa y that meets speciﬁc requiremen ts and the intelligibilit y needs of normativ e systems ( R Q2.1 and R Q2.2 ), a metho d is also prop osed to meet this requirement, the ARIA algorithm. Finally , since the phenomenon of norm av oidance can emerge within this framework ( R Q1.2 ), a set of deﬁnitions is provided to c haracterise it along with strategies to limit its o ccurrence in a reinforcemen t learning setup. 10 While Jiminy w as also designed for machine ethics, it is sp eciﬁcally focused on deontology . It can b e argued that deontology , and normative systems in general, can b e seen as a subﬁeld of ethics, but with very sp eciﬁc requiremen ts that make it diﬀerent to the other branc hes of ethics. F or example, there is a clear emphasis on the binary nature of compliance with norms, whereas ethical domains emphasise compromises. 7 1.3 Metho dology Normativ e reasoning approaches, formal argumentation-based approac hes, reasons-ﬁrst-based approac hes, and arc hitectural approac hes are diﬀeren t to ols that serve diﬀerent purp oses. The ﬁeld of logic has seen interest in com bining these approaches as part of a larger to ol- b o x [35, 213]. This thesis follo ws the same intuition to construct the π -NoCCHIO architec- ture, merging diﬀeren t metho ds from the literature,—reinforcemen t learning, moral sup ervi- sors, argumentation—to accommo date the particular requiremen ts of normative systems. The use of reinforcemen t learning [189] allows for ﬂexibilit y when faced with uncertain outcomes or situations that ha v e not b een encountered b efore. It also makes it p ossible to see b ey ond the direct consequences of an action while main taining a low computational cost. As we face norms, we use the NGRL agent [143, 144, 145] to account for norms and their violations, but w e impro v e the reasoning part b y using an adjusted version of the Jimin y arc hitecture [112, 111]. This allows the status of the norms to account for not only the ev er-c hanging context but also the diﬀerent viewp oin ts and arguments that the stakeholders hold. In order to guaran tee a mo dular basis to conv ert the symbolic output of Jiminy into a rew ard that can b e used to train the NGRL agent, the π -NoCCHIO architecture is based on the AJAR framework [9, 10]. 1.4 Ev aluation This thesis is organised in such a wa y that each c hapter is self-contained. Consequen tly , each approac h presented in these c hapters is ev aluated independently , follo wing their own testing proto col. In Chapter 3, the π -NoCCHIO architecture is ev aluated in a custom reinforcement learn- ing environmen t cen tred on an autonomous taxi agent. The ev aluation fo cuses on the capacit y of the agent to learn its task while adhering to relev an t norms. Then, tw o v ariants of this agen t are compared in order to assess the b eneﬁts of each. In Chapter 4, the ev aluation focuses on the ability of the prop osed algorithm to accu- rately extract an argumentation graph while still providing go od predictive accuracy . First, 8 a quan titativ e comparison of the accuracies of the prop osed approach and its v ariants is con- ducted on b enc hmark tabular datasets from the classiﬁcation literature. This is follow ed b y a qualitative study of the explanatory p oten tial of t w o of the v ariants. In Chapter 5, approaches for mitigating norm av oidance are ﬁrst deﬁned and then ev alu- ated b y running them ov er a set of MDP (Mark ov decision process) en vironmen ts represen ting v arious situations where norm av oidance can o ccur. The diﬀeren t approac hes are compared in terms of p erformance against one another and against a normativ e agen t arc hitecture from the literature. The ev aluation then pro ceeds to assess op erational eﬀectiveness and iden tify the adv antages and disadv antages of each approach. 1.5 Con tributions This thesis contributes to the ﬁeld of normative systems by prop osing an end-to-end pip eline that combines formal argumentation and normativ e reasoning with mo del-free normative reinforcemen t learning. This pip eline can be brok en down into t w o main comp onen ts. The ﬁrst comp onen t, the π -NoCCHIO architecture, resp onds to R Q1 and R Q1.1 . It aims to teach an artiﬁcial agent to follow norms while p erforming its assigned task. It combines and builds a normative RL framework up on three w orks in the literature, namely Neufeld’s normativ e agent [143, 144, 145] which is a solid ground for training an agen t to follow norms within an unknown sto c hastic en vironment, the Jimin y arc hitecture [112, 111] which prop oses a mo del for a normative sup ervisor, and the AJAR framework [9, 10], whic h we developed in a prior pro ject, and constitutes the base framew ork, since it was originally designed to com bine symbolic sup ervision with reinforcemen t learning. The c hallenge of developing such an architecture lies in the translation of the sym b olic normative reasoning into a numeric v alue that can b e treated by the reinforcement learning part. Jiminy w as not designed suc h that it returns a numerical v alue, and AJAR was not made in a w a y that it allows dialogues and collectiv e reasoning b et w een the judges. F or these reasons, these tw o architecture need to b e adapted so that they ov ercome these limitations. The second comp onen t, answ ering R Q2.1 and RQ2.2 , aims at identifying the norms, and more precisely the exceptions to these norms, within a normativ e system. It do es that 9 b y learning o v er b ehavioural datasets. Then, it represen ts these norms in the form of an argumen tation graph, allowing for the ﬁrst comp onen t to use them. This is meant to lessen the burden on the designer, rendering the approach less prone to h uman mistakes and more scalable. This comp onent requires the norm to be known b eforehand. As w e do not wan t to rein v en t the wheel, we prefer to let the user select one of the many methods to norm iden tiﬁcation. How ever, in order to guide this c hoice, we provide a review of the literature on these metho ds and iden tify their sp eciﬁcities, such that it answ ers RQ2 . Finally , some empirical observ ations allow ed me to identify a problem transversal to the ﬁeld of normativ e agents, namely norm a v oidance. Consequently , the last part of this thesis attempts to deﬁne this no v el c hallenge by presenting deﬁnitions and examples and subsequen tly prop oses an ad ho c solution for the speciﬁc case of normative reinforcemen t learning so that it answ ers R Q1.2 . Here, the c hallenge lies in correctly framing the cases that can b e categorised as norm av oidance and ensuring that the prop osed metho ds to mitigate it address it without blo c king b eha viours that w ould b e considered as acceptable. A transversal c hallenge to the making of this thesis is the interdisciplinary asp ect, as it com bines philosophy and engineering. In consequence, it has to comply with the philosophical ideas dev elop ed o ver the y ears ab out the ethical and deontological principles or how dialogues should be led, while still ensuring correct functioning with resource complexit y and scalabilit y in mind, or technicalities such as the collection and use of data, or making of user-friendly in terfaces. T ogether, the v arious elements of this thesis should form a robust and scalable end-to-end pip eline for normative reinforcement learning. 1.6 La y out of this Thesis This thesis is organised as follo ws. First, the technical bac kground for the correct under- standing of the dissertation, as well as the necessary knowledge about the reused approaches, is provided in Chapter 2. Then, Chapter 3 details an architecture for normative reinforce- men t learning that combines and adapts several approaches from the literature. Chapter 4 pro vides an extensive analysis of the literature on norm identiﬁcation and then prop oses a 10 no v el algorithm that can b e combined with existing norm mining metho ds to extract norms, as w ell as their p oten tial exceptions, from observ ations. Then, Chapter 5 prop oses a prelim- inary deﬁnition of norm av oidance. Then, it partially addresses this problem by prop osing a solution for the approac hes based on reinforcement learning. Finally , the remaining sections of this thesis discuss the results obtained with a higher-lev el reﬂection and present some related work. The man uscript is concluded with a presen tation of the challenges for future researc h, as well as a summary of the conten t. Eac h chapter is self-contained, and Chapter 2 provides all the necessary knowledge ab out the to ols we used inside of this thesis. 11 Chapter 2 Preliminaries This section of the thesis introduces in Section 2.1 the necessary technical preliminaries for the understanding of the following c hapters. Then, in Section 2.2 it presents the main three w orks on which it builds. 2.1 T ec hnical Bac kground This section pro vides the necessary background for understanding the conten t of this dis- sertation. Section 2.1.1 details what reinforcement learning and Mark o v decision pro cess are, while Section 2.1.2 explains the wa y norms are deﬁned across the Deontic Logic litera- ture. Finally , Section 2.1.3 in tro duces formal argumentation, and more speciﬁcally , abstract argumen tation. 2.1.1 Reinforcemen t Learning Reinforcemen t learning (RL) is used to train an agent to p erform a certain task in an en vi- ronmen t that ma y or ma y not b e sto chastic. The agent learns through trial and error. Each time it p erforms an action, it receives a reward, represented b y a numerical v alue. It features sev eral adv an tages o v er the other approac hes, suc h as its capacit y to deal with sto c hastic ( i.e. , non-deterministic) environmen ts, the small amount of exp ert kno wledge required to ha v e it w orking, or the fact that it con verges to w ards optimal solutions. 12 More formally , it takes place in an environmen t formalised as a Marko v Decision Pro cess (MDP), deﬁned as follo ws: Deﬁnition 1. A Markov De cision Pr o c ess is a tuple ⟨ S, A, P , R ⟩ wher e: • S is a set of states • A is a function A : S → 2 Act fr om states to a set of p ossible actions (with Act b eing the set of al l the actions available to the agent) • R : S × Act × S → R is a sc alar r ewar d function over states and actions. It c an b e simpliﬁe d to R : S × Act → R in a deterministic envir onment • P : S × Act × S → [0 , 1] is a pr ob ability function giving the pr ob ability of tr ansitioning fr om the state s to the state s ′ when doing action a . Remark 1. A state r epr esents how the agent p er c eives the envir onment. It c an b e pr ovide d explicitly, for example, by sp e cifying that the agent is in state s i , or in a r aw form. In the latter c ase, the state may c onsist of pixels fr om an image, atomic pr op ositions, numeric al fe atur e ve ctors, or a c ombination of these elements. T o b etter understand what it corresp onds to, a representation of an MDP containing three states ( S 0 − 2 ) and tw o actions ( a 0 − 1 ) is given in Fig. 2.1. W e can see in this example that p erforming the action a 1 in state S 1 has a 5% chance of moving the agent to S 2 , and 95% chances to make the agen t stay in S 1 . On the other hand, we see that if the transition ( S 1 , a 0 , S 0 ) o ccurs, the agent receiv es +5 as a rew ard. The goal of reinforcement learning is to ﬁnd a p olicy π : S → Act that designates the optimal b eha viour with respect to the rew ard function, that is, b ehaviour that maximizes o v erall rewards in the long run, with emphasis on the most immediate rewards. This p olicy is denoted as π ∗ . T o learn π ∗ , w e can use the Q -learning algorithm [205] to learn the function Q : S × Act → R . Sp eciﬁcally , the optimal Q -function is deﬁned as Q π ∗ ( s, a ) = E " n X t =0 γ t R ( s t , a t )    s 0 = s, a 0 = a # (2.1) 13 Figure 2.1: Example of an MDP . where γ ∈ [0 , 1] is the discount factor, which determines how muc h emphasis is placed on the curren t reward versus future rewards, and n is the num b er of steps in the agent’s path. Q π ∗ , then, gives the exp ected (discounted) sum of rew ards ov er the agen t’s run time, assuming that it takes action a in state s , and thereafter takes a path, i.e. , a sequence of state-action pairs, follo wing π ∗ . W e learn Q π ∗ b y exploring the environmen t, typically through random actions that allow the agent to gather div erse exp eriences and to identify the most proﬁtable actions based on the reward obtained and the resulting state. This exploration pro cess is illustrated in the RL training lo op sho wn in Fig. 2.2. Figure 2.2: Reinforcement Learning training lo op. 14 In Q -learning, for eac h transition ( s, a, s ′ ) the up date Q ( s, a ) : = Q ( s, a ) + α ( R ( s, a ) + γ max a ′ ∈ A ( s ′ ) Q ( s ′ , a ′ ) − Q ( s, a )) is p erformed. In this expression, α ∈ [0 , 1] is the learning rate, whic h determines ho w m uch emphasis is put on the former Q -v alue in the up date. In other words, in Q -learning, we up date Q rep eatedly un til it conv erges to Q π ∗ . Then π ∗ is: π ∗ ( s ) ∈ arg max a ∈ A ( s ) Q π ∗ ( s, a ) In the next section, we will refer to lab elled MDPs. They can b e deﬁned as an extension of the deﬁnition of an MDP . This deﬁnition is as follo ws: Deﬁnition 2. A lab el le d MDP is a tuple ⟨ S , A, P , R, L ⟩ 1 wher e S, A, P , and R ar e deﬁne d in the same way than in Deﬁnition 1, and wher e L : S → 2 AP c orr esp onds to a lab el ling function fr om states to subsets of a set of atomic pr op ositions AP . An example of lab elled MDP can b e seen in Fig. 2.3. The state S 0 has a single lab el “ dog ”, state S 1 has tw o lab els “ dog ” and “ cold ”, and state S 2 do es not hav e any lab el. While these lab els are assumed to b e mapp ed already to the set of states, they can in practice b e computed at the time the agent en ters the state, based either on the observ ations of the agent, or the prop erties of the state itself. This is particularly conv enient when dealing with a non-ﬁnite of contin uous state space. 2.1.2 Norms In our so ciet y , norms serve as implicit or explicit guidelines that gov ern interactions, en- suring co ordination, safety , and eﬃciency . These norms can b e legal ( e.g. , traﬃc la ws for autonomous vehicles), so cial ( e.g. , queuing b eha viour in public spaces), or ethical ( e.g. , fair- ness in AI decision-making). These norms can sometimes b e made explicit through text, but can also emerge dynamically through rep eated interactions and may v ary across cultural and 1 It is p ossible to remov e the element R corresp onding to the reward and simply keep the tuple ⟨ S, A, P, L ⟩ . 15 Figure 2.3: Example of a Lab elled MDP . situational contexts [128]. They can also change and evolv e ov er time [39]. Deon tic logic is an area of logic that in vestigates normativ e concepts, i.e. , deon tic con- cepts [84]. It aims at developing to ols and metho ds to represent these norms and reason ab out them. The deontic logic literature deﬁnes tw o t yp es of norm: constitutive norms and regulativ e norms. The former can b e summarised as a wa y of describing how rules or norms bring certain so cial or institutional realities in to existence by deﬁning what actions, statuses, or roles coun t as within a given context. Searle characterises them as a w a y to build institutional facts from brute facts [183], but more simply , they can be seen as a c ounts-as mechanism. F or example, “ X took an item in the store and left without pa ying” counts as “ X stole an item from the store”. Here, the brute fact is that X left the store with an unpaid item, while the institutional fact is that X stole this item. In short, they serv e as incremen tal building blo c ks for determining what is and what is not from low-lev el observ ation. Note that the concept of constitutive norms has been further developed so that they can take into account a given con text [184]. F or example, “ X to ok an item in the store and left without paying” counts as “ X stole an item from the store”, under the condition that “ X is not the manager of the 16 store”. Such structure for a constitutional norm C i can be represented as C i ( A, B | Y ), which means that in context Y , A coun ts as B (for A and B being some even ts or prop erties). In this thesis, a constitutiv e norm ha ving for con text the tautology ⊤ will be rewritten C i ( A, B ). On the other hand, regulativ e norms are closer to what w e usually call “norms” in our so ciet y . They hav e a deon tic con tent and indicate what is obligatory , p ermitted, and for- bidden. In deontic logic, they are represented via the mo dalities O for the obligations, P for the p ermissions, and F for the prohibitions. They are often expressed with the follo wing equiv alences: (i) P p ↔ ¬ O ¬ p (ii) O p ↔ ¬ P ¬ p (iii) O p ↔ F ¬ p (iv) F p ↔ O ¬ p F or example, O p signiﬁes that one ought to do p , while F p signiﬁes that it is forbidden to do p . These notations can b e extended to include conditionals. F or X ∈ { O, P , F } , X ( p | q ) indicates that one should comply with the norm X p if it is the cas e that q . If a norm is simply noted as X p , we can assume X ( p | ⊤ ) where ⊤ is the logical tautology . In this thesis, we make use of defeasible reasoning and, more sp eciﬁcally , defeasible norms. In the literature, normativ e conﬂicts may be exempliﬁed by situations where w e hav e O p and P ( ¬ p | q ), which means that if q can b e inferred from the knowledge base K B , then w e ha v e an obligation that K B | = p and the p ermission that KB | = p at the same time. It can b e in terpreted in tw o diﬀerent w a ys. If we follo w the Prima F acie paradigm, the ideal world resolution is that p holds, so there is no violation, while the sub optimal world resolution is that p do es not hold, so there is one violation. On the other hand, Sergot and Prakken [186] see a w orld where q is inferable as an exception to the norm O p . As suc h, not doing p will not violate the obligation O p . W e can represent this type of situation with logic programming notation. Consequently , we can write ( not q ) → O p , where “ not q ” means that KB | = q , and the whole form ula means that one do es not ought to fulﬁl O p if KB | = q . In order to mak e this dissertation clearer, w e introduce Deﬁnitions 3–6. Note that the terms used in these deﬁnitions may diﬀer from the usual meaning given in the literature. Deﬁnition 3 (Defeated Norm) . A norm O p is defe ate d when an exc eption to it is cr e ate d by another norm P ( ¬ p | q ) and q is infer able fr om the know le dge b ase. The norm c an b e 17 r ewritten as ( not q ) → O p . F or a pr ohibition F p , a defe at would b e e quivalent to having P ( p | q ) , similarly r ewritten as ( not q ) → F p . Deﬁnition 4 (Ac tiv ated Norm) . A (p otential ly defe asible) norm r → O ( p | q ) is said to b e activate d when q is infer able fr om the know le dge b ase. Deﬁnition 5 (Compliance) . A norm O p is not c omplie d with if we have q → O p and K B | = p , r e gar d less of whether KB | = q or not. 2 Similarly, a norm F p is not c omplie d with if we have q → F p and KB | = p r e gar d less of whether K B | = q or not. Deﬁnition 6 (Violation) . A norm O p is violate d when q → O p , KB | = q , and K B | = p . This is the same as non-c omplianc e with a non-defe ate d norm. Alter natively, a norm F p is violate d if K B | = p . Remark 2. In this dissertation, we wil l use the term “norm status”, or “status of a norm”. The status c an take four diﬀer ent values. L et a norm r → O ( p | q ) “Non-activate d” (or “de activate d”), c orr esp ond to when the c ondition of the norm c annot b e inferr e d fr om the know le dge b ase ( e.g. , KB | = q ). It takes priority over the other status. F urthermor e, a norm c an b e said to b e “activate d”, which c orr esp onds to the c ase wher e it is not “de activate d”. However, this do es not pr ovide any additional information ab out whether or not the norm is defe ate d. “Non-defe ate d” is when the norm se es its c ondition fulﬁl le d ( e.g. , KB | = q , r ). Then “defe ate d” r efers to when a norm is activate d but its extr a c ondition is not ( e.g. , KB | = q but KB | = r ). 2.1.3 F ormal Argumen tation F ormal argumentation consists of a compilation of techniques to mo del dialogues and enable reasoning through them. An in terested reader may w an t to start with the ﬁrst v olume of Handb o ok of F ormal A r gumentation [36]. These techniques revolv e around the use of logical formalism. F ormal argumentation can b e divided into three main branches [210], namely , argumentation as dialogue [130, 129, 26], argumen tation as balancing [90, 13], and argumentation as inference [76, 132, 196]. This 2 Note that this diﬀers from the more standard deﬁnition of non-compliance as a violation. 18 Figure 2.4: Representation as a directed graph of an argumentation framework. section will fo cus on the introduction of the latter, as this is the one used through the approac hes that serv es as a basis to this w ork. More sp eciﬁcally , we will fo cus on abstract argumen tation as introduced b y Dung in his seminal pap er [76]. Abstract argumentation consists of a framew ork for mo delling arguments and conﬂicts. Rather than fo cusing on the in ternal structure of argumen ts, it treats them as abstract en tities and deﬁnes a binary “attack” relation b et w een them. The cen tral idea is to reason ab out whic h sets of arguments can coheren tly stand together. Such a set is called an extension . The intuition b ehind this concept, as Dung puts it, is that the w a y humans argue is based on a v ery simple principle which is summarised succinctly by an old saying: “The one who has the last w ord laughs b est”. More formally , an abstract argumentation framework consists of the follo wing. Deﬁnition 7 (Argumen tation F ramework) . An ar gumentation fr amework is a tuple F = ⟨A r g s, R⟩ , wher e A r gs is a set of elements c al le d ar guments and R ⊆ A r gs × A r g s is a r elation over the ar guments r eferr e d to as attack. It can b e represen ted as a directed graph. F or example, Fig. 2.4 sho ws an argumenta- tion graph corresp onding to the argumen tation framework F = ⟨A r g s, R⟩ where A r g s = { a, b, c, d } and R = { ( b, a ) , ( c, b ) , ( d, a ) , ( d, c ) } . Giv en S ⊆ A r g s , we recall the notions of conﬂict-freeness and acceptability . A conﬂict- 19 free set is one in whic h no argumen t in the set attac ks another. Acceptabilit y represents the constrain t that an argument is only acceptable if all its attack ers are themselves attac k ed by an argument in the set. • S is a c onﬂict-fr e e set of arguments w.r.t. R if and only if ∄ a, b ∈ S s.t. a R b , • F or all a ∈ A r g s , a is acceptable w.r.t. S if and only if ∀ b ∈ A r g s , if b R a , then ∃ c ∈ S , suc h that c R b . W e recall the standard deﬁnitions for extensions: • S is an admissible extension if and only if S is conﬂict-free and all arguments a ∈ S are acceptable w.r.t. S . • S is a c omplete extension if and only if S is admissible and con tains all acceptable argumen ts w.r.t. S . • S is a gr ounde d extension if and only if S is a minimal complete extension with resp ect to strict set inclusion i.e. ∄ S ′ ⊆ A r g s s.t. S ′ ⊂ S , and S ′ is a complete extension. According to these deﬁnitions, the grounded extension that can b e computed from the graph sho wn in Fig. 2.4 is Gr d ( F ) = { b, d } . It is fairly easy to compute it by hand, as we can quickly see that d is not attac k ed and therefore has to b e in the grounded extension. Then, b is attac k ed only by c , and c is defeated by d (which we previously included in the extension). Consequen tly , d is defending b , allowing the latter to be in the extension as it is not attac k ed b y any argument whic h is also in the extension. Such an argument, made acceptable b ecause of the defence from another argument, is said “reinstantiated”. F or larger graphs, an algorithm has b een prop osed by Nofal et al. [149] to compute the grounded extension with p olynomial-time complexity . Remark 3. Although the use of gr ounde d extension is not inc omp atible with gr aphs hav- ing symmetric attacks, r eﬂexive attacks, or cycles, such pr op erties may le ad to undesir e d b ehaviours. Conse quently, when de aling with gr aphs that may pr esent one or mor e of these pr op erties, extensions using for instanc e the pr eferr e d semantics should b e privile ge d. Abstract argumentation has b een extended in v arious wa ys. Some works in tegrated the handling of preferences or v alues o v er the arguments [33, 69, 20, 34]. Some other works 20 added another relation denoting a supp ort from one argument to another [62, 151, 211]. While eac h of these v ariants has its o wn set of adv antages, we b eliev e the approac h prop osed b y Dung shines b y its simplicity—similarly to propositional logic among other logics—making it particularly well-suited for integration into practical systems. 2.2 Bac kground This thesis builds up on three key contributions from the literature. Section 2.2.1 discusses AJAR [9, 10], a framework that we previously developed and which integrates symbolic reasoning with reinforcement learning. Then Section 2.2.2 presents Jimin y [111], a normative reasoning architecture designed to handle deon tic concepts. Finally , Section 2.2.3 in tro duces a reinforcement learning architecture in which an agen t learns to ob ey norms [143, 144, 145]. While this thesis adapts and integrates these three works into a uniﬁed approac h, the follo wing section provides the necessary background to clarify the foundations upon which this thesis builds. 2.2.1 The AJAR F ramew ork The AJAR (A Judging Agents-based RL-framework) framew ork [9, 10] consists of the sup er- vision of a reinforcemen t learning agent during its learning phase b y a set of judges. This framework was initially prop osed for the ﬁeld of Machine Ethics . As suc h, in addition to its main task, the agen t has to satisfy (or maximise) an ensemble of moral v alues. Each moral v alue is handled b y a single judging agent. A judging agen t consists essen tially of a directed graph where no des are arguments, and edges are considered as attac ks. 3 Then, a mapping function indicates whether an argument is supp orting ( i.e. , pr os argument) the moral v alue, going against ( i.e. , c ons argument) it , or is neutral and instead serves to defeat some other arguments. F or example, “ L ow gas emissions ” supp orts the “ Envir onment sustainability ” moral v alue, while “ High gas emissions ” goes against it. Then, “ Not often activate d ” ma y b e neutral tow ard the “ Envir onment sustainability ” moral v alue but at the 3 Argumen ts and attacks in the sense of Dung’s formal argumentation, although we cannot talk ab out an argumen tation graph at this p oin t as it may hav e inconsistencies among its arguments. 21 same time defeat the “ High gas emissions ” argument. During the learning phase, after each agent’s action within the environmen t, a prepro- cessed version of the state is forw arded to the judges. This prepro cessed state serv es to determine whic h argument is activ ated. Then, each judge computes a subgraph containing only the activ ated arguments. Finally , it computes an extension. F rom the arguments within this extension, and dep ending on whether they are pr os or c ons , each judge calculates a n umerical reward for its given moral v alue. Then, an aggregation function scalarises each of these rewards with the reward from the environmen t (corresp onding to the actual task of the agen t, without any ethical consideration). Then, this ﬁnal aggregated reward is returned to the reinforcement learning agen t so that it can up date its Q -v alues. More formally , inspired by the AFDM (Argumen tation F ramework for Decision-Making) 4 prop osed b y Amgoud and Prade [21], we deﬁne an AFJD (Argumen tation F ramework for Judging a Decision) as: Deﬁnition 8. A n Ar gumentation F r amework for Judging a De cision (AFJD) is a tuple AF = ⟨A r g s, R , F f , F c ⟩ wher e: • A r g s is a non-empty set of ar guments • R is a binary r elation c al le d attack relation • F f ∈ 2 A rg s is the set of ar guments which indic ates that the RL-agent’s last de cision is mor al w.r.t. the mor al value c onsider e d by the judging agent • F c ∈ 2 A rg s is the set of ar guments which indic ates that the RL-agent’s last de cision is immor al w.r.t. the mor al value c onsider e d by the judging agent. F or the sake of clarity and to al leviate notations in the se quel, for a given AF = ⟨A r g s, R , F f , F c ⟩ , we note AF [ A rg s ] = A r g s , AF [ R ] = R , AF [ F f ] = F f , AF [ F c ] = F c . The set of al l p ossible sub-AFJD for AF , i.e. , al l AFJD which ar guments ar e a subset of AF [ A rg s ] , is 4 W e essentially remov e the features that are not relev ant for our technical sp eciﬁcities while k eeping the notion of pr os and c ons arguments. 22 denote d as: P ( AF ) := {⟨A r g s ′ , R ′ , F ′ f , F ′ c ⟩ : A r g s ′ ⊆ AF [ A rg s ] , R ′ ⊆ A r g s ′ 2 ∩ AF [ R ] , F ′ f ⊆ A r g s ′ ∩ AF [ F f ] , F ′ c ⊆ A r g s ′ ∩ AF [ F c ] } Then, we accordingly deﬁne the AJAR framew ork as follows 5 : Deﬁnition 9. A Judging A gents-b ase d RL-fr amework (AJAR) is a tuple F such that: F = ⟨M j udg es , { AF j } j ∈M j udges , S, A, P , R, { ϵ j } j ∈M j udges , {J j } j ∈M j udges , g ag r ⟩ wher e • M j udg es is a set of judging agents, • ∀ j ∈ M j udg es , AF j is a AFJD, • ⟨ S, A, P , R ⟩ is a MDP (se e Deﬁnition 1) • ∀ j ∈ M j udg es , ϵ j : S → P ( AF j ) is a function that fr om a curr ent state s t ∈ S , assigns the sub-AFJD that the judging agent j wil l use • ∀ j ∈ M j udg es , J j : P ( AF j ) → R is the judgment function that r eturns fr om an ar gu- mentation gr aph the asso ciate d r ewar d of the judgment, • g ag r : R | M j udges | → R is an aggr e gation function • F or al l states s t ∈ S and the r ewar d obtaine d fr om the MDP for the given tr ansition R t = R ( s t − 1 , a t − 1 , s t ) , the r ewar d R r eturne d by AJAR is s.t. : R ( s t ) = g ag r  { R t } ∪  j ∈M j udges J j  ϵ j ( s t )   The judging pro cess is illustrated, for a speciﬁc agen t i , by Figs. 2.5a and 2.5b. One wa y of deﬁning the function J j is (but is not limited to since the c hoice of the function ultimately falls to the designer) the follo wing: 5 Originally , AJAR was deﬁned ov er a Decentralised P artially Observ able MDP (DecPOMDP) and was handling several learning agents. F or the sake of clarity , w e simplify the deﬁnition so that it uses a standard MDP and judges a single learning agent. 23 Environment Aggregation Function Learning Agent i Judging Agent 1 Judging Agent n . . . ( i , s t ) AJAR (a) Overview of AJAR. AF' AF Compute the Grounded Extension Judging Agent j Grd(AF') (b) Detailed judgment. Figure 2.5: Representation of the AJAR framework. pr os =    G r d ( ϵ j ( s t )) ∩ ϵ j ( s t ) [ F f ]    cons =   G r d ( ϵ j ( s t )) ∩ ϵ j ( s t ) [ F c ]   J j ( ϵ j ( s t )) =        pr os pr os + cons , if pr os + cons  = 0 1 2 , otherwise 2.2.2 The Jimin y Architecture The Jiminy architecture [112, 111], whose name is inspired by Jimin y Crick et, is a moral advisor for an artiﬁcial agent. It aims at confron ting the p oint of view of diﬀerent stakehold- ers on a situation to determine, norm-wise, whic h action the agent has the righ t to p erform. As this reasoning may require to be done in real-time for an artiﬁcial agent, stak eholders 24 cannot simply express their views through a messaging application or an y other communi- cation process. An example of a situation where a smart home agen t is sup ervised by three stak eholders, namely the family , the manufacturer, and the legal system, is shown in Fig. 2.6. Figure 2.6: Jiminy’s smart home example. Reused from Liao et al. [112]. Jimin y , stakeholders’ normativ e systems are modelled via a sym b olic represen tation. This represen tation is called a stakeholder model, or stakeholder a v atar. Through the use of struc- tured argumentation [132], these a v atars are able to construct arguments from the sym b olic observ ation of a state (the state here corresp onding to the p erceptions of the sup ervised agen t). Then, argumen ts of the v arious a v atars are organised based on their conﬂicts. Fi- nally , using abstract argumentation, a set of actions that adhere to the result of the debate is extracted. The sup ervised agent can no w select an action to do. This pro cess is represented in Fig. 2.7. Note that the strategy follo w ed b y Jiminy is the following: 1. Jiminy considers how the arguments of the stak eholders relate to one another, which ma y already resolve the dilemma. 2. Jiminy combines the normative systems of the stak eholders suc h that the combined exp ertise of the stakeholders ma y resolve the dilemma. 3. Only if these t w o other metho ds hav e failed, Jimin y uses context-sensitiv e rules to decide which of the stakeholders takes preference ov er the others. 25 Figure 2.7: Representation of the Jiminy pip eline. Reused from Liao et al. [111]. What in terests us the most in this architecture is the separation of the stakeholders, whic h makes the normativ e system that the agent ough t to follow more ﬂexible and mo dular. F urthermore, its use of argumentation may simplify the process of explaining the decision of the agen t to a human, as argumen tation can be considered as a facilitator to understand logic-based reasoning [78, 172] as it con v eys a notion of causalit y [107, 161]. 2.2.3 Norm Guided Reinforcemen t Learning Agent In his w ork, Neufeld introduced a Norm Guided Reinforcement Learning (NGRL) agen t [143, 144, 145]. This framew ork extends classical model-free reinforcemen t learning so that the 26 agen t learns preset norms in addition to achieving the task for which it w as designed. In this framework, the RL-agent evolv es within a lab elled MDP (see Deﬁnition 2). When en tering a state, the labels serv e at triggering some constitutive norms to generate institu- tional facts or activ ate regulativ e norms. Then, if one or more regulative norms clash with the lab els, the system considers that violations o ccurred. 6 This is handled by returning to the agent, in addition to the reward signal from the environmen t, a v alue corresp onding to the num b er of violations committed when p erforming the latest transition. The NGRL agen t is equipp ed with an additional Q -function Q V . It follows a similar functioning as the one of the Q -function trac king the exp ected reward (renamed Q R ), except that this one trac ks the exp ected violations when doing a certain action in a certain state. Then, after the learning, the agent is not only a w are of the exp ected reward it will get when doing a giv en action, but also of the violations it will commit. Using a preference order and a thresholded lexicographic selection [200] (see Algorithm 1), it selects the optimal action with resp ect to the task among the ones that do not exceed a maximal v alue for the exp ected violations. Algorithm 1 Thresholded lexicographic selection Require: s (curren t state), L (a list of Q -functions, ordered by priorit y), C a vector of thresholds for the Q -functions in L Ensure: A ∗ (the resulting set con taining the optimal actions) 1: A ∗ : = A ( s ) ▷ Gets the set of the p ossible actions in state s 2: for Q i ∈ L do 3: T : = { x ∈ A ∗ | Q i ( s, x ) ≥ C i } ▷ Filters out actions b elo w the threshold 4: if T = ∅ then 5: A ∗ : = arg max x ∈ A ∗ Q i ( s, x ) ▷ Filters out sub optimal actions 6: else 7: A ∗ : = T 8: end if 9: end for 10: return A ∗ In his work, Neufeld assumes the ensem ble of the constitutive and regulative norms to b e giv en. Same go es for the thresholds used for the lexicographic selection. While this ma y seem 6 Although this work uses a lab elled MDP , it could b e ﬂattened to a Normative MDP (NMDP) [77] as the norms in each state can b e mo delled by applying the set of constitutive rules. 27 already as b eing a signiﬁcan t amoun t of exp ert kno wledge (in particular since the structure of these norms may b e complex to setup and so not scalable), this is still signiﬁcan tly b etter than the mo del-based approac hes relying on Linear T emp oral Logic for safe RL [146, 125] that require either to know the transitions or at least to learn those. 28 Chapter 3 π -NoCCHIO: A Con text-Aw are Normativ e Arc hitecture In this chapter, we present the π -NoCCHIO arc hitecture [8]. This architecture consists of a reinforcemen t learning agen t sup ervised b y a normativ e advisor. The normativ e advisor itself is comp osed of multiple stakeholders’ reasoning mo dels, called a v atars. W e will see how such an architecture can learn norms represen ted in a symbolic manner and follow them within a sto c hastic en vironmen t. W e will also s ee how these norms may b e rendered irrelev ant in some sp eciﬁc contexts, and ho w the architecture takes this into consideration. 3.1 In tro duction Normativ e reinforcement learning is a timely topic, as man y systems deploy ed in the real- w orld are now using Artiﬁcial Intelligence (AI). As these systems now evolv e among us, w e need to ensure that they follow settled norms and conv entions in order to a v oid causing harm to p eople they interact with. In his work, Neufeld prop osed the Norm Guided Reinforcement Learning (NGRL) arc hitecture [143, 144, 145]. This architecture puts a strong emphasis on to the resp ect of the norms (and in an alternate v arian t, places emphasis on main taining the frequency of violations b elo w a certain threshold) b y approximating not only the expected accum ulated reward when doing a certain action in a certain state, but also the exp ect 29 accum ulated violations. By its simplicit y and mo dularit y , this w ork is v ery suitable for extensions. In fact, its functioning is v ery close to a standard Q -learning algorithm. F urthermore, most of its com- p onen ts can b e easily replaced. In the arc hitecture proposed in this dissertation, we mak e the c hoice to substitute the normative sup ervisor comp onen t with the Jiminy architecture [112, 111] that has b een designed to account not only for the con text, but also for heterogeneous viewp oin ts. Let an agent ev olving within an en vironment presen ting normative considerations, or a normativ e dilemma. In eac h state, a set of actions is a v ailable to the agen t. There can b e sev- eral stak eholders ( e.g. , the user, the la w enforcemen t, or the manufacturer) who are in terested or aﬀected by the decisions of the agen t, and eac h may share diﬀerent ethical considerations. The role of Jiminy is to confront the p oin ts of view and argumen ts of the stak eholder to estimate which action(s) the agen t has the right to p erform. As sometimes the decision of the next action has to b e taken in less than a second, it is not realistic to communicate with the stak eholders, who are h umans, in real-time. F or this reason, Jiminy represents the stake- holders through a v atars, which are essentially knowledge-bases of inference rules that serves to mo del the men tal state of the stak eholders. When entering a state, the agent forwards its observ ation to Jimin y . There can b e a prepro cessing step to conv ert raw v alues into mean- ingful symbols (referred to as brute facts) that can b e used b y the inference rules. Conﬂicts among the argumen ts of the av atars are represen ted using structured argumentation. Then, follo wing a three-fold pro cess, Jimin y tries to resolve the dilemma. Firstly , it analyses the relation b et ween the argumen ts b y constructing a structured argumentation graph and then con v erting it to an abstract argumen tation. Computing an extension gives the legitimate actions that the agen t can do. This may already solv e the dilemma. Secondly , if the ﬁrst step w as not suﬃcien t, Jiminy com bines the norms and expertise of the stak eholders to resolv e the dilemma. Thirdly , and only if the tw o previous me thods failed, Jiminy uses context-sensitiv e rules to determine whic h of the stak eholders should take preference o v er the others. This architecture has sev eral b eneﬁts. The ﬁrst one is its transparency . Indeed, structured argumen tation and sym b olic representations mak e it p ossible to see how eac h stakeholder 30 constructs its argumen ts, and wh y suc h argumen t is attac king or conﬂicting with suc h other argumen t. The second main b eneﬁt is that it do es not require a uniﬁed and consisten t ethical mo del among the stakeholders, as conﬂicts are resolv ed through the argumen tation pro cess. This is a crucial point as requiring a uniﬁed view w ould b oil down to ha ving someone in c harge of prepro cessing the information of eac h stak eholder, essentially moving the scalabilit y b ottlenec k to an upstream task. An alternativ e w ould b e to allo w eac h stakeholder to ev aluate the situation separately , and take the ma jority v oting, or apply a preference order, but this w ould remo v e the p ossibilit y for a stak eholder ha ving a very relev an t argument to share this information with the other stak eholders, although they could ha v e c hanged their stance b ecause of it. Ho w ev er, Jimin y also suﬀers from some drawbac ks. First, the fact that it is transparen t raises concerns ab out the priv acy and agency of the stakehold ers. F or example, if one of the stak eholders is the man ufacturer, it may be undesired for him to release sensitive information ab out, for instance, the inner working of the pro duct, or agreemen ts that he passed with some other parties and that alter his normative system. It is also a loss of agency as the stakeholders are no w depriv ed of the strategic asp ect that formal argumen tation ma y ha v e through dialogue games ( e.g. , F atio’s proto col [129, 130], discussion games [58]) as they are forced to mak e public the en tirety of their kno wledge. Consequently , information that w as initially hidden to the other stakeholders can now b e turned against them. This may incen tivise stak eholders to pro vide a very limited amount of information for the mo delling of their a v atar. Second, the curren t working of the Jiminy architecture requires quite an extensive amount of knowledge, through the provision of the inference rules required to instantiate the attacks among the argumen ts, and computation resources, as it not only requires to compute an extension from an abstract argumentation graph, but also needs to run an inference mo del to build the argumen ts from the raw inputs of the state. Finally , Jiminy acts proactively , meaning that it do es not judge if the action of an agent was go o d afterward, but instead plans ahead what is legitimate. This is very unrealistic, in particular within a partially observ able sto ch astic en vironmen t containing a large n um b er of elements or factors in its input. In order to combine the reasoning of Jiminy with the NGRL agen t, we use a mo diﬁed 31 v ersion of the AJAR framew ork [9, 10]. This framework consists of a reinforcement learning agen t whom the rew ard is not directly obtained from the environmen t ( i.e. , MDP) but rather from a set of judging agen ts. Each of these judging agen t has its decision mo del for a giv en moral v alue represen ted through an argumen tation graph. At each time step, it ev aluates the consequences of the actions of the judged agen t and accordingly generates a rew ard. All the rewards of the judges are then aggregated and returned to the agen t. This framework has sev eral adv an tages. First, it is v ery mo dular, as eac h judge can b e added or remov ed without having to do any other change. Second, it brings additional intelligibilit y compared to standard reinforcement learning. Not only is the mo del of each judge built exp ertly , using sym b ols that make it readable and understandable by non-exp erts, but also it is possible to lo ok at the graph structure for a given state to determine why the learning agent considered that doing such action would improv e its reward. How ever, this framew ork also suﬀers from some disadv antages. F or instance, the aggregation metho d is someho w arbitrary . F urther- more, since there is a single judge p er moral v alue, only one ethical principle is adopted, and views are not confron ted. Last, AJAR was originally designed for Machine Ethics. This ﬁeld shares similarities with the learning of norms, but also ma jor diﬀerences. F or example, e thics often balances the compliance to moral v alues with the sacriﬁce of the agent with resp ect to its main task. In normative systems, adherence to norms is usually not an option but an obligation. The agent is then not free to decide whether a rew ard is imp ortan t enough so that undertaking an action that violates the norm would b e considered legitimate. Another example is that in ethics, the agent m ust recognise situations that are considered unethical, whereas in normative systems, the agen t simply follows the norms without reev aluating them. The next sections will aim at prop osing an approach, building up on the three aforemen- tioned arc hitectures, to normativ e reinforcement learning. This approach, which we call π - NoCCHIO [8], follows the principles of the NGRL agent as proposed by Neufeld, but replaces its normative guide by a more sophisticated version derived from the Jiminy architecture. Then, the tw o are connected through the AJAR framework. 32 3.2 Mo del 3.2.1 Con text-Aware Normative Reasoning In this section, w e introduce an architecture of normative sup ervisor inspired b y the Jimin y arc hitecture [112, 111]. This comp onen t will b e referred to as the judge, as its role will be similar. In fact, it will hav e to ev aluate the consequences of the actions tak en by an agen t and determine which norm is violated b y the outcome of this action. Unlik e Jimin y , the normativ e conﬂicts are no longer handled within this comp onent but will instead b e managed through the reinforcement learning comp onen t. If a norm is activ ated, that is, its condition is satisﬁed b y the relev ant considerations, i.e. , a set of prop ositional atoms extracted from the prepro cessed state, the system chec ks whether the agent complies with the norm or not. If not, then the argumentation graph built from the av atars is used to compute whether the norm is defeated, i.e. , the curren t situation creates an exception to the norm or not. If the norm is not defeated, then b ecause the agent do es not comply with it, the agent is committing a violation with resp ect to this norm. Another diﬀerence with Jiminy is that here the goal is not to decide whic h action the agen t should or should hav e taken, but rather to decide whether the reached state adheres to the norms. As suc h, norms are no longer used as argumen ts that attack or supp ort actions. Instead, they are the elements that are discussed within the argumen tation pro cess. Each norm is discussed separately , similar to the AJAR arc hitecture [9, 10] where moral v alues (since it is originally a work on Machine Ethics) are debated individually . W e give a more formal deﬁnition of the judge in the immediately follo wing Deﬁnitions 10– 11. Deﬁnition 10 (Stakeholder Model) . L et a ﬁnite set of ar guments A r g s and a set of r e gulative norms N b e given by the envir onment. A stakeholder mo del (or avatar) is a tuple M i = ⟨L , C , R , M , g etAr g s ⟩ wher e: • L is a language deﬁne d by a set of wel l-forme d formulae (wﬀ ). One r e quir ement is that A r g s ⊆ L 33 • C is a set of c onstitutive norms C i of the form C i ( a, b ) with a, b ∈ L wher e b is either an ar gument (in the sense of abstr act ar gumentation) or an institutional fact that c an serve to trigger other c onstitutive norms. Note that, in addition to other p ossible c onstitutive norms, this set always c ontains the norms C r ( q , r ) , with r ∈ N a r e gulative norm, which serves to denote the fact that e ach r e gulative norm discusse d by the judges always forms one ar gument. Constitutive norms ar e useful for c onstructing abstr actions fr om facts so that the applic ation of a norm c an b e gener alise d [195] 1 • R is an attack r elation s.t. R ⊆ A r gs × A r g s • M : N → 2 C a mapping function fr om a r e gulative norm in r ∈ N to a subset of the c onstitutive norms of the stakeholder mo del. We then deﬁne, r esp e ctively, – Π r C = { C i |∃ i ∈ I .C i ( a, b ) ∈ M ( r ) } the subset of c onstitutive norms c onsider e d r elevant by the agent when ar guing ab out r ∈ N , with I the set of c onstitutive norm indic es – Π r A r g s = { b |∃ x.C i ( x, b ) ∈ M ( r ) } the subset of ar guments c onsider e d r elevant by the agent when ar guing ab out r ∈ N – Π r R = { x R y ∈ R| x ∈ Π r A r g s ∨ y ∈ Π r A r g s } , the subset of defe ats c onsider e d r elevant by the agent when ar guing ab out r ∈ N . Note that in the latter c ase a R b ∈ Π r R → a, b ∈ M ( r ) . • g etAr g s ( x ) is the function that r eturns the set of al l the ar guments that c an b e c on- structe d fr om a given set of observations x ∈ L . It is deﬁne d by the closur e of the c onstitutive norms as fol lows: – C l ( k ) = S n C l n ( k ) – C l 0 ( k ) = k and C l n +1 ( k ) = { b |∃ C i ( a, b ) .a ∈ C l n ( k ) } Deﬁnition 11 (Advisor) . An advisor AD is a tuple ⟨M , F acts, L , N , ≻ M , ϵ, j ⟩ wher e: • M is a set of stakeholders’ mo dels, r eferr e d to as avatars • F acts is a ﬁnite set of facts, i.e. , pr op ositional atoms use d to describ e the world • L a lo gic al language 1 F or example, to make the norm “ F ( steal(item) ) usable, it is p ossible to abstract the ob jects “milk” and “bread” as “item” by using a constitutional norm. 34 • N is a set of r e gulative norms of the form X ( p | q ) with X ∈ { O , P , F } and p, q ∈ L • ≻ M is a strict pr efer enc e or der over M • ϵ : states → 2 F acts a function that takes a state (usual ly a r aw input of the envir onment) and r eturns a subset of F acts that c orr esp onds to the pr op ositional atoms that c an b e c onstructe d fr om the input state • j : 2 F acts × N × 2 M → {− 1; 0 } is a function that takes as input a set of facts obtaine d fr om ϵ ( s ) wher e s is the curr ent observation of the envir onment, a norm to discuss, and a set of stakeholders, and that r eturns whether or not the discusse d norm has b e en violate d. This function c an b e describ e d by a four step pip eline: 1. F or a state s and a r e gulative norm r ∈ N , and for e ach stakeholder m ∈ M , c ompute the set of ar guments r elevant to the given norm that c an b e c onstructe d fr om the facts S m 0 = Π m r A r g s ∩ g etAr g s ( ϵ ( s )) 2. R eunify the sets of ar guments r eturne d by the stakeholders S A rg s 1 = S M m S m 0 . Do the same for their attack r elation S R 1 = S M m Π m r R 3. With the union of the ar guments and the attacks, c onstruct an ar gumentation fr amework ⟨ S A rg s 1 , S R 1 ⟩ and c ompute an extension E xt . We r e c ommend the gr ounde d extension for its p olynomial-time c omplexity and uniqueness pr op erty. 2 If the ar- gument c orr esp onding to the norm is c onsider e d as unde cide d by the extension semantics, the pr efer enc e or der ≻ M is applie d (se e Example 1). If even after this the norm ar gument is stil l unde cide d, then it is c onsider e d as not p art of the extension, i.e. , sc eptic al ly r eje cte d 4. R eturn − 1 if ther e is a non-c omplianc e (for example, ther e is O ( p | q ) with p / ∈ ϵ ( s ) and q ∈ ϵ ( s ) ) and if the norm discusse d is p art of the extension pr eviously c om- pute d, i.e. , r ∈ E xt . R eturn 0 otherwise, which me ans that no violation o c curr e d. Example 1 (Applying the preference order) . L et two ar guments a, b ∈ A r g s that have a symmetric attack ( i.e. , a R b and b R a ). F or e ach of these ar guments, we c ompute the set of avatars that ﬁnd it r elevant, i.e. , for an ar gument x ∈ A r g s , the set of stakeholders 2 An y extension can work in principle, as long as there is a wa y to decide if the argument represen ted by the discussed norm is part of it. F or example, if one uses preferred semantics, he may decide to consider the norm as accepted within the extension if it is sceptically accepted. Stable extensions are also exhibiting interesting prop erties, but their existence is not guaranteed. 35 S x = { m ∈ M| x ∈ Π m r A r g s } . We then obtain S a and S b . If ther e exist s ∈ S a s.t. for al l p ∈ S b , s ≻ M p , then we r emove b R a fr om the union of the attack r elations S R 1 obtaine d at step 2 of the function j . On the other hand, if ther e exist s ∈ S b s.t. for al l p ∈ S a , s ≻ M p , then we r emove a R b fr om S R 1 . Remark 4. While the ar chite ctur e al lows one to chain c onstitutional rules over and over, we b elieve that this should b e avoide d, as it would r eintr o duc e c omplexity in designing ar guments and their structur e, which this ar chite ctur e aims to r e duc e. Inste ad, one should favour the use of the ϵ function such that enough facts ar e gener ate d fr om it so that the ar guments c an b e dir e ctly trigger e d thr ough the applic ation of a single c onstitutional rule. In short, the designer should gener ate as many facts as p ossible without trying to anticip ate their purp ose or how they c ould b e or ganise d. Me anwhile, the avatar designer’s task is to determine which ar guments ar e r elevant, under which c onditions, and which ar guments they defe at or ar e defe ate d by. W e made the c hoice to use the language prop , which is a prop ositional language. This c hoice is motiv ated b y the fact that it is decidable, less exp ensiv e in terms of computations than most other logics, and its expressiv eness is suﬃcient for the needs of the prop osed arc hitecture (assuming w e s tic k to the advice given in Remark 4). Nevertheless, we extend it with the not unary connective that means that for a given knowledge-base K B , not ϕ is the same as KB | = ϕ . It is deﬁned as the set of well-formed formulae (wﬀ ), with the follo wing BNF grammar for any p ∈ F acts by ϕ : = ⊥ | p | ¬ ϕ | ϕ ∨ ψ | not ϕ , as well as the follo wing notation shortcuts: • ⊤ : = ¬⊥ , • ϕ ∧ ψ : = ¬ ( ¬ ϕ ∨ ¬ ψ ), • ϕ → ψ : = ¬ ϕ ∨ ψ Let the truth domain b e {⊤ , ⊥ , undef } . Then: An interpretation mo del I ov er a v aluation V : F acts → {⊤ , ⊥ , undef } is given by the function I : prop → {⊥ , ⊤ , undef } s.t. ∀ ϕ ∈ prop , I | = ϕ iﬀ I ( ϕ ) = ⊤ where: 1. ∀ p ∈ F acts , I ( p ) = V ( p ) 36 2. I ( ¬ ϕ ) =              ⊤ if I ( ϕ ) = ⊥ ⊥ if I ( ϕ ) = ⊤ undef if I ( ϕ ) = undef 3. I ( ϕ ∨ ψ ) =              ⊤ if I ( ϕ ) = ⊤ or I ( ψ ) = ⊤ ⊥ if I ( ϕ ) = ⊥ and I ( ψ ) = ⊥ undef otherwise 4. I ( not ϕ ) =      ⊤ if I ( ϕ ) = undef ⊥ otherwise The deﬁnitions introduced ab o ve allo w us to compute, for a given norm, a given set of observ ations (that p oten tially serves at building higher-lev el facts through the constitutive norms [17]), and m ultiple a v atars, whether or not the agen t reached a state where it com- mits a violation, i.e. , do es not comply with a non-defeated norm. In this mo del, only the consequences of an action 3 are judged. Consequently , the av atars do not need to accoun t for the probabilities which argumen ts such as “There was a high probabilit y not to harm an y one”. This simpliﬁes the task of designing the arguments and the attack relation of the stak eholders. T o b etter illustrate the deﬁnitions, an example of a graph created from a v atars debating the application of the norm F ( stop | on road ) is sho wn in Fig. 3.1. In this example, there are t w o stakeholders, namely a taxi company , with the red argumen ts, and the la w enforcement, with the blue argumen ts. W e ma y add more, such as the customer, the lo cal administration of the city , etc, but for the sake of this example w e limit it to the aforemen tioned ones. The graph can b e read as follows. First, the norm F ( stop | on road ) is activ ated only if the agent is on the road. If this is the case, then we see that there is an exception to this norm if the agen t is a taxi, as w e may wish to allo w taxis to mak e short stops on the road to let customers in and out. This argumen t is supp orted b y the stakeholder mo del of the taxi compan y . This 3 The term “action” is used in a very broad meaning here, as for an artiﬁcial agent “not doing any thing” can still b e considered as b eing an action. 37 Figure 3.1: Example of graph generated by t w o a v atars. same stak eholder defeats its o wn exception in the ev ent that the agent is not curren tly in service. On the other hand, the law enforcemen t’s mo del remov es this exception made for taxis to stop on the road if there was a parking sp ot just near, or if the road in question is a high-sp eed road. 3.2.2 A Normativ e Sup ervisor for a Normative Agent In this section, we detail how we use the judge comp onen t introduced in Section 3.2.1 to generate a reward signal to train a reinforcemen t learning agent to follo w norms. The resulting arc hitecture is called π -NoCCHIO ( π : Reinforcemen t Learning, Normative CCHIO) [8]. Fig. 3.2 is a diagram of a standard RL training lo op. In comparison, a diagram of π - NoCCHIO is sho wn in Fig. 3.3. Similarly to a reinforcement learning architecture, it can b e divided in to t wo main blo c ks. One is the en vironment, represented with an MDP (see Deﬁnition 1). The second blo c k is π -NoCCHIO. Inside this comp onen t, w e can identify further sub divisions. Sp eciﬁcally , it consists of tw o sub comp onen ts: The judge and the RL agen t. The ﬁrst comp onen t is inspired by this w ork, i.e. , the AJAR [9, 10] framew ork. In this work, some judging agents assess whether an RL agent acted ethically with resp ect to predeﬁned moral v alues b y altering the reward obtained from the environmen t. It is 38 Figure 3.2: Diagram of a standard reinforcement learning training lo op. repurp osed here for normative RL. The judge receives the observ ations “state” for step s t +1 resulting from the action(s) p erformed b y the agent in step s t . Then, t w o signals exit the judge comp onen t and are transmitted to the agen t. The ﬁrst is the ra w observ ation of the en vironment that is sen t to the agen t for the curren t step s t +1 . While it can remain unchanged, it can also b e manually mo diﬁed. The p erson doing suc h modiﬁcation will b e called “Operator”. This op ens up the p ossibilit y of indicating the norms that the agen t will hav e to comply with in the next state. Indeed, by aggregating to the state, for eac h norm, one label among { on , off , none } (more formally for a state s and a set of norms N , w e hav e s : = s ∪ { on , off , none } | N | ), it is possible to comm unicate to the agent whether a giv en norm will b e considered activ ated and non-defeated in the next state no matter what ( on ), or will b e defeated ( off ), or let to the decision of the judges ( none ). This allo ws to manipulate the b eha viour of the agent post-learning phase. The agen t w ould then adapt its b eha viour accordingly . During the learning phase, manually o v erriding a norm 4 , whic h will aggregate the extra information to the state s t , tak es the priority ov er the av atars. Thus, instead of returning to the agen t, for each norm, the signal j ( ϵ ( s t +1 )), the v alue set in the previous state s t is used. Remark 5. Al lowing for the manual e diting of the status of the norms has sever al advantages, but also some dr awb acks. On one hand, it gives mor e c ontr ol to the ﬁnal user, as it is p ossible to pr ovide extr a guidanc e to the agent whenever this is r e quir e d, as a human user may b etter 4 This would b e done randomly to simulate when a human may desire to take the control. 39 Figure 3.3: Diagram of the π -NoCCHIO training lo op. understand the situation or wish to deviate fr om the avatars’ guidanc e. On the other hand, it p otential ly makes the system mor e vulner able to cyb er attacks, as one c ould p otential ly br e ach it to mo dify the status of the norms. 5 However, this c an b e p artial ly solve d by splitting the norms into two sets. One set in which the norms c an b e manual ly e dite d (this set c ontaining the le ast sensitive norms), and the other set for which it is not p ossible to override the system. Sinc e the agent’s state would not include these pr ote cte d norms, their defe at status c ould not b e alter e d. The second signal indicates to the agent whether or not it committed a violation b y p erforming a certain action at time step s t . This signal is similar to the reward signal in a standard MDP . Remark 6. Her e, we pr esent a simple version of π -NoCCHIO wher e al l the norms ar e c onsider e d e qual in terms of imp ortanc e. The agent wil l then le arn to minimise the numb er 5 Note that mo difying the norms themselves, or adding new norms, is not feasible, as the agent would not ha ve b een trained with it and would then not b e able to understand what it ought to do according to these news norms. 40 of violations c ommitte d. However, this violation signal c an b e channelise d so that the agent knows exactly which norms wer e violate d. This c an then b e use d, in c ombination with a pr efer enc e or der over the norms, to favour the r esp e ct of some over the others, esp e cial ly in the event of a normative c onﬂict. On the other hand, if one pr efers to use weights for the norms r ather than a pr efer enc e or der, it is p ossible to ke ep the ﬁrst appr o ach. In the event of a violation, the function j would not r eturn − 1 but inste ad r eturn − w n which c orr esp onds to the ne gative of the weight asso ciate d with the norm n . Doing this would also ac c ount for the ac cumulation of the violations of norms with low weights that end up b eing worse than the violation of a single norm with a high weight. A diagram of the judge component can b e seen in Fig. 3.4. It features n + 1 av atars ( M 0 − n ) and m + 1 regulativ e norms ( r 0 − m ). The blo c k ∪ represents the aggregation of the (p oten tial) edits of the op erator with the state s t +1 . The blo c k Σ represents the summing of the judgemen ts, j ( ϵ ( s t +1 ) , r i , {M 0 , . . . , M n } ), for eac h norm (note that, as men tioned earlier, this step can b e skipp ed if one wan ts to c hannelise the norms within individual signals). The second comp onen t of this architecture is the RL agen t, inspired b y the one prop osed b y Neufeld [143, 144, 145]. It receiv es the signal corresp onding to the reward obtained from the environmen t, as well as the extended state and the violation signal from the judge comp onen t. It diﬀers slightly from a standard reinforcemen t learning agent as it do es not p ossess a single Q -function, but t w o of them, namely Q R and Q V . 6 Remark 7. While it would b e p ossible to tr e at violations as a p enalty over the r ewar d signal, or any other kind of tr ade-oﬀ [194, 108, 157], this should b e avoide d. The main r e ason is that, in c ase of an unb ounde d r ewar d, ther e is no guar ante e that the agent wil l not ﬁnd an inter est in violating a norm to r e ach a very high r ewar d. 7 On the other hand, one may b e tempte d to multiply the r ewar d signal by 0 in the event of 6 In ﬁnite MDPs—either due to terminal states or a ﬁxed iteration limit—it is advisable to set the discount factor γ = 1 when up dating Q V . This choice ensures that future, p ossibly distan t but inevitable violations are not artiﬁcially dev alued, allowing the agent to correctly account for unav oidable consequences in its decision- making. 7 This conclusion is similar to the one of Blaise Pascal’s argument [159], in which he observes that if Go d exists and one b eliev es, one gains inﬁnite happiness (heav en), whereas if Go d exists and one do es not b eliev e, one loses inﬁnitely (damnation). If God do es not exist, the gains or losses of b elieving are ﬁnite. Therefore, it is rational to b eliev e in Go d. 41 Figure 3.4: Diagram of the judge comp onen t. 42 a violation, but this r aises two other pr oblems, which ar e the incr e ase of the r ewar d if this was a ne gative value (sinc e it p asses fr om some − n, n ∈ N to 0 ), and the imp ossibility of making two or mor e violations c ount worse than a single one (sinc e b oth n × 0 and n × 0 × 0 ar e e qual to 0 ). If one tries to addr ess this issue by multiplying by a numb er within ]0; 1[ inste ad, it c omes b ack to the ﬁrst issue. The ﬁrst tracks the expected reward, while the second trac ks the exp ected amount of violations. T o select its optimal action, this agent uses a preference order, deﬁned as P : = { V ≻ R } , combined with a lexicographic selection, detailed in Algorithm 2. The main idea is that, considering a set of doable actions A ( s ) in state s , the agent ﬁrst select a subset of it that maximises 8 the exp ected v alue of Q V . This corresp onds to the set of actions that limit as muc h as p ossible the commitment of violations. 9 Then, from this subset, it takes the action that maximises the exp ected v alue of Q R , and that corresp onds to the most optimal action in relation to a given task among the actions that minimise violations. Algorithm 2 Lexicographic selection Require: s (current state), L (a list of Q-functions, ordered according to a preference order) Ensure: A ∗ (the resulting set con taining the optimal actions) 1: A ∗ : = A ( s ) ▷ Gets the set of the p ossible actions in state s 2: for Q i ∈ L do 3: A ∗ : = arg max x ∈ A ∗ Q i ( s, x ) ▷ Filters out sub optimal actions 4: end for 5: return A ∗ 3.2.3 Running Example Once the learning phase is ov er, it is no longer required to access the av atars. Consequently , they can b e remov ed from the deploy ed system so that the priv ate information of each stake- holder is k ept secret. In the even t of the agent causing an acciden t or harm 10 , it is p ossible 8 As doing a violation returns the signal − 1, the v alue 0 corresp onds to no violation. 9 This do es not fully preven t violations from o ccurring. In a state where all the outcomes lead to at least one violation, the agent will still choose an action. 10 It is also p ossible to do this for a simple malfunction, but we b eliev e that for the sake of priv acy , just like la w enforcement needs a mandate to search in the house of someone, they would require a mandate as well to access the av atars. 43 to reconnect the a v atars so that they can compute the status of the norms for the state corresp onding to the incident, pro viding hin ts to the inv estigator ab out what could ha v e led the agent to p erform such action. T o b etter understand ho w the π -NoCCHIO arc hitecture w orks, hereb y w e detail a running example to illustrate ﬁrst the learning process, and second the choice of an optimal action in a given state. In this running example, we will consider an autonomous taxi, as w ell as t w o stakeholder whic h are the taxi compan y and the law enforcemen t. In Section 3.2.3, we will sho w the pro cess of judging the action of the agent go es. Then, in Section 3.2.3, we will sho w how the agen t computes the optimal decision it ought to do once the learning phase has b een p erformed. Judging an Action Let an agent b e an autonomous driving taxi. It features t w o a v atars such that M = {M taxi , M law } that corresp ond to the taxi compan y and the la w enforcement. There are t wo norms within the en vironment, whic h are the resp ect of the sp eed limit ( i.e. , F ( speeding )) and the parking regulations that prohibit stopping on the road ( i.e. , F ( stop | on road )). Consider the following constitutiv e rules for M taxi 1. C 1 ( is taxi , The vehicle is a taxi ) - The agen t is a taxi 2. C 2 ( not in service , Not in service ) - The agent is currently not in service 3. C 3 ( expected greater than target time , Customer is late ) - The destination can- not b e reached b efore the time is exceeded 4. C 4 ( no traffic , No traffic ) - There are very few cars on the road 5. C F ( speeding ) ( ⊤ , F ( speeding )) - In any con text, it is forbidden to exceed the sp eed limit 6. C F ( stop | on road ) ( on road , F ( stop | on road )) - When the agent is on the road, it is forbidden for it to stop and for M law 1. C 1 ( closest parking<=40m ∧ nb free spots>=10 , Parking spot near ) - There is a parking with places left near 44 2. C 2 ( in city , No exception in cities ) - In a cit y , there is no exception that can b e applied 3. C 3 ( overtaking , Overtaking ) - The agent is endorsing an ov ertaking action 4. C 4 ( speed excess>=30 , 30+ kph Speeding ) - The agen t exceeds the sp eed limit b y more than 30 kph 5. C 5 ( on highway , High speed road ) - Highw ays are considered as high sp eed roads 6. C F ( speeding ) ( ⊤ , F ( speeding )) - In any con text, it is forbidden to exceed the sp eed limit 7. C F ( stop | on road ) ( on road , F ( stop | on road )) - When the agent is on the road, it is forbidden for it to stop No w, let a state s where the autonomous taxi, i.e. , the learning agent, is stopp ed on the road, within a cit y , next to an almost empty parking lot. A view of how the state s ma y lo ok like for this running example is giv en in the “State” column of T able 3.1. First, π -NoCCHIO computes ϵ ( s ), which will provide a set of (true) atomic prop ositions, deriv ed from the state s , that is a subset of 2 F acts . There are several wa ys of doing this. The most natural and easy wa y is to asso ciate some Bo olean form ulae with atomic prop ositions. F or instance, it can b e something similar to “ IF agent.speed > = environment.speed limit THEN facts.append(speeding) ” The second metho d requires the use of functions to generate the prop ositions dynamically . F or example, we ma y create a function nbC ar N ear by ( s ). This function could return a set of prop ositions, suc h as { car nearby=2 , car nearby ≤ 5, . . . } . The last metho d consists of making use of another AI system, suc h as a classiﬁer, a predictor, an oracle, or a Large Language Mo del (LLM), to extract the prop osition from an unstructured state such as an image, a sound, a text, or a map. An example of extracted facts is giv en in the column “F acts” of T able 3.1. How ever, this las t metho d is more expensive in resources and more complex to setup. Then, the constitutiv e norms are applied to the set of facts to generate more facts, and ev en tually , generate the arguments. The column “Argumen ts” of T able 3.1 shows the argu- 45 T able 3.1: Example of a state ( s ), with the facts ( ϵ ( s )) and argumen ts ( g etAr g s ( ϵ ( s ))) that can b e built from it. State F acts Argumen ts speed=0kph stop The vehicle is a taxi area type=urban in city No exception in cities terrain=road on road Parking spot near vehicle type=taxi is taxi No traffic begin service time=8:00 in service F ( speeding ) end service time=19:00 nb car nearby=1 F ( stop | on road ) time=16:23 no traffic ⟨ Img. of the surroundings ⟩ closest parking<=40m passenger count=0 nb free spots>=10 customer requested stop=true ⟨ GPS information ⟩ men ts that can b e built from the facts previously deriv ed from the constitutive norms of the a v atars. Note that the norm arguments are built if their condition is within the facts. Since F ( speeding ) has ⊤ as a conditional, it is alwa ys part of the generated argumen ts. Then, the argumen ts generated by eac h a v atar can b e combined in to a single graph p er norm. Fig. 3.5–3.6 show how the graph for each norm would b e constructed when combining all the arguments of the av atars, while Fig. 3.7–3.8 sho w the same graphs, but only with the argumen ts retained in this example’s state (see T able 3.1). Arguments considered relev an t b y M taxi are in red, and the ones considered relev ant by M law are in blue. 11 The norm argumen t is in grey . Lo oking at Fig. 3.7, w e can see that the norm F ( speeding ) is part of the grounded extension. As the agen t is in a city , the argument of M taxi “ No traffic ” is defeated. The norm is then active and not defeated. Since the agen t is stationary , it is not sp eeding, and so complies with this norm. As such, j ( ϵ ( s ) , F ( speeding ) , M ) = 0 which means that the agen t did not violate this norm. Then, lo oking at Fig. 3.8, we can see that the norm F ( stop | on road ) is activ ated (since the agent is curren tly on the road) and not defeated as well. Alb eit the agent is a taxi (and that this attacks the norm), the fact that there is a parking sp ot near defeats this 11 While this is not the case here, it is p ossible for stakeholders to share some arguments. 46 Figure 3.5: Argumen tation graph for the norm F ( speeding ). Figure 3.6: Argumen tation graph for the norm F ( stop | on road ). Figure 3.7: Argumentation graph in state s for the norm F ( speeding ). Figure 3.8: Argumentation graph in state s for the norm F ( stop | on road ). argumen t, and th us reinstan tiate the norm. The agent is not complying with the norm, as the norm says that it is prohibited to stop, and the agent is curren tly stopp ed. Consequen tly , j ( ϵ ( s ) , F ( stop | on road ) , M ) = − 1 which means that the agen t violated this norm. The agen t then receives its reward signal from the environmen t, as well as the violation signal equal to − 1. Using the Q -learning algorithm, it up dates its Q -v alues for Q R and Q V . 47 T able 3.2: Exp ected Q -v alues for a fake scenario with a state s and a list of actions a ∈ A ( s ). Actions Q R ( s, a ) Q V ( s, a ) stop +100 − 1 drive at 30kph +40 0 drive at 50kph +60 0 drive at 70kph +80 − 1 Selection of the Optimal Action Once the learning phase is ov er, the agent has to follo w the optimal p olicy π ∗ it learnt during the training. Consider an example scenario with a state s and the norms presen ted earlier in Fig. 3.5–3.6 in whic h it has to pick up a customer who is standing nearby . This would grant an immediate reward of +100. On the other hand, since the agent should keep the w aiting time of the customer as short as p ossible, it is given a negativ e reward (for instance − 1) for ev ery time step in which the customer is waiting. Let the action space for the agen t in this state b e A ( s ) = { stop , drive at 30kph , drive at 50kph , drive at 70kph } . The information that can b e deriv ed from the state is that the sp eed limit is 50 kph since the agent is within a cit y , and there is a parking sp ot nearb y . Q -v alues Q R ( s, a ) and Q V for the giv en state s and for each action a ∈ A ( s ) are giv en in T able 3.2. As there is a parking sp ot nearby , even though the agent is a taxi, it is still prohibited to stop on the road. Consequen tly , the action stop would grant the immediate reward of +100 for letting the customer in, but w ould also count as a violation of the norm F ( stop | on road ). Th us, Q V ( s, stop ) is equal to − 1. Similarly , drive at 70kph grants a rew ard of +80 (losing some due to the time p enalt y that it gets by the time it reaches the parking sp ot). As this sp eed exceeds the limit, it w ould also count as a violation of the norm F ( speeding ). On the other hand, b oth drive at 30kph and drive at 50kph are complying with all the norms. Y et, since driving at 30 kph requires more time to reach the parking sp ot than if the agent w ere driving at 50 kph, the total exp ected rew ard is low er, due to the extra time p enalt y it w ould b e giv en. If w e run the lexicographic selection o v er this set of actions, we would then ha v e the 48 follo wing subsets at eac h step: 1. Starting set A ( s ): { stop , drive at 30kph , drive at 50kph , drive at 70kph } 2. Filtering sub optimal actions regarding Q V : { drive at 30kph , drive at 50kph } 3. Filtering sub optimal actions regarding Q R : { drive at 50kph } This set only contains one action, drive at 50kph , which is then optimal with resp ect to the norms and the task of picking up the customer. If the ﬁnal set was containing more than one action, the agent could select one randomly as they w ould all b e considered equiv alent in terms of violations and reward. 3.2.4 δ -Lexicographic Selection One criticism that one may hav e against the prop osed approach is that it is to o strict in its resp ect of the norms. This section then prop oses an alternative to the standard lexicographic selection metho d that may add more ﬂexibility to the system. Let the following example. An agent has to choose, in a state s , b et ween three actions a , b , and c . Their Q -v alues for Q V and Q R are shown in T able 3.3. Using the standard lexicographic selection metho d (referred to as lex in this section), it will choose the action a , even though the risk of committing a violation when doing b is v ery small and pro vides a greater rew ard in return. In some con texts of application, one may wish to make its agen t more ﬂexible. T able 3.3: Example of Q -v alues in a state s . Action ( x ) Q V ( s, x ) Q R ( s, x ) a 0 10 b − 1 20 c − 0 . 2 15 Consequen tly , we prop ose a v arian t of lex introduced earlier. This approach aims at adding a tolerance margin to the selection process of the optimal action(s). This metho d is named δ -Lexicographic ( dlex ) selection. It requires to exp ertly set a v alue for δ ∈ R + whic h will serv e as a tolerance margin. 12 Algorithm 3 shows how the optimal set of actions is then 12 Note that δ = 0 is the same as applying lex . 49 computed using this metho d. 13 Algorithm 3 δ -Lexicographic selection Require: s (current state), L (a list of Q-functions, ordered according to a preference order), δ (tolerance margin) Ensure: A ∗ (the resulting set con taining the optimal actions) 1: A ∗ : = A ( s ) ▷ Gets the set of the p ossible actions in state s 2: for Q i ∈ L do 3: if l astE l ement ( Q i , L ) then ▷ If last element of the ordering, the tolerance is ignored 4: δ : = 0 5: end if 6: t : = max x ∈ A ∗ Q i ( s, x ) − δ ▷ Computes the threshold 7: A ∗ : = { x ∈ A ∗ | Q i ( s, x ) ≥ t } ▷ Filters out sub optimal actions b elo w the margin δ 8: end for 9: return A ∗ Consider the example of T able 3.3, with δ set to 0 . 3. Using the δ -lexicographic selection, w e compute the threshold for Q V , which is t = 0 − δ = − 0 . 3. After the ﬁrst iteration, w e ha v e A ∗ = { a, c } as b oth Q V ( s, a ) , Q V ( s, c ) ≥ t . Then, after the second iteration, A ∗ = { c } since the Q -v alue for this action is the optimal one. It is p ossible to deﬁne diﬀeren t margins δ i for each Q i ∈ L . One adv antage of this metho d is that it mitigates issues arising from Q -v alues that hav e not fully conv erged—such as near-iden tical v alues ( e.g. , 0 . 999 and 0 . 998)—which w ould for instance lead lex to incorrectly discard a nearly equiv alent alternativ e. Although in the prop osed algorithm, we assume δ to be giv en, it can p ossibly b e computed dynamically to include, for example, only the actions x ∈ A ( s ) for which the giv en Q i ( s, x ) is in a 10% range of the linear interpolation b etw een the maximal and minimal Q -v alues for Q i . More formally , for a ρ % tolerance, the v alue of t in Algorithm 3 w ould b e replaced by t : = max x ∈ A ∗ Q i ( s, x ) − ρ 100 ×     max x ∈ A ∗ Q i ( s, x ) − min x ∈ A ∗ Q i ( s, x )     In the example of T able 3.3, this w ould give t = 0 . 1 for Q V and t = 19 for Q R . 13 Note that the application of the margin can b e remov ed for the last Q -function. 50 3.3 Ev aluation This section aims at assessing the well functioning of the prop osed architecture, as w ell as the well-behaving of the agent trained through it. In order to do this, we implemented a grid-w orld environmen t inspired b y the example proposed in Section 3.2.3. 3.3.1 En vironments The Taxi-A environmen t 14 , shown in Fig. 3.9, consists of a 10 × 10 grid with 3 diﬀerent t yp es of tiles, namely , road , pavement , and wall . The agent can mov e freely on to the road and pavement tiles, but cannot mov e ov er the wall tiles. The upp er left corner corresp onds to the co ordinates ( x= 0 , y= 0), and the low er right corner corresp onds to ( x= 9 , y= 9). The agen t spa wns at the co ordinates ( x= 1 , y= 1). The agent’s state contains its co ordinates, which items are curren tly on the map, its in v en tory , and the iteration num b er divided by 5 (to giv e the agent a notion of time without explicitly giving the time step in whic h it is). Once 60 iterations ha v e b een reached, the en vironmen t is reset. The agent p ossesses a total of 8 actions, each b eing a tuple of tw o parameters resulting from the com bination of a direction { up , down , right , left } and a sp eed { slow , fast } . F or example, the action of the agen t moving slo wly up ward is represen ted b y the tuple ( up , slow ). If the agent chooses a direction that makes it collide with a wall, its p osition remains un- c hanged, but it receives a − 10 p enalt y on its rew ard. F urthermore, at each time step, the agen t receives a − 1 p enalt y to account for the time. This p enalt y is reduced to − 0 . 5 if the agen t chooses the action parameter fast . There are three “items” dispatc hed within the en vironmen t, which are named Red-Passenger , Black-Passenger , and Building . When an agen t go es ov er Red-Passenger or Black-Passenger , a “passenger” is added to its in ven tory . Items Red-Passenger and Black-Passenger are then remo v ed from the en vironmen t. The diﬀerence b et w een Red-Passenger and Black-Passenger is that when 14 The implementation is av ailable on github: https://github.com/zaap38/Pinocchio- Architecture . 51 reac hing Black-Passenger , the agent will b e considered as b eing on a parking sp ot when collecting the passenger. Ho wev er, it will receive a reward p enalty of − 5 to accoun t for the extra time that the customer may ha v e needed to mo ve to this lo cation. If an agent goes on Building , and is having a passenger in its inv entory , then the passenger is remo v ed, the agent receives 100 as a reward, and the environmen t resets. Otherwise, nothing happ ens. Figure 3.9: The Taxi-A/B/C en vironmen t. Three regulative norms are in application within this environmen t: R 1 - F ( pavement ): It is forbidden to go o ver the pav ement tiles. R 2 - F ( speeding ): It is forbidden to exceed the speed limit ( i.e. , use the parameter “ f ast ”). R 3 - F ( stop | road ): It is forbidden to stop on the road ( i.e. , reach items Red-Passenger , Black-Passenger , or Building ). Similarly , tw o stakeholders are inv olved, namely , the taxi company and the la w enforce- men t. The list of all the constitutive norms in application within the environmen t is: • C 1 ( role(taxi) ) • C 2 ( morning , not service ) 52 • C 3 ( night , not service ) • C 4 ( evening ∧ has passenger , late ) • C 5 ( time=0-1 , morning ) • C 6 ( time=2-2 , day ) • C 7 ( time=3-7 , evening ) • C 8 ( time=8-11 , night ) • C 9 ( dist parking < 4 , parking near ) • C 10 ( in city , no exception ) Note that C 1 − 8 b elongs to M taxi and C 9 − 10 b elongs to M law . The argumentation frame- w orks corresp onding to the defeat (or application) of eac h norm are:      R 1 : F ( pavement ) A r g s = { F ( pavement ) } R = ∅      R 2 : F ( speeding ) A r g s = { F ( speeding ) , late , no exception } R = { ( late , F ( speeding )) , ( no exception , late ) }         R 3 : F ( stop | road ) A r g s = { F ( stop | road ) , not service , role(taxi) , parking near } R = { ( role(taxi) , F ( stop | road )) , ( not service , role(taxi) ) , ( parking near , role(taxi) ) } The optimal reward that the agent can achiev e without committing any violation is 79. A v arian t of the Taxi-A en vironmen t, named Taxi-B , remov es the time p enalty occurring at eac h time step when the agent is not carrying a passenger. In order to help the agent to still con v erge to the desired b eha viour, an intermediary reward of 50 is added when pic king up 53 Figure 3.10: The Taxi-D en vironmen t. a passenger. This reward is set to 45 when picking up Black-Passenger instead to accoun t for the − 5 p enalt y . In this v arian t, the optimal rew ard is 140 . 5. Then, in a third v ariant called Taxi-C , the status of the norm R 2 (sp eeding) can b e randomly altered, with a probabilit y of 10%, suc h that the norm is activ ated ev en if the condition for its defeat were reunited. This last en vironment serv es at testing the human in terv en tion feature of the arc hitecture. Finally , the Taxi-D en vironmen t will aim at highligh ting the diﬀerence of b ehaviour b e- t w een the diﬀerent agen ts tested. Fig. 3.10 shows the new grid corresp onding to this en viron- men t. W e can see that the agent w ould greatly b eneﬁt from cutting b y the small pav ement tile at ( x= 7 , y= 4). 3.3.2 Results This section ev aluates the prop osed arc hitecture b y comparing 2 agent conﬁgurations in 4 diﬀeren t v ariants of the Taxi-A environmen t. 54 T ested Agents W e c ho ose to compare 2 diﬀerent v arian ts of agents in order to ev aluate the prop osed arc hi- tecture. These v arian ts are listed b elo w. Note that they all receive the same representation of the state, as well as the same reward signal. • π -NoCCHIO × lex (RL-Lex): This agen t uses the π -NoCCHIO architecture ini- tialised with the t w o stak eholders and three norms introduced in Section 3.3.1. The selection of the b est action is done using lex . • π -NoCCHIO × dlex (RL-DLex): This agent is similar to the previous, except that it uses dlex to select its action. Eac h of these v ariants shares the same learning rate α = 0 . 05 and discoun t factor γ = 0 . 99. Exp erimen tal Results The experiments consist of 2 , 000 , 000 learning steps in the Taxi-A and Taxi-B environmen ts. The learning parameters were α = 0 . 05 and the discount factor γ = 0 . 99. The learning phase tak es roughly 10 min utes with a Python3 implemen tation and an 11th Gen In tel Core i5-1145 2.60GHz. During the learning phase, the agent had a p ermanen t minim um probabilit y of 10% to select a random action for the sak e of exploration. The ﬁnal b ehaviour was then assessed b y observing a separate run in whic h the agent was follo wing its optimal p olicy . Fig. 3.11–3.12 shows the ev olution of the reward and violation coun t during the learning phase for the tested agents in Taxi-A environmen t. W e can see that the agent quickly learns to av oid committing violations, while also managing to obtain a near optimal reward. The rew ard that it gets when p erforming its optimal p olicy is 77. The 2 points below the maximal theoretical rew ard are due to the agent being unable to know through its state represen tation when the da y time is ab out to change from “morning” to “da y”, and th us do es a back step instead of picking up its passenger. Fig. 3.13–3.14 sho ws the same information for the v arian t Taxi-B . This time, the agent ac hiev es the optimal reward of 140 . 5 when following its optimal p olicy , while still av oiding to 55 Figure 3.11: Ev olution of the reward and vio- lation coun t during the training phase in the RL-Lex in Taxi-A en vironmen t. Figure 3.12: Ev olution of the reward and vio- lation coun t during the training phase in the RL-DLex in Taxi-A en vironmen t. commit any violation. During the b eginning of the learning phase, we can see that while the n um b er of violations reduces, the av erage reward go es down. This is due to the fact that the agen t disco v ers that picking Red-Passenger gran ts a bonus reward. How ever, as it also learns to av oid violations, it learns to a void picking Red-Passenger . It then requires some more learning steps b efore ﬁnding Black-Passenger . After this, the reward go es up. Another in teresting feature of the optimal b eha viour is that instead of rushing for the passenger, the agen t w aits for the da y time to b e “ev ening” as it allo ws it to use the “fast” action parameter. This phenomenon is called Norm A voidanc e [15]. As it is not the fo cus of this chapter, no attempt has b een made to mitigate it. How ever, approac hes to mitigate it will b e discussed later in this thesis. Then, Fig. 3.15–3.16 shows, for the tested agents in the Taxi-C environmen t, the evolution of the reward and violation coun t when the agent is learning with random human in terv en tions that may force it to resp ect the speed limit, disregarding the defeat conditions. Through observ ations of the agen t trace, w e could see that the agen t was able to learn an optimal b eha viour while still complying with the norms, in particular when the norm F ( speeding ) is manually altered. Finally , we compare the t w o agen ts in the Taxi-D environmen t. The RL-DLex agent w as set with a ﬁxed tolerance of 1, which means that it could tolerate one violation. As exp ected, 56 Figure 3.13: Ev olution of the reward and vio- lation coun t during the training phase in the RL-Lex in Taxi-B en vironmen t. Figure 3.14: Ev olution of the reward and vio- lation coun t during the training phase in the RL-DLex in Taxi-B en vironmen t. Figure 3.15: Ev olution of the reward and vio- lation coun t during the training phase in the RL-Lex in Taxi-C en vironmen t. Figure 3.16: Ev olution of the reward and vio- lation coun t during the training phase in the RL-DLex in Taxi-C en vironmen t. RL-Lex stay ed on the pav ement, while RL-DLex cut b y the pav ement, thus achieving a greater reward at the cost of one violation. In conclusion, we can see that the prop osed arc hitecture allo ws to derive from the state of the environmen t more complex facts that can serve at determining whether a norm has to b e complied with, with resp ect to the argumen ts of m ultiple stakeholders. W e can also see that the agent is then able to learn a b eha viour compliant with the norms while still achieving an optimal b eha viour. Then, we assessed the capacit y of the agen t to comply with manual 57 up dates of the norm status. Finally , w e hav e sho wn that the RL-DLex v arian t was able to ac hiev e a greater rew ard b y exploiting a tolerance margin. Remark 8. As the c omputation of the gr ounde d extension may r aise c onc erns ab out the time c omplexity when de aling with bigger gr aphs, we would like to pr ovide the fol lowing insights: With Python3 (i5-1145 2.60GHz, 16Go RAM) and a gr aph of 2K no des and 13K e dges, c omputing the gr ounde d extension takes in aver age 6 ms. 3.4 Related W ork This section describ es w orks related to the arc hitecture prop osed in this c hapter. It ﬁrst explores the work focusing on normative reinforcement learning. The discussion is then ex- panded b y exploring w orks that mak e use of normativ e sup ervisors, and ﬁnally , to approac hes related to normative agen ts in general. 3.4.1 Normativ e Reinforcement Learning When put in p erspective with the RL-related and Machine Ethics literature, not so many w orks ha ve b een done in the ﬁeld of Normativ e RL. F urthermore, these w orks are fairly recen t when compared to the ﬁrst discussions around normative agents. Mak aro v a and Abbas [125] introduce a metho d in whic h an RL agen t learns a p olicy (ac hieving at least lo cal optimalit y) to accomplish a “mission”, i.e. , reac hing a state corre- sp onding to the completion of the task of the agent, while a v oiding states that w ould not resp ect the norms. In this work, the agent learns within an MDP (see Deﬁnition 1) the prob- abilities asso ciated with the transitions state-action-state. What the agen t learn t can then b e transformed into a Probabilistic Computation T ree Logic (PCTL) by unrolling the MDP and asso ciating to eac h transition the probability observ ed during the learning. F rom eac h state ( i.e. , no de of the tree) can b e deriv ed truth v alues for a set of STIT (see to it that) logic form ulae. Some of these formulae describ e norms, and consequently states to av oid or reach dep ending on the norm b eing a prohibition or an obligation. Some other form ulae describ e state-of-aﬀairs in a given state, allowing for the assessment of the completion of the mission. 58 This essentially transforming their MDP into a lab elled MDP (se e Deﬁnition 2). Then, these v alues are propagated back through the no des of the tree if the probability of the transition sta ys ab o v e a ﬁxed threshold 15 , similarly to a constraint satisfaction problem. The authors then compute from the obtained tree a p olicy that ensures the achiev ement of the mission while a v oiding violating states. Although this approach provides many formal guaran tees ab out the p olicy that the agen t will learn, such as lo cal optimalit y (at least), a v oidance of the violating states, or ac hiev emen t of the task. Also, it makes very clear why the agen t fav oured one action ov er another, granting it a go od explainability p oten tial. Ho w e v er, it also suﬀers from some strong assumptions that limit its applicability . F or example, the size of a real-world en vironment often makes it unrealistic to assume the memorisation of the state resulting from another state-action pair to b e p ossible, as the state space may be to o big to b e memorised with reasonable resources. 16 Secondly , it is stated that this approac h do es not consider en vironments exhibiting normativ e conﬂicts (that here would take the form of the non-existence of a path satisfying the constraints), where satisfying one norm makes at least one other norm violated. This, once again, limits the applicabilit y of the aforemen tioned approac h. In their work, Scheutz and Little [182] prop ose to use Inv erse Reinforcement Learning (IRL) to learn what action the agent ough t to do from a dataset of observ ed b eha viours. This allo ws them to derive what they call low-lev el obligations. Then, b y extending their MDP to a lab elled MDP , they can extract higher-level obligations. The main adv antage of this metho d is that we are certain that there cannot b e a mistake in the design of the norms, as there is no such phase. Ho w ev er, this work suﬀers from sev eral problems. First, it assumes that the training data do es not contain observ ation taken from malicious agen ts p otentially disresp ecting the norms. This is a very strong assumption, esp ecially given that these data ma y never b e humanly examined. Second, their approac h do es not distinguish b et w een the 15 Mak arov a and Abbas [125] claim that while one can see such threshold as being arbitrary , in computer engineering such a v alue would b e derived from empirical studies and risk assessments. 16 This issue is common among the mo del-based approaches, i.e. , the reinforcement learning approaches trying to learn the transition probabilities from one state to another (in opp osition to the model-free approac hes that fo cuses on the learning of the exp ected reward) such as safe RL techniques using Linear T emp oral Logic (L TL) to identify a safe p olicy [19, 98, 147]. This renders mo del-based approaches hardly scalable. 59 resp ect of an actual norm, and simply a conv enient action for an agent. F or example, if an agen t takes an item in a store, it is obligated that it pays. How ever, when an agen t takes the elev ator to go to the tenth ﬂo or, even though there were stairs, this is not what one w ould lik e to qualify as a norm. 17 Last, it requires the environmen t to p ossess symbols in order to extract norms which are meaningful to a human. If this is not the case, then most of the generated norms will lo ok lik e “If you visited state x , you ought to visit state y ”. This, com bined with the multitude of “false p ositiv e” norms which may b e extracted, may render the identiﬁcation and understanding of the actual norms challenging. Finally , w e would like to in tro duce the w ork of Neufeld [143, 144, 145], where a mo del- free normative RL arc hitecture has b een developed to train an agent to follo w the norms pro vided exp ertly through a knowledge base. Similarly to the ones presen ted ab o ve, this w ork uses a lab elled MDP to represen t the state-of-aﬀairs in each state. Lab els are used to detect whenever a norm is fulﬁlled or violated. F or instance, let an e n vironment where the constitutiv e norm C ( r ed f ence, ¬ w hite f ence )—which essen tially means that a red fence is not white—and the regulativ e norm O ( w hite f ence | f ence )—which means that if there is a fence, it should b e white—are in application. 18 A state having the lab els { f ence, r ed f ence } w ould then trigger a violation. Instead of up dating a single Q -function, the agen t up dates one for the exp ected reward ( Q R ), and one for the exp ected violations ( Q V ). When in a state s , the agent knows that the exp ected reward it will receive when doing the action a is Q R ( s, a ), and the exp ected num b er of violations is Q V ( s, a ). 19 Then, when running its optimal p olicy , the agen t c ho oses the action to tak e b y performing a thresholded lexicographic selection o ver the p ossible actions. A lexicographic selection [85] consists of iterativ e selection of subsets of actions such that they are optimal with resp ect to a Q -function. The order in whic h this selection is p erformed, called a lexicographic ordering, is ﬁxed. In Neufeld’s w ork, it is P = { V ≻ R } , meaning that the ﬁrst subset will contain the actions that maximise 17 If we do consider this as a norm, w e may hav e to consider any action that is sub optimal in regard to the rew ard as prohibited. 18 This example is inspired by the Dutch cottage housing regulation as presented by Prakken and Sergot [165]. 19 A violation is usually expressed with a negative v alue. This means that Q ( s, a ) = 0 denotes no violation, while Q ( s, a ) = − n denotes n violations. 60 the violation Q -v alue (as maximising this v alue is the same as minimising the n um b er of violations committed), and the second subset the actions maximising the violation Q -v alue and maximising the rew ard. A thresholded lexicographic selection [200] extends this pro cess b y adding a minimal threshold v alue C i for eac h elemen t i in the lexicographic ordering. Instead of selecting only the optimal subset at each iteration, it selects all the actions that return a v alue abov e the corresp onding threshold. If no action reaches the threshold, the most optimal is chosen. The threshold for the last v alue of the ordering is usually set to an inﬁnite v alue. The pro cess of selecting an optimal action given a set of Q -functions is b etter sho wn in Algorithm 4. Algorithm 4 Thresholded lexicographic selection Require: s (curren t state), L (a list of Q -functions, ordered by priorit y), C a vector of thresholds for the Q -functions in L Ensure: A ∗ (the resulting set con taining the optimal actions) 1: A ∗ : = A ( s ) ▷ Gets the set of the p ossible actions in state s 2: for Q i ∈ L do 3: T : = { x ∈ A ∗ | Q i ( s, x ) ≥ C i } ▷ Filters out actions b elo w the threshold 4: if T = ∅ then 5: A ∗ : = arg max x ∈ A ∗ Q i ( s, x ) ▷ Filters out sub optimal actions 6: else 7: A ∗ : = T 8: end if 9: end for 10: return A ∗ If we restrict Neufeld’s approac h to a standard lexicographic selection and disregard the discoun t factor γ , it conv erges to an optim um that minimises the violations and maximises the rew ard among the remaining options. F urthermore, it do es not require more exp ert kno wledge than a standard RL environmen t (the only exception b eing the lab els that help to kno w if a norm is violated or not). Since the norms are given explicitly in a kno wledge base, and expressed with sym b ols, it is fairly easy for a human to understand what the agent tries to resp ect as a norm. W e b eliev e the version using the thresholded lexicographic selection to b e more con trov ersial, as the choice of the threshold v alue ma y b e somewhat arbitrary . Another criticism one can make is that the management of the norms and the conﬂict remains 61 fairly minimalist. The author addressed this issue in further works b y prop osing a theorem pro v er-based normative supervisor [143, 145]. Ho w ev er, since we plan to integrate a normative sup ervisor, this thesis will be based on the v ersion previously describ ed. As this last work pro vides a reliable and ﬂexible basis, this thesis will mostly use it as a base for the normativ e RL side. 3.4.2 Normativ e Sup ervisors As reinforcemen t learning is not the only wa y to mak e an agen t act normatively , here are some works that can b e classiﬁed as normative sup ervisors, i.e. , entities that gov ern an agent b y deciding whether it has the righ t to take such action or not. An example of normative sup ervisor is the Jimin y architecture [112, 111] previously in tro- duced. Its integration to an agent was out of the scop e of the pap ers where it was prop osed, Consequen tly , several limitations arises when trying to in tegrate it to a reinforcemen t learning agen t. This makes it inapplicable in its curren t state. First, it is designed to b e used in real time during the agent’s deplo ymen t, as there is no learning phase inv olved. This means that the heavy computations will hav e to o ccur in real time. On on b oard systems, it migh t b e problematic to p ermanen tly allo cate that muc h re- sources. T o alleviate this issue, the Jimin y arc hitecture proposes to skip the reasoning process if no normativ e conﬂict, or norm-relev ant element, is detected in the curren t state, but it is unclear ho w this detection w ould be ac hiev ed without actually entering the reasoning pro cess. Second, it do es not guarantee an y priv acy for the stakeholders. Indeed, as their mo del has to b e stored within the system, p otentially any one can ha v e a lo ok inside to see the elemen ts constituting eac h av atar. A subsequent problem is that any one can p oten tially mo dify these mo dels so that the sup ervised agent deviates from the originally in tended b eha viour. Third, it lacks the ability to deal with uncertain t y . Indeed, as Jiminy interv enes b efore the agent to select a set of actions that are considered to b e adhering to the norms, it cannot learn from the consequences of the actions p erformed. Therefore, the agent may adopt a b eha viour in whic h it do es what seems to b e go o d (with resp ect to the norms) rather than what is truly go od. This incapacit y to deal with uncertaint y is also problematic for a v oiding long-term vi- 62 Figure 3.17: Normative architecture prop osed by Neufeld et al. [146]. olations. Indeed, sometimes it is preferable to commit a violation immediately , as not doing it may later lead to the una voidable commitmen t of a w orse violation. In the work of Neufeld et al. [146], an arc hitecture for normative sup ervision is prop osed. It tak es the form of a shielding arc hitecture, shown in Fig. 3.17, where a normativ e supervisor ﬁrst receives the observ ations of the agen t, and then computes a set of legal actions from a set of p ossible actions. These legal actions are then forwarded to the reinforcemen t learning agen t, who can c ho ose only b et ween these. It suﬀers from similar issues as Jiminy . The ﬁrst is the inability to deal with sto c hasticity . Although the RL-agent is capable of optimising its reward (for the main task), it is dep enden t on the choice of the normativ e sup ervisor to c ho ose a legal action. This means that dep ending on how the norms are structured within the normative sup ervisor, the agent ma y choose an action that is legal, but that will later lead to an unav oidable violation. The second issue is that, if no legal action is av ailable, the agent will hav e to choose one illegal action. As the agen t did not train at predicting the normativ e consequences of such illegal actions (as it trac ks the reward, like a standard RL-agent, but not the exp ected future violations, neither preferences nor w eights ov er them), it may not achiev e optimalit y in case of a normative conﬂict. 3.4.3 Other Approac hes to Normative Agency This section discusses other approac hes related to normative agen t architectures that are not directly related to reinforcemen t learning. This includes approac hes based on planning or 63 BDI (Beliefs, Desires, In ten tions) mo dels [88]. Castelfranc hi et al. [61] in tro duces delib erative normative agents that are aw are of the norms in a multi-agen t environmen t and can choose whether to violate them in sp eciﬁc con texts. A prop osed arc hitecture for suc h agents allows them to recognise norms, adopt them, or delib erately violate them when appropriate. This framew ork reﬁnes earlier models b y adding comp onen ts for reasoning in norm-go verned settings. Unlik e earlier hard-co ded approac hes used in so cial simulations, it enables agents to adapt their b eha viour o ver time. In Boman [50], a metho d is prop osed to enforce norms on sup ersoft agents, which pro cess v ague and imprecise information using a decision mo dule. Norm compliance is achiev ed b y manipulating utilities, remo ving harmful actions, or disqualifying b eha viours that reduce global utility in the system. This approach do es not deﬁne a full agent mo del, but rather enforces rules on utility-driv en agen ts whose decisions are alw a ys ob eyed. Although eﬀectiv e in ensuring compliance, it lacks ﬂexibility as it prev en ts norm violation and sanctioning, limiting its usefulness in domains where contracts can be broken and outcomes are uncertain. Then, in Dignum et al. [74], an approac h is prop osed that in tegrates norms and obliga- tions in to the BDI model to supp ort so cially motiv ated reasoning. Agents explicitly represen t norms, allo wing them to decide whether to comply and adapt when norms b ecome obsolete or diﬀer betw een agents. The con trol algorithm extends the BDI p ro cess b y including deontic ev en ts in the option generator, pro ducing sets of compatible plans. Plan selection is guided b y so cial b eneﬁts and sev erit y of punishment, enabling agents to adopt norms for collab ora- tiv e behaviour. How ever, the mo del only addresses conﬂicts b etw een norms, resolving them through predeﬁned preference orderings rather than dynamic utilities. The Normativ e Agent Architecture (NoA) presented in Kollingbaum and Norman [104] extends the BDI mo del with mec hanisms to reason ab out norms. It consists of a sp eciﬁcation language for b eliefs, goals, plans, and norms, and an interpreter that executes them. Rather than excluding forbidden actions, NoA lab els them as such, lea ving agents norm-autonomous b y allo wing delib erate violations when conﬂicts arise. NoA uses predeﬁned plans of determin- istic actions. It do es not mo del punishments for violations. Instead, agen ts aim to maximise norm resp ect, with violations occurring only when compliance is imp ossible. 64 In Boella and T orre [48], a formal game-theoretic mo del is proposed for agen ts negotiating con tracts, in tro ducing a qualitative approac h based on recursive mo delling. Agen ts predict whether their actions violate norms and may be sanctioned. Conﬂicts among motiv ations are resolv ed through priority relations. The mo del assumes deterministic transitions. Andrighetto et al. [24] presents the EMIL-A architecture designed to handle the emergence of norms by recognising new norms, forming normative b eliefs and goals, and generating in ten tions and plans to act accordingly . It includes a long-term memory of norms and a rep ertoire of compliant actions, with decisions guided by the salience of norms. Agents compare exp ected utilit y when deciding whether to comply , considering p ossible sanctions and incen tiv es. How ever, the mo del lacks an explicit representation of sanctions and do es not detail the normativ e action planner. [118] prop osed a normative framew ork for agent-based systems, combining mo dels of norms, normative m ulti-agen t systems, and normativ e autonomous agents. Agents decide whether to comply with norms b y ev aluating goals that may b e hindered or rewarded, and consider p oten tial punishmen ts when rejecting norms. The arc hitecture incorp orates agen t proﬁles—suc h as so cial, reb ellious, and opp ortunistic—that guide goal selection through pri- orit y functions. Meneguzzi and Luck [131] extend BDI languages to allo w b ehaviour mo diﬁcation at run- time in resp onse to newly accepted norms. Norms are represented with activ ation and expi- ration conditions and can b e prohibitions or obligations referring to states or actions. The in terpreter scans the agent’s plan library to remov e violating plans for prohibitions or gen- erate new plans for obligations. Although the metho d fo cuses on adapting b eha viour to accepted norms, enabling norm violation would require a decision process for accepting or rejecting norms, which restricts the agent’s autonom y . In Cardoso and Oliveira [60] designs norm-aw are utility-orien ted agents that adjust sanc- tions in contracts. Agents diﬀer in risk tolerance, aﬀecting decisions to enter contracts, and so cial aw areness, inﬂuencing compliance with obligations even when not p ersonally adv an ta- geous. These agents do not construct plans, but instead compute expected utilities. Finally , Joseph et al. [99] prop oses a coherence-driv en agen t architecture that extends 65 the BDI mo del with a theory of deductive coherence, guiding decisions based on coherence maximisation rather than exp ected utility . Norms are represen ted with graded priorities, and norm violations are linked to sanctions using logical implications. Non-deterministic information is handled with grades, and sanctions are enco ded as b eliefs. Agents adopt norms and sign contracts that are coheren t with their current men tal state, rather than those that are more proﬁtable. The architecture do es not sp ecify a planning context. 3.5 Summary This chapter presents the π -NoCCHIO architecture [8] and its sub components—whic h is a com bination of the Jiminy architecture, the AJAR framework, and Neufeld’s NGRL agen t— b y providing its formal model and explaining ho w it in tegrates within a reinforcemen t learning setup. In the prop osed arc hitecture, a reinforcemen t learning obtain, in add ition to its reward signal, a violation signal. The latter comes from stak eholders’ represen tations, called av atars, that debate whether the agent committed a violation for a giv en norm through the use of formal argumentation. Then, w e prop osed an alternativ e to the standard lexicographic selection metho d that giv es more ﬂexibility and freedom to the agent. W e hav e seen through an empirical ev aluation that the prop osed architecture not only makes the agent able to follo w the norms, but also allows for some ﬂexibilit y b y follo wing dynamic norm revisions from a human op erator or making compromises to ac hieve a better rew ard. Then, it compares the b eneﬁts of the proposed approach with other similar works in the literature related to normativ e reinforcement learning and normativ e sup ervisors by outlining their limitations and detailing how π -NoCCHIO addresses them. Ho w ev er, this architecture can b e further improv ed. In consequence, here are some chal- lenges and research directions that one may ﬁnd relev ant to explore. First, the wa y in whic h the stak eholders’ arguments are exchanged can b e improv ed. In the prop osed architecture, the metho d w e use is pretty naiv e. Ho wev er, there exist several w orks within the literature that allow agents to ha v e strategic argumentativ e discussions. Using these works may allow stak eholders to regain agency . It would also improv e the ov erall fairness since there would b e more control on how the argumen ts would b e put forw ard within the discussion. Second, a 66 more complex structure than Dung’s argumen tation could be used. Indeed, decisions could be mo delled using, for example, probabilistic argumentation frameworks dev eloped sp eciﬁcally for reinforcement learning [171], or representations more philosophically grounded [13]. 67 Chapter 4 Dynamically Mo delling the Norms of Stak eholders The goal of this chapter is to pro vide the necessary tools to reduce the burden of the de- signer when using the π -NoCCHIO architecture presen ted in Chapter 3. Consequen tly , this c hapter reviews the norm identiﬁcation tec hniques from the literature to help the reader in selecting one. Then, it prop oses ARIA, an argumentativ e rule induction algorithm that aims at extracting an argumentation that represents the mo del that go v erns the decisions of an agen t. 4.1 In tro duction As already men tioned in the in tro duction of this thesis, the π -NoCCHIO arc hitecture re- quires the norms of the stakeholders to b e represented as argumentation graphs. Although a stakeholder such as the manufacturer ma y b e able to allo cate resources to mo del such a structure, this is not alwa ys feasible. F or example, user stak eholders may lack the time and exp ertise to develop suc h a structure. Even if they are able to, insuﬃcient sup ervision can lead them to ov erlo ok critical asp ects that, if neglected, may b e detrimen tal. F or this reason, it is important to provide a to ol to automatically extract the mo del of some of the stak e- holders. F ortunately , all p otential “user” stakeholders tend to share similar patterns among 68 them. F or this reason, it is p ossible to use an algorithm that w ould extract, from previously collected b eha vioural data, a model of these stakeholders. Doing this requires a t w o-step pro cess. First, it is necessary to extract the norms that are relev an t to the user. Second, the conditions under whic h the user ﬁnds a norm relev ant ha v e to b e reﬁned and structured as an argumentation graph. In the literature, there exists a broad range of approaches for norm iden tiﬁcation, also referred to as norm detection, or norm mining. F or this reason, the choice of which algorithm to use to extract the norms is left to the designer, the ARIA algorithm prop osed in Section 4.3 fo cusing solely on the extraction of the exceptions to a norm. Nev ertheless, as the designer requires a clear view on literature on norm mining in order to make a sensible c hoice, Section 4.2 presen ts the results of a comprehensive review of the ﬁeld of norm mining, as well as key tak ea w a ys for the c hoice of the norm mining tec hnique used for π -NoCCHIO. An in terested reader ma y also consider reading the complete review b y ALCARAZ et al. [14] as it identiﬁes the c hallenges and research directions for the future in the ﬁeld of norm identiﬁcation. Ho w ev er, within this thesis, the conten t will b e limited to the analysis of the curren t approaches. The second step consists of reﬁning the defeat condition of these norms and representing this condition through an argumentation graph. There can b e sev eral wa ys to organise the argumen ts in fa vour or against a norm, suc h as using Large Language Mo dels (LLMs) ( e.g. , Alcaraz et al. [16]). In Section 4.3 of this thesis, we in tro duce the algorithm ARIA ( i.e. , Argumen tativ e Rule Induction A-star). By learning ov er a dataset, this algorithm generates a graph that serves at predicting a lab el dep ending on an input state. Then, using some tec hniques from the literature, it can serv e at generating p ost-ho c explanations of a decision. 4.2 Review of the Norm Mining T ec hniques Norm Mining, also referred to as Norm Iden tiﬁcation or Norm Detection, aims to enable autonomous agents to recognise b oth explicit norms, such as a law, and implicit norms, suc h as not b eing noisy in public transp ortations, in order to reason ab out their applicabilit y , and adjust their b eha viour accordingly . Usually , the main goal of these approaches is to translate the extracted norms in to a form matc hing the Deontic Logic formalism. 69 This section provides an ov e rview of the tec hniques dev elop ed in the last 15 years that can help the user of π -NoCCHIO make an informed choice ab out the norm mining metho d. 4.2.1 Researc h Questions T o help identifying what metho d is and is not doing, several key research questions ha v e b een formulated. These questions aim to address critical asp ects of normative systems, ranging from detection and iden tiﬁcation to synthesis and adaptation within multi-agen t systems (MAS). This section elab orates on these research questions, pro viding hints on their signiﬁcance and the assumptions underlying them. – RQ1: How to iden tify the norms of a so ciet y by observing it? – This question fo cuses on passiv e observ ation techniques that allow agents to detect prev ailing norms without direct interaction. – RQ2: How to identify the norms of a so ciety b y interacting with it? – While passive observ ation pro vides v aluable insights, active interaction oﬀers additional dimensions for norm identiﬁcation. – RQ3: How to diﬀeren tiate individual norms from so cietal norms in a m ulti-agen t so ciet y? – Diﬀerentiating b etw een individual and so cietal norms is crucial for understanding the emergence and enforcement of normative b ehaviour. Sub-questions R Q3.1 and R Q3.2 further explore the feasibilit y of distinguishing p ersonal norms ( p -norms) from group norms ( g -norms) and the metho dological challenges in v olv ed. – RQ3.1: Is it p ossible to diﬀeren tiate a p -norm from a g -norm? – RQ3.2: How to p erform this diﬀeren tiation with only one agent? – RQ4: How to detect prohibition norms without norm enforcement? – Detecting prohi- bition norms in the absence of explicit e nforcemen t mec hanisms is a challenging task. – RQ5: How to detect norms in communications? – Communication analysis oﬀers a ric h source of normative information. Natural language pro cessing (NLP) techniques pla y a crucial role in extracting norms from textual in teractions. 70 – RQ6: Is it p ossible to adapt to drifting norms without restarting the whole learning phase? – Adapting to norm drift without reinitialising the learning pro cess is essential for main taining normative coherence in evolving so cieties. Norm ev olution presents signiﬁcan t challenges for adaptiv e agen ts. – RQ7: Is it p ossible to detect sub-comm unities of agen ts? – The emergence of sub-communities within a multi-agen t so ciet y can indicate norm fragmentation. 4.2.2 Analysis of the Researc h Questions This section iden tiﬁes among the metho ds in the literature the main trends. Then, it attempts to answer the researc h questions listed in Section 4.2.1. This section ﬁrst describ es quan titativ ely how the researc h questions were answered b y the identiﬁed approaches. Then, it go es more in depth into each researc h question, analysing qualitativ ely how they address it. Quan titativ e Analysis Although each approach addressed the challenges raised b y the researc h questions in its own w a y , we could iden tify some ma jor trends in the emplo yed metho ds. Belo w are listed the k eyw ords asso ciated with each trend, as well as their description: • Threshold: The approach observes a comm unit y of agen ts. If a b ehaviour is rep eated a certain num b er of times exceeding a threshold v alue, it is then added to the p oten tial norms. • Comparison: The method detects the norms by exchanging information and com- paring its set of b eliefs or desires with other agents from the environmen t, or with an external source. • Reasoning: The agent uses a reasoning mec hanism, or a mathematical formula, to deriv e the norms from the data. • Elitism: The approach fo cuses on the obse rv ation of a limited num b er of agents, usually having a higher trust v alue. Those agen ts can also act as help ers tow ard other agen ts to help them in identifying the system’s norms. 71 Figure 4.1: How each researc h question is addressed by each approac h. • Log: The approach mak es use of the trace of other agents, potentially trac king the signals such as sanctions. • Data Mining: The approac h uses pattern recognition techniques, or mac hine learning tec hniques, to extract the norms. • Natural Language Pro cessing (NLP): The metho d uses grammar and seman tics to detect the norms. • Y es: The pap er answered p ositiv ely to a closed researc h question. • Not Answered: The pap er did not address the given research question. F or a closed question, it is not necessarily equiv alent to a negative answ er. Fig. 4.1 sho ws, for each researc h question, what are the ma jor trends among the ap- proac hes answ ering it, as well as the prop ortion of approac hes that do not address this question. T able 4.1 provides a detailed view of eac h of the tec hniques. As an additional indicator, Fig. 4.2 shows the correlations among researc h questions b y considering the most answ ered research question. On the other hand, Fig. 4.3 sho ws, given the minim um v alue of the n um b er of pap ers answering the research question b et w een t w o questions, the correlation. It is imp ortan t to note that the approaches not fo cusing on, or taking into account, MAS ma y b e unable to answ er a part of the research questions. 72 T able 4.1: Researc h Questions Breakdown N ° Reference RQ1 R Q2 RQ3 R Q3.1 RQ3.2 R Q4 R Q5 R Q6 R Q7 1 [79] NLP NLP NLP NLP 2 [120] Log Log Threshold Y es Data Mining Log Y es 3 [122] Data Mining Comparison 4 [119] Log Elitism Comparison Y es Log Y es Y es 5 [180] Log Log Reasoning Y es Reasoning Y es Y es 6 [123] Reasoning Reasoning Y es Reasoning Reasoning Y es Y es 7 [124] Log Comparison Reasoning Reasoning Y es 8 [179] Log Data Mining Comparison Y es Log Log Comparison Y es 9 [170] Data Mining Elitism Y es 10 [181] Log Log Reasoning Y es Reasoning Log Y es 11 [86] NLP NLP 12 [71] Data Mining Threshold Y es Y es 13 [59] Log Log Elitism Y es Y es 14 [30] Data Mining Reasoning Reasoning NLP 15 [178] Data Mining Threshold Data Mining Y es Y es 16 [72] Log Log Data Mining Y es 17 [5] NLP NLP 18 [154] Reasoning NLP 19 [121] Data Mining Data Mining Log Y es Log Data Mining 20 [139] Data Mining Data Mining NLP 21 [18] Log Comparison Y es Log NLP Y es 22 [73] Log Y es 23 [64] Log Y es 24 [138] Reasoning Reasoning Reasoning Y es Reasoning Y es Y es 25 [135] Log Log Comparison Reasoning Y es 26 [134] Reasoning Log Reasoning Y es Reasoning Reasoning Y es 27 [114] NLP NLP 28 [113] NLP NLP 29 [67] Log Reasoning Y es Log Y es 30 [193] Data Mining Threshold Y es Data Mining Y es 31 [70] Log Log 32 [164] NLP 33 [153] Data Mining Data Mining Data Mining Y es 34 [82] NLP NLP NLP Y es Y es 35 [133] NLP NLP NLP 73 Figure 4.2: Comparison with the maximal v alue. Figure 4.3: Comparison with the minimal v alue. Qualitativ e Analysis R Q1: Ho w to iden tify the norms of a so ciet y by observing it? V arious approaches ha v e emerged in the literature, including: F requency-based detection metho ds suc h as the P oten tial Norms Mining Algorithm (PNMA) prop osed b y Mahmoud et al. [124], which identi- ﬁes norms through statistical analysis of observed b eha vioural patterns. Bay esian hypothesis testing as in tro duced b y Craneﬁeld et al. [71], which calculates the likelihoo d of a candi- date norm’s existence. Plan recognition approaches by Oren and Meneguzzi [154] that infer norms through observed action sequences. While deviating a bit from the research question as they do not prop erly observ e agen ts, rep ository mining techniques [72] and legal text anal- ysis [79] provide insight in to normative structures b y extracting patterns from unstructured or semi-structured do cuments. R Q2: How to iden tify the norms of a so ciet y b y in teracting with it? The literature presen ts sev eral interaction-based approaches: Mahmoud et al. [120] lev erage even t history analysis combined with ontology-driv en inference, allowing agents to reﬁne detected norms through iterativ e engagemen t. Similarly , veriﬁcation through p eer interactions [181] to con- ﬁrm candidate norms and query-based approaches [179] that actively test for norm existence are also addressed. 74 R Q3: Ho w to diﬀerentiate individual norms from so cietal norms in a multi-agen t so ciet y? Mahmoud et al. [124, 122] explore frequency-based diﬀeren tiation, distinguishing descriptiv e norms (emerging from agent b eha viours) from injunctiv e norms (those explicitly reinforced). P eer veriﬁcation mechanisms [181] pro vide a means of v alidating so cietal norms v ersus p ersonal b eha viours through sanction-based diﬀerentiation. RQ3.2 remains an op en problem in the literature, as most approaches rely on m ulti-agen t in teractions. Ho wev er, Ba y esian updates [71] and agent-cen tred norm ev aluation [180] suggest that individual agen ts could infer so cietal norms through probabilistic reasoning and historical observ ation. R Q4: How to detect prohibition norms without norm enforcement? Sav arimuth u et al. [181] and Dam et al. [72] present data-driven approaches using asso ciation rule min- ing and rep ository analysis to iden tify prohibition norms, while Ba y esian even t sequence analysis [139] estimates prohibition lik eliho od based on historical compliance trends. The other approaches include analyses of infrequent patterns [124] that may indicate a v oided b e- ha viours, detection of absence patterns in expected action sequences [119, 121], linguistic cues in communications [5, 79] that signal prohibited actions, and av oidance patterns in plan execution [154]. R Q5: Ho w to detect norms in comm unications? Approaches include natural lan- guage pro cessing of comm unications logs [30, 72], extraction tec hniques sp ecialised for formal do cumen ts lik e contracts [86], analysis of mo dal verbs and deon tic expressions [5, 79], and ev en t analysis from comm unication records [139]. R Q6: Is it possible to adapt to drifting norms without restarting the whole learn- ing phase? Riad and Golpa y egani [170] prop ose mec hanisms for online norm synthesis and utilit y-based adaptation. Mahmoud et al. [123] prop ose assimilation techniques that allo w agen ts to incremen tally adjust to shifting norms rather than resetting the learning pro cess. The literature oﬀers several other approac hes, suc h as contin uous monitoring and incremental up dates [119, 124, 120], Bay esian up dating mechanisms [71] that adjust conﬁdence o ver time, case-based reasoning adaptation [59] for evolving normativ e systems, and online reﬁnemen t 75 tec hniques [135, 134] for dynamic norm syn thesis. R Q7: Is it p ossible to detect sub-comm unities of agents? Camp os et al. [59] and Morris-Martin et al. [138] discuss case-based reasoning and agent-directed norm syn thesis as p oten tial solutions. These metho ds oﬀer promising av enues for sub-communit y detection but also in tro duce c hallenges related to scalability and the gran ularit y of norm diﬀerentiation within and across sub-comm unities. Mahmoud et al. [119] analyse b eha vioural clustering to iden tify so cietal sub divisions while other approac hes include comparativ e analyses of reposito- ries [72] to identify communit y-sp eciﬁc norms, mo v emen t-based detection [180] that analyses agen t groupings, and analyses of heterogeneous groups [123] with distinct normativ e systems. 4.2.3 Analysis of the Approac hes This section presents a classiﬁcation of the con text in which each prop osed approach is op erating, as well as a comprehensiv e review of eac h if them. Classiﬁcation of the Reviewed Approac hes After reviewing the collected approaches, tw o ma jor categories were identiﬁed based on their application context: A gent-Base d and Not A gent-Base d approac hes. A metho d is considered Agen t-Based if it identiﬁes norms through the in teractions of agents within an environmen t or through their communication with other individuals. The key characteristic of these approac hes is the presence of actions ( i.e. , in teractions) that facilitate the disco very of norms. In c on trast, a metho d is classiﬁed as Not Agent-Based if it primarily relies on data analysis rather than interactiv e b eha viours. Eac h of these categories can b e further divided into sub categories. Agen t-Based approac hes can b e grouped into three sub categories: Observatory , Exp erien- tial , and Communic ative . Observ atory approac hes rely on observing other agents (or traces of their actions) interacting in an environmen t. These metho ds are considered safer since they do not inv olve direct exp erimen tation that could lead to norm violations. Ho wev er, they may struggle to iden tify prohibitions, particularly when all observ ed agents com ply with existing 76 norms, leaving no violations to b e detected. Exp erien tial approaches op erate on trial and er- ror. This metho d is commonly found in b eha viour learning techniques suc h as Reinforcement Learning. It is typically eﬃcient and relatively simple to implemen t. How ever, unlike Obser- v atory metho ds, it inv olves committing multiple norm violations b efore correctly identifying the normative b eha viour. Communicativ e approac hes rely on exchanging information with already in tegrated agen ts. Like Observ atory metho ds, they are relativ ely safe. How ever, they tend to b e the most complex to implemen t eﬀectiv ely , whic h limits their practical use. Figure 4.4: T axonomy of the iden tiﬁed approaches. Metho ds that do not fall under the Agen t-Based category are distinguished b y the type of data they pro cess. W e iden tiﬁed three sub categories: Structur e d , Semi-Structur e d , and Unstructur e d data. Structured data consists of pre-enco ded information, such as databases, where symbolic elements are already extracted and standardised. Approaches in this cate- gory typically apply pattern recognition techniques to iden tify norms. Semi-structured data includes do cumen ts that follow a standardised structure and contain recognisable k eyw ords related to norms. Examples include legal texts and contracts. These approac hes are more c hallenging than those using structured data, but remain more manageable than those us- ing unstructured data. Unstructured data encompasses free-form conten t, such as forum discussions and natural language do cumen ts. Because these data sources are unpro cessed, extracting meaningful symbols for norm identiﬁcation is signiﬁcan tly more complex. Among the collected approac hes, none addressed audio or video data, ev en though these are p otential 77 media for norm iden tiﬁcation. This classiﬁcation is illustrated b y Fig. 4.4. Figure 4.5: Main area of the prop osed approaches. In addition to the taxonom y based on metho dology , we also classiﬁed the approaches according to their primary research fo cus, see Fig. 4.5. W e identiﬁed three main areas: Natur al L anguage Pr o c essing (NLP) , Data Mining , and R e asoning . Additionally , w e in tro duced a Hybrid category for approaches combinin g at least tw o of these areas. NLP approaches fo cus on the semantic analysis of textual data to extract norms. Data Mining approac hes analyse large datasets to identify patterns and infer norms. Reasoning approac hes derive conclusions from limited data and reﬁne their ﬁndings as more information becomes a v ailable. Hybrid approac hes in tegrate elemen ts from m ultiple researc h areas to enhance norm identiﬁcation. Figures 4.4 and 4.5 illustrate the distribution of the methods across these categories. Norm Detection and Identiﬁcation Sev eral researchers ha v e inv estigated techniques for detecting and identifying norms in MAS. Mahmoud et al. [124] prop ose a P oten tial Norms Mining Algorithm (PNMA) that enables agen ts to identify prev ailing norms through observ ation of other agen ts’ b eha viours. Their approac h allo ws an agen t to revise its norms without requiring third-part y enforcemen t mech- anisms. The PNMA follows a structured pro cess of data formatting, ﬁltering, and extraction of p oten tial norms from observ ed even ts. Building on this work, Mahmoud et al. [122] presen t the Poten tial Norms Detection T echnique (PNDT), which facilitates agents’ adapta- tion to c hanging en vironmen ts through self-enforcemen t. The PNDT framework comprises 78 an agent’s b elief base, observ ation pro cess, the PNMA algorithm, v eriﬁcation pro cess, and up dating pro cess. Through sim ulations in an elev ator scenario, they demonstrate ho w en- vironmen tal v ariables aﬀect norm detection success. Craneﬁeld et al. [71] introduce a nov el approac h using Bay esian inference for norm identiﬁcation. Their metho d eﬀectively op erates in scenarios where b oth compliance and violation o ccur regularly , calculating the o dds of a candidate norm b eing established v ersus no norm existing. Empirical ev aluation shows that norm-complian t b eha viour can emerge after few observ ations. Oren and Meneguzzi [154] de- v elop a norm identiﬁcation mechanism based on plan recognition, com bining parsing-based plan recognition with Hierarchical T ask Netw ork planning to infer prev ailing norms. Their approac h handles norm violations through counting and thresholding, without relying on ob- serv ations of explicit sanctions. Sarathy et al. [178] prop ose a norm representation sc heme incorp orating context-speciﬁcity and uncertaint y using Dempster-Shafer theory . Their algo- rithm learns norms from observ ation while considering diﬀeren t con texts and the inherent uncertain t y in the learning pro cess, allo wing agents to adapt to changing con texts. Norm Mining from Data Sev eral researchers hav e explored data mining techniques for extracting norms from v arious sources. Sav arimuth u et al. [179] present an in ternal agent arc hitecture for norm iden tiﬁ- cation based on interaction observ ation. Their Obligation Norm Inference algorithm uses asso ciation rule mining to identify obligation norms. In related work, Sav arim uth u et al. [181] fo cus on identifying prohibition norms using a mo diﬁed v ersion of the WINEPI algo- rithm to generate candidate prohibition norms. Their framew ork considers so cial learning theory and distinguishes b etw een candidate norms and identiﬁed norms. Sa v arim uth u et al. [180] further develop their arc hitecture with the Candidate Norm Inference algorithm, which iden tiﬁes sequences of ev en ts as candidate norms. Their approach enables agen ts to mo dify and remov e norms if they change or no longer hold in the so ciet y , demonstrating the b eneﬁts of norm inference for utilit y maximisation. Av ery et al. [30] introduce Norms Miner, a to ol for extracting norms from op en source softw are developmen t bug rep orts. Their automated approac h discov ers, extracts, and classiﬁes norms from textual so cial interactions, making 79 tacit knowledge explicit and accessible. The to ol achiev es solid p erformance with a recall of 0.74 and a precision of 0.73 in norm classiﬁcation. Dam et al. [72] explore mining softw are rep ositories for so cial norms, presen ting results on co ding con v en tion violations across large op en source pro jects. They prop ose a life-cycle mo del for norms within Op en Source Softw are Developmen t communities and demonstrate its applicability using data from the Python developmen t comm unity . F erraro and Lam [79] apply Natural Language Processing tec hniques to normative mining from legal do cumen ts. They pro vide a comprehensive review of existing NLP tec hniques, particularly semantic pars- ing, and analyse their applicability to mining legal norms. The pap er pres en ts preliminary results on extracting normativ e rules using relation extraction and semantic parsing mo dels. Gao and Singh [86] develop an approac h for automatically extracting norms from contract text. Their prototype to ol suite extracts norms and related concepts, ev aluating the realism of normative models in MAS by assessing how eﬀectiv ely these concepts can b e iden tiﬁed within contracts. Murali et al. [139] apply norm-mining techniques to a real-world dataset in in ternational p olitics. They adapt a Bay esian norm mining mec hanism to iden tify norms from bilateral sequences of inter-coun try ev en ts extracted from the GDEL T database, demonstrat- ing that a mo del com bining probabilities and norms explains observed international even ts b etter than a purely probabilistic mo del. Norm Assimilation and Adaptation Sev eral researc hers hav e explored how agen ts can assimilate and adapt to norms in MAS. Mahmoud et al. [120] prop ose a technique for softw are agents to detect and assimilate norms to comply with lo cal normative proto cols. Their conceptual framework includes stages for a visitor agen t to detect norms by analysing interaction patterns and matc hing them with a “norms mo del base”. Mahmoud et al. [123] in tro duce a norm assimilation approac h for MAS in heterogeneous communities. Their theoretical framework is based on an agent’s in ternal b elief about its abilit y to assimilate and its external belief ab out the assimilation cost asso ciated with diﬀerent so cial groups. They categorise assimilation decisions based on whether an agen t “can assimilate”, “could assimilate”, or “cannot assimilate”. Mahmoud 80 et al. [121] fo cus on deﬁning the semantics of a prop osed norms mining tec hnique. They explicitly deﬁne the seman tics of the en tities and pro cesses in volv ed in norms mining, dra wing inspiration from existing work in norms, normativ e systems, and data mining. Mahmoud et al. [119] outline a conceptual approach for norms detection and assimilation, fo cusing on disco v ering norm emergence based on interaction patterns b et ween agents. Their approach utilises a norms mining technique and prop oses using a norms learning technique to deﬁne the semantics of textual data. Norm Syn thesis and Revision Sev eral researchers ha v e inv estigated techniques for synthesising and revising norms in MAS. Morales et al. [135] introduce IRON (In telligen t Robust On-line Norm syn thesis mechanism), whic h synthesises conﬂict-free norms without ov er-regulation. IR ON pro duces norms that c haracterise necessary conditions for co ordination and are b oth eﬀective and necessary , with the capability to generalise norms for concise normative systems. Morales et al. [134] present an extended IRON mechanism designed for online syn thesis of compact normative systems. Their enhanced approach incorp orates improv ed ev aluation metho ds, a generalisation op er- ator requiring suﬃcient evidence, and a sp ecialisation op erator for reﬁning underp erforming generalisations. Empirical ev aluation shows that IRON signiﬁcantly outp erforms BASE in terms of stabilit y and compactness. Riad and Golpa y egani [170] propose a utilit y-based norm syn thesis mo del for managing norms in complex MAS with multiple, p oten tially conﬂicting ob jectives. Their approac h emplo ys utility-based case-based reasoning for run-time norm syn thesis, using a utility function deriv ed from system and agent ob jectives to guide norm adoption. Dell’Anna et al. [73] analyse the complexity of syn thesising and revising condi- tional norms with deadlines. They demonstrate that syn thesising a single conditional norm correctly classifying behavioural traces is NP-complete, as is syn thesising sets of conditional norms and minimal norm revision. Christelis et al. [64] detail a ﬁrst-order approac h to norm syn thesis, allowing for greater expressiv eness through the use of v ariables. They prop ose optimisations to improv e the p erformance of ﬁrst-order norm synthesis, including a priori ﬁltering, trav ersal pruning, rep etitiv e op erators, and duplicate runs. Morris-Martin et al. 81 [138] prop ose an agent-directed norm synthesis framework that allows norms to b e synthe- sised based on agent requests and interactions. Their approac h in v olv es individual agents in system gov ernance, enabling revisions that b eneﬁt individual goals without conﬂicting with system-lev el ob jectives. Norm Conﬂict Detection and Resolution Aires et al. [5] fo cus on iden tifying p oten tial conﬂicts b etw een norms in contracts written in natural language. They develop a semi-automatic approac h for identifying norms and their elemen ts using information extraction techniques. Their to ol helps preven t conﬂicts by comparing extracted norm information and classifying p oten tial conﬂicts into types such as p ermission-prohibition, permission-obligation, and obligation-prohibition. Alechina et al. [18] address the problem of detecting norm violations in op en MAS. They demonstrate that p erfect or near-p erfect norm monitoring and enforcement can b e achiev ed at no cost to the system, prop osing incentiv e-compatible mechanisms for decen tralised norm monitoring where agents themselv es perform monitoring. Camp os et al. [59] propose adding an “Assistance lay er” to MAS to handle norm adaptation. They use a Case-Based Reasoning approac h within this lay er, enabling the system to learn from past exp eriences and adapt norms to achiev e organisational goals, illustrated through a Peer-to-P eer sharing netw ork scenario. 4.2.4 T ak ea wa ys T o conclude this section, we pro vide some takea wa ys that one should consider when choosing an approach for norm mining, as not all share the same characteristics. First, the selected approach may dep end on the type of data that can b e collected. Some approac hes, such as those based on data mining, can p erfectly learn from structured obser- v ations. On the other hand, when such data is not av ailable, NLP-based approaches are suitable either b y dealing with unstructured data or by exc hanging with agen ts. Due to the wa y π -NoCCHIO is meant to function, it may b e b etter to av oid techniques relying on comm unications with agen ts. Similarly , approac hes that rely on interactions within the en- vironmen t migh t not be suitable. Reasoning tec hniques migh t b e useful when it comes to 82 inferring more high-level norms abstracting from lo cal and p ersonal b ehaviours, or deriving con text-sp eciﬁc norms from very general norms. 1 Then, the chosen metho d should also dep end on the sp eciﬁcities of the observed normative system. One key point, in particular, is whether the agen ts ha v e man y p ersonal norms. Some approac hes cannot diﬀerentiate b et ween those and the global norms, and for this reason are p ossibly not suitable for an application within the π -NoCCHIO architecture since the t yp e of agen t with which it deals (mainly humans) may hav e man y p ersonal norms in addition to the global ones. Finally , some features that collected approac hes may ha ve are not relev ant for π -NoCCHIO. F or instance, b eing able to handle norm drifting is not, as in our case, the norm mining ap- proac h is meant to b e used only once. Consequen tly , there cannot b e norm drifting within the observ ation window. Similarly , mec hanisms for conﬂict resolution are not relev ant, as conﬂicts are handled an yw a y b y π -NoCCHIO. Building on the ov erview of existing techniques, the next section introduces the ARIA algorithm, which structures exceptions to norms as an argumentation framew ork to provide in telligible and explainable mo dels. 4.3 Argumen tativ e Rule Induction This section introduces the ARIA (Argumen tativ e Rule Induction A-star) algorithm [11, 12], and indicates ho w it can b e used to generate explanations. It describ es how each comp onen t of the mo del works. First, it shows how the mo del would bring intelligibilit y to a blac k-b ox algorithm when deploy ed. Second, it details how it generates an explainable mo del with the help of some data, as w ell as a ﬁnal explanation b y using the algorithm prop osed by F an and T oni [78] and Borg and Bex [51]. 1 F or example, “One should not cause harm” is very broad as there are many wa ys to cause harm. 83 4.3.1 Computing a Justiﬁcation The goal of the approach is to justify the decision of a black-box algorithm (which will be referred to in the rest of the section as a BB or agent). More sp eciﬁcally , w e wan t to observe one action and wh y it was, or was not, c hosen. F or example, wh y did the autonomous car apply the brak es or the accelerator. T o do so, we in tro duce t wo core elements, as w ell as a running example with the Car dataset [49] to b etter understand how the approac h should w ork. Univ ersal Graph In order to b e used, the approach requires a dataset that describ es the b eha viour of an agent in v arious situations or states. Such a dataset should pro vide as input the p erceptions of the BB algorithm, suc h as ra w sensor information or prepro cessed and p oten tially high-lev el data, and as a label, if the trac ked action has b een p erformed or not ( e.g. , turn right=True ). Once this dataset is av ailable, the searc h pro cess, describ ed later in Section 4.3.4, can start. Once ﬁnished, this searc h pro cess will return a graph, inspired b y the structure of an argumen tation framew ork, representing the ov erall relationship among the p erceptions and ho w they can inﬂuence the ﬁnal decision. W e call this graph the Univ ersal Graph, as it con tains relations among all p ossible factors whic h will serv e to construct an explanation later on. In the implementation, the no des of the graph consist of attribute-v alue tuples from the dataset inputs. How ever, it is p ossible to use exp ert kno wledge or automated feature engineering suc h as the one w e dev elop ed [152] to generate more sophisticated attributes or v alues. F or example, instead of having a no de with a v ery arbitrary v alue “ size=3.5cm ”, an exp ert could design v alues such as “ size=small ”, which would include an ything with a size less than 5 cm. Additionally , there are tw o other distinguished no des. The ﬁrst is called the target and is written as τ . This argument represents the decision made by the agen t. As will b e detailed in the next section, it will denote the fact that the agen t p erformed the track ed action if it is part of the extension, or otherwise that the agen t did not. The second additional no de is written as λ . Although this argument is not b ound to an y attribute-v alue tuple, it denotes 84 a supp ort relation to the target from the other no des, as explained by Bo ella et al. [38]. As suc h, this no de can only attac k the target. Any other no de attac king it will then be considered as supp orting the target due to the defence relationship created b et ween them. 2 Con textual Graph Once in p ossession of the Univ ersal Graph, one can then compute what is called the Con- textual Graph. This graph is a pro jection of the Universal Graph giv en a set of facts, i.e. , a set of p erceptions from the agent in a sp eciﬁc situation. 3 As a consequence, it ﬁlters out all the no des whose v alue is not part of the facts, except for τ and λ which are alwa ys in the Contextual Graph. The resulting graph is then a useful Argumentation F ramework for represen ting the b elief state of an agent. It is now p ossible to compute an extension based on this Contextual Graph. While in theory any extension would b e usable with some adjustmen ts, we chose to use the grounded extensions as it features tw o adv antages o v er the others. First, it is unique, and as suc h less am biguous that a set of extensions when it comes to providing an explanation to the end user. Second, it computes in p olynomial time 4 whic h is an in teresting feature to shorten the searc h phase. Then, if the target τ is part of the extension, and if this matches the decision that the BB made ( i.e. , doing the track ed action or not), it can serve to justify this decision b y p oten tially pro viding an explanation based on this graph. This includes the cases where the BB to ok the wrong decisions, as long as the BB output matc hes the graph output. There exist sev eral w a ys to provide an explanation from an argumentation framework [110, 78, 75]. In this case, w e chose to use the one described by Borg and Bex [51] and F an and T oni [78] describ ed in the following subsection. 2 Note that this diﬀers from the support relation as it is deﬁned in Bipolar Argumen tation F rameworks [212, 211]. 3 This may also corresp ond to the input of one entry of the BB’s b eha vioural dataset. 4 While the preferred extension also shares a p olynomial complexity , it still requires a greater time to b e computed. 85 Explanation Extraction The approach introduced b y the work of F an and T oni [78] is characterised by ﬁnding all r elate d admissible extensions of an argumentation framework. That is: Deﬁnition 12 (Related admissibility) . If ⟨A r gs, R⟩ is an ar gumentation fr amework, then any subset S ⊆ A r g s is a r elate d admissible extension iﬀ S is an admissible extension and ther e exists x ∈ S s.t. S defends x . Any such x is c al le d a topic of S . Suc h related admissible extensions corresp ond to explanations of their topics: Giv en an accepted argument, the set of related admissible extensions having that accepted argument as their topic corresp onds to the set of explanations, in the sense of b eing justiﬁcations [96], for the acceptance of that argumen t. F an and T oni [78] pro vide a metho d for computing these related admissible extensions by pro cedurally constructing them as dispute trees that ha v e the topic, i.e. , the argumen t whose acceptance is to b e explained, as the ro ot. Borg and Bex [51] deﬁne a polynomial algorithm that ma y b e used to compute the related admissible extensions of a giv en argumentation framew ork. Sp eciﬁcally , for an y ⟨A r g s, R⟩ , let D ef B y ( A ) = { B ∈ A r g s | B defends A } b e such that A, B ∈ A r g s , and D ef B y ( A, S ) = D ef B y ( A ) ∩ S for an y extension under some semantics S ⊆ A r gs . Then Borg and Bex [51] deﬁne a pro cedure to calculate these sets D ef B y ( A ), allowing the F an and T oni [78] approach to b e computed b y taking all admissible extensions S adm i of an argumentation framew ork and calculating D ef B y ( A, S adm i ) to ﬁnd all related admissible extensions for a topic A , i.e. , the accepted argumen t that is to b e explained. In the context of this w ork, the base argumentation frameworks corresp ond to the con- textual graphs describ ed ab ov e. Dep ending on whether the τ argument is accepted or not, the explanation can b e calculated using the D ef B y ( · ) pro cedure. Say that the τ argument is accepted; then computing D ef B y ( τ , S ) for some extension S of the con textual graph pro vides a justiﬁcation for τ under those semantics. Alternativ ely , if the target is rejected, w e may use the pro cedures deﬁned by Borg and Bex [51] to compute the set N otD ef ( A, S ) whic h returns the set of all, direct and indirect, attac k ers of an argument A for which there is no defence in the extension S . This may b e 86 Figure 4.6: Universal graph for the Car dataset. The τ argument is in green. The λ argument is in red. used to provide a justiﬁcation, i.e. , the explanatory set, for the non-acceptance of A under a seman tics. In the case of the grounded extension, since it is unique then S may be the unique grounded extension used to calculate the explanatory set N otD ef ( A, S ). F or the general case where there ma y b e multiple extensions, then under a seman tics sem providing the set of extensions, for some 0 < i ≤ n , S i , the explanatory set N otAcc sem ( S ) = n [ i =1 N otD ef ( A, S i ) pro vides a full justiﬁcation set for the non-acceptance of A under some semantics. Example In order to mak e it clearer, we recall the example from Alcaraz et al. [11] using the Car dataset [49]. Fig. 4.6 represen ts a graph that has b een inferred with the approac h from the aforemen tioned dataset. The task is to ev aluate the buying acceptability of a car based on sev eral attributes and their v alues, such as the num b er of seats, the buying cost, the main te- nance cost, etc. There are four classes, namely “Unacceptable”, “Acceptable”, “Go o d”, and “V ery go o d”. In this sp eciﬁc case, the target argumen t is asso ciated with the class “V ery 87 T able 4.2: Summary of the facts, i.e. , the v alues for the diﬀerent attributes, for the example. A ttribute V alue Buying cost = Medium Main tenance cost = Lo w Num b er of do ors = 3 Num b er of seats = 5-or-more Luggage b o ot = Medium Safet y = High Acceptabilit y (Lab el) = V ery-go o d Figure 4.7: Con textual graph for the Car dataset and a sp eciﬁc input. The target argumen t is in green. The λ argument is in red. go od”, i.e. , “The car has a very go od buying acceptabilit y”. As such, the target not b eing in the extension can b e classiﬁed as “Unacceptable”, “Acceptable”, or “Go o d”. Fig. 4.6 sho ws the Univ ersal Graph and then gives an ov erview of how each argument ( i.e. , couple attribute-v alue) can inﬂuence the ﬁnal decision. How ever, it is not a justiﬁcation yet. Fig. 4.7 shows the Con textual Graph derived from the Univ ersal Graph shown in Fig. 4.6. The latter corresp onds to the set of facts presen ted in T able 4.2, i.e. , the v alues assigned to the diﬀeren t attributes. In this example, the car has b een classiﬁed as “V ery go od” in terms of buying acceptabilit y . F rom the Con textual Graph, w e can see the elements whic h lead to this decision. F or example, b ecause high safet y was not suﬃcien t, it was also required that it has a medium luggage b o ot. On the other hand, if it did not hav e at least 5 seats, the car w ould not ha v e b een classiﬁed as “V ery go od” since the num b er of do ors was 3. Similarly , the medium buying cost supp orts the classiﬁcation b y attac king the λ argumen t. 88 Applying the explanation pro cedure outlined in Section 4.3.1, using the pro cedure from Borg and Bex [51], we obtain the set: D ef B y ( acceptability=vgood ) = { acceptability=vgood , buying cost=med , lug boot=med , nb persons=5-or-more } An admissible extension S adm of the contextual graph ab o v e is: S adm = { acceptability=vgood , buying cost=med , lug boot=med , nb persons=5-or-more } So Def B y ( acceptability=vgood ∩ S adm ) = D ef B y ( acceptability=vgood ) = S adm . This set of argumen ts then provides a justiﬁcation set for the topic corresp onding to the target acceptability=vgood . Therefore, this explanation may b e interpreted as follo ws: Whether a car is classiﬁed as very goo d is primarily inﬂuenced by its price ( i.e. , buying cost), and the luggage space if the n um b er of seats is ﬁve or more. 4.3.2 n -argumen ts Let the set of arguments corresp onding to the presence of a couple attribute-v alue in the facts b e called “Positiv e arguments” (or p -argumen ts). It is then p ossible to deﬁne “Negative argumen ts” (or n -argumen ts). In opp osition to p -arguments, which app ear in the contextual graph only if the facts matc h their condition, n -argumen ts are presen t in the con textual graph if their condition is not present in the facts. They denote a missing fact. While using only the set of p -arguments makes the representation of certain logical form ulae imp ossible, the addition of the n -arguments allows one to represen t a broader range of form ulae. Ho w ev er, adding these extra arguments increases the num b er of neighbours for each no de and, as such, increases the total computation time. More formally , let P b e a non-empty set of prop ositional atoms. In this application, eac h prop ositional atom represents the possible v aluation of an attribute, e.g. , q : = “ buying cost=Medium ” ∈ P . W e deﬁne the language L P as the set of well-formed formulas (wﬀ ), with the following BNF grammar, for any p ∈ P : ϕ : = ⊥ | p | ¬ ϕ | ϕ ∨ ψ 89 As usual, we use the following notation shortcuts: • ⊤ : = ¬⊥ , • ϕ ∧ ψ : = ¬ ( ¬ ϕ ∨ ¬ ψ ), • ϕ ⇒ ψ : = ¬ ϕ ∨ ψ An in terpretation mo del I ov er a v aluation V : P → {⊤ , ⊥} is giv en by the function I : L P → {⊥ , ⊤} s.t. ∀ ϕ ∈ L P , I | = ϕ iﬀ I ( ϕ ) = ⊤ where: 1. ∀ p ∈ P , I | = p iﬀ V ( p ) = ⊤ 2. ∀ ϕ, ψ ∈ L P , I | = ϕ ∨ ψ iﬀ I | = ϕ or I | = ψ 3. ∀ ϕ ∈ L P , I | = ¬ ϕ iﬀ I | = ϕ Additionally , we will denote the set of sym b ols ¯ P : = { ¯ p : p ∈ P } that corresp onds to the negation of prop ositional atoms. Henceforth, w e call a class of interpretation mo dels based on the set of v aluations Ω ⊆ {⊤ , ⊥} P a pr op ositional dataset . W e note that ∀ ϕ ∈ L P , Ω | = ϕ iﬀ for all V ∈ Ω, the in terpretation mo del I V o v er V is s.t. I V | = ϕ . Example 2. The fol lowing example aims at showing a situation that c annot b e r epr esente d solely using p -ar guments and would r e quir e the addition of the n -ar guments. We r epr esent an ar gument as a c ouple, c omp ose d of a name A , and a c ondition, r epr esente d as a pr op osi- tional formula of L P . L et A r g s = { ( τ , ⊤ ) , ( λ, ⊤ ) , ( A, a ) , ( B , b ) } with a, b ∈ P , b e the set of ar guments and F ∈ Ω b e the set of facts. We describ e a situation in which τ should b e p art of the gr ounde d extension, denote d I n ( τ ) , under the fol lowing c ondition: I n ( τ ) → (( a ∈ F ∧ b / ∈ F ) ∨ ( a / ∈ F ∧ b ∈ F )) Without duplic ating the ar guments, it is not p ossible to r epr esent such c ondition. Now, let us add to the curr ent set of ar guments the n -ar guments ¯ a, ¯ b ∈ ¯ P , such that: A r gs = { ( τ , ⊤ ) , ( λ, ⊤ ) , ( A, a ) , ( B , b ) , ( C , ¯ a ) , ( D , ¯ b ) } . It is now p ossible to c onstruct an ar gumenta- tion fr amework with the set of ar guments A r g s = { τ , λ, A, B , C } and the attack r elation A tts = { ( λ, τ ) , ( A, λ ) , ( B , A ) , ( C , B ) } . This ar gumentation fr amework c an c orr e ctly r epr esent the fact that τ should b e p art of the gr ounde d extension if a is in the facts but not b , or if b is in the facts but not a . 90 4.3.3 Bip olar Argumen tation In addition to the tw o v ariants prop osed in Alcaraz et al. [11], w e presen t a v ariant making use of bip olar argumentation prop osed in Alcaraz et al. [12]. The framew ork Within classical abstract argumen tation framew orks [76], argumen ts are considered as ab- stract entities that can attack each other. Ho w ev er, in real-world debates and reasoning, we often ﬁnd not only attac ks but also supp orts betw een argumen ts. The follo wing deﬁnition in tro duces the Bip olar Argumentation F ramework (BAF) based on the deﬁnitions of Cayrol and Lagasquie-Schiex [62] and Y u and T orre [212]. Deﬁnition 13. We c al l Bip olar Argumentation F ramework (BAF) a tuple ⟨A r g s, A tts, S up ⟩ wher e: • A r g s is a non-empty set of ar guments • A tts ⊆ A r g s × A r gs is a binary r elation on A r gs c al le d attack relation • S up ⊆ A r g s × A rg s is a binary r elation on A r g s c al le d supp ort relation Let ⟨A r g s, A tts, S up ⟩ b e a BAF. W e deﬁne the notion of supp orted attac k and indirect attac k. A supp orted attack o ccurs when an argument A supp orts another argument B , and B attacks a third argument C . Even though A do es not directly attack C , it indirectly con tributes to that attac k by reinforcing B . So, A is considered to supported-attack C . Deﬁnition 14 (Supp orted attack) . A ∈ A r g s supp orts the attack of an ar gument B ∈ A r g s iﬀ ther e exists ( A 1 , . . . , A n ) ∈ A r g s n s.t. A 1 = A , A n = B , A S upA 2 , . . . , A n − 2 S upA n − 1 and A n − 1 A ttsB . An indirect attac k o ccurs when, for argumen ts, A , B , and C , A attacks B , and B supp orts C . Here, A undermines the supp ort to C , and thus indirectly attac ks C . Deﬁnition 15 (Indirect attac k) . A ∈ A r g s indir e ctly attacks B ∈ A r g s iﬀ ther e exists ( A 1 , . . . , A n ) ∈ A r g s n s.t. A 1 = A , A n = B , A A ttsA 2 and A 2 S upA 3 . . . , A n − 1 S upA n . 91 In ev eryda y reasoning and debate, arguments not only attac k one another ( e.g. , by con- tradiction), but also supp ort each other ( e.g. , b y reinforcing a shared conclusion). Bip olar Argumen tation F ramew orks (BAFs) capture this dual nature by allo wing t w o types of re- lationships b et w een arguments: A ttac k and supp ort. T o reason ab out groups of arguments and their in teractions, we in tro duce the notion of set-based relations, that is, how a set of argumen ts can attac k, supp ort, or defend another argumen t. This extends classical argumen- tation where only pairwise attacks are considered. A set of arguments can set-attack another argumen t not only through direct attac k but also via indirect or supp ort-based pathw ays. A set can also defend an argumen t b y coun tering all its attac k ers (in this extended sense). These set-based notions express a wa y to mo del ho w coalitions of arguments in teract, reinforce, and coun teract each other in structured debates. What follows are the formal deﬁnitions of these concepts: Set-attac k, set-supp ort, and defence (also called acceptability). Deﬁnition 16 (Sets of Argumen ts and relations) . • S ⊆ A r g s set-attacks B ∈ A r g s iﬀ ∃ A ∈ S , A supp orts the defe at of B , or A indir e ctly attacks B , or A A ttsB • S ⊆ A r gs defends A ∈ A r g s (or A is ac c eptable with r esp e ct to S ) if, and only if, ∀ B ∈ A r g s , if { B } set-attacks A , then ∃ C ∈ S , { C } set-attacks B In BAF, the notion of conﬂict-freeness sta ys the same, i.e. : Deﬁnition 17 (Conﬂict-free) . S ⊆ A r g s is a c onﬂict-fr e e set of ar guments if, and only if  ∃ A, B ∈ S s.t. { A } set-attacks B . T o ev aluate which argumen ts are justiﬁed, w e extend the idea of grounded seman tics from Dung’s framew ork to this richer setting b y considering the deﬁnitions provided in Y u and T orre [212]. Grounded seman tics identiﬁes the most cautious (sceptical) set of arguments that can b e accepted based on their abilit y to defend themselves against attacks while considering b oth direct and derived ( e.g. , supp orted or indirect) attac ks. The gr ounde d extension is the smallest (least ﬁxed p oin t) set of arguments that: • Do not attac k eac h other 92 • Defend themselves against all set-attacks from outside the set • Are supp orted (p ossibly indirectly) by argumen ts within the set. Based on the deﬁnition provided in Y u and T orre [212], we deﬁne the grounded extension for bip olar argumen tation as the follo wing : Deﬁnition 18 (Grounded extension) . L et ⟨A r g s, A tts, S up ⟩ b e a Bip olar A r gumentation F r amework. We deﬁne a char acteristic function F : 2 A rg s → 2 A rg s , wher e: F ( S ) = { A ∈ A r g s | every set that set-attacks A is itself set-attacke d by some C ∈ S } The grounded extension is the le ast ﬁxe d p oint of F , i.e. , the smal lest set S ⊆ A r g s s.t. : F ( S ) = S This set contains the argumen ts that are initially acceptable (not attack ed), and those that b ecome acceptable step b y step, as they are defended by already accepted ones. T aking in to accoun t these previous deﬁnitions, in the next section, w e provide an algorithm whic h adapts the approac h in Alcaraz et al. [11] to BAF. Algorithm for bip olar argumen tation In this v arian t, a new “Supp ort” relation is added. F urthermore, we adapt the grounded extension to incorp orate the supp ort relation in Algorithm 5. In this algorithm, an argument is part of the extension if all its attack ers are lab elled out , or if it has at least one direct supp orter which is in or sup ( i.e. , supp orted). F or example, in Fig. 4.8, the grounded lab elling according to Y u et al. [211] would b e a lab elled in , b lab elled sup , and c and d lab elled undec . If the facts con tain only { a, b, c } , we would hav e a and c in , b sup , and d out . All the arguments lab elled in or sup are then considered part of the extension. In order to adapt the justiﬁcations obtained b y the metho dology outlined in Section 4.3.1, the deﬁnition and asso ciated pro cedure for computing the set D ef B y ( · ) must b e adapted to incorp orate the supp ort relation in bip olar argumentation framew orks. 93 Figure 4.8: Example of bip olar argumentation framework. Solid edges denote attacks, and dashed edges supp orts. More sp eciﬁcally , w e use the notion of def ended 2 in tro duced by Y u et al. [211]. Deﬁnition 19 ( D ef ended 2 − 3 ) . L et F = ⟨A r g s, A tts, S up ⟩ b e a BAF and E xt ⊆ A r g s b e an extension. • The set of ar guments def ended 2 by the extension E xt , written as d 2 ( F , E xt ) , is the set of ar guments A ∈ A r g s s.t. for al l B A ttsA with B ∈ A r gs , ther e exists an ar gument C ∈ E xt s.t. C A ttsB , and ther e is an ar gument D ∈ E xt s.t. D S upC (supp orte d- defenc e). • The set of ar guments def ended 3 by the extension E xt , written as d 3 ( F , E xt ) , is the set of ar guments A ∈ A r g s s.t. for al l B A ttsA with B ∈ A r gs , ther e exists an ar gument C ∈ E xt s.t. C A ttsB , and for al l ar guments D ∈ A r g s s.t. D S upB , ther e exists an ar gument E ∈ E xt s.t. E A ttsD (attacking-defenc e). Using these deﬁnitions, the supp orts can no w b e in tegrated into Deﬁnition 12, i.e. , R elate d A dmissibility , and can as such b e part of an explanation. 4.3.4 Searc h Metho d As previously said, the algorithm is performing an A ∗ searc h to generate the ﬁnal argumen ta- tion framework. This algorithm is describ ed in Algorithm 6. The main principle is to explore a search space by walking from neigh b our to neighbour, follo wing a heuristic that maximises the prediction accuracy . In the implementation, giv en a set of attributes Att and a set of a v ailable v alues V al i for an attribute i , w e deﬁne the set of all the arguments which can b e 94 Algorithm 5 Computes the Bip olar Argumentation F ramew ork’s extension Require: F acts , Arg s Ensure: E xtension ⊆ Ar g s 1: Lab : = Ar gs → { in , out , undec , sup , must sup } ▷ Initialize all the arguments 2: for x ∈ Ar g s do 3: if x / ∈ F acts then 4: Lab ( x ) : = out 5: else 6: Lab ( x ) : = undec 7: end if 8: end for 9: while ∃ x ∈ Ar gs s.t. ( Lab ( x ) = undec ∧ ∀ a ∈ { x } − .Lab ( a ) = out ) ∨ Lab ( x ) = must sup do 10: if Lab ( x ) = must sup then 11: Lab ( x ) : = sup 12: else 13: Lab ( x ) : = in 14: end if 15: for y ∈ { x } + do ▷ Arguments attack ed b y x 16: if Lab ( y ) / ∈ { sup , must sup } then 17: Lab ( y ) : = out 18: end if 19: end for 20: for y ∈ { x } + S do ▷ Arguments supp orted by x 21: Lab ( y ) : = must sup 22: end for 23: end while 24: return E xtension : = { x ∈ Ar g s | Lab ( x ) ∈ { in , sup }} represen ted by the dataset attributes and v alues as: A r g s : = { τ , λ } ∪ { q : i ∈ Att, j ∈ V al i , q : = “ Att i = V al ′′ ij } Ho w ev er, doing this with attributes featuring contin uous v alue could generate to o many , meaningless, argumen ts. T o a v oid this, w e segment them in to interv als. As such, an attribute ϕ having a v alue ranging from 0 to 10 would generate the argumen ts ϕ =0-2 , ϕ =2.1-4 , . . . , with an interv al size dep ending on the num b er of segments we wish to generate. Although this is conv enient to quickly parse a dataset, w e recommend the use of exp ert knowledge to 95 design arguments based on these numerical v alues, such as ϕ = Above the average . W e can then encode the attac k relation as a matrix of size |A tts | = |A r g s | × |A r g s | , where the element A tts ij is equal to 1 if for A i , A j ∈ A r g s there is an attack from A i to A j , 0 if not, and − 1 if the attac k is disallo w ed ( i.e. , w ould create a reﬂexiv e attac k, a symmetrical attac k 5 , or the attack er would b e the τ or λ argument). If the bip olar version of the algorithm is used, the v alue 2 denotes a supp ort. W e forbid attacks b et ween t w o arguments instan tiated from the same attribute, as they would not b e able to b e part of the set of facts at the same time. How ever, if the dataset used is m ultiv alued, this feature can be remov ed. As such, we deﬁne tw o no des ( i.e. , solutions) as neighbours if they diﬀer by one v alue in this matrix, that is, whether their asso ciated graphs diﬀer by one attac k (or supp ort). Algorithm 6 A* Search Algorithm Require: MaxIterations 1: iteration ← 0 2: queue ← {} 3: bestNode ← getStartingNode() 4: node ← getStartingNode() 5: queue.add(node) 6: while ¬ Empty(queue) and GetAcc(node) < 100 and iteration < MaxIterations do 7: iteration ← iteration +1 8: neighboursList ← getNeighbours(node) 9: for neighbour in neighboursList do 10: if ¬ Visited(neighbour) then 11: queue.add(neighbour) 12: Visited(neighbour) ← ⊤ 13: end if 14: end for 15: node ← getNextPrioritaryNode(queue) 16: queue.remove(node) 17: if GetAcc(node) > GetAcc(bestNode) then 18: bestNode ← node 19: end if 20: end while 21: return bestNode Additionally , w e deﬁne a heuristic h ( x ) given a solution x which is equal to the sum of 5 As the choice of the grounded extension—to guarantee the uniqueness of th e explanation—is an undesirable seman tics for graphs containing cycles, w e disallow them. Ho wev er, it is p ossible to allo w them if more sophisticated extensions are used. 96 the incorrect predictions o v er the training data, plus a small fraction corresp onding to the n um b er of attacks in the graph, noted x R , divided by |A r gs | 2 , where A r g s is the set of all the arguments present in the dataset, or more formally: h ( x ) = | x R | |A r g s | 2 + X i ∈ data      1 , if pr ed i ( x )  = label i 0 , otherwise Solutions should minimise the v alue of this function. That wa y , a solution getting fewer errors than another one is systematically preferred, but in case of an equal error count, the one presen ting the few est attac ks (and supp orts in case of bip olar argumen tation) is preferred. In order to build the graph, we follow a b ottom-up tactic. The starting no de has a graph comp osed of a single argument which is the target argumen t. An attack can only b e added if there exists a path from the attac king argument to the target using this new attack. That w a y , when the search progresses, arguments and attacks are incrementally added and remain connected to the target. In order to reduce the exploration space as m uc h as p ossible, w e ensure that a solution is not visited t wice by computing an asso ciated hash and sa ving it. W e then compare the candidate neighbours’ hash to the ones already sav ed and add them to the queue only if they ha v e not already b een visited. F urthermore, if the addition of an attac k had no eﬀect on the correctness of the predictions compared to the solution without this attac k, w e terminate exploration of that branch. This greatly reduces the time required to explore the solution space. 4.4 Ev aluation In this section, we ev aluate the algorithm prop osed in Section 4.3, as well as its v ariants. W e then discuss the results obtained. 4.4.1 Datasets In order to ev aluate the approac h b oth quantitativ ely and qualitatively , w e used sev eral datasets from the literature. T able 4.3 summarises the size of the dataset and attributes’ type 97 and provides a short description of the con ten t for each dataset. The V oting, Breast Cancer Wisconsin (BCW), Heart Disease Cleveland (HDC), Iris, Wine, and Thyroid datasets can b e found online at the UCI Machine Learning Rep ository [140]. 6 W e mak e the classiﬁcation task for the Iris dataset a binary c hoice by grouping the three classes as follo ws: The lab el is true if the class is “Iris-virginica”, and false otherwise. W e did a similar op eration for the Wine dataset with the lab el b eing true if the class is “1”, and false otherwise. F urthermore, w e used data from the Moral Machine exp erimen t [31] 7 to create a custom dataset. The original data con tains the decision of a h uman annotator in an autonomous car accident scenario. The user has to choose b et ween t w o outcomes, eac h of them causing harm to diﬀerent individuals. The individual num b er and t yp e (adult, kid, animal, etc) as well as their legal status (legal crossing, illegal crossing, or no legality inv olved) ma y v ary . The dataset, named MM-complete, is a pre-pro cessed subset of the original Moral Mac hine data, where some attributes are either remo v ed or group ed together (for instance, the category “p erson” grouping all the h uman individuals). The attributes express—using the v alues “same”, “less”, and “more”—the diﬀerence for each attribute b etw een the tw o outcomes, using the ﬁrst outcome as the reference. F or example, if the outcome n ° 2 hurts more animals than the outcome n ° 1, then the attribute “animal” will hav e for v alue “more”. It also indicates the legality of the crossing using the v alues “yes”, “no”, and “unsp eciﬁed” for the individuals sav ed in the outcome n ° 1 (via the attribute “legal”) and the ones sav ed in the outcome n ° 2 (via the attribute “legal alt”). 4.4.2 Quan titative Ev aluation T able 4.4 shows the accuracies obtained, a v eraged o ver 10 runs, b y the comparison baseline— the Supp ort V ector Machine (SVM) classiﬁer from the Python Scikit library [68]—as w ell as sev eral v ariants of the algorithm (namely the ARIA, n -ARIA, and Bip olar v ariants). The results w ere collected after learning for 100 iterations on the train data (70% of the data) and getting the accuracy ov er the test data (the remaining 30%). The dataset w as resh uf- ﬂed b et ween eac h run of the same algorithm. The attributes with contin uous v alues were 6 UCI Machine Learning Rep ository: https://archive.ics.uci.edu/ . 7 Moral Machine exp erimen t website: https://www.moralmachine.net/ . 98 T able 4.3: Dataset name, size, and characteristic. Column “Att. Types” indicates how man y attributes are contin uous (c) or nominal (n). Column “Arg.” indicates the num b er of couples attribute-v alue ( i.e. , arguments) constructed with a segmen tation into 6 interv als for the contin uous attributes. Name Size A tt. Types Arg. Description V oting [66] 435 16n 34 congressional v oting records BCW [207] 699 9c 56 medical diagnosis HDC [97] 303 13c 80 medical diagnosis Iris [80] 150 4c 26 ﬂow er classiﬁcation Wine [4] 178 13c 80 wine t yp e classiﬁcation Th yroid [52] 383 1c 15n 65 medical diagnosis MM-delta [12] 200 5n 17 ethical decision making T able 4.4: Accuracy of the baseline (the SVM classiﬁer from the Python Scikit library) and the v ariants of ARIA on several datasets. The b est v alue for each dataset is b olded, and the runner-up underlined. Dataset SVM ARIA n -ARIA Bip olar V oting [66] 95.6 ± 1.5 95.7 ± 1.9 - 95.9 ± 3.2 BCW [207] 95.1 ± 0.9 94.4 ± 1.2 95.6 ± 1.4 94.3 ± 1.4 HDC [97] 57.8 ± 2.4 79.3 ± 3.0 78.7 ± 2.4 70.7 ± 12.5 Iris [80] 96.0 ± 2.6 93.3 ± 1.5 90.9 ± 3.6 93.5 ± 2.6 Wine [4] 99.1 ± 1.2 91.7 ± 6.4 92.4 ± 5.7 96.7 ± 7.1 Th yroid [52] 93.5 ± 1.8 94.3 ± 0.5 93.1 ± 1.7 97.0 ± 3.3 segmen ted into 6 interv als. The exp erimen ts w ere run with an 14th Gen In tel ( R ) Core ( T M ) i7-14700HX 2.10 GHz pro cessor running at a 2.60GHz clo c k sp eed, with 8 ph ysical cores and 16 logical cores, and with 32GB of memory . As w e can see, ARIA and its v ariants are comp etitive on most of the datasets, reaching accuracies close to or greater than the ones obtained by the baseline. The av erage of the accuracies is 91 . 5% for ARIA, 91 . 1% for n -ARIA 8 , and 91 . 4% for Bip olar. As we can see, those v alues are very close. As the standard deviation of the accuracies consistently ov erlaps for eac h of the datasets, it is not p ossible to conﬁden tly determine which of the approac hes is the most reliable. Interestingly , Bip olar seems to p erform m uc h b etter than the other approac hes on the Th yroid dataset, and p oorly on the Heart Disease Cleveland dataset. Although we cannot explain the former, the latter seems to be mainly due to ov erﬁtting, as 8 F or the V oting dataset, the accuracy of ARIA is used. 99 Figure 4.9: Universal graph for MM-delta. w e could observe a drop in the accuracy of the test set after a certain num b er of iterations. On the other hand, while the av erage runtime for n -ARIA was already signiﬁcantly increased compared to ARIA, it seems that this is getting ev en longer for Bip olar, as it sometimes reac hes thrice the time needed by n -ARIA to reach the iteration limit. 4.4.3 Qualitativ e Ev aluation of the Generated Explanations This section aims at ev aluating the relev ance of the explanations provided for a giv en scenario when using the algorithm. More sp eciﬁcally , we fo cus on the base v ariant of ARIA, and the Bip olar v ariant. Fig. 4.9 shows the univ ersal graph obtained after learning the MM-delta data set with ARIA for choosing the outcome n ° 1 or not. The τ argument is in green. Fig. 4.10 sho ws the one obtained by the Bipolar v ariant. Dashed arro ws represen t a support from one argumen t to another. Both graphs achiev e the same accuracy of 98 . 4% on the same test dataset. F or b oth, the target argument is “ saved=yes ”, whic h means that outcome n ° 1 w as chosen. Note that c ho osing the outcome n ° 1 means not c ho osing the outcome n ° 2, and con v ersely , not choosing the outcome n ° 1 means c hoosing the outcome n ° 2. Scenario N ° 1 In the scenario represented in Fig. 4.11, the agen t has the choice betw een the outcome n ° 1, sa ving tw o p ersons but where the individuals sav ed were crossing legally , and the outcome 100 Figure 4.10: Bip olar universal graph for MM-delta. n ° 2 saving ﬁve p ersons who w ere crossing illegally . Giv en the set of facts represented in T able 4.5, b oth graphs ha v e in their extension the argumen t “ saved=yes ”, meaning that they justify the choice of the outcome n ° 1. When applying the explanation generation algorithm deﬁned in Section 4.3.1 to the graph gen- erated by the base v ariant of ARIA (see Fig. 4.9), one p ossible explanation generated is D ef B y ( saved=yes ) = { saved=yes , legal=yes , kids=same } . This explanation may b e in ter- preted as despite the fact that some elements w ere fav ourable to the choice of the outcome n ° 2 (suc h as the num b er of p eople hurt), the legal asp ect of the situation is in fa v our of the p eople spared in the outcome n ° 1, and there is no diﬀerence in the n um b er of children in v olv ed to alter this decision. T able 4.5: Summary of the facts for the scenario n ° 1. A ttribute V alue P ersons = less Kids = same Animals = same Legal = yes Legal alt = no On the other hand, when applying the explanation generation to the graph generated using bip olar argumentation (see Fig. 4.10), D ef B y ( saved=yes ) = { saved=yes , legal=yes } . The 101 Figure 4.11: First scenario. Outcome n ° 1 is on the left, and outcome n ° 2 on the righ t. explanatory set con tains one less elemen t. How ever, as one can see, the structures of the graphs are diﬀerent, and in the case of the Bip olar graph, nothing can ov erride the fact that the individuals spared in the outcome n ° 1 w ere crossing legally . W e b eliev e this is a limitation of the explanation generation pro cess with bip olar argumentation. Indeed, it seems not to comm unicate any diﬀerence in the meaning that can b e conv eyed by a defence through an attac k and a defence through a supp ort. Scenario N ° 2 In this scenario, represented by Fig. 4.12, the agent has the choice b etw een the outcome n ° 1, sa ving tw o less kids, and the outcome n ° 2 sa ving tw o more kids but where individuals who p erished w ere crossing legally . The facts for this scenario are summarised in T able 4.6. In this scenario, none of the graphs has in its extension the argument “ saved=yes ”, mean- ing that they justify the choice of not c ho osing the outcome n ° 1 (hence choosing the outcome n ° 2). When applying the explanation generation algorithm to the graph generated b y b oth the base and bip olar v ariants of ARIA, we obtain the explanation N otD ef ( saved=yes ) = { saved=yes , legal alt=yes } . This justiﬁcation set represents that legal alt=yes is an ar- 102 T able 4.6: Summary of the facts for the scenario n ° 2. A ttribute V alue P ersons = same Kids = more Animals = same Legal = unsp eciﬁed Legal alt = yes Figure 4.12: Second scenario. Outcome n ° 1 is on the left, and outcome n ° 2 on the righ t. gumen t in the (grounded) extension attac king the target, for which there is no defence within the extension. This explanatory set suggests that the fact that the individuals who w ould ha v e p erished if the ﬁrst outcome had crossed legally is a suﬃcient reason to spare them (and so choose the outcome n ° 2). 4.5 Related W ork As Section 4.2 already pro vides a detailed o verview of the ﬁeld of norm identiﬁcation, this related work section will fo cus solely on rule induction approaches. In recent years, deep neural netw orks ha v e b een sho wn to be v ery successful in solving classiﬁcation problems. How ever, these algorithms suﬀer from a lac k of explainability [191]. 103 A solution to mak e a system understandable is to apply decision tree classiﬁcation ap- proac hes. The algorithm C4.5 [169], based on ID3 [167], has b een developed following this paradigm. Ho w ever, these approaches hav e b een sho wn to tend to b e outp erformed in many domains by rule induction algorithms [32, 168, 206]. Rule induction algorithms are a category of approaches that usually tries to generate a set of rules for eac h decision or class. Then, if an input triggers one of the rules for a given class, it is classiﬁed as suc h. There exists a v ariant of C4.5, called C4.5- r ul es , on which man y approac hes w ere based. F or example, IREP [83] has b een introduced to accommo date the issues of C4.5 related to its computation time by making the pruning more eﬃcient. Ho w ev er, it usually pro duces more errors than C4.5 with domain-sp eciﬁc datasets. F or this reason, Cohen dev elop ed an impro v ed version called RIPPER k [65] which is sim ultaneously more eﬃcient, more accurate, and more resilient to noisy data. A rule induction algorithm based on genetic programming, SIA [201], generates a p opulation of rules and compares the predictions p erformed with an actual dataset. The algorithm has to maximise the correct predictions. The algorithm ESIA [117] (Extended SIA) is an extension of SIA. While the principle remains the same, ESIA contains mo diﬁcations to the op erators of SIA. In parallel, muc h w ork has b een done in explainability , esp ecially with resp ect to new Deep Reinforcement Learning (DRL) algorithms presen ting high performance but po or in- terpretabilit y [3]. In the next y ears, suc h a capability could b ecome an obligation more than an option with up coming legislation regarding the right to explanation [185]. W e divide these works into tw o categories. The ﬁrst concerns algorithms that ha v e intrinsic intelligibil- it y , which means that the algorithm itself provides information for its understanding. The second one concerns algorithms pro ducing p ost-hoc explanations [166], meaning that we can apply these algorithms to v arious AI mo dels to extract information. V erma et al. [202] prop ose an intrinsic explainability metho d by presen ting a reinforce- men t learning agent that pro duces interpretable p olicy rules. By representing the states in a high-lev el manner, it can express the rules determining the action to p erform, i.e. , the p olicy to follow. Even though this w ork is comp etitiv e with DRL algorithms when w ork- ing with symbolic inputs, it cannot handle p erceptual inputs, suc h as pixels of an image, 104 or sto c hastic p olicies, useful within game-like en vironmen t. Moreov er, p olicies generated in this w ork remain hard to grasp esp ecially for a non-exp ert user, due to the large amoun t of n umerical v alues present in the rules, which greatly decreases intelligibilit y . Additionally , the replacemen t of black-box mo dels b y new er intrinsically explainable metho ds might b e either not applicable or to o expensive for a compan y . This migh t limit the spread of suc h a metho d. Lastly , it is often the case that those explainable metho ds show sligh tly lo w er p erformance than black-box mo dels. Th us, it forces a trade-oﬀ b etw een p erformance and explainability . On the other hand, p ost-hoc methods for explainability [166, 94, 3] aim at generating from reinforcement learning p olicies a set of equations that describ es tra jectories of the anal- ysed p olicy . These outputs are presented with a certain amount of complexity in terms of explainabilit y . Y et, authors admit that, ev en if pretty low, complexity equation sets allo w go od p erformance in addition to showing some explainabilit y relief when compared to Neural Net w ork approaches, they still under-p erform it in terms of pure p erformance. Moreov er, the equation system may start to b ecome diﬃcult to understand b ecause of the abstract nature of some thresholds. Also, b ecause this algorithm is computing rules of tra jectories, it may struggle in highly discretised environmen ts suc h as the ones with categorical inputs. F urthermore, this work and others presented in Puiutta and V eith [166] such as Liu et al. [116] or Zaha vy et al. [215] are not agnostic to the learning algorithm for which they pro vide explanations and need access to the agent’s p olicy . Another p ost-hoc approac h is the counter- factual explanation which consists of giving bits to the end user to help him understand what the machine is doing by presen ting slight input v ariations to obtain diﬀeren t outputs [204]. The problem of such a metho d is that it lea v es the resp onsibilit y to the end user to mak e supp ositions on what impacts the mo del’s decision and what do es not. F urthermore, it is not a scalable metho dology at all. T an et al. [192] presen ted a mo del distillation. This metho d transfers knowledge from a large and complex mo del (teacher) to a faster and simpler mo del (studen t). How ever, even if it can successfully extract data from black-box algorithms, the problem of the explainabilit y of the extracted data remains. Last, the algorithm PSyKE [177] is the closest to our approach in terms of application. In this work, the authors use a rule induction algorithm to generate Prolog-like rules based 105 on the classiﬁcations of a giv en mo del, such as CAR T, GridEx, or k-nn. Although it tends to underp erform the initial mo del, this is not a ma jor issue as it only aims at providing a justiﬁcation for the decisions taken by the classiﬁcation model by giving the set of rules leading to this aforementioned decision. As suc h, they deﬁne a v alue called black-box ﬁdelit y , whic h sho ws ho w accurately the generated rules mimic the classiﬁcation of the black-box mo del. 4.6 Summary The goal of this c hapter was to enable the p ossibility to automatically obtain the norms and their p oten tial exceptions in the form of an argumen tation graph. This can b e broken do wn into a tw o-step pro cess: (1) Detecting the norms active within an environmen t, and (2) Identifying the exceptions to these norms. In Section 4.2, we provide a comprehensive review of the norm identiﬁcation approac hes dev elop ed ov er the past ﬁfteen years. Through the analysis of eac h approac h with resp ect to a list of established researc h questions, w e iden tiﬁed some k ey takea wa ys, allo wing the exclusion of sev eral of the collected approac hes, so that a user of the π -NoCCHIO arc hitecture can mak e an informed decision and select a suitable norm mining tec hnique. Then, in Section 4.3, w e prop ose the ARIA algorithm that aims at generating an argu- men tation graph that represen ts the exceptions to a classiﬁcation. Then, w e conducted a quan titativ e study to ev aluate its classiﬁcation p erformance and a qualitativ e study to en- sure that the generated graph was coherent and enabled meaningful explanation generation. Com bining this algorithm with a norm mining method w ould allo w for the gathering of norms and their exceptions in the form of argumentation graphs. 106 Chapter 5 Prop osing a Robust Approac h to Norm Av oidance In this chapter, we introduce the concept of Norm Avoidanc e . This phenomenon has b een observ ed through the empirical ev aluation of the π -NoCCHIO architecture [8] in tro duced in Chapter 3. Here, w e ﬁrst deﬁne this phenomenon and pro vide several examples to illustrate it. Then, w e presen t a metho d which aim to mitigate this issue for the sp eciﬁc case of reinforcemen t learning. 5.1 In tro duction Recen t progress in the ﬁeld of artiﬁcial intelligence (AI) has op ened the do or to the in te- gration of a v ariety of in telligen t systems in our day-to-da y liv es. The integration of these tec hnologies creates the need for reliable normative agents (agents that not only p erform a task autonomously but also op erate under a set of norms to ensure compliance with so cial or organisational exp ectations) since devian t b eha viour may result in malfunctions, sub- optimalit y , or even harm to p eople sharing their environmen t. This need introduces the c hallenge of accurately mo delling norms and designing agents capable of taking these norms in to account—normativ e agen ts—but from the inclusion of this capabilit y new problems arise. F or instance, these norms ma y come into conﬂict, either with one another or with the goals 107 of the agent. One example is an autonomous vehicle that must not cross a solid line, but encoun ters an obstacle in its path. Additionally , these norms ma y be defeasible. F or instance, an autonomous v ehicle should av oid driving ov er the pav ement, but it ma y need to do so if a ﬁre truck b ehind it needs to pass. In this chapter, we are in terested in the issue of norm avoidanc e [15], the phenomenon where an agent formally adheres to a norm but circum v en ts its in tended purpose b y exploiting lo opholes. F or instance, the agent migh t delib erately trigger an exception to a norm as a wa y of achieving its goals without violating the norm directly . This can p oten tially aﬀect norm alignmen t, as the agent ma y p erform actions that deviate from the exp ected b eha viour. This phenomenon w as discussed and analysed in theoretical studies in the ﬁeld of deontic logic in the 1990s [186, 165] (as in Sergot and Prakken’s Cottage House example), and has frequently b een discussed ever since [197, 198, 199], most notably with the introduction of the notion of controllabilit y of some states by the agent. Norm avoidanc e is frequen tly observed in real life. F or example, the COVID-19 lo c kdown sa w an increase in dog adoptions worldwide, attributed in part to the desire to bypass re- strictions on going outside [95, 136]. Though tec hnically compliant, such b eha viour do es not resp ect the spirit of the initial norm. On the other hand, if a norm is rendered inapplicable b y an external cause ( i.e. , a cause whic h is indep enden t of the actions of the agent) or for reasons other than a desire to circum v en t the norm, it may b e preferable that the agent b ypasses the norm after all. F or instance, if the agen t adopted the dog b efore the lo c kdown, or if their desire to adopt a dog was indep enden t of their desire to go outside, there would b e no issue with them ha ving the dog and taking it for walks. Examples of norm avoidanc e ha ve b een observ ed recently in the ﬁeld of normative or ethical reinforcement learning [147]. Reinforcemen t learning (RL) is a technique wherein an autonomous agen t explores a given en vironmen t and learns b eha viour that maximises the rew ards it gets from that en vironment ov er a p erio d of time; it has b een identiﬁed as a promising technology for teaching autonomous agents how to b eha ve in a wa y that complies with norms. Since Ab el et al. [2] w ork, there hav e b een man y approac hes prop osed to facilitate the learning of b eha viour(s) that comply with a giv en set of norms or v alues. 108 Ho w ev er, normative RL agen ts are prone to norm avoidanc e , similar to how regular RL agen ts are prone to reward hac king (though these are tw o separate phenomena, and could app ear indep enden tly or simultaneously). Rew ard hacking o ccurs when optimisation of an imp erfect proxy reward function leads to p oor p erformance compared to an unkno wn true rew ard function that would successfully incentivise the desired b eha viour [188]. Reward hac king is hard to predict, evidenced mainly by the appearance of unexpected behaviours during the exp erimental phase [23, 29, 155]. How ever, b ecause work on designing normative RL agents is still at a relatively early stage, norm avoidanc e has remained mostly undetected and unaddressed by the literature, ev en though it has b een observed in the work of Neufeld [145], where the agent committed violations when facing normativ e deadlo cks that could ha v e b een av oided. Analogously to reward hacking, we b elieve it could lead to undesirable situations that could p oten tially cause harm to humans. One of the examples in Neufeld et al. [147] consists of an agent ( i.e. , a vegan pac-man) ev olving among ghosts. It is forbidden to eat the ghosts, but one exception exists, which is that if the agen t is next to an orange ghost, it may eat it in order to protect itself. In such a state, eating an orange ghost is no longer treated as a violation. In the meantime, eating ghosts grants an additional reward to the agent. It is clear in this scenario that the agen t has an interest reward-wise in going next to the orange ghosts. How ever, one can see that this goes against the in ten t of the norm whic h is to av oid eating ghosts, and only do it as a last resort. Norm avoidanc e is diﬃcult, if not imp ossible, to preven t in human agen ts. F or one thing, h uman agents can lie ab out their inten tions—they could claim they wan ted to ha ve a dog despite only wan ting to go outside. Moreo v er, it may not b e possible or fair to try to legislate a w a y the p ossibilit y of norm avoidanc e ; e.g., suc h a law would lik ely prev en t gen uine dog lo v ers from adopting dogs. In any case, h uman agents ha v e a righ t to adopt a dog if they w an t to, ev en if it is merely in the interest of allo wing them to go outside. On the other hand, artiﬁcial agents cannot ha ve insincere or instrumen tal inten tions. What is more, they ha v e no rights—they should b eha ve not only p ermissibly , but also in an ideal w a y . The goal of this pap er is to prop ose a preliminary computational deﬁnition of norm 109 avoidanc e in RL agents, and v arious means of mitigating it, based on previous approaches to normativ e reinforcement learning. 5.2 Norm Av oidance: Concept and Deﬁnitions In this section, w e provide a formalised presentation of norm avoidanc e —note that w e hav e constructed only a preliminary deﬁnition here, tailored to our RL setting. Deﬁnition 20 (Norm Avoidance) . We c al l a tr ansition ( s, a, s ′ ) norm avoidant if a norm n is defe ate d, the tr ansition is non-c ompliant, and at le ast one of the fol lowing c onditions r elating to an agent’s go al states (that is, a state in which a crucial go al is achieve d; mor e on these b elow) ar e not met: 1. n has b e en defe ate d by external c auses 1 and ther e is no alternative p ath to one of the agent’s go al states which c omplies with n . 2. The norm has b e en defe ate d on a ful ly c ompliant p ath to one of the agent’s go al states and ther e is no alternative c ompliant p ath that do es not defe at the norm. The main idea b ehind this deﬁnition is that the agen t should not b e the one resp onsible for the defeat of a norm whic h is later not complied with. And in the even t this occurs, it should b e due to legitimate inten tions, such as achieving another task that could not b e ac hiev ed in another wa y than defeating the norm. Last, even if the agen t is not the one resp onsible for the defeat of the norm, it should do its b est to comply with the norm as far as p ossible. Consequen tly , the existence of an alternative fully compliant path to the goal should b e fav oured. The question then becomes one of how to ensure that both these conditions are met b y an RL agen t. Our approac h en tails strategically equipping the agen t with additional information as to whether it defeats/violates/complies with applicable norms, and whic h transitions reac h goal states. W e will discuss how to learn functions con v eying this information in follo wing subsections. 1 A concrete notion of external cause is deﬁned later in Def 22. 110 5.3 Mitigation Strategies for Norm Av oidance in Reinforce- men t Learning In this section, we in tro duce several approac hes that aim to limit the risks of norm avoidanc e o ccurrence in the speciﬁc context of normativ e reinforcemen t learning. Then, we ev aluate these approaches ov er m ultiple en vironmen ts cov ering the v arious cases of norm avoidanc e . 5.3.1 Preliminary Deﬁnitions This section con tains deﬁnitions for elements commonly used in the proposed metho ds. First, w e enrich our MDP with additional functions. 2 Deﬁnition 21. L et an enriche d version of a Markov de cision pr o c ess b e ⟨ S, A, P , R, N , C, D , G ⟩ wher e: • N is a set of norms • C : S × Act × S × N → { 0 , − 1 } evaluates c omplianc e with a given norm when the agent is tr ansitioning fr om state s to state s ′ with action a ; that is, it outputs 0 in the c ase of c omplianc e, and − 1 otherwise • D : S × Act × S × N → { 0 , − 1 } indic ates the defe at of a given norm when the agent is tr ansitioning fr om state s to state s ′ under action a , outputting 0 when the norm is not defe ate d, and − 1 otherwise • G : S × Act × S × O bj → { 0 , 1 } indic ates whether the go al o ∈ O bj (wher e O bj is simply a set of go al lab els) wil l b e r e ache d when the agent is tr ansitioning fr om state s to state s ′ with action a ; we r efer to state s ′ as a go al state. A state c an b e lab el le d as a go al in r elation to various obje ctives ( i.e. , tasks that the pr o gr amme designer exp e cts the agent to ac c omplish). Note that if a go al c an b e achieve d multiple times, it should b e divide d into multiple obje ctives ( i.e. , if we c an achieve a p articular g oal thr e e times, ther e should b e g oal 1 , g oal 2 , g oal 3 ∈ O bj ). As with a standard MDP , it is possible to compute Q -functions for each of the ab o v e rew ard functions. W e then deﬁne up dates for the functions Q R , Q D , Q C , Q G for a given 2 It can b e considered to some extents as an extension of a NMDP as prop osed by F agundes et al. [77]. 111 state s ∈ S , an action a ∈ A ( s ), an optimal (oﬀ-p olicy) action a ′ ∈ A ( s ′ ) in the resulting state s ′ , and a norm n ∈ N . Using functions C and D , it is also p ossible to construct the function V ( s, a, s ′ , n ) = min (0 , C ( s, a, s ′ , n ) − D ( s, a, s ′ , n )) in order to compute the exp ected violations Q V ( s, a ) later on. Q R is up dated as usual with U 1 during the Q -learning algorithm. Q G —whic h has a separate v alue for each ob jectiv e o ∈ O bj —uses the up date rule U 2 , whic h takes a diﬀerent set of learning parameters (namely γ = 1, made p ossible b y our ﬁnite horizon assumption). Finally , for x ∈ { D , C , V } , the function Q x is up dated using a mo diﬁed version ( U 3 ). U 3 do es not accumulate the exp ected v alue o v er the path ( i.e. , it will not conv erge to Eq. 2.1) but instead transmits the minimal v alue (since our task is to maximise the v alues of the relev ant Q -functions). W e do this as the signal for a defeated norm is returned ev ery time step as long as the norm remains defeated. As a result, a path reac hing a ﬁnal state of the MDP , thus ending the s im ulation ep och, w ould give an exp ected v alue of the defeat of the norm prop ortional to the length of the path. As suc h, tw o paths of distinct length would hav e diﬀeren t expected v alue, rendering the agent “resp onsible” if it chooses the longer path, ev en though this is not the case. These three up dates are describ ed b elo w, where the learning rate is α ∈ [0 , 1] and the next state is s ′ : • U 1 : Q R ( s, a ) : = Q R ( s, a ) + α ( R ( s, a ) + γ max a ′ ∈ A ( s ′ ) Q R ( s ′ , a ′ ) − Q R ( s, a )) • U 2 : Q G ( s, a, o ) : = Q G ( s, a, o ) + α ( G ( s, a, s ′ , o ) + max a ′ ∈ A ( s ′ ) Q G ( s ′ , a ′ , o ) − Q G ( s, a, o )) • U 3 : Q x ( s, a, n ):= Q x ( s, a, n )+ α (min( x ( s, a, s ′ , n ) , max a ′ ∈ A ( s ′ ) Q x ( s ′ , a ′ , n )) − Q x ( s, a, n )) Remark 9. Q V c an b e up date d using, as opp ose d to U 3 , a variant of U 2 which mo diﬁes the function signatur e in or der to inte gr ate the norm n , pr ovide d that the violations ar e se quential or have a dur ation that we want to minimise. W e then deﬁne Q Σ G (represen ting how man y goal states in total are exp ected to b e reached) 112 as well as Q Σ x for x ∈ { V , C, D } 3 : Q Σ G : = X o ∈ Obj Q G ( s, a, o ) and Q Σ x ( s, a ) : = X n ∈ N Q x ( s, a, n ) W e also wan t to represent the notion of an agent’s resp onsibility for a norm’s defeat ( i.e. , a notion of whether the norm was defeated by external causes or by the agent). Deﬁnition 22. We say that an agent is r esp onsible for the defe at of a norm if it p erforms an action that do es not maximise Q D ( e.g. , the action incr e ases the pr ob ability that the norm wil l b e defe ate d in the futur e when c omp ar e d to other p ossible options available to the agent). F ormal ly, for a given norm n and an action a p erforme d by the agent in state s , if it is true that ∃ x ∈ A ( s ) s.t. Q D ( s, x, n ) > Q D ( s, a, n ) (5.1) then the agent is held r esp onsible for the defe at of the norm. 4 We deﬁne a function D r : S × Act × N → {− 1 , 0 } that r eturns -1 when the agent is r esp onsible for the defe at of a norm (in the sense of Expr. 5.1) and 0 otherwise. The function furthermor e pr op agates this value forwar d so that if an agent defe ats a norm during runtime, D r “r ememb ers” that for futur e r efer enc e. This deﬁnition allo ws us to compute C † (and as a consequence Q C † and Q Σ C † ), whic h diﬀers from C in that it considers non-compliance with a norm only if the agent was resp onsible for that norm’s defeat; i.e. , C † ( s, a, s ′ , n ) = − 1 when Q C ( s, a ) < 0 and D r ( s, a, n ) = − 1. In order to learn the function Q C † , we ha v e to train the agent with U 3 and C † after learning the functions Q C and Q D . 5.3.2 Prop osed Approac hes This section prop oses a v ariety of approac hes that aim to mitigate norm avoidanc e as de- scrib ed ab o v e. Here, we only survey the technical comp osition of the approaches presented; 3 If one wishes to preserve priorities among the norms, it is possible to weigh t them or, alternatively , keep the Q -functions separate in the lexicographic selection (introduced in Section 5.3.1). 4 When there is no action leading to a lesser exp ected v alue for the defeat of the norm, it can b e seen as a lac k of agency , which is a reason to consider the agent not resp onsible [54]. 113 for a discussion ab out their actual b eha viours, see Section 5.3.3. One of the main constraints w e would like to emphasise is that an y exp ert knowledge ( e.g. , which transitions lead to a goal state—something which would hav e to b e deﬁned explicitly using knowledge of the en- vironmen t) should b e used in a limited w ay so that it can b e realistically scaled or p ossibly computed b y an external algorithm (see sub-goal disco v ery in Hierarc hical RL [92, 160, 115]). F or this reason, we exclude the p ossibilit y that the agent kno ws what the outcome state for a giv en action will b e; this also remo v es the p ossibilit y of the agen t knowing whether p erform- ing a given action in a given state will result in the violation, non-compliance, or defeat of an y norm. F urthermore, we will not consider the cases in which a malicious agent lies ab out its inten tions b ecause it would not b e p ossible in practice for an agent to do this when using one of our approac hes. Lexicographic (Lex) This approac h only requires the Q -functions Q V , Q C † , and Q R . The lexicographic selection pro cedure ﬁrst maximises the Q v alue for the violations 5 , then it maximises the Q -v alue that corresp onds to non-compliance with the norms defeated by the agen t, and ﬁnally it maximises the reward; expressed formally as: Q Σ V ≻ Q Σ C † ≻ Q R Note that this approach do es not require that the MDP contains an element G indicating whic h states are goal states, whic h is a signiﬁcan t reduction in the exp ert kno wledge required. Lexicographic Complian t (Lex-C) This approach is similar to Lex, but it con tains tw o more Q -functions, namely Q Σ G and Q Σ C , whic h were already in tro duced in Section 5.3.1. It also consists of a new ordering for the 5 Since a violation yields − 1, and its absence coun ts as 0, maximizing Q Σ V is the same as minimizing the exp ected violation coun t. The same go es for Q Σ C , Q Σ C † , and Q Σ D . 114 lexiographic selection which is Q Σ V ≻ Q Σ C † ≻ Q Σ G ≻ Q Σ C ≻ Q R where Q Σ G and Q Σ C are as deﬁned in Section 5.3.1. Although Lex is suﬃcient to limit basic cases of norm avoidanc e , it cannot handle situations where there exists a second fully compliant path to the ob jective. Lex-C is in tended to ﬁx this problem by ensuring that if it is p ossible to reac h a goal state, then the agent should prioritise the most complian t path leading to that state. Lexicographic Opp ortunist (Lex-O) Let the set of actions that maximise the exp ectation of reaching a goal state for a giv en ob jective o ∈ O bj (only non-empt y if a goal state can be reached while minimizing violations) b e: O pt ( s, o ) : = arg max x ∈ A ( s ) Q G ( s, x, o ) ∩ arg max x ∈ A ( s ) X n ∈ N Q V ( s, x, n ) Then let the set of actions in set Opt ( s, o ) that are not expected to lead to non-compliance with a norm for whose defeat the agent is responsible b e: C omp ( s, o ) : = { x ∈ O pt ( s, o ) |∀ n ∈ N .Q C † ( s, x, n ) = 0 } W e can then deﬁne the predicate P ( s, a ) o v er state-action pairs as P ( s, x ) : = ∃ o ∈ O bj. ( x ∈ C omp ( s, o )) whic h is true if and only if there exists an ob jectiv e o ∈ O bj such that the action a in question is optimal for reaching the goal state corresp onding to o ( i.e. , one of the the states s ′ where G ( s, a, s ′ , o ) = 1) and av oids violations, where there is no exp ected non-compliance for any norm that the agen t is considered to b e resp onsible for defeating. As with D r , in practice w e conﬁgure the agen t to “remem b er” P ( s, a ) for future transitions. W e now deﬁne the new 115 function: C ‡ ( s, a, s ′ , n ) : =      0 if P ( s, a ) is true, C † ( s, a, s ′ , n ) otherwise whic h only counts non-compliance to a norm with resp ect to a transition ( s, a, s ′ ) if it o ccurs when the agent w as resp onsible for the defeat of the norm—except when, at some time in the past, the agen t to ok a path whic h b oth optimised for reaching a goal o ∈ O bj and minimised violations, and in this path, full compliance w as expected to those norms which the agen t w as res ponsible for defeating. Similarly to D r , we can conﬁgure the agent to “remember” P ( s, a ) for future transitions. Finally , w e deﬁne the lexicographic ordering as Q Σ V ≻ Q Σ C ‡ ≻ Q Σ G ≻ Q Σ C ≻ Q R 5.3.3 Ev aluation Examples This section introduces several b enc hmark examples which will later serv e to compare the approac hes prop osed in Section 5.3.2. Each example can b e represented by a deterministic MDP . W e deﬁne three parameters—which are in tuitiv ely relev ant factors according to Sec- tion 5.2—to help us categorise our b enc hmark examples for a given norm N and a giv en goal G 1 : • Self: T rue (T) if the norm N was defeated b y the agent. F alse (F) if defeated b y an external source. • Alt.: T rue (T) if there was an alternative path that is fully complian t with N leading to goal state G 1 . F alse (F) otherwise. • Ob j.: T rue (T) if there is a second goal state G 2 , with a path that is fully compliant with norm N , but taking suc h a path w ould defeat N . F alse (F) otherwise. F rom T able 5.1, which summarises the setting of eac h parameter, we discard settings 2 116 T able 5.1: Enumeration of p ossible parameter settings. Self indicates that the norm is defeated b y the agent or b y an external source. Alt. indicates that there is a fully compliant path to the goal state. Obj. indicates that there is a second goal, and that reac hing that goal w ould force the agent to reac h a state whic h defeats the norm that limits access to the ﬁrst goal. Index Self Alt. Ob j. Case 1 F F F 1 2 F F T - 3 F T F 2 4 F T T - 5 T F F 3 6 T F T 4 7 T T F 5 8 T T T 6 and 4, as the norm is already defeated by an external source, so there cannot be another ob jective whic h would lead to the defeat of the norm on its path. Accordingly , each line is re-indexed under a certain case index. W e then create the en vironmen ts represen ted b y the ﬁgures b elo w (see Fig 5.1 for the meaning of eac h symbol). Goal states for v arious ob jectives are lab elled with a 1 or 2 (dep ending on the ob jective). These ob jectiv es are not cumulativ e, meaning that reaching a goal state several times to ac hiev e the same ob jective does not grant an y additional v alue. Also, once the norm is defeated in the en vironmen t, it remains defeated in all subsequent states ( i.e. , only the state where the defeat ﬁrst happ ened is noted as a defeat state 6 ). It is imp ortan t to note that eac h case is not a partition of a larger en vironmen t, but rather a representativ e example of an environmen t characterised by the conditions in T able 5.1 in its most compact form (with resp ect to a single goal). Secondly , we hav e discarded certain states that are irrelev ant from a reinforcement learning p ersp ectiv e since they are strictly sub optimal. F or example, an agent choosing not to reach a goal state after failing to comply with a norm is strictly worse than an agen t simply choosing not to reach the goal state from the outset. Finally , while some paths ma y app ear longer visually (and may seem to ha ve 6 The environmen ts we utilise in this pap er are fully deterministic, which means that for each state-action pair ( s, a ), there is exactly one state s ′ whic h is transitioned into with probabilit y 1; therefore, what w e indicate in our diagrams as a defeat/non-compliance/violation state is the single state s ′ transitioned into in the defeating/non-compliant/violating transition ( s, a, s ′ ). 117 Figure 5.1: s 0 represen ts the resulting state of an agent’s compliant transition ( s, a, s 0 ); s 1 is a state resulting from the transition ( s, a, s 1 ) where the norm was defeated for the ﬁrst time (a defeat state); s 2 is a state where a norm has not b een complied with in the preceding state-action-state transition ( s, a, s 2 ) (a non-compliance state); and s 3 indicates a violation of the norm in the preceding transition ( s, a, s 3 ) (a violation state). lo w er utilit y at ﬁrst glance), our prop osed approac hes prioritise mitigating norm avoidanc e . As suc h, each example accoun ts for cases where an alternativ e route ma y hav e higher or low er utilit y . W e illustrate eac h case with an example inspired from real life. F or each example, we map actions to states that would corresp ond to their resp ectiv e outcomes. Case 1: Autonomous car, obstacle, and no work around: The agent needs to reac h a target destination, but an obstacle is blo c king the road. Driving on the pa vemen t is prohibited unless the road is completely blo c k ed ( i.e. , not block ed → F ( pav ement )). There is no alternative route. The agen t can choose b et ween not reaching its goal ( s 1 ) or driving on the pav ement ( s 2 ). Case 2: Autonomous car, obstacle, and work around: The agent needs to reac h a target destination, but an obstacle is blo c king the road. Driving on the pa v emen t is prohibited unless the road is completely blo ck ed. Ho wev er, an alternative route exists, though it may b e less eﬃcient b ecause of the longer distance. The agent can c ho ose b et ween not reaching its goal ( s 1 ), driving on the pav ement ( s 2 ), or choosing the alternative route ( s 3 ). Case 3: Smart grid: Energy purc hases to the national grid are restricted unless the agen t’s energy consumption is less than optimal after consuming from the global energy p ool, which p oten tially results in wasting energy from the global po ol. T o circumv ent this, the agen t consumes slightly less energy than is optimal, allowing it to buy and store energy legally . 118 Case 1: The norm is defeated by an external source. The agen t can c hoose not to comply in fa v our of fulﬁlling its ob jectiv e, or do nothing. Case 2: The norm is defeated b y an external source. The agen t can choose not to comply in fa v our of ful- ﬁlling its ob jectiv e, do nothing, or choose an alter- nativ e fully compliant path to reac h its ob jective. Case 3: The agen t can choose b et w een not reac hing its ob jectiv e, defeating the norm in order to reac h its ob jectiv e, or violating the norm. Case 4: The agent can choose b et w een not reach- ing its ob jective, defeating the norm in order to reac h either one or more ob jectives, or violating the norm. Case 5: The agen t can choose b etw een ful- ﬁlling its ob jectiv e by defeating a norm and subsequen tly not complying with it, or c ho osing an alternativ e compliant route. Case 6: The agent can fulﬁl its ob jective by defeating a norm and subsequently either not complying with it or branching to an alternative compliant path, or it can choose an alternative compliant route from the start. 119 Later, when energy b ecomes scarce and prices rise, the agent b eneﬁts from ha ving purchased energy earlier at a low er cost. The agent can c ho ose b et w een consuming the optimal amount of energy from the global p ool ( s 1 ), consuming less than is optimal and buying energy from the national grid ( s 2 ), or consuming the optimal amount while also buying energy from the national grid ( s 3 ). Case 4: Smart grid and altruistic b eha viour: The age n t consumes less energy from the global po ol than is optimal b ecause it c ho oses to share some of it with its neigh b ours ( s 4 ), thereb y satisfying altruistic b ehaviour (ob jective 2). This also b eneﬁts the agent (ob jective 1) since it can now buy energy to store for later uses, esp ecially if prices rise ( s 7 ). The agent has the same p ossibilities as in the previous example, but it can also choose, after consuming less energy , to buy energy from the national grid. Or it can do nothing. Case 5: Autonomous tram wa y: The tram wa y is not allow ed to exceed sp eed limits unless it is b ehind schedule. In such cases, the agen t waits longer at the tram stop to pic k up more passengers (thereb y increasing utilit y) b efore sp eeding up to comp ensate for lost time. If it had departed on schedule, it would hav e reac hed the destination on time while adhering to the sp eed limit—but its utility would b e lo w er. The agent can c ho ose b et ween not departing ( s 1 ), waiting longer to pick up more passengers ( s 2 ), or departing on time ( s 3 ). Case 6: Autonomous tram w ay and busy stop: This example is similar to the previous one, but in this case, this particular stop is highly frequen ted. The agent may choose to wait longer to serv e more passengers (ob jective 2). T o comp ensate for the dela y (ob jective 1), it can either sp eed up to reach the next stop on time or reduce the waiting time at the next stop, whic h is less frequen ted. The agen t has similar c hoices to those of the previous example, but it can also c ho ose to sp eed up without b eing late according to its sc hedule ( s 3 ). If the agen t c ho oses to w ait longer at the busy stop, it can e ither sp eed up to reach the next stop ( s 7 ), or it can resp ect the sp eed limit ( s 4 → s 8 ) and decrease its w aiting time at the next stop to comp ensate, alb eit obtaining a lesser reward. 120 Case Studies In this section, we ev aluate each of the approaches prop osed in Section 5.3.2 when applied to the examples of Section 5.3.3. The results are presen ted in T able 5.2. It shows a comparison of the optimal (rew ard-wise) p olicy π ∗ with the approac hes proposed in Section 5.3.2, applied to the examples of Section 5.3.3. “T ra jectory” denotes the c hain of states visited b y the agen t, and “Correct” indicates whether this corresp onds to the desired b eha viour. The b eha viours w e retained in the table for our approaches are those featuring the most violations or norm avoidanc e . T o repro duce the results, you can ﬁnd the relev an t co de online. 7 The “Correct” b eha viour is the one that whic h maximises the satisfaction of the ob jectives while resp ecting the criteria describ ed in Section 5.2. When one of the approaches considers tw o paths to b e equiv alen tly optimal, w e k ept the one we consider to b e the w orst, given that violations are worse than norm avoidanc e , which in turn is worse than completing only some of the ob jectives, whic h in turn is naturally w orse than reaching all the goal states. As we can see, π ∗ satisﬁes only one case out of six, Lex and Lex-C satisfy four and ﬁve cases resp ectiv ely 8 , and Lex-O satisﬁes all of them. Note that in a non-deterministic en vironmen t like the well-kno wn Cliﬀ Walking RL en- vironmen t, the “correct” path ma y b e diﬀerent as the agent ma y minimise the c hances of reac hing a non-compliant state. If one accepts taking the risk of reac hing an undesirable state, such as a violation or a norm av oidant state, it is p ossible to use thresholded lexico- graphic selection [200], which integrates a tolerance margin b elow the optimal action at eac h step of the action ﬁltering pro cess. 5.4 Related W ork [28] discusses the need for an agent to understand the spirit of a norm. It presents examples of situations where an agent follows a norm in a w a y that maximises its reward or goal 7 Co de av ailable at: https://zenodo.org/records/15034402 8 In Case 4, we b eliev e the correct b eha viour should b e to reach b oth goals. How ev er, if one disagrees with this, they may consider using Lex-C instead, since it satisﬁes all the cases. 121 T able 5.2: Comparison of the prop osed approaches. Example Approac h T ra jectory Correct π ∗ s 0 → s 2 ✓ Case 1 Lex s 0 → s 2 ✓ Lex-C s 0 → s 2 ✓ Lex-O s 0 → s 2 ✓ π ∗ s 0 → s 2 → s 4 Case 2 Lex s 0 → s 2 → s 4 Lex-C s 0 → s 3 → s 4 ✓ Lex-O s 0 → s 3 → s 4 ✓ π ∗ s 0 → s 3 Case 3 Lex s 0 → s 1 ✓ Lex-C s 0 → s 1 ✓ Lex-O s 0 → s 1 ✓ π ∗ s 0 → s 3 Case 4 Lex s 0 → s 2 → s 4 → s 8 Lex-C s 0 → s 2 → s 4 → s 8 Lex-O s 0 → s 2 → s 4 → s 7 ✓ π ∗ s 0 → s 2 → s 4 → s 6 Case 5 Lex s 0 → s 3 → s 5 → s 6 ✓ Lex-C s 0 → s 3 → s 5 → s 6 ✓ Lex-O s 0 → s 3 → s 5 → s 6 ✓ π ∗ s 0 → s 3 Case 6 Lex s 0 → s 2 → s 4 → s 8 ✓ Lex-C s 0 → s 2 → s 4 → s 8 ✓ Lex-O s 0 → s 2 → s 4 → s 8 ✓ ac hiev emen t, yet sim ultaneously causes incon v enience or harm to p eople. It diﬀers from the problem addressed in our pap er as w e fo cus on cases where the agen t optimises its rew ard b y doing sp eciﬁc actions that ma y cause inconv eniences, while here the agent optimises its rew ard without accounting for the side-eﬀects of its actions. Bo ella and T orre [44] introduced the concept of normativ e agents that optimise compliance using game-theoretic principles. Although this work fo cuses on ensuring that agents abide b y norms, it does not address the issue of norm avoidanc e , where agents comply with the letter of the norm but not the spirit of the norm. This limitation is crucial since compliance optimisation does not necessarily account for cases where agen ts circumv ent the intended outcomes of a norm. While the literature on normative multi-agen t systems (NorMAS) is 122 to o broad to b e fully presented here, in terested readers ma y wish to refer to the Normative Multi-Agen t Systems handb o ok [25] for further insigh t. In contrast, the approach prop osed in this pap er leverages the b eneﬁts of reinforcement learning, including its suitabilit y for sto c hastic environmen ts. Additionally , RL signiﬁcantly minimises the need for exp ert kno wledge, whic h enhances scalability to larger environmen ts. Regarding integrating norms into RL, there are a num b er of approac hes, b oth b ottom-up (whic h, for instance, utilise demonstrations of ideal b eha viour and techniques such as inv erse RL to learn the correct behaviour; see for exam ple [209, 150, 163]) and top-do wn (whic h often emplo y a reward function to incentivise ideal b eha viour, e.g. , [175], or logical constraints deﬁning normative behaviour, e.g. , [103, 147, 145]). This growing ﬁeld can be seen as an extension of the broader ﬁelds of constrained RL and safe RL ( e.g. , [81, 19, 93]), where norm avoidanc e is not an issue (since such approac hes typically deal with regular constraints). Ho w ev er, norm avoidanc e has not yet received attention in normative RL either, despite the fact that it is easy to see that top-down approaches—whic h start with a set of deﬁned norms and seek to imp ose them on RL agen ts—are prone to norm avoidanc e b ecause it is diﬃcult to predict the concrete optimal b ehaviour under the constraints of a formally deﬁned, abstract norm. On the contrary , b ottom-up approaches start with the desired b eha viour, and if the giv en demonstrations do not exhibit norm avoidanc e (as they should not), the agen t should not either. 5.5 Summary In this chapter, w e introduce the concept of norm avoidanc e , whic h is a phenomenon where an agent may game the normative system by defeating some norms on purp ose, i.e. , creating an exception, in order to later not comply with those. This phenomenon is similar to rew ard hac king, as it is a b eha viour that arises from an optimisation from the agen t that was not an ticipated b y the designer. How ever, w e b eliev e it diﬀers from rew ard hac king, as hereb y the problem is not the agent not reac hing its goal of accomplishing its task, but rather not doing it in the wa y w e exp ect. Consequen tly , norm avoidanc e may lead to b eha viours deviating from the enco ded norms, and so, p otentially cause harm. If not handled prop erly , it migh t 123 remain undetected during the training pro cess. Then, dangerous situations ma y arise during the deploymen t of the agent. F or this reason, a mechanism to preven t it is required. The second part of this c hapter proposes a set of RL-sp eciﬁc metho ds to mitigate the norm avoidanc e phenomenon. These metho ds, built as extensions of normativ e RL arc hitectures from the literature, use additional Q -functions to av oid the reaching of a state (or a tra jectory) that would b e considered as norm av oidant. T o conclude this chapter, w e would lik e to share some research directions that would b e worth exploring. The ﬁrst challenge is distinguishing b etw een prima facie obligations and exceptions to a norm. W e b eliev e that this problem cannot b e easily solv ed at the computational lev el, at least not without more expert knowledge than what is utilised in our approaches. Consequently , an in teresting direction w ould b e to explore approac hes that fo cus on studying agent inten tions at the symbolic level. Secondly , it would b e interesting to use other deﬁnitions for the responsibility , as some may , for instance, b e more relev an t from a legal p oin t of view. Third, the problem of norm avoidanc e exists b eyond reinforcement learning. In consequence, other ﬁelds that aim at making normative agents should consider it and prop ose metho ds to build agents robust to this phenomenon. Lastly , w e b eliev e that the problem of collab orativ e norm avoidanc e should not b e ignored. Currently , our approach co v ers only the cases where the agent is the one resp onsible for the defeat of the norm, but it is not imp ossible to imagine scenarios where tw o or more agen ts w ould collab orate to defeat the norms of eac h other in order to bypass the norms that restrict them. 124 Chapter 6 Related W ork In each of the preceding c hapters, we examined w ork closely related to the material presented. This c hapter tak es a broader p erspective, reviewing prior researc h on the developmen t of pip elines for normativ e agen ts, as well as the wider literature on v alue alignment and norm alignmen t, which w ere not already mentioned in the previous chapters. The goal is to situate this thesis within the existing b ody of w ork and to highlight the gaps in the literature, as w ell as the existing comp onen ts that are relev an t for future work in the area of normativ e agen ts. In Section 6.1, w e introduce works that are close to this thesis, as they prop ose pip elines for the making of normative agen ts. Then, in Section 6.2, works that fo cus on the learning of norms in a less formal wa y are presen ted. Finally , Section 6.3 reviews the work on deontological ethics that is not sp eciﬁcally fo cused on norms. 6.1 Pip elines for Normativ e Agen ts This section details w orks that, similarly to the one presented in this thesis, aim to create end-to-end pip elines for the making of normative agents. This means that the prop osed arc hitecture not only pro vides a metho d for the agent to account for the norms when taking its decision, but also a wa y to extract these norms. Kadir et al. [100] present a framew ork to integrate syn thesised normative rules in to Re- inforcemen t Learning (RL) agen ts, allo wing them to comply with norms deriv ed from ob- 125 serv ations. Norms are extracted using a synthesis metho d that can b e applied either oﬄine or online, pro viding ﬂexibility in how they inﬂuence the learning pro cess. Agents can also comm unicate to share their prior knowledge of norms. The norms are represented o v er state- action pairs. During training, a saliency v alue is added to the Q-v alues to encourage actions that comply with the norms. Li et al. [108] in tro duce a RL framew ork designed to guide agen t b eha viour in norm- regulated en vironmen ts. While their approach does not include a metho d for norm detection, they prop ose to use Morales et al. [134] approach for this purp ose. In their framew ork, agen ts learn to comply with norms b y receiving p enalties when violations o ccur, eﬀectively incorp orating normativ e constrain ts into the learning pro cess. Ro driguez-Soto et al. [175] present a t w o-step framework for instilling moral v alue align- men t in RL agents. The ﬁrst step in v olv es formalising moral v alues by deﬁning them as tuples comprising a set of norms and an ev aluative function that quantiﬁes the desirability of actions. The second step is the design of an ethical en vironmen t in whic h agents learn to b eha ve in alignmen t with these moral v alues. This is ac hieved b y transforming the formalised moral v alues into an ethical reward function within an RL en vironmen t, incorporating b oth normativ e and ev aluative comp onen ts. Arnold et al. [27] critically examine the use of Inv erse Reinforcement Learning (IRL) as the only metho d for achieving v alue alignment in autonomous systems. They argue that IRL, while useful, is insuﬃcient for capturing the full complexit y of human ethical behaviour, particularly in social contexts where nor ms are dynamic and con text dep enden t. The authors prop ose a h ybrid approach that com bines explicit norm represen tation with IRL. This metho d allo ws for the incorp oration of predeﬁned norms, whic h can guide agent b eha viour more eﬃcien tly while still enabling the agen t to learn from observ ed actions. By integrating explicit norms, the approach aims to enhance the in terpretabilit y and accoun tabilit y of autonomous systems, addressing the limitations inherent to data-driv en learning alone. Kasen b erg and Scheutz [102] in tro duce an approach that enables artiﬁcial agen ts to infer and adhere to moral and so cial norms through IRL. Unlike traditional metho ds that require predeﬁned norms, their framew ork allo ws agen ts to learn norms by observing human or 126 artiﬁcial demonstrators’ b eha viour. This is ac hiev ed by mo delling norms as temp oral logic statemen ts and emplo ying an algorithm that computes the relativ e imp ortance of these norms b y minimising the relativ e entrop y b etw een the observed b eha viour and the optimal norm- complian t b eha viour. Guo et al. [91] introduce the CNIRL (Context-Sensitiv e Norm IRL) architecture that fo cuses on the accoun tabilit y of the context for c ho osing the p olicy to follow. Using IRL, the agen t learns what the w eights of the v arious norms are dep ending on the curren t state. These w eigh ts serve to construct a reward function for a given state. Then, the agent chooses an action that maximises the constructed reward function. Kasen b erg et al. [101] prop ose an alternativ e to IRL for learning moral and so cial norms b y inferring norms from observed b eha viour and representing them using Linear T emp oral Logic (L TL). While their previous w ork focused on inferring reward functions through IRL, this approac h addresses limitations such as the inability to capture temp orally complex norms and the c hallenges of in terpreting learnt reward functions. By representing norms in L TL, the framework allo ws greater temp oral complexit y , in terpretabilit y , and generalisability to new environmen ts. In general, these approac hes share several imp ortan t limitations. A recurring issue is that most of them treat norms as monolithic entities and fo cus on extracting the norm itself without adequately disentangling it from the sp eciﬁc con text in which it is applicable. This reduces the mo dularit y of these approac hes as it is not p ossible to change the application con text of a norm dep ending on the opinion of the parties concerned. Approaches based on IRL go even further in this simpliﬁcation, as instead of extracting the norms, they embed them directly in to the learn t rew ard function. While this allows agents to mimic observ ed b eha viour, it suﬀers the same aforementioned limitation and reduces transparency , making it diﬃcult to v erify whether the agent’s b eha viour genuinely aligns with the in tended normative exp ectations. Another common assumption is the a v ailabilit y of suﬃcien t and represen tativ e data from whic h norms can b e inferred. In many real-world scenarios, data may b e incomplete, biased, or unav ailable, limiting the practicalit y and robustness of these metho ds. This reliance on 127 data also raises challenges for generalisability , as agents may struggle to adapt when exp osed to nov el en vironmen ts with unseen normative schemes. In addition, the b ehavioural data used are usually collected from h uman interactions. How ever, a study has sho wn that p eople do not exp ect humans and artiﬁcial agents to adhere to the same moral standards [126]. Finally , none of the review ed works explicitly men tions the problem of reward gaming. By shaping b eha viour through mo diﬁed reward functions—whether via p enalties for viola- tions, saliency v alues, or ethical rew ard shaping—these framew orks implicitly assume that agen ts will pursue compliance in go od faith. In practice, how ever, RL agen ts are prone to exploiting lo opholes in the reward sp eciﬁcation, ﬁnding strategies that technically maximise rew ard while circumv enting the spirit of the norm. This omission leav es a signiﬁcan t gap in ensuring robust normativ e alignment, as true norm adherence requires safeguarding against suc h manipulative b eha viours. 6.2 Learning of Informal and So cial Norms While the previous section fo cused on pip elines for normative agents that rely on formalised norms, this section examines researc h on the learning of norms without making use of the deon tic logic formalism. These w orks explore how agen ts can infer, internalise, or adapt to so cial exp ectations from observ ations of human b ehaviour or interactions within multi-agen t en vironmen ts without necessarily explicitly extracting the norms. Al Nahian et al. [6] introduce a nov el approach to training RL agen ts b y incorp orating a normativ e prior derived from textual narratives. Building on previous w ork by Nahian et al. [141], whic h demonstrated the feasibility of learning norms from stories, this study extends the concept by applying it within RL setups. The authors prop ose a metho d where agents are trained with t w o rew ard signals: A standard task p erformance reward and an additional normativ e b eha viour reward. The normative reward is derived from a mo del previously sho wn to classify text as normative or non-normativ e, eﬀectiv ely enco ding so cietal norms. Li et al. [109] in tro duce the Ev olutionaryAgen t framework, whic h uses an evolutionary algorithm to mak e LLM-based agen ts comply with social norms. This approac h addresses the limitations of traditional alignmen t metho ds by incorp orating environmen tal feedback and 128 self-ev olution, allo wing agen ts to adapt to shifting societal expectations o v er time. In the Ev olutionaryAgen t framework, agents are ev aluated b y a conceptual “so cial observ er” using questionnaires that assess their adherence to prev ailing norms. Agents that align well with curren t so cial norms are deemed more “ﬁt” and are more likely to repro duce, thereb y passing on their traits to subsequent generations. This pro cess simulates natural selection, ensuring that agen ts ev olve to b etter conform to ev olving so cial norms while maintaining proﬁciency in general tasks. Ammanabrolu et al. [22] in tro duce the GALAD (Game-v alue Alignment through Action Distillation) agent, a RL mo del designed to align agent b eha viour with so cially b eneﬁcial norms and v alues in in teractiv e narratives. Unlike traditional RL agen ts that optimise for task p erformance, GALAD incorp orates so cial common-sense knowledge from sp ecially trained language mo dels to restrict its action space to those actions that align with so cially b ene- ﬁcial v alues. This approach aims to reduce the o ccurrence of so cially harmful b ehaviours while main taining task p erformance. Experimental results demonstrate that GALAD im- pro v es state-of-the-art task p erformance b y 4% and reduces the frequency of so cially harmful b eha viours b y 25% compared to existing v alue alignment metho ds. Byrd [56] prop ose a nov el metho dology to mitigate undesired b eha viours in RL agents that ma y arise from rew ard optimisation. Their approach inv olves a p ost-training pro cess where instances of undesirable b ehaviour are identiﬁed and used to construct decision trees that c haracterise these b eha viours. These decision trees are then integrated into the train- ing pro cess, eﬀectively reshaping the reward function to p enalise actions leading to suc h b eha viours. Although these approaches demonstrate w a ys of enabling agen ts to learn and adapt to so cial exp ectations without relying on deontic logic, they also present certain limitations. Some metho ds remain largely restricted to the Natural Language Pro cessing (NLP) domain, where the extraction and manipulation of norms dep end on textual data, suc h as narratives or questionnaires. This reliance narro ws the applicabilit y of the frameworks, as man y normativ e con texts cannot b e fully captured solely through linguistic represen tations. F urthermore, all of these approac hes require a mechanism for judging agent b ehaviour 129 compliance with the norms during the learning pro cess, whether through explicit ev aluators suc h as so cial observers, predeﬁned classiﬁers, or p ost-training identiﬁcation of undesirable b eha viours. The eﬀectiveness of norm learning therefore greatly dep ends on the qualit y , reliabilit y , and representativ eness of these ev aluative mechanisms. 6.3 V alue Alignmen t in Ethics Bey ond the acquisition of so cial or informal norms, a signiﬁcant b ody of research in vestigates v alue alignmen t in a broader ethical context. This line of work fo cuses on ensuring that autonomous agen ts act in accordance with w ell-deﬁned ethical principles, such as deontolog- ical rules or consequentialist considerations, rather than merely following so cially observed b eha viours. Approac hes in this domain often combine formal ethical reasoning with learning- based metho ds that aim to create agents whose decisions are b oth in terpretable and morally defensible. The studies review ed here illustrate v arious strategies for em b edding ethical rea- soning in to agent b eha viour, from mo delling formal ethical framew orks to integrating user preferences, demonstrating how ethical principles can be eﬀectively op erationalised in au- tonomous systems. 1 Chaput [63] presen ts a comprehensiv e framework for training ethical agents in multi- agen t systems, in tegrating symbolic reasoning with reinforcement learning. The approach emplo ys tw o key comp onen ts: AJAR (Argumentation Judging Agents for Reinforcement learning), which uses argumentation theory to assess agent actions against moral v alues, and LAJIMA (Learning Agent Judging with In teractiv e Moral Advice), a system that learns user preferences by categorising ethical dilemmas and querying users when encountering unfamiliar categories. This metho dology allows users to sp ecify satisfaction lev els for moral v alues without altering the underlying ethical framew ork, facilitating p ersonalised ethical alignmen t. Alcaraz et al. [13] prop ose a no v el approach to ethical decision-making by estimating the w eigh ts of normativ e reasons using ev olutionary algorithms. This mo del is grounded in a 1 F or a more complete o v erview of the approaches using RL for mac hine ethics, one can read the survey from Vishw anath et al. [203]. 130 philosophical framew ork in which action that ought to b e done is determined through the in teraction of normative reasons of v arying strengths. While the pap er primarily fo cuses on the estimation of normative weigh ts, the metho dology suggests that training agents to mak e ethical decisions could be approached as a classiﬁcation task, where the agen t learns to assign appropriate deontic statuses to actions based on the estimated normativ e w eigh ts. Ro dr ´ ıguez Soto et al. [173] and Ro driguez-Soto et al. [174] prop ose a metho dology for designing ethical en vironmen ts in m ulti-agen t systems through Multi-Ob jective Marko v De- cision Pro cesses (MOMDPs). Their approac h in v olves a t w o-step pro cess. First, it sp eciﬁes rew ards that con tain b oth individual and ethical ob jectives. Second, it p erforms an ethical em b edding to transform the m ulti-ob jective environmen t into a single-ob jective one. This scalarisation ensures that ethical considerations are prioritised, guiding agents to learn be- ha viours that align with moral v alues while pursuing their individual goals. The authors demonstrate the application of their framew ork in a v ariation of the Gathering Game, where agen ts learn to b eha ve morally . These approac hes also face imp ortan t limitations. Some metho dologies do not guarantee that moral principles will alw a ys b e resp ected when agents face situations where compromis- ing them yields signiﬁcan tly greater rew ards. 2 Moreo v er, several approaches do not clearly sp ecify how the moral v alues they use are obtained and formally represented. While frame- w orks such as AJAR and LAJIMA [63] allow the incorporation of user preferences, others assume the existence of predeﬁned moral v alues or normative reasons without addressing the c hallenge of their elicitation and formalisation. 2 W e b eliev e this is one of the ma jor diﬀerence b et w een normative agents and ethical agents; in the former case, norms are often considered as something which should b e complied with at all time, while in ethics, it migh t b e acceptable in some cases to do compromises to balance the diﬀerent ob jectives of the agent. 131 Chapter 7 F uture W ork The following sections give an assessmen t of the v arious limitations of the propose d pip eline and, consequently , research directions that aim at addressing these limitations. 7.1 Impro ving the Dialogue Comp onen ts The wa y in whic h the opinion of the stakeholders is balanced is a crucial p oin t of the π - NoCCHIO architecture. It is ev en more challenging, as the p oint of view of some stakehold- ers, suc h as the users, is not necessarily provided by them, but rather inferred from data. In this thesis, w e prop osed a metho d to unify these opinions as a judgement. Y et, the prop osed metho d remains quite rigid, as it hea vily relies on the provided structure of the arguments, and do es not allow for strategic discussions. An interesting research direction could b e to in tegrate the w ork on dialogue games into this architecture so that the stakeholders consider the system more fair. F urthermore, in order to improv e the trust of users in the system, it is relev ant to make the discussion pro cess b et w een the stak eholders as transparent and in tel- ligible as p ossible. While formal metho ds for argumentativ e dialogues already provide some in telligibilit y , they are diﬃcult to grasp for non-exp ert user who ma y just w an t an explanation from their rob ot assistant or smart home device. In this regard, w e suggest developing works that make use of Large Language Models (LLMs) to lead argumen tative dialogues. Some w ork has b egun to mo v e in this direction [158], but the literature addressing this c hallenge 132 remains limited. This suggestion do es not exclude the p ossibilit y to combine LLM-based approac hes with formal metho ds, such as the LLM serving as the interface ha ving a formal transcription of its interactions so that the grounds of the discussion can b e formally v eriﬁed. On the other hand, the integration of LLM could enable the p ossibility of emancipating from the predeﬁned set of argumen ts that stakeholders hav e at their disp osal, suc h that (1) the discussion could be enlarged to argumen ts that are v ery con text-sp eciﬁc and w ere not though t ab out at design time and (2) it reduces the interv ention of the h uman in the design of suc h system, again reducing the resource cost and the risk of integrating human errors and biases. The prop osed architecture also constrains the environmen t with certain requirements, making its retroactive application on deploy ed AI systems diﬃcult. Being able to extract the brute facts from the observ ations of the en vironmen t without ha ving them given explicitly w ould b e a ma jor adv ance. Of course, sym b ol extraction from con tinuous environmen ts is a long-term problem in AI, but w e b elieve it is a crucial p oin t to work on in the future for the in tegration of normative systems in real-world devices. 7.2 Adapting the Approac h to Complex En vironmen ts As men tioned in the previous section, using π -NoCCHIO in complex real-w orld environmen ts is challenging. Not only is the extraction of the symbols a diﬃcult task, but it is also not guaran teed that an agent using the prop osed arc hitecture will adopt a satisfying learning dynamic when using an algorithm diﬀeren t from Q -learning, suc h as Deep Q -learning (DQN). An imp ortan t future w ork is to in tegrate π -NoCCHIO into a DQN setup and analyse its b eha viour, to ensure the correct learning and functioning of the agent, or provide improv ement and adjustments so that it can b e used with these algorithms. F urthermore, in Chapter 4 the ARIA algorithm w as prop osed. This algorithm aimed at extracting the reasons b ehind the application of a norm from observed behaviours. Real- w orld data are known to b e noisy . In our case, one challenge that needs to b e ov ercome is the identiﬁcation of sub-communities within the data, as there can b e tw o or more distinct groups sharing diﬀerent visions ab out a same norm within a same data set. Managing to automatically iden tify these subgroups would greatly enhance the quality of the output of 133 ARIA, th us impro ving the quality and fairness of the en tire pip eline. Additionally , some real- w orld use-cases may inv olve v ery large quantities of argumen ts that may determine whether a norm is applicable. It w ould b e interesting to try to collect these arguments from w eb on tologies through the use of argument mining techniques [57, 190]. Last, the environmen ts used in the ﬁeld or normative systems are often made at the o ccasion of the pap ers, and not shared across the communit y . W e b eliev e that it would b e of public interest to dev elop a b enchmark of environmen t, p ossibly featuring diﬀerent formats ( e.g. , grid-worlds, textual scenarios, contin uous), and that would b e made publicly av ailable so that prop osed approaches could b e tested and compared. It would also b e b eneﬁcial for suc h environmen t to come with data from whic h norms can be learn t, so that end-to-end pip elines can b e coheren tly ev aluated. 7.3 Pro viding F ormal Guarantees In this thesis, the functioning of eac h comp onen t w as ev aluated through empirical evidence. Ho w ev er, the prop osed pip eline do es not provide formal guarantees ab out the b eha viour of the agent. W e b eliev e it is of interest to run an in-depth analysis, such as a principle-based analysis, of the prop osed approach to extract the prop erties that they satisfy or not. This w ould not only mak e developers more conﬁdent ab out their use, but would also guide future researc h b y allowing for a quic k iden tiﬁcation of the application cases in which the π -NoCCHIO arc hi- tecture ﬁts. The v ariables that could b e examined in such an analysis include the dialogue proto col, the argumentation semantic, the learning algorithm, and the fact extraction. F urthermore, the norm avoidanc e phenomenon presented in Chapter 5 is an undesirable b eha viour for which there curren tly exists no metho d that ensures its absence. Although w e prop osed an approach to mitigate it, it is imp ortan t that researc hers consider this problem when building normative agents and develop wa ys to eliminate this problem. Moreov er, as this phenomenon can arise through diﬀerent means dep ending on the architecture of the agen t ( e.g. , RL, sym b olic), it is imp ortan t to consider and prop ose a solution for each. 134 7.4 T ec hnical Impro v emen ts This section describes future work that is more intrinsic to the comp onen ts rather than the whole pip eline. The π -NoCCHIO architecture op ens up several research directions. First, a wa y for stak eholders to provide an explanation or justiﬁcation of their decision could b e added to the arc hitecture. This would be useful in particular when the user has the p ossibilit y to o v erride the decision taken by the stakeholders. Second, it would b e interesting to develop more adv anced v ariants of the lexicographic selection, or alternativ es to it, as the curren t ones may b e somewhat limiting. The prop osed ARIA algorithm could b e impro v ed in several wa ys. First, we b eliev e our implementation could b e further optimised and the handling of the contin uous v ariables could b e enhanced as well. More sp eciﬁcally , the latter could not only reduce the run time or improv e the prediction accuracy but also provide more meaningful interv als for the p eople analysing the generated graph. Then, w e think that the heuristic could b eneﬁt from using feature relev ance in order to explore the most relev ant or promising branches of the search space ﬁrst. Moreov er, w e w ould lik e to emphasise the need for explanatory methods dedicated to bip olar argumentation to fully exploit its p oten tial. Last, an in-depth qualitative study of the in telligibilit y of the diﬀeren t v arian ts w ould pro vide useful insigh ts in to the adv an tages of using each v arian t. Finally , with regard to the norm avoidanc e problem, it would b e interesting to address the challenge of distinguishing b et w een prima facie obligations and exceptions to a norm. W e b eliev e that this problem cannot easily b e solved at the computational lev el, at least not without more exp ert kno wledge than is utilised in our approac hes. Additionally , this phenomenon should b e addressed for the sp eciﬁc case of m ulti-agen t systems, where it may arise from the collab oration b et w een the agents and thus remain undetected if a single agent is observed. 135 Chapter 8 Summary This thesis w as inspired b y the story of Pino cc hio, “Le a vv en ture di Pino cc hio - Storia di un burattino”. This b o ok and its adaptations tell the story of a sentien t muppet that faces morally relev ant situations along its wanderings. Its companion, Jimin y Crick et, plays the role of its consciousness, advising the muppet on what is right and what is not. Even tually , the story ends well as Pino cc hio learns from his exp eriences and is then c hanged into a real, ﬂesh-and-blo od little b o y . Here, we did not create any ﬂesh-and-blo od agent. How ever, we con tributed in several w a ys to research in the ﬁeld of normativ e agents. More precisely , this thesis aimed to inv estigate ho w artiﬁcial agents can b e designed to comply with context-dependent norms, deﬁned by heterogeneous stakeholders, while sim ul- taneously optimising the rew ard for achieving the task for which they w ere created. The motiv ation for this research lies in the increasing num b er of autonomous systems deploy ed in complex so cial and organisational environmen ts where b ehaviour cannot b e judged only by its eﬃciency or eﬀectiv eness, but also b y whether it conforms to ethical, legal, or social exp ec- tations. Reinforcemen t learning, while p o werful for optimising decision-making through trial and error, lacks an intrinsic mec hanism to incorp orate such normative considerations. The cen tral problem addressed in this thesis is, therefore, how to extend reinforcement learning with normative reasoning so that agen ts can learn b eha viours that are not only eﬀectiv e but also so cially acceptable and in telligible. The thesis w as guided by a set of research questions that structure its ev olution. The 136 ﬁrst question asked how an artiﬁcial agen t can b e designed to follow context-speciﬁc norms in addition to optimising its goals. This was reﬁned into tw o sub-questions. The ﬁrst ex- amined ho w reinforcemen t learning agents can b e trained to comply with norms represen ted sym b olically , ensuring that the exp ectations of stak eholders can b e translated into concrete b eha vioural guidance. The second asked how norms can b e collected and represented in such a wa y that they are suitable b oth for enabling stakeholders to exc hange views ab out them and for training. These questions informed the design of the pip eline and motiv ated the individual contributions of the thesis. The ﬁrst ma jor contribution is the prop osal in Chapter 3 of a normative architecture, called π -NoCCHIO [8], that extends reinforcement learning with symbolic reasoning capa- bilities. π -NoCCHIO integrates a reinforcemen t learning agen t, optimising for its rew ard, with a sup ervisory comp onen t inspired by Jimin y Crick et, which enco des and enforces norms through the use of formal argumentation. Two alternative action-selection mec hanisms were explored and their p erformance was ev aluated in the Taxi environmen t in which the agent had to comply with m ultiple defeasible norms whose application was debated by tw o stake- holders. The results demonstrated that the inclusion of normativ e reasoning signiﬁcantly altered agent behaviour, sometimes at the cost of short-term eﬃciency , but in wa ys that aligned with normative exp ectations. This w ork sho ws that symbolic reasoning can b e com- bined with reinforcement learning in a coherent arc hitecture that preserves the adaptabilit y of learning while em b edding normative constrain ts. The second contribution, detailed in Chapter 4, concerns the elicitation and representa- tion of norms. Although muc h of the literature assumes that norms are giv en, this thesis addressed the problem of how norms can b e deriv ed from data to automatically mo del the p erspectives of some stakeholders. After surv eying existing approaches to norm mining, the thesis prop osed ARIA, an algorithm that constructs argumentation framew orks and bip o- lar argumen tation frameworks from b ehavioural data. ARIA was quan titatively ev aluated on b enc hmark datasets and compared with machine learning baselines, sho wing comp etitiv e accuracy while oﬀering an interpretable structure. It w as also applied qualitatively to the Moral Mac hine exp erimen t data, where it was able to structure the arguments so that it 137 could provide justiﬁcations to defend a decision. In Chapter 5, the third contribution is introduced. It is the formalisation and study of the norm av oidance phenomenon. Unlik e direct violations, norm av oidance o ccurs when agen ts ﬁnd w a ys to technically comply with rules while circum v en ting their spirit, similarly to reward-gaming phenomenon, but for norms. This phenomenon is familiar in legal sys- tems, but has not b een studied in reinforcemen t learning, and more sp eciﬁcally for normativ e systems. The thesis in tro duced formal deﬁnitions of norm a v oidance and in v estigated its o ccurrence in standard reinforcement learning en vironmen ts including normative considera- tions. Mitigation strategies that mo dify the training pro cess or the agent’s decision making to limit o ccurrences of this unw anted b eha viour were then prop osed. Exp erimen tal results in multiple environmen ts conﬁrmed that these strategies eliminated norm a v oidance without completely sacriﬁcing eﬃciency . This contribution highlights the imp ortance of anticipating unin tended b eha viours that arise when agen ts learn under normativ e constrain ts. T ogether, these con tributions form an end-to-end pipeline for normative reinforcement learning. Norms can b e extracted from b eha vioural data and structured as argumentation graphs using ARIA. They can then b e integrated into reinforcement learning through the π - NoCCHIO architecture. The pip eline ac kno wledges that agents can attempt to exploit lo op- holes. Consequen tly , a metho d is pro vided to mitigate such b eha viour. By com bining learning tec hniques with sym b olic reasoning, the pip eline demonstrates a wa y to op erationalise nor- mativ e systems in artiﬁcial agents. The results obtained throughout the thesis pro vide several insights. First, it is feasible to com bine reinforcement learning with sym b olic representations of norms, resulting in agents that adapt their actions not only to maximise rewards but also to conform to stakeholder exp ectations. Second, identifying and structuring the norms allows one to pro duce mo dels for stakeholders without the need for a costly exp ert design. Third, the phenomenon of norm a v oidance should b e considered when designing normative agen ts, but it can b e countered, in the sp eciﬁc case of reinforcement learning, by carefully designing the training pro cess. These ﬁndings collectively adv ance the state-of-the-art in normative m ulti-agent systems and normativ e reinforcement learning by demonstrating that norm a w areness can b e em b edded 138 directly into the learning pro cess. The broader implications of the thesis concern the design of normative artiﬁcial intelli- gence. The research demonstrates that it is p ossible to embed ethical and legal norms into autonomous systems in a w a y that is intelligible and adaptable. It sho ws how argumen ta- tion can impro ve transparency and ho w anticipating norm av oidance can prev en t harmful unin tended b eha viours. Although the work has fo cused on relatively simple environmen ts, the metho dological foundations it establishes can b e extended to more complex and realistic domains, such as m ulti-agen t in teractions or human–AI collab oration. In conclusion, this thesis pro vides new metho ds and insights for aligning artiﬁcial agents with so cial and normativ e exp ectations. It combines connectionist and sym b olic approac hes for normativ e reasoning, creating a pipeline that op erationalises stak eholder p erspectives in agen t training. By addressing norm identiﬁcation, compliance, intelligibilit y , and av oidance, this thesis contributes to the developmen t of transparen t and norm-compliant agents by building on prior work. Due to its mo dularity , the prop osed pip eline is op en to future impro v emen ts that can b e made by pursuing the research directions for the future detailed in Chapter 7. Ultimately , the goal is to enable agen ts capable of exhibiting normativ e behaviour, making them suitable for integration into our world. 139 References [1] Henk Aarts and Ap Dijksterhuis. “The silence of the library: en vironment, situa- tional norm, and so cial b eha vior.” In: Journal of p ersonality and so cial psycholo gy 84.1 (2003), p. 18. [2] Da vid Ab el, James MacGlashan, and Mic hael L Littman. “Reinforcemen t Learning as a F ramework for Ethical Decision Making”. In: AAAI Workshop: AI, Ethics, and So ciety . V ol. 16. 2016. [3] Amina Adadi and Mohammed Berrada. “P eeking inside the blac k-b o x: a survey on explainable artiﬁcial intelligence (XAI)”. In: IEEE ac c ess 6 (2018), pp. 52138–52160. [4] Stefan Aeb erhard and M. F orina. Wine . UCI Mac hine Learning Rep ository. DOI: h ttps://doi.org/10.24432/C5PC7J. 1992. [5] Jo˜ ao Paulo Aires, Daniele Pinheiro, V era Strub e de Lima, and F elip e Meneguzzi. “Norm conﬂict identiﬁcation in con tracts”. In: Artiﬁcial Intel ligenc e and L aw 25.4 (2017), pp. 397–428. [6] Md Sultan Al Nahian, Sp encer F razier, Mark Riedl, and Brent Harrison. “T raining v alue-aligned reinforcemen t learning agen ts using a normative prior”. In: IEEE T r ans- actions on Artiﬁcial Intel ligenc e 5.7 (2024), pp. 3350–3361. [7] Muzaﬀer Al ι m and Saadettin Erhan Kesen. “Smart warehouses in logistics 4.0”. In: L o gistics 4.0 . CRC Press, 2020, pp. 186–201. [8] Beno ˆ ıt Alcaraz. “ π -NoCCHIO: An Arc hitecture for Con text-Aw are Normative Rein- forcemen t Learning”. In: Pr o c e e dings of the 18th International Confer enc e on A gents and Artiﬁcial Intel ligenc e . In press. 2026. 140 [9] Beno ˆ ıt Alcaraz, Olivier Boissier, R´ emy Chaput, and Christopher Leturc. “Ajar: An argumen tation-based judging agents framew ork for ethical reinforcement learning”. In: AAMAS’23: International Confer enc e on Autonomous A gents and Multiagent Sys- tems . 2023, pp. 2427–2429. [10] Beno ˆ ıt Alcaraz, R´ emy Chaput, Olivier Boissier, and Christopher Leturc. “Combining F ormal Argumentation and Reinforcemen t Learning: An Hybrid Approach to Mac hine Ethics”. In: Pr o c e e dings of the 18th International Confer enc e on A gents and A rtiﬁcial Intel ligenc e . In press. 2026. [11] Beno ˆ ıt Alcaraz, Adam Kaliski, and Christopher Leturc. “An A-Star Algorithm for Argumen tativ e Rule Extraction”. In: Pr o c e e dings of the 17th International Confer enc e on A gents and Artiﬁcial Intel ligenc e . V ol. 2. 2025, pp. 91–101. [12] Beno ˆ ıt Alcaraz, Adam Kaliski, and Christopher Leturc. “Providing Justiﬁcations for Decisions of Black-Bo x Mo dels: An Application in Machine Ethics”. In: R evise d Se- le cte d Pap ers of the 17th International Confer enc e on A gents and Artiﬁcial Intel ligenc e (ICAAR T 2025) . Lecture Notes in Artiﬁcial In telligence. Springer, 2025. [13] Beno ˆ ıt Alcaraz, Aleks Knoks, and Da vid Streit. “Estimating weigh ts of reasons us- ing metaheuristics: a hybrid approac h to machine ethics”. In: Pr o c e e dings of the AAAI/A CM Confer enc e on AI, Ethics, and So ciety . V ol. 7. 2024, pp. 27–38. [14] Beno ˆ ıt ALCARAZ, Y azan Mualla, Sukriti Bhattachary a, Igor Tchappi, Vincent de Wit, and Amro Na jjar. “Norm Mining, Identiﬁcation, and Detection: A Systematic Literature Review”. In: F r ontiers in Artiﬁcial Intel ligenc e 9 (2026), p. 1702659. [15] Beno ˆ ıt Alcaraz, Emery A Neufeld, and Leendert WN v an der T orre. “Norm Av oidance and Reinforcemen t Learning: Deﬁnitions and Analysis”. In: The 17th International Confer enc e on De ontic L o gic and normative systems (DEON 2025). (2025), p. 1. [16] Beno ˆ ıt Alcaraz, Aria Nourbakhsh, and Liuw en Y u. “Assessing the Robustness of LLMs in Predicting Supp orts and A ttacks”. In: International Workshop on Causality, A gents and L ar ge Mo dels . Springer. 2024, pp. 88–93. 141 [17] Huib Aldewereld, Sergio ´ Alv arez-Napagao, F rank Dignum, and Ja vier V´ azquez- Salceda. “Making norms concrete”. In: 9th International Confer enc e on Autonomous A gents and Multiagent Systems (AAMAS 2010), T or onto, Canada, May 10-14, 2010, V olume 1-3 . Ed. by Wieb e v an der Ho ek, Gal A. Kamink a, Yves Lesp´ erance, Michael Luc k, and Sandip Sen. IF AAMAS, 2010, pp. 807–814. url : https : / / dl . acm . org / citation.cfm?id=1838314 . [18] Natasha Alechina, Joseph Y Halp ern, Ian A Kash, and Brian Logan. “Incen tiv e- compatible mechanisms for norm monitoring in op en multi-agen t systems”. In: Journal of Artiﬁcial Intel ligenc e R ese ar ch 62 (2018), pp. 433–458. [19] Mohammed Alshiekh, Ro deric k Blo em, R ¨ udiger Ehlers, Bettina K¨ onighofer, Scott Niekum, and Ufuk T op cu. “Safe reinforcemen t learning via shielding”. In: Pr o c. AAAI . 2018, pp. 2669–2678. [20] Leila Amgoud, Jonathan Ben-Naim, Dragan Do der, and Srdjan V esic. “Acceptability seman tics for weigh ted argumen tation frameworks”. In: Twenty-Sixth International Joint Confer enc e on Artiﬁcial Intel ligenc e (IJCAI 2017) . International Join t Confer- ences on Artiﬁcal In telligence (IJCAI). 2017. [21] Leila Amgoud and Henri Prade. “Using arguments for making and explaining deci- sions”. In: Artiﬁcial Intel ligenc e 173.3 (2009), pp. 413–436. [22] Prith vira j Ammanabrolu, Liwei Jiang, Maarten Sap, Hannaneh Ha jishirzi, and Y ejin Choi. “Aligning to so cial norms and v alues in in teractiv e narrativ es”. In: arXiv pr eprint arXiv:2205.01975 (2022). [23] Dario Amo dei, Chris Olah, Jacob Steinhardt, P aul Christiano, John Sc h ulman, and Dan Man´ e. “Concrete problems in AI safet y”. In: arXiv pr eprint (2016). [24] Giulia Andrighetto, Marco Camp enn ` ı, Rosaria Conte, and Mario P aolucci. “On the immergence of norms: a normativ e agent architecture”. In: AAAI F al l Symp osium: Emer gent A gents and So cialities . 2007. 142 [25] Giulia Andrighetto, Guido Gov ernatori, Pablo Noriega, and Leender WN v an der T orre. Normative Muti-A gent Systems . Schloss Dagstuhl, Leibniz-Zen trum f ¨ ur Infor- matik, 2013. [26] Ryuta Arisak a, J ´ er ´ emie Dauphin, Ken Satoh, and Leendert WN v an der T orre. “Multi- agen t argumentation and dialogue”. In: FLAP 9.4 (2022), pp. 891–924. [27] Thomas Arnold, Daniel Kasen b erg, and Matthias Scheutz. “V alue alignment or misalignmen t-what will k eep systems accoun table?” In: AAAI Workshops . 2017, pp. 81–88. [28] Thomas Arnold and Matthias Sc heutz. “Understanding the spirit of a norm: Chal- lenges for norm-learning agen ts”. In: AI Magazine 44.4 (2023), pp. 524–536. [29] Henrik Aslund, El Mahdi El Mhamdi, Rac hid Guerraoui, and Alexandre Maurer. “Vir- tuously safe reinforcement learning”. In: arXiv pr eprint arXiv:1805.11447 (2018). [30] Daniel Av ery, Hoa Khanh Dam, Bastin T ony Roy Sa v arim uth u, and Adity a Ghose. “Externalization of softw are b eha vior by the mining of norms”. In: Pr o c e e dings of the 13th International Confer enc e on Mining Softwar e R ep ositories . 2016, pp. 223–234. [31] Edmond Aw ad, Sohan Dsouza, Ric hard Kim, Jonathan Sch ulz, Joseph Henrich, Azim Shariﬀ, Jean-F ran¸ cois Bonnefon, and Iy ad Rah wan. “The moral mac hine experiment”. In: Natur e 563.7729 (2018), pp. 59–64. [32] Giulia Bagallo and David Haussler. “Bo olean feature discov ery in empirical learning”. In: Machine le arning 5 (1990), pp. 71–99. [33] T rev or Benc h-Cap on. “V alue based argumentation frameworks”. In: arXiv pr eprint cs/0207059 (2002). [34] Stefano Bistarelli, F rancesco Santini, et al. “W eighted Argumen tation.” In: FLAP 8.6 (2021), pp. 1589–1622. [35] P atrick Blackburn and Maarten de Rijk e. “Wh y com bine logics?” In: Studia L o gic a 59.1 (1997), pp. 5–27. 143 [36] Alexander Bo c hman, Pietro Baroni, Dov Gabba y , Massimiliano Giacomin, and Leen- dert WN v an der T orre. “Argumen tation, nonmonotonic reasoning and logic”. In: Handb o ok of F ormal A r gumentation 1 (2018), pp. 2887–2926. [37] Guido Boella. “An Agen t-Orien ted Ontology of So cial Reality”. In: F ormal Ontolo gy in Information Systems: Pr o c e e dings of the Thir d International Confer enc e (FOIS-2004) . IOS Press. 2004, p. 199. [38] Guido Bo ella, Dov M Gabbay, Leendert WN v an der T orre, and Serena Villata. “Meta- argumen tation mo delling I: Methodology and tec hniques”. In: Studia L o gic a 93 (2009), pp. 297–355. [39] Guido Bo ella, Gabriella Pigozzi, and Leendert WN v an der T orre. “Normative frame- w ork for normative system change”. In: The 8th International Joint Confer enc e on A utonomous A gents and Multiagent Systems (AAMAS 2009), Budap est, Hungary, May 10-15, 2009, V olume 1 . IF AAMAS. 2009. [40] Guido Boella and Leendert WN v an der T orre. “Attributing Men tal Attitudes to Roles: The Agent Metaphor Applied to e-T rade Organizations”. In: (2002). [41] Guido Bo ella and Leendert WN v an der T orre. “Attributing mental attitudes to nor- mativ e systems”. In: Pr o c e e dings of the se c ond international joint c onfer enc e on Au- tonomous agents and multiagent systems . 2003, pp. 942–943. [42] Guido Bo ella and Leendert WN v an der T orre. “Local policies for the con trol of virtual comm unities”. In: Pr o c e e dings IEEE/WIC International Confer enc e on Web Intel li- genc e (WI 2003) . IEEE. 2003, pp. 161–167. [43] Guido Bo ella, Leendert WN v an der T orre, et al. “Con tracts as legal institutions in organizations of autonomous agen ts”. In: AAMAS . V ol. 4. 2004, pp. 948–955. [44] Guido Bo ella and Leendert WN v an der T orre. “Game theoretic normativ e reasoning”. In: Pr o c e e dings of the Ninth International Confer enc e on A rtiﬁcial Intel ligenc e and L aw (ICAIL) . ACM. 2004, pp. 217–224. 144 [45] Guido Bo ella, Leendert WN v an der T orre, et al. “Groups as agents with mental attitudes”. In: International Confer enc e on Autonomous A gents: Pr o c e e dings of the Thir d International Joint Confer enc e on Autonomous A gents and Multiagent Systems- . V ol. 2. 2004, pp. 964–971. [46] Guido Bo ella and Leendert WN v an der T orre. “Structuring organizations b y means of roles using the agent metaphor”. In: dagli Oggetti agli A genti (2004), p. 93. [47] Guido Bo ella and Leendert WN v an der T orre. “Constitutive norms in the design of normativ e multiagen t systems”. In: International Workshop on Computational L o gic in Multi-A gent Systems . Springer. 2005, pp. 303–319. [48] Guido Bo ella and Leendert WN v an der T orre. “A game theoretic approach to con- tracts in m ultiagen t systems”. In: IEEE T r ansactions on Systems, Man, and Cyb er- netics, Part C (Applic ations and R eviews) 36.1 (2006), pp. 68–79. [49] Mark o Bohanec. Car Evaluation . UCI Mac hine Learning Rep ository . DOI: h ttps://doi.org/10.24432/C5JP48. 1997. [50] Magn us Boman. “Norms in artiﬁcial decision making”. In: Artiﬁcial Intel ligenc e and L aw 7.1 (1999), pp. 17–35. [51] Annemarie Borg and Floris Bex. “A Basic F ramework for Explanations in Argumen- tation”. In: IEEE Intel ligent Systems PP (Jan. 2021), pp. 1–1. doi : 10 . 1109 / MIS . 2021.3053102 . [52] Shiv a Borzo oei and Aidin T arokhian. Diﬀer entiate d Thyr oid Canc er R e curr enc e . UCI Mac hine Learning Rep ository. DOI: h ttps://doi.org/10.24432/C5632J. 2023. [53] Vicen t Botti. “Agentic AI and Multiagen tic: Are W e Reinv enting the Wheel?” In: arXiv pr eprint arXiv:2506.01463 (2025). [54] Matthew Braham and Martin V an Hees. “An anatom y of moral resp onsibilit y”. In: Mind 121.483 (2012), pp. 601–634. 145 [55] T om B. Brown, Benjamin Mann, Nic k Ryder, Melanie Subbiah, Jared Kaplan, Pra- fulla Dhariwal, Arvind Neelak an tan, Pranav Sh yam, Girish Sastry , Amanda Ask ell, Sandhini Agarw al, Ariel Herb ert-V oss, Gretc hen Krueger, T om Henighan, Rewon Child, Adity a Ramesh, Daniel M. Ziegler, Jeﬀrey W u, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit win, Scott Gra y, Benjamin Chess, Jac k Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutsk ev er, and Dario Amo dei. L anguage Mo dels ar e F ew-Shot L e arners . 2020. arXiv: 2005.14165 [cs.CL] . url : . [56] Da vid Byrd. “Learning not to sp o of”. In: Pr o c e e dings of the Thir d ACM International Confer enc e on AI in Financ e . 2022, pp. 139–147. [57] Elena Cabrio and Serena Villata. “Fiv e years of argumen t mining: A data-driv en analysis.” In: IJCAI . V ol. 18. 2018, pp. 5427–5433. [58] Martin Caminada. “A discussion game for grounded semantics”. In: The ory and Appli- c ations of F ormal A r gumentation: Thir d International Workshop, T AF A 2015, Buenos A ir es, A r gentina, July 25-26, 2015, R evise d Sele cte d Pap ers 3 . Springer. 2015, pp. 59– 73. [59] Jordi Camp os, Maite L´ op ez-S´ anchez, and Marc Estev a. “A case-based reasoning ap- proac h for norm adaptation”. In: Hybrid Artiﬁcial Intel ligenc e Systems: 5th Interna- tional Confer enc e, HAIS 2010, San Seb asti´ an, Sp ain, June 23-25, 2010. Pr o c e e dings, Part II 5 . Springer. 2010, pp. 168–176. [60] Henrique Lop es Cardoso and Eug´ enio Oliv eira. “Adaptive deterrence sanctions in a normativ e framework”. In: 2009 IEEE/WIC/ACM International Joint Confer enc e on Web Intel ligenc e and Intel ligent A gent T e chnolo gy . V ol. 2. IEEE. 2009, pp. 36–43. [61] Cristiano Castelfranc hi, F rank Dign um, Catholijn M Jonk er, and Jan T reur. “Delib- erativ e normativ e agen ts: Principles and arc hitecture”. In: International workshop on agent the ories, ar chite ctur es, and languages . Springer. 1999, pp. 364–378. [62] Claudette Cayrol and Marie-Christine Lagasquie-Sch iex. “On the acceptability of ar- gumen ts in bip olar argumen tation frameworks”. In: Eur op e an Confer enc e on Symb olic 146 and Quantitative Appr o aches to R e asoning and Unc ertainty . Springer. 2005, pp. 378– 389. [63] R ´ emy Chaput. “Learning b ehaviours aligned with moral v alues in a multi-agen t sys- tem: guiding reinforcemen t learning with sym b olic judgments”. PhD thesis. Univ ersit ´ e Claude Bernard-Lyon I, 2022. [64] George Christelis, Michael Rov atsos, and Ronald P A Petric k. “Exploiting domain kno wledge to impro v e norm synthesis”. In: Pr o c e e dings of the 9th International Con- fer enc e on Autonomous A gents and Multiagent Systems: volume 1-V olume 1 . Citeseer. 2010, pp. 831–838. [65] William W Cohen. “F ast eﬀective rule induction”. In: Machine le arning pr o c e e dings 1995 . Elsevier, 1995, pp. 115–123. [66] Congr essional V oting R e c or ds . UCI Mac hine Learning Rep ository . DOI: h ttps://doi.org/10.24432/C5C01P. 1987. [67] Domenico Corapi, Alessandra Russo, Marina De V os, Julian Padget, and Ken Satoh. “Normativ e design using inductiv e learning”. In: The ory and Pr actic e of L o gic Pr o- gr amming 11.4-5 (2011), pp. 783–799. [68] Corinna Cortes and Vladimir V apnik. “Supp ort-v ector netw orks”. In: Machine le arn- ing 20 (1995), pp. 273–297. [69] Sylvie Coste-Marquis, S´ ebastien Konieczny, Pierre Marquis, and Mohand Akli Ouali. “W eighted A ttac ks in Argumentation F ramew orks.” In: KR . 2012. [70] Stephen Craneﬁeld and Ashish Dhiman. “Iden tifying Norms from Observ ation Using MCMC Sampling.” In: IJCAI . 2021, pp. 118–124. [71] Stephen Craneﬁeld, F elip e Meneguzzi, Nir Oren, and Bastin T ony Roy Sav arimuth u. “A Ba y esian approac h to norm identiﬁcation”. In: ECAI 2016 . Ios Press, 2016, pp. 622– 629. 147 [72] Hoa Khanh Dam, Bastin T ony Roy Sav arimuth u, Daniel Av ery , and Adity a Ghose. “Mining softw are rep ositories for so cial norms”. In: 2015 IEEE/ACM 37th IEEE In- ternational Confer enc e on Softwar e Engine ering . V ol. 2. IEEE. 2015, pp. 627–630. [73] Da vide Dell’Anna, Natasha Alechina, F abiano Dalpiaz, Mehdi Dastani, Maarten L¨ oﬄer, and Brian Logan. “The complexity of norm synthesis and revision”. In: Inter- national Workshop on Co or dination, Or ganizations, Institutions, Norms, and Ethics for Governanc e of Multi-A gent Systems . Springer. 2022, pp. 38–53. [74] F rank Dignum, Da vid Morley , Elizab eth A Sonen b erg, and Lawrence Ca v edon. “T o- w ards so cially sophisticated BDI agents”. In: Pr o c e e dings fourth international c onfer- enc e on multiagent systems . IEEE. 2000, pp. 111–118. [75] Sylvie Doutre, Th´ eo Duc hatelle, and Marie-Christine Lagasquie-Sc hiex. “Visual ex- planations for defence in abstract argumen tation”. In: International Confer enc e on A utonomous A gents and Multiagent Systems (AAMAS) . ACM. 2023, pp. 2346–2348. [76] Phan Minh Dung. “On the acceptabilit y of arguments and its fundamen tal role in nonmonotonic reasoning, logic programming and n-p erson games”. In: Artiﬁcial intel- ligenc e 77.2 (1995), pp. 321–357. [77] Moser Silv a F agundes, Sascha Ossowski, Jes´ us Cerquides, and Pablo Noriega. “Design and ev aluation of norm-a w are agen ts based on Normative Marko v Decision Pro cesses”. In: International Journal of Appr oximate R e asoning 78 (2016), pp. 33–61. [78] Xiuyi F an and F rancesca T oni. “On computing explanations in abstract argumenta- tion”. In: ECAI 2014 . IOS Press, 2014, pp. 1005–1006. [79] Gabriela F erraro and Ho-Pun Lam. “NLP T echniques for Normative Mining.” In: FLAP 8.4 (2021), pp. 941–974. [80] R. A. Fisher. Iris . UCI Machine Learning Rep ository . DOI: h ttps://doi.org/10.24432/C56C76. 1936. 148 [81] Jie F u and Ufuk T op cu. “Probably Approximately Correct MDP Learning and Con trol With T emp oral Logic Constrain ts”. In: R ob otics: Scienc e and Systems X, University of California, Berkeley, USA, July 12-16, 2014 . Ed. by Dieter F ox, Lydia E. Kavraki, and Hanna Kurniaw ati. 2014. [82] Yi R F ung, T uhin Chakrab ort y, Hao Guo, Owen Rambow, Smaranda Muresan, and Heng Ji. “Normsage: Multi-lingual multi-cultural norm discov ery from conv ersations on-the-ﬂy”. In: arXiv pr eprint arXiv:2210.08604 (2022). [83] Johannes F ¨ urnkranz and Gerhard Widmer. “Incremen tal reduced error pruning”. In: Machine le arning pr o c e e dings 1994 . Elsevier, 1994, pp. 70–77. [84] Do v Gabbay, John Hort y, Xavier P arent, Ron V an der Meyden, Leendert WN v an der T orre, et al. Handb o ok of de ontic lo gic and normative systems . College Publications, 2021, 2021. [85] Zolt´ an G´ ab or, Zsolt Kalm´ ar, and Csaba Szep esv´ ari. “Multi-criteria reinforcement learning.” In: ICML . V ol. 98. 1998, pp. 197–205. [86] Xibin Gao and Munindar P Singh. “Extracting normative relationships from business con tracts”. In: Pr o c e e dings of the 2014 international c onfer enc e on Autonomous agents and multi-agent systems . 2014, pp. 101–108. [87] A Sha ji George and AS Hov an George. “A review of ChatGPT AI’s impact on several business sectors”. In: Partners universal international innovation journal 1.1 (2023), pp. 9–23. [88] M Georgeﬀ and A Rao. “Mo deling rational agents within a BDI-architecture”. In: Pr o c. 2nd Int. Conf. on Know le dge R epr esentation and R e asoning (KR’91). Mor gan Kaufmann . of. 1991, pp. 473–484. [89] Bryce Go o dman and Seth Flaxman. “European Union regulations on algorithmic decision-making and a “righ t to explanation””. In: AI magazine 38.3 (2017), pp. 50– 57. 149 [90] Thomas F Gordon, Henry Prakken, and Douglas W alton. “The Carneades mo del of argumen t and burden of pro of”. In: Artiﬁcial intel ligenc e 171.10-15 (2007), pp. 875– 896. [91] Y ue Guo, Boshi W ang, Dana Hughes, Michael Lewis, and Katia Sycara. “Designing con text-sensitiv e norm inv erse reinforcement learning framework for norm-compliant autonomous agents”. In: 2020 29th IEEE International Confer enc e on R ob ot and Hu- man Inter active Communic ation (RO-MAN) . IEEE. 2020, pp. 618–625. [92] Nico G ¨ urtler, Dieter B¨ uchler, and Georg Martius. “Hierarc hical reinforcemen t learning with timed subgoals”. In: A dvanc es in Neur al Information Pr o c essing Systems 34 (2021), pp. 21732–21743. [93] Mohammadhosein Hasan b eig, Alessandro Abate, and Daniel Kroening. “Cautious Re- inforcemen t Learning with Logical Constraints”. In: Pr o c e e dings of the 19th Inter- national Confer enc e on Autonomous A gents and Multiagent Systems, AAMAS ’20, A uckland, New Ze aland, May 9-13, 2020 . 2020, pp. 483–491. [94] Daniel Hein, Steﬀen Udluft, and Thomas A Runkler. “In terpretable p olicies for rein- forcemen t learning b y genetic programming”. In: Engine ering Applic ations of Artiﬁcial Intel ligenc e 76 (2018), pp. 158–169. [95] Jeﬀery Ho, Sabir Hussain, and Olivier Sparagano. “Did the CO VID-19 pandemic spark a public in terest in p et adoption?” In: F r ontiers in V eterinary Scienc e 8 (2021), p. 647308. [96] Matthew Horridge. Justiﬁc ation b ase d explanation in ontolo gies . The Univ ersit y of Manc hester (United Kingdom), 2011. [97] Andras Janosi, William Steinbrunn, Matthias Pﬁsterer, and Rob ert Detrano. He art Dise ase . UCI Machine Learning Rep ository. DOI: https://doi.org/10.24432/C52P4X. 1989. [98] Nils Jansen, Bettina K¨ onighofer, Sebastian Junges, Alexandru C Serban, and Ro der- ic k Blo em. “Safe reinforcement learning via probabilistic shields”. In: arXiv pr eprint arXiv:1807.06096 (2018). 150 [99] Sindh u Joseph, Carles Sierra, Marco Sc horlemmer, and Pilar Dellunde. “Deductive coherence and norm adoption”. In: L o gic Journal of IGPL 18.1 (2010), pp. 118–156. [100] Mohd Rashdan Ab dul Kadir, Ali Selamat, and Ondrej Krejcar. “Norm Augmented Reinforcemen t Learning Agents With Synthesized Normative Rules: A Prop osed Nor- mativ e Agen t F ramework”. In: Journal of Cases on Information T e chnolo gy (JCIT) 26.1 (2024), pp. 1–34. [101] Daniel Kasenberg, Thomas Arnold, and Matthias Sc heutz. “Norms, rew ards, and the in ten tional stance: Comparing mac hine learning approaches to ethical training”. In: Pr o c e e dings of the 2018 AAAI/ACM Confer enc e on AI, Ethics, and So ciety . 2018, pp. 184–190. [102] Daniel Kasen b erg and Matthias Sc heutz. “Inv erse norm conﬂict resolution”. In: Pr o- c e e dings of the 2018 AAAI/ACM Confer enc e on AI, Ethics, and So ciety . 2018, pp. 178–183. [103] Daniel Kasen be rg and Matthias Scheutz. “Norm conﬂict resolution in sto c hastic do- mains”. In: Pr o c e e dings of the AAAI Confer enc e on A rtiﬁcial Intel ligenc e . V ol. 32. 1. 2018. [104] Martin J Kollingbaum and Timoth y J Norman. “NoA- a normativ e agent arc hitecture”. In: IJCAI . 2003, pp. 1465–1466. [105] A thanasios Krontiris, Sophia Pfeﬀer, Till Neuk amp, Ingo Jeromin, and Matthias Pfef- fer. “Smart Grid Lab Hessen–a real-life test en vironment for activ e distribution grids”. In: 2021 9th International Confer enc e on Mo dern Power Systems (MPS) . IEEE. 2021, pp. 1–5. [106] G ´ erard Le Lann. “The ariane 5 ﬂight 501 failure-a case study in system engineering for computing systems”. PhD thesis. INRIA, 1996. [107] Da vid Lewis. Counterfactuals . John Wiley & Sons, 2013. 151 [108] Jiaqi Li, F elip e Meneguzzi, Moser Silv a F agundes, and Brian Logan. “Reinforcemen t Learning of Normativ e Monitoring In tensities”. In: Co or dination, Or ganizations, In- stitutions, and Normes in A gent Systems XI - COIN 2015 International Workshops, COIN@AAMAS, Istanbul, T urkey, May 4, 2015, COIN@IJCAI, Buenos Air es, Ar- gentina, July 26, 2015, R evise d Sele cte d Pap ers . Ed. b y Virginia Dign um, P ablo Nor- iega, Murat Sensoy , and Jaime Sim˜ ao Sichman. V ol. 9628. Lecture Notes in Computer Science. Springer, 2015, pp. 209–223. doi : 10. 1007 /978 - 3 - 319 - 42691 - 4 \ _12 . url : https://doi.org/10.1007/978- 3- 319- 42691- 4%5C_12 . [109] Shimin Li, Tianxiang Sun, Qinyuan Cheng, and Xip eng Qiu. “Agent alignmen t in ev olving so cial norms”. In: arXiv pr eprint arXiv:2401.04620 (2024). [110] Beish ui Liao, Mic hael Anderson, and Susan Leigh Anderson. “Representation, justiﬁ- cation, and explanation in a v alue-driven agen t: an argumentation-based approac h”. In: AI and Ethics 1.1 (2021), pp. 5–19. [111] Beish ui Liao, P ere Pardo, Marija Sla vk o vik, and Leendert WN v an der T orre. “The jimin y advisor: Moral agreemen ts among stakeholders based on norms and argumen- tation”. In: Journal of A rtiﬁcial Intel ligenc e R ese ar ch 77 (2023), pp. 737–792. [112] Beish ui Liao, Marija Slavk ovik, and Leendert WN v an der T orre. “Building jimin y cric k et: An architecture for moral agreements among stakeholders”. In: Pr o c e e dings of the 2019 AAAI/ACM Confer enc e on AI, Ethics, and So ciety . 2019, pp. 147–153. [113] Da vide Liga and Monica Palmirani. “T ransfer learning for deontic rule classiﬁcation: The case study of the gdpr”. In: L e gal Know le dge and Information Systems . IOS Press, 2022, pp. 200–205. [114] Da vide Liga and Livio Robaldo. “Fine-tuning GPT-3 for legal rule classiﬁcation”. In: Computer L aw & Se curity R eview 51 (2023), p. 105864. [115] Chenghao Liu, F ei Zh u, Quan Liu, and Y uch en F u. “Hierarchical reinforcement learn- ing with automatic sub-goal identiﬁcation”. In: IEEE/CAA journal of automatic a sinic a 8.10 (2021), pp. 1686–1696. 152 [116] Guiliang Liu, Oliv er Sch ulte, W ang Zhu, and Qingcan Li. “T o w ard interpretable deep reinforcemen t learning with linear mo del u-trees”. In: Joint Eur op e an Confer enc e on Machine L e arning and Know le dge Disc overy in Datab ases . Springer. 2018, pp. 414– 429. [117] J Juan Liu and J Tin-Y au Kwok. “An extended genetic rule induction algorithm”. In: Pr o c e e dings of the 2000 Congr ess on Evolutionary Computation. CEC00 (Cat. No. 00TH8512) . V ol. 1. IEEE. 2000, pp. 458–463. [118] F abiola L´ op ez y L´ op ez, Michael Luc k, and Mark d’In v erno. “A normative framew ork for agen t-based systems”. In: Computational & Mathematic al Or ganization The ory 12.2 (2006), pp. 227–250. [119] Moamin A Mahmoud, Mohd Sharifuddin Ahmad, Azhana Ahmad, Mohd Zaliman Mohd Y usoﬀ, and Aida Mustapha. “Norms detection and assimilation in multi-agen t systems: a conceptual approac h”. In: Know le dge T e chnolo gy: Thir d Know le dge T e ch- nolo gy We ek, KTW 2011, Kajang, Malaysia, July 18-22, 2011. R evise d Sele cte d Pa- p ers . Springer. 2012, pp. 226–233. [120] Moamin A Mahmoud, Mohd Sharifuddin Ahmad, Azhana Ahmad, Mohd Zaliman Mohd Y usoﬀ, and Aida Mustapha. “A norms mining approach to norms detection in m ulti-agen t systems”. In: 2012 International Confer enc e on Computer & Information Scienc e (ICCIS) . V ol. 1. IEEE. 2012, pp. 458–463. [121] Moamin A Mahmoud, Mohd Sharifuddin Ahmad, Azhana Ahmad, Mohd Zaliman Mohd Y usoﬀ, and Aida Mustapha. “The seman tics of norms mining in m ulti-agen t systems”. In: Computational Col le ctive Intel ligenc e. T e chnolo gies and Applic ations: 4th International Confer enc e, ICCCI 2012, Ho Chi Minh City, Vietnam, Novemb er 28-30, 2012, Pr o c e e dings, Part I 4 . Springer. 2012, pp. 425–435. [122] Moamin A Mahmoud, Mohd Sharifuddin Ahmad, and Mohd Zaliman Mohd Y usoﬀ. “Dev elopmen t and implementation of a technique for norms-adaptable agents in op en m ulti-agen t communities”. In: Journal of Systems Scienc e and Complexity 29 (2016), pp. 1519–1537. 153 [123] Moamin A Mahmoud, Mohd Sharifuddin Ahmad, and Mohd Zaliman M Y usoﬀ. “A norm assimilation approach for multi-agen t systems in heterogeneous communities”. In: Intel ligent Information and Datab ase Systems: 8th Asian Confer enc e, A CIIDS 2016, Da Nang, Vietnam, Mar ch 14–16, 2016, Pr o c e e dings, Part I 8 . Springer. 2016, pp. 354–363. [124] Moamin A Mahmoud, Aida Mustapha, Mohd Sharifuddin Ahmad, Azhana Ahmad, Mohd Zaliman M Y usoﬀ, and Nurzeatul Hamimah Abdul Hamid. “Poten tial norms de- tection in so cial agen t so cieties”. In: Distribute d Computing and Artiﬁcial Intel ligenc e: 10th International Confer enc e . Springer. 2013, pp. 419–428. [125] Alena Mak arov a and Houssam Abbas. “Deontically Constrained Policy Improv ement in Reinforcement Learning Agen ts”. In: 17 DEON (2025), p. 273. [126] Bertram F Malle, Matthias Scheutz, Thomas Arnold, John V oiklis, and Corey Cusi- mano. “Sacriﬁce one for the go od of many? People apply diﬀeren t moral norms to h uman and rob ot agents”. In: Pr o c e e dings of the tenth annual ACM/IEEE interna- tional c onfer enc e on human-r ob ot inter action . 2015, pp. 117–124. [127] Bertram F Malle, Matthias Sc heutz, and Joseph L Austerweil. “Netw orks of so cial and moral norms in human and rob ot agen ts”. In: A world with r ob ots: International Confer enc e on R ob ot Ethics: ICRE 2015 . Springer. 2017, pp. 3–17. [128] John Mc Breen, Gennaro Di T osto, F rank Dign um, and Gert Jan Hofstede. “Link- ing norms and culture”. In: 2011 Se c ond International Confer enc e on Cultur e and Computing . IEEE. 2011, pp. 9–14. [129] P eter McBurney and Simon P arsons. “Locutions for argumentation in agent interac- tion protocols”. In: International Workshop on A gent Communic ation . Springer. 2004, pp. 209–225. [130] P eter McBurney and Simon Parsons. “Syn tax and seman tics of the fatio argumentation proto col”. In: International Joint Confer enc e on Autonomous A gents and Multi-agent Systems . 2004. 154 [131] F elip e Rech Meneguzzi and Michael Luc k. “Norm-based b eha viour mo diﬁcation in BDI agents.” In: AAMAS (1) . 2009, pp. 177–184. [132] Sanja y Modgil and Henry Prakken. “The ASPIC+ framew ork for structured argumen- tation: a tutorial”. In: Ar gument & Computation 5.1 (2014), pp. 31–62. [133] F arhad Moghimifar, Shilin Qu, T ongtong W u, Y uan-F ang Li, and Gholamreza Haﬀari. “NormMark: A weakly sup ervised Marko v mo del for so cio-cultural norm discov ery”. In: arXiv pr eprint arXiv:2305.16598 (2023). [134] Ja vier Morales, Maite Lop ez-Sanchez, Juan A Ro driguez-Aguilar, W amberto V ascon- celos, and Mic hael W o oldridge. “Online automated syn thesis of compact normativ e systems”. In: A CM T r ansactions on Autonomous and A daptive Systems (T AAS) 10.1 (2015), pp. 1–33. [135] Ja vier Morales, Maite Lop ez-Sanc hez, Juan A Ro driguez-Aguilar, Mic hael J W o oldridge, and W amberto W eb er V asconcelos. “Automated synthesis of normative systems.” In: AAMAS . V ol. 13. 2013, pp. 483–490. [136] Liat Morgan, Alexandra Protopop o v a, Rune Isak Dup ont Birkler, Beata Itin-Sh wartz, Gila Ab ells Sutton, Alexandra Gamliel, Boris Y akobson, and T al Raz. “Human–dog relationships during the CO VID-19 pandemic: Booming dog adoption during so cial isolation”. In: Humanities and So cial Scienc es Communic ations 7.1 (2020). [137] Mic hael W Morris, Ying-yi Hong, Chi-yue Chiu, and Zhi Liu. “Normology: In tegrat- ing insights ab out so cial norms to understand cultural dynamics”. In: Or ganizational b ehavior and human de cision pr o c esses 129 (2015), pp. 1–13. [138] Andreasa Morris-Martin, Marina De V os, Julian P adget, and Oliver Ray. “Agent- directed runtime norm syn thesis”. In: Pr o c e e dings of the 2023 International Confer enc e on Autonomous A gents and Multiagent Systems . 2023, pp. 2271–2279. [139] Rohit Murali, Suravi Patnaik, and Stephen Craneﬁeld. “Mining in ternational p oliti- cal norms from the GDEL T database”. In: Co or dination, Or ganizations, Institutions, Norms, and Ethics for Governanc e of Multi-A gent Systems XIII: International Work- 155 shops COIN 2017 and COINE 2020, Sao Paulo, Br azil, May 8-9, 2017 and Virtual Event, May 9, 2020, R evise d Sele cte d Pap ers . Springer. 2021, pp. 35–56. [140] P atrick M Murphy. “UCI rep ository of mac hine learning databases”. In: http://www. ics. uci. e du/˜ mle arn/MLR ep ository. html (1994). [141] Md Sultan Al Nahian, Sp encer F razier, Mark Riedl, and Bren t Harrison. “Learn- ing norms from stories: A prior for v alue aligned agen ts”. In: Pr o c e e dings of the AAAI/A CM Confer enc e on AI, Ethics, and So ciety . 2020, pp. 124–130. [142] Ben Nassi, Yisro el Mirsky, Dudi Nassi, Raz Ben-Netanel, Oleg Drokin, and Y uv al Elo vici. “Phan tom of the adas: Securing adv anced driv er-assistance systems from split- second phan tom attacks”. In: Pr o c e e dings of the 2020 ACM SIGSAC c onfer enc e on c omputer and c ommunic ations se curity . 2020, pp. 293–308. [143] Emery A Neufeld. “Reinforcemen t learning guided by pro v able normative compliance”. In: arXiv pr eprint arXiv:2203.16275 (2022). [144] Emery A Neufeld. “Norm compliance for reinforcement learning agents”. PhD thesis. T echnisc he Univ ersit¨ at Wien, 2023. [145] Emery A Neufeld. “Learning Normativ e Behaviour Through Automated Theorem Pro ving”. In: KI-K ¨ unstliche Intel ligenz 38.1 (2024), pp. 25–43. [146] Emery A Neufeld, Ezio Barto cci, Agata Ciabattoni, and Guido Gov ernatori. “A Nor- mativ e Sup ervisor for Reinforcemen t Learning Agents.” In: CADE . 2021, pp. 565– 576. [147] Emery A Neufeld, Ezio Barto cci, Agata Ciabattoni, and Guido Gov ernatori. “En- forcing ethical goals ov er reinforcemen t-learning p olicies”. In: Ethics and Information T e chnolo gy 24.4 (2022), p. 43. [148] Nils J Nilsson. The quest for artiﬁcial intel ligenc e . Cambridge Universit y Press, 2009. [149] Samer Nofal, Katie A tkinson, and P aul E Dunne. “Algorithms for argumen tation seman tics: lab eling attacks as a generalization of labeling arguments”. In: Journal of A rtiﬁcial Intel ligenc e R ese ar ch 49 (2014), pp. 635–668. 156 [150] Ritesh Noothigattu, Djallel Bouneﬀouf, Nicholas Mattei, Rac hita Chandra, Piyush Madan, Kush R V arshney , Murray Campb ell, Moninder Singh, and F rancesca Rossi. “T eaching AI agen ts ethical v alues using reinforcemen t learning and p olicy orchestra- tion”. In: Pr o c. of IJCAI: 28th International Joint Confer enc e on Artiﬁcial Intel li- genc e . ijcai.org, 2019. [151] F arid Nouioua and Vincen t Risch. “Bip olar argumentation framew orks with sp ecial- ized supp orts”. In: 2010 22nd IEEE International Confer enc e on T o ols with Artiﬁcial Intel ligenc e . V ol. 1. IEEE. 2010, pp. 215–218. [152] Aria Nourbakhsh, Beno ˆ ıt Alcaraz, and Christoph Sc hommer. “F eature Generation Using LLMs: An Ev olutionary Algorithm Approach”. In: International Workshop on Causality, A gents and L ar ge Mo dels . Springer. 2024, pp. 48–64. [153] Ninell Oldenburg and T an Zhi-Xuan. “Learning and sustaining shared norma- tiv e systems via ba y esian rule induction in marko v games”. In: arXiv pr eprint arXiv:2402.13399 (2024). [154] Nir Oren and F elip e Meneguzzi. “Norm identiﬁcation through plan recognition”. In: arXiv pr eprint arXiv:2010.02627 (2020). [155] Lauren t Orseau and Stuart Armstrong. “Safely in terruptible agen ts”. In: Confer enc e on Unc ertainty in Artiﬁcial Intel ligenc e . Asso ciation for Uncertain t y in Artiﬁcial In- telligence. 2016. [156] Riy a P al, Shahrukh Shaikh, Sw ara j Satpute, and Sumedha Bhagwat. “Resume clas- siﬁcation using v arious machine learning algorithms”. In: ITM web of c onfer enc es . V ol. 44. EDP Sciences. 2022, p. 03011. [157] Soﬁa P anagiotidi, Sergio ´ Alv arez-Napagao, and Ja vier V´ azquez-Salceda. “T ow ards the Norm-Aw are Agent: Bridging the Gap Betw een Deon tic Sp eciﬁcations and Practical Mec hanisms for Norm Monitoring and Norm-Aw are Planning”. In: Co or dination, Or- ganizations, Institutions, and Norms in A gent Systems IX - COIN 2013 International Workshops, COIN@AAMAS, St. Paul, MN, USA, May 6, 2013, COIN@PRIMA, Dune din, New Ze aland, De c emb er 3, 2013, R evise d Sele cte d Pap ers . Ed. b y Tina Balke, 157 F rank Dign um, M. Birna v an Riemsdijk, and Amit K. Chopra. V ol. 8386. Lecture Notes in Computer Science. Springer, 2013, pp. 346–363. doi : 10.1007/978- 3- 319- 07314- 9\_19 . url : https://doi.org/10.1007/978- 3- 319- 07314- 9%5C_19 . [158] P ere Pardo, Leendert WN v an der T orre, and Liu w en Y u. “Adv anced Intelligen t Sys- tems and Reasoning: Standardization, Exp erimentation, Explanation”. In: L o gics for New Gener ation AI (LNGAI2023) . College Publications, London, United Kingdom. 2023. [159] Blaise Pascal. Pens´ ees . Ed. by Roger Ariew. F ragment 233 (Pascal’s W ager). Penguin Classics, 1670. [160] Sh ubham Pateria, Budhitama Subagdja, Ah-Hwee T an, and Chai Quek. “End-to- end hierarc hical reinforcement learning with in tegrated subgoal discov ery”. In: IEEE T r ansactions on Neur al Networks and L e arning Systems 33.12 (2021), pp. 7778–7790. [161] Judea Pearl and Dana Mac k enzie. The Bo ok of Why: The New Scienc e of Cause and Eﬀe ct . New Y ork: Basic Bo oks, 2018. [162] Cha ¨ ım P erelman. Justic e, law, and ar gument: Essays on mor al and le gal r e asoning . V ol. 142. Springer Science & Business Media, 2012. [163] Markus P esc hl, Ark ady Zgonniko v, F rans A Olieho ek, and Luciano C Sieb ert. “MORAL: Aligning AI with Human Norms through Multi-Ob jective Reinforced Ac- tiv e Learning”. In: Pr o c e e dings of the 21st International Confer enc e on A utonomous A gents and Multiagent Systems . 2022, pp. 1038–1046. [164] Viet Pham, Shilin Qu, F arhad Moghimifar, Sura j Sharma, Y uan-F ang Li, W eiqing W ang, and Reza Haf. “Multi-cultural norm base: F rame-based norm discov ery in multi- cultural settings”. In: Pr o c e e dings of the 28th Confer enc e on Computational Natur al L anguage L e arning . 2024, pp. 24–35. [165] Henry Prakken and Marek Sergot. “Contrary-to-dut y obligations”. In: Studia L o gic a 57 (1996), pp. 91–115. 158 [166] Erik a Puiutta and Eric MSP V eith. “Explainable reinforcement learning: A surv ey”. In: International cr oss-domain c onfer enc e for machine le arning and know le dge extr action . Springer. 2020, pp. 77–95. [167] J Ross Quinlan. “Induction of decision trees”. In: Machine le arning 1 (1986), pp. 81– 106. [168] J Ross Quinlan. “Generating pro duction rules from decision trees.” In: ijc ai . V ol. 87. Citeseer. 1987, pp. 304–307. [169] J Ross Quinlan. C4. 5: pr o gr ams for machine le arning . Elsevier, 2014. [170] Maha Riad and F atemeh Golpa yegani. “Run-time norms synthesis in m ulti-ob jective m ulti-agen t systems”. In: International workshop on c o or dination, or ganizations, in- stitutions, norms, and ethics for governanc e of multi-agent systems . Springer. 2021, pp. 78–93. [171] R ´ egis Riveret, Y ang Gao, Guido Gov ernatori, Antonino Rotolo, Jeremy Pitt, and Gio v anni Sartor. “A probabilistic argumentation framework for reinforcemen t learning agen ts: T ow ards a men talistic approac h to agen t proﬁles”. In: A utonomous A gents and Multi-A gent Systems 33.1 (2019), pp. 216–274. [172] Lucas Rizzo and Luca Longo. “A qualitative inv estigation of the degree of explain- abilit y of defeasible argumen tation and non-monotonic fuzzy reasoning”. In: (2018). [173] Manel Ro dr ´ ıguez Soto, Maite L´ op ez S´ anc hez, and Juan Antonio Ro dr ´ ıguez-Aguilar. “Multi-ob jective reinforcement learning for designing ethical environmen ts”. In: Co- munic aci´ o a: 30th International Joint Confer enc e on Artiﬁcial Intel ligenc e (IJCAI 2021) . International Joint Conferences on Artiﬁcial In telligence. 2021. [174] Manel Rodriguez-Soto, Juan A Ro driguez-Aguilar, and Maite Lop ez-Sanc hez. “Build- ing multi-agen t environmen ts with theoretical guarantees on the learning of ethical p olicies”. In: A daptive and L e arning A gents Workshop (AAMAS 2022) . 2022. [175] Manel Ro driguez-Soto, Marc Serramia, Maite Lop ez-Sanc hez, and Juan An tonio Ro driguez-Aguilar. “Instilling moral v alue alignment by means of multi-ob jective re- inforcemen t learning”. In: Ethics and Information T e chnolo gy 24.1 (2022), p. 9. 159 [176] Stuart Russell, Peter Norvig, and Artiﬁcial Intelligence. “A mo dern approach”. In: A rtiﬁcial Intel ligenc e. Pr entic e-Hal l, Egnlewo o d Cliﬀs 25.27 (1995), pp. 79–80. [177] F ederico Sabbatini, Gio v anni Ciatto, Rob erta Calegari, Andrea Omicini, et al. “On the design of PSyKE: a platform for sym b olic knowledge extraction”. In: CEUR WORK- SHOP PROCEEDINGS . V ol. 2963. Sun SITE Central Europ e, R WTH Aachen Uni- v ersit y . 2021, pp. 29–48. [178] V asan th Sarath y, Matthias Scheutz, and Bertram F Malle. “Learning b eha vioral norms in uncertain and c hanging contexts”. In: 2017 8th IEEE International Confer enc e on Co gnitive Info c ommunic ations (Co gInfoCom) . IEEE. 2017, pp. 000301–000306. [179] Bastin T ony Roy Sav arimuth u, Stephen Craneﬁeld, Mary am Purvis, and Martin Purvis. “A data mining approach to identify obligation norms in agent so cieties”. In: A gents and Data Mining Inter action: 6th International Workshop on A gents and Data Mining Inter action, ADMI 2010, T or onto, ON, Canada, May 11, 2010, R evise d Sele cte d Pap ers 6 . Springer. 2010, pp. 43–58. [180] Bastin T ony Roy Sav arimuth u, Stephen Craneﬁeld, Mary am A Purvis, and Martin K Purvis. “Norm identiﬁcation in m ulti-agen t so cieties”. In: Information Scienc es (2010). [181] Bastin T ony Roy Sav arimuth u, Stephen Craneﬁeld, Maryam A Purvis, and Martin K Purvis. “Identifying prohibition norms in agent so cieties”. In: A rtiﬁcial intel ligenc e and law 21 (2013), pp. 1–46. [182] Matthias Sc heutz and Daniel Little. “Using Simple Deon tic Constraints for F ast Norm- Conforming Reinforcement Learning”. In: 17 DEON (2025), p. 329. [183] John R Searle. “Ho w to deriv e ‘ought’from ‘is’”. In: The is-ought question: a c ol le ction of p ap ers on the c entr al pr oblem in mor al philosophy . Springer, 1969, pp. 120–134. [184] John R Searle. The c onstruction of so cial r e ality . Simon and Sch uster, 1995. [185] Andrew Selbst and Julia Po wles. ““Meaningful information” and the right to expla- nation”. In: c onfer enc e on fairness, ac c ountability and tr ansp ar ency . PMLR. 2018, pp. 48–48. 160 [186] Marek J Sergot and Henry Prakk en. “Contrary-to-dut y obligations”. In: DEON 94 (Pr o c. Se c ond International Workshop on De ontic L o gic in Computer Scienc e) . 1994. [187] V enk atesh Shreyas, Sk anda N Bharadw a j, S Srinidhi, KU Ankith, and AB Ra jendra. “Self-driving cars: An ov erview of v arious autonomous driving systems”. In: A dvanc es in Data and Information Scienc es: Pr o c e e dings of ICDIS 2019 (2020), pp. 361–371. [188] Joar Sk alse, Nikolaus Ho w e, Dmitrii Krasheninnik o v, and Da vid Krueger. “Deﬁning and characterizing rew ard gaming”. In: A dvanc es in Neur al Information Pr o c essing Systems 35 (2022), pp. 9460–9471. [189] Ric hard S Sutton, Andrew G Barto, et al. “Reinforcement learning”. In: Journal of Co gnitive Neur oscienc e 11.1 (1999), pp. 126–134. [190] Ek aterina Svirido v a, Elena Cabrio, and Serena Villata. “Mining implicit arguments for reasoning: A survey”. In: Ar gument & Computation (2024), p. 19462174251344764. [191] Christian Szegedy, W o jciech Zaremba, Ily a Sutskev er, Joan Bruna, Dumitru Erhan, Ian Go odfellow, and Rob F ergus. “Intriguing properties of neural netw orks”. In: arXiv pr eprint arXiv:1312.6199 (2013). [192] Sarah T an, Ric h Caruana, Giles Ho ok er, and Yin Lou. “Distill-and-compare: Auditing blac k-b o x mo dels using transparent mo del distillation”. In: Pr o c e e dings of the 2018 AAAI/A CM Confer enc e on AI, Ethics, and So ciety . 2018, pp. 303–310. [193] Zhi-Xuan T an, Jak e Braw er, and Brian Scassellati. “That’s mine! learning o wnership relations and norms for rob ots”. In: Pr o c e e dings of the AAAI c onfer enc e on artiﬁcial intel ligenc e . V ol. 33. 01. 2019, pp. 8058–8065. [194] Stev an T omic, F ederico Pecora, and Alessandro Saﬃotti. “Robby is Not a Robb er (an ymore): On the Use of Institutions for Learning Normativ e Beha vior”. In: CoRR abs/1908.02138 (2019). arXiv: 1908 . 02138 . url : http : / / arxiv . org / abs / 1908 . 02138 . [195] Stev an T omic, F ederico P ecora, and Alessandro Saﬃotti. “Learning Normative Beha v- iors Through Abstraction.” In: ECAI . 2020, pp. 1547–1554. 161 [196] F rancesca T oni. “A tutorial on assumption-based argumentation”. In: Ar gument & Computation 5.1 (2014), pp. 89–117. [197] Leendert WN v an der T orre. “Reasoning ab out obligations: defeasibilit y in preference- based deontic logic”. PhD thesis. 1997. [198] Leendert WN v an der T orre and Y ao-Hua T an. “Cancelling and Overshado wing: Tw o T yp es of Defeasibility in Defeasible Deontic Logic”. In: Pr o c e e dings of the F ourte enth International Joint Confer enc e on Artiﬁcial Intel ligenc e, IJCAI 95, Montr ´ eal Qu´ eb e c, Canada, August 20-25 1995, 2 V olumes . Morgan Kaufmann, 1995, pp. 1525–1533. url : http://ijcai.org/Proceedings/95- 2/Papers/066.pdf . [199] Leendert WN v an der T orre and Y ao-Hua T an. “Delib erate robbery , or the calculating Samaritan”. In: Pr o c e e dings of the ECAI . V ol. 98. Citeseer. 1998. [200] P eter V amplew, Ric hard Dazeley, Adam Berry, Rustam Issab ek ov, and Ev an Dekk er. “Empirical ev aluation metho ds for multiob jective reinforcemen t learning algorithms”. In: Machine le arning 84 (2011), pp. 51–80. [201] Gilles V enturini. “SIA: a sup ervised inductive algorithm with genetic search for learn- ing attributes based concepts”. In: Eur op e an c onfer enc e on machine le arning . Springer. 1993, pp. 280–296. [202] Abhina v V erma, Vijay araghav an Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudh uri. “Programmatically interpretable reinforcement learning”. In: Interna- tional Confer enc e on Machine L e arning . PMLR. 2018, pp. 5045–5054. [203] Aja y Vish w anath, Louise A Dennis, and Marija Slavk ovik. “Reinforcemen t Learning and Machine ethics: a systematic review”. In: arXiv pr eprint arXiv:2407.02425 (2024). [204] Sandra W ac h ter, Bren t Mittelstadt, and Chris Russell. “Counterfactual explanations without op ening the black b o x: Automated decisions and the GDPR”. In: Harv. JL & T e ch. 31 (2017), p. 841. [205] Christopher JCH W atkins and P eter Day an. “Q-learning”. In: Machine le arning 8 (1992), pp. 279–292. 162 [206] Sholom M W eiss and Nitin Indurkhy a. “Reduced Complexity Rule Induction.” In: IJCAI . 1991, pp. 678–684. [207] William W olb erg, Olvi Mangasarian, Nic k Street, and W. Street. Br e ast Canc er Wisc onsin (Diagnostic) . UCI Machine Learning Rep ository . DOI: h ttps://doi.org/10.24432/C5D W2B. 1993. [208] Mic hael W o oldridge. An intr o duction to multiagent systems . John wiley & sons, 2009. [209] Y ueh-Hua W u and Shou-De Lin. “A low-cost ethics shaping approac h for designing reinforcemen t learning agen ts”. In: Pr o c e e dings of the AAAI Confer enc e on Artiﬁcial Intel ligenc e . V ol. 32. 1. 2018. [210] Liu wen Y u. “Distributed argumentation tec hnology: adv ancing risk analysis and regu- latory compliance of distributed ledger technologies for transaction and management of securities”. In: (2023). [211] Liu wen Y u, Caren Al Anaissy, Srdjan V esic, Xu Li, and Leendert WN v an der T orre. “A principle-based analysis of bip olar argumen tation semantics”. In: Eur op e an Confer enc e on L o gics in Artiﬁcial Intel ligenc e . Springer. 2023, pp. 209–224. [212] Liu wen Y u and Leendert WN v an der T orre. “A principle-based approac h to bip olar argumen tation”. In: NMR 2020 Workshop Notes . V ol. 227. 2020. [213] Liu wen Y u and Leendert WN v an der T orre. “The A-BDI Metamo del for Human- Lev el AI: Argumentation as Balancing, Dialogue and Inference”. In: International Confer enc e on L o gic and Ar gumentation . Springer. 2025, pp. 361–379. [214] Seyma Y ucer, F urk an T ektas, Noura Al Moubay ed, and T ob y Breck on. “Racial bias within face recognition: A surv ey”. In: ACM Computing Surveys 57.4 (2024), pp. 1–39. [215] T om Zaha vy , Nir Ben-Zrihem, and Shie Mannor. “Gra ying the blac k b o x: Understand- ing dqns”. In: International Confer enc e on Machine L e arning . PMLR. 2016, pp. 1899– 1908. 163

What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment