Synchronizing Objectives for Markov Decision Processes

We introduce synchronizing objectives for Markov decision processes (MDP). Intuitively, a synchronizing objective requires that eventually, at every step there is a state which concentrates almost all the probability mass. In particular, it implies t…

Authors: Laurent Doyen (LSV, ENS Cachan & CNRS, France)

Synchronizing Objectives for Markov Decision Processes
Johannes Reich and Bernd Finkbeiner (Eds): International W orkshop on Interactions, Games and Protocols (iWIGP) EPTCS 50, 2011, pp. 61–75, doi:10.4204/EPTCS.50.5 © L. Doyen, T . Massart & M. Shirmohammadi This work is licensed under the Creativ e Commons Attribution License. Synchr onizing Objectiv es f or Mark ov Decision Pr ocesses Laurent Doyen LSV , ENS Cachan & CNRS, France doyen@lsv.ens-cachan.fr Thierry Massart Mahsa Shirmohammadi Univ ersit ´ e Libre de Bruxelles, Brussels, Belgium ∗ thierry.massart@ulb.ac.be mahsa.shirmohammadi@ulb.ac.be Abstract. W e introduce synchronizing objecti ves for Marko v decision processes (MDP). Intuiti vely , a synchronizing objectiv e requires that ev entually , at ev ery step there is a state which concentrates almost all the probability mass. In particular , it implies that the probabilistic system behaves in the long run like a deterministic system: ev entually , the current state of the MDP can be identified with almost certainty . W e study the problem of deciding the existence of a strategy to enforce a synchronizing objecti ve in MDPs. W e sho w that the problem is decidable for general strategies, as well as for blind strategies where the player cannot observe the current state of the MDP . W e also show that pure strategies are sufficient, b ut memory may be necessary . 1 Intr oduction A Markov decision pr ocess (MDP) is a model for systems that exhibit both probabilistic and nondeter- ministic behavior . MDPs hav e been used to model and solve control problems for stochastic systems where the nondeterminism represents the freedom of the controller to choose a control action, while the probabilistic component of the behavior describes the system response to control actions. MDPs hav e also been adopted as models for concurrent probabilistic systems, probabilistic systems operating in open en vironments [7], and under -specified probabilistic systems [4]. T raditional objecti ves for MDP specify a set S of paths, where a path is an infinite sequence of states through the underlying graph of the MDP . The v alue of interest is the probability that an e xecution of the MDP under a given strategy belongs to S . For example, a reachability objectiv e specifies all paths that visit a gi ven tar get state ` . A typical qualitativ e question is to decide whether there exists a strategy such that a gi ven state ` is reached with probability 1. In this paper , we consider a dif ferent type of objecti ves which specify a set of infinite sequences ¯ X = X 0 , X 1 , . . . of probability distributions over the states [6]. Intuiti vely , the distribution X i in the sequence gi ves for each state ` the probability X i ( ` ) to be in state ` at step i ≥ 0. W e introduce synchr onizing objectives which specify sequences of distrib utions in which the probability tends to accumulate in a single state. W e use the infinity norm as a measure of the highest peak in a probability distribution X i (i.e., k X i k = max ` ∈ L X i ( ` ) ) and we require that the limit 1 of this measure in the sequence is 1. Intuiti vely , this requires that in the long run, the MDP behav es like a deterministic system: from some point on, at e very step i there is a state ` i which accumulates almost all the probability . Note that satisfying such an ∗ This work has been done in the MoVES project (P6/39) which is part of the IAP-Phase VI Interuni versity Attraction Poles Programme funded by the Belgian State, Belgian Science Policy . 1 Since the limit may not exist in general, we actually consider either lim inf or lim sup. 62 Synchronizing objecti ves for Marko v Decision Processes objecti ve implies that there exists a state ` which is reached with probability 1. The conv erse does not hold because reachability objectiv es do not require the visits to the target state to occur after the same number of steps in (almost) all executions of the MDP . W e consider the problem of deciding if a giv en MDP is synchronizing for some strategy , W e consider the general case where memoryful randomized strategies are allowed, as well as the special case of blind strategies which are not allowed to observe the current state of the MDP . Defining objectiv es as a sequence of probability distributions over states rather than a distribution ov er sequences of states is a change of standpoint in the traditional approach to MDP v erification. Up to our kno wledge, there are very few works in this setting. W e are aware of the work in [6] which studies MDPs as generators of probability distrib utions with applications in sensor networks and dynamical systems, and shows that the resulting objectiv es are not expressible in known logics such as PCTL ∗ [1, 4]. In their definition, probability distributions over states are assigned a vector v ∈ { 0 , 1 } k of truth v alues for a finite set of predicates ϕ 1 , . . . , ϕ k (which are linear constraints on the probabilities such as ϕ ( X ) ≡ X ( ` ) + X ( ` 0 ) ≤ 1 2 , for example). This can be viewed as a coloring of the probability distributions using a finite number of colors, and then objectiv es are languages of infinite words ov er the finite alphabet of colors. It is shown that reachability of a gi ven color is undecidable for MDPs if arbitrary linear predicates are allo wed [6]. A decidability result is obtained if only predicates of the form ∑ ` ∈ T X ( ` ) > 0 are allowed. Synchronizing objecti ves cannot be e xpressed in the framew ork of [6] using finite colorings as they require a real-v alued measure (namely , the infinite norm) to be assigned to the probability distributions. In [2], the monadic logic of probabilities is introduced as a predicate logic which can e xpress proper- ties of sequences of probability distrib utions. But because it allows comparison of probabilities only with constants, it cannot express synchronizing objectives which would require a quantification over proba- bility thresholds, such as ϕ ( ¯ X ) ≡ ∀ ε > 0 · ∃ N · ∀ i ≥ N · ∃ ` ∈ L : X i ( ` ) ≥ 1 − ε , where X i is the probability distribution in position i in the sequence ¯ X . Synchronizing objecti ves generalize the notion of synchronizing words. In a deterministic finite automaton, a word w is synchronizing if reading w from any state of the automaton always leads to the same state. It is sufficient to consider finite words, and it is conjectured that if a synchronizing word exists, then there exists one of length at most ( n − 1 ) 2 where n is the number of states of the automaton, kno wn as the ˇ Cern ´ y’ s conjecture. Several works hav e studied this conjecture and related problems (see the surve y in [8]). V ie wing deterministic automata as a special case of MDP where all transitions hav e only one successor , a synchronizing word can be seen as a blind strategy to ensure a synchronizing objecti ve. Note that we do not present a generalization of ˇ Cern ´ y’ s conjecture since in our case, strate gies for MDPs are infinite objects. Ho wev er , synchronizing objecti ves provide an extension of the design frame work for the many applications of the theory of synchronizing words, such as control of discrete e vent systems, planning, biocomputing, and robotics [8]. For example, in probabilistic models of DNA transcription, one may ask which molecules to introduce in a cell in order to bring it to a single possible state [3, 8]. W e prov e that it is decidable to determine if a given MDP is synchronizing for some strategy , either blind or general. W e use v ariants of the subset construction in the underlying graph of MDPs to obtain a decidable characterization of synchronizing strategies. Our results imply that pure strategies are sufficient to satisfy a synchronizing objecti ve, b ut we provide an example sho wing that memory may be necessary , both with blind and general strategies. L. Doyen, T . Massart & M. Shirmohammadi 63 2 Definitions. A pr obability distribution over a finite set S is a function d : S → [ 0 , 1 ] such that ∑ s ∈ S d ( s ) = 1. The support of d is the set Supp ( d ) = { s ∈ S | d ( s ) > 0 } . D ( S ) denotes the set of all probability distrib utions on S , and P ( S ) the power set of S . Marko v decision processes. A Markov decision pr ocess (MDP) is a tuple M = h L , µ 0 , Σ , δ i where L is a finite set of states, µ 0 ∈ D ( L ) is an initial probability distribution over states, Σ is a finite set of actions, δ : L × Σ → D ( L ) is a probabilistic transition function that assigns to each pair of states and actions, a probability distribution o ver successor states. A Markov chain is a special case of MDPs with only one action ( | Σ | = 1). Markov chains are therefore generally vie wed as a tuple M = h L , µ 0 , δ i where δ : L → D ( L ) . F or an action σ ∈ Σ and a state ` ∈ L , let P os t σ ( ` ) = Supp ( δ ( `, σ )) , and for a set s ⊆ L , let P os t σ ( s ) = ∪ ` ∈ s P os t σ ( ` ) . Example Figure 1(a) sho ws an MDP with four states and alphabet Σ = { σ 1 , σ 2 } . The initial probability distribution is µ 0 ( 1 ) = 1 and µ 0 ( i ) = 0 for i ∈ { 2 , 3 , 4 } , and the probabilistic transition function δ in state 1 is such that δ ( 1 , σ 1 )( 2 ) = δ ( 1 , σ 1 )( 3 ) = 1 / 2 and δ ( 1 , σ 2 )( 1 ) = 1. W e describe the behavior of an MDP as a one-player stochastic game played for infinitely many rounds. In the first round, the game starts in state ` with probability µ 0 ( ` ) . In each round, if the game is in the state ` and the player chooses the action σ ∈ Σ , then the game moves to the successor state ` 0 chosen with probability δ ( `, σ )( ` 0 ) , and the next round starts. W e consider two versions of this game. In both versions, the player knows the structure of the MDP . In the first version the player has perfect information , he can see the current state of the game; in the second version the player is blind , he is not allo wed to observe the current state of the game, and only knows the number of rounds that hav e been played so far . A play of the game is an infinite sequence of interleaved states and actions π = ` 0 σ 0 ` 1 · · · such that ` i + 1 ∈ P os t σ i ( ` i ) for all i ≥ 0. The set of all plays over M is denoted by Plays ( M ) . A finite prefix h = ` 0 σ 0 ` 1 · · · σ n − 1 ` n of a play π is called a history , the last state of h is Last ( h ) = ` n , the i t h action and state of the of h is Action ( h , i ) = σ i and State ( h , i ) = ` i , and its length is | h | = n . The set of all histories of plays is denoted by Hists ( M ) . Strategies and outcome. In the game, the choice of the action is made by the player according to a strategy . Depending on what the player can observe and record, he can use various classes of strategies. A randomized strate gy (or simply a strategy) over an MDP M is a function α : Hists ( M ) → D ( Σ ) . A pur e (deterministic) strategy is a special case of randomized strate gy where for all h ∈ Hists ( M ) , there exists an action σ ∈ Σ such that α ( h )( σ ) = 1. A memoryless strategy is a randomized strategy α such that α ( h 1 ) = α ( h 2 ) for all h 1 , h 2 ∈ Hists ( M ) with Last ( h 1 ) = Last ( h 2 ) . In this last case, the player cannot record the history of the play and makes a choice according to the current state only . For conv enience, we view pure strategies as functions α : Hists ( M ) → Σ , and memoryless strategies as functions α : L → D ( Σ ) . Hence, a pure memoryless strategy is a function α : L → Σ . A strategy α is blind if α ( h 1 ) = α ( h 2 ) for all h 1 , h 2 ∈ Hists ( M ) such that | h 1 | = | h 2 | . Blind strategies can be vie wed as functions α : N → D ( Σ ) (or , α : N → Σ for pure blind strategies) which assign in each round a probability distribution over actions. Sometimes we talk about perfect-information strategies to emphasize when we consider strategies that are not necessarily blind. 64 Synchronizing objecti ves for Marko v Decision Processes The outcome of the game played on an MDP M = h L , µ 0 , Σ , δ i using a strategy α is the infinite sequence X α 0 X α 1 . . . of probability distrib utions ov er the set of states L , where X α 0 = µ 0 and for all n > 0, X α n ( ` ) = ∑ h ∈ Hists ( M ) : Last ( h )= `, | h | = n Pr α ( h ) where the probability Pr α ( h ) of a history h = ` 0 σ 0 ` 1 · · · σ n − 1 ` n under strategy α is Pr α ( h ) = µ 0 ( ` 0 ) · ∏ n j = 1 α ( ` 0 σ 0 . . . ` j − 1 )( σ j − 1 ) · δ ( ` j − 1 , σ j − 1 )( ` j ) . Synchronizing objectives. The norm of a probability distrib ution X over L is k X k = max ` ∈ L X ( ` ) . W e say that the MDP M with strategy α is str ongly synchr onizing if lim inf n → ∞ k X α n k = 1 , (1) and that it is weakly synchr onizing if lim sup n → ∞ k X α n k = 1 . (2) Intuiti vely , an MDP is synchronizing if the probability mass tends to concentrate in a single state, either at ev ery step from some point on (for strongly synchronizing), or at infinitely many steps (for weakly synchronizing). Note that equiv alently , M with strategy α is strongly synchronizing if the limit lim n → ∞ k X α n k exists and equals 1. In this paper, we are interested in the problem of deciding if a giv en MDP is synchronizing for some strategy . W e consider the problem for both perfect-information and blind strategies. Recurrent and transient states. A state ` 0 ∈ L is accessible from a state ` ∈ L (denoted ` → ` 0 ), if there is a history h = ` 0 σ 0 ` 1 · · · σ n − 1 ` n with ` 0 = ` and ` n = ` 0 . If both ` → ` 0 and ` 0 → ` hold, then we say that ` and ` 0 are strongly connected (denoted ` ↔ ` 0 ). This induces an equi v alence relation called accessibility r elation . An MDP is str ongly connected , if all pairs of states `, ` 0 ∈ L are strongly connected. A state accessible from a state of Supp ( µ 0 ) is simply called accessible state. For a Markov chain M , the state ` is recurr ent if all accessible states from ` can access ` (i.e., ` and ` 0 are strongly connected for all ` 0 such that ` → ` 0 ), and the state ` is transient if there exists some state ` 0 such that ` 0 is accessible from ` , but ` is not accessible from ` 0 . The next proposition follows from standard results [5]. Proposition 1 Given a Markov chain M , let X 0 , X 1 , . . . be the sequence of pr obability distributions of M . Then lim sup n → ∞ X n ( ` ) = 0 for all transient states ` ∈ L, and lim sup n → ∞ X n ( ` ) > 0 for all r ecurr ent states ` ∈ L. Subset constructions. W e define two important constructions based on the subset construction idea. Subset construction is a standard technique to compute, from a nondeterministic finite automaton N , an equi valent deterministic automaton D (for language equiv alence), where one state of D corresponds to the set of possible states (called a cell) in which N can be. W e define two kinds of subset constructions on MDPs, the perfect-information subset construction , and the blind subset construction . As usual, each state of the subset constructions is a subset of states of the MDP (i.e., a cell). In our case, the main dif ference lies in the alphabet. In the perfect-information subset construction, the selection of the next action depends on the current state (each state of a cell can independently choose an action), while in the blind subset constructions the next action is independent of the state (all states of a cell hav e to choose the same action). Thus, an action in the perfect-information subset construction is a function ˆ σ : L → Σ which assigns to each state ` ∈ L its choice among the actions in Σ . L. Doyen, T . Massart & M. Shirmohammadi 65 { 2 , 3 } { 3 , 4 } { 2 , 4 } { 4 } { 1 } ˆ σ 7 ˆ σ 6 ˆ σ 2 ˆ σ 1 ˆ σ 3 ˆ σ 5 ˆ σ 4 ˆ σ 8 ˆ σ 9 1 2 3 1 / 2 1 / 2 4 σ 2 σ 1 , σ 2 σ 1 σ 2 σ 1 σ 1 σ 2 ˆ σ 10 ˆ σ 11 ( a ) ( b ) Figure 1: (a) shows an MDP , and (b) shows the accessible states of its perfect information subset con- struction. Definition 1 (Perfect-inf ormation subset construction of an MDP) F or an MDP M = h L , µ 0 , Σ , δ i , the perfect-information subset construction is an automaton M P = h L , L I , ˆ Σ , δ P i where L = P ( L ) \{ / 0 } , L I = Supp ( µ 0 ) , ˆ Σ = { ˆ σ | ˆ σ : L → Σ } is the alphabet, and δ P : L × ˆ Σ → L wher e for all s 1 , s 2 ∈ L and ˆ σ ∈ ˆ Σ , we have δ P ( s 1 , ˆ σ ) = s 2 wher e s 2 = ∪ ` ∈ s 1 P os t ˆ σ ( ` ) ( ` ) . Example Figure 1(b) shows the perfect information subset construction M P of the MDP drawn in Fig- ure 1(a) (presented in the first example). Let us present ˆ Σ in the table below . Each row labelled by a function ˆ σ i ( i ∈ { 1 , . . . , 11 } ), each column labelled by a state ` ; and each entry shows the v alue of ˆ σ i ( ` ) . 1 2 3 4 ˆ σ 1 σ 2 { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } ˆ σ 2 σ 1 { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } ˆ σ 3 { σ 1 , σ 2 } σ 2 σ 1 { σ 1 , σ 2 } ˆ σ 4 { σ 1 , σ 2 } σ 1 σ 2 { σ 1 , σ 2 } ˆ σ 5 { σ 1 , σ 2 } σ 2 σ 2 { σ 1 , σ 2 } ˆ σ 6 { σ 1 , σ 2 } σ 1 σ 1 { σ 1 , σ 2 } ˆ σ 7 { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } ˆ σ 8 { σ 1 , σ 2 } σ 1 { σ 1 , σ 2 } { σ 1 , σ 2 } ˆ σ 9 { σ 1 , σ 2 } { σ 1 , σ 2 } σ 2 { σ 1 , σ 2 } ˆ σ 10 { σ 1 , σ 2 } { σ 1 , σ 2 } σ 1 { σ 1 , σ 2 } ˆ σ 11 { σ 1 , σ 2 } σ 2 { σ 1 , σ 2 } { σ 1 , σ 2 } Note that, the function ˆ σ with ˆ σ ( ` ) = { σ 1 , σ 2 } (for a state ` ) giv es tw o dif ferent functions where ˆ σ i ( ` ) = σ 1 and ˆ σ j ( ` ) = σ 2 ; but these tw o functions behav es similarly . A cycle of M P is a finite sequence C P = s 0 ˆ σ 0 s 1 . . . s d − 1 ˆ σ d − 1 s d of interlea ved cells and symbols such that δ P ( s j , s j + 1 = ˆ σ j ) for all 0 ≤ j < d , and s 0 = s d . Note that, in this definition, d is the length of the cycle C P . W e write s ∈ C P if s is one of the cells s j (0 ≤ j < d ) of the finite sequence of the cycle C P . A simple cycle is a cycle where all cells s 0 , . . . , s d − 1 are different. W e are interested in defining some property on cycles of the perfect-information subset construction for a gi ven MDP . Definition 2 (Recurrent cyclic sets) Let C P = s 0 ˆ σ 0 . . . s d − 1 ˆ σ d − 1 s d be a cycle of the perfect-information subset construction M P for a given MDP M . A recurrent cyclic set for the cycle C P is a sequence G = g 0 g 1 . . . g d such that g 0 = g d , and / 0 6 = g i ⊆ s i and ∪ ` ∈ g i P os t ˆ σ i ( ` ) ( ` ) = g i + 1 for all 0 ≤ i < d . 66 Synchronizing objecti ves for Marko v Decision Processes 1 1 / 3 1 / 3 1 / 3 2 3 4 5 6 7 8 9 1 / 2 1 / 2 σ 1 σ 1 , σ 2 σ 2 σ 1 σ 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 { 1 } { 2 , 5 , 8 } { 3 , 5 , 6 } { 4 , 7 , 9 } ˆ σ 1 ˆ σ 2 ˆ σ 3 ˆ σ 4 ( a ) ( b ) Figure 2: (a) shows an MDP , and (b) shows some part of its perfect information subset construction. A cycle C P might hav e se veral recurrent c yclic sets. A recurrent cyclic set G for a gi ven c ycle C P , is said to be minimal if there is no other recurrent cyclic set G 0 ( G 6 = G 0 ) such that for 0 ≤ i < d , and for g i ∈ G , g 0 i ∈ G 0 , we have g 0 i ⊆ g i . W e denote the set of all minimal recurrent cyclic sets of the cycle C P by ∆ ( C P ) = { G | G is a minimal recurrent cyclic set for the cycle C P } . Example Consider the MDP M in Figure 2 (the initial distribution is µ 0 ( 1 ) = 1 and µ 0 ( i ) = 0 for i ∈ { 2 , . . . , 9 } ). Figure 2(b) sho ws one cycle of the perfect information subset construction M P . Let us present ˆ Σ in the table below . Each row labeled by a function ˆ σ i ( i ∈ { 1 , . . . , 4 } ), each column labeled by a state ` ; and each entry sho ws the value of ˆ σ i ( ` ) . 1 2 3 4 5 6 7 8 9 ˆ σ 1 { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } ˆ σ 2 { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } σ 1 { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } ˆ σ 3 { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } σ 2 { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } ˆ σ 4 { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } { σ 1 , σ 2 } For the cycle C P = { 2 , 5 , 8 } ˆ σ 2 { 3 , 5 , 6 } ˆ σ 3 { 4 , 7 , 9 } ˆ σ 4 { 2 , 5 , 8 } , the set of minimal recurrent cyclic sets is ∆ ( C P ) = {{{ 2 } , { 3 } , { 4 }} , {{ 5 } , { 6 } , { 7 }}} . The elements of ∆ ( C P ) are not comparable. The blind subset construction for an MDP is a special case of its perfect information subset construc- tion where the action functions ˆ σ ∈ ˆ Σ are restricted to constant functions. In each cell, all states hav e to choose the same action. Definition 3 (Blind subset construction of an MDP) The blind subset construction for a given MDP M = h L , µ 0 , Σ , δ i is an automaton M B = h L , L I , Σ , δ B i wher e L = P ( L ) \ { / 0 } , L I = Supp ( µ 0 ) , and for all s 1 , s 2 ∈ L and σ ∈ Σ , we have δ B ( s 1 , σ ) = s 2 wher e s 2 = P ost σ ( s 1 ) . W e denote c ycles in the blind subset construction by C B . 3 Synchr onizing Objectives f or Perfect-Inf ormation Strategies W e have defined a perfect-information one-player stochastic game in which the player can see the current state of the game and record the sequence of visited states. W e sho w that synchronizing strategies can be L. Doyen, T . Massart & M. Shirmohammadi 67 σ 1 , σ 2 σ 1 , σ 2 σ 2 1 2 3 1 / 2 1 / 2 σ 1 , σ 2 σ 1 , σ 2 4 σ 1 5 Figure 3: An MDP where memory is necessary to win the strongly synchronizing objectiv e. characterized in the perfect-information subset construction, giving a decidability result. W e also show in the next e xample that memory may be necessary . Example Consider the MDP M in Figure 3 (the initial distribution is µ 0 ( 1 ) = 1 and µ 0 ( i ) = 0 for i ∈ { 2 , . . . , 5 } ), and let α be the strate gy defined as follows: α (( L × Σ ) ∗ ` )( σ ) = 1 / 2 for all σ ∈ Σ and ` ∈ { 1 , 3 , 4 , 5 } , and for the histories ending in the state 2, α (( L × Σ ) ∗ ` Σ 2 )( σ ) =    1 if ` = 1 and σ = σ 2 , 1 if ` 6 = 1 and σ = σ 1 , 0 otherwise . In this example, it is easy to check that the strategy α is strongly synchronizing. In state 2, it plays σ 1 and σ 2 in alternation in order to ensure synchronization with the cycle 3 , 4 , 5 of length 3. Ho wev er , none of the memoryless strategies is strongly synchronizing, sho wing that memory is necessary . This example also shows that memory is necessary for weakly synchronizing objectiv e, as well as for blind strategies. Proposition 2 F or both str ongly and weakly synchr onizing objectives, memoryless strate gies are not sufficient in MDPs. Theorem 1 F or a perfect information game over an MDP M , there e xists a strate gy α such that M with strate gy α is strongly synchronizing , if and only if the perfect-information subset construction M P for M , has an accessible cycle C P such that | ∆ ( C P ) | = 1 , and for G ∈ ∆ ( C P ) and for all g ∈ G, | g | = 1 . Proof Sufficient condition. W e suppose that the perfect-information subset construction M P for M , has an accessible cycle C P = s 0 ˆ σ 0 . . . s d such that | ∆ ( C P ) | = 1, and for G ∈ ∆ ( C P ) and for all g ∈ G , we hav e | g | = 1. Since this cycle is accessible, there exists a finite path P = p 0 ˆ σ 0 0 p 1 . . . p m − 1 ˆ σ 0 m − 1 p m in M P from p 0 = L I to p m = s 0 = s d (See Figure 4). Consider the pure strategy α as follows α (( L × Σ ) k ` ) =  ˆ σ 0 k ( ` ) if 0 ≤ k < m , ˆ σ ( k − m ) mod d ( ` ) if m ≤ k . Let us construct a finite Markov chain M 0 in a way that its long term behavior simulates the long term behavior of the MDP M under the strategy α for synchronizing objecti ves. This Markov chain is 68 Synchronizing objecti ves for Marko v Decision Processes p 0 p m − 1 s 0 s d − 1 s 1 ˆ σ ! 0 ˆ σ ! m − 2 ˆ σ ! m − 1 ˆ σ 0 ˆ σ 1 ˆ σ d − 1 Figure 4: An accessible cycle C P of M P which is reachable by a finite path p 0 , . . . , p m . M 0 = ( L 0 , µ 0 0 , δ 0 ) where L 0 = { ( i , ` ) | 0 ≤ i < ( m + d ) and ` ∈ L } , the initial distribution µ 0 0 is defined as follo ws µ 0 0 (( i , ` )) =  µ 0 ( ` ) if i = 0 0 o t herwise . and the probability transition function δ 0 is defined as follo ws δ 0 (( i , ` ))(( i 0 , ` 0 )) =        δ ( `, ˆ σ 0 i ( ` ))( ` 0 ) if ( 0 ≤ i < m ) , ( i 0 = i + 1 ) , ( ` ∈ p i ) and ( ` 0 ∈ p i 0 ) , δ ( `, ˆ σ i − m ( ` ))( ` 0 ) if ( m ≤ i < m + d ) , ( i 0 = m + ( i − m + 1 ) mod d ) , ( ` ∈ s i − m ) and ( ` 0 ∈ s i 0 − m ) , 0 o t herwise . The idea is that each cell p i (0 ≤ i < m ) of the path P and, similarly , each cell s i ( m ≤ i < m + d ) of the cycle C P corresponds to | L | states in the Markov chain M (one for each state of the MDP M ). The v alue of δ 0 (( i , ` ))(( i 0 , ` 0 )) sho ws the probability to reach in one step, the state ( i 0 , ` 0 ) from the state ( i , ` ) ; semantically it giv es the probability to go from ` to ` 0 at step i . W e show that (a) if the Markov chain M 0 is strongly synchronizing, then so is the MDP M under the strategy α and that (b) M 0 is strongly synchronizing. Proving (a) is straightforw ard from the definition of the Marko v chain M 0 . Each state of the MDP M corresponds to m + d state of M 0 . Then if, from some point, the mass of probability accumulates in one state of M 0 and afterward moves totally to another one, it happens also in M . In detail, let the sequence X α i ( i ∈ N ) denote the outcome of the MDP M under the strate gy α , and X 0 i ( i ∈ N ) denote the probability distribution at step i generated by the Markov chain M 0 . Note that X α is a random v ariable over | L | entries, but X 0 is ov er | L | · ( m + d ) entries which has at most | L | non-zero entries. Let us compute and compare the non-zero entries of these two random v ariable sequences. For ` ∈ L : X α 0 ( ` ) = µ 0 ( ` ) = X 0 0 (( 0 , ` )) and we have X 0 0 (( j , ` )) = 0 for all j 6 = 0. X α 1 ( ` ) = ∑ ` 0 ∈ L µ 0 ( ` 0 ) · δ ( ` 0 , α ( ` 0 ))( ` ) = ∑ ` 0 ∈ L µ 0 ( ` 0 ) · δ ( ` 0 , ˆ σ 0 0 ( ` 0 ))( ` ) = ∑ ` 0 ∈ L µ 0 ( ` 0 ) · δ 0 (( 0 , ` 0 ))(( 1 , ` )) = X 0 1 (( 1 , ` )) and we have X 0 1 (( j , ` )) = 0 for all j 6 = 1. In the next step, let us compute these random v ariables for i < m : X α i ( ` ) = ∑ ` 0 ,` 1 ,...` i − 1 ∈ L µ 0 ( ` 0 ) · δ ( ` 0 , α ( ` 0 ))( ` 1 ) · δ ( ` 1 , α ( ` 0 α ( ` 0 ) ` 1 ))( ` 2 ) · · · · δ ( ` i − 1 , α ( ` 0 α ( ` 0 ) ` 1 . . . ` i − 1 ))( ` ) = ∑ ` 0 ,` 1 ,...` i − 1 ∈ L µ 0 ( ` 0 ) · δ ( ` 0 , ˆ σ 0 0 ( ` 0 ))( ` 1 ) · δ ( ` 1 , ˆ σ 0 1 ( ` 1 ))( ` 2 ) · · · · δ ( ` i − 1 , ˆ σ 0 i − 1 ( ` i − 1 ))( ` ) = ∑ ` 0 ,` 1 ,...` i − 1 ∈ L µ 0 ( ` 0 ) · δ 0 (( 0 , ` 0 ))(( 1 , ` 1 )) · δ 0 (( 1 , ` 1 ))(( 2 , ` 2 )) · · · · δ 0 (( i − 1 , ` i − 1 ))(( i , ` )) = X 0 i (( i , ` )) . L. Doyen, T . Massart & M. Shirmohammadi 69 W e, also, hav e X 0 i (( j , ` )) = 0 for all j 6 = i , these results gi ve k X α i k = k X 0 i k for i < m . At the end, consider i ≥ m : X α i ( ` ) = ∑ ` 0 ,` 1 ,...` i − 1 ∈ L µ 0 ( ` 0 ) · δ ( ` 0 , α ( ` 0 ))( ` 1 ) · δ ( ` 1 , α ( ` 0 α ( ` 0 ) ` 1 ))( ` 2 ) · · · · δ ( ` i − 1 , α ( ` 0 α ( ` 0 ) ` 1 . . . ` i − 1 ))( ` ) = ∑ ` 0 ,` 1 ,...` i − 1 ∈ L µ 0 ( ` 0 ) · δ ( ` 0 , ˆ σ 0 0 ( ` 0 ))( ` 1 ) · δ ( ` 1 , ˆ σ 0 1 ( ` 1 ))( ` 2 ) · · · · δ ( ` m − 1 , ˆ σ 0 m − 1 ( ` m − 1 ))( ` m ) · δ ( ` m , ˆ σ 0 ( ` m ))( ` m + 1 ) · · · · δ ( ` i − 1 , ˆ σ ( i − m ) mod d ( ` i − 1 ))( ` ) = ∑ ` 0 ,` 1 ,...` i − 1 ∈ L µ 0 ( ` 0 ) · δ 0 (( 0 , ` 0 ))(( 1 , ` 1 )) · δ 0 (( 1 , ` 1 ))(( 2 , ` 2 )) · · · · δ 0 (( m − 1 , ` m − 1 ))(( m , ` m )) · · · · δ 0 (( m + ( i − m ) mod d , ` i − 1 ))(( m + ( i − m ) mod d , ` )) = X 0 i (( m + ( i − m ) mod d , ` )) . W e, also, have X 0 i (( j , ` )) = 0 for all j 6 = i , this results giv e k X α i k = k X 0 i k for i ≥ m . W e have shown that X α i ( ` ) = X 0 i (( j , ` )) where for 0 ≤ i < m , we hav e j = i , and for i ≥ m , we ha ve j = m + ( i − m ) mod d . This simply giv es k X α i k = k X 0 i k for i ∈ N ; meaning that if the Markov chain M 0 is synchronizing, so is the MDP M under the strategy α . T o show (b), we study transient and recurrent states of the Marko v chain M 0 . Suppose that G ∈ ∆ ( C P ) is the only recurrent cyclic set of the cycle, and it includes d elements as g 0 , . . . g d − 1 . Let R be the set of states ( m + i , ` ) such that ` ∈ g i , for 0 ≤ i < d . W e claim that the states of R are the only recurrent states in the Marko v chain M 0 . • First, we can see that the states of R are recurrent. By construction, the states of R are strongly connected. In addition, we have to prove that if ( m + i , ` ) ∈ R and ( m + i , ` ) → ( m + j , ` 0 ) , then ( m + j , ` 0 ) ∈ R . This holds by induction on the equality ∪ ` ∈ g i P os t σ i ( ` ) ( ` ) = g i + 1 . Note that ( m + i , ` ) ∈ R implies that ` ∈ g i ; and if ( m + i , ` ) → ( m + j , ` 0 ) then ` 0 has to lie in g j . • No w , we show that the states of R are the only recurrent states. By contradiction, suppose that there is another set R 0 of recurrent states in the Markov chain M 0 . By Proposition 1 and since the states ( i , ` ) (0 ≤ i < m ) are visited only once, then they could not be recurrent; therefore we discuss only on the states ( m + i , ` ) with 0 ≤ i < d of the Markov chain M 0 . Let g 0 i denote all states included in { ` | ( m + i , ` ) ∈ R 0 } ∩ s i for 0 ≤ i < d . The construction of the Markov chain implies that a state ( m + i , ` ) can only ha ve outgoing edges toward some states ( m + ( i + 1 ) mod d , ` 0 ) ; hence g 0 i 6 = / 0 for all 0 ≤ i < d . On the other hand, the definition of recurrent states requires that each accessible state from ( m + i , ` ) ∈ R 0 could access ( m + i , ` ) , therefore ∪ ` ∈ g 0 i P os t σ i ( ` ) ( ` ) = g 0 i + 1 . It is a contradiction with | ∆ ( C P ) | = 1. Based on Proposition 1, for the transient states ( k , ` ) , the probability X n (( k , ` )) vanishes for n → ∞ . Since for all g ∈ G , we hav e | g | = 1, the support of X n ( n > m ) contains only one recurrent state. Thus, the probability mass accumulates in that state: for all ε > 0, for all n > n 0 there is a state ( i , ` ) with X n (( i , ` )) > 1 − ε , that is k X α n k > 1 − ε . Hence, lim n → ∞ k X α n k = 1 and M 0 is strongly synchronizing. Therefore, so is the MDP M under the strategy α . Necessary condition. Assume that the MDP M with strategy α is strongly synchronizing. Then ∀ ε > 0 · ∃ n 0 ∈ N · ∀ n ≥ n 0 · ∃ q n such that X α n ( q n ) > 1 − ε . Moreov er the state q n is unique, and we show belo w that it is independent of ε (assuming ε < 1 2 ). Let ν be the smallest probability among all probability distributions of the MDP M (i.e., ν = min ` ∈ L , σ ∈ Σ ,` 0 ∈ Supp ( δ ( `, σ )) ( δ ( `, σ )( ` 0 )) ). Let ε < ν 1 + ν . W e claim that for all n ≥ n 0 , there exists some action σ ∈ Σ such that P os t σ ( q n ) = { q n + 1 } is a singleton. T o ward contradiction, assume that for all σ ∈ Σ , the statement P ost σ ( q n ) 6 = { q n + 1 } is satisfied. The probability which does not inject to q n + 1 (from q n ) is at least ν · ( 1 − ε ) . And since M is strongly synchronizing, we have: 1 − ε ≤ k X α n + 1 k ≤ 1 − ν · ( 1 − ε ) 70 Synchronizing objecti ves for Marko v Decision Processes This giv es ε ≥ ν 1 + ν which is a contradiction. Therefore, for all n ≥ n 0 , there exists σ ∈ Σ such that P os t σ ( q n ) = { q n + 1 } . This implies that the infinite sequence of states I = q n 0 q n 0 + 1 . . . is uniquely defined. The sequence I is used to define a pure synchronizing strate gy β from the randomized synchronizing strategy α . This construction implies that the pure strategies are sufficient for strongly synchronizing objecti ves. W e define the pure strate gy β as follows: • for h ∈ Hists ( M ) with | h | = i and Last ( h ) = q i , we define β ( h ) = σ where P os t σ ( Last ( h )) = { q i + 1 } , • for h ∈ Hists ( M ) with | h | = i and Last ( h ) 6 = q i , we define β ( h ) = Action ( h 0 , i ) where h 0 ∈ Hists ( M ) is the shortest possible history such that (1) State ( h 0 , i ) = Last ( h ) , (2) Pr α ( h 0 ) > 0, and (3) Last ( h 0 ) = q j with | h 0 | = j . One might notice that a reachable state Last ( h ) with a strictly positive probability Pr α ( h ) > 0, has to access a state of I (such as Last ( h 0 ) = q j where | h 0 | = j ); otherwise the MDP M with strategy α , would not be strongly synchronizing. Consequently , the history h 0 defined abov e always e xists. As a result, we can define SizePath ( h ) = | h 0 | − | h | to be the size of shortest path from Last ( h ) to the infinite sequence I . Note that for h with | h | = i and Last ( h ) = q i , we define SizeP ath ( h ) = 1. It is easy to see that the MDP M with pure strategy β is also strongly synchronizing. In the following, we sho w that there exists a cycle C P of M P which has only one recurrent cyclic set G , and all g ∈ G are singleton. By construction, we have β ( h ) = β ( h 0 ) for all histories h , h 0 ∈ Hists ( M ) with Last ( h ) = Last ( h 0 ) and | h | = | h 0 | . Therefore the pure strategy β induces an infinite path P β , in the perfect-information subset construction M P . Since the state space of M P is finite, some cell S has to be visited infinitely many times along P β . The path between two visits to S along P β is a cycle (not necessarily a simple cycle) of M P . W e study one of the these cycles (starting at S and coming back there), and prov e that this cycle satisfies the conditions of the theorem. Let Inf ( I ) denotet the set of all states visited infinitely often along I . Hence, there exists N Inf ≥ n 0 such that ∀ i ≥ N Inf : q i ∈ I ⇒ q i ∈ Inf ( I ) . Let K 1 be the first step after N Inf in which the path P β visits S . Let MaxP ath =max h ∈ Hists ( M ) , Pr β ( h ) > 0 , | h | = K 1 ( SizePath ( h )) , be the length of the longest path (among the shortest ones) from one reachable state at step K 1 , to the infinite sequence I . Let C P be the cycle starting in S at step K 1 , and coming back to this state in some step K 2 > K 1 + MaxP ath . W e claim that this cycle C P has only one recurrent cyclic set G , and all subsets g ∈ G are singleton: 1. G = {{ q i } | q i ∈ I for K 1 ≤ i ≤ K 2 } is a recurrent cyclic set. W e already have proved that there exists σ ∈ Σ such that P os t σ ( q n ) = { q n + 1 } ( n ≥ n 0 ). Note that for state q n , the action σ is chosen by the cycle. 2. G is the only recurrent cyclic set. Each state included in S reaches, in at most MaxP ath steps, one state of I . Hence, the cell S , as the first element of C P , cannot have another subset g 0 constructing another recurrent cyclic set. W e hav e proved that for a strongly synchronizing MDP M , the perfect information subset construction for M , has a cycle C P such that | ∆ ( C P ) | = 1, and for G ∈ ∆ ( C P ) and for all g ∈ G , | g | = 1.  Through the proof of Theorem 1, we have seen that for all strategies α such that an MDP M with the strategy α is strongly synchronizing, there is a pure strategy that satisfies the strongly synchronization condition. W e will see that this is also the case for weakly synchronizing objective (see the proof of Theorem 2). L. Doyen, T . Massart & M. Shirmohammadi 71 Corollary 1 F or both str ongly and weakly synchr onizing objectives, pur e strate gies ar e sufficient in MDPs. Theorem 2 F or a perfect information game over an MDP M , there e xists a strate gy α such that M with strate gy α is weakly synchronizing , if and only if the perfect-information subset construction M P for M , has an accessible cycle C P such that | ∆ ( C P ) | = 1 , and for G ∈ ∆ ( C P ) , there exists g ∈ G such that | g | = 1 . Proof Sufficient condition. W e suppose that the perfect-information subset construction M P for M , has an accessible cycle C P such that | ∆ ( C P ) | = 1, and for G ∈ ∆ ( C P ) , there exists g ∈ G such that | g | = 1. Consider a pure strategy similarly to which presented in proof of Theorem 1. Let us, here as well, construct the Marko v chain M 0 , and therefore discuss on transient and recurrent states of M 0 . Suppose that G ∈ ∆ ( C P ) is the only recurrent cyclic set of the cycle, and it includes d elements as g 0 , . . . g d − 1 . Let R be the set of states ( m + i , ` ) such that ` ∈ g i , for 0 ≤ i < d . As we hav e sho wn in proof of Theorem 1, the states of R are the only recurrent states in the Markov chain M 0 . Let p n be the probability to be in one state of R at step n . Based on Proposition 1, for the transient states ( i , ` ) the probability X n (( i , ` )) vanishes for n → ∞ , which leads to l im n → ∞ p n = 1. On the other hand, by hypothesis, for G ∈ ∆ ( C P ) there exists g j ∈ G (0 ≤ j < d ) such that | g j | = 1. Then ev ery d steps, at least once, the probability p m + k · d + j gathers in only one state ( m + j , ` ) where ` ∈ g j . As a result, for all k ∈ N , max ( k X α m + d . k k , k X α m + d . k + 1 k , ... k X α m + d . k + d − 1 k ) ≥ p m + k · d + j . W e ha ve shown that l im n → ∞ p n = 1, hence l imSu p n → ∞ k X α n k = 1. Necessary condition. Assume that the MDP M with strategy α is weakly synchronizing meaning that l imSu p n → ∞ k X α n k = 1. Therefore there exists a subsequence k X α i k k of k X α i k which approaches to 1 (i.e., l im k → ∞ k X α i k k = 1), where i 0 < i 1 < i 2 < . . . is an increasing sequence of indices. Then, for ε < 1 / 2 there exists n 0 ∈ N such that for all n ≥ n 0 there exists a (unique) state ` such that X α i n ( ` ) > 1 / 2. Let ( `, i n ) refer to this unique state at position i n . Let Inf be the set of all states ` such that X α i n (( `, i n ) > 1 / 2 for infinitely many n ∈ N . Hence, there exists N Inf ≥ n 0 such that ∀ n ≥ N Inf : X α i n (( `, i n ) > 1 / 2 ⇒ ` ∈ Inf . Since the state space of the MDP is finite, for a specific q ∈ Inf , we can define a subsequence ( j k ) k ∈ N ) of ( i k ) k ∈ N such that 1. j 0 ≥ N Inf , and 2. X α j k (( q , j k ) > 1 / 2, and 3. Supp ( X α j k ) = Supp ( X α j k + 1 ) ; in the sequel, we denote to this set by S . Let ( q , j k ) refer to the state q at specific step j k , and J be the sequence of this states. Note that since j k is a subsequence of i k , we hav e l im k → ∞ k X α j k k = 1 as well. W e use the infinite sequence J to construct a winning pure strategy from the winning randomized strategy α . Consider the pure strategy β as follo ws. For h ∈ Hists ( M ) with | h | = i , we define β ( h ) = Action ( h 0 , i ) where h 0 ∈ Hists ( M ) is the shortest possible history such that (1) Pr α ( h 0 ) > 0, (2) Last ( h 0 ) = ( q , j k ) where | h 0 | = j k for some k ∈ N , and in addition (3) State ( h 0 , i ) = Last ( h ) . One might notice that a reachable state Last ( h ) with a strictly positiv e probability Pr α ( h ) > 0, has to access the infinite sequence J ; otherwise the MDP M with strategy α would not be weakly synchronizing. Consequently , the history h 0 defined abov e always e xists. Similarly to the case of strongly synchronizing, we can define SizePath ( h ) = | h 0 | − | h | to be the size of shortest path from Last ( h ) to the infinite sequence J . 72 Synchronizing objecti ves for Marko v Decision Processes In the following, we show that for a weakly synchronizing MDP M , there exists a cycle C P of M P which has only one recurrent cyclic set G , and there exists g ∈ G which is singleton. By construction, we ha ve β ( h ) = β ( h 0 ) for all histories h , h 0 ∈ Hists ( M ) with Last ( h ) = Last ( h 0 ) and | h | = | h 0 | . Therefore the pure strategy β induces an infinite path P β in the perfect-information subset construction M P . The construction of β , also implies that the cell S is visited infinitely many times along P β . The path taken between two visits to S along P β is a cycle (not necessarily a simple cycle) of M P . W e study one of these c ycles (starting at S and coming back there), and prov e that this cycle satisfies the conditions of the theorem. Let K 1 to be the first step after N Inf in which the path P β visits S . Let us define MaxP ath = max h ∈ Hists ( M ) , Pr ( h ) > 0 , | h | = K 1 ( SizePath ( h )) to be the length of the longest path (among the shortest ones) from a reachable state at step K 1 to the infinite sequence J . Let C P be the cycle starting in S at step K 1 , and coming back to this state in some step K 2 > K 1 + MaxP ath . For con v enience, let d = K 2 − K 1 denote the length of the c ycle C P . W e define the winning pure strategy β 0 from the strategy β as follows. • for h ∈ Hists ( M ) with | h | < K 1 + K 2 , we define β 0 ( h ) = β ( h ) . • for h ∈ Hists ( M ) with | h | > K 1 + K 2 , we define β 0 ( h ) = β ( h 0 ) where | h | = d · m + | h 0 | for some m ∈ N , and h 0 is a history with K 1 ≤ | h 0 | ≤ K 1 + K 2 and Last ( h ) = Last ( h 0 ) . In fact, the path corresponding to the strategy β 0 first reaches the cycle C P , and then fore ver follo ws this cycle. The strategy β 0 as well as the strategy β is weakly synchronizing. W e claim that this cycle C P = s 0 ˆ σ 0 · · · s d ( s 0 = s d ) has only one recurrent cyclic set G , and there e xists g ∈ G which is singleton: 1. First we pro ve that this cycle has one recurrent cyclic set. The size of the cycle is more than MaxP ath which shows that some elements of the infinite sequence J are visited along the cycle. Suppose that ( q , j k 0 ) is the last visited state of J along the cycle, and K 0 = j k 0 − K 1 is the index of cell s K 0 including this state. Let us construct a singleton subset g K 0 = { ( q , K 0 ) } . By induction, let g ( K 0 + i + 1 ) mod d = ∪ ` ∈ g ( K 0 + i ) mod d P os t ˆ σ ( K 0 + i ) mod d ( ` ) for all 0 ≤ i < d . Note that, for i = d − 1, the set g K 0 is computed. By definition, the set G = { g 0 , g 1 , · · · , g d − 1 } is a recurrent cyclic set, if after the computation, we still hav e g K 0 = { ( q , K 0 ) } . W e claim that g K 0 = { ( q , K 0 ) } . By contradiction, suppose that g K 0 6 = { ( q , K 0 ) } is satisfied. W e hav e lim sup n → ∞ k X β 0 n k = 1. Then ∀ ε > 0 · ∃ n 0 ∈ N · ∀ n ≥ n 0 · ∃ ` such that X β 0 n ( ` ) > 1 − ε . On the other hand, by definition of J , we know that the mass of probability in states of J are more than 1 / 2, and in addition we know that all states of the cycle inject probability to J ; these show that the visited states of J along the cycle are candidates to concentrate the probability mass. Let ν be the smallest probability among all probability distributions of the MDP M (i.e., ν = min ` ∈ L , σ ∈ Σ ,` 0 ∈ Supp ( δ ( `, σ )) ( δ ( `, σ )( ` 0 )) ). Let ε < ν d 1 + ν d , and X β 0 n (( q , K 0 )) > 1 − ε where n > n 0 . The probability which does not inject to ( q , K 0 ) (from ( q , K 0 ) after d steps), is at least ν d · ( 1 − ε ) . W e hav e: 1 − ε ≤ X β 0 n + d (( q , K 0 )) ≤ 1 − ν d · ( 1 − ε ) This gi ves ε ≥ ν d 1 + ν d which is a contradiction. Therefore, we have constructed a recurrent cyclic set G for the cycle, and ha ve sho wn that one element of G is singleton. 2. W e can see that G is the only recurrent cyclic set. Each state included in S reaches, at most in MaxP ath steps, one state of J . Hence, the cell S , as the first element of C P , can not hav e another subset g 0 constructing another recurrent cyclic set. L. Doyen, T . Massart & M. Shirmohammadi 73 σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 σ 1 σ 2 σ 2 1 2 3 4 5 6 7 8 1 / 2 1 / 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 9 Figure 5: A weakly synchronizing MDP . W e hav e pro ved that for a weakly synchronizing MDP M , the perfect information subset construction for M , has a cycle C P such that | ∆ ( C P ) | = 1, and for G ∈ ∆ ( C P ) , there exists g ∈ G such that | g | = 1.  Example The MDP depicted in Figure 5 (the initial distrib ution is µ 0 ( 1 ) = 1 and µ 0 ( i ) = 0 for i ∈ { 2 , . . . , 5 } ) with strategy α which defined as follows α (( L × Σ ) ∗ L )( σ ) = 1 / 2 for σ ∈ Σ , is weakly syn- chronizing. Note that L = { 1 , . . . , 9 } , Σ = { σ 1 , σ 2 } . 4 Synchr onizing objectives f or Blind strategies W e hav e defined a blind one-player stochastic game where the player is not allo wed to observe the current state of the game. W e use a characterization of synchronizing blind strategies to show that the existence of synchronizing blind strategies can be decided. W e first present an example where the player is blind and has a strategy to mak e the game synchronizing. Example The MDP depicted in Figure 6 (the initial distrib ution is µ 0 ( 1 ) = 1 and µ 0 ( i ) = 0 for i ∈ { 2 , . . . , 5 } ) with blind strategy α which defined as following α (( L × Σ ) ∗ L )( σ ) = 1 / 2 for σ ∈ { Σ } is strongly synchronizing. Note that L = { 1 , . . . , 8 } , Σ = { σ 1 , σ 2 } . Theorem 3 F or a blind game over an MDP M , ther e exists a strate gy α such that M with strate gy α is strongly synchronizing , if and only if the blind subset construction M B for M , has an accessible cycle C B such that | ∆ ( C B ) | = 1 , and for G ∈ ∆ ( C B ) and for all g ∈ G, | g | = 1 . σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 , σ 2 σ 1 σ 1 σ 2 σ 2 1 2 3 4 5 6 7 8 1 / 2 1 / 2 σ 1 , σ 2 Figure 6: An MDP that with some blind strategy is strongly synchronizing. 74 Synchronizing objecti ves for Marko v Decision Processes Proof Sufficient condition. W e suppose that the blind subset construction M B = h L , L I , Σ , δ B i for M , has an accessible cycle C B such that | ∆ ( C B ) | = 1, and for G ∈ ∆ ( C B ) and for all g ∈ G , | g | = 1. Since this cycle is accessible, then there exists a finite path P = p 0 σ 0 0 . . . σ 0 m − 2 p m − 1 σ 0 m − 1 p m in M B from p 0 = L I to p m = s 0 . Consider the pure blind strategy α as follows α ( k ) =  σ 0 k if 0 ≤ k < m , σ ( k − m ) mod d if m ≤ k . Let us, construct a Markov chain M 0 similar to which presented in proof of Theorem 1, with the belo w probability function: the probability transition function δ 0 is defined as follo ws δ 0 (( i , ` ))(( i 0 , ` 0 )) =        δ ( `, σ 0 i )( ` 0 ) if ( 0 ≤ i < m ) , ( i 0 = i + 1 ) , ( ` ∈ p i ) and ( ` 0 ∈ p i 0 ) , δ ( `, σ i − m )( ` 0 ) if ( m ≤ i < m + d ) , ( i 0 = m + ( i − m + 1 ) mod d ) , ( ` ∈ s i − m ) and ( ` 0 ∈ s i 0 − m ) , 0 o t herwise . Suppose that G ∈ ∆ ( C B ) is the only recurrent cyclic set of the cycle, and it includes d elements as g 0 , . . . g d − 1 . Let R be the set of states ( m + i , ` ) such that ` ∈ g i , for 0 ≤ i < d . As we have shown in proof of Theorem 1, the states of R are the only recurrent states in the Marko v chain M 0 . Based on Proposition 1, for the transient states ( k , ` ) , the probability X n (( k , ` )) vanishes for n → ∞ . Since for all g ∈ G , we hav e | g | = 1, the support of X n ( n > m ) contains only one recurrent state. Thus, the probability mass accumulates in that state: for all ε > 0, for all n > n 0 there is a state ( i , ` ) with X n (( i , ` )) > 1 − ε , that is k X α n k > 1 − ε . Hence, lim n → ∞ k X α n k = 1 and M 0 is strongly synchronizing. Therefore, so is the MDP M under the blind strategy α . Necessary condition. W e benefit from arguments presented in Proof of Theorem 1; but here since the winning strategy is blind, we use blind subset constructions.  Theorem 4 F or a blind game over an MDP M , ther e exists a strate gy α such that M with strate gy α is weakly synchronizing , if and only if the blind subset construction M B for M , has an accessible cycle C B such that | ∆ ( C B ) | = 1 , and for G ∈ ∆ ( C B ) , ther e e xists g ∈ G such that | g | = 1 . Proof Sufficient condition. W e suppose that the blind subset construction M B = h L , L I , Σ , δ B i for M , has an accessible cycle C B such that | ∆ ( C B ) | = 1, and for G ∈ ∆ ( C B ) , there exists g ∈ G such that | g | = 1. Consider a pure strategy similarly to which presented in proof of Theorem 1. Let us, here as well, construct the Marko v chain M 0 , and therefore discuss on transient and recurrent states of M 0 . Suppose that G ∈ ∆ ( C B ) is the only recurrent cyclic set of the cycle, and it includes d elements as g 0 , . . . g d − 1 . Let R be the set of states ( m + i , ` ) such that ` ∈ g i , for 0 ≤ i < d . As we hav e sho wn in proof of Theorem 1, the states of R are the only recurrent states in the Markov chain M 0 . Suppose that p is the probability to be in one state of R at step n . Based on Proposition 1, for the transient states ( i , ` ) , the probability X n (( i , ` )) vanishes for n → ∞ , which leads l im n → ∞ p = 1. On the other hand, by hypothesis, for G ∈ ∆ ( C B ) , there exists g j ∈ G (0 ≤ j < d ) such that | g j | = 1. Then every d steps, at least once, the whole of probability p gathers in only one state ( m + j , ` ) where ` ∈ g j . As a result, for all k ∈ N , max ( k X α m + d . k k , k X α m + d . k + 1 k , ... k X α m + d . k + d − 1 k ) > p . W e hav e sho wn that l im n → ∞ p = 1, hence l imSu p n → ∞ k X α n k = 1. Necessary condition. W e benefit from arguments presented in Proof of Theorem 2; but here since the winning strategy is blind, we use blind subset constructions. L. Doyen, T . Massart & M. Shirmohammadi 75  From the four pre vious theorems, we obtain the following result. Theorem 5 The pr oblem of deciding the existence of a { perfect-information, blind } strate gy in MDPs for a { str ongly , weakly } synchr onizing objective is decidable. W e have defined a new class of objectiv es for Markov decision processes, and we have given a de- cidable characterization of winning strate gies for these objecti v es. Further in vestigations will be dev oted to studying the precise complexity of the problem, establishing memory bounds, and extending this frame work to partially-observ able MDPs and stochastic two-player games. Refer ences [1] A. Aziz, V . Singhal & F . Balarin (1995): It Usually W orks: The T emporal Logic of Stochastic Systems . In: Proc. of CA V : Computer Aided V erification . LNCS 939, Springer, pp. 155–165. A vailable at http://dx. doi.org/10.1007/3- 540- 60045- 0_48 . [2] D. Beauquier, A. M. Rabinovich & A. Slissenko (2002): A Lo gic of Pr obability with Decidable Model- Checking . In: Proc. of CSL: Computer Science Logic . LNCS 2471, Springer, pp. 371–402. A vailable at http://dx.doi.org/10.1007/3- 540- 45793- 3_21 . [3] Y . Benenson, R. Adar , T . Paz-Elizur , Z. Livneh & e. Shapiro (2003): DNA molecule pro vides a computing machine with both data and fuel . Proc. National Acad. Sci. USA 100, pp. 2191–2196. A vailable at http: //dx.doi.org/10.1073/pnas.0535624100 . [4] A. Bianco & L. de Alfaro (1995): Model Checking of Pr obabilistic and Nondeterministic Systems . In: Proc. of FSTTCS: Foundations of Software T echnology and Theoretical Computer Science . LNCS 1026, Springer- V erlag, pp. 499–513. A vailable at http://dx.doi.org/10.1007/3- 540- 60692- 0_70 . [5] J. Filar & K. Vrieze (1997): Competitive Markov Decision Pr ocesses . Springer-V erlag. A vailable at http: //www.springer.com/engineering/mathematical/book/978- 0- 387- 94805- 8 . [6] V . A. Korthikanti, M. V iswanathan, Y . Kwon & G. Agha (2009): Reasoning about MDPs as transformers of pr obability distributions . In: Proc. of QEST : Quantitativ e Ev aluation of Systems . IEEE Computer Society , pp. 199–208. A v ailable at http://dx.doi.org/10.1109/QEST.2010.35 . [7] R. Segala (1995): Modeling and V erification of Randomized Distributed Real-T ime Systems . Ph.D. thesis, MIT . A v ailable at http://profs.sci.univr.it/ ~ segala/www/phd.html . T echnical Report MIT/LCS/TR- 676. [8] M. V . V olko v (2008): Synchr onizing Automata and the Cerny Conjectur e . In: Proc. of LA T A: Language and Automata Theory and Applications . LNCS 5196, Springer, pp. 11–27. A v ailable at http://dx.doi.org/ 10.1007/978- 3- 540- 88282- 4_4 .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment