Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

We consider opportunistic communications over multiple channels where the state ("good" or "bad") of each channel evolves as independent and identically distributed Markov processes. A user, with limited sensing and access capability, chooses one cha…

Authors: Sah, H.A. Ahmad, Mingyan Liu

1 Optimality of Myopic Sensing in Multichannel Opportunistic Access Sahand Haji Ali Ahmad ♯ , Mingyan Liu ♯ T ara Javid i † , Qing Zhao ‡ , Bhaskar Krishnamachari § shajiali@eecs.umich.ed u, mingyan @eecs.umich.ed u, tara@ece.ucsd.edu, qzhao@ece.u cdavis.edu, bkrishna@u sc.edu ♯ Departmen t of Electrical Engin eering an d Compute r Science , Univ ersity o f Michigan, Ann Arbor, MI 48109 † Departmen t of Elec trical and Comp uter E ngineerin g, University of Califor nia, San Diego, La Jolla, CA 920 93 ‡ Departmen t of Elec trical and Comp uter E ngineerin g, University of Califor nia, Davis, CA 956 16 § Ming Hsieh Department of Electrica l E ngineerin g, University of Sou thern California, Los Angeles, CA 90 089 Abstract W e consider o pportu nistic commu nication over multiple channels where the state (“good” o r “ba d”) of each channel ev olves as indepen dent and identically distrib uted Markov processes. A user, with limited channel sensing and acce ss capability , chooses one chann el to sense and subsequ ently acce ss (based on the sensed c hannel state) in each time slot. A reward is obtained whenever th e user senses and accesses a “good ” channel. The objecti ve is to d esign an op timal ch annel selection policy that m aximizes th e expected total (discoun ted o r av erage) rew ard accrued over a finite or in finite horizo n. This prob lem can be cast as a Partially Observable Markov Decision Process ( POMDP) or a restless m ulti-armed band it process, to which op timal solutions are often intrac table. W e show in this paper that a myopic p olicy that m aximizes the immediate o ne-step reward is always o ptimal when the state transitions a re po siti vely correlated over time. When the state transitions are negativ ely correlated, we show that the same po licy is optimal when the number of channels is limited to 2 or 3, wh ile presenting a counterexample for the case o f 4 chan nels. This result finds app lications in opportun istic transmission scheduling in a fading environmen t, cognitive rad io ne tworks for spectrum overlay , an d resource-co nstrained jam ming and anti-jam ming. Preliminary v ersion of t his work was presented at IE EE International Confer ence on Communica tions (ICC), May 2008 , Beijing, C hina. May 6, 2008 DRAFT 2 Index T er ms Opportu nistic ac cess, cog nitiv e radio, POMDP , multi-armed b andit, restless bandit, Gittins index, Whittle’ s in dex, my opic policy . May 6, 2008 DRAFT 3 I . I N T R O D U C T I O N W e consider a communication s ystem i n which a sender has access to multiple channels, b ut is limited to sensin g and transmitti ng only on one at a given time. W e explore how a sm art sender should exploit p ast ob serv ations and the kn owledge of the stochasti c state e volution of these channels to m aximize its transmission rate by swi tching opportunist ically across channels. W e m odel this problem in the following manner . As sh own in Figure 1 , there are n channels, each of whi ch ev olves as an independent, id entically-distributed, two-state discrete-time Markov chain. The two states for ea ch channel — “good” (or state 1 ) and “bad” (or state 0 ) — indicate the desirability of t ransmitting over that channel at a g iv en time slot. The st ate transiti on probabilities are giv en by p ij , i, j = 0 , 1 . In each time slo t the sender picks one of the channels t o sense based on it s prior observations, and obtains some fixed rew ard if it is in the goo d state. The basic objective of the s ender is to m aximize the rew ard that i t can gain over a given finite ti me horizon. This problem can be d escribed as a partially o bserva ble Markov decision process (POMDP) [1] since the states of the underlying Markov chains are n ot fully observed. It can also be cast as a special case of the class of restl ess mul ti-armed bandit problems [2]; more discussion on this is giv en i n Section VII. P S f r a g r e p l a c e m e n t s 0 1 (bad) (good) p 01 p 11 p 00 p 10 Fig. 1. The Marko v channel model. This formulation is broadly applicable to se veral domain s. It arises naturally i n opportunistic spectrum access (OSA) [3], [4], where th e sender is a secondary user , and the channel states describe the occupancy by prim ary us ers. In the OSA problem, the secondary sender may send on a giv en channel only when t here is n o primary u ser occupying i t. It pertains to communication over parallel f ading c hannels as well, if a two-state Mark ovian fading model is employed. Anot her interesting application of this formul ation is in th e domain of comm unication security , where it can be u sed to dev elop bounds on t he performance of resource-constrained jammi ng. A jammer May 6, 2008 DRAFT 4 that has access to only one channel at a tim e could also use the s ame s tochastic dynamic d ecision making p rocess to maximi ze the number of times that it can successfully jam communications that occur on these channels. In this application, the “good” state for the jammer is precisely when the channel is being u tilized by other senders (i n contrast with t he OSA problem ). In this paper we examine th e opt imality of a simple m yopic p olicy for th e op portunistic access problem outlined above. Specifically , we show that the myo pic po licy is optimal for arbitrary n when p 11 ≥ p 01 . W e also show that it i s optimal for n = 3 when p 11 < p 01 , while presenting a finite horizon counter e xample showing that it is in general n ot o ptimal for n ≥ 4 . W e also generalize these results to related formulati ons in volving di scounted and av erage rewards over an infinite horizon. These results extend and com plement thos e reported in prior work [5]. Specifically , it has been shown in [5] that for all n th e m yopic p olicy has an elegant and robust structure that o bviates the need to know th e channel s tate t ransition p robabilities and reduces channel selection to a simple round robin procedure. Based on this structure, t he optim ality of the myopic policy for n = 2 was establi shed and the performance of the myopic poli cy , i n particular , t he scaling property with respect t o n , analyzed in [5]. It was conjectured in [5] that the myopic policy is opt imal for any n . This conjecture was partially addressed in a p reliminary conference version [6] , where the opti mality was established under certain restrictive condit ions o n the channel parameters and the dis count factor . In t he present paper , we significantly relax these condit ions and formerly prove this conj ecture under th e condition p 11 ≥ p 01 . W e also provide a counter example for p 11 < p 01 . W e would like to emphasize that compared to earlier work [5], [6], the approach used in this p aper relies on a coupling argument, which i s the key to extending the optim ality result t o the arbitrary n case. Earlier techniques w ere lar gely based on exploiting the con vex analytic properties of the value funct ion, and were s hown to hav e diffic ulty in overcoming the n = 2 barrier witho ut further condition s on t he discount factor or transition probabilities . This observation is so mewhat reminiscent of the r esults re ported in [7], where a coupling argument w as also used to solve an n -queue probl em while earlier versions [8] u sing value function properties were l imited to a 2 -queue case. W e in vite the interested reader to refer to [9], an i mportant manuscript on monoton icity in MDPs which explores the power as well as the l imitation of working with analytic properties of value functions and dynamic programm ing operators as we May 6, 2008 DRAFT 5 had done in our earlier work. In particular , [9, Section 9.5] explores the di f ficulty of usi ng such techniques for mul ti-dimensional problems where the n umber of queues is more than n = 2 ; [9, Chapter 12 ] contrasts t his proof technique with the s tochastic coupling ar guments, which our present work uses. The remainder of this paper is organized as follows. W e form ulate the problem in Section II and illu strate the m yopic policy in Section III. In Section IV, we prove that t he myopi c policy is optimal in the case of p 11 ≥ p 01 , and show in Section V that it is in general no t optimal when this condit ion do es n ot hold. Section VI extends the results from finite horizon to infinite horizon. W e di scuss our work within the context of the class of restless bandit problem s as well as some related work in this area in Section VII. Section VIII concludes t he paper . I I . P RO B L E M F O R M U L A T I O N W e consider the scenario where a user is trying to access th e wireless spectrum to maxim ize its t hroughput or data rate. The spectrum consists of n in dependent and statistically identical channels. The state of a chann el is gi ven by a t wo-state d iscrete ti me Markov chain sh own in Figure 1. The sys tem operates in discrete time steps i ndexe d by t , t = 1 , 2 , · · · , T , where T is the time horizon of i nterest. At ti me t − , t he channels (i.e., t he Markov chains representing them) go through state transitions, and at time t the user makes the channel sensing and acce ss decision . Specifically , at ti me t th e u ser selects one of the n channels to s ense, say channel i . If the channel is sensed to be in the “good ” state (state 1 ), the user transmits and coll ects one unit o f re ward. Otherwise t he user does not t ransmit (or transmits at a lower rate), collects no rew ard, and waits until t + 1 to make another choice. This process repeats sequentially until t he time horizon expires. As m entioned earlier , this abstraction is primarily motiv ated by the foll owing mu lti-channel access scenario where a secondary user seeks spectrum oppo rtunity in between a primary user’ s activities. Specifically , time is d ivided into frames and at the beginning of each frame there is a design ated time slot for the primary user to reserve that frame and for secondary users to perform chann el sens ing. If a p rimary user intends to use a frame it will simply remain active in a channel (or multiple channels) du ring that sensi ng time slot (i .e., reservation is by default for a primary user in use of the channel), in which case a secondary user will find the channel(s) busy May 6, 2008 DRAFT 6 and not attem pt to us e it for the duration of that frame. If the prim ary user is i nactiv e duri ng this sensing time sl ot, then the remainder of the frame i s open to secondary users. Such a s tructure provides the necessary protection for th e primary user as channel sensing (in particular activ e channel sensing that in volv es commu nication b etween a pair of users) conducted at arbitrary times can cause undesirable interference. W ithi n such a s tructure, a secondary user h as a lim ited amo unt o f time and capabilit y to perform channel sensing, and may only be able to s ense one or a subs et of the channels before the sensing time s lot ends. And if all t hese channels are unav ailable t hen it will have to wait till the next sensing time slot . In this p aper we will limit our at tend to the s pecial case where t he secondary user o nly has the resources to sense one channel wit hin this slot. Conceptually our formulation is easily extended to the case where the second ary user can sense m ultiple channels at a ti me within this structure, although the correspondi ng results differ , see e.g., [10]. Note that in thi s formul ation we do not explicitly model the cost of channel sensing; i t is implicit in the f act that the user is limited in ho w many channels it can sense at a time. Alternativ e formulations have been stud ied where sensing costs are explicitl y taken into cons ideration in a user’ s sensing and access decision, see e.g., a sequenti al channel sensing scheme i n [11]. In this formulation we have assu med that sensing errors are negligible. T echniq ues used in this paper may be appli cable in proving the optimality of the myopic p olicy under imp erfect sensing and for a general number of channels. The reason behind this is that our proof exploits the s imple structure of the myopic policy , which remains when sensing is subject to errors as shown i n [12]. Note that the sy stem is not full y observable to the u ser , i.e., the user does not kno w th e exact state of the system when making t he sensing decision. Specifically , channels go through state transition at time t − (or anytime between ( t − 1 , t ) ), thus when t he user makes the chann el sensing decision at tim e t , it does not ha ve the true state of the s ystem at ti me t , which we denote b y s ( t ) = [ s 1 ( t ) , s 2 ( t ) , · · · , s n ( t )] ∈ { 0 , 1 } n . Furthermo re, eve n after its actio n (at time t + ) it only g ets to obs erve the true state of one channel, which goes t hrough ano ther transitio n at or before time ( t + 1) − . The user’ s action space at ti me t is given by the finit e set { 1 , 2 , · · · , n } , and we will use a ( t ) = i to denot e th at the user selects channel i to sense at time t . For cl arity , we will denote the ou tcome/observation of channel sensing at time t following the action a ( t ) by h a ( t ) ( t ) , which i s essentiall y the true state s a ( t ) ( t ) of channel a ( t ) at time t since we ass ume May 6, 2008 DRAFT 7 channel sensing to be error-fr ee. It can be shown (see e.g., [1], [13], [14]) that a suf ficient statistic o f such a system for optimal decision making, or the information state o f t he s ystem [13], [14 ], i s giv en by the conditional probabi lities of the st ate each channel is in given all past actions and obs erv ations. Since each channel can be in one of two states, we denote this informat ion state or belief vector by ¯ ω ( t ) = [ ω 1 ( t ) , · · · , ω n ( t )] ∈ [0 , 1] n , where ω i ( t ) is the condi tional probability that channel i is in state 1 at tim e t gi ven all past states, actions and observations 1 . Throu ghout the paper ω i ( t ) wi ll be referred to as the informatio n s tate of channel i at time t , or simply t he channel probability of i at time t . Due to the M arkovian natu re of the channel mod el, the future info rmation state is o nly a function of the current informati on state and the current action; i.e., it is independent of past history given th e current i nformation state and action. It foll ows that the information state of the system e volv es as follows. Giv en that the state at ti me t i s ¯ ω ( t ) and action a ( t ) = i is taken, ω i ( t + 1) can t ake on two values: (1) p 11 if the observation is that channel i is in a “good” state ( h i ( t ) = 1 ); this occurs w ith probability P { h i ( t ) = 1 | ¯ ω ( t ) } = ω i ( t ) ; (2) p 01 if the observation is that channel i is in a “bad” state ( h i ( t ) = 0 ); this occurs with probability P { h i ( t ) = 0 | ¯ ω ( t ) } = 1 − ω i . For any other channel j 6 = i , the correspondi ng ω j ( t + 1) can only take on one value (i.e., with probability 1 ): ω j ( t + 1) = τ ( ω j ( t )) wh ere the o perator τ : [0 , 1] → [0 , 1] is d efined as τ ( ω ) := ω p 11 + (1 − ω ) p 01 , 0 ≤ ω ≤ 1 . (1) These transition probabili ties are summarized in th e following equation for t = 1 , 2 , · · · , T − 1 : { ω i ( t + 1) | ¯ ω ( t ) , a ( t ) } =          p 11 with prob . ω i ( t ) if a ( t ) = i p 01 with prob . 1 − ω i ( t ) if a ( t ) = i τ ( ω i ( t )) with prob . 1 if a ( t ) 6 = i , i = 1 , 2 , · · · , n, (2) Also note that ¯ ω (1) ∈ [0 , 1] n denotes the initial condition (information state) of the system , which may be int erpreted as the user’ s initi al belief abou t how likely each channel is in the good s tate before sensin g starts at time t = 1 . For the purpose of the opti mization problems 1 Note t hat this is a standard way of turning a POMDP problem into a classic MDP (Marko v decision process) problem by means of information state, the main implication being that the state space is now uncountable. May 6, 2008 DRAFT 8 formulated below , this initial condit ion is considered given, wh ich can be any probabilit y vector 2 . It is important t o note that alt hough in general a POMDP p roblem has an uncount able state space (information states are probability dist ributions), in our probl em the state space is countable for an y gi ven initi al condi tion ¯ ω (1) . This is because as shown above, the in- formation state of any channel with an initial probabi lity of ω can only t ake on the values { ω , τ k ( ω ) , p 01 , τ k ( ω ) , p 11 , τ k ( ω ) } , where k = 1 , 2 , · · · and τ k ( ω ) := τ ( τ k − 1 ( ω ) ) , which is a countable set. For com pactness of p resentation we will further use the op erator T to denote the above probability distribution of the informati on state (the entire vector): ¯ ω ( t + 1) = T ( ¯ ω ( t ) , a ( t )) , (3) by noting that t he operation given in (2) is applied to ¯ ω ( t ) element-by-element. W e will also use the following to denote the information state given obs erv ation outcome: T ( ¯ ω ( t ) , a ( t ) | h a ( t ) ( t ) = 1) = ( τ ( ω 1 ( t )) , · · · , τ ( ω a ( t ) − 1 ( t )) , p 11 , τ ( ω a ( t )+1 ( t )) , · · · , τ ( ω n ( t ))) (4) T ( ¯ ω ( t ) , a ( t ) | h a ( t ) ( t ) = 0) = ( τ ( ω 1 ( t )) , · · · , τ ( ω a ( t ) − 1 ( t )) , p 01 , τ ( ω a ( t )+1 ( t )) , · · · , τ ( ω n ( t ))) (5) The objective of the user is t o maximize its total (di scounted or average) expected re ward over a finite (or infinit e) horizon. Let J π T ( ¯ ω ) , J π β ( ¯ ω ) , and J π ∞ ( ¯ ω ) denote, respectively , these cos t criteria (namely , finite horizon, in finite horizon with di scount, and infinite h orizon avera ge reward) under policy π starting in s tate ¯ ω = [ ω 1 , · · · , ω n ] . The associated opt imization problems ((P1)-(P3)) are formally defined as follows. (P1): ma x π J π T ( ¯ ω ) = max π E π [ T X t =1 β t − 1 R π t ( ¯ ω ( t )) | ¯ ω (1 ) = ¯ ω ] (P2): ma x π J π β ( ¯ ω ) = max π E π [ ∞ X t =1 β t − 1 R π t ( ¯ ω ( t )) | ¯ ω (1 ) = ¯ ω ] (P3): ma x π J π ∞ ( ¯ ω ) = max π lim T →∞ 1 T E π [ T X t =1 R π t ( ¯ ω ( t )) | ¯ ω (1 ) = ¯ ω ] 2 That is, t he optimal solutions are functions of t he initial condition. A reasonable choice, if the user has no special information other than t he transition probabilities of these channels, is to simply use the steady-state probabilities of channels being in state “1” as an initial condition (i.e., setting ω i (1) = p 10 p 01 + p 10 ). May 6, 2008 DRAFT 9 where β ( 0 ≤ β ≤ 1 for (P1) and 0 ≤ β < 1 for (P2)) is the discount factor , and R π t ( ¯ ω ( t )) is the re war d coll ected under state ¯ ω ( t ) when channel a ( t ) = π t ( ¯ ω ( t )) is selected and h a ( t ) ( t ) is observed. This reward is given by R π t ( ¯ ω ( t )) = 1 wi th probability ω a ( t ) ( t ) (when h a ( t ) ( t ) = 1 ), and 0 otherwise. The maximization in (P1) is over the class of determin istic Markov pol icies. 3 . An admi ssible policy π , given by the vector π = [ π 1 , π 2 , · · · , π T ] , i s thus such that π t specifies a mapping from the current information s tate ¯ ω ( t ) to a channel selection action a ( t ) = π t ( ¯ ω ( t )) ∈ { 1 , 2 , · · · , n } . This is done withou t loss of optimality due to the Markovian nature o f the underlying system, and due to known results on POMDPs. Note t hat the class of Markov policies in terms of information state are also kn own as seperated policies (see [14]). Due to finiteness of (unob serv able) state spaces and acti on space in problem (P1), it is known that an optimal poli cy (over all random and deterministic, h istory-dependent and his tory-independent pol icies) may be found w ithin the class of separated (i.e. determini stic Markov) policies (see e.g., [14 , Theorem 7.1, Chapter 6]), thus justi fying the maximization and the admissibl e po licy sp ace. In Section VI we establish t he existence of a stationary s eparated p olicy π ∗ , und er wh ich the supremum of the expected discounted re ward as well as the supremum of expected average cost are achieved, hence j ustifying o ur u se o f maximization in (P2) and (P3). Furthermore, it is shown t hat under this policy the lim it in (P3) exists and is greater than t he limsu p of the average performance of any other p olicy (in general history-dependent and random ized). This is a strong notion of optimality ; the interpretation is that t he m ost “pessi mistic” average performance under policy π ∗ ( lim inf 1 T J π ∗ T ( · ) = lim 1 T J π ∗ T ( · ) ) is greater th an the most “optimi stic” performance under any o ther policy π ( lim sup 1 T J π T ( · ) ). In much of the literature on MDP , th is is referred t o as the str ong o ptimality for an expected average cos t (reward) problem; for a discussi on on t his, see [15, Page 344]. I I I . O P T I M A L P O L I C Y A N D T H E M YO P I C P O L I C Y A. Dynamic Pr ogramming R epr esentatio ns 3 A Marko v policy is a policy that deriv es its action only depending on the current (information) state, rather than the entire history of states, see e.g., [14]. May 6, 2008 DRAFT 10 Problems (P1)-(P3) defined in the previous secti on may be solved using t heir respective d y- namic programmin g (DP) representations. Specifically , for problem (P1), we hav e t he following recursiv e equations: V T ( ¯ ω ) = max a =1 , 2 , ··· ,n E [ R a ( ¯ ω )] V t ( ¯ ω ) = max a =1 , 2 , ··· ,n E [ R a ( ¯ ω ) + β V t +1 ( T ( ¯ ω , a ))] = max a =1 , ··· ,n ( ω a + β ω a V t +1 ( T ( ¯ ω , a | 1)) + β (1 − ω a ) V t +1 ( T ( ¯ ω , a | 0))) , (6) for t = 1 , 2 , · · · , T − 1 , where V t ( ¯ ω ) is kno wn as the value function, or the maximum expected future rew ard th at can be accrued st arting from time t wh en the information s tate is ¯ ω . In particular , we hav e V 1 ( ¯ ω ) = max π J π T ( ¯ ω ) , and an opt imal deterministi c Markov pol icy exists such that a = π ∗ t ( ¯ ω ) achie ves the maximum in (6) (see e.g., [15] (Chapter 4)). Note that since T is a conditional probability distribution (given in (3)), V t +1 ( T ( ¯ ω , a )) i s taken to be the expectation over this distribution when i ts ar gum ent i s T , with a slight ab use of notation, as expressed in (6). Similar dy namic programming representations hol d for (P2) and (P3) as giv en belo w . For problem (P2) there exists a unique function V β ( · ) satisfying the following fixed poi nt equation: V β ( ¯ ω ) = max a =1 , ··· , n E [ R a ( ¯ ω ) + β V β ( T ( ¯ ω , a ))] = max a =1 , ··· , n ( ω a + β ω a V β ( T ( ¯ ω , a | 1)) + β (1 − ω a ) V β ( T ( ¯ ω , a | 0))) . (7) W e hav e that V β ( ¯ ω ) = max π J π β ( ¯ ω ) , and that a stationary s eparated policy π ∗ is opt imal if and only if a = π ∗ ( ¯ ω ) achieves the m aximum in (7) [16, Theorem 7.1]. For prob lem (P3), we will sh ow that there exist a bound ed fun ction h ∞ ( · ) and a constant scalar J satisfying the following equation: J + h ∞ ( ¯ ω ) = max a =1 , 2 , ··· ,n E [ R a ( ¯ ω ) + h ∞ ( T ( ¯ ω , a ))] = max a =1 , ··· ,n ( ω a + ω a h ∞ ( T ( ¯ ω , a | 1)) + (1 − ω a ) h ∞ ( T ( ¯ ω , a | 0))) . (8) The boundedness of h ∞ and the immedi ate re ward impli es that J = max π J π ∞ ( ¯ ω ) , and that a stationary separated po licy π ∗ is opt imal in the context of (P3) if and only if a = π ∗ ( ¯ ω ) achieves the maxim um in (8) [16, Theorems 6.1-6.3]. Solving (P1)-(P3) using the above recursive equations i s in general computationally hea vy . Therefore, i nstead of directly using the DP equ ations, the focus of this paper is on examining May 6, 2008 DRAFT 11 the optimali ty properties of a si mple, greedy algorithm . W e define this algo rithm next and show its simpl icity i n structure and implementati on. B. The Myopic P olicy A myopic or greedy policy i gnores the im pact of the current action on the future re ward, fo- cusing solely on maximizin g the expected immediate reward. Myopic policies are th us stati onary . For (P1), the myopi c p olicy under state ¯ ω = [ ω 1 , ω 2 , · · · , ω n ] is giv en b y a ∗ ( ¯ ω ) = arg max a =1 , ··· ,n E [ R a ( ¯ ω )] = arg max a =1 , ··· ,n ω a . (9) In general, obtain ing the myop ic action in each time slot requires t he successiv e update of the informati on state as given i n (2), which explicitly relies on the knowledge of the transition probabilities { p ij } as well as the initi al cond ition ¯ ω (1) . Interestingly , it has been sh own in [5] that t he i mplementation o f t he m yopic po licy requires on ly th e knowledge of the initial condition and t he order of p 11 and p 01 , but not the precise values of these transition p robabilities. T o make the present paper self-contained, below we briefly describe ho w thi s pol icy works; more details may be found in [5]. Specifically , when p 11 ≥ p 01 the conditional probabilit y updati ng function τ ( ω ) is a monotoni - cally increasing function, i.e., τ ( ω 1 ) ≥ τ ( ω 2 ) for ω 1 ≥ ω 2 . T herefore the o rdering of i nformation states among channels is preserved when they are not observed. If a channel h as been o bserved to be in st ate “1” (respectively “0”), its probability at the next step becomes p 11 ≥ τ ( ω ) (respective ly p 01 ≤ τ ( ω ) ) for any ω ∈ [0 , 1] . In o ther words, a channel o bserved to be in state “1” (respectiv ely “0”) will have the highest (respectiv ely l owe st) possible inform ation state among all channels . These ob serv ations lead to t he following implem entation of t he myopic policy . W e take t he initial in formation s tate ¯ ω (1) , o rder th e channels according to their probabili ties ω i (1) , and p robe the highest one (top of the ordered list) with ties broken randomly . In subsequent steps we stay in the same channel i f the channel was sensed to be in state “1” (goo d) in the previous slot; otherwise, this channel is moved to the bottom of the ordered li st, and we probe the channel currently at t he to p of the lis t. This in effect creates a round robi n style of probing, where the channels are cycled throu gh i n a fix ed order . Th is circular structure is exploited in Sec tion IV to prove th e optimali ty of the my opic policy in the case of p 11 ≥ p 01 . May 6, 2008 DRAFT 12 When p 11 < p 01 , we have an analog ous but opposite sit uation. The conditio nal probabil ity updating function τ ( ω ) is now a m onotonically decreasing function, i.e., τ ( ω 1 ) ≤ τ ( ω 2 ) for ω 1 ≥ ω 2 . Therefore the ordering of i nformation states among channels is rever sed at each t ime step when they are not observed. If a channel has been observed to be i n state “1” (respecti vely “0”), its probabil ity at th e next step b ecomes p 11 ≤ τ ( ω ) (respectiv ely p 01 ≥ τ ( ω ) ) for any ω ∈ [0 , 1 ] . In other words, a channel observed to be in state “1” (respectiv ely “0”) wi ll hav e the lowest (respectively hi ghest) possible information s tate among all channels . As in the p re vious case, these sim ilar observ ations l ead to t he following implementation . W e take the initial information state ¯ ω (1) , order the channels according to t heir probabilities ω i (1) , and probe the highest one (top of the ordered li st) wit h ties broken randomly . In each sub sequent step, i f th e channel sensed in t he pre vious step was in state “0” (bad), we keep thi s channel at th e top of the li st but completely re verse t he order o f the remaining lis t, and we probe thi s channel. If the channel sensed in the pre v ious step was in state “1” (good), then we completely re verse th e order of the enti re list (includ ing dropping th is channel to the bott om of the list), and probe the channel currently at the top of the list. Th is alternating circular structure is exploited in Section V to examine the optim ality o f the m yopic policy in the case of p 11 < p 01 . I V . O P T I M A L I T Y O F T H E M Y O P I C P O L I C Y I N T H E C A S E O F p 11 ≥ p 01 In this section we show th at the my opic policy , with a simple and robust structure, is optimal when p 11 ≥ p 01 . W e will first show this for the finite ho rizon discounted cost case, and t hen extend the result t o the i nfinite horizon case under bo th discoun ted and ave rage cost criteria in Section VI. The main assum ption is formally stated as follows. Assumption 1: The transition probabilit ies p 01 and p 11 are such that p 11 − p 01 ≥ 0 . (10) The main theorem o f this section i s as follows. Theor em 1: Consider Problem (P1). Define V t ( ¯ ω ; a ) := E [ R a ( ¯ ω ) + β V t +1 ( T ( ¯ ω , a ))] , i.e., the value o f the value function given in Eqn (6) wh en action a is taken at t ime t followed by an optimal poli cy . Under Assump tion 1, the myopic policy is optimal, i.e. for ∀ t, 1 ≤ t < T , and ∀ ¯ ω = [ ω 1 , · · · , ω n ] ∈ [0 , 1] n , V t ( ¯ ω ; a = j ) − V t ( ¯ ω ; a = i ) ≥ 0 , (11) May 6, 2008 DRAFT 13 if ω j ≥ ω i , for i = 1 , · · · , n . The proof of this theorem is based on backward inducti on on t : gi ven t he optimali ty of the myopic policy at times t + 1 , t + 2 , · · · , T , we want to show that it is also optimal at ti me t . This relies on a number of lemm as in troduced belo w . The first lemma introduces a notation that allows us to express the expected future rew ard under the m yopic policy . Lemma 1: There exist T n -v ariable functions , denoted by W t () , t = 1 , 2 , · · · , T , each of which is a polynomi al of o rder 1 4 and can be represented recursi vely in the following form: W t ( ¯ ω ) = ω n + ω n β W t +1 ( τ ( ω 1 ) , . . . , τ ( ω n − 1 ) , p 11 ) + (1 − ω n ) β W t +1 ( p 01 , τ ( ω 1 ) , . . . , τ ( ω n − 1 )) , (12) where ¯ ω = [ ω 1 , ω 2 , · · · , ω n ] and W T ( ¯ ω ) = ω n . Pr oof : The proof is easily obtained using backward induction on t given the above recursive equation and noting that W T () is one su ch p olynomial and the mapping τ () is a lin ear operation. Cor ollar y 1: When ¯ ω represents the ordered list of information states [ ω 1 , ω 2 , · · · , ω n ] with ω 1 ≤ ω 2 ≤ · · · ≤ ω n , then W t ( ¯ ω ) is the expected total re ward obtained by the myop ic pol icy from tim e t on. This result follows directly from the descriptio n of the po licy given in Section III-B. Pr opos ition 1: The fact that W t is a polynom ial of order 1 and affine in each o f i ts element s implies that W t ( ω 1 , · · · , ω n − 2 , y , x ) − W t ( ω 1 , · · · , ω n − 2 , x, y ) = ( x − y )[ W t ( ω 1 , · · · , ω n − 2 , 0 , 1) − W t ( ω 1 , · · · , ω n − 2 , 1 , 0)] . (13) Similar results hold wh en we change t he positi ons of x and y . T o s ee this, con sider W t ( ω 1 , · · · , ω n − 2 , x, y ) and W t ( ω 1 , · · · , ω n − 2 , y , x ) , as functions of x and y , each having an x t erm, a y term, an xy term and a constant term. Since we are just swapping the positions of x and y in these two functions, the constant term remains the same, and so does the xy term. Thus the only difference is the x term and the y term, as given in the above equation. This li nearity result will be used later in our proofs. The next lemma establishes a necessary and sufficient condit ion for t he optim ality of the myopic policy . 4 Each function W t is affine in each variable, when all other variables are held constant. May 6, 2008 DRAFT 14 Lemma 2: Consider Problem (P1) and Assump tion 1. Given the optimali ty of the myopic policy at times t + 1 , t + 2 , · · · , T , the opti mality at time t is equiv alent t o: W t ( ω 1 , . . . , ω i − 1 , ω i +1 , . . . , ω n , ω i ) ≤ W t ( ω 1 , . . . , ω n ) , for all ω 1 ≤ · · · ≤ ω i ≤ · · · ≤ ω n . Pr oof : Since the m yopic p olicy is opti mal from t + 1 on, it is suf ficient t o show that prob ing ω n followed by myopic probing i s better than prob ing an y other chann el follo wed by myop ic probing. T he former is precisely given by the RHS o f the above equation; the latter by the LHS, thus completi ng the proof. Ha ving established t hat W t ( ¯ ω ) is the total expected re ward of th e myopic pol icy for an increasingly-ordered vector ¯ ω = [ ω 1 , · · · , ω n ] , we next proceed to show that we do not decrease this total expected re ward W t ( ¯ ω ) b y swit ching the order o f two neighboring elements ω i and ω i +1 if ω i ≥ ω i +1 . This is done in two separate cases, when i + 1 < n (given in Lemma 4) and when i + 1 = n (given in Lemm a 5), respective ly . The first case is quit e straightforward, while proving the second cased tu rned out to be s ignificantly more d iffi cult. Our proof of t he second case (Lemma 5) relies on a separate lemma (Lemma 3) that establis hes a bound between the greedy use of two identical vectors but with a dif ferent starting position. The proof of Lemm a 3 is based on a coupli ng ar gu ment and is quite ins tructiv e. Below we present and prove Lemm as 3, 4 and 5. Lemma 3: For 0 < ω 1 ≤ ω 2 ≤ . . . ≤ ω n < 1 , we have the fol lowing inequali ty for all t = 1 , 2 , · · · , T : 1 + W t ( ω 2 , . . . , ω n , ω 1 ) ≥ W t ( ω 1 , . . . , ω n ) . (14) Pr oof : W e prove this lemma using a coupling argument along any s ample path. The LHS of the above i nequality represents the e xpected re ward of a poli cy (referred to as L b elow) that probes in the sequence of channels 1 followed by n , n − 1 , · · · , and then 1 again , and so on, plus an extra re ward of 1 ; the RHS represents the expected rewar d of a policy (referred to as R below) that probes in the sequence of channels n follo wed by n − 1 , · · · , and 1 and then n again, and so on. It helps to imagine lining up the n channels along a circle in the sequence of n , n − 1 , · · · , 1 , clo ck-wise, and thus L ’ s st arting position is 1 , R’ s starting po sition is n , exactly one spot ahead of L clock-wise. Each will cycle around the circle till tim e T . Now for any realizatio n of the channel condi tions (or any samp le pat h o f t he sy stem), consi der the sequence of “ 0 ”s and “ 1 ”s that these two policies see, and consider t he p osition they are on May 6, 2008 DRAFT 15 the circle. The rew ard a policy gets along a gi ven sample path is R l = P T j = t β j l for pol icy L, where j l = j if L sees a “1” at tim e j , and 0 otherwis e; the rewa rd for R is R r = P T j = t β j r with j r similarly defined. There are two cases. Case (1): the two e ventually catch up w ith each ot her at som e time K ≤ T , i.e., at some point t hey st art probi ng exactly th e same channel. From this poi nt on the two po licies behave exactly the s ame way along the same sample path, and the reward they obtain from this point on is exactly the same. Therefore in this case we only need to compare the rew ards (L has an extra 1 ) leading up to t his point. Case (2): The t wo ne ver manage to meet within the horizon T . In th is case we need t o compare the rew ards for the enti re horizon (from t to T ). W e will consider Case (1) first. There are only two possibil ities for the two pol icies to m eet: (Case 1 .a) either L has s een exactly one more “0” th an R in its sequence, or (Case 1.b) R has seen exactly n − 1 more “0”s th an L. T his is because the moment we see a “0” we will move to the next channel on the circle. L is o nly one position b ehind R, so on e mo re “0” wi ll put it at exactly the same posi tion as R. The same wit h R moving n − 1 mo re position s ahead to catch up with L. Case (1.a): L sees exactly one more “0” than R in its sequence. The extra “0” necessarily occurs at exactly ti me K , t ≤ K ≤ T , m eaning that at K , L sees a “0” and R sees a “1”. From t to K , if we write the sequence of re wards (zeros and ones) un der L and R, we observe the following: between t and K both L and R hav e equal num ber of zeros, while for ∀ t ′ = t, t + 1 , . . . , K − 1 , the nu mber of zeros up to time t ′ is less (or no m ore) for L t han for R. In ot her words, L and R see t he same number of “0”s, b ut L ’ s i s always laggi ng behind (or no earlier). That is, for ev ery “0” R sees, L h as a matching “0” that occurs no earlier than R’ s “0. ” This m eans that if we denote by R l ( t 1 , t 2 ) the rewa rds accumulated between t 1 and t 2 , then for the re war ds in [ t, K − 1 ] , we ha ve R l ( t, t ′ ) ≥ R r ( t, t ′ ) , for ∀ t ′ ≤ K − 1 , while R l ( K , K ) = β K and R r ( K , K ) = 0 . Finally by definition we have R l ( K + 1 , T ) = R r ( K + 1 , T ) . Therefore overall we hav e 1 + R l ( t, T ) ≥ R r ( t, T ) , proving the above inequal ity . Case (1.b): R sees n − 1 more “0”s than L does. The com parison is sim pler . W e onl y need to note that R’ s “0”s m ust again precedes (or be no later than) L ’ s since otherwise we will return to Case (1.a). Therefore we have R l ≥ R r , and t hus 1 + R l ≥ R r is also tru e. W e now consider Case (2). The argument is ess entially th e same. In this case the two don’t May 6, 2008 DRAFT 16 get to meet, b u t th ey are on their way , meaning that either L h as e xactly the same “0”s as R and their pos itions are no earlier (corresponding to Case (1.a)), or R has more “0”s than L (b u t not up to n − 1 ) and th eir pos itions are no later th an L ’ s (corresponding to Case (1.b)). So either way we hav e 1 + R l ≥ R r . The proof is thu s complete. Lemma 4: For all j , 1 ≤ j ≤ n − 3 , and all x ≥ y , we hav e W t ( ω 1 , . . . , ω j , x, y , . . . , ω n ) ≤ W t ( ω 1 , . . . , ω j , y , x, . . . , ω n ) (15) Pr oof : W e prove this by induction over t . The claim is obviously true for t = T , since both sides will be equal to ω n , thereby establishi ng the induction basis. Now suppose the claim is true for all t + 1 , · · · , T − 1 . W e hav e W t ( ω 1 , · · · , ω j − 1 , x, y , · · · , ω n ) = ω n (1 + β W t +1 ( τ ( ω 1 ) , · · · , τ ( x ) , τ ( y ) , · · · , τ ( ω n − 1 ) , p 11 )) + (1 − ω n ) β W t +1 ( p 01 , τ ( ω 1 ) , · · · , τ ( x ) , τ ( y ) , · · · , τ ( ω n − 1 )) ≤ ω n (1 + β W t +1 ( τ ( ω 1 ) , · · · , τ ( y ) , τ ( x ) , · · · , τ ( ω n − 1 ) , p 11 )) + (1 − ω n ) β W t +1 ( p 01 , τ ( ω 1 ) , · · · , τ ( y ) , τ ( x ) , · · · , τ ( ω n − 1 )) = W t ( ω 1 , · · · , ω j − 1 , y , x, · · · , ω n ) (16) where the inequality is due to t he induction hypothesis, and noting that τ () is a mon otone increasing mapping in the case of p 11 ≥ p 01 . Lemma 5: For all x ≥ y , we have W t ( ω 1 , . . . , ω j , . . . , ω n − 2 , x, y ) ≤ W t ( ω 1 , . . . , ω j , . . . , ω n − 2 , y , x ) . (17) Pr oof : This lemma is proved ind uctiv ely . The claim is obviously true for t = T . Assume it also hol ds for ti mes t + 1 , · · · , T − 1 . W e ha ve by the definition of W t () and due to its linearity property: W t ( ω 1 , . . . , ω n − 2 , y , x ) − W t ( ω 1 , . . . , ω n − 2 , x, y ) = ( x − y )( W t ( ω 1 , . . . , ω n − 2 , 0 , 1) − W t ( ω 1 , . . . , ω n − 2 , 1 , 0)) = ( x − y ) (1 + β W t +1 ( τ ( ω 1 ) , . . . , τ ( ω n − 2 ) , p 01 , p 11 ) − β W t +1 ( p 01 , τ ( ω 1 ) , . . . , τ ( ω n − 2 ) , p 11 )) . May 6, 2008 DRAFT 17 But from the induction hypothesis we kn ow that W t +1 ( τ ( ω 1 ) , . . . , τ ( ω n − 2 ) , p 01 , p 11 ) ≥ W t +1 ( τ ( ω 1 ) , . . . , τ ( ω n − 2 ) , p 11 , p 01 ) . (18) This means that 1 + β W t +1 ( τ ( ω 1 ) , . . . , τ ( ω n − 2 ) , p 01 , p 11 ) − β W t +1 ( p 01 , τ ( ω 1 ) , . . . , τ ( ω n − 2 ) , p 11 ) ≥ 1 + β W t +1 ( τ ( ω 1 ) , . . . , τ ( ω n − 2 ) , p 11 , p 01 ) − β W t +1 ( p 01 , τ ( ω 1 ) , . . . , τ ( ω n − 2 ) , p 11 ) ≥ 0 , where the last inequality is due to Lemma 3 (note that in t hat lem ma we proved 1 + A ≥ B , which obviously impl ies 1 + β A ≥ β B for 0 ≤ β ≤ 1 t hat is used above). This, t ogether wi th the conditi on x ≥ y , completes the proof. W e are now ready to p rove the main t heorem. Pr oof o f Theor em 1: The basic app roach is by induction on t . The optimali ty of the myopic policy at ti me t = T is obvious. So t he i nduction basis is established. Now assume that the myopic p olicy is optim al for all ti mes t + 1 , t + 2 , · · · , T − 1 , and we wi ll show that it is also optimal at tim e t . By Lemma 2 this is equivalent to establishing the following W t ( ω 1 , . . . , ω i − 1 , ω i +1 , . . . , ω n , ω i ) ≤ W t ( ω 1 , . . . , ω n ) . (19) But we know from Lemmas 4 and 5 that, W t ( ω 1 , . . . , ω i − 1 , ω i +1 , . . . , ω n , ω i ) ≤ W t ( ω 1 , . . . , ω i − 1 , ω i +1 , . . . , ω i , ω n ) ≤ W t ( ω 1 , . . . , ω i − 1 , ω i +1 , . . . , ω i , ω n − 1 , ω n ) ≤ . . . ≤ W t ( ω 1 , . . . , ω n ) , where the first i nequality is the resul t of L emma 5, while the remaining in equalities are repeated application of Lemma 4, completing the proof. W e would like to emphasize that from a technical point of vi e w , Lemma 3 is the key to the whol e proof: it leads to Lemm a 5 , which i n turn leads to Th eorem 1. Whi le Lemma 5 was easy t o conceptuali ze as a sufficient condition t o prov e the main theorem, Lemma 3 was much more elusi ve to construct and prove. This, indeed, m arks the main differ ence between t he proof techniques used here vs. that used in ou r earlier work [6]: Lemma 3 relies on a coupling ar gument instead of the con ve x analytic properties of the value functio n. May 6, 2008 DRAFT 18 V . T H E C A S E O F p 11 < p 01 In the pre vious section we showed th at a my opic policy is opt imal if p 11 ≥ p 01 . In this section we examine what happens when p 11 < p 01 , which corresponds to the case w hen the Marko vian channel state process exhibits a n egati ve auto-correlation over a unit ti me. This is perhaps a case of less practical interest and rele v ance. Howe ver , as we shall see this case presents a greater degree of t echnical com plexity and richness than the previous case. Specifically , we first show that when the num ber of channels is three ( n = 3 ) or when the discount factor β ≤ 1 2 , the myopic policy remain s optim al e ven for the case of p 11 < p 01 (the p roof for two channels in this case was given earlier in [5]). W e thus conclude that th e m yopic policy is optimal for n ≤ 3 or β ≤ 1 / 2 regardless of the transition probabil ities. W e then present a count er example showing that the the myopic policy is not optimal in general when n ≥ 4 and β > 1 / 2 . In particul ar , our counter example is for a finite horizon wit h n = 4 and β = 1 . A. n = 3 or β ≤ 1 2 W e start by dev eloping som e result s parallel to those presented in the previous section for the case of p 11 ≥ p 01 . Lemma 6: There exist T n-variable polynom ial functi ons of order 1 , denot ed b y Z t () , t = 1 , 2 , · · · , T , i.e., each function i s linear in all the elements, and can be represented recursively in the foll o wing form: Z t ( ¯ ω ) := ω n (1 + β Z t +1 ( p 11 , τ ( ω n − 1 ) , . . . , τ ( ω 1 ))) +(1 − ω n ) β Z t +1 ( τ ( ω n − 1 ) , . . . , τ ( ω 1 ) , p 01 ) . (20) where Z T ( ¯ ω ) = ω n . Cor ollar y 2: Z t ( ¯ ω ) giv en in (20) represents t he expected total re ward of the my opic policy when ¯ ω is ord ered in increasing order of ω i . Similar to Coroll ary 1, the above resul t follows directly from the policy description given in Section III-B. It follows that the function Z t also has the same linearity property presented earlier , i .e. Z t ( ω 1 , · · · , ω n − 2 , y , x ) − Z t ( ω 1 , · · · , ω n − 2 , x, y ) = ( x − y )( Z t ( ω 1 , · · · , ω n − 2 , 0 , 1) − Z t ( ω 1 , · · · , ω n − 2 , 1 , 0)) . (21) May 6, 2008 DRAFT 19 Similar results hold wh en we change t he positi ons of x and y . In the next lemma and theorem we p rove t hat the myopic pol icy is still optimal when p 11 < p 01 if n = 3 or β ≤ 1 / 2 . In particular , Lemma 7 b elow is the analogy of Lemmas 4 and 5 combined. Lemma 7: At t ime t ( t = 1 , 2 , · · · , T ), for all j ≤ n − 2 , we h a ve the following i nequality for ∀ 1 ≥ x ≥ y ≥ 0 i f either n = 3 or β ≤ 1 / 2 : Z t ( ω 1 , . . . , ω j , y , x, ω j +3 , . . . , ω n ) ≥ Z t ( ω 1 , . . . , ω j , x, y , ω j +3 , . . . , ω n ) . (22) Pr oof : W e prove this by inducti on on t . The claim is obviously true for t = T . Now suppos e it’ s true for t + 1 , · · · , T − 1 . Due to the linearity property of Z t , Z t ( ω 1 , . . . , ω j , y , x, ω j +3 , . . . , ω n ) − Z t ( ω 1 , . . . , ω j , x, y , ω j +3 , . . . , ω n ) = ( x − y ) ( Z t ( ω 1 , . . . , ω j , 0 , 1 , ω j +3 , . . . , ω n ) − Z t ( ω 1 , . . . , ω j , 1 , 0 , ω j +3 , . . . , ω n )) . (23 ) Thus i t suffices to show that Z t ( ω 1 , . . . , ω j , 0 , 1 , ω j +3 , . . . , ω n ) ≥ Z t ( ω 1 , . . . , ω j , 1 , 0 , ω j +3 , . . . , ω n ) . W e treat t he case when j < n − 2 and j = n − 2 separately . Indeed, withou t loss of generality , let j = n − 3 (t he proof follo ws exactly for all j ≤ n − 3 with more length y notations). At time t we have Z t ( ω 1 , . . . , ω n − 3 , 0 , 1 , ω n ) − Z t ( ω 1 , . . . , ω n − 3 , 1 , 0 , ω n ) = ω β ( Z t +1 ( p 11 , p 11 , p 01 , τ ( ω n − 3 ) , . . . , τ ( ω 1 )) − Z t +1 ( p 11 , p 01 , p 11 , τ ( ω n − 3 ) , . . . , τ ( ω 1 ))) + ( 1 − ω ) β ( Z t +1 ( p 11 , p 01 , τ ( ω n − 3 ) , . . . , τ ( ω 1 ) , p 01 ) − Z t +1 ( p 01 , p 11 , τ ( ω n − 3 ) , . . . , τ ( ω 1 ) , p 01 )) ≥ 0 where the last i nequality is due to the in duction hypothesis. Now we will consider the case when j = n − 2 . Z t ( ω 1 , . . . , ω n − 2 , 0 , 1) − Z t ( ω 1 , . . . , ω n − 2 , 1 , 0) = 1 + β Z t +1 ( p 11 , p 01 , τ ( ω n − 2 ) , . . . , τ ( ω 1 )) − β Z t +1 ( p 11 , τ ( ω n − 2 ) , . . . , τ ( ω 1 ) , p 01 ) . (24) Next we show that if β ≤ 1 / 2 or n = 3 the right hand side of (24) is n on-negati ve. If β ≤ 1 / 2 , t hen 1 + β Z t +1 ( p 11 , p 01 , τ ( ω n − 2 ) , . . . , τ ( ω 1 )) − β Z t +1 ( p 11 , τ ( ω n − 2 ) , . . . , τ ( ω 1 ) , p 01 ) ≥ 1 − β 1 − β ≥ 0 . May 6, 2008 DRAFT 20 If n = 3 , then 1 + β Z t +1 ( p 11 , p 01 , τ ( ω 1 )) − β Z t +1 ( p 11 , τ ( ω 1 ) , p 01 ) = 1 + β ( τ ( ω 1 ) − p 01 )( Z t +1 ( p 11 , 0 , 1) − Z t +1 ( p 11 , 1 , 0)) ≥ 1 − β ( Z t +1 ( p 11 , 0 , 1) − Z t +1 ( p 11 , 1 , 0)) ≥ 0 where t he first inequality is du e t o the fact that − 1 ≤ τ ( ω 1 ) − p 01 ≤ 0 and the last inequalit y is giv en b y the inducti on hypothesis. Theor em 2: Consider Problem (P1). Assu me that p 11 < p 01 . The m yopic policy is optim al for the case of n = 3 and t he case of β ≤ 1 / 2 wi th arbitrary n . More p recisely , for thes e two cases, ∀ t , 1 ≤ t ≤ T , we have V t ( ¯ ω ; a = j ) − V t ( ¯ ω ; a = i ) ≥ 0 , (25) if ω j ≥ ω i for i = 1 , · · · , n . Pr oof : W e prove by inducti on on t . The optimali ty of the myopic policy at time t = T is obvious. Now assu me t hat the myopic policy is op timal for all times t + 1 , t + 2 , · · · , T − 1 , and we want t o sh ow that it is also optimal at tim e t . Suppose at time t the channel probabilities are such that ω n ≥ ω i for i = 1 , · · · , n − 1 . The myopi c policy is optimal at time t if and only if probin g ω n followed by myopic probing is better than probing an y other channel followed by myopic probing. Mathemati cally , th is means Z t ( ω 1 , . . . , ω i − 1 , ω i +1 , . . . , ω n , ω i ) ≤ Z t ( ω 1 , . . . , ω n ) , for all ω 1 ≤ ω i ≤ ω n . But this i s a direct consequence of Lemma 7, com pleting the proof. B. A 4 -channel Counter Example The following example shows that the myopic policy is not, in general, opti mal for n ≥ 4 when p 11 < p 01 . Example 1: Consider an example with the following parameters: p 01 = 0 . 9 , p 11 = 0 . 1 , β = 1 , and ¯ ω = [ . 97 , . 97 , . 98 , . 99] . Now compare the following two policies at t ime T − 3 : play May 6, 2008 DRAFT 21 myopically (I), or play the . 98 channel first, followed by the myopic policy (II). Computation re veals that V I T − 3 ( . 97 , . 97 , . 98 , . 99) = 2 . 4 0 1863 < V I I T − 3 ( . 97 , . 97 , . 98 , . 99) = 2 . 4 0 2968 which shows that the myopic policy is not opt imal in this case. It rem ains an interesting question as to whether su ch counter examples exist in the case when the ini tial condition is such that all channel are in the good state with the statio nary probability . V I . I N FI N I T E H O R I Z O N Now we consi der extensions of results in Sections IV and V to (P2) and (P3), i.e., to show that the my opic policy is also op timal for (P2) and (P3) under t he same conditio ns. Intuitiv ely , this holds due to the fact that the stat ionary optim al pol icy of the finite horizon problem i s independent of the horizon as well as the discount factor . Theorems 3 and 4 belo w concretely establish this . W e poin t out that the proofs of Theorems 3 and 4 do not rely on any additional assump- tions other than the optim ality of the myopic policy for (P1). Indeed, if the optimality of the myopic policy for (P1) can be established under weaker condi tions, Theorems 3 and 4 can be readily i n voked to establish its optimalit y u nder the same weaker condition for (P2) and (P3), respectiv ely . Theor em 3: If myopic po licy is opti mal for (P1), it is also optimal for (P2) for 0 ≤ β < 1 . Furthermore, it s value function is the limiting value function of (P1) as the time horizon goes to infinity , i.e., we have max π J π β ( ¯ ω ) = lim T →∞ max π J π T ( ¯ ω ) . Pr oof : W e first use the bounded con vergence theorem (BCT) to establish the fact th at under any deterministic st ationary Markov policy π , we hav e J π β ( ¯ ω ) = lim T →∞ J π T ( ¯ ω ) . W e prove this by noting that J π β ( ¯ ω ) = E π [ lim T →∞ T X t =1 β t − 1 R π ( t ) ( ¯ ω ( t )) | ¯ ω (1 ) = ¯ ω ] = lim T →∞ E π [ T X t =1 β t − 1 R π ( t ) ( ¯ ω ( t )) | ¯ ω (1 ) = ¯ ω ] = lim T →∞ J π T ( ¯ ω ) (26) May 6, 2008 DRAFT 22 where the second equalit y is due to BCT for P T t =1 β t − 1 R π ( t ) ( ¯ ω ( t )) ≤ 1 1 − β . This proves the second part of the theorem by notin g th at due to the finiteness of the action space, we can interchange maxim ization and limi t. Let π ∗ denote the myopic policy . W e now establish the optimalit y of π ∗ for (P2). From Theorem 1, we k now: J π ∗ T ( ¯ ω ) = max a = i  ω i + β ω i J π ∗ T − 1 ( T ( ¯ ω , i | 1)) + β (1 − ω i ) J π ∗ T − 1 ( T ( ¯ ω , i | 0))  . T aking li mit of both sid es, we hav e J π ∗ β ( ¯ ω ) = max a = i  ω i + β ω i J π ∗ β ( T ( ¯ ω , i | 1)) + β (1 − ω i ) J π ∗ β ( T ( ¯ ω , i | 0))  . (27) Note that (27) is nothing b ut the dynami c programming equation for the infinite horizon dis- counted rew ard problem gi ven in (7). From t he uniqueness of the dynamic programmin g solution, then, we have J π ∗ β ( ¯ ω ) = V β ( ¯ ω ) = max π J π β ( ¯ ω ) hence, the op timality of the my opic policy . Theor em 4: Consider (P3) wi th the expected a verage re ward and under the ergodicity as- sumption | p 11 − p 00 | < 1 . Myo pic policy is optimal for problem (P3) if it is optimal for (P1). Pr oof : W e consider the infini te horizon discounted cost for β < 1 under the optimal policy denoted by π ∗ : J π ∗ β ( ¯ ω ) = max a = i  ω i + β ω i J π ∗ β ( T ( ¯ ω , i | 1)) + β (1 − ω i ) J π ∗ β ( T ( ¯ ω , i | 0))  . (28) This can be written as (1 − β ) J π ∗ β ( ¯ ω ) = ma x a = i  ω i + β ω i  J π ∗ β ( T ( ¯ ω , i | 1)) − J π ∗ β ( ¯ ω )  + β (1 − ω i )  J π ∗ β ( T ( ¯ ω , i | 0)) − J π ∗ β ( ¯ ω )  . May 6, 2008 DRAFT 23 Notice that the boundedness of the rewa rd function and compactness of information state implies that t he sequence of { (1 − β ) J π ∗ β ( ¯ ω ) } i s bounded, i.e. for all 0 ≤ β ≤ 1 , (1 − β ) J π ∗ β ( ¯ ω ) ≤ 1 . (29) Also, apply ing Lemma 2 from [6] (whi ch provides an upper bo und on the difference in v alue functions b etween taki ng two differe nt actions foll owe d by the optim al policy) and noti ng that − 1 < p 11 − p 00 < 1 , we ha ve that there exists som e positive constant K := 1 1 −| p 11 − p 01 | such that   J π ∗ β ( T ( ¯ ω , i | 0)) − J π ∗ β ( ¯ ω )   ≤ K . (30) By Bolzano-W eierstrass theorem, (29) and (30 ) gu arantee the existence of a con ver ging sequence β k → 1 such th at lim k →∞ (1 − β k ) J π ∗ β k ( ¯ ω ∗ ) := J ∗ , (31) and lim k →∞  J π ∗ β k ( ¯ ω ) − J π ∗ β k ( ¯ ω ∗ )  := h π ∗ ( ¯ ω ) , (32) where ω ∗ i := p 01 1 − p 11 + p 01 is the s teady-state belief (the l imiting beli ef when channel i is not sensed for a long t ime). As a result, (31) can be writt en as J ∗ = lim k →∞  (1 − β k ) J π ∗ β k ( ¯ ω ∗ ) + (1 − β k )  J π ∗ β k ( ¯ ω ) − J π ∗ β k ( ¯ ω ∗ )  . In other words, J ∗ = lim k →∞ max a = i  ω i + β k ω i  J π ∗ β k ( T ( ¯ ω , i | 1)) − J π ∗ β k ( ¯ ω )  + β k (1 − ω i )  J π ∗ β k ( T ( ¯ ω , i | 0)) − J π ∗ β k ( ¯ ω )  . From (32), we can write this as J ∗ + h π ∗ ( ¯ ω ) = max a = i  ω i + ω i h π ∗ ( T ( ¯ ω , i | 1)) + (1 − ω i ) h π ∗ ( T ( ¯ ω , i | 0))  . (33) Note that (33) is nothin g b ut the DP equation as given by (8). In addit ion, we know that the immediate re ward as w ell as function h are both bo unded by max(1 , K ) . This im plies th at J ∗ is the maxi mum a verage re ward, i.e. J ∗ = max π J π ∞ ( ¯ ω ( t )) (see [16, Theorems 6 .1-6.3]) . May 6, 2008 DRAFT 24 On the oth er hand, we know from Theorem 3 that th e m yopic policy is optimal for (P2) if it is for (P1), and thus we can take π ∗ in (28) to b e the myopic policy . Rewriting (28) gives the following: J π ∗ β ( ¯ ω ) = ω π ∗ ( ¯ ω ) + β ω π ∗ ( ¯ ω ) J π ∗ β ( T ( ¯ ω , π ∗ ( ¯ ω ) | 1)) + β (1 − ω π ∗ ( ¯ ω ) ) J π ∗ β ( T ( ¯ ω , π ∗ ( ¯ ω ) | 0)) . Repeating steps (31)-(33) we arriv e at the following: J + h π ∗ ( ¯ ω ) = ω π ∗ ( ¯ ω ) + ω π ∗ ( ¯ ω ) h π ∗ ( T ( ¯ ω , π ∗ ( ¯ ω ) | 1)) + (1 − ω π ∗ ( ¯ ω ) ) h π ∗ ( T ( ¯ ω , π ∗ ( ¯ ω ) | 0)) , (34) which s hows that ( J ∗ , h π ∗ , π ∗ ) is a canonical triplet [16, Theorems 6.2]. This, togeth er with boundedness of h π ∗ and immedi ate reward, implies that the m yopic policy π ∗ is optimal for (P3) [16, Theorems 6.3] . V I I . D I S C U S S I O N A N D R E L A T E D W O R K The problem s tudied in this paper may be viewe d as a special case of a class o f MDPs k nown as the r estless bandit pr oblems [2]. In this class of problems , N controlled M arko v chains (als o called pr ojects or machines ) are activ ated (or played) one at a t ime. A machine when activa ted generates a state dependent rew ard and transits to th e next state according to a Markov rul e. A machine not activ ated transi ts to the next state according to a (pot entially diffe rent) Markov rule. The problem is to decide the sequence in which these machin es are activ at ed so as t o maximize the expected (discounted or av erage) re war d over an infinit e horizon. T o put our p roblem in this context, each channel correspond s to a machine, and a channel i s activ ated when it i s probed, and its information state goes through a transition depending on the observation and the underlying channel model. When a channel is not probed, its i nformation state goes throu gh a transit ion solely based on the underlyin g channel model 5 . In the case t hat a machin e stays frozen in it s current state when not played, the problem reduces to the multi-armed bandit pr oblem , a class of problems solved by Gitti ns in his 1970 5 The standard definition of bandit problems typically assumes fi nite or countably infinite st ate spaces. W hile our problem can potentially hav e an uncountable state space, it is ne vertheless countable f or a given initial st ate. This view has been taken throughou t the paper . May 6, 2008 DRAFT 25 seminal work [17]. Gittins showed that there exists an i ndex associated with each m achine that is solely a function of that i ndividual machine and its state, and that playing th e machine currently with the highest index i s optimal. This index h as s ince been referred to as the Gittins index due to Whittle [18]. The remarkable nature of this result lies in th e fact that it essentially decomposes the N -dimensional problem into N 1-dimensional problems, as an index is defined for a machine i ndependent of others. The basic model of multi-armed bandit has been used pre viously in t he context of channel access and cognitiv e radio networks. For example, i n [19], Bayesian learning was used to estimate the probabili ty of a channel being av ail able, and t he Gittins indices, calculated based on s uch estimates (which were only up dated when a channel is ob served and used, thus giving rise to a m ulti-armed bandit formulation rather than a restless bandit formulatio n), were used for channel s election. On the ot her hand, relativ ely little is known about the struct ure of the optimal poli cies for the restless band it problems in general. It has been sho wn that the Gittins index policy is not in general optimal in this case [2], and th at th is class o f problem s is PSP A CE-hard in general [20]. Whittle, in [2], propos ed a Gittins-li ke index (referr ed to as the Whittl e’ s index policy), shown to be optimal under a constraint on the averag e number of machines that can be played at a given tim e, and asymptotically optim al under certain lim iting regimes [21]. There has been a large v olume of l iterature in thi s area, including various approximation algorithm s, see for example [22] and [23] for near -optimal heurist ics, as well as cond itions for certain policies to be optimal for special cases of the restless bandit problem, see e.g., [24], [25]. The nature of the results deriv ed in the present paper is simi lar to that of [24], [25] in s pirit. That is, we have shown that for this special case of the restless bandit problem an index policy is op timal under certain conditio ns. F or the indexability (as defined b y Whittle [2]) of this problem, s ee [26]. Recently Gu ha and Munagala [27], [28 ] st udied a class o f problems referred to as the feedback multi-armed bandit probl ems. This class is very similar to th e restless bandit problem studied in the present paper , wi th the di f ference that channels may have different transi tion prob abilities (thus t his is a sli ght generalization to th e one studied here). While we identified conditions under which a s imple greedy ind ex p olicy i s optimal in th e present paper , Guha and Mun agala in [27], [28] looked for prov ably good approximation algorithms. In particular , they deriv ed a 2 + ǫ -approximate policy using a du ality-based technique. May 6, 2008 DRAFT 26 V I I I . C O N C L U S I O N The g eneral problem of opportunis tic sensing and access arises in many multi -channel com- munication contexts. For cases where the st ochastic evolution of channels can be modelled as i.i.d. two-state M arkov chains, we showed that a simp le and rob u st myo pic p olicy is opti mal for the finite and infinite horizon discounted re ward criteria as w ell as the infinite horizon av erage re ward criterion, when t he state transiti ons are positively correlated over tim e. When the st ate transitions are negative ly correlated, we showed that the s ame p olicy is opti mal when the num ber of channels is limited to 2 or 3, and p resented a counterexample for th e case of 4 channels . R E F E R E N C E S [1] R. Smallwood and E. S ondik, “The op timal control of partially observable markov processes ove r a fi nite horizon, ” Operation s Resear ch , pp. 1071–1088, 1971. [2] P . Whittle, “Restless bandits: Activity al location in a changing world, ” A Celebration of Applied Pr obab ility , ed. J . Gani, J ournal of applied pr obability , vol. 25A, pp. 287–298, 1988. [3] Q. Zhao and B. Sadler , “ A survey of dynamic spectrum access, ” I EEE Signal Processing magazine , vol. 24, pp. 79–89, May 2007. [4] Q. Zhao, L. T on g, A. S wami, and Y . Chen, “Decentralized cognitiv e mac for opportunistic spectrum access i n ad hoc networks: A pomdp framew ork, ” IEEE Journal on Selected Area s in Communications: Special Issue on A daptive , Spectrum Agile and Cognitive W ireless Netw orks , April 2007. [5] Q. Z hao, B. K rishnamachari, and K. Liu, “On myopic sensing for multi -channel opportunistic access: Structure, optimality , and performance, ” IEE E Tr ans. W ir eless Communications , vol. 7, pp. 5431–5440, D ecember 2008. [6] T . Javidi, B. Kri shnamacha ri, Q. Zhao, and M. L iu, “Optimality of myopic sensing i n multi-channel opportunistic access, ” in IEEE International Confer ence on Communications (IC C) , May 2008. Beijing, China. [7] A. Ganti, E. Modiano, and J. N. T sitsiklis, “Optimal transmission scheduling in symmetric communication models with intermittent connecti vity , ” IE EE T rans. Information Theory , vol. 53, pp. 998–100 8, March 2007. [8] A. Ganti, Tr ansmission schedu ling f or multi-beam satellite systems . Ph.D. dissertati on, Dept. of EECS, MIT , 2003. Cambridge, MA. [9] G. K oole, “Monotonicity in markov reward and decision chains: Theory and applications, ” F ound ations and T r ends in Stocha stic Systems , 2006. [10] K. Liu and Q. Zhao, “Channel probing for opportunistic access with multi-channel sensing, ” in IEE E Asilomar Confer ence on Signals, Systems, and Computers , October 2008. [11] N. Chang and M. Liu, “Optimal channel probing and transmission scheduling for opportunistic spectrum access, ” International Confer ence on Mobile Computing and Networking (MOBICOM) , September 2007. Montreal, Canada. [12] Q. Zhao, B. Krishnamachari, and K. Liu, “Low-com plexity approaches to spectrum opportunity tracking, ” in the 2nd International Confer ence on Cognitive Radio Oriented W ireless Netw orks and Communication s (CrownCo m) , August 2007. May 6, 2008 DRAFT 27 [13] E. Fernandez-Gaucherand , A. Arapostathis, and S. I. Marcus, “On the average cost optimality equation and the str ucture of optimal policies for partially observ able markov decision processes, ” Annals of Operations Resear ch , vol. 29, December 1991. [14] P . Kumar and P . Karaiya, Stochastic Systems: Estimation, Identification, and Adaptive C ontr ol . Prentice-Hall, Inc, 1986. Engle wood Cliffs, NJ. [15] M. L. P uterman, Markov Decision Pro cesses: Discr ete Stoch astic Dynamic Pro gr amming . W iley Series in P robability and Mathematical Statistics, Wile y Interscience, 1994. [16] A. Arapostathis, V . Borkar , E. Fernandez-Gaucherand, M. K. Gosh, and S. I. Marcus, “Discrete-t ime controlled markov processes with average cost criteri on: A surve y , ” Siam Journ al of Contr ol and Optimi zation , vol. 31, March 1993. [17] J. C. Gittins, “Bandit processes and dynamic allocation i ndices, ” Journa l of the Royal Statistical Society , vol. B14, pp. 148– 167, 1972. [18] P . W hittle, “Multi-armed bandits and the gittins index, ” Journ al of the Royal Statistical Society , vol. 42, no. 2, pp. 143–149 , 1980. [19] A. Motamedi and A. Bahai, “Mac protocol design for spectrum-agile wireless networks: Stochastic control approach, ” in IEEE International Symposium on New F r ontiers in Dynamic Spectrum A ccess N etworks (DySP AN) , 2007. [20] C. H. Papadimitriou and J. N. T sitsiklis, “The complexity of optimal queueing network control, ” Mathematics of Operations Resear ch , vol. 24, pp. 293–305, May 1999. [21] R. R. W eber and G. W eiss, “On an index policy for restless bandits, ” Journa l of Applied Pro bability , vol. 27, pp. 637–648, 1990. [22] D. Bert simas and J. E. Ni ˜ no-Mora, “Restless band its, linear programming relaxations, and a primal-dual heuristic, ” Operation s Resear ch , vol. 48, January-February 2000. [23] J. E. Ni ˜ no-Mora, “Restless bandits, partial conservation laws and indexability , ” Advances in Applied Prob ability , vol. 33, pp. 76–98, 2001. [24] C. Lott and D. T eneketzis, “On the optimality of an i nde x rule in multi-channel allocation for single-hop mobile networks with multiple service classes, ” Pro bability in the Engineering and Informational Sciences , vol. 14, pp. 259–29 7, July 2000. [25] N. Ehsan and M. Liu, “Server allocation with delayed state observ ation: S uf ficient conditions f or the optimality of an index policy , ” IEE E Tr ansac tions on W ir eless Communication , 2008. to appear . [26] K. Liu and Q. Zhao, “ A restless multiarmed bandit formulation of opportunistic access: indexability and index policy , ” in the 5th IEEE Confer ence on Sensor , Mesh and A d Hoc C ommunications and N etworks (SEC ON) , June 2008. a complete version submitted to IEEE T ransactions on Information Theory and av ailable at htt p://arxiv .org/abs/0810.4658 . [27] S. Guha and K. Munagala, “ Approximation algorithms for partial-information based stochastic control with marko vian rew ards, ” in 48th IEE E Symposium on F o undations of Computer Science (FOCS) , 2007. [28] S. Guha, K. Munagala, and P . Shi, “ Approx imation algorithms for restless bandit problems, ” in ACM-S IAM Symposium on Discr et e Algorithms (SOD A) , 2009. May 6, 2008 DRAFT

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment