Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs

Output-Sensitiv e Adaptiv e Metrop olis-Hastings for Probabilistic Programs Da vid T olpin, Jan Willem v an de Meent, Brooks Paige, and F rank W oo d Univ ersity of Oxford Departmen t of Engineering Science {dtolpin, jwvdm, brooks, fwood}@robots.ox.ac.uk Abstract. W e in tro duce an adaptive output-sensitive Metrop olis-Hast- ings algorithm for probabilistic models expressed as programs, Adap- tiv e Light w eight Metrop olis-Hastings (AdLMH). The algorithm extends Ligh tweigh t Metrop olis-Hastings (LMH) by adjusting the probabilities of proposing random v ariables for modiﬁcation to improv e con vergence of the program output. W e sho w that AdLMH conv erges to the correct equilibrium distribution and compare conv ergence of AdLMH to that of LMH on several test problems to highlight diﬀerent asp ects of the adap- tation sc heme. W e observ e consisten t impro vemen t in conv ergence on the test problems. Keyw ords: Probabilistic programming, adaptiv e MCMC 1 In tro duction One strategy for improving con v ergence of Mark ov Chain Monte Carlo (MCMC) samplers is through online adaptation of the proposal distribution [1,2,14]. An adaptation sc heme must ensure that the sample sequence conv erges to the cor- rect equilibrium distribution. In a comp onent wise up dating Metropolis-Hastings MCMC, i.e. Metrop olis-within-Gibbs [5,8,10], the prop osal distribution can be decomp osed in to t wo comp onen ts: 1. A sto c hastic schedule (probabilit y distribution) for selecting the next random v ariable for mo diﬁcation. 2. The kernels from which new v alues for each of the v ariables are prop osed. In this pap er w e concentrate on the ﬁrst comp onen t—adapting the schedule for selecting a v ariable for mo diﬁcation. Our primary interest in this work is to impro ve MCMC metho ds for proba- bilistic programming [6,7,11,16]. Probabilistic programming languages facilitate dev elopment of general probabilistic models using the expressiv e pow er of gen- eral programming languages. The goal of inference in such programs is to reason ab out the posterior distribution o ver random v ariates that are sampled during execution, conditioned on observ ed v alues that constrain a subset of program expressions. 2 Da vid T olpin, Jan Willem v an de Meent, Bro oks Paige, and F rank W o od Ligh tw eight Metrop olis-Hastings (LMH) samplers [15] propose a change to a single random v ariable at eac h iteration. The program is then rerun, reusing previous v alues and computation where p ossible, after whic h the new set of sam- ple v alues is accepted or rejected. While re-running the program eac h time ma y w aste some computation, the simplicit y of LMH makes developing probabilistic v arian ts of arbitrary languages relatively straigh tforward. Designing robust Adaptive MCMC metho ds for probabilistic programming is complicated because of diversit y of mo dels expressed b y probabilistic programs. The same adaption sc heme should perform w ell with diﬀeren t programs without man ual tuning. Here we presen t an adaptive v ariant of LMH, whic h dynamically adjusts the sc hedule for selecting v ariables for mo diﬁcation. First, we review the general structure of a probabilistic program. W e discuss con vergence crite- ria with resp ect to the program output and prop ose a sc heme for tracking the “inﬂuence” of eac h random v ariable on the output. W e then adapt the selection probabilit y for each v ariable, b orro wing techniques from the upp er conﬁdence b ound (UCB) family of algorithms for multi-armed bandits [3]. W e sho w that the proposed adaptation sc heme preserves conv ergence to the target distribution under reasonable assumptions. Finally , we compare original and Adaptiv e LMH on several test problems to show how con vergence is improv ed b y adaptation. 2 Preliminaries 2.1 Probabilistic Program A probabilistic program is a stateful deterministic computation P with the fol- lo wing prop erties: – Initially , P exp ects no arguments. – On ev ery call, P returns either a distribution F , a distribution and a v alue ( G, y ), a v alue z , or ⊥ . – Up on returning F , P exp ects a v alue x drawn from F as the argumen t to the next call. – Up on returning ( G, y ) or z , P is in vok ed again without arguments. – Up on returning ⊥ , P terminates. A program is run b y calling P rep eatedly un til termination. Ev ery run of the program implicitly pro duces a sequence of pairs ( F i , x i ) of distributions and v alues of latent random v ariables. W e call this sequence a tr ac e and denote it b y x x x . A trace induces a sequence of pairs ( G j , y j ) of distributions and v alues of observe d random v ariables. W e call this sequence an image and denote it b y y y y . W e call a sequence of v alues z k an output of the program and denote it b y z z z . Program output is deterministic given the trace. The probability of a trace is prop ortional to the pro duct of the probability of all random choices x x x and the likelihoo d of all observ ations y y y p ( x x x ) ∝ | x x x | Y i =1 p F i ( x i ) | y y y | Y j =1 p G j ( y j ) . (1) Output-Sensitiv e Adaptive Metrop olis-Hastings for Probabilistic Programs 3 Algorithm 1 Adaptiv e comp onen t wise MH 1: Select initial p oin t x x x 1 . 2: Set initial selection probabilities α α α 1 . 3: for t = 1 . . . ∞ do 4: α α α t ← f t ( α α α t , x x x 0 , x x x 1 , . . . , x x x t ). 5: Cho ose k ∈ { 1 , . . . , N } with probability α t k . 6: Generate x x x 0 ∼ q k θ ( x x x | x x x t ). 7: ρ ← min  π ( x x x 0 ) q k θ ( x x x t | x x x 0 ) π ( x x x t ) q k θ ( x x x 0 | x x x t ) , 1  8: x x x t +1 ← x x x 0 with probability ρ , x x x t otherwise. 9: end for The ob jectiv e of inference in probabilistic program P is to disco ver the distribu- tion of z z z . 2.2 Adaptiv e Mark ov Chain Mon te Carlo MCMC methods generate a sequence of samples { x x x t } ∞ t =1 b y sim ulating a Mark o v c hain using a transition op erator that leav es a target densit y π ( x x x ) inv ariant. In MH the transition op erator is implemen ted b y dra wing a new sample x x x 0 from a parameterized prop osal distribution q θ ( x x x 0 | x x x t ) that is conditioned on the curren t sample x x x t . The prop osed sample is then accepted with probability ρ = min  π ( x x x 0 ) q θ ( x x x t | x x x 0 ) π ( x x x t ) q θ ( x x x 0 | x x x t ) , 1  . (2) If x x x 0 is rejected, x x x t is re-used as the next sample. The conv ergence rate of MH dep ends on parameters θ of the prop osal distri- bution q θ . The parameters can b e set either oﬄine or online. V ariants of MCMC in which the parameters are con tinuously adjusted based on the features of the sample sequence are called adaptive. Challenges in design and analysis of Adap- tiv e MCMC methods include optimization criteria and algorithms for the pa- rameter adaptation, as well as conditions of conv ergence of Adaptive MCMC to the correct equilibrium distribution [13]. Contin uous adaptation of parameters of the prop osal distribution is a well-kno wn research sub ject [1,2,14]. In a comp onen twise MH algorithm [10] whic h targets a densit y π ( x x x ) de- ﬁned on an N -dimensional space X , the components of a random sample x x x = { x 1 , . . . , x N } are updated individually , in either random or systematic order. Assuming the comp onen t i is selected at the step t for mo diﬁcation, the pro- p osal x x x 0 sampled from q i θ ( x x x | x x x t ) ma y diﬀer from x x x t only in that comp onent, and x 0 j = x t j for all j 6 = i . Adaptiv e comp onent wise Metrop olis-Hastings (Algo- rithm 1) chooses diﬀerent probabilities for selecting a comp onent for modiﬁcation at each iteration. Parameters of this sc heduling distribution may b e view ed as a subset of parameters θ of the prop osal distribution q θ , and adjusted according to optimization criteria of the sampling algorithm. 4 Da vid T olpin, Jan Willem v an de Meent, Bro oks Paige, and F rank W o od V arying selection probabilities based on past samples violates the Marko v prop ert y of { x x x t } ∞ 1 . Ho wev er, provided the adaptation of the selection probabili- ties diminishes, with || α α α t − α α α t − 1 || → 0, then under suitable regularity conditions for the target density (see Section 4) an adaptive comp onent wise MH algorithm will still b e ergo dic [8], and the distribution on x x x induced by Algorithm 1 con- v erges to π . 2.3 Ligh tw eigh t Metropolis-Hastings LMH [15] is a sampling sc heme for probabilistic programs where a single random v ariable drawn in the course of a particular execution of a probabilistic program is mo diﬁed via a standard MH prop osal, and this mo diﬁcation is accepted by comparing the v alues of the joint probabilit y of old and new program traces. LMH diﬀers from comp onen t wise MH algorithms in that other random v ariables ma y also hav e to b e mo diﬁed, dep ending on the structural dep endencies in the probabilistic program. LMH initializes a prop osal b y selecting a single v ariable x k from an execution trace x x x and resampling its v alue x 0 k either using a reversible kernel κ ( x 0 k | x k ) or from the conditional prior distribution. Starting from this initialization, the program is rerun to generate a new trace x x x 0 . F or each m > k , the previous v alue x m is reused, provided it still lies in the support of the distribution on x 0 m , rescoring its log probabilit y relative to the new random choices { x 0 1 , . . . , x 0 m − 1 } . When x 0 m cannot b e rescored, a new v alue is sampled from the prior on x 0 m , conditioned on preceding choices. The acceptance probability ρ LMH is obtained b y substituting (1) in to (2): ρ LMH = min  1 , p ( y y y 0 | x x x 0 ) p ( x x x 0 ) q ( x x x | x x x 0 ) p ( y y y | x x x ) p ( x x x ) q ( x x x 0 | x x x )  . (3) W e here further simplify LMH by assuming x 0 k is sampled from the prior condi- tioned on earlier choices and that all v ariables are selected for mo diﬁcation with equal probability . In this case, ρ LMH tak es the form [16] ρ LMH = min  1 , p ( y y y 0 | x x x 0 ) p ( x x x 0 ) | x x x | p ( x x x \ x x x 0 | x x x ∩ x x x 0 ) p ( y y y | x x x ) p ( x x x ) | x x x 0 | p ( x x x 0 \ x x x | x x x 0 ∩ x x x )  , (4) where x x x 0 \ x x x denotes the resampled v ariables, and x x x 0 ∩ x x x denotes the v ariables whic h ha ve the same v alues in both traces. 3 Adaptiv e Ligh t w eight Metrop olis-Hastings W e develop an adaptiv e v arian t of LMH, which dynamically adjusts the proba- bilities of selecting v ariables for mo diﬁcation (Algorithm 2). Let x x x t b e the trace at iteration t of Adaptive LMH. W e deﬁne the probability distribution of se- lecting v ariables for mo diﬁcation in terms of a weigh t v ector W W W t that w e adapt, Output-Sensitiv e Adaptive Metrop olis-Hastings for Probabilistic Programs 5 Algorithm 2 Adaptiv e LMH 1: Initialize W W W 0 to a constant. 2: Run the program. 3: for t = 1 . . . ∞ do 4: Randomly select a v ariable x t k according to W W W t . 5: Prop ose a v alue for x t k . 6: Run the program, accept or reject the trace. 7: if accepted then 8: Compute W W W t +1 based on the program output. 9: else 10: W W W t +1 ← W W W t 11: end if 12: end for suc h that the probabilit y α t i of selecting the i th v ariable for mo diﬁcation is α t i = W t i | x x x t | P k =1 W t k . (5) Just like LMH, Adaptive LMH runs the probabilistic program once and then selects v ariables for mo diﬁcation randomly . Ho wev er, acceptance ratio ρ AdLMH m ust now include selection probabilities α k and α 0 k of the resampled v ariable in the current and the prop osed sample ρ AdLMH = min  1 , p ( y y y 0 | x x x 0 ) p ( x x x 0 ) α 0 k p ( x x x \ x x x 0 | x x x ∩ x x x 0 ) p ( y y y | x x x ) p ( x x x ) α k p ( x x x 0 \ x x x | x x x 0 ∩ x x x )  . (6) This high-lev el description of the algorithm do es not detail ho w W W W t is com- puted for each iteration. Indeed, this is the most essential part of the algorithm. There are tw o diﬀerent asp ects here — on one hand, the inﬂuence of a given c hoice on the output sequence m ust be quan tiﬁed in terms of con vergence of the sequence to the target distribution. On the other hand, the inﬂuence of the c hoice m ust b e translated in to re-computation of weigh ts of random v ariables in the trace. Both parts of re-computation of W W W t are explained b elo w. 3.1 Quan tifying Inﬂuence Extensiv e research literature is av ailable on criteria for tuning parameters of Adaptiv e MCMC [1,2,14]. The case of inference in probabilistic programs is diﬀeren t: the user of a probabilistic program is interested in fast conv ergence of the program output { z z z t } rather than of the v ector of the program’s random v ariables { x x x t } . In adaptive MCMC v ariants the acceptance rate can b e eﬃcien tly used as the optimization ob jective [14]. Ho wev er, for con vergence of the output sequence an accepted trace that pro duces the same output is indistinguishable from a 6 Da vid T olpin, Jan Willem v an de Meent, Bro oks Paige, and F rank W o od rejected trace. Additionally , while optimal v alues of the acceptance rate [1,14] can b e used to tune parameters in adaptive MCMC, in Adaptiv e LMH w e do not c hange the parameters of prop osal distributions of individual v ariables, and assume that they are ﬁxed. How ever, proposing a new v alue for a random v ariable ma y or may not c hange the output ev en if the new trace is accepted. By c hanging v ariable selection probabilities w e attempt to maximize the change in the output sequence so that it con verges faster. In the p edagogical example x 1 ∼ Bernoulli(0 . 5) , x 2 ∼ N ( x 1 , 1) , z 1 ← ( x 1 , x 2 ) , selecting the Bernoulli random choice for mo diﬁcation changes the output only when a diﬀeren t v alue is sampled, while selecting the normal random choice will c hange the output almost alwa ys. Based on these considerations, w e quantif y the inﬂuence of sampling on the output sequence by measuring the c hange in the output z z z of the probabilistic program. Since probabilistic programs ma y produce output of an y t ype, w e c hose to discern b et ween identical and diﬀeren t outputs only , rather than to quantify the distance by introducing a type-dep enden t norm. In addition, when | z z z | > 1, w e quantify the diﬀerence by the fraction of comp onen ts of z z z with changed v alues. F ormally , let { z z z t } ∞ 1 = { z z z 1 , . . . , z z z t − 1 , z z z t , . . . } b e the output sequence of a probabilistic program. Then the inﬂuenc e of a choice that pro duced z z z t is deﬁned b y the total rew ard R t , computed as normalized Hamming distance b et ween the outputs R t = 1 | z z z t | | z z z t | X i = k 1 1 ( z t k 6 = z t − 1 k ) . (7) In the case of a scalar z z z , the rew ard is 1 when outputs of subsequent samples are diﬀerent, 0 otherwise. The reward is used to adjust the v ariable selection probabilities for the sub- sequen t steps of Adaptive LMH b y computing W W W t +1 (line 8 of Algorithm 2). It ma y seem suﬃcient to assign the reward to the last choice and use a v erage c hoice rewards as their weigh ts, but this approac h will not w ork for Adaptiv e LMH. Consider the generative mo del x 1 ∼N (1 , 10) , x 2 ∼ N ( x 1 , 1) , y 1 ∼N ( x 2 , 1) , z 1 ← x 1 , where we observ e the v alue y 1 = 2. Mo difying x 2 ma y result in an accepted trace, but the v alue of z 1 = x 1 , predicted by the program, will remain the same as in the previous trace. Only when x 1 is also mo diﬁed, and a new trace with the updated v alues for both x 1 and x 2 is accepted, the earlier change in x 2 is indirectly reﬂected in the output of the program. In the next section, we discuss propagation of rew ards to v ariable selection probabilities in detail. Output-Sensitiv e Adaptive Metrop olis-Hastings for Probabilistic Programs 7 3.2 Propagating Rew ards to V ariables Both LMH and Adaptiv e LMH mo dify a single v ariable p er trace, and either re-use or recompute the probabilities of v alues of all other v ariables (except those absent from the previous trace or ha ving an incompatible distribution, for whic h new v alues are also sampled). Due to this up dating sc heme, the inﬂuence of mo difying a v ariable on the output can b e delay ed by sev eral iterations. W e prop ose the following propagation sc heme: for eac h random v ariable x i , the re- w ard r i and count c i are kept in a data structure used to compute W W W . The set of v ariables selected for modiﬁcation, called here the history , is main tained for eac h comp onen t z k of output z z z . When the v alue of z k c hanges, the rew ard is distributed betw een all of the v ariables in the history , and the history is emp- tied. When z k do es not c hange, the selected v ariable is p enalized b y zero rew ard. This scheme, for the case of scalar output for simplicity , is shown in Algorithm 3 whic h expands line 8 of Algorithm 2. When z z z has multiple components, histories for each comp onen t are maintained independently . Algorithm 3 Propagating Rewards to V ariables 1: App end x k to the history of v ariables selected for mo diﬁcation. 2: if z z z t +1 6 = z z z t then 3: w ← 1 | history | 4: for x m in history do 5: r m ← r m + w, c m ← c m + w 6: end for 7: Flush the history . 8: else 9: c k ← c k + 1 10: end if Rew arding all of the v ariables in the history ensures that while v ariables whic h cause changes in the output more often get a greater reward, v ariables with low er inﬂuence on the output are still selected for mo diﬁcation suﬃciently often. This, in turn, ensures ergo dicity of sampling sequence, and helps establish conditions for conv ergence to the target distribution, as we discuss in Section 4. Let us show that under certain assumptions the prop osed reward propaga- tion sc heme has a non-degenerate equilibrium for v ariable selection probabilities. Indeed, assume that for a program with t wo v ariables, x 1 , and x 2 , pr ob ability matching , or selecting a choice with the probabilit y prop ortional to the unit rew ard ρ i = r i c i , is used to compute the weigh ts, that is, W i = ρ i . Then, the follo wing lemma holds: Lemma 1 Assume that for variables x i , wher e i ∈ { 1 , 2 } : – α i is the sele ction pr ob ability; – β i is the pr ob ability that the new tr ac e is ac c epte d given that the variable was sele cte d for mo diﬁc ation; 8 Da vid T olpin, Jan Willem v an de Meent, Bro oks Paige, and F rank W o od – γ i is the pr ob ability that the output change d given that the tr ac e was ac c epte d. Assume further that α i , β i , and γ i ar e c onstant. Then for the c ase γ 1 = 1 , γ 2 = 0 : 0 < α 2 α 1 ≤ 1 3 (8) Pr o of. W e shall pro of the lemma in three steps. First, w e shall analyse a sequence of samples betw een tw o subsequen t arriv als of x 1 . Then, w e shall deriv e a form ula for the exp ected unit reward of x 2 . Finally , w e shall b ound the ratio α 2 α 1 . Consider a sequence of k samples, for some k , b et ween t wo subsequen t arriv als of x 1 , including the sample corresp onding to the second arriv al of x 1 . Since a new v alue of x 1 alw ays ( γ 1 = 1) and x 2 nev er ( γ 2 = 0) causes a change in the output, at the end of the sequence the history will contain k occurrences of x 2 . Let us denote b y ∆r i , ∆c i the increase of rew ard r i and coun t c i b et w een the b eginning and the end of the sequence. Noting that x 2 is p enalized eac h time it is added to the history (line 9 of Algorithm 3), and k o ccurrences of x 2 are rew arded when x 1 is added to the history (line 5 of Algorithm 3), w e obtain ∆r 1 = 1 k + 1 , ∆c 1 = 1 k + 1 ∆r 2 = k k + 1 , ∆c 2 = k + k k + 1 (9) Consider now a sequence of M such sequences. When M → ∞ , r iM c iM approac hes the exp ected unit reward ρ i , where r iM and c iM are the rew ard and the coun t of x i at the end of the sequence. ρ i = lim M →∞ r iM c iM = lim M →∞ r iM M c iM M = lim M →∞ P M m =1 ∆r im M P M m =1 ∆c im M = ∆r i ∆c i (10) Eac h v ariable x i is selected randomly and indep enden tly and pro duces an accepted trace with probability p i = α i β i α 1 β 1 + α 2 β 2 . (11) Acceptances of x 1 form a Poisson process with rate 1 p 1 = α 1 β 1 + α 2 β 2 α 1 β 1 . k is dis- tributed according to the geometric distribution with probabilit y p 1 , Pr[ k ] = (1 − p 1 ) k p 1 . Since ∆r 1 = ∆c 1 for any k , the expected unit reward ρ 1 of x 1 is 1. W e shall substitute ∆r i and ∆c i in to (10) to obtain the exp ected unit reward ρ 2 of x 2 : ∆r 2 = ∞ X k =0 k k + 1 (1 − p 1 ) k p 1 ∆c 2 = ∞ X k =0  k + k k + 1  (1 − p 1 ) k p 1 = 1 − p 1 p 1 | {z } k + ∞ X k =0 k k + 1 (1 − p 1 ) k p 1 (12) Output-Sensitiv e Adaptive Metrop olis-Hastings for Probabilistic Programs 9 ρ 2 = ∆r 2 ∆c 2 = ∞ P k =0 k k +1 (1 − p 1 ) k p 1 1 − p 1 p 1 + ∞ P k =0 k k +1 (1 − p 1 ) k p 1 = 1 − A z }| { ∞ X k =0 1 k + 1 (1 − p 1 ) k p 1 1 p 1 − ∞ X k =0 1 k + 1 (1 − p 1 ) k p 1 | {z } A (13) F or probability matching, selection probabilities are prop ortional to exp ected unit rewards: α 2 α 1 = ρ 2 ρ 1 (14) T o prov e the inequalit y , we shall deriv e a closed-form representation for ρ 2 , and analyse solutions of (14) for α 2 α 1 . W e shall eliminate the summation A in (13): A = ∞ X k =0 k k + 1 (1 − p 1 ) k p 1 = p 1 1 − p 1 ∞ X k =0 1 k + 1 (1 − p 1 ) k +1 = p 1 1 − p 1 ∞ X k =0 Z 1 p 1 (1 − ξ ) k dξ = p 1 1 − p 1 Z 1 p 1 ∞ X k =0 (1 − ξ ) k dξ = − p 1 1 − p 1 log p 1 (15) By substituting A into (13), and then ρ 1 and ρ 2 in to (14), we obtain α 2 α 1 = ρ 2 ρ 1 = ρ 2 = 1 + p 1 log p 1 1 − p 1 1 p 1 + p 1 log p 1 1 − p 1 ) B ( p 1 ) (16) The righ t-hand side B ( p 1 ) of (16) is a monotonic function for p 1 ∈ [0 , 1], and B (0) = 0, B (1) = 1 3 (see App endix for the analysis of B ( p 1 )). According to (11), α 2 α 1 = 0 implies p 1 = 1, hence α 2 α 1 6 = 0, and 0 < α 2 α 1 ≤ 1 3 . u t By noting that an y subset of v ariables in a probabilistic program can b e considered a single random v ariable drawn from a m ulti-dimensional distribution, Lemma 1 is generalized to any num ber of v ariables b y Corollary 1: Corollary 1 F or any p artitioning of the set x x x of r andom variables of a pr ob- abilistic pr o gr am, A dLMH with weights pr op ortional to exp e cte d unit r ewar ds sele cts variables fr om e ach of the p artitions with non-zer o pr ob ability. T o ensure conv ergence of W W W t to exp ected unit rewards in the stationary distribution, w e use upp er conﬁdence b ounds on unit rew ards to compute the v ariable selection probabilities, an idea whic h w e b orro wed from the UCB family of algorithms for multi-armed bandits [3]. F ollowing UCB1 [3], w e compute the upp er conﬁdence b ound ˆ ρ i as the sum of the unit rew ard and the exploration term ˆ ρ i = ρ i + C s log P | x x x | i =1 c i c i , (17) 10 Da vid T olpin, Jan Willem v an de Meent, Bro oks Paige, and F rank W o od where C is an exploration factor. The default v alue for C is √ 2 in UCB1; in practice, a lo wer v alue of C is preferable. Note that v ariable selection in Adaptiv e LMH is diﬀeren t from arm selection in multi-armed bandits: unlike in bandits, where we wan t to sample the b est arm at an increasing rate, in Adaptive LMH w e expect W W W t to conv erge to an equilibrium in which selection probabilities are prop ortional to exp ected unit rew ards. 4 Con v ergence of Adaptive LMH As adaptiv e MCMC algorithms ma y dep end arbitrarily on the history at eac h step, showing that a given sampler correctly dra ws from the target distribution can b e non-trivial. General conditions under which adaptiv e MCMC sc hemes are still ergo dic, in the sense that the distribution of samples con verges to the target π in total v ariation, are established in [13]. The fundamen tal criteria for v alidit y of an adaptiv e algorithm are diminishing adaptation , which (informally) requires that the amount whic h the transition op erator c hanges each iteration m ust asymptotically decrease to zero; and c ontainment , a technical condition whic h requires that the time un til con v ergence to the target distribution m ust b e b ounded in probability [4]. The class of mo dels representable b y probabilistic programs is very broad, allo wing sp eciﬁcation of completely arbitrary target densities; ho w ever, for man y mo dels the adaptiv e LMH algorithm reduces to an adaptiv e random scan Metro- p olis-within-Gibbs in Algorithm 1. T o discuss when this is the case, w e in v oke the concept of structur al v ersus structur e-pr eserving random c hoices [17]. Crucially , a structur e-pr eserving random c hoice x k do es not aﬀect the existence of other x m in the trace. Supp ose w e w ere to restrict the expressiveness of our language to admit only programs with no structural random c hoices: in such a language, the LMH al- gorithm in Algorithm 2 reduces to the adaptive comp onen twise MH algorithm. Conditions under which such an adaptiv e algorithm is ergo dic hav e b een estab- lished explicitly in [8, Theorems 4.10 and 5.5]. Giv en suitable assumptions on the target density deﬁned by the program, it is necessary for the probabilit y v ector || α t − α t − 1 || → 0, and that for any particular comp onent k we ha ve probability α t k >  > 0. Both of these are satisﬁed by our approach: from Corollary 1, we ensure that the unit rew ard across each x i con verges to a p ositiv e ﬁxed p oin t. While any theoretical result will require language restrictions such that pro- grams only induce distributions satisfying regularit y conditions, we conjecture that this sc heme is broadly applicable across most non-pathological programs. W e leav e a precise theoretical analysis of the space of probabilistic programs in whic h adaptive MCMC schemes (with inﬁnite adaptation) may b e ergo dic to future w ork. Empirical ev aluation presented in the next section demonstrates practical conv ergence of Adaptive LMH on a range of inference examples, in- cluding programs con taining structural random c hoices. Output-Sensitiv e Adaptive Metrop olis-Hastings for Probabilistic Programs 11 5 Empirical Ev aluation W e ev aluated Adaptiv e LMH on many probabilistic programs and observ ed con- sisten t improv emen t of conv ergence rate compared to LMH. W e also veriﬁed on a num b er of tests that the algorithm conv erges to the correct distribution obtained by indep enden t exact methods. In this section, w e compare Adaptiv e LMH to LMH on several represen tative examples of probabilistic programs. The rates in the comparisons are presen ted with resp ect to the n umber of samples, or sim ulations, of the probabilistic programs. The additional computation required for adaptation takes negligible time, and the computational eﬀort per sample is appro ximately the same for all algorithms. Our implementation of the inference engine is a v ailable at https://bitbucket.org/dtolpin/embang . In the following case studies diﬀerences b et ween program outputs and target distributions are presen ted using KullBack-Leibler (KL) div ergence, Kolmogorov- Smirno v (KS) distance, or L2 distance, as appropriate. In cases where target distributions cannot b e up dated exactly , they w ere approximated by running a non-adaptiv e inference algorithm for a long enough time and with a suﬃcient n umber of restarts. In each of the ev aluations, all of the algorithms w ere run with 25 random restarts and 500 000 simulations of the probabilistic program p er restart. The diﬀerence plots use the logarithmic scale for both axes. In the plots, the solid lines corresp ond to the median, and the dashed lines to 25% and 75% p ercen tiles, taken o ver all runs of the corresp onding inference algorithm. The exploration factor for computing upp er conﬁdence b ounds on unit rewards (Equation 17) w as ﬁxed at C = 0 . 5 for all tests and ev aluations. The ﬁrst example is a laten t state inference problem in an HMM with three states, one-dimensional normal observ ations (0.9, 0.8, 0.7, 0, -0.025, 5, 2, 0.1, 0, 0.13, 0.45, 6, 0.2, 0.3, -1, -1) with v ariance 1.0, a known transition matrix, and kno wn initial state distribution. There are 18 distinct random choices in all traces of the program, and the 0th and the 17th state of the mo del are predicted. The results of ev aluation are shown in Figure 1 as KL divergences b etw een the inference output and the ground truth obtained using the forw ard-backw ard algorithm. In addition, bar plots of unit reward and sample coun t distributions among random choices in Adaptiv e LMH are sho wn for 1000, 10 000, and 100 000 samples. As can b e seen in the plots, Adaptive LMH (blac k) exhibits faster conv ergence o ver the whole range of ev aluation, requiring half as many samples as LMH (cy an) to achiev e the same approximation, with the median of LMH ab o ve the 75% quantile of Adaptive LMH. In addition, the bar plots show unit rewards and sample counts for diﬀerent random choices, providing an insight on the adaptiv e b eha vior of AdLMH. On the left-hand bar plots, red bars are normalized unit rew ards, and blue bars are normalized sample counts. On the righ t-hand bar plots, the total heigh t of a bar is the total sample coun t, with green section corresp onding to the accepted, and y ellow to the rejected samples. A t 1 000 samples, the unit rewards hav e not yet con verged, and exploration supersedes exploitation: random choices with low er acceptance rate are selected more often (choices 7, 8 and 13 corresp onding to 12 Da vid T olpin, Jan Willem v an de Meent, Bro oks Paige, and F rank W o od 5e+04 1e+05 2e+05 5e+05 1e−04 5e−04 2e−03 samples KL divergence LMH AdLMH 1 3 5 7 9 11 13 15 17 0.00 0.06 0.12 rewards selected 1 3 5 7 9 11 13 15 17 0 40 80 accepted rejected 1000 samples 1 3 5 7 9 11 13 15 17 0.00 0.06 1 3 5 7 9 11 13 15 17 0 400 10000 samples 1 3 5 7 9 11 13 15 17 0.00 0.06 0.12 1 3 5 7 9 11 13 15 17 0 4000 8000 100000 samples Fig. 1. HMM, predicting the 0th and 17th state states 6, 7 and 12). A t 10 000 samples, the unit rew ards b ecome close to their ﬁnal v alues, and choices 1 and 18, immediately aﬀecting the predicted states, are selected more often. At 100 000 samples, the unit rewards con verge, and the sample coun ts corresp ond closely to the equilibrium state outlined in Lemma 1. The second case study is estimation of h yp erparameters of a Gaussian Pro- cess. W e deﬁne a Gaussian Pro cess of the form f ∼G P ( m, k ) , where m ( x ) = ax 2 + bx + c, k ( x, x ) = de ( x − x 0 ) 2 2 g . The pro cess has ﬁve hyperparameters, a, b, c, d, g . The program infers the p oste- rior v alues of the h yp erparameters by maximizing marginal likelihoo d of 6 ob- serv ations (0 . 0 , 0 . 5), (1 . 0 , 0 . 4), (2 . 0 , 0 . 2), (3 . 0 , − 0 . 05), (4 . 0 , − 0 . 2), and (5 . 0 , 0 . 1). P arameters a, b, c of the mean function are predicted. Maxim um of KS distances b et w een inferred distributions of eac h of the predicted parameters and an ap- pro ximation of the target distributions is sho wn in Figure 2. The approximation w as obtained b y running LMH with 2 000 000 samples p er restart and 50 restarts, and then taking eac h 100th sample from the last 10 000 samples of eac h restart, 5000 samples total. Just as for the previous case study , bar plots of unit rewards and sample coun ts are shown for 1000, 10 000, and 100 000 samples. Here as w ell, Adaptive LMH (black) con v erges faster ov er the whole range of ev aluation, outperforming LMH by a factor 2 o v er the ﬁrst 50 000 samples. Bar plots of unit rewards and sample counts for diﬀerent num ber of c hoices, again, show the dynamics of sample allocation among random choices. Choices a , b , and c are predicted, while c hoices d and g are required for inference but not predicted. Choice a has the low est acceptance rate (ratio b etw een the total heigh t of the bar and the green part on the righ t-hand bar plot), but the unit rew ard is close the unit reward of c hoices b and c . At 1 000 samples, c hoice a is Output-Sensitiv e Adaptive Metrop olis-Hastings for Probabilistic Programs 13 5e+03 2e+04 5e+04 2e+05 5e+05 0.1 0.2 0.5 1.0 samples KS distance LMH AdLMH a b c d g 0.00 0.15 0.30 rewards selected a b c d g 0 100 250 accepted rejected 1000 samples a b c d g 0.00 0.15 0.30 a b c d g 0 1000 2500 10000 samples a b c d g 0.00 0.15 a b c d g 0 10000 25000 100000 samples Fig. 2. Gaussian process hyperparameter estimation selected with the highest probabilit y . Ho w ever, close to the con v erged state, at 100 000 samples, choices a , b , and c are selected with similar probabilities. At the same time, c hoices 4 and 5 are selected with a lo wer probability . Both the exploration-exploitation dynamics for c hoices a – b and probability matc hing of selection probabilities among all c hoices secure improv ed conv ergence. The third case study in volv es a larger amount of data observed during each sim ulation of a probabilistic program. W e use the well-kno wn Iris dataset [9] to ﬁt a mo del of classifying a given ﬂow er as of the species Iris setosa, as opp osite to either Iris virginica or Iris v ersicolor. Eac h record in the dataset corresponds to an observ ation. F or each observ ation, we deﬁne a feature vector x and an indicator v ariable z i , which is 1 if and only if the observ ation is of an Iris setosa. W e ﬁt the mo del with ﬁv e regression co eﬃcien ts β 1 , . . . , β 5 , deﬁned as σ 2 ∼ In vGamma(1 , 1) , β j ∼ Normal(0 , σ ) , p ( z i = 1) = 1 1 + e − β T x . T o assess the con v ergence, w e perform sh uﬄe split leav e-2-out cross v alidation on the dataset, selecting one instance b elonging to the sp ecies Iris setosa and one b elonging to a diﬀeren t sp ecies for eac h run of the inference algorithm. The classiﬁcation error is shown in Figure 3 ov er 100 runs of LMH and Adaptive LMH. The results are consistent with other case studies: Adaptiv e LMH exhibits a faster con vergence rate, requiring half as many samples to ac hiev e the same classiﬁcation accuracy as LMH. 14 Da vid T olpin, Jan Willem v an de Meent, Bro oks Paige, and F rank W o od 5e+03 1e+04 2e+04 5e+04 1e+05 2e+05 5e+05 0.03 0.04 0.05 0.07 samples Classification error LMH AdLMH Fig. 3. Logistic regression on Iris dataset. 0.05 0.10 0.15 0.20 0.25 0.00 0.10 0.20 0.30 LMH q ω 0.05 0.10 0.15 0.20 0.25 0.00 0.10 0.20 0.30 AdLMH q ω Fig. 4. Kalman ﬁlter, 500 samples after 10 000 samples of burn-in. As a ﬁnal case study w e consider a linear dynamical system (i.e. a Kalman smo othing problem) that w as previously describ ed in [12] x t ∼ Norm( A · x t − 1 , Q ) , y t ∼ Norm( C · x t , R ) . In this problem w e assume that 16-dimensional observ ations y t are conditioned on 2-dimensional latent states z t . W e imp ose additional structure by assuming that the transition matrix A is a simple rotation with angular velocity ω , whereas the transition co v ariance Q is a diagonal matrix with co eﬃcient q , A =  cos ω − sin ω sin ω cos ω  , Q =  q 0 0 q  . W e predict p osterior v alues for ω , and q in a setting where C and R are as- sumed known, under mildly informativ e priors ω ∼ Gamma(10 , 2 . 5) and q ∼ Gamma(10 , 100). Posterior inference is p erformed conditioned on a sim ulated Output-Sensitiv e Adaptive Metrop olis-Hastings for Probabilistic Programs 15 sequence y 1: T of T = 100 observ ations, with ω ∗ = 4 π /T , and q ∗ = 0 . 1. The observ ation matrix C and co v ariance R are sampled ro w-wise from symmetric Diric hlet distributions with parameters c = 0 . 1, and r = 0 . 01 resp ectiv ely . Figure 4 sho ws a qualitativ e ev aluation the mixing rate in the form of 500 consecutiv e samples ( ω , q ) from an LMH and AdLMH c hain after 10 000 samples of burn-in. The LMH sequence exhibits goo d mixing ov er ω but is strongly correlated in q , whereas the AdLMH sequence obtains a muc h better co verage of the space. T o summarize, Adaptiv e LMH consistently attained faster con vergence than LMH, measured by diﬀerences b etw een the ongoing output distribution of the random program and the target indep enden tly obtained distribution, assessed using v arious metrics. V ariable selection probabilities computed b y Adaptiv e LMH are dynamically adapted during the inference, com bining exploration of the model represented b y the probabilistic program and exploitation of inﬂuence of random v ariables on program output. 6 Con tribution and F uture W ork In this pap er we in tro duced a new algorithm, Adaptiv e LMH, for appro ximate inference in probabilistic programs. This algorithm adjusts sampling parame- ters based on the output of the probabilistic program in which the inference is p erformed. Con tributions of the pap er include – A scheme of rewarding random c hoice based on program output. – An approac h to propagation of c hoice rewards to MH proposal scheduling parameters. – An application of this approach to LMH, where the probabilities of selecting eac h v ariable for mo diﬁcation are adjusted. Adaptiv e LMH w as compared to LMH, its non-adaptive counterpart, and was found to consistently outperform LMH on several probabilistic programs, while still being almost as easy to implemen t. The time cost of additional computation due to adaptation was negligible. Although presented in the context of a particular sampling algorithm, the adaptation approach can be extended to other sampling metho ds. W e b eliev e that v arious sampling algorithms for probabilistic programming can b eneﬁt from output-sensitiv e adaptation. Additional p oten tial for impro vemen t lies in ac- quisition of dependencies betw een predicted expressions and random v ariables. Exploring alternative approaches for guiding exploration-exploitation compro- mise, in particular, based on Ba y esian inference, is another promising researc h direction. Ov erall, output-sensitiv e appro ximate inference app ears to bring clear adv an- tages and should be further explored in the con text of probabilistic programming mo dels and algorithms. 16 Da vid T olpin, Jan Willem v an de Meent, Bro oks Paige, and F rank W o od 7 Ac kno wledgments This work is supp orted under DARP A PP AML through the U.S. AFRL under Co operative Agreemen t n umber F A8750-14-2-0004. References 1. Andrieu, C., Thoms, J.: A tutorial on adaptive MCMC. Statistics and Computing 18(4), 343–373 (2008) 2. A tchad ´ e, Y., F ort, G., Moulines, E., Priouret, P .: Adaptive Marko v c hain Monte Carlo: theory and metho ds. In: Barber, D., Cemgil, A.T., Chiappa, S. (eds.) Ba yesian Time Series Mo dels, pp. 32–51. Cam bridge Universit y Press (2011), cam- bridge Bo oks Online 3. Auer, P ., Cesa-Bianchi, N., Fischer, P .: Finite-time analysis of the Multiarmed Bandit problem. Machine Learning 47(2–3), 235–256 (2002) 4. Bai, Y., Roberts, G.O., Rosen thal, J.S.: On the con tainment condition for adaptiv e Mark ov chain Mon te Carlo algorithms. Adv ances and Applications in Statistics 21(1), 1–54 (2011) 5. Gamerman, D., Lop es, H.F.: Marko v Chain Monte Carlo: Stochastic Simulation for Bay esian Inference. Chapman and Hall/CRC (2006) 6. Go odman, N.D., Mansinghk a, V.K., Roy , D.M., Bonawitz, K., T enen baum, J.B.: Ch urch: a language for generative models. In: Pro c. of Uncertaint y in Artiﬁcial In telligence (2008) 7. Gordon, A.D., Henzinger, T.A., Nori, A.V., Ra jamani, S.K.: Probabilistic program- ming. In: International Conference on Softw are Engineering (ICSE, F OSE track) (2014) 8. Latuszy ´ nski, K., Rob erts, G.O., Rosenthal, J.S.: Adaptiv e Gibbs samplers and related MCMC metho ds. Annals of Applied Probabilit y 23(1), 66–98 (2013) 9. Lauritzen, S.: Graphical Mo dels. Clarendon Press (1996) 10. Levine, R.A., Y u, Z., Hanley , W.G., Nitao, J.J.: Implementing component wise hast- ings algorithms. Computational Stastistics & Data Analysis 48(2), 363–389 (2005) 11. Mansinghk a, V.K., Selsam, D., Pero v, Y.N.: V enture: a higher-order probabilistic programming platform with programmable inference. CoRR abs/1404.0099 (2014) 12. v an de Meen t, J.W., Y ang, H., Mansinghk a, V., W o o d, F.: P article Gibbs with An- cestor Sampling for Probabilistic Programs. In: Artiﬁcial In telligence and Statistics (2015), 13. Rob erts, G.O., Rosen thal, J.S.: Coupling and ergo dicit y of adaptiv e MCMC. Jour- nal of Applied Probability 44, 458–475 (2007) 14. Rob erts, G.O., Rosen thal, J.S.: Examples of adaptiv e MCMC. Journal of Compu- tational and Graphical Statistics 18(2), 349–367 (2009) 15. Wingate, D., Stuhlm ¨ uller, A., Go odman, N.D.: Light weigh t implemen tations of probabilistic programming languages via transformational compilation. In: Pro c. of the 14th Artiﬁcial Intellig ence and Statistics (2011) 16. W o od, F., v an de Meent, J.W., Mansinghk a, V.: A new approac h to probabilistic programming inference. In: Artiﬁcial In telligence and Statistics (2014) 17. Y ang, L., Hanrahan, P ., Go odman, N.D.: Generating eﬃcient MCMC kernels from probabilistic programs. In: Pro ceedings of the Seven teenth International Confer- ence on Artiﬁcial Intelligence and Statistics. pp. 1068–1076 (2014)

Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment