Fault-tolerant Algorithms for Tick-Generation in Asynchronous Logic: Robust Pulse Generation

F ault-toleran t A lgorithms for T ic k-Generation in A sync hronous L ogic: Robust Pulse Generation Dann y Dolev, Matthias F¨ ugger, Christoph Lenzen, and Ulric h Sc hmid Abstract T o da y’s hardware tec hnology presen ts a new challenge in designing robust systems. Deep submicron VLSI tec hnology introduced transient and permanent faults that w ere nev er consid- ered in lo w-lev el system designs in the past. Still, robustness of that part of the system is crucial and needs to b e guaran teed for an y successful pro duct. Distributed systems, on the other hand, ha ve b een dealing with similar issues for decades. Ho wev er, neither the basic abstractions nor the complexit y of con temp orary fault-toleran t distributed algorithms matc h the peculiarities of hardw are implementations. This paper is in tended to be part of an attempt strivi ng to o vercome this gap betw een theory and practice for the clo ck sync hronization problem. Solving this task suﬃcien tly well will allo w to build a v ery robust high-precision clo cking system for hardw are designs lik e systems-on-c hips in critical applications. As our ﬁrst building block, w e describe and prov e correct a nov el Byzan tine fault-toleran t self-stabilizing pulse sync hronization proto col, which can be implemen ted using standard async hronous digital logic. Despite the strict limitations introduced by hardware designs, it oﬀers optimal resilience and smaller complexity than all existing proto cols. 1 In tro duction & Related W ork With to day’s deep submicron tec hnology running at GHz clo ck speeds [20], disseminating the high- sp eed clo ck throughout a very lar ge sc ale inte gr ate d (VLSI) circuit, with negligible skew, is diﬃcult and costly [2, 3, 12, 24, 29]. Systems-on-chip are hence increasingly designed glob al ly asynchr onous lo c al ly synchr onous (GALS) [4], where diﬀeren t parts of the chip use diﬀeren t lo cal clo c k signals. Tw o main t ypes of clocking sc hemes for GALS systems exist, namely , (i) those where the local clo ck signals are unrelated, and (ii) multi-sync hronous ones that pro vide a certain degree of synchron y b et ween lo cal clock signals [30, 34]. GALS systems clo c ked by type (i) p ermanen tly b ear the risk of metastable upsets when con- v eying information from one clock domain to another. T o explain the issue, consider a physical implemen tation of a bistable storage element, lik e a register cell, which can b e accessed b y read and write op erations concurrently . It can b e shown that t wo op erations (lik e t wo writes with dif- feren t v alues) o ccurring very closely to eac h other can cause the storage cell to attain neither of its t wo stable states for an un b ounded time [23], and thereb y , during an un b ounded time afterw ards, successiv e reads ma y return none of the stable states. Although the probabilit y of a single upset is v ery small, one has to take into account that ev ery bit of transmitted information across clo c k domains is a candidate for an upset. Elab orate sync hronizers [8, 21, 28] are the only means for ac hieving an acceptably low probability for metastable upsets here. This problem can b e circumv en ted in clo cking sc hemes of type (ii): Common synchron y prop- erties oﬀered b y multi-sync hronous clocking systems are: • b ounde d skew , i.e., b ounded maximum time b etw een the o ccurence of any t w o matching clo c k transitions of any tw o lo cal clo ck signals. Thereby , in classic clo ck synchronization, tw o clo ck transitions are matc hing iﬀ they are b oth the k th , k ≥ 1, clock transition of a lo cal clo ck. • b ounde d ac cur acy , i.e., b ounded minim um and maximum time b et ween the occurence of an y t wo successive clock transitions of an y lo cal clo ck signal. T yp e (ii) clo c king sc hemes are particularly b eneﬁcial from a designer’s p oin t of view, since they com bine the conv enien t lo cal synchron y of a GALS system with a global time base across the whole c hip. It has b een shown in [27] that these prop erties indeed facilitate metastability-free high-sp eed comm unication across clo ck domains. The decreasing structure sizes of deep submicron tec hnology also resulted in an increased lik e- liho o d of chip components failing during op eration: Reduced v oltage swing and smaller critical c harges make circuits more susceptible to ionized particle hits, crosstalk, and electromagnetic in- terference [5, 18]. F ault-toler anc e hence b ecomes an increasingly pressing issue in chip design. Unfortunately , faulty components ma y b eha ve non-benign in man y wa ys. They ma y p erform signal transitions at arbitrary times and ev en con vey inconsisten t information to their successor comp o- nen ts if their outgoing communication c hannels are aﬀected b y a failure. This forces to model fault y comp onen ts as unrestricted, i.e., Byzantine, if a high fault co v erage is to b e guaranteed. The dar ts fault-tolerant clock generation approach [15, 17] developed b y some of the authors of this paper is a Byzan tine fault-tolerant m ulti-sync hronous clo c king scheme. dar ts comprises a set of mo dules, eac h of which generates a lo cal clock signal for a single clo ck domain. The dar ts mo dules (no des) are synchronized to eac h other to within a few clo ck cycles. This is achiev ed b y exchanging binary clo ck signals only , via single wires. The basic idea b ehind dar ts is to emplo y a simple fault-tolerant distributed algorithm [35]—based on Srik anth & T oueg’s consisten t 2 broadcasting primitive [31]—implemented in async hronous digital logic. An imp ortant prop erty of the dar ts clo cking sc heme is that it guaran tees that no metastable upsets occur during fault-free executions. F or executions with faults, metastable upsets cannot b e ruled out: Since Byzantine fault y comp onen ts are allow ed to issue unrelated read and write accesses b y deﬁnition, the same argumen ts as for clo c king sc hemes of t yp e (i) apply . How ev er, in [13], it w as shown that b y prop er c hip design the probabilit y of a Byzantine component leading to a metastable upset of dar ts can b e made arbitrarily small. Although b oth theoretical analysis and exp erimental ev aluation rev ealed man y attractiv e ad- ditional features of dar ts , like guaranteed startup, automatic adaption to current op erating con- ditions, etc., there is ro om for impro vemen t. The most obvious dra wback of dar ts is its inabilit y to supp ort late joining and restarting of no des, and, more generally , its lac k of self-stabilization prop erties. If, for some reasons, more than a third of the dar ts no des ever b ecome faulty , the system cannot b e guaran teed to resume normal operation ev en if all failures cease. Even w orse, simple transien t faults such as radiation- or crosstalk-induced additional (or omitted) clock ticks accum ulate ov er time to arbitrarily large sk ews in an otherwise b enign execution. Byzan tine-tolerant self-stabilization, on the other hand, is the ma jor strength of a num b er of proto cols [1, 6, 9, 19, 22] primarily devised for distributed systems. Of particular in terest in the ab o ve con text is the w ork on self-stabilizing pulse synchr onization , where the purpose is to generate w ell-separated anon ymous pulses that are synchronized at all correct nodes. This facilitates self- stabilizing clo c k sync hronization, as agreemen t on a time windo w permits to simulate a sync hronous proto col in a b ounded-delay system. Bey ond optimal (i.e., d n/ 3 e − 1, c.f. [26]) resilience, an attractiv e feature of these proto cols is a small stabilization time [1, 6, 19, 22], which is crucial for applications with stringent av ailabilit y requiremen ts. In particular, [1] sync hronizes clo c ks in exp ected constant time in a synchronous system. Giv en any pulse synchronization proto col stabilizing in a b ounded-delay system in expected time T , this implies an exp ected ( T + O (1))- stabilizing clo c k synchronization proto col. Nonetheless, it remains op en whether a (with resp ect to the num b er of no des n ) sublinear con vergence time can b e ac hieved: While the classical consensus lo wer b ound of f + 1 rounds for synchronous, deterministic algorithms in a system with f < n/ 3 faults [11] prov es that exact agreemen t on a clo c k v alue requires at least f + 1 ∈ Ω( n ) deterministic rounds, one has to face the fact that only appro ximate agreement on the current time is ac hiev able in a b ounded-delay system an ywa y . Ho w ev er, no non-trivial lo wer b ounds on appro ximate deterministic synchronization or the exact problem with randomization are known by now. Note that existing synchronization algorithms, in particular those that do not rely on pulse sync hronization, hav e deﬁciencies rendering them unsuitable in our context. F or example, they ha ve exponential con v ergence time [9], require the relative drift of the no des’ lo cal clo cks to be v ery small [7, 22], 1 pro vide larger skew only [22] or make use of linear-sized messages [6]. F urthermore, standard mo dels used b y the distributed systems comm unity do not account for metastability , resulting in the same to b e true for the existing solutions. It is hence natural to explore wa ys of com bining and extending the ab ov e lines of research. The presen t pap er is the ﬁrst step tow ards this goal. Detailed con tributions. W e describ e and pro ve correct the nov el F A T AL pulse sync hroniza- 1 Note that it is too costly and space consuming to equip each no de with a quartz oscillator. Simple digital oscillators, like inv erters with feedback, in turn exhibit drifts of at least sev eral p ercent, whic h hea vily v ary with op erating conditions. 3 tion protocol, which facilitates a direct implemen tation in standard asynchronous digital logic. It self-stabilizes within O ( n ) time with probability 1 − 2 n − f , 2 in the presence of up to d n/ 3 e − 1 Byzan tine faulty no des, and is metastability-free by construction after stabilization in failure-free runs. While executing the proto col, non-fault y no des broadcast a constant n um b er of bits in con- stan t time. In terms of distributed message complexit y , this implies that stabilization is achiev ed after broadcasting O ( n ) messages of size O (1), improving b y factor Ω( n ) on the n um b er of bits transmitted by previous algorithms. 3 The proto col can sustain large relativ e clo ck drifts of more than 10%, which is crucial if the lo cal clo ck sources are simple ring oscillators (uncomp ensated ring oscillators suﬀer from clo c k drifts of up to 9% [32]). If the n um b er of faults is not ov erwhelming, i.e., a ma jorit y of at least n − f no des contin ues to execute the proto col in an orderly fashion, re- co vering no des and late joiners (re)sync hronize in constant time. This prop ert y is highly desirable in practical systems, in particular in com bination with Byzantine fault-tolerance: Ev en if no des randomly exp erience transient faults on a regular basis, quic k recov ery ensures that the mean time un til failure of the system as a whole is substan tially increased. All this is achiev ed against a p o werful adversary that, at time t , kno ws the whole history of the system up to time t + ε (where ε > 0 is inﬁnitesimally small) and do es not need to choose the set of faulty no des in adv ance. Apart from bounded drifts and communication delays, our solution solely requires that receivers can unambiguously identify the sender of a message, whic h is a prop ert y that arises naturally in hardw are designs. W e also describ e how the pulse synchronization proto col can b e implemented using asynchronous digital logic. Moreov er, w e sketc h how the pulse sync hronization proto col will b e integrated with dar ts clo cks to build a high-precision self-stabilizing clo c king system for m ulti-sync hronous GALS. The basic idea of our integration is to let the pulse synchronization proto col non-intrusiv ely mon- itor the op eration of dar ts clo cks and to recov er dar ts clo cks that run abnormally . Lik e the original dar ts , the join t system is metastability-free in failure-free runs after stabilization. During stabilization, the fact that no des merely undergo a constant n um b er of state transitions in constant time ensures a v ery small probability of metastable upsets. 2 Mo del Our formal framework will be tied to the peculiarities of hardw are designs, which consist of mo dules that c ontinuously 4 compute their output signals based on their input signals. F ollo wing [14, 16], w e deﬁne (the trace of ) a signal to b e a timed even t trace ov er a ﬁnite alphab et S of p ossible signal states: F ormally , signal σ ⊆ S × R + 0 . All times and time interv als refer to a global r efer enc e time tak en from R + 0 , that is, signals describe the system’s b ehaviour from time 0 on. The elements of σ are called events , and for each even t ( s, t ) we call s the state of event ( s, t ) and t the time of event ( s, t ). In general, a signal σ is required to fulﬁll the follo wing conditions: (i) for each time interv al [ t − , t + ] ⊆ R + 0 of ﬁnite length, the num b er of ev ents in σ with times within [ t − , t + ] is ﬁnite, (ii) from ( s, t ) ∈ σ and ( s 0 , t ) ∈ σ follo ws that s = s 0 , and (iii) there exists an even t at time 0 in σ . 2 Note that the algorithm from [1] achieving an exp ected constant stabilization time in a synchronous mo del needs to run for Ω( n ) rounds to ensure the same probabilit y of stabilization. 3 W e remark that [22] achiev es the same complexit y , but considers a muc h simpler model. In particular, al l comm unication is restricted to broadcasts, i.e., all no des observ e the same b ehaviour of a given other no de, even if it is faulty . 4 In sharp con trast to classic distributed computing models, there is no computationally complex discrete zero-time state-transition here. 4 Note that our deﬁnition allo ws for even ts ( s, t ) and ( s, t 0 ) ∈ σ , where t < t 0 , without having an ev ent ( s 0 , t 00 ) ∈ σ with s 0 6 = s and t < t 00 < t 0 . In this case, we call even t ( s, t 0 ) idemp otent . Two signals σ and σ 0 are e quivalent , iﬀ they diﬀer in idemp oten t even ts only . W e identify all signals of an equiv alence class, as they describ e the same physical signal. Eac h equiv alence class [ σ ] of signals con tains a unique signal σ 0 ha ving no idemp otent even ts. W e say that signal σ switches to s at time t iﬀ ev ent ( s, t ) ∈ σ 0 . The state of signal σ at time t ∈ R + 0 , denoted b y σ ( t ), is given by the state of the even t with the maximum time not greater than t . 5 Because of (i), (ii) and (iii), σ ( t ) is well deﬁned for each time t ∈ R + 0 . Note that σ ’s state function in fact dep ends on [ σ ] only , i.e., we may add or remov e idemp oten t even ts at will without c hanging the state function. Distributed System On the topmost level of abstraction, we see the system as a set of V = { 1 , . . . , n } physically remote no des that comm unicate by means of channels . In the context of a VLSI circuit, “physically remote” actually refers to quite small distances (centimeters or ev en less). Ho wev er, at gigahertz frequencies, a lo cal state transition will not b e observed remotely within a time that is negligible compared to clo ck sp eeds. W e stress this p oint, since it is crucial that diﬀeren t clo cks (and their attached logic) are not to o close to eac h other, as otherwise they might fail due to the same even t such as a particle hit. This w ould render it pointless to devise a system that is resilien t to a certain fraction of the no des failing. Eac h node i comprises a num ber of input p orts , namely S i,j for each no de j , an output p ort S i , and a set of lo c al p orts , in tro duced later on. An exe cution of the distributed system assigns to eac h p ort of eac h no de a signal. F or conv enience of notation, for any p ort p , w e refer to the signal assigned to p ort p simply b y signal p . W e say that no de i is in state s at time t iﬀ S i ( t ) = s . W e further sa y that no de i switches to state s at time t iﬀ signal S i switc hes to s at time t . No des exchange their states via the c hannels b etw een them: for each pair of no des i, j , output p ort S i is connected to input port S j,i b y a FIF O c hannel from i to j . Note that this includes a c hannel from i to i itself. In tuitiv ely , S i b eing connected to S j,i b y a (non-faulty) c hannel means that S j,i ( · ) should mimic S i ( · ), how ev er, with a sligh t delay accoun ting for the time it tak es the signal to propagate. In con trast to an asynchronous system, this delay is bounded by the maximum delay d > 0. 6 F ormally w e deﬁne: The channel from no de i to j is said to b e c orr e ct during [ t − , t + ] iﬀ there exists a function τ i,j : R + 0 → R + 0 , called the channel’s delay function , such that: (i) τ i,j is con tinuous and strictly increasing, (ii) ∀ t ∈ [ t − , t + ] : 0 ≤ τ i,j ( t ) − t < d , and (iii) for each t ∈ [ t − , t + ], ( s, τ i,j ( t )) ∈ S j,i ⇔ ( s, t ) ∈ S i . W e say that no de i observes no de j in state s at time t if S i,j ( t ) = s . Clo c ks and Timeouts Nodes are never a w are of the current reference time and w e also do not require the reference time to resemble Newtonian “real” time. Rather we allo w for ph ysical clo cks that run arbitrarily fast or slo w, as long as their speeds are close to eac h other in comparison. One ma y hence think of the reference time as progressing at the sp eed of the currently slo w est correct clo c k. In this framework, no des essentially mak e use of b ounded clo cks with b ounded drift. F ormally , clo ck rates are within [1 , ϑ ] (with resp ect to reference time), where ϑ > 1 is constan t and ϑ − 1 is the (maximum) clo ck drift . A clo ck C is a contin uous, strictly increasing function 5 T o facilitate intuition, w e here slightly abuse notation, as this wa y σ denotes b oth a function of time and the signal (trace), which is a subset of S × R + 0 . Whenever referring to σ , we will talk of the signal, not the state function. 6 With resp ect to O -notation, we normalize d ∈ O (1), as all time b ounds simply dep end linearly on d . 5 C : R + 0 → R + 0 mapping reference time to some lo cal time. Clock C is said to b e c orr e ct during [ t − , t + ] ⊆ R + 0 iﬀ we hav e for any t, t 0 ∈ [ t − , t + ], t < t 0 , that t 0 − t ≤ C ( t 0 ) − C ( t ) ≤ ϑ ( t 0 − t ). Eac h no de comprises a set of clo cks assigned to it, whic h allo w the no de to estimate the progress of reference time. Instead of directly accessing the v alue of their clo cks, no des hav e access to so-called time out p orts of watc hdog timers. A time out is a triple ( T , s, C ), where T ∈ R + is a duration, s ∈ S is a state, and C is a clo ck, say of node i . Eac h timeout ( T , s, C ) has a corresp onding timeout port Time T ,s,C , b eing part of no de i ’s lo cal ports. Signal Time T ,s,C is Bo olean, that is, its p ossible states are from the set { 0 , 1 } . W e sa y that timeout ( T , s, C ) is c orr e ct during [ t − , t + ] ⊆ R + 0 iﬀ clo ck C is correct during [ t − , t + ] and the follo wing holds: 1. F or eac h time t s ∈ [ t − , t + ] when no de i switches to state s , there is a time t ∈ [ t s , τ i,i ( t s )] suc h that ( T , s, C ) is r eset , i.e., (0 , t ) ∈ Time T ,s,C . This is a one-to-one corresp ondence, i.e., ( T , s, C ) is not reset at any other times. 2. F or a time t ∈ [ t − , t + ], denote b y t 0 the suprem um of all times from [ t − , t ] when ( T , s, C ) is reset. Then it holds that (1 , t ) ∈ Time T ,s,C iﬀ C ( t ) − C ( t 0 ) = T . Again, this is a one-to-one corresp ondence. W e sa y that timeout ( T , s, C ) expir es at time t iﬀ Time T ,s,C switc hes to 1 at time t , and it is expir e d at time t iﬀ Time T ,s,C ( t ) = 1. F or notational conv enience, w e will omit the clo ck C and simply write ( T , s ) for b oth the timeout and its signal. A r andomize d time out is a triple ( D , s, C ), where D is a bounded random distribution on R + 0 , s ∈ S is a state, and C is a clo ck. Its corresp onding timeout p ort Time D ,s,C b eha ves v ery similar to the one of an ordinary timeout, except that whenev er it is reset, the lo cal time that passes until it expires next—provided that it is not reset again b efore that happ ens—follows the distribution D . F ormally , ( D , s, C ) is correct during [ t − , t + ] ⊆ R + 0 , if C is correct during [ t − , t + ] and the follo wing holds: 1. F or eac h time t s ∈ [ t − , t + ] when no de i switches to state s , there is a time t ∈ [ t s , τ i,i ( t s )] suc h that ( D , s, C ) is r eset , i.e., (0 , t ) ∈ Time D ,s,C . This is a one-to-one corresp ondence, i.e., ( D , s, C ) is not reset at any other times. 2. F or a time t ∈ [ t − , t + ], denote by t 0 the supremum of all times from [ t − , t ] when ( D , s, C ) is reset. Let µ : R + 0 → R + 0 denote the density of D . Then (1 , t ) ∈ Time D ,s,C “with probability µ ( C ( t ) − C ( t 0 ))” and w e require that the probabilit y of (1 , t ) ∈ Time D ,s,C —conditional to t 0 and C on [ t 0 , t ] b eing giv en—is indep endent of the system’s state at times smaller than t . More precisely , if sup erscript E iden tiﬁes v ariables in execution E and t 0 0 is the inﬁm um of all times from ( t 0 , t + ] when no de i switches to state s , then we demand for any [ τ − , τ + ] ⊆ [ t 0 , t 0 0 ] that P h ∃ t 0 ∈ [ τ − , τ + ] : (1 , t 0 ) ∈ Time D ,s,C    t E 0 = t 0 ∧ C   E [ t 0 ,t 0 ] = C   [ t 0 ,t 0 ] i = Z τ + τ − µ ( C ( τ ) − C ( t 0 )) dτ , indep enden tly of E   [0 ,τ − ) . W e will apply the same notational con v entions to randomized timeouts as w e do for regular timeouts. 6 Note that, strictly sp eaking, this deﬁnition do es not induce a random v ariable describing the time t 0 ∈ [ t 0 , t 0 0 ) satisfying that (1 , t 0 ) ∈ Time D ,s,C . How ev er, for the state of the timeout p ort, w e get the meaningful statemen t that for any t 0 ∈ [ t 0 , t 0 0 ), P [Time D ,s,C switc hes to 1 during [ t 0 , t 0 ]] = Z t 0 t 0 µ ( C ( t 0 ) − C ( t 0 )) dτ . The reason for phrasing the deﬁnition in the ab ov e more cumbersome w ay is that w e wan t to guaran tee that an adversary kno wing the full present state of the system and memorizing its whole history cannot reliably predict when the timeout will expire. 7 W e remark that these deﬁnitions allo w for diﬀerent timeouts to b e driv en by the same clo ck, implying that an adv ersary ma y derive some information on the state of a randomized timeout b efore it expires from the no de’s b ehaviour, even if it cannot directly access the v alues of the clock driving the timeout. This is crucial for implemen tabilit y , as it might b e v ery diﬃcult to guaran tee that the b eha viour of a dedicated clo ck that driv es a randomized timeout is indeed indep endent of the execution of the algorithm. Memory Flags Besides timeout and randomized timeout p orts, another kind of node i ’s local p orts are memory ﬂags . F or each state s ∈ S and each no de j ∈ V , Mem i,j,s is a lo cal port of no de i . It is used to memorize whether node i has observ ed no de j in state s since the last reset of the ﬂag. W e say that no de i memorizes no de j in state s at time t if Mem i,j,s ( t ) = 1. F ormally , we require that signal Mem i,j,s switc hes to 1 at time t iﬀ no de i observ es no de j in state s at time t and Mem i,j,s is not already in state 1. The times t when Mem i,j,s is r eset , i.e., (0 , t ) ∈ Mem i,j,s , are sp eciﬁed by no de i ’s state machine, whic h is in tro duced next. State Mac hine It remains to specify how nodes switc h states and when they reset memory ﬂags. W e do this b y means of state machines that ma y attain states from the ﬁnite alphab et S . A no de’s state mac hine is speciﬁed by (i) the set S , (ii) a function tr , called the tr ansition function , from T ⊆ S 2 to the set of Bo olean predicates on the alphab et consisting of expressions “ p = s ” (used for expressing guards), where p is from the no de’s input and lo cal ports and s is from the set of p ossible states of signal p , and (iii) a function r e , called the r eset function , from T to the p ow er set of the no de’s memory ﬂags. In tuitively , the transition function sp eciﬁes the conditions (guards) under whic h a node switc hes states, and the reset function determines whic h memory ﬂags to reset up on the state c hange. F ormally , let P b e a predicate on no de i ’s input and lo cal p orts. W e deﬁne P holds at time t b y structural induction: If P is equal to p = s , where p is one of no de i ’s input and lo cal p orts and s is one of the states signal p can obtain, then P holds at time t iﬀ p ( t ) = s . Otherwise, if P is of the form ¬ P 1 , P 1 ∧ P 2 , or P 1 ∨ P 2 , w e deﬁne P holds at time t in the straigh tforw ard manner. W e sa y no de i fol lows its state machine during [ t − , t + ] iﬀ the follo wing holds: Assume no de i observ es itself in state s ∈ S at time t ∈ [ t − , t + ], i.e., S i,i ( t ) = s . Then, for each ( s, s 0 ) ∈ T , b oth: 1. No de i switches to state s 0 at time t iﬀ tr ( s, s 0 ) holds at time t and i is not already in state s 0 . 8 7 This is a non-trivial prop erty . F or instance nodes could just determine, by dra wing from the desired random distribution at time t 0 , at which lo cal clo ck v alue the timeout shall expire next. This would, how ev er, essentially give a wa y early when the timeout will expire, greatly reducing the p o wer of randomization! 8 In case more than one guard tr ( s, s 0 ) can b e true at the same time, we assume that an arbitrary tie-breaking ordering exists among the transition guards that sp eciﬁes to which state to switch. 7 2. No de i resets memory ﬂag m at some time in the interv al [ t, τ i,i ( t )] iﬀ m ∈ re ( s, s 0 ) and i switc hes from state s to state s 0 at time t . This corresp ondence is one-to-one. A no de is deﬁned to b e non-faulty during [ t − , t + ] iﬀ during [ t − , t + ] all its timeouts and random- ized timeouts are correct and it follows its state mac hine. If it employs multiple state machines (see b elo w), it needs to follow all of them. In con trast, a fault y node may change states arbitrarily . Note that while a fault y no de ma y b e forced to send consisten t output state signals to all other no des if its channels remain correct, there is no w ay to guarantee that this still holds true if c hannels are faulty . 9 Metastabilit y In our discrete system mo del, the eﬀect of metastability is captured b y the lack- ing capability of state machines to instan taneously take on new states: No de i decides on state transitions based on the delay ed status of p ort S i,i instead of its “true” current state S i . This non-zero dela y from S i to S i,i b ears the p otential for metastability , as a successful state transition can only b e guaranteed if after a transition guard from some state s to some state s 0 b ecomes true, all other transition guards from s to s 00 6 = s 0 remain false during this dela y at least. This is exempliﬁed in the following scenario: Assume no de i is in state s at some time t . Ho wev er, since it switched to s only very recently , it still observes itself in state s 0 6 = s at time t via S i,i . Giv en that there is a transition ( s 0 , s 00 ) in T , s 00 6 = s , whose condition is fulﬁlled at time t , it will switch to state s 00 at time t (although state s has not ev en stabilized y et). That is, due to the discrepancy b et ween S i,i and S i , no de i switches from state s to state s 00 at time t ev en if ( s, s 00 ) is not in T at all. 10 In a physical chip design, this premature change of state might even result in inconsistent op erations on the lo cal memory , up to the point where it cannot be prop erly describ ed in terms of S , and th us in terms of our discrete mo del, anymore. Ev en w orse, the state of i is part of the lo cal memory and the no de’s state signal may attain an undeﬁned v alue that is propagated to other no des and their memory . While av oiding the latter is the task of the input p orts of a non-fault y no de, our goal is to prev en t this erroneous b eha viour in situations where input p orts attain legitimate v alues only . Therefore, w e deﬁne no de i to b e metastability-fr e e , if the situation described ab ov e do es not o ccur. Deﬁnition 2.1 (Metastability-F reedom) . No de i ∈ V is c al le d metastability-free during [ t − , t + ] , iﬀ for e ach time t ∈ [ t − , t + ] when i switches to some state s ∈ S , it holds that τ i,i ( t ) < t 0 , wher e t 0 is the inﬁmum of al l times in ( t, t + ] when i switches to some state s 0 ∈ S . Multiple State Mac hines In some situations the previous deﬁnitions are too stringent, as there migh t be diﬀeren t “comp onents” of a no de’s state mac hine that act concurren tly and indep endently , mostly relying on signals from disjoint input ports or orthogonal comp onen ts of a signal. W e mo del this by permitting that no des run several state machines in parallel. All these state machines share the input and lo cal p orts of the resp ective no de and are required to ha v e disjoin t state spaces. If no de i runs state machines M 1 , . . . , M k , no de i ’s output signal is the pro duct of the output signals 9 A single physical fault ma y cause this behaviour, as at some p oint a no de’s output p ort m ust be connected to remote no des’ input p orts. Even if one places bifurcations at diﬀerent ph ysical lo cations striving to mitigate this eﬀect, if the voltage at the output p ort drops b elow sp eciﬁcations, the v alues of corresp onding input c hannels may deviate in unpredictable wa ys. 10 Note that while the “internal” delay τ i,i ( t ) − t can b e made quite small, it cannot b e reduced to zero if the mo del is meant to reﬂect physical implementations. 8 of the individual machines. F ormally we deﬁne: Eac h of the state machines M j , 1 ≤ j ≤ k , has an additional own output p ort s j . The state of no de i ’s output p ort S i at any time t is given b y S i ( t ) := ( s 1 ( t ) , . . . , s k ( t )), where the signals of p orts s 1 , . . . , s k are deﬁnied analogously to the signals of the output p orts of state mac hines in the single state machine case, each. Note that by this deﬁnition, the only (lo cal) means for node i ’s state machines to interact with each other is by reading the dela yed state signal S i,i . W e say that no de i ’s state machine M j is in state s at time t iﬀ s j ( t ) = s , where S i ( t ) = ( s 1 ( t ) , . . . , s k ( t )), and that no de i ’s state machine M j switches to state s at time t iﬀ signal s j switc hes to s at time t . Since the state spaces of the machines M j are disjoint, we will omit the phrase “state machine M j ” from the notation, i.e., we write “no de i is in state s ” or “no de i switc hed to state s ”, resp ectively . Recall that the v arious state mac hines of no de i are as lo osely coupled as remote no des, namely via the delay ed status signal on c hannel S i,i only . Therefore, it makes sense to consider them indep enden tly also when it comes to metastability . Deﬁnition 2.2 (Metastability-F reedom (Multiple State Machines)) . State machine M of no de i ∈ V is c al le d metastability-free during [ t − , t + ] , iﬀ for e ach time t ∈ [ t − , t + ] when M switches to some state s ∈ S , it holds that τ i,i ( t ) < t 0 , wher e t 0 is the inﬁmum of al l times in ( t, t + ] when M switches to some state s 0 ∈ S . Note that b y this deﬁnition the diﬀerent state mac hines ma y switc h states concurren tly without suﬀering from metastabilit y . 11 It is ev en p ossible that some state machine suﬀers metastability , while another is not aﬀected by this at all. 12 Problem Statemen t The purp ose of the pulse synchronization proto col is that no des generate sync hronized, well-separated pulses by switching to a distinguished state ac c ept . Self-stabilization requires that they start to do so within b ounded time, for any p ossible initial state. How ev er, as our protocol makes use of randomization, there are executions where this do es not happ en at all; instead, we will sho w that the proto col stabilizes with probability one in ﬁnite time. T o give a precise meaning to this statemen t, we need to deﬁne appropriate probabilit y spaces. Deﬁnition 2.3 (Adversarial Spaces) . Denote for i ∈ V by C i = { C i,k | k ∈ { 1 , . . . , c i }} the set of clo cks of no de i . A n adversarial space is a pr ob abilistic sp ac e that is deﬁne d by subsets of no des and channels W ⊆ V and E ⊆ V × V , a time interval [ t − , t + ] , a pr oto c ol P (no des’ p orts, state machines, etc.) as pr eviously deﬁne d, sets of clo ck and delay functions C = S i ∈ V C i and Θ = { τ i,j : R + 0 → R + 0 | ( i, j ) ∈ V 2 } , an initial state E 0 of al l p orts, and an adversarial function A . Her e A is a function that maps a p artial exe cution E | [0 ,t ] until time t (i.e., al l p orts’ values until time t ), W , E , [ t − , t + ] , P , C , and Θ to the states of al l faulty p orts during the time interval ( t, t 0 ] , wher e t 0 is the inﬁmum of al l times gr e ater than t when a non-faulty no de or channel switches states. 11 Ho wev er, care has to b e tak en when implementing the inter-node comm unication of the state comp onents in a metastabilit y-free manner, cf. Section 6. 12 This is crucial for the algorithm we are going to present. F or stabilization purp oses, nodes comprise a state mac hine that is prone to metastabilit y . How ev er, the state machine generating pulses (i.e., having the state ac c ept , cf. Deﬁnition 2.4) do es not take its output signal into account once stabilization is achiev ed. Th us, the algorithm is metastabilit y-free after stabilization in the sense that we guaran tee a metastability-free signal indicating when pulses o ccur. 9 The adversarial sp ac e AS ( W, E , [ t − , t + ] , P , C , Θ , E 0 , A ) is now deﬁne d on the set of al l exe cutions E satisfying that ( i ) the initial state of al l p orts is given by E | [0 , 0] = E 0 , ( ii ) for al l i ∈ V and k ∈ { 1 , . . . , c i } : C E i,k = C i,k , ( iii ) for al l ( i, j ) ∈ V 2 , τ E i,j = τ i,j , ( iv ) no des in W ar e non-faulty during [ t − , t + ] with r esp e ct to the pr oto c ol P , ( v ) al l channels in E ar e c orr e ct during [ t − , t + ] , and ( v i ) given E | [0 ,t ] for any time t , E | ( t,t 0 ] is given by A , wher e t 0 is the inﬁmum of times gr e ater than t when a non-faulty no de switches states. Thus, exc ept for when r andomize d time outs expir e, E is ful ly pr e determine d by the p ar ameters of AS . 13 The pr ob ability me asur e on AS is induc e d by the r andom distributions of the r andomize d time outs sp e ciﬁe d by P . T o a void confusion, observ e that if the clo ck functions and delays do not follow the mo del constrain ts during [ t − , t + ], the re sp ectiv e adversarial space is empty and thus of no concern. This cum b ersome deﬁnition provides the means to formalize a notion of stabilization that accounts for w orst-case drifts and delays and an adv ersary that kno ws the full state of the system up to the curren t time. W e are now in the p osition to formally state the pulse synchronization problem in our framework. In tuitively , the goal is that after transien t faults cease, nodes should with probability one ev en tually start to issue w ell-separated, sync hronized pulses by switching to a dedicated state ac c ept . Thus, as the initial state of the system is arbitrary , sp ecifying an algorithm 14 is equiv alen t to deﬁning the state mac hines that run at each no de, one of whic h has a state ac c ept . Deﬁnition 2.4 (Self-Stabilizing Pulse Synchronization) . Given a set of no des W ⊆ V and a set E ⊆ V × V of channels, we say that pr oto c ol P is a ( W, E )-stabilizing pulse synchronization proto col with skew Σ and accuracy bounds T − , T + that stabilizes within time T with probability p iﬀ the fol lowing holds. Cho ose any time interval [ t − , t + ] ⊇ [ t − , t − + T + Σ] and any adversarial sp ac e AS ( W , E , [ t − , t + ] , P , · , · , · , · ) (i.e., C , Θ , E 0 , and A ar e arbitr ary). Then exe cutions fr om AS satisfy with pr ob ability at le ast p that ther e exists a time t s ∈ [ t − , t − + T ] so that, denoting by t i ( k ) the time when no de i switches to a distinguishe d state accept for the k th time after t s ( t i ( k ) = ∞ if no such time exists), ( i ) t i (1) ∈ ( t s , t s + Σ) , ( ii ) | t i ( k ) − t j ( k ) | ≤ Σ if max { t i ( k ) , t j ( k ) } ≤ t + , and ( iii ) T − ≤ | t i ( k + 1) − t i ( k ) | ≤ T + if t i ( k ) + T + ≤ t + . Note that the fact that A is a deterministic function and, more generally , that we consider each space AS individually , is no restriction: As P succeeds for an y adv ersarial space with probability at least p in achieving stabilization, the same holds true for randomized adversarial strategies A and w orst-case drifts and delays. 3 The F A T AL Pulse Sync hronization Proto col In this section, w e present our self-stabilizing pulse generation algorithm. In order to b e suitable for implementation in hardware, it needs to utilize v ery simple rules only . It is stated in terms of a state mac hine as introduced in the previous section. Since the ultimate goal of the pulse generation algorithm is to stabilize a system of dar ts clo c ks, w e in tro duce an additional p ort dar ts i , for each no de i , whic h is driv en b y no de i ’s d ar ts 13 This follows by induction starting from the initial conﬁguration E 0 . Using A , w e can alwa ys extend E to the next time when a correct no de switches states, and when correct no des switch states is fully determined by the parameters of AS except for when randomized timeouts expire. Note that the induction reaches any ﬁnite time within a ﬁnite n umber of steps, as signals switch states ﬁnitely often in ﬁnite time. 14 W e use the terms “algorithm” and “proto col” interc hangably throughout this work. 10 r e ady pr op ose ac c ept sle ep T 1 and ≥ n − f ac c ept ( T 3 and D AR TS i ) or T 4 or ≥ f + 1 pr op ose ( T 2 , ac c ept ) ≥ n − f pr op ose or ac c ept pr op ose , D AR TS i ac c ept true sle ep → waking waking ( ϑ + 1) T 1 ac c ept ac c ept Figure 1: Basic cycle of no de i once the algorithm has stabilized. instance. As for other state signals, its output raises ﬂag Mem i, dar ts , to whic h for simplicity w e refer to as dar ts i as well. Note that the dar ts signals are of no concern to the liveliness or stabilization of the pulse algorithm itself; rather, it is a control signal from the dar ts comp onent that helps in adjusting the frequency of pulses to the sp eed of the dar ts clo c ks once the system as a whole (including the dar ts comp onent) is stable. The pulse algorithm will stabilize indep endently of the dar ts signal, and the dar ts comp onent will stabilize once the pulse comp onent did so. Therefore we can partition the algorithm’s analysis into t wo parts. When pro ving the correctness of the algorithm in Section 4, w e assume that for each no de i , dar ts i is arbitrary . In Section 7, we will outline ho w the pulse algorithm and dar ts in teract. 3.1 Basic Cycle The full algorithm makes use of a rather in volv ed interpla y b etw een conditions on timeouts, states, and thresholds to conv erge to a safe state despite a limited num b er of fault y comp onen ts. As our approac h is th us diﬃcult to present in bulk, we break it do wn in to pieces. Moreov er, to facilitate giving intuition ab out the key ideas of the algorithm, in this section w e assume that there are f < n/ 3 faulty no des, and the remaining n − f no des are non-fault y within [0 , ∞ ) (where of course the time 0 is unkno wn to the no des). W e further assume that c hannels b etw een non-faulty no des (including lo opbac k channels) are correct within [0 , ∞ ). W e start by presenting the basic cycle that is rep eated every pulse once a safe conﬁguration is reached (see Figure 1). W e employ graphical representations of the state machine of each no de i ∈ V . States are represen ted by circles containing their names, while transition ( s, s 0 ) ∈ T is depicted as an arrow from s to s 0 . The guard tr ( s, s 0 ) is written as a lab el next to the arrow, and the reset function’s v alue r e ( s, s 0 ) is depicted in a rectangular b o x on the arrow. T o keep lab els more simple w e make use of some abbreviations. W e write T instead of ( T , s ) if s is the state whic h no de i leav es if the condition inv olving ( T , s ) is satisﬁed. Threshold conditions like “ ≥ f + 1 s ”, where s ∈ S , abbreviate Bo olean predicates that reach ov er all of node i ’s memory ﬂags Mem i,j,s , where j ∈ V , 11 and are deﬁned in a straigh tforward manner. If in suc h an expression w e connect tw o states by “or”, e.g., “ ≥ n − f s or s 0 ” for s, s 0 ∈ S , the summation considers ﬂags of b oth t yp es s and s 0 . Th us, such an expression is equiv alent to P j ∈ V max { Mem i,j,s , Mem i,j,s 0 } ≥ f + 1. F or any state s ∈ S , the condition S i,j = s , (resp ectively , ¬ ( S i,j = s )) is written in short as “ j in s ” (resp ectively , “ j not in s ”). If j = i , we simply write “(not) in s ”. W e write “true” instead of a condition that is alw a ys true (like e.g. “(in s ) or (not in s )” for an arbitrary state s ∈ S ). Finally , re ( · , · ) alw ays requires to reset all memory ﬂags of certain t yp es, hence w e write e.g. pr op ose if all ﬂags Mem i,j, pr op ose are to b e reset. W e now brieﬂy introduce the basic ﬂo w of the algorithm once it stabilizes, i.e., once all n − f non-fault y no des are w ell-sync hronized. Recall that the remaining up to f < n/ 3 fault y no des may pro duce arbitrary signals on their outgoing c hannels. A pulse is lo cally triggered by switching to state ac c ept . Th us, assume that at some time all non-fault y no des switc h to state ac c ept within a time window of 2 d , i.e., a v alid pulse is generated. Supp osing that T 1 ≥ 3 ϑd , these no des will observ e, and thus memorize, eac h other and themselv es in state ac c ept b efore T 1 expires. This makes timeout T 1 the critical condition for switching to state sle ep . F rom state sle ep , they will switc h to states sle ep → waking , waking , and ﬁnally r e ady , where the timeout ( T 2 , ac c ept ) is determining the time this tak es, as it is considerably larger than ϑ ( ϑ + 2) T 1 . The in termediate states serv e the purp ose of achieving stabilization, hence we lea v e them out for the momen t. Note that up on switc hing to state r e ady , no des reset their pr op ose ﬂags and dar ts i . Thus, they essen tially ignore these signals b etw een the most recent time they switched to pr op ose b efore switching to ac c ept and the subsequen t time when they switch to r e ady . This ensures that no des do not take in to accoun t outdated information for the decision when to switch to state pr op ose . Hence, it is guaranteed that the ﬁrst no de switching from state r e ady to state pr op ose again do es so b ecause T 4 expired or b ecause T 3 expired and its dar ts memory ﬂag is true. Due to the constrain t min { T 3 , T 4 } ≥ ϑ ( T 2 + 4 d ), w e are sure that all non-faulty no des observ e themselv es in state r e ady b efore the ﬁrst one switches to pr op ose . Hence, no no de deletes information ab out nodes that switch to pr op ose again after the previous pulse. The ﬁrst non-faulty no de that switches to state ac c ept again cannot do so b efore it memorizes at least n − f no des in state pr op ose , as the ac c ept ﬂags are reset up on switching to state pr op ose . Therefore, at this time at least n − 2 f ≥ f + 1 non-faulty no des are in state pr op ose . Hence, the rule that no des switch to pr op ose if they memorize f + 1 no des in states pr op ose will tak e eﬀect, i.e., the remaining non-faulty no des in state r e ady switch to pr op ose after less than d time. Another d time later all non-faulty no des in state pr op ose will hav e b ecome aw are of this and switch to state ac c ept as well, as the threshold of n − f no des in states pr op ose or ac c ept is reac hed. Th us the cycle is complete and the reasoning can b e rep eated inductiv ely . Clearly , for this line of argumentation to be v alid, the algorithm c ould b e simpler than stated in Figure 1. W e already men tioned that the motiv ation of having three in termediate states b etw een ac c ept and r e ady is to facilitate stabilization. Similarly , there is no need to make use of the ac c ept ﬂags in the basic cycle at all; in fact, it adversely aﬀects the constraints the timeouts need to satisfy for the ab ov e reasoning to b e v alid. Ho w ev er, the ac c ept ﬂags are muc h better suited for diagnostic purp oses than the pr op ose ﬂags, since no des are exp ected to switc h to ac c ept in a small time windo w and remain in state ac c ept for a small p erio d of time only (for all our results, it is suﬃcien t if T 1 = 4 ϑd ). Moreo ver, tw o diﬀeren t timeout conditions for switching from r e ady to pr op ose are unnecessary for correct op eration of the pulse synchronization routine. As discusse d b efore, they are in tro duced in order to allo w for a seamless coupling to the dar ts system. W e elab orate on this in Section 7. 12 r e c over r e ady pr op ose ac c ept sle ep ( T 2 , ac c ept ) T 5 ≥ n − f join or pr op ose or ac c ept ≥ f + 1 r e c over or ac c ept ( T 3 and DAR TS i ) or T 4 or ≥ f + 1 pr op ose T 1 and ≥ n − f ac c ept T 1 and < n − f ac c ept ≥ n − f pr op ose or ac c ept join ≥ f + 1 join and not in dormant in dormant * ((( T 6 , active ) and in active ) or ((( T 7 , passive ) or ≥ f + 1 join ) and not in dormant )) ac c ept join, prop ose, DAR TS i pr op ose, acc ept true waking ac c ept, re cover ≥ f + 1 ac c ept and in r e ady ( ϑ + 1) T 1 sle ep → waking (2 ϑd, suspe ct ) and in susp e ct and not * * trust susp e ct not in r e ady ac c ept ac c ept Figure 2: Overview of the core routine of node i ’s self-stabilizing pulse algorithm. 3.2 Main Algorithm W e pro ceed by describing the main routine of the pulse algorithm in full. Alongside the main routine, several other state machines run concurrently and pro vide additional information to b e used during reco very . The main routine is graphically presented in Figure 2, together with a very simple second comp onen t whose sole purp ose is to simplify the otherwise ov erloaded description of the main routine. Except for the states r e c over and join and additional resets of memory ﬂags, the main routine is iden tical to the basic cycle. The purp ose of the tw o additional states is the following: No des switch to state r e c over once they detect that something is wrong, that is, non-faulty no des do not execute the basic cycle as outlined in Section 3.1. This wa y , non-fault y no des will not con tinue to confuse others b y sending for example state signals pr op ose or ac c ept despite clearly b eing out-of-sync. There are v arious consistency c hecks that no des p erform during each execution of the basic cycle. The ﬁrst one is that in order to switc h from state ac c ept to state sle ep , non-fault y no des need to memorize at least n − f no des in state ac c ept . If this do es not happ en within T 1 time after switc hing to state ac c ept , b y the argumen ts given in Section 3.1, they could not hav e en tered state ac c ept within 2 d of each other. Therefore, something must b e wrong and it is feasible to switch to state r e c over . Next, whenev er a non-faulty node is in state waking , there should b e no non-fault y no des in states ac c ept or r e c over . Considering that the no de resets its ac c ept and 13 r e c over ﬂags up on switching to waking , it should not memorize f + 1 or more no des in states ac c ept or r e c over at a time when it observ es itself in state waking . If it do es, how ev er, it again switc hes to state r e c over . Similarly , when in state r e ady , no des exp ect others not to b e in state ac c ept for more than a short perio d of time, as a non-faulty no de switching to ac c ept should imply that every non-fault y no de switches to pr op ose and then to ac c ept shortly thereafter. This is expressed b y the second state machine comprising t wo states only . If a no de is in state r e ady and memorizes f + 1 no des in state ac c ept , it switc hes to susp e ct . Subsequen tly , if it remains in state r e ady un til a timeout of 2 ϑd expires, it will switch to state r e c over . Last but not least, during a synchronized execution of the basic cycle, no non-faulty no de may b e in state pr op ose for more than a certain amoun t of time b efore switc hing to state ac c ept . Therefore, no des will switch from pr op ose to r e c over when timeout T 5 expires. No des can join the basic cycle again via the second new state, called join . Since the Byzantine no des ma y “pla y nice” to w ards n − 2 f or more no des still executing the basic cycle, making them believe that system op eration contin ues as usual, it must b e p ossible to join the basic cycle again without having a ma jorit y of nodes in state r e c over . On the other hand, it is crucial that this happ ens in a suﬃciently w ell-synchronized manner, as otherwise no des could drop out again b ecause the v arious chec ks of consistency detect an erroneous execution of the basic cycle. In part, this issue is solved by an additional agreement step. In order to enter the basic cycle again, no des need to memorize n − f no des in states join (the resp ectiv e no des detected an inconsistency), pr op ose (these no des con tinued to execute the basic cycle), or ac c ept (there are executions where no des reset their pr op ose ﬂags b ecause of switching to join when other no des already switc hed to ac c ept ). Since there are thresholds of f + 1 no des memorized in state join b oth for leaving state r e c over and switching from r e ady to join , all no des will follow the ﬁrst one switching from join to pr op ose quickly , just as with the switch from pr op ose to ac c ept in an ordinary execution of the basic cycle. How ev er, it is decisiv e that all no des are in states that permit to participate in this agreemen t step in order to guarantee success of this approac h. As a result, still a certain degree of sync hronization needs to b e established b eforehand, b oth among no des that still execute the basic cycle and those that do not. F or instance, if at the p oin t in time when a ma jority of no des and c hannels b ecome non-fault y , some no des already memorize no des in join that are not, they may switc h to state join and subsequen tly pr op ose prematurely , causing others to hav e inconsistent memory ﬂags as well. Again, Byzantine faults ma y sustain this amiss conﬁguration of the system indeﬁnitely . So why did we put so muc h eﬀort in “shifting” the fo cus to this part of the algorithm? The key adv antage is that no des outside the basic cycle may tak e in to account less reliable information for stabilization purp oses. They ma y take the risk of metastable upsets (as w e know it is imp ossible to a void these during the stabilization pro cess, an ywa y) and mak e use of randomization. In fact, to mak e the ab ov e scheme w ork, it is suﬃcien t that all non-fault y no des agree on a so called r esynchr onization p oint (formally deﬁned later on), that is, a p oint in time at which no des reset the memory ﬂags for states join and sle ep → waking as well as certain timeouts, while guaran teeing that no no de is in these states close to the resp ectiv e reset times. Except for state sle ep → waking , all of these timeouts, memory ﬂags, etc. are not part of the basic cycle at all, thus no des ma y enforce consistent v alues for them when they agree on such a resynchronization point. Con venien tly , the use of randomization also ensures that it is quite unlikely that no des are in state sle ep → waking close to a resynchronization p oin t, as the consistency chec k of ha ving to memorize n − f no des in state ac c ept in order to switc h to state sle ep guarantees that the time 14 dormant p assive active not in resync in r esync ≥ f + 1 sle ep → waking join, sle ep → waking not in resync Figure 3: Extension of no de i ’s core routine. windo ws during which non-fault y no des may switc h to sle ep mak e up a small fraction of all times only . Consequen tly , the remaining comp onents of the algorithm deal with agreeing on resynchroniza- tion p oints and utilizing this information in an appropriate wa y to ensure stabilization of the main routine. W e describ e this connection to the main routine ﬁrst. It is done b y another, quite simple state mac hine, which runs in parallel alongside the core routine. It is depicted in Figure 3. Its purp ose is to reset memory ﬂags in a consisten t w ay and to determine when a no de is p ermitted to switc h to join . In general, a resync hronization p oint (lo cally observed b y switc hing to state r esync , which is in tro duced later) triggers the reset of the join and sle ep → waking ﬂags. If there are still no des executing the basic cycle, a no de may b ecome aw are of it by observing f + 1 no des in state sle ep → waking at some time. In this case it switc hes from the state p assive , which it entered at the p oint in time when it lo cally observed the resync hronization point, to the state active , which enables an earlier transition to state join . This is expressed by the rather inv olv ed transition rule tr ( r e c over , join ): T 6 is m uc h smaller than T 7 , but T 6 is of no concern un til the no de switc hes to state active and resets T 6 . 15 It remains to explain ho w no des agree on resynchronization points. 3.3 Resync hronization Algorithm The resynchronization routine is sp eciﬁed in Figure 4 as well. It is a low er lay er that the core routine uses for stabilization purp oses only . It provides some sync hronization that is very similar to that of a pulse, except that suc h “weak pulses” occur at random times, and may b e generated inconsistently after the algorithm as a whole has stabilized. Since the main routine op erates indep endently of the resync hronization routine once the system has stabilized, we can aﬀord the w eak er guaran tees of the routine: If it succeeds in generating a “go o d” resync hronization p oint merely once, the main routine will stabilize deterministically . Deﬁnition 3.1 (Resynchronization P oints) . Given W ⊆ V , time t is a W -resynchronization p oin t iﬀ e ach no de in W switches to state supp → resync in the time interval ( t, t + 2 d ) . Deﬁnition 3.2 (Goo d Resynchronization P oints) . A W -r esynchr onization p oint is c al le d go o d if 15 The condition “not in dormant ” here ensures that the transition is not p erformed b ecause the no de has b een in state r esync a long time ago, but there was no recent switching to r esync . 15 none r esync * ( S i,j = init and ( R 2 , supp j )) ** ( S i,k = init and ( R 2 , supp k )) ( R 1 , supp → r esync ) supp j supp k 2 ϑd ≥ n − f supp ≥ n − f supp 2 ϑd R 3 init supp supp supp supp none * ** * ** States supp 1,...,n (i.e, one for eac h no de) supp wait true supp supp 4 ϑd → r esync Figure 4: Resync hronization algorithm, comprising tw o state mac hines executed in parallel at no de i . no no de fr om W switches to state sleep during ( t − ( ϑ + 3) T 1 , t ) and no no de is in state join during [ t − T 1 − d, t + 4 d ) . In order to clarify that despite having a linear num b er of states ( supp 1 , . . . , supp n ), this part of the algorithm can b e implemented using 2-bit comm unication channels b et ween state mac hines only , w e generalize our description of state machines as follows. If a state is depicted as a circle separated into an upp er and a low er part, the upp er part denotes the lo cal state, while the lo wer part indicates the signal state to which it is mapp ed. A no de’s memory ﬂags then store the resp ectiv e signal states only , i.e., remote no des do not distinguish b etw een states that share the same signal. Clearly , such a machine can b e simulated b y a machine as introduced in the model section. The adv antage is that such a mapping can b e used to reduce the num b er of transmitted state bits; for the resync hronization routine giv en in Figure 4, w e merely need tw o bits ( init / wait and none / supp ) instead of d log( n + 3) e + 1 bits. The basic idea b ehind the resynchronization algorithm is the following: Ev ery no w and then, no des will try to initiate agreement on a resynchronization p oin t. This is the purpose of the small state mac hine on the left in Figure 4. Recalling that the transition condition “true” simply means that the no de switc hes to state wait again as soon as it observes itself in state init , it is easy to see that it do es nothing else than creating an init signal as so on as R 3 expires and resetting R 3 again as quic kly as p ossible. As the time when a node switc hes to init is determined by the randomized timeout R 3 distributed o v er a large in terv al (cf. Equality (11)) only , it is impossible to predict when it will expire, ev en with full kno wledge of the execution up to the curren t p oin t in time. Note that the complete indep endence of this part of no de i ’s state from the remaining proto col implies that 16 fault y no des are not able to inﬂuence the resp ective times b y any means. Consider now the state mac hine display ed on the righ t of Figure 4. T o understand ho w the routine is in tended to work, assume that at the time t when a non-fault y node i switches to state init , all non-fault y nodes are not in any of the states supp → r esync , r esync , or supp i , and at all non-faulty no des the timeout ( R 2 , supp i ) has expired. Then, no matter what the signals from fault y no des or on fault y channels are, all non-faulty no des will b e in one of the states supp j , j ∈ V , or supp → r esync at time t + d . Hence, they will observe each other (and themselves) in one of these states at some time smaller than t + 2 d . These statements follow from the v arious timeout conditions of at least 2 ϑd and the fact that observing node i in state init will make no des switc h to state supp i if in none or supp j , j 6 = i . Hence, all of them will switch to state supp → r esync during ( t, t + 2 d ), i.e., t is a resync hronization p oint. Since t follo ws a random distribution that is indep enden t of the remaining algorithm and, as men tioned earlier, most of the times no des cannot switc h to state sle ep and it is easy to deal with the condition on join states, there is a large probabilit y that t is a goo d resynchronization p oint. Note that timeout R 1 mak es sure that no non-fault y no de will switch to supp → r esync again anytime soon, lea ving suﬃcient time for the main routine to stabilize. The scenario w e just described relies on the fact that at time t no no de is in state supp → r esync or state r esync . W e will choose R 2  R 1 , implying that R 2 + 3 d time after a no de switc hed to state init all no des hav e “forgotten” ab out this, i.e., ( R 2 , supp i ) is expired and they switched bac k to state none (unless other init signals in terfered). Thus, in the absence of Byzan tine faults, the abov e requiremen t is easily ac hieved with a large probability by c ho osing R 3 as a uniform distribution o ver some interv al [ R 2 + 3 d, R 2 + Θ( nR 1 )]: Other no des will switc h to init O ( n ) times during this in terv al, eac h time “blo c king” other no des for at most O ( R 1 ) time. If the random c hoice pic ks an y other p oint in time during this interv al, a resynchronization p oin t o ccurs. Even if the clo ck sp eed of the clo ck driving R 3 is manipulated in a w orst-case manner (aﬀecting the density of the probabilit y distribution with resp ect to real time by a factor of at most ϑ ), we can just increase the size of the in terv al to accoun t for this. Ho wev er, what happ ens if only some of the no des receive an init signal due to fault y channels or no des? If the same holds for some of the subsequent supp signals, it might happ en that only a fraction of the nodes reaches the threshold for switching to state supp → r esync , resulting in an inconsisten t reset of ﬂags and timeouts across the system. Until the respective no des switc h to state none again, they will not supp ort a resynchronization p oint again, i.e., ab out R 1 time is “lost”. This issue is the reason for the agreement step and the timeouts ( R 2 , supp j ). In order for any no de to switc h to state supp → r esync , there must be at least n − 2 f ≥ f + 1 non-fault y no des supp orting this. Hence, all of these no des recently switched to a state supp j for some j ∈ V , resetting ( R 2 , supp j ). Un til these timeouts expire, f + 1 ∈ Ω( n ) non-fault y no des will ignore init signals on the resp ective channels. Since there are O ( n 2 ) c hannels, it is possible to c ho ose R 2 ∈ O ( nR 1 ) suc h that this ma y happen at most O ( n ) times in O ( n ) time. Pla ying with constan ts, w e can pic k R 3 ∈ O ( n ) maintaining that still a constan t fraction of the times are “go o d” in the sense that R 3 expiring at a non-fault y no de will result in a go o d resync hronization p oint. 3.4 Timeout Constraints Condition 3.3 summarizes the constraints we require on the timeouts for the core routine and the resync hronization algorithm to act and interact as intended. 17 Condition 3.3 (Timeout Constraints) . Deﬁne λ := r 25 ϑ − 9 25 ϑ ∈  4 5 , 1  , (1) ∆ g := ( ϑ + 3) T 1 , ∆ s := T 2 /ϑ − 2 T 1 − d , δ s := 2 T 1 + 3 d , and ˜ δ s := ( ϑ + 2 − 1 /ϑ ) T 1 + 4 d . The time outs ne e d to satisfy the c onstr aints T 1 ≥ ϑ 4 d (2) T 2 ≥ ϑ max  T 1 + ∆ g − (4 ϑ 2 + 16 ϑ + 5) d,  3 ϑ + 1 − 1 ϑ  T 1 + T 5  (3) T 3 ≥ max  ( ϑ − 1) T 2 + ϑ (2 T 1 + (2 ϑ + 4) d ) , (2 ϑ 2 + 3 ϑ − 1) T 1 − T 2 + ϑ ( T 6 + 5 d )  (4) T 4 ≥ T 3 (5) T 5 ≥ max  ϑ ( T 4 + 7 d ) − T 3 + ( ϑ − 1) T 2 , ( ϑ 2 + ϑ − 2) T 1 + ϑ ( T 2 + T 4 + 9 d ) − T 6  (6) T 6 ≥ ϑ  ˜ δ s −  1 − 1 ϑ  T 1 + T 2 + 2 d  > ϑ ∆ s (7) T 7 ≥ ϑ ( T 2 + T 4 + T 5 + ∆ s + ˜ δ s − ∆ g + d ) + T 6 − 4 d (8) R 1 ≥ ϑ max  T 7 + (4 ϑ + 8) d,  2 ϑ + 4 − 3 ϑ  T 1 + 2 T 4 + T 5 − ∆ s − ∆ g + 17 d  (9) R 2 ≥ 2 ϑ ( R 1 + ( ϑ + 2) T 1 + T 2 /ϑ + (8 ϑ + 9) d )( n − f ) 1 − λ (10) R 3 = uniformly distribute d r andom variable on [ ϑ ( R 2 + 3 d ) , ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 ] (11) λ ≤ ∆ s − ∆ g − δ s ∆ s . (12) W e need to show for whic h v alues of ϑ this system can b e solved. F urthermore, we would lik e to allow for the largest p ossible drift of DAR TS clo cks, which necessitates to maximize the ratio ( T 2 + T 4 ) / ( ϑ ( T 2 + T 3 + 4 d )), that is, the minimal gap b etw een pulses provided that the states of the dar ts signals are zero divided b y the maximal time it tak es no des to observe themselves in state r e ady with T 3 expired after a pulse (as then they will resp ond to dar ts i switc hing to one). Lemma 3.4. Deﬁne ϑ max ≈ 1 . 247 as the p ositive solution of 2 ϑ + 1 = ϑ 3 + ϑ 2 . Given that ϑ < ϑ max , Condition 3.3 c an b e satisﬁe d with T 1 , . . . , T 7 , R 1 ∈ O (1) and R 2 ∈ O ( n ) . The r atio ( T 2 + T 4 ) /ϑ T 2 + T 3 + 4 d c an b e made lar ger than any c onstant smal ler than ϑ 3 + 2 ϑ + 1 2 ϑ 4 + ϑ 3 . Pr o of. First, w e identify several redundan t inequalities in the system. W e hav e that  2 ϑ + 2 − 1 ϑ  T 1 + T 5 (6) > 3 ϑT 1 + T 2 + T 4 − T 6 (4 , 5) > 7 ϑT 1 > T 1 + ∆ g − (4 ϑ 2 + 16 ϑ + 5) d, 18 i.e., the left term in the maximum in Inequality (3) is redundant. The same holds true for the left terms in the maxima in Inequality (4) and Inequalit y (6), since (2 ϑ 2 + 3 ϑ − 1) T 1 − T 2 + ϑ ( T 6 + 5 d ) (7) > 3 ϑT 1 + ( ϑ − 1) T 2 + 4 d (2) > ( ϑ − 1) T 2 + ϑ (2 T 1 + (2 ϑ + 4) d ) and ϑ ( T 4 + 7 d ) − T 3 + ( ϑ − 1) T 2 (4) < ϑ ( T 2 + T 4 − T 6 + 7 d ) < ( ϑ 2 + ϑ − 2) T 1 + ϑ ( T 2 + T 4 + 9 d ) − T 6 . Finally , w e can eliminate the right term in the maximum in Inequalit y (9) from the system, as T 7 + (4 ϑ + 8) d (8) > T 2 + T 4 + T 5 + T 6 + 2 ˜ δ s − ∆ g + 13 d (3) >  2 ϑ + 4 − 3 ϑ  T 1 + T 4 + 2 T 5 + T 6 − ∆ g + 17 d (6) >  2 ϑ + 4 − 3 ϑ  T 1 + T 2 + 2 T 4 + T 5 − ∆ g + 17 d. Next, it is not diﬃcult to see that the righ t hand sides of all inequalities are strictly increasing in T 1 (except for Inequality (12), whose right hand side decreases with T 1 ), implying that w.l.o.g. w e ma y set T 1 := 4 ϑd . Similarly , w e demand th at Inequalit y (8), Inequality (9), and Inequalit y (10) are satisﬁed with equalit y , i.e., T 7 = ϑ ( T 2 + T 4 + T 5 ) + T 6 − (4 ϑ 2 + 4) d R 1 = ϑT 7 + (4 ϑ 2 + 8 ϑ ) d R 2 = 2 ϑ ( R 1 + T 2 /ϑ + (4 ϑ 2 + 16 ϑ + 9) d )( n − f ) 1 − λ R 3 = uniformly distributed random v ariable on [ ϑ ( R 2 + 3 d ) , ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 ] . W e set T 4 := αT 3 for a parameter α ∈  1 , 2 ϑ + 1 ϑ 3 + ϑ 2  , implying that Inequalit y (5) holds b y deﬁnition. The remaining simpler system is as follo ws. T 2 ≥ (8 ϑ 3 + 8 ϑ 2 − 4 ϑ ) d + ϑT 5 (13) T 3 ≥ (8 ϑ 3 + 12 ϑ 2 + ϑ ) d − T 2 + ϑT 6 (14) T 5 ≥ (4 ϑ 3 + 4 ϑ 2 + ϑ ) d + ϑ ( T 2 + α T 3 ) − T 6 (15) T 6 ≥ (4 ϑ 2 + 6 ϑ − 4) d + T 2 (16) r 25 ϑ − 9 25 ϑ ≤ T 2 /ϑ − (4 ϑ 2 + 28 ϑ + 4) d T 2 /ϑ − (8 ϑ + 1) d . Note the ab ov e equalities do not aﬀect this system and can b e resolv ed iterativ ely once the other v ariables are ﬁxed. W e observ e that the righ t hand side of Inequality (13) is increasing in T 5 , the 19 righ t hand side of Inequalit y (15) is increasing in T 3 , and neither T 3 nor T 5 are presen t in any further inequalities. Hence, w e rule that Inequality (14) and Inequality (15) shall b e satisﬁed with equalit y , i.e., T 3 = (8 ϑ 3 + 12 ϑ 2 + ϑ ) d − T 2 + ϑT 6 T 5 = ( α (8 ϑ 4 + 12 ϑ 3 + ϑ 2 ) + (4 ϑ 3 + 4 ϑ 2 + ϑ )) d − ( ϑα − 1) T 2 + ( ϑ 2 α − 1) T 6 and arriv e at the subsystem T 2 ≥ ( α (8 ϑ 5 + 12 ϑ 4 + ϑ 3 ) + (4 ϑ 4 + 12 ϑ 3 + 9 ϑ 2 − 4 ϑ )) d + ( ϑ 3 α − ϑ ) T 6 1 + ϑ − ϑ 2 α (17) T 6 ≥ (4 ϑ 2 + 6 ϑ − 4) d + T 2 T 2 ≥ (4 ϑ 3 + 20 ϑ 2 + 3 ϑ ) d 1 − p (25 ϑ − 9) / (25 ϑ ) , where w e used that 1 + ϑ − ϑ 2 α > 0. No w we can see that Inequality (17) is also increasing in T 6 , set T 6 := (4 ϑ 2 + 6 ϑ − 4) d + T 2 , and obtain T 2 ≥ ( α (12 ϑ 5 + 18 ϑ 4 − 3 ϑ 3 ) + (4 ϑ 4 + 8 ϑ 3 + 3 ϑ 2 )) d 1 + 2 ϑ − ( ϑ 3 + ϑ 2 ) α (18) T 2 ≥ 25(1 + p (25 ϑ − 9) / (25 ϑ ))(4 ϑ 4 + 20 ϑ 3 + 3 ϑ 2 ) d 9 , (19) exploiting that 1 + 2 ϑ − ( ϑ 3 + ϑ 2 ) α > 0. Since α and th us ϑ are constantly b ounded (and we treat d as constant as w ell), we hav e a feasible solution for T 2 ∈ O (1) (considering asymptotic with resp ect to n ). Resolving the equalities w e derived for the other v ariables, w e see that T 1 , . . . , T 7 , R 1 ∈ O (1) and R 2 ∈ O ( n ) as claimed. It remains to determine the maximal ratio ( T 2 + T 4 ) / ( ϑ ( T 2 + T 3 + 4 d )) = ( T 2 + αT 3 ) / ( ϑ ( T 2 + T 3 + 4 d )) we can ensure. Obviously , for an y v alue of α , ﬁxing either T 2 or T 3 implies that we wan t to minimize T 2 or maximize T 3 , resp ectiv ely . Hav e a lo ok at Inequalities (13)–(16) again. The solution we constructed minimized T 3 and subsequently T 2 , parametrized b y feasible v alues of α . Increase now T 3 b y x ∈ R + in Inequality (14). Consequently , we ma y increase T 6 in Inequality (16) b y x/ϑ compared to our previous solution (where w e minimized all inequalities). Hence, w e need to increase T 5 b y ( ϑα − 1 /ϑ ) x according to Inequality (15), and ﬁnally T 2 b y ϑ ( ϑα − 1 /ϑ ) x . Thus, for any feasible α and any ε > 0, we can achiev e that T 2 ≤ ( ϑ 2 α − 1 + ε ) T 3 if we just choose x large enough. W e conclude that we can get arbitrarily close to the ratio ( α + ( ϑ 2 α − 1)) T 3 ϑ (1 + ( ϑ 2 α − 1)) T 3 = ϑ 2 α + α − 1 ϑ 3 α . Inserting the suprem um of admissible v alues for α , this expression b ecomes (2 ϑ + 1)( ϑ 2 + 1) − ( ϑ 3 + ϑ 2 ) ϑ 3 (2 ϑ + 1) = ϑ 3 + 2 ϑ + 1 2 ϑ 4 + ϑ 3 . This sho ws the last claim of the lemma, concluding the pro of. 20 4 Analysis In this section we derive skew b ounds Σ, as well as accuracy bounds T − , T + , suc h that the presented proto col is a ( W , E )-stabilizing pulse sync hronization proto col, for proper c hoices of the set of nodes W and the set of channels E , with skew Σ and accuracy b ounds T − , T + that stabilizes within time T ( k ) ∈ O ( k n ) with probability 1 − 1 / 2 k ( n − f ) , for an y k ∈ N . T o start our analysis, w e need to deﬁne the basic requirements for stabilization. Essen tially , we need that a ma jority of no des is non-fault y and the channels b etw een them are correct. How ev er, the ﬁrst part of the stabilization pro cess is simply that no des “forget” ab out past even ts that are captured by their timeouts. Therefore, we demand that these no des indeed hav e b een non-faulty for a time perio d that is suﬃcien tly large to ensure that all timeouts hav e been reset at least once after the considered set of no des b ecame non-faulty . Deﬁnition 4.1 (Coherent States) . The subset of no des W ⊆ V is c al le d coherent during the time interval [ t − , t + ] , iﬀ during [ t − − ( ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 ) − d, t + ] al l no des i ∈ W ar e non-faulty, and al l channels S i,j , i, j ∈ W , ar e c orr e ct. W e will sho w that if a coherent set of at least n − f no des ﬁres a pulse, i.e., switches to ac c ept in a tigh t sync hrony , this set will generate pulses deterministically and with controlled frequency , as long the set remains coherent. This motiv ates the following deﬁnitions. Deﬁnition 4.2 (Stabilization P oin ts) . We c al l t a W -stabilization point (quasi-stabilization p oint) iﬀ al l no des i ∈ W switch to accept during [ t, t + 2 d ) ([ t, t + 3 d )) . Throughout this section, w e assume the set of coheren t no des W with | W | ≥ n − f to b e ﬁxed and consider all no des in and channels originating from V \ W as (p oten tially) fault y . As all our statemen ts refer to no des in W , we will t ypically omit the w ord “non-fault y” when referring to the b ehaviour or states of no des in W , and “all no des” is short for “all nodes in W ”. Note, ho wev er, that we will still clearly distinguish b etw een channels originating at faulty and non-fault y no des, resp ectively , to no des in W . As a ﬁrst step, we observe that at times when W is coherent, indeed all no des reset their timeouts, basing the resp ective state transition on prop er p erception of no des in W . Lemma 4.3. If the system is c oher ent during the time interval [ t − , t + ] , any (r andomize d) time out ( T , s ) of any no de i ∈ W expiring at a time t ∈ [ t − , t + ] has b e en r eset at le ast onc e sinc e time t − − ( ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 ) . If t 0 denotes the time when such a r eset o c curr e d, for any j ∈ W it holds that S i,j ( t 0 ) = S j ( τ − 1 j,i ( t 0 )) , i.e., at time t 0 , i observes j in a state j attaine d when it was non-faulty. Pr o of. According to Condition 3.3, the largest possible v alue of any (randomized) timeout is ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 . Hence, any timeout that is in state 1 at a time smaller than t − − ( ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 ) expires b efore time t 1 or is reset at least once. As b y the deﬁnition of coherency all no des in W are non-fault y and all channels b etw een suc h no des are correct during [ t − − ( ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 ) − d, t + ], this implies the statemen t of the lemma. Phrased informally , an y corruption of timeout and c hannel states even tually ceases, as correct timeouts expire and correct links remember no even ts that lie d or more time in the past. Prop er cleaning of the memory ﬂags is more complicated and will be explained further do wn the road. 21 Throughout this section, w e will assume for the sak e of simplicit y that the system is coheren t at all times and use this lemma implicitly , e.g. we will alw a ys assume that no des from W will observe all other nodes from W in states that they indeed had less than d time ago, expiring of randomized timeouts at non-faulty no des cannot b e predicted accurately , etc. W e will discuss more general settings in Section 5. W e proceed by showing that once all no des in W switc h to ac c ept in a short p erio d of time, i.e., a W -quasi-stabilization p oin t is reached, the algorithm guarantees that sync hronized pulses are generated deterministically with a frequency that is b ounded b oth from ab ov e and b elo w. Theorem 4.4. Supp ose t is a W -quasi-stabilization p oint. Then (i) al l no des in W switch to accept exactly onc e within [ t, t + 3 d ) , and do not le ave accept until t + 4 d , and (ii) ther e wil l b e a W -stabilization p oint t 0 ∈ ( t + ( T 2 + T 3 ) /ϑ, t + T 2 + T 4 + 5 d ) satisfying that no no de in W switches to accept in the time interval [ t + 3 d, t 0 ) and that (iii) e ach no de i ’s, i ∈ W , c or e state machine (Figur e 1) is metastability-fr e e during [ t + 4 d, t 0 + 4 d ) . Pr o of. Pro of of (i): Due to Inequality (2), a no de do es not leav e the state ac c ept earlier than T 1 /ϑ ≥ 4 d time after switc hing to it. Th us, no no de can switc h to ac c ept twice during [ t, t + 3 d ). By deﬁnition of a quasi-stabilization p oin t, every node do es switc h to ac c ept in the in terv al [ t, t + 3 d ) ⊂ [ t, t + T 1 /ϑ ). This prov es Statemen t (i). Pro of of (ii): F or each i ∈ W , let t i ∈ [ t, t + 3 d ) b e the time when i switches to ac c ept . By (i) t i is w ell-deﬁned. F urther let t 0 i b e the inﬁm um of times in ( t i , ∞ ) when i switc hes to r e c over , join , or pr op ose . 16 In the follo wing, denote by i ∈ W a node with minimal t 0 i . W e will show that all no des switc h to pr op ose via states sle ep , sle ep → waking , waking , and r e ady in the presented order. By (i) no des do not lea ve ac c ept before t + 4 d . Thus at time t + 4 d , eac h no de in W is in state ac c ept and observes eac h other no de in W in ac c ept . Hence, eac h no de in W memorizes eac h other node in W in ac c ept at time t + 4 d . F or each no de j ∈ W , let t j,s b e the time no de j ’s timeout T 1 expires ﬁrst after t j . Then t j,s ∈ ( t j + T 1 /ϑ, t j + T 1 + d ). 17 Since | W | ≥ n − f , each node j switches to state sle ep at time t j,s . Hence, by time t + T 1 + 4 d , no node will b e observed in state ac c ept anymore (un til the time when it switc hes to ac c ept again). When a node j ∈ W switc hes to state waking at the minimal time t w larger than t j , it do es not do so earlier than at time t + T 1 /ϑ + (1 + 1 /ϑ ) T 1 = t + (1 + 2 /ϑ ) T 1 > t + T 1 + 5 d . This implies that all no des in W ha ve already left ac c ept at least d time ago, since they switc hed to it at their resp ectiv e times t j < t + T 1 + 4 d . Moreo ver, they cannot switch to ac c ept again un til t 0 i as it is minimal and no des need to switch to pr op ose b efore switching to ac c ept . Hence, no des in W are not observ ed in state ac c ept during ( t + T 1 + 5 d, t 0 i ], in particular not b y no de j . F urthermore, no des in W are not observ ed in state r e c over during ( t w − d, t 0 i ]. As it resets its ac c ept and r e c over ﬂags up on switc hing to waking , j will hence neither switch from waking to r e c over nor from trust to susp e ct during ( t w , t 0 i ], and th us also not from r e ady to r e c over . No w consider node i . By the previous observ ation, it will not switch from waking to r e c over , but to r e ady , following the basic cycle. Consequently , it m ust wait for timeout T 2 to expire, i.e., 16 Note that we follow the con ven tion that inf ∅ = ∞ if the inﬁmum is taken with resp ect to a (from ab ov e) un b ounded subset of R + 0 . 17 The upp er b ound comprises an additive term of d since T 1 is reset at some time from ( t j , t j + d ). 22 cannot switch to r e ady earlier than at time t + T 2 /ϑ . As no des in W clear their join ﬂags upon switc hing to state r e ady , b y deﬁnition of t 0 i no de i cannot switch from r e ady to join , but has to switc h to pr op ose . Again, b y deﬁnition of i , it cannot do so b efore timeouts T 3 or T 4 expire, i.e., b efore time t + T 2 ϑ + min { T 3 , T 4 } ϑ (5) = t + T 2 + T 3 ϑ (4) > t + T 2 + 5 d. (20) All other no des in W will switch to waking , and for the ﬁrst time after t j , observ e themselves in state waking at a time within ( t + T 1 + 4 d, t + T 1 (2 + ϑ ) + 7 d ). Recall that unless they memorize at least f + 1 nodes in ac c ept or r e c over while b eing in state waking , they will all switch to state r e ady b y time max { t + T 2 + 4 d, t + ( ϑ + 2) T 1 + 7 d } (3) = t + T 2 + 4 d. (21) As we just show ed that t 0 i > t + T 2 + 5 d , this implies that at time t + T 2 + 5 d all no des are observed in state r e ady , and none of them lea ves b efore time t 0 i . No w c ho ose t 0 to be the inﬁmum of times from ( t + ( T 2 + T 3 ) /ϑ, t + T 2 + T 4 + 4 d ] when a node in W switches to state ac c ept . 18 Because of Inequality (20), t 0 is the ﬁrst time any no de j ∈ W ma y switch to ac c ept again after its resp ective time t j . W e will next show that no no de j ∈ W can switch to r e c over within [ t j , t 0 + 2 d ]. Since at time t 0 i no de j do es not memorize other no des from W in state ac c ept , it will also not do so during [ t 0 i , t 0 ]. Hence, it cannot switc h from r e ady to r e c over during [ t 0 i , t 0 + 2 d ] since it cannot b e in state susp e ct during [ t 0 i , t 0 ]. By Inequalit y (20), j cannot switc h to pr op ose within [ t j , t + ( T 2 + T 3 ) /ϑ ), and thus its timeout T 5 cannot expire un til time t + T 2 + T 3 + T 5 ϑ (6) ≥ t + T 2 + T 4 + 7 d ≥ t 0 + 3 d, (22) making it impossible for j to switc h from pr op ose to r e c over at a time within [ t j , t 0 + 3 d ]. What is more, a node from W that switches to ac c ept must stay there for at least T 1 /ϑ > 3 d time. Thus, b y deﬁnition of t 0 , no no de j ∈ W can switch from ac c ept to r e c over at a time within [ t j , t 0 + 3 d ]. Hence, no node j ∈ W can switc h to state r e c over after t j , but earlier than time t 0 + 2 d . As no des reset their join ﬂags up on switc hing to state r e ady , it follo ws that no no de in W can switch to other states than pr op ose or ac c ept during [ t + T 2 + 4 d, t 0 + 2 d ]. In particular, no no de in W resets its pr op ose ﬂags during [ t + T 2 + 5 d, t 0 + 2 d ] ⊃ [ t 0 i , t 0 + 2 d ]. If at time t 0 a no de in W switc hes to state ac c ept , n − 2 f ≥ f + 1 of its pr op ose ﬂags corresp onding to nodes in W are true, i.e., in state 1. As the node reset its pr op ose ﬂags at the most recent time when it switc hed to r e ady and no no des from W hav e been observed in pr op ose b etw een this time and t 0 i , it holds that f + 1 no des in W switched to state pr op ose during [ t 0 i , t 0 ). Since w e established that no no de resets its pr op ose ﬂags during [ t 0 i , t 0 + 2 d ], it follows that all no des are in state pr op ose b y time t 0 + d . Consequently , all no des in W will observ e all no des in W in state pr op ose b efore time t 0 + 2 d and switch to ac c ept , i.e., t 0 ∈ ( t + ( T 2 + T 3 ) /ϑ, t + T 2 + T 4 + 4 d ) is a stabilization p oint. Statemen t (ii) follows. On the other hand, if at time t 0 no no de in W switc hes to state ac c ept , it follo ws that t 0 = t + T 2 + T 4 + 4 d . As all no des observ e themselv es in state r e ady by time t + T 2 + 5 d , they switch to pr op ose b efore time t + T 2 + T 4 + 5 d = t 0 + d because T 4 expired. By the same reasoning as in the previous case, they switc h to ac c ept b efore time t 0 + 2 d , i.e., Statement (ii) holds as w ell. 18 Note that since we take the inﬁmum on ( t + ( T 2 + T 3 ) /ϑ, t + T 2 + T 4 + 4 d ], we ha ve that t 0 ≤ t + T 2 + T 4 + 4 d . 23 Pro of of (iii): W e hav e shown that within [ t j , t 0 + 2 d ], any no de j ∈ W switc hes to states along the basic cycle only . Moreo v er, suc h no des switch to ac c ept at some time in [ t 0 , t 0 + 2 d ]. Since T 1 ≥ 4 ϑd , this implies that no no de observing itself in ac c ept after time t 0 will lea v e this state b efore time t 0 + 4 d . T o sho w the correctness of Statement (iii), it is th us suﬃcient to prov e that, whenever j switches from state s of the basic cycle to s 0 of the basic cycle during time [ t j + d, t 0 + 2 d ] ⊃ [ t + 4 d, t 0 + 2 d ], the transition from s to join or r e c over is disabled from the time it switc hes to s 0 un til it observ es itself in this state. W e consider transitions tr ( ac c ept , r e c over ), tr ( waking , r e c over ), tr ( r e ady , r e c over ), tr ( r e ady , join ), and tr ( pr op ose , r e c over ) one after the other: 1. tr ( ac c ept , r e c over ): W e show ed that no de j ’s tr ( ac c ept , sle ep ) is satisﬁed b efore time t + 4 d ≤ t + T 1 /ϑ , i.e., b efore tr ( ac c ept , r e c over ) can hold, and no no de resets its ac c ept ﬂags less than d time after switc hing to state sle ep . When j switches to state ac c ept again at or after time t 0 , T 1 will not expire earlier than time t 0 + 4 d . 2. tr ( waking , r e c over ): As part of the reasoning in (ii), w e derived that tr ( waking , r e c over ) do es not hold at no des from W observing themselves in state waking . 3. tr ( r e ady , r e c over ) and tr ( r e ady , join ): Similarly , we pro ved that at no no de in W , condition tr ( r e ady , r e c over ) or tr ( r e ady , join ) can hold during ( t + ( T 2 + T 3 ) /ϑ, t 0 + 2 d ), and no des in W are in state r e ady during ( t + ( T 2 + T 3 ) /ϑ, t 0 + d ) only . 4. tr ( pr op ose , r e c over ): Finally , the additional slack of d in Inequality (22) ensures that T 5 do es not expire at an y no de in W switc hing to state ac c ept during ( t 0 , t 0 + 2 d ) earlier than time t 0 + 3 d . Since [ t j , t 0 + 4 d ) ⊃ [ t + 3 d, t 0 + 4 d ), Statement (iii) follo ws. Inductiv e application of Theorem 4.4 shows that b y construction of our algorithm, no des in W pro v ably do not suﬀer from metastability upsets once a W -quasi-stabilization p oin t is reached, as long as all no des in W remain non-fault y and the c hannels connecting them correct. Unfortunately , it can be shown that it is imp ossible to ensure this property during the stabilization p erio d, thus rendering a formal treatmen t infeasible. This is not a p eculiarity of our system mo del, but a threat to any mo del that allo ws for the p ossibility of metastable upsets as encoun tered in physical chip designs. Ho wev er, it w as shown that, by proper chip design, the probabilit y of metastable upsets can b e made arbitrarily small [13]. In the remainder of this work, w e will therefore assume that all non-faulty no des are metastability-free in all executions. The next lemma rev eals a very basic prop ert y of the main algorithm that is satisﬁed if no no des ma y switch to state join in a given perio d of time. It states that in order for any non-fault y node to switc h to state sle ep , there need to be f + 1 non-faulty no des supp orting this by switching to state ac c ept . Subsequen tly , these no des cannot do so again for a certain time window. In particular, this implies that during the resp ective time window no node may switc h to sle ep . Lemma 4.5. Assume that at time t s , some no de fr om W switches to sleep and no no de fr om W is in state join during [ t s − T 1 − d, t + ] . Then ther e is a subset A ⊆ W of at le ast n − 2 f no des such that (i) e ach no de fr om A has b e en in state accept at some time in the interval ( t s − T 1 − d, t s ) and 24 (ii) no no de fr om A is in state prop ose or switches to state accept during the time interval  t s , min  t s + ∆ s , t +  . Pr o of. In order to switch to sle ep at time t s , a no de must ha ve observed n − 2 f non-fault y no des in state ac c ept at times from ( t s − T 1 , t s ], since it resets its ac c ept ﬂags at the time t a ≥ t s − T 1 (that is minimal with this prop erty) when it switched to state ac c ept . Each of these no des m ust ha ve b een in state ac c ept at some time from ( t s − T 1 − d, t s ), showing the existence of a set A ⊆ W satisfying Statemen t (i). W e will next prov e Statement (ii). Consider a node i ∈ A . In order to switc h to pr op ose or again to ac c ept , i m ust switc h to join ﬁrst or wait for T 2 to expire after switching to state ac c ept some time after t s − 2 T 1 − d . How ev er, b y assumption the ﬁrst option is imp ossible until time t + , since no no des are in state join during [ t s − T 1 − d, t + ]. Therefore, j will not b e in state pr op ose or switc h to state ac c ept again un til t s − 2 T 1 + T 2 /ϑ − d = t s + ∆ s or t + , resp ectiv ely , whatev er is smaller. This prov es Statemen t (ii). Gran ted that nodes are not in state join , this implies that the time windo ws during which no des ma y switch to sle ep and sle ep → waking , resp ectively , are well-separated. Corollary 4.6. Assume that during [ t − − T 1 − d, t + ] no no de fr om W is in state join , wher e t + − t − ≤ ∆ s . Then (i) any time interval [ t a , t b ] ⊆ [ t − , t + ] of minimum length c ontaining al l switches of no des in W fr om accept to sleep during [ t − , t + ] has length at most 2 T 1 + 3 d , and (ii) gr ante d that no no de fr om W switches to state sleep during ( t − − ( ϑ + 1) T 1 − d, t − ) , any time interval [ t a , t b ] ⊆ [ t − , t + + (1 + 1 /ϑ ) T 1 ] of minimum length c ontaining al l times in [ t − , t + + (1 + 1 /ϑ ) T 1 ] when a no de fr om W switches to sleep → waking has length at most ˜ δ s . Pr o of. Consider Statement (i) ﬁrst. If there is no no de from W that switches from ac c ept to sle ep during [ t − , t + ], the statemen t is trivially satisﬁed. Otherwise, c ho ose an y such interv al [ t a , t b ]. Since [ t a , t b ] 6 = ∅ is minimal, b oth at time t a and t b some no des from W switc h to sle ep . Assume by means of con tradiction that t b − t a > 2 T 1 + 3 d . Due to the constraints on t − and t + , w e hav e that t b ≤ t a + ∆ s . Moreo ver, during [ t a − T 1 − d, t b ] ⊆ [ t − − T 1 − d, t + ] no no de from W is in state join . Th us, we can apply Lemma 4.5 to t a and see that at least n − 2 f ≥ f + 1 no des from W do not switch to ac c ept in the time in terv al ( t a , t a + ∆ s ) ⊃ ( t b − (2 T 1 + 3 d ) , t b ] . As nodes from W leav e state ac c ept as soon as T 1 expires, these nodes are not in state ac c ept during [ t b − ( T 1 + 2 d ) , t b ], implying that they are not observed in this state during [ t b − ( T 1 + d ) , t b ]. It follo ws that no node in W can observe more than n − f − 1 diﬀeren t no des in state ac c ept during [ t b − ( T 1 + d ) , t b ]. As nodes from W clear their ac c ept ﬂags up on switc hing to ac c ept and lea v e state ac c ept after less than T 1 + d time, w e conclude that no no de from W switc hes to state sle ep at time t b . This is a contradiction, implying that the assumption that t b − t a > 2 T 1 + 3 d must b e wrong and therefore Statemen t (i) must b e true. T o obtain Statement (ii), observe ﬁrst that an y node from W switc hing to state sle ep at some time t ≤ t − − ( ϑ + 1) T 1 − d switc hes to state sle ep → waking b efore time t − . Subsequently , it 25 needs to switch to state sle ep again in order to b e in state sle ep → waking at or later than time t − . On the other hand, ev ery node that switches to sle ep after time t + will not switc h to sle ep → waking again b efore time t + + (1 + 1 /ϑ ) T 1 . Hence, an y no de switc hing to state sle ep → waking during the considered interv al must switch to sle ep during [ t − , t + ]. Applying Statemen t (i) to [ t − , t + ] yields that no des from W can only switch to sle ep within a time in terv al of length at most 2 T 1 + 3 d . Considering the fastest and slo west p ossible transitions from sle ep to sle ep → waking we obtain that no des from W can switch to sle ep → waking within a time interv al of length at most 2 T 1 + 3 d + ( ϑ + 1) T 1 + d − (1 + 1 /ϑ ) T 1 = ˜ δ s . Statement (ii) follo ws. W e are no w ready to adv ance to pro ving that go o d resync hronization points are lik ely to o ccur within b ounded time, no matter what the strategy of the Byzantine fault y no des and channels is. T o this end, w e ﬁrst establish that in any execution, at most of the times a node switc hing to state init will result in a go o d resynchronization p oint. This is formalized by the following deﬁnition. Deﬁnition 4.7 (Go o d Times) . Given an exe cution E of the system, denote by E 0 any exe cution satisfying that E | 0 [0 ,t ) = E | [0 ,t ) , wher e at time t a no de i ∈ W switches to state init in E 0 . Time t is go o d in E with resp ect to W pr ovide d that for any such E 0 it holds that t is a go o d W - r esynchr onization p oint in E 0 . The previous statement th us b oils do wn to showing that in any execution, the ma jorit y of the times is go o d. Lemma 4.8. Given any exe cution E and any time interval [ t − , t + ] , the volume of go o d times in E during [ t − , t + ] is at le ast λ 2 ( t + − t − ) − 11(1 − λ ) R 2 10 ϑ . Pr o of. Assume w.l.o.g. that | W | = n − f (otherwise consider a subset of size n − f ) and abbreviate N :=  ϑ ( t + − t − ) R 2 + 11 10  ( n − f ) ≥  ϑ ( t + − t − ) + R 2 / 10 R 2  ( n − f ) (10) ≥  ϑ ( t + − t − ) + ϑ ( R 1 + ( ϑ + 2) T 1 + T 2 /ϑ + (8 ϑ + 9) d ) / (5(1 − λ )) R 2  ( n − f ) (1) ≥  ϑ ( t + − t − + ( R 1 + ( ϑ + 2) T 1 + T 2 /ϑ + (8 ϑ + 9) d )) R 2  ( n − f ) (3) ≥  ϑ ( t + − t − + R 1 + T 1 + 4 d + ∆ g ) R 2  ( n − f ) . The proof is in t w o steps: First we construct a measurable subset of [ t − , t + ] that comprises go o d times only . In a second step a low er b ound on the volume of this set is derived. Constructing the set: Consider an arbitrary time t ∈ [ t − , t + ], and assume a node i ∈ W switc hes to state init at time t . When it do es so, its timeout R 3 expires. By Lemma 4.3 all timeouts of no de i that expire at times within [ t − , t + ], hav e b een reset at least once until time t − . Let t E 3 b e the maxim um time not later than t when R 3 w as reset. Due to the distribution of R 3 w e know that t E 3 (11) ≤ t − ( R 2 + 3 d ) . 26 Th us, no de i is not in state init during time [ t − ( R 2 + 2 d ) , t ), and no no de j ∈ W observ es i in state init during time [ t − ( R 2 + d ) , t ). Thereb y any no de j ’s, j ∈ W , timeout ( R 2 , supp i ) corresp onding to no de i is expired at time t . W e claim that the condition that no no de from W is in or observed in one of the states r esync or supp → r esync at time t is suﬃcien t for t b eing a W -resynchronization p oin t. T o see this, assume that the condition is satisﬁed. Th us all no des j ∈ W are in states none or supp k for some k ∈ { 1 , . . . , n } at time t . By the algorithm, they all will switc h to state supp i or state supp → r esync during ( t, t + d ). It might happ en that they subsequently switc h to another state supp k 0 for some k 0 ∈ V , but all of them will b e in one of the states with signal supp during ( t + d, t + 2 d ]. Consequently , all no des will observe at least n − f no des in state supp during ( t 0 , t + 2 d ) for some time t 0 < t + 2 d . Hence, those nodes in W that were still in state supp i (or supp k 0 for some k 0 ) at time t + d switc h to state supp → r esync b efore time t + 2 d , i.e., t is a W -resynchronization p oint. W e pro ceed by analyzing under whic h conditions t is a go o d W -resynchronization p oint. Recall that in order for t to b e go o d, it has to hold that no no de from W switches to state sle ep during ( t − ∆ g , t ) or is in state join during ( t − T 1 − d, t + 4 d ). W e b egin by c haracterizing subsets of go o d times within ( t r , t 0 r ) ⊂ [ t − , t + ], where t r and t 0 r are times suc h that during ( t r , t 0 r ) no no de from W switc hes to state supp → r esync . Due to timeout R 1 (9) ≥ (4 ϑ + 2) d, w e know that during ( t r + R 1 + 2 d, t 0 r ), no no de from W will b e in, or b e observed in, states supp → r esync or r esync . Th us, if a no de from W switches to init at a time within ( t r + R 1 + 2 d, t 0 r ), it is a W -resynchronization point. F urther, all no des in W will be in state dormant during ( t r + R 1 + 2 d, t 0 r + 4 d ). Th us all no des in W will b e observed to b e in state dormant during ( t r + R 1 + 3 d, t 0 r + 4 d ), implying that they are not in state join during ( t r + R 1 + 3 d, t 0 r + 4 d ). In particular, an y time t ∈ ( t r + R 1 + T 1 + 4 d, t 0 r ) satisﬁes that no no de in W is in state join during ( t − T 1 − d, t + 4 d ). F urther deﬁne t a to b e the inﬁmum of times from ( t r + R 1 + T 1 + 4 d, t 0 r ] when a no de from W switc hes to state sle ep . By Corollary 4.6, no no de from W switc hes to state sle ep during ( t a + δ s , min { t a + ∆ s , t 0 r } ). Hence, if t a < ∞ , all times in b oth ( t r + R 1 + T 1 + 4 d + ∆ g , t a ) and ( t a + δ s + ∆ g , min { t a + ∆ s , t 0 r } ) are go o d. In case t a < t 0 r − ∆ s w e can repeat the reasoning, deﬁning that t 0 a is the inﬁmum of times from [ t a + ∆ s , t 0 r ] when a no de switc hes to state sle ep . By analogous argumen ts as b efore w e see that all times in the sets [ t a + ∆ s , t 0 a ) and ( t 0 a + δ s + ∆ g , min { t 0 a + ∆ s , t 0 r } ) are go o d. By induction on the times t a , t 0 a , . . . , t k a (halting once t k a ≥ t 0 r − ∆ s ), we infer that the total v olume of times from ( t r , t 0 r ) as w ell as from ( t r + R 1 + T 1 + 4 d + ∆ g , t 0 r ) that is go o d is at least  t 0 r − ( t r + R 1 + T 1 + 4 d + ∆ g ) ∆ s  (∆ s − ∆ g − δ s ) > t 0 r − ( t r + R 1 + T 1 + 4 d + ∆ g + ∆ s ) ∆ s (∆ s − ∆ g − δ s ) . (23) In other words, up to a constant loss in eac h interv al ( t r , t 0 r ), a constan t fraction of the times are go o d. 27 V olume of the set: In order to infer a low er b ound on the volume of go o d times during [ t − , t + ], w e subtract from [ t − , t + ] all in terv als [ t r , t r + R 1 + T 1 + 4 d + ∆ g ], where a node from W switc hes to supp → r esync at a time t r within [ t − − ( R 1 + T 1 + 4 d + ∆ g ) , t + ]. F ormally deﬁne ¯ G = [ t r ∈ [ t − − ( R 1 + T 1 +4 d +∆ g ) ,t + ] ∃ i ∈ W : i switches to supp → r esync at t r [ t r , t r + R 1 + T 1 + 4 d + ∆ g ] . What remains is the set [ t − , t + ] \ ¯ G , that has as subset the union of interv als ( t r + R 1 + T 1 + 4 d + ∆ g , t 0 r ) ⊆ [ t − , t + ], where t r and t 0 r are times at which a node from W switc hes to supp → r esync and no no de from W switches to supp → r esync within ( t r , t 0 r ). Note that for each such interv al we already know it contains a certain amount of go o d times b ecause of Inequality (23). In order to lo wer b ound the go o d times in [ t − , t + ], it is th us feasible to low er b ound the volume and num ber of connected comp onents (i.e., maximal interv als) of any subset of [ t − , t + ] \ ¯ G . Observ e that any no de in W do es not switch to state init more than  t + − t − + R 1 + T 1 + 4 d + ∆ g R 3  (11) ≤  t + − t − + R 1 + T 1 + 4 d + ∆ g R 2 + d  ≤ N n − f (24) times during [ t − − ( R 1 + T 1 + 4 d + ∆ g ) , t + ]. No w consider the case that a no de in W switc hes to state supp → r esync at a time t satisfying that no no de in W switc hed to state init during ( t − (8 ϑ + 6) d, t ). This necessitates that this no de observ es n − f of its channels in state supp during ( t − (2 ϑ + 1) d, t ), at least n − 2 f ≥ f + 1 of whic h originate from no des in W . As no no de from W switc hed to init during ( t − (8 ϑ + 6) d, t ), ev ery no de that has not observ ed a no de i ∈ V \ W in state init at a time from ( t − (8 ϑ + 4) d, t ) when ( R 2 , supp i ) is expired must b e in a state whose signal is none during ( t − (2 ϑ + 2) d, t ) due to timeouts. Therefore its outgoing c hannels are not in state supp during ( t − (2 ϑ + 1) d, t ). By means of con tradiction, it th us follows that for eac h no de j of the at least f + 1 no des (which are all from W ), there exists a no de i ∈ V \ W such that no de j resets timeout ( R 2 , supp i ) during the time in terv al ( t − (8 ϑ + 4) d, t ). The same reasoning applies to any time t 0 6∈ ( t − (8 ϑ + 6) d, t ) satisfying that some no de in W switc hes to state supp → r esync at time t 0 and no no de in W switc hed to state init during ( t 0 − (8 ϑ + 6) d, t 0 ). Note that the set of the resp ective at least f + 1 even ts (corresp onding to the at least f + 1 no des from W ) where timeouts ( R 2 , supp i ) with i ∈ V \ W are reset and the set of the ev ents corresp onding to t are disjoint. How ev er, the total num b er of even ts where suc h a timeout can b e reset during [ t − − ( R 1 + T 1 + 4 d + ∆ g ) , t + ] is upp er b ounded by | V \ W || W |  t + − t − + R 1 + T 1 + 4 d + ∆ g R 2 /ϑ  < ( f + 1) N , (25) i.e., the total n um b er of channels from no des not in W ( | V \ W | man y) to no des in W m ultiplied b y the num ber of times the asso ciated timeout can expire at the receiving no de in W during [ t − − ( R 1 + T 1 + 4 d + ∆ g ) , t + ]. With the help of inequalities (24) and (25), w e can sho w that ¯ G can b e cov ered by less than 2 N interv als of size ( R 1 + T 1 + 4 d + ∆ g ) + (8 ϑ + 6) d eac h. By Inequality (24), there are no more than N times t ∈ [ t − − ( R 1 + T 1 + 4 d + ∆ g ) , t + ] when a non-faulty no de switc hes to init and th us may cause others to switch to state supp → r esync at times in [ t, t + (8 ϑ + 6) d ]. Similarly , 28 Inequalit y (25) shows that the channels from V \ W to W may cause at most N − 1 suc h times t ∈ [ t − − ( R 1 + T 1 + 4 d + ∆ g ) , t + ], since any suc h time requires the existence of at least f + 1 even ts where timeouts ( R 2 , supp i ), i ∈ V \ W , are reset at nodes in W , and the resp ectiv e ev en ts are disjoin t. Thus, all times t r ∈ [ t − − ( R 1 + T 1 + 4 d + ∆ g ) , t + ] when some no de i ∈ W switches to supp → r esync are cov ered b y at most 2 N − 1 interv als of length (8 ϑ + 6) d . This results in a co ver ¯ G 0 ⊇ ¯ G consisting of at most 2 N − 1 in terv als that satisﬁes that v ol  ¯ G  ≤ v ol  ¯ G 0  < 2 N ( R 1 + T 1 + ∆ g + (8 ϑ + 10) d ) . Summing ov er the at most 2 N interv als that remain in [ t − , t + ] \ ¯ G 0 and using Inequalit y (23), w e conclude that the volume of go o d times during [ t − , t + ] is at least t + − t − − 2 N ( R 1 + T 1 + (8 ϑ + 10) d + ∆ g + ∆ s ) ∆ s (∆ s − ∆ g − δ s ) = t + − t − − 2 N ( R 1 + ( ϑ + 2) T 1 + T 2 /ϑ + (8 ϑ + 9) d ) ∆ s (∆ s − ∆ g − δ s ) (12) ≥ λ  t + − t − − 2  ϑ ( t + − t − ) R 2 + 11 10  ( n − f )( R 1 + ( ϑ + 2) T 1 + T 2 /ϑ + (8 ϑ + 9) d )  = λ  1 − 2 ϑ ( R 1 + ( ϑ + 2) T 1 + T 2 /ϑ + (8 ϑ + 9) d )( n − f ) R 2  ( t + − t − ) − 11 λ ( R 1 + ( ϑ + 2) T 1 + T 2 /ϑ + (8 ϑ + 9) d )( n − f ) 5 (10) ≥ λ 2 ( t + − t − ) − 11(1 − λ ) R 2 10 ϑ , as claimed. The lemma follows. W e are no w in the p osition to prov e our second main theorem, which states that a go o d resyn- c hronization p oint o cc urs within O ( R 2 ) time with o verwhelming probability . Theorem 4.9. Denote by ˆ E 3 := ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 + d the maximal value the distribution R 3 c an attain plus the at most d time until R 3 is r eset whenever it expir es. F or any k ∈ N and any time t , with pr ob ability at le ast 1 − (1 / 2) k ( n − f ) ther e wil l b e a go o d W -r esynchr onization p oint during [ t, t + ( k + 1) ˆ E 3 ] . Pr o of. Assume w.l.o.g. that | W | = n − f (otherwise consider a subset of size n − f ). Fix some no de i ∈ W and denote b y t 0 the inﬁmum of times from [ t, t + ( k + 1) ˆ E 3 ] when no de i switches to init . W e ha v e that t 0 < t + ˆ E 3 . By induction, it follows that no de i will switc h to state init at least another k times during [ t, t + ( k + 1) ˆ E 3 ] at the times t 1 < t 2 < . . . < t k . W e claim that each suc h time t j , j ∈ { 1 , .., k } , has an indep endently b y 1 / 2 low er b ounded probability of b eing go o d and therefore b eing a go o d W -resynchronization p oin t. W e prov e this b y induction on j : As induction h yp othesis, suppose for some j ∈ { 1 , . . . , k − 1 } , w e sho wed the statemen t for j 0 ∈ { 1 , . . . , j − 1 } and the execution of the system is ﬁxed until time t j − 1 , i.e., E | [0 ,t j − 1 ] is given. Now consider the set of executions that are extensions of E | [0 ,t j − 1 ] and hav e the same clo ck functions as E . F or each suc h execution E 0 it holds that E 0 | [0 ,t j − 1 ] = E | [0 ,t j − 1 ] , and all no des’ clocks make progress in E 0 as in E . Clearly eac h suc h E 0 has its own time t j < t + ( j + 1) ˆ E 3 29 when R 3 expires next after t j − 1 at no de i , and i switches to init . W e next characterize the distribution of the times t j . As the rate of the clo ck driving no de i ’s R 3 is betw een 1 and ϑ , t j > t j − 1 is within an interv al, call it [ t − , t + ], of size at most t + − t − ≤ 8(1 − λ ) R 2 , regardless of the progress that i ’s clo ck C mak es in an y execution E 0 . Certainly we can apply Lemma 4.8 also to eac h of the E 0 , sho wing that the v olume of times from [ t − , t + ] that are not go o d in E 0 is at most (1 − λ 2 )( t + − t − ) + 11(1 − λ ) R 2 10 ϑ . Since clo ck C can make progress not faster than at rate ϑ and the probability densit y of R 3 is constan tly 1 / (8(1 − λ ) R 2 ) (with resp ect to the clo c k function C ), w e obtain that the probability of t j not b eing a go o d time is upp er b ounded b y (1 − λ 2 )( t + − t − ) + 11(1 − λ ) R 2 / (10 ϑ ) 8(1 − λ ) R 2 /ϑ ≤ ϑ (1 − λ 2 ) + 11 80 (1) < ϑ 9 25 ϑ + 7 50 = 1 2 . Here w e use that the time when R 3 expires is indep endent of E 0 | [0 ,t j − 1 ] . W e complete our reasoning as follows. Given E | [0 ,t j − 1 ] , we p ermit an adversary to c ho ose E 0 , including random bits of all nodes and full knowledge of the future, with the exception that w e den y it control or knowledge of the time t j when R 3 expires at no de i , i.e., E 0 is an imaginary execution in which R 3 do es not expire at i at any time greater than t j − 1 . Note that for the goo d W -resynchronisation points we considered, the choice of E 0 do es not aﬀect the probability that t 1 , . . . , t j − 1 are goo d W -resynchronization p oints: The conditions referring to times greater than a W -resynchronisation p oin t t , i.e., that all no des in W switch to state supp → r esync during ( t, t + 2 d ) and no no de in W shall b e in state join during ( t − T 1 − d, t + 4 d ), are already fully determined b y the history of the system until time t . As we ﬁxed E 0 , the b eha viour of the clo c k driving R 3 is ﬁxed as well. Next, we determine the time t j when R 3 expires according to its distribution, giv en the b eha viour of no de i ’s clo ck. The abov e reasoning shows that time t j is goo d in E 0 with probabilit y at least 1 / 2, indep endently of E 0 | [0 ,t j − 1 ] = E | [0 ,t j − 1 ] . W e deﬁne that E | [0 ,t j ) = E 0 | [0 ,t j ) and in E no de i switches to state init (b ecause R 3 expired). As — conditional to the clo ck driving R 3 and t j − 1 b eing speciﬁed — t j is indep enden t of E | [0 ,t j ) , E is indistinguishable from E 0 un til time t j . Because t j is go o d with probability at least 1 / 2 indep enden tly of E | 0 [0 ,t j − 1 ] = E | [0 ,t j − 1 ] , so it is in E . Hence, in E t j is a go o d W -resync hronization point with probabilit y 1 / 2, indep endently of E | [0 ,t j − 1 ] . Since E 0 w as chosen in an adversarial manner, this completes the induction step. In summary , we show ed that for any no de in W and any execution (in whic h we do not manipulate the times when R 3 expires at the resp ective no de), starting from the second time during [ t, t + ( k + 1) ˆ E 3 ] when R 3 expires at the resp ective node, there is a probability of at least 1 / 2 that the resp ective time is a go o d W -resync hronization point. Since we assumed that | W | = n − f and there are at least k such times for each no de in W , this implies that having no go o d W - resync hronization p oint during [ t, t + ( k + 1) ˆ E 3 ] is as least as unlik ely as k ( n − f ) un biased and indep enden t coin ﬂips all showing tail, i.e., (1 / 2) k ( n − f ) . This concludes the pro of. Ha ving established that ev entually a go o d W -resync hronization p oint t g will o ccur, w e turn to pro ving the conv ergence of the main routine. W e start with a few helper statemen ts wrapping up 30 that a go o d resynchronization p oint guaran tees prop er reset of ﬂags and timeouts inv olv ed in the stabilization pro cess of the main routine. Lemma 4.10. Supp ose t g is a go o d W -r esynchr onization p oint. Then (i) e ach no de i ∈ W switches to passive at a time t i ∈ ( t g + 4 d, t g + (4 ϑ + 3) d ) and observes itself in state dorman t during [ t g + 4 d, τ i,i ( t i )) , (ii) Mem i,j, join | [ τ i,i ( t i ) ,t join ] ≡ 0 for al l i, j ∈ W , wher e t join ≥ t g + 4 d is the inﬁmum of al l times gr e ater than t g − T 1 − d when a no de fr om W switches to join , (iii) Mem i,j, sleep → waking | [ τ i,i ( t i ) ,t s ] ≡ 0 for al l i, j ∈ W , wher e t s ≥ t g + (1 + 1 /ϑ ) T 1 is the inﬁmum of al l times gr e ater or e qual to t g when a no de fr om W switches to sleep → w aking , (iv) no no de fr om W r esets its sleep → waking ﬂags during [ t g + (1 + 1 /ϑ ) T 1 , t g + R 1 /ϑ ] , and (v) no no de fr om W r esets its join ﬂags due to switching to passive during [ t g + (1 + 1 /ϑ ) T 1 , t g + R 1 /ϑ ] . Pr o of. All no des in W switc h to state supp → r esync during ( t g , t g + 2 d ) and switc h to state r esync when their timeout of ϑ 4 d expires, which do es not happ en until time t g + 4 d . Once this timeout expired, they switch to state p assive as so on as they observe themselv es in state r esync , i.e., by time t g + (4 ϑ + 3 d ). Hence, every no de i ∈ W do es not observ e itself in state r esync within [ t g + 3 d, τ i,i ( t i )), and therefore is in state dormant during [ t g + 3 d, τ i,i ( t i )]. This implies that it observ es itself in state dormant during [ t g + 4 d, τ i,i ( t i )), completing the pro of of Statement (i). Moreo ver, from the deﬁnition of a go o d W -resync hronization p oint w e hav e that no no des from W are in state join at times in [ t g − T 1 − d, t join ). Statemen t (ii) follows, as every no de from W resets its join ﬂags up on switching to state p assive at time t i . Regarding Statement (iii), observ e ﬁrst that no nodes from W are in state sle ep → waking during ( t g − d, t g + (1 + 1 /ϑ ) T 1 ) for the following reason: By deﬁnition of a go o d W -resynchronization p oin t no no de from W switc hes to sle ep during ( t g − ∆ g , t g ) ⊇ ( t g − ( ϑ + 1) T 1 − 3 d, t g ). Any no de in W that is in states sle ep or sle ep → waking at time t g − ( ϑ + 1) T 1 − 3 d switc hes to state waking b efore time t g − d due to timeouts. Finally , any node in W switching to sle ep at or after time t g will not switc h to state sle ep → waking before time t g + (1 + 1 /ϑ ) T 1 . The observ ation follo ws. Since no des in W reset their sle ep → waking ﬂags at some time from [ t i , τ i,i ( t i )] ⊂ ( t g + 3 d, t g + (4 ϑ + 4) d ) (2) ⊆ ( t g + 3 d, t g + (1 + 1 /ϑ ) T 1 ) , Statemen t (iii) follows. Statemen ts (iv) and (v) follow from the fact that all no des in W switc h to state p assive until time t g + (3 + 4 ϑ ) d (2) ≤ t g +  1 + 1 ϑ  T 1 − d, while timeout ( R 1 , supp → r esync ) m ust expire ﬁrst in order to switc h to dormant and subsequen tly p assive again. 31 Before w e pro ceed, in the next lemma we mak e the basic yet crucial observ ation that after a go o d W -resynchronization p oin t t g , no no de from W will switch to state join until either time t g + T 7 /ϑ + 4 d or T 6 /ϑ time after the ﬁrst non-fault y no de switched to sle ep → waking again after t g . By proper c hoice of T 6 and T 7 > T 6 , this will guaran tee that nodes from W do not switc h to join prematurely during the ﬁnal steps of the stabilization pro cess. Lemma 4.11. Supp ose t g is a go o d W -r esynchr onization p oint. Denote by t s the inﬁmum of times gr e ater than t g when a no de in W switches to state sleep → w aking and by t join the inﬁmum of times gr e ater than t g − T 1 − d when a no de in W switches to state join . Deﬁne t + := t g + ∆ s − ∆ g + ˜ δ s + T 2 + T 4 + T 5 + d . Then, starting fr om time t g + 4 d , tr (recov er , join) is not satisﬁe d at any no de in W until time min  t s + T 6 ϑ , t g + T 7 ϑ + 4 d  ≥ min { t s + ∆ s , t + } and t join is lar ger than this time. Pr o of. By Statements (ii) and (iii) of Lemma 4.10 and Inequalit y (2), w e hav e that t s ≥ t g + T 1 +4 d ≥ t g +(4 ϑ + 4) d and t join ≥ t g +4 d . Consider a no de i ∈ W not observing itself in state dormant at some time t ∈ [ t g + 4 d, t join ]. According to Statemen ts (i) and (ii) of Lemma 4.10, the threshold condition of f + 1 nodes memorized in state join cannot be satisﬁed at suc h a no de. By statemen ts (i) and (iii) of the lemma, the threshold condition of f + 1 no des memorized in state sle ep → waking cannot b e satisﬁed unless t > t s . Hence, if at time t a no de from W satisﬁes that it observes itself in state active and T 6 expired, w e ha ve that t > t s + T 6 /ϑ . Moreov er, b y Statemen t (i) of Lemma 4.10, we ha ve that if T 7 is expired at any no de in W at time t , it holds that t > t g + T 7 /ϑ + 4 d . Altogether, w e conclude that tr ( r e c over , join ) is not satisﬁed at an y no de in W during  t g + 4 d, min  t s + T 6 ϑ , t g + T 7 ϑ + 4 d  (7 , 8) ⊇  t g + 4 d, min { t s + ∆ s , t + }  . In particular, t join m ust b e larger than the upp er b oundary of this interv al, concluding the proof. Before w e can mo v e on to pro ving ev ent ual stabilization, we need one last k ey lemma. Essen- tially , it states that after a go o d W -resynchronization p oint, any no de in W switches to r e c over or to sle ep → waking within b ounded time, and all no des in W doing the latter will do so in rough sync hrony , i.e., within a time window of ˜ δ s . Using the previous lemma, w e can show that this happ ens b efore the transition to join is enabled for an y no de. Lemma 4. 12. Supp ose t g is a go o d W -r esynchr onization p oint and use the notation of L emma 4.11. Then either (i) t s < t + − ∆ s and any no de in W switches to state sleep → waking at some time in [ t s , t s + ˜ δ s ] or is observe d in state reco ver during [ t s + T 1 + T 5 , t join ] or (ii) al l no des in W ar e observe d in state reco ver during [ t + , t join ] . Pr o of. By Lemma 4.11, it holds that t join > min { t s + ∆ s , t + } . (26) 32 F or any node in W , consider the supremum t of all times smaller or equal to t g − ∆ g when it switc hed to sle ep . After that, it observ ed itself in state waking b efore time t + ( ϑ + 1) T 1 + 3 d ≤ t g − T 1 − d (27) (w.l.o.g. assuming that the no de has ever b een in state sle ep since it b ecame non-faulty). By deﬁnition of a go o d W -resynchronization p oint, no des in W are not in state join during ( t g − T 1 − d, t join ) and do not switc h to state sle ep during ( t g − ∆ g , t s ). Contin uing to execute the basic cycle after time t g − d > t g − T 1 − d th us necessitates that the no de is in one of the states waking , r e ady , pr op ose or ac c ept at time t g − d . Assume that it is in state waking (w e just sho wed that if not, it already w as in waking b y time t g − d ). As timeout T 2 cannot hav e b een reset later than time t − T 1 /ϑ + d ≤ t g − ∆ g − T 1 /ϑ + d at the respective no de, it observ es itself in state r e ady b y time t g − ∆ g − T 1 /ϑ + T 2 + 2 d , in state pr op ose b y time t g − ∆ g − T 1 /ϑ + T 2 + T 4 + 3 d , in state ac c ept by time t g − ∆ g − T 1 /ϑ + T 2 + T 4 + T 5 + 4 d , in state sle ep b y time t g − ∆ g + (1 − 1 /ϑ ) T 1 + T 2 + T 4 + T 5 + 5 d , and must switch to sle ep → waking b efore time t + − ∆ s . W e next distinguish betw een t wo cases: Case 1: Assume that t s < t + − ∆ s . W e already established that no no de in W observes itself in states sle ep or sle ep → waking at time t s , and by Inequality (27), any no de in W observing itself in states waking or r e ady reset its ac c ept ﬂags after time t g − T 1 − d . Denote by t 0 s ∈ ( t s − ( ϑ + 1) T 1 − d, t s − (1 + 1 /ϑ ) T 1 ) the minimal time greater or equal to t g when a no de from W switc hes to state sle ep ; by the timeout condition for switc hing from sle ep to sle ep → waking and the deﬁnitions of t s and goo d W -resynchronization p oints, suc h a time exists. According to Lemma 4.5, at least f + 1 no des hav e b een in state ac c ept at times in ( t 0 s − T 1 − d, t 0 s ). By Statemen ts (i) and (iii) of Lemma 4.10, all nodes are in state p assive un til at least time t s . Hence, an y no des from W observing themselves in state waking or r e ady at time t 0 s + d satisfy tr ( waking , r e c over ) or tr ( unsusp e ct , susp e ct ), resp ectively . Consequently , they will lea v e these states no later than time t 0 s + (2 ϑ + 2) d ≤ t s −  1 + 1 ϑ  T 1 + (2 ϑ + 2) d (2) ≤ t s − 4 d. It follows that any no des from W that are in state pr op ose at time t s observ e themselves in this state since at least time t s − 3 d , implying that they switch to states ac c ept or r e c over by time t s + T 5 − 3 d . After switching to ac c ept , a no de from W switches to sle ep and subsequently to sle ep → waking within another (2 ϑ + 1) T 1 + 2 d or is observed in state r e c over after less than T 1 + 2 d time. Thus, as t join > t s + ∆ s − ( ϑ − 1 /ϑ ) T 1 − d ) (3) > t s + T 1 + T 5 − d, all no des in W that do not switc h to state sle ep → waking during [ t s , t s + (2 ϑ + 1) T 1 + T 5 − d ] (3) ⊆  t s , t s + ∆ s −  ϑ − 1 ϑ  T 1 − d  ⊆  t s , t 0 s + ∆ s +  1 + 1 ϑ  T 1  are observ ed in state r e c over at time t s + T 1 + T 5 . Because t join > t s + ∆ s − ( ϑ − 1 /ϑ ) T 1 − d and no no des from W switch to state sle ep during ( t g − ∆ g , t s ), w e can apply Statemen t (ii) of Corollary 4.6 to conclude that no no des from W switc h to state sle ep → waking during  t s + ˜ δ s , t 0 s + ∆ s +  1 + 1 ϑ  T 1  , 33 i.e., any node from W that do es not switch to state sle ep → waking during [ t s , t s + ˜ δ s ] is observed in state r e c over during [ t s + T 1 + T 5 , t join ]. Statement (i) follo ws. Case 2: Assume t s ≥ t + − ∆ s . Then by Inequality (26), t join ≥ t + holds. By deﬁnition of t s , the ﬁrst no de in W switc hing to sle ep → waking after t g do es so at time t s , and by the argumen ts giv en ab o ve, no no de from W executing the basic cycle do es so later than t + − ∆ s < t + − d . Hence, it is observed in state r e c over during [ t + , t join ], as it cannot leav e r e c over through join b efore time t join . Hence Statement (ii) holds and the proof c oncludes. W e hav e everything in place for pro ving that a go o d resync hronization p oint leads to stabiliza- tion within R 1 /ϑ − 3 d time. Theorem 4.13. Supp ose t g is a go o d W -r esynchr onization p oint. Then ther e is a quasi-stabilization p oint during ( t g , t g + R 1 /ϑ − 3 d ] . Pr o of. F or simplicity , assume during this pro of that R 1 = ∞ , i.e., by Statement (i) of Lemma 4.10 all no des in W observe themselv es in states p assive or active at times greater or equal to t g + (4 ϑ + 4) d . W e will establish the existence of a quasi-stabilization p oint at a time larger than t g and sho w that it is upp er bounded b y t g + R 1 /ϑ − 3 d . Hence this assumption can b e made w.l.o.g., as the existence of the quasi-stabilization p oint dep ends on the execution up to time t g + R 1 /ϑ only , and R 1 cannot expire b efore this time at any no de in W . W e use the notation of Lemma 4.11. By Statemen ts (ii) of Lemma 4.10 and Inequality (2), we hav e that t s ≥ t g + T 1 + 4 d ≥ t g + (4 ϑ + 4) d . By Lemma 4.11, it holds that t join > min { t s + ∆ s , t + } . W e diﬀerentiate several cases. Case 1: Assume t s ≥ t + − ∆ s . According to Lemma 4.10, all no des in W switc hed to state p assive during ( t g + 4 d, t g + (3 + 4 ϑ ) d ), implying that at any node in W , T 7 will expire at some time from ( t g + T 7 /ϑ + 4 d, t g + T 7 + (4 ϑ + 4) d . By Lemma 4.12 we ha v e that all non-fault y no des are observ ed in state r e c over during [ t + , t join ]. By Statement (v) of Lemma 4.10, no no de in W resets its join ﬂags after time t + b efore it switc hes to state pr op ose , returning to the basic cycle. Thus, any no de from W will switch to state join b efore time t g + T 7 + (4 ϑ + 4) d and switch to pr op ose as so on as it memorizes all non-fault y nodes in state join . Denote b y t p ∈ ( t g + T 7 /ϑ + 4 d, t g + T 7 + (4 ϑ + 5) d ) the minimal time when a node from W switc hes from join to pr op ose . Certainly , nodes in W do not switc h from waking to r e ady during ( t p , t p + 2 d ) and therefore also not reset their join ﬂags b efore time t p + 3 d . As no des in W reset their pr op ose and ac c ept ﬂags up on switching to state join , some no de in W m ust memorize n − 2 f ≥ f + 1 non-fault y no des in state join at time t p . According to Statemen t (ii) of Lemma 4.10, these no des must ha ve switc hed to state join at or after time t join . Hence, all no des in W will memorize them in state join b y time t p + d and th us hav e switched to state join . Hence, all no des in W will switch to state pr op ose b efore time t p + 2 d and subsequen tly to state ac c ept b efore time t p + 3 d , i.e. t p ≤ t g + T 7 + (4 ϑ + 5) d is a quasi-stabilization p oint. Case 2a: Assume t s < t + − ∆ s and < f + 1 no des in W switch to sle ep → waking during [ t s , t s + ˜ δ s ]. W e then ha ve that t + (3) > t s + T 1 + T 5 . According to Lemma 4.12, any no de in W that do es not switc h to state sle ep → waking is observed in state r e c over during [ t s + T 1 + T 5 , t join ]. Th us, any no de in W will observe at least n − 2 f ≥ f + 1 nodes from W in state r e c over during [ t s + T 1 + T 5 , t join ]. As no des in W reset their pr op ose ﬂags when switc hing to state r e ady and t s + T 1 + T 5 (3 , 4) ≤ t s + T 2 + T 3 ϑ − ( ϑ + 2) T 1 − (2 ϑ + 4) d, 34 a no de from W switc hing to state sle ep → waking at or after time t s cannot switch to pr op ose via states waking and r e ady b efore time t s + T 1 + T 5 + (2 ϑ + 1) d . Any no de in W switc hing to state sle ep → waking during [ t s , t s + ˜ δ s ] will observ e itself in state waking b efore time t s + ˜ δ s + 2 d (2 , 3) ≤ t s + ∆ s − (2 ϑ + 2) d ≤ t + − (2 ϑ + 2) d. By Lemma 4.11, tr ( r e c over , join ) cannot b e satisﬁed at any no de in W un til time min { t s + ∆ s , t + } . Th us, we hav e that no no de from W switches from r e ady to join during [ t s , t join ) b y deﬁnition of t join and any no de in W that observ es itself in states r e ady and susp e ct will switc h to state r e c over once (2 ϑd, susp e ct ) expires. In summary , any no de in W switc hing to state sle ep → waking at some time in [ t s , t s + ˜ δ s ] will switc h from waking to r e c over or from unsusp e ct to susp e ct by time t s + ∆ s − (2 ϑ + 2) d , and in the latter case it cannot lea v e state r e ady before switc hing to state r e c over due to tr ( r e ady , r e c over ) b eing satisﬁed. As the latter happens b efore time t s + ∆ s − d < t join − d , all no des in W are observ ed in state r e c over during [ t s + ∆ s , t join ]. F rom here w e can argue analogously to the ﬁrst case, i.e., there exists a quasi-stabilization p oint t p ≤ t g + T 7 + (4 ϑ + 5) d . Case 2b: Assume t s < t + − ∆ s and ≥ f + 1 no des in W switc h to sle ep → waking during [ t s , t s + ˜ δ s ]. By Statement s (ii) and (iv) of Lemma 4.10, no node from W resets its sle ep → waking ﬂags at or after time t s ≥ t g + (1 + 1 /ϑ ) T 1 . Hence, by Statement (i) of the lemma, all no des in W switc h to active during ( t s , t s + ˜ δ s + d ). Bet ween T 6 /ϑ and T 6 + d time later T 6 will expire. W e ha ve that t s + T 6 ϑ < t + − ∆ s + T 6 ϑ (8) ≤ t g + T 7 ϑ + 4 d. Th us, according to Lemma 4.11, t join > t s + T 6 /ϑ . On the other hand, at the latest once T 6 expires, tr ( r e c over , join ) holds at every no de. By time t s + T 6 ϑ (7) ≥ t s + ˜ δ s −  1 − 1 ϑ  T 1 + T 2 + 2 d, the no des in W that switched to state sle ep → waking observe themselves in state r e ady b ecause of timeouts or are in state r e c over . By Statement (v) of Lemma 4.10, after this time no no de in W resets its join ﬂags again b efore it runs through the basic cycle again and switc hes to state r e ady . Hence, all no des in W will switc h to states join or pr op ose until time max  t s + ˜ δ s −  1 + 1 ϑ  T 1 + T 2 + T 4 + 2 d, t s + ˜ δ s + T 6 + 3 d  + d (4 , 5) = t s +  ϑ + 1 − 2 ϑ  T 1 + T 2 + T 4 + 7 d, where w e accounted for an additional delay of d due to a possible transition from r e ady to r e c over just b efore time t s + ˜ δ s + T 6 + 2 d and, if no no de from W switc hes from r e ady to join , all no des in W needing to be observed in state join for a no de in W to switc h to state pr op ose . It follo ws that a minimal time t p ∈ ( t s + T 6 /ϑ, t s + ( ϑ + 1 − 2 /ϑ ) T 1 + T 2 + T 4 + 7 d ) exists when a no de from W switc hes to state pr op ose . Again, we distinguish t w o cases. Case 2b-I: Assume that some no de in W switc hes from state join to state pr op ose at time t p . Th us, there must b e at least n − 2 f ≥ f + 1 non-faulty no des in state join at time t p − ε (for some arbitrarily small ε > 0), as any pr op ose or ac c ept ﬂag corresp onding to a non-faulty no de has b een 35 reset at a time t satisfying that the resp ectiv e node has not b een observed in one of these states during [ t, t p ]. Th us, all no des in W will switch to states join or pr op ose b efore time t p + d . A t time t p + 2 d , they will observe all non-faulty no des in one of the states join , pr op ose , or ac c ept , i.e., they switch to state pr op ose b efore time t p + 2 d . Finally , they will observe all non-faulty no des in states pr op ose or ac c ept b efore time t p + 3 d < t p + T 1 /ϑ and switc h to state ac c ept . As t p is minimal, w e conclude that all no des in W switched to state ac c ept during ( t p , t p + 3 d ), i.e., t p is a quasi-stabilization p oin t. Case 2b-I I: Otherwise, some no de in W switched from state r e ady to state pr op ose at time t p . As w e hav e that t s + ˜ δ s + T 6 + 4 d (4) ≤ t s − ( ϑ + 1) T 1 + T 2 + T 3 ϑ − d, T 6 is expired at all no des in W since time t p − 2 d , i.e., tr ( r e c over , join ) is satisﬁed at all no des in W since time t p − 2 d . Hence, all no des in W are observ ed in states r e ady or join at time t p , and no no de from W may switch to state r e c over again or reset its pr op ose ﬂags b efore switc hing to r esync or ac c ept ﬁrst after time t p . Denote by t a the inﬁmum of times greater than t p when a node from W switches to ac c ept and assume for the moment that no node from W may switc h from pr op ose to r e c over before switc hing to ac c ept ﬁrst after time t p . As no des in W reset their pr op ose ﬂags up on switc hing to states r e ady or join , there must be n − 2 f ≥ f + 1 non-faulty nodes that switc hed to state pr op ose during [ t p , t a ) (unless t a = ∞ , whic h will b e ruled out shortly). Th us, all no des in W lea v e state r e ady b efore time t a + d , and are observed in states pr op ose or join b efore time t a + 2 d . Recalling that all no des in W switch to states join or pr op ose un til time t s +  ϑ + 1 − 2 ϑ  T 1 + T 2 + T 4 + 7 d, w e get that indeed all nodes in W are observed in one of these states after time t p and b efore time min  t a + 2 d, t s +  ϑ + 1 − 2 ϑ  T 1 + T 2 + T 4 + 8 d  . Th us, at any no de from W , tr ( join , pr op ose ) will be satisﬁed b efore this time, and it will b e observ ed in state pr op ose less than d time later. It follo ws that all no des in W switch to state ac c ept b efore time t q + 3 d := min  t a + 3 d, t s +  ϑ + 1 − 2 ϑ  T 1 + T 2 + T 4 + 9 d  , i.e., t q is a quasi-stabilization p oint. As we made the assumption that no no de from W switc hes from pr op ose to r e c over b efore switc hing to ac c ept , w e need to show that T 5 do es no expire at an y no de from W in state pr op ose un til time t q + 3 d . This holds true b ecause t p + T 5 ϑ > t s + T 5 + T 6 ϑ (6) ≥ t s +  ϑ + 1 − 2 ϑ  T 1 + T 2 + T 4 + 9 d ≥ t q + 3 d. It remains to chec k that in all cases, the obtained quasi-synchronisation point t q o ccurs no later than time t g + R 1 /ϑ − 3 d . In Cases 1 and 2a, we hav e that t q ≤ t g + T 7 + (4 ϑ + 5) d (9) ≤ t g + R 1 ϑ − 3 d. 36 In Case 2b, it holds that t q ≤ t s +  ϑ + 1 − 2 ϑ  T 1 + T 2 + T 4 + 9 d ≤ t + − ∆ s +  ϑ + 1 − 2 ϑ  T 1 + T 2 + T 4 + 9 d (9) ≤ t g + R 1 ϑ − 3 d. W e conclude that indeed all no des in W switch to ac c ept within a window of less than 3 d time b efore at any no de in W , R 1 expires and it lea ves state r esync , concluding the pro of. Finally , putting together our main theorems and Lemma 3.4, w e deduce that the system will stabilize from an arbitrary initial state provided that a subset of n − f no des remains coheren t for a suﬃcien tly large p erio d of time. Corollary 4.14. Supp ose that ϑ < ϑ max ≈ 1 . 247 as given in L emma 3.4. L et W ⊆ V , wher e | W | ≥ n − f , and deﬁne for any k ∈ N T ( k ) := ( k + 2)( ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 + d ) + R 1 /ϑ. Then, for any k ∈ N , the pr op ose d algorithm is a ( W, W 2 ) -stabilizing pulse synchr onization pr oto c ol with skew 2 d and ac cur acy b ounds ( T 2 + T 3 ) /ϑ − 2 d and T 2 + T 4 + 7 d stabilizing within time T ( k ) with pr ob ability at le ast 1 − 1 / 2 k ( n − f ) . It is fe asible to pick time outs such that T ( k ) ∈ O ( k n ) and T 2 + T 4 + 7 d ∈ O (1) . Pr o of. The satisﬁability of Condition 3.3 with T ( k ) ∈ O ( k n ) and T 2 + T 4 + 7 d ∈ O (1) follows from Lemma 3.4. Assume that t + is suﬃciently large for [ t − + T ( k ) + 2 d, t + ] to b e non-empt y , as otherwise nothing is to show. By deﬁnition, W will b e coherent during [ t − c , t + ], with t − c = t − + ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 + d . According to Theorem 4.9, there will b e some go o d W - resync hronization p oint t g ∈ [ t − c , t − c + ( k + 1)( ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 + d )] with probability at least 1 − 1 / 2 k ( n − f ) . If this is the case, Theorem 4.13 shows that there is a W -stabilization p oint t ∈ [ t g , t − + T ( k )]. Applying Theorem 4.4 inductively , w e deriv e that the algorithm is a ( W , E )- stabilizing pulse synchronization proto col with the b ounds as stated in the corollary that stabilizes within time T ( k ) with probability at least 1 − 1 / 2 k ( n − f ) . 5 Generalizations and Extensions 5.1 Sync hronization Despite F aulty Channels Theorem 4.13 and our notion of coherency require that all inv olv ed no des are connected b y correct c hannels only . Ho w ever, it is desirable that non-faulty no des synchronize even if they are not connected by correct channels. T o capture this, the notions of coherency and stability can b e generalized as follo ws. Deﬁnition 5.1 (W eak Coherency) . We c al l the set C ⊆ V w eakly coheren t during [ t − , t + ] , iﬀ for any no de i ∈ C ther e is a subset C 0 ⊆ C that c ontains i , has size n − f , and is c oher ent during [ t − , t + ] . 37 In particular, if there are in total at most f no des that are fault y or hav e faulty outgoing c hannels, then the set of non-faulty no des is (after some amoun t of time) weakly coherent. Corollary 5.2. F or e ach k ∈ N let T 0 ( k ) := T ( k ) − (( ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 + d )) , wher e T ( k ) is deﬁne d as in Cor ol lary 4.14. Supp ose the subset of no des C ⊆ V is we akly c oher ent during the time interval [ t − , t + ] ⊇ [ t − + T 0 ( k ) + T 2 + T 4 + 8 d, t + ] 6 = ∅ . Then, with pr ob ability at le ast 1 − ( f + 1) / 2 k ( n − f ) , ther e is a C -quasi-stabilization p oint t ≤ t − + T 0 ( k ) + T 2 + T 4 + 5 d such that the system is we akly C -c oher ent during [ t, t + ] . Pr o of. By the deﬁnition of w eak coherency , ev ery no de in C is in some coherent set C 0 ⊆ C of size n − f . Hence, for any such C 0 it holds that we can cov er all nodes in C b y at most 1 + | V \ C 0 | ≤ f + 1 coherent sets C 1 , . . . , C f +1 ⊆ C . By Corollary 4.14 and the union b ound, with probabilit y at least 1 − ( f + 1) / 2 k ( n − f ) , for each of these sets there will b e at least one stabilization p oint during [ t − , t − + T 0 ( k ) − ( T 2 + T 4 + 5 d )]. Assuming that this is indeed true, denote b y t i 0 ∈ [ t − , t − + T 0 ( k ) − ( T 2 + T 4 + 5 d )] the time max i ∈{ 1 ,...,f +1 } { max { t ≤ t − + T 0 ( k ) − ( T 2 + T 4 + 5 d ) | t is a C i -stabilization p oin t }} , where i 0 ∈ { 1 , . . . , f + 1 } is an index for which the ﬁrst maximum is attained and t i 0 is the resp ective maximal time, i.e., t i 0 is a C i 0 -stabilization p oin t. Deﬁne t 0 i 0 ∈ ( t i 0 , t − + T 0 ( k )] to be minimal such that it is another C i 0 -stabilization point. Such a time m ust exist by Theorem 4.4. Since the theorem also states that no no de from C i 0 switc hes to state ac c ept during [ t i 0 + 2 d, t 0 i 0 ) and C i ∩ C i 0 6 = ∅ , there can be no C i -stabilization p oint during ( t i 0 + 2 d, t 0 i 0 − 2 d ) for an y i ∈ { 1 , . . . , f + 1 } . Applying the theorem once more, we see that there are also no C i -stabilization p oints during ( t 0 i 0 + 2 d, t 0 i 0 + ( T 2 + T 3 ) /ϑ ) − 2 d for any i ∈ { 1 , . . . , f + 1 } . On the other hand, the maximality of t i 0 implies that every C i had a stabilization p oin t by time t i 0 . Applying Theorem 4.4 to the latest stabilization p oint until time t i 0 for each C i , we see that it m ust hav e another stabilization point before time t i 0 + T 2 + T 4 + 5 d . W e hav e that 2( T 2 + T 3 ) ϑ − 2 d (3) > T 2 + T 3 + T 5 ϑ (6) > T 2 + T 4 + 5 d, i.e., all C i ha ve stabilization p oints within a short time interv al of ( t 0 i 0 − 2 d, t 0 i 0 + 2 d ). Arguing anal- ogously ab out the previous stabilization p oints of the sets C i (whic h exist b ecause t i 0 is maximal), w e infer that all C i had their previous stabilization p oint during ( t i 0 − 2 d, t i 0 + 2 d ). No w supp ose t a is the minimal time in ( t 0 i 0 − 2 d, t 0 i 0 + 2 d ) when a no de from C switches to ac c ept and this no de is in set C i for some i ∈ { 1 , . . . , f + 1 } . As usual, there must be at least f + 1 non- fault y nodes from C i in state pr op ose at time t a and b y time t a + d all no des from C i will be in either of the states pr op ose or ac c ept . As | C i ∩ C j | ≥ f + 1 for an y j ∈ { 1 , . . . , f + 1 } all no des in C j will observ e f + 1 no des in state pr op ose at times in ( t a , t a + 2 d ). W e ha v e that t a ≥ t i 0 + ( T 2 + T 3 ) /ϑ − 2 d according to Theorem 4.4. As no no des switc hed to state ac c ept during ( t i 0 + 2 d, t a ) and none of them switc h to state r e c over (cf. Theorem 4.4), it follows from the Inequalit y ( T 2 + T 3 ) /ϑ − 4 d (4) > T 2 + 2 T 1 (2) > T 2 + 5 d that all nodes from C j observ e themselv es in one of the states r e ady or pr op ose at time t a . Hence, they will switc h from r e ady to pr op ose if they still are in r e ady b efore time t a + 2 d . Less than d time later, all no des in C j will memorize C j in state pr op ose and therefore switch to ac c ept if not done so y et. Since j w as arbitrary , it follo ws that t a is a C -quasi-stabilization p oin t. 38 Corollary 5.3. Supp ose C is we akly c oher ent during [ t − , t + ] and t ∈ [ t − , t + − ( T 2 + T 4 + 8 d )] is a C -quasi-stabilization p oint. Then (i) al l no des fr om C switch to accept exactly onc e within [ t, t + 3 d ) and (ii) ther e wil l b e a C -quasi-stabilization p oint t 0 ∈ [ t + ( T 2 + T 3 ) /ϑ, t + T 2 + T 4 + 5 d ) satisfying that no no des switch to accept in the time interval [ t + 3 d, t 0 ) (iii) and e ach no de i ’s, i ∈ W , state of the b asic cycle (Figur e 1) is metastability-fr e e during [ t + 4 d, t 0 + 4 d ) Pr o of. Analogously to the pro ofs of Theorem 4.4 and Corollary 5.2. W e p oint out that one cannot get stronger results by the prop osed tec hnique. Even if there are merely f + 1 failing channels, this can e.g. eﬀectiv ely render a no de faulty (as it ma y nev er see n − f no des in states pr op ose or ac c ept ) or exclude the existence of a coheren t set of size n − f (if the c hannels connect f + 1 disjoint pairs of no des, there can b e no subset of n − f no des whose induced subgraph con tains correct c hannels only). Stronger resilience to c hannel faults w ould necessitate to propagate information ov er several hops in a fault-tolerant manner, imp osing larger b ounds on timeouts and w eaker synchronization guaran tees. Com bination of Corollary 5.2 and Corollary 5.3 ﬁnally yields: Corollary 5.4. Supp ose that ϑ < ϑ max ≈ 1 . 247 as given in L emma 3.4. L et C ⊆ V b e such that, for e ach i ∈ C , ther e is a set C i ⊆ C with | C i | = n − f , and let E = S i ∈ C C 2 i . Then the pr op ose d algorithm is a ( C , E ) -stabilizing pulse synchr onization pr oto c ol with skew 3 d and ac cur acy b ounds ( T 2 + T 3 ) /ϑ − 3 d and T 2 + T 4 + 8 d stabilizing within time T ( k ) + T 2 + T 4 + 5 d with pr ob ability at le ast 1 − ( f + 1) / 2 k ( n − f ) , for any k ∈ N . Pr o of. Analogously to the pro of of Corollary 4.14 5.2 Late Joining and F ast Reco v ery An important asp ect of combining self-stabilization with Byzantine fault-tolerance is that the sys- tem can remain op erational when facing a limited n umber of transient faults. If the aﬀected comp onen ts stabilize quickly enough, this can prev ent future faults from causing system failure. In an en vironmen t where transien t faults occur according to a random distribution that is not to o far from b eing uniform (i.e., one deals not primarily with bursts), the mean time un til failure is therefore determined by the time it tak es to recov er from transient faults. Thus, it is of signiﬁcan t in terest that a no de that starts functioning according to the sp eciﬁcations again synchronizes as fast as p ossible to an existing subset of correct no des making a quasi-stabilization p oint. Moreov er, it is of interest that a no de that has b een s h ut down temp orarily , e.g. for main tenance, can join the op erational system again quickly . In the presen ted form, the algorithm suﬀers from the drawbac k that a node in state r e c over ma y b e caugh t there un til the next go o d resync hronization p oin t. Since Byzan tine faults of a certain pattern may deterministically dela y this for Ω( n ) time, w e w ould lik e to mo dify the algorithm in a w ay ensuring that a non-faulty no de can synchronize to others more quickly if a quasi-stabilization p oin t is reached. This can b e done in a simple manner. Whenev er a no de switc hes to state none , it stays until a new timeout ( R 1 , none ) expires. When switching to none , it switches also to p assive , resets its 39 join and sle ep → waking ﬂags, and repeats to reset its sle ep → waking ﬂags whenever a timeout of ( ϑ − 1)( ϑ + 2) T 1 + ϑ 5 d expires. Thus, it will not switch to state active b ecause of outdated information. On the other hand, it will not miss the next o ccurrence of a set C , that are w eakly coheren t since a C -quasi-stabilization p oint, switc hing to state sle ep → waking within a time window of (1 − 1 /ϑ )( ϑ + 2) T 1 + ϑ 5 d , as it will reset its sle ep → waking ﬂags at most once in this windo w, whereas | C | ≥ n − f ≥ 2 f . Subsequently , it will switch to state join at an appropriate time to enter the basic cycle again at the occurrence of the next C -stabilization p oint. Since no des refrain from lea ving state none for a constan t p erio d of time only , this wa y stabilization time in face of severe failures can still b e k ept linear, while in a stable system, no des recov ering from faults or joining late stabilize in constan t time. Corollary 5.5. The pulse synchr onization r outine c an b e mo diﬁe d such that it r etains al l shown pr op erties, ˆ E 3 incr e ases by a c onstant factor, and it holds that, for any no de i in V , if ther e is a C -quasi-stabilization p oint at some time t < t − , so that C is we akly c oher ent during [ t, t + ] , and ( C ∪ { i } ) -c oher ent during [ t − , t + ] , then ther e exists a ( C ∪ { i } ) -quasi-stabilization p oint at some time t 0 ≤ t − + O (1) , so that ( C ∪ { i } ) is we akly c oher ent during [ t 0 , t + ] . Pr o of Sketch. Essen tially , the fact that n − f no des contin ue to execute the basic cycle narrows do wn the p ossibilities in the proof of Theorem 4.13 to Case 2b-II, where the threshold for leaving state join will be achiev ed close to the next C -stabilization p oint due to the inv olv ed threshold conditions. Since the no des in C execute the basic cycle, they are not aﬀected by the re-synchronisation subroutine at all. Th us, v stabilizes indep endently of this subroutine pro vided that it resets its join and sle ep → waking ﬂags in an appropriate fashion. W e explained ab ov e how this is done and wh y a consisten t reset of the sle ep → waking ﬂags is achiev ed. The join ﬂags are not an issue since at most n − | C | ≤ f channels can attain state join . As a no de switc hes to state none again in constan t time whenever it lea ves, the node will stabilize in constant time provided that there is a C -quasi-stabilization p oin t from where on C is weakly coheren t un til time t + . On the other hand, w e can easily adapt the re-sync hronisation subroutine, Lemma 4.8, Theorem 4.9, and Condition 3.3 to allow for the additional time nodes are non-resp onsive with resp ect to the re-synchronisation subroutine, increasing ˆ E 3 b y a constant factor only . 5.3 Stronger Adversary So far, our analysis considered a ﬁxed set C of coherent (or w eakly coherent) no des. But what happ ens if whether a no de b ecomes faulty or not is not determined upfront, but dep ends on the execution? Phrased diﬀerently , do es the algorithm still stabilize quickly with a large probability if an adv ersary ma y “corrupt” up to f no des, but may decide on its c hoices as time progresses, fully a ware of what happ ened so far? Since w e op erate in a system where all op erations take p ositive time, it might ev en b e the case that a no de migh t fail just when it is ab out to p erform a certain state transition, and would not hav e done so if the execution had pro ceeded diﬀeren tly . Due to the w ay w e use randomization, this how ev er mak es little diﬀerence for the stabilization prop erties of the algorithm. Corollary 5.6. Supp ose at every time t , an adversary has ful l know le dge of the state of the system up to and including time t , and it might de cide on in total up to f no des (or al l channels originating fr om a no de) b e c oming faulty at arbitr ary times. If it picks a no de at time t , it ful ly c ontr ols its actions after and including time t . F urthermor e, it c ontr ols delays and clo ck drifts of non-faulty 40 c omp onents within the system sp e ciﬁc ations, and it initializes the system in an arbitr ary state at time 0 . F or any k ∈ N , deﬁne t k := 2( k + 2)( ϑ ( R 2 + 3 d ) + 8(1 − λ ) R 2 + d ) + R 1 /ϑ + T 2 + T 4 + 5 d. Then the set of al l non-faulty no des have r e ache d a quasi-stabilization p oint by time T ( k ) fr om wher e on they ar e we akly c oher ent, with pr ob ability at le ast 1 − ( f + 1) e − k ( n − f ) / 2 . Pr o of. W e need to sho w that Theorem 4.9 holds for the modiﬁed time in terv al [ t, t + ( k + 2) ˆ E 3 ] with the mo diﬁed probabilit y of at least 1 − e − k ( n − f ) / 2 . If this is the case, we can pro ceed as in Corollaries 5.2 and 5.3. W e start to track the execution from time 0. Whenev er a no de switches to state init at a go o d time, the adversary must corrupt it in order to preven t subsequent deterministic stabilization. In the pro of of Theorem 4.9, we show ed that for any non-faulty no de, there are at least k + 1 diﬀerent times until 2 ϑ ( k + 2) ˆ E 3 when it switc hes to init that ha v e an indep enden tly b y 1 / 2 lo w er bounded probabilit y to b e go o d. Since Lemma 4.8 holds for any execution where we ha v e at most f faults, the adv ersary corrupting some no de at time t aﬀects the curren t and future trials of that no de only , while the statement still holds true for the non-corrupted no des. Th us, the probabilit y that the adv ersary ma y preven t the system from stabilizing until time t k is upp er bounded b y the probabilit y that ( k + 1)( n − f ) indep endent and unbiased coin ﬂips show f or less times tail. Chernoﬀ ’s b ound states for the random v ariable X counting the n um b er of tails in this random exp erimen t that for an y δ ∈ (0 , 1), P [ X < (1 − δ ) E [ X ]] <  e − δ (1 − δ ) 1 − δ  E [ X ] < e − δ E [ X ] . Inserting δ = k / ( k + 1) and E [ X ] = ( k + 1)( n − f ) / 2, we see that the probability that P [ X ≤ f ] ≤ P [ X < ( n − f ) / 2] < e − k ( n − f ) / 2 , as claimed. 6 Implemen tation Issues In this section, w e brieﬂy survey some core asp ects of the VLSI implemen tation of the pulse syn- c hronization algorithm, which is curren tly b eing developed. Thereby w e focus on the three ma jor building blo cks: (1) asynchronous state machines, (2) memory ﬂags with thresholds and (3) watc h- dog timers. The pulse synchronization algorithm at every no de consists of sev eral simple state mac hines that execute asynchronously and concurren tly . There are several t yp es of conditions that can trigger state transitions: (i) The state machines of a certain n um b er (1, ≥ f + 1, or ≥ n − f ) of remote no des reac hed some particular state, indicated b y memory ﬂags. (ii) Some lo cal state mac hine reached a particular state. 41 (iii) A w atc hdog timer expires. These conditions ma y also b e combined (using AND or OR). W e will emplo y standard Huﬀman-type async hronous state machines [25] for implementing our state mac hines, as they ﬁt nicely to the Θ-Mo del already used in d ar ts . 19 Analyzing the transition conditions of all the ﬁv e state machines (Figures 2, 3 and 4) of a single no de reveals that w e need to communicate six diﬀeren t states ( r e c over , ac c ept , join , pr op ose , sle ep → waking and “other”) of the core state machine (Figure 2) and tw o states each ( supp , none and init , wait ) for the tw o state mac hines making up the resync hronization algorithm from every no de to every node. There are sev eral possibilities for implementing this comm unication. F or example, b oth a simple high-sp eed serial proto col and a parallel ﬁve bit bundled data bus with a strob e signal are viable alternatives, eac h oﬀering diﬀerent trade-oﬀs betw een implemen tation complexit y , sp eed, area consumption, etc. W e note, how ev er, that any metho d for communicating states is complicated b y the fact that state o ccupancy times may b e very short in an asynchronous state machine: Reaching a state must alw ays b e faithfully con v eyed to all remote no des ev en if it is almost immediately left again. In addition, the core state machine may undergo v arious sequences of state transitions, implying that w e cannot use a state enco ding where only a single bit c hanges b etw een successiv e states. Care m ust hence b e tak en in order not to trigger hazardous in termediate state o ccupancies at the receiver when communicating some m ulti-bit state c hange. Both problems can b e handled using suitable b ounded dela y conditions. Remote Memory Flags and Thresholds Figure 5 shows the principle of implementing remote memory ﬂags, which are the basic mechanism required for t yp e (i) state transition conditions at no de i . F or every remote no de j , it consists of a hazard-free demultiplexer that deco des the communicated state of no de j ’s state mac hines, a resettable memory ﬂag p er state that remembers whether no de j has ever reac hed the resp ectiv e state since the most recen t ﬂag reset, and optionally a threshold mo dule that combines the cor- resp onding ﬂag outputs for all remote no des. Note that every memory ﬂag is implemented as a (resettable) Muller C Gate 20 here, but could also b e built by using a ﬂip-ﬂop. Implemen ting local state input transition conditions (ii) is pretty muc h straightforw ard, as one simply needs to incorp orate (single) state signals from local state mac hines here. Note that every transition condition comprises the no de observing itself in a particular state, which also falls into this category . T o av oid metastable upsets in the async hronous state mac hine (see b elow), it may b e necessary to add memory ﬂags for lo cal signals as well. W atc hdog Timers Our implementation of the watc hdog timers, whic h are required for realizing state transition con- ditions (iii), will rest upon a single lo cal clo ck generator (we will use a simple ring oscillator, i.e., a single inv erter with feedback and a prescaler) p er node that drives all w atchdog timers, instead of a crystal oscillator, b ecause of the possibility to integrate it on-c hip. How ev er, the oscillator frequency of such ring oscillators v ary hea vily with op erating conditions, in particular with supply 19 The Θ-Mo del assumes that we can enforce a certain ratio b etw een slow est and fastest end-to-end delay along critical signaling paths. 20 A Muller C Gate retains its current output v alue when its inputs are diﬀerent, and sets its output to the common input otherwise. 42 D e M U X . . . M e m j , i , p r o p o s e R e s e t N o d e j C r p r o p o s e . . . ≥ f + 1 . . . M e m 1 , i , p r o p o s e M e m n , i , p r o p o s e . . . . . . . . . ≥ f + 1 p r o p o s e M e m j , i , a c c e p t R e s e t C r a c c e p t ≥ n - f . . . M e m 1 , i , a c c e p t M e m n , i , a c c e p t . . . ≥ n - f a c c e p t . . . 1 1 N o d e i N o d e 1 N o d e n Figure 5: Implementation principle of remote memory ﬂags and thresholds. v oltage and temp erature, as w ell as with pro cess conditions. The resulting (tw o-sided) clo ck drift ξ (with resp ect to supply voltage, temp erature and process v ariation) is typically in the range of 7% to 9% for uncomp ensated ring oscillators and can b e low ered down to 1% to 2% b y prop er comp ensation techniques [32]. The tw o-sided clo c k drifts map to ϑ = (1 + ξ ) / (1 − ξ ) b ounds of 1 . 15 to 1 . 19 and 1 . 02 to 1 . 04, resp ectiv ely . Recalling from Lemma 3.4 that ϑ max ≈ 1 . 247, one sees that b oth uncomp ensated and comp ensated ring oscillators are suitable for implementation of the pulse synchronization proto col’s watc hdog timers. How ev er, care must b e tak en when the proto col is used to stabilize d ar ts : to comp ensate a typical drift of 15% of dar ts clo cks, one m ust ensure that ϑ is smaller than roughly 1 . 064 (cf. Section 7). Thus, here, only comp ensated ring oscillators are suﬃcien tly accurate. Note, ho wev er, that these are conserv ative b ounds, assuming that the sync hronization proto col and dar ts drift into diﬀerent directions. Considering that a large share of the drift in b oth systems is due to v ariations in temp erature, it seems reasonable to assume that, in the long term, b oth drift into the same direction. As shown in Figure 6, ev ery watc hdog timer consists of a resettable up-coun ter and a timeout register, which holds the timeout v alue. A comparator compares the counter v alue and the timeout register after every clo ck tic k, and raises a stable expiration output signal if the coun ter v alue is greater or equal to the register v alue. The asynchronous reset of the coun ter, whic h also resets the timeout output signal, is used to re-trigger the watc hdog. C t ≥ T O r W a t c h d o g e x p i r e d N o d e i C l o c k W a t c h d o g r e t r i g g e r . . . . . . > r T O R e g i s t e r U p - C o u n t e r > Figure 6: Implementation principle of w atc hdog timers. As for the watc hdog timer with random timeout R 3 in the resynchronization algorithm, the 43 simplest implementation w ould load a uniformly distributed random v alue in to the timeout reg- ister whenever the w atchdog is re-triggered. Depending on the implementation tec hnology , suc h random v alues can be generated either via true random sources (thermal noise) or pseudo-random sources (LFSRs) clo ck ed by another ring oscillator. If w e could guaran tee that the conten t of the timeout and the random source can, by no means, read or prob ed someho w by anybo dy , such an implementation satisﬁes the mo del requiremen ts. 21 Alternativ ely , one could use random sam- pling p er clo c k tic k, whic h av oids storing the future timeout v alue and also conv erges to uniformly distributed timeouts for suﬃcien tly large v alues of R 3 . Com bined State T ransition Conditions Com bining diﬀerent state transition conditions (i)–(iii) via AND/OR requires some care, since an async hronous state machine requires stable input signals in order not to b ecome metastable during its state transition. Com bining several conditions (i) do es not cause any problems, since the memory ﬂags ensure that all outputs are stable. Non-stable signals, like “ T 1 AND < n − f ac c ept ” require sampling via a ﬂip-ﬂop clo ck ed by a stable signal. F or example, the status of < n − f = ¬ ≥ n − f is sampled when the signal rep orting expiration of T 1 is issued. Similarly , it migh t happ en that conditions requiring conﬂicting state transitions are satisﬁed at the same time, e.g., ( T 2 , ac c ept ) migh t expire simultaneously with the threshold of “ ≥ f + 1 r e c over or ac c ept ” b eing reached. Ob viously , b oth of the abov e situations could create a metastable upset, either of the sampling ﬂip-ﬂop, or directly of the register(s) holding the no de’s state. F ortunately , Theorem 4.4 revealed that this can happ en during stabilization only . In regular op eration, e.g. the critical threshold of ≥ n − f ac c ept is alwa ys reached b efore T 1 expires. Thus, the former is acceptable, as metastable upsets o ccur rarely and increase con v ergence time only . Moreov er, to further decrease the probability of a metastable upset that migh t aﬀect stabilization time, it is p erfectly feasible to insert a sync hronizer or an elastic pip eline after the sampling ﬂip-ﬂop for capturing metastabilit y [13]. This additional precaution merely increases the latency b y a constant delay , whic h due to b eing restricted to the pulse synchronization comp onent will not adv ersely aﬀect the ﬁnal precision and accuracy of the stabilized dar ts clocks. 7 Coupling of D AR TS and Pulse Sync hronization Algorithm In this section, we describe how the self-stabilizing pulse synchronization protocol could b e coupled with dar ts clo cks. As this requires certain implementation details, we also sk etch some ideas that might be used in a protot yp e implemen tation. The joint system pro vides a high-precision self-stabilizing Byzan tine fault-tolerant clo cking system for m ulti-sync hronous GALS. The coupling b et ween the pulse synchronization proto col and dar ts clocks in v olves tw o direc- tions: 1. The pulse sync hronization proto col primarily monitors the op eration of the dar ts clo cks. As long as dar ts ticks are generated correctly , it must not in terfere with the dar ts tick generation rules at all. 21 Note that in practice this is a reasonable assumption, as ev en the no de itself do es not access this v alue except for chec king whether the timer expired and the computational p ow er of the system is very limited. 44 2. If d ar ts clo c ks b ecome inconsistent w.r.t. the behavior of the pulse sync hronization proto col, the latter m ust interfere with the regular dar ts tick generation, p ossibly up to resetting dar ts clo c ks. T o assist the reader, we ﬁrst provide a very brief ov erview of the original dar ts and its imple- men tation. 7.1 D AR TS Ov erview dar ts clo c ks (called TG-Algs in the sequel) are instances of a simple synchronizer [35] for the Θ-Mo del based on consisten t broadcasting [31]. They generate tic ks Tick (0) , Tick (1) , Tick (2) , . . . approximately sim ultaneously at all correct no des. Since actual dar ts ticks are just binary clo c k signal transitions, whic h cannot carry tic k n um b ers, the original algorithm had to b e modiﬁed signiﬁcan tly in order to b e implemen table in async hronous digital logic. Figure 7 sho ws a sc hematic of a single TG-Alg for a 5-no de system. C C C C C C C C ... 3f+1 C bottom C top RemoteClk Pipe Compare Signal Generat or Diff-Gate Local Pipe Remote Pipe Counter Module 1 of 3f+1 ... Threshold Modules P i p e C o m p a r e S i g n a l G e n . Local P i p e l ine Diff- G a t e Remote P ipelin e Counter Module 2 P i p e C o m p a r e S i g n a l G e n . Local P i p e l ine Diff- G a t e Remote P i p e l i n e Counter Module 3 Pipe Compare Signal Generator Local Pipeline Diff- Module Remote Pipeline Counter Module n-1 ... f+1 2f+1 Tick Gen GR GEQ ClockOut Error Containment Bound ary Figure 7: Schematic of d ar ts TG-Alg Implementation Key comp onen ts of a TG-Alg are coun ter mo dules, one p er remote TG-Alg, whic h just coun t the diﬀerence betw een the num b er of tic ks generated locally and remotely . They are implemen ted using a pair of elastic pipelines [33], whic h implement FIFO buﬀers for signal transitions. Matching tic ks in b oth pip elines, which are ob viously irrelev an t for the diﬀerence, are remo v ed by the connecting Diﬀ-Gate. The status ( > 0, ≥ 0) of all counter mo dules is fed into tw o threshold mo dules, whose outputs trigger the generation of the next lo cal tick. A detailed discussion of the implementation can b e found in [10]. The correctness pro of and p erformance analysis in [16, 17, 15] revealed that correct TG-Algs indeed generate synchronized clo ck ticks, in the presence of up to f Byzan tine faulty TG-Algs in a system with n ≥ 3 f + 2 nodes: F or any tw o correct no des p , q , the n um b er of clo c k tic ks 45 generated b y p and q by time t do not diﬀer by more than a (very small) constan t π , and the frequency of any correct clo ck (and th us the maxim um drift ρ ) is within a certain range. In addition, expressions (in the order of π ) for the maximum size of the elastic pipelines in the counter mo dules were established, whic h guarantee ov erﬂo w-free operation. Exp erimen ts with b oth FPGA and ASIC prototype implementations demonstrated that dar ts clo cks indeed oﬀer close to p erfect sync hronization and very reliable op eration. Nev ertheless, as already mentioned, (almost) simultaneous start-up of all TG-Algs and at most f failures during the whole life-time of the system are mandatory preconditions for these results to hold. dar ts neither supp orts late joining or recov ery of TG-Algs, nor recov ery from more than f failures. 7.2 Required Extensions for Coupling D AR TS and Pulse Sync hronization The ma jor obstacle for supp orting late joining of TG-Algs, removing spuriously generated tic ks in the pip elines etc. are the anon ymous clo ck ticks used in d ar ts : Since they are just signal transitions on a single wire, they cannot enco de any information except their o ccurrence time. The most imp ortant extension of d ar ts is hence to add an additional bundled data wire to the clo ck signal, whic h carries 1 bit of data. This w ay , single ticks can b e marke d with a 1, distinguishing them from ordinary non-mark ed ticks that carry a 0. W e will actually mark every T -th dar ts tick, for some suitably chosen T . Such a mark ed tick k T , k ≥ 0, is to b e understo o d as the start of the ( k + 1)-st dar ts r ound , which consists of the mark ed d ar ts tick k T and T − 1 subsequen t unmark ed tic ks k T + 1 , k T + 2 , . . . , ( k + 1) T − 1; the mark ed tick ( k + 1) T starts the next dar ts round. Note that the resulting dar ts tic ks can b e in terpreted as a discrete, b ounded clock op erating modulo T . As dar ts rounds at an y t w o correct TG-Algs are synchronous, marked ticks must alwa ys match in the pip elines of every counter, i.e., the Diﬀ-Gate must alw a ys remov e pairs of matching mark ed (or non-mark ed) ticks and can hence detect and remo ve any inconsistency . The actual coupling b et ween the instance of the pulse synchronization proto col and the dar ts clo c k running at no de i is accomplished by means of t w o signals, namely , dar ts i and pulse i : • d ar ts i rep orts dar ts rounds to the pulse synchronization proto col. The rising edge of the dar ts i signal, whic h ma y trigger a switch from r e ady to pr op ose ; is issued when the dar ts clo c k of node i generates tick k T − X , for some ﬁxed X < T . The falling edge of d ar ts i rep orts the o ccurrence of the marked tick k T . • pulse i rep orts the generation of a pulse to the d ar ts clo ck. Its rising edge is issued on the transition to ac c ept , and its falling edge signals the expiration of a ﬁxed timeout T y that is reset at the time the rising edge is transmitted. The basic idea underlying the coupling of the pulse synchronization proto col and dar ts is to align marked tic ks and pulses as follows: If the system op erates normally , every correct no de i ﬁrst reac hes some dar ts tick kT − X and issues d ar ts i = 1. Next, a pulse is generated at no de i b y the the pulse sync hronization proto col, which thus sets pulse i = 1. Subsequently , the d ar ts mark ed tic k k T o ccurs, whic h is signaled by dar ts i = 0. Finally , the pulse timeout T y and hence pulse i = 0 o ccurs. Normal op eration th us exp ects that the d ar ts mark ed tick (= the falling edge of dar ts i ) o ccurs within the time windo w where pulse i is 1. Provided that the timeout used for 46 generating this window 22 is ch osen suﬃciently large, namely , ϑρ ( π + 2 d + 1), this interlea ving can indeed b e guaran teed in normal op eration. As long as this is the case, we just let dar ts generate its ticks using its standard rules. Should a dar ts clo ck fail, how ev er, such that pulse i and dar ts i are not prop erly in terleav ed, then w e will force marking the next dar ts tick (and possibly resetting the TG-Alg, if needed) up on the falling edge of pulse i . dar ts ticks and pulses (as w ell as marked dar ts tic ks at diﬀerent no des) will hence only b e re-aligned in case of errors or desync hronization: As long as dar ts clocks w ork correctly , an y t wo correct TG-Algs will mark tick k T within the d ar ts synchronization precision. Pro vided that X and T y are suitably chosen, it is not diﬃcult to prov e that the joint system consisting of pulse synchronization protocol and dar ts clo cks will stabilize: After some unsta- ble p erio d, the pulse synchronization algorithm will stabilize, whic h w e ha ve pro ved to happ en indep enden tly of the (non-)op eration of dar ts clo cks. When the pulse synchronization proto col ev entually starts to generate synchronous pulses, the dar ts clo c ks will start to reco v er in a guided (sync hronized) manner. When all correct dar ts clo cks are even tually synchronized to within the in trinsic dar ts precision, the system will p erp etually ensure the abov e interlea ving at all correct no des. Some additional observ ations: (1) Since the dar ts precision is typically considerably smaller than the worst case pulse sync hro- nization precision, the underlying d ar ts clo cks may b e view ed as a “precision ampliﬁer” (as w ell as a clo ck multiplier, see Section 7.2). (2) There is no need to sp ecify prop erties p ossibly achiev ed b y dar ts clo c ks during their o wn reco very . W e only require that they even tually reach full synchronization in the presence of sync hronous pulses at all correct nodes. In practice, d ar ts clo cks will typically also gradually impro ve their synchronization precision during this in terv al. (3) Although the pulse sync hronization algorithm stabilizes even when the d ar ts clo cks behav e arbitrary , it is nev ertheless the case that it achiev es b etter pulse synchronization precision when the dar ts clocks are fully sync hronized. (4) One migh t ask why we did not just use the k -th rising edge of pulse i to mark the very next dar ts tick generated b y the TG-Alg at node i . This simple solution has several ma jor draw- bac ks. First, the pulse synchronization precision is t ypically worse than the sync hronization precision provided by d ar ts . Thus, every pulse would result in a te mp orary deterioration of the dar ts sync hronization quality . Second, marked tic ks are not necessarily generated exactly ev ery T dar ts ticks. And last but not least, since d ar ts clo cks and pulse sync hro- nization execute completely async hronously , marking d ar ts ticks at pulse o ccurrence times w ould create the p otential of metastabilit y ev ery k T dar ts ticks, ev en if there is no failure at all. 7.3 D AR TS ⇒ pulse sync hronization T o implement this part of the coupling, every d ar ts clo ck signals the up coming o ccurrence of mark ed tic k k T to its lo cal instance of the pulse sync hronization proto col. This is accomplished b y the rising edge of dar ts i , the dedicated dar ts signal , whic h is generated up on dar ts tick 22 W e remark that it is vital not to rely on the dar ts clo c k here. 47 K T − X . If all correct no des happ en to do this within some time windo w when they are (w.r.t. the pulse algorithm) in state r e ady with T 3 < T 4 already expired, all correct no des will switch to state pr op ose within π time. 23 Subsequen tly , they will all switc h to state ac c ept within d time. T o mak e sure that indeed all correct no des are in state r e ady with T 3 already expired, up to small additional terms of O ( d ), w e must c ho ose the minimal duration of a dar ts round to b e larger than T 2 + T 3 + 4 d , while ( T 2 + T 4 ) /ϑ is to exceed its maximal duration. Assuming that ρ < 1 . 15, Lemma 3.4 sho ws that this is feasible up to ϑ ≈ 1 . 064, whic h is clearly within the reach of ring oscillators [32]. 24 7.4 Pulse synchronization ⇒ DAR TS This part of the coupling b etw een dar ts and the pulse synchronization proto col requires t wo mec hanisms: (1) A w a y to force a mark ed dar ts tick at no de i up on the o ccurrence of the falling edge of pulse i , provided that no marked tick (i.e., the falling edge of dar ts i ) has b een generated while pulse i w as 1. This may also include recov ering from a complete stall of the dar ts tick generation. (2) A wa y to reco ver accurate dar ts synchronization after forcing marked ticks, which may also include the need to get rid of any information from the preceding unstable p erio d. T o achiev e (1), we use a simple async hronous circuit that sup ervises the in terlea ving of pulse i and dar ts i , and generates a “force marking” signal if dar ts i do es not o ccur in time. Note that this device can b e built in a wa y that entirely av oids metastability in case of normal op eration. In an unstable p erio d, how ev er, it may happ en that force marking o ccurs exactly at the time when dar ts generates its marked tick, so a metastable upset or t wo very close marked ticks (a forced and a regularly generated one) are p ossible. There are several v ariants for implemen ting the forced marking itself, including the simplest v ariant of just resetting the TG-Alg in order to generate marked tick 0. The need for p ossibly resetting a TG-Alg originates from the fact that stateful TG-Alg comp onents may deadlo ck due to earlier failures. F or example, a deadlo ck ed pipe will never propagate tic ks from its input to the Diﬀ-Gate. Unfortunately , resetting TG-Algs complicates dar ts reco v ery considerably: If a TG-Alg reset w ould also reset the remote pip es of its counters, it migh t lose “fresh” marked ticks generated b y remote TG-Algs. Hence, remote pip es should only b e reset when the r emote no de is reset. Ho wev er, since a remote no de might never observ e a discrepancy b etw een dar ts rounds and pulses, this approac h migh t end up in the pipe not being reset at all. This is problematic as it migh t eﬀectiv ely render the no de faulty despite all its comp onents being operational. Luckily , we may utilize the fact that solving (2) under the assumption that correct pip es are not deadlo ck ed yields a trivial means to distinguish a lock ed pip e from an operational one: If the Diﬀ-Gate cannot remov e an y tic ks within a certain time in terv al after a (correct) pulse, the pip e m ust ha ve deadlo ck ed and can safely b e reset. At the next pulse, all pip es will hav e reco v ered from previous deadlo cks and a solution to (2) assuming deadlo ck-free pip es will succeed. 23 In contrast to the mo del we employ ed for our analysis, we neglect the lo cal signaling delay here, as it is smaller than the time to generate a single tick. 24 This is true regardless of the additional term of O ( d ), as the b ound is derived from an asymptotic statement. 48 T o explain how we achiev e (2), we start with the observ ation that our wa y of marking every T -th tick implies that, for any t wo correct dar ts TG-Algs, it will alwa ys b e a marked tic k k T from a remote no de that is matched by the local mark ed tic k k T in every counter of Figure 7 when the Diﬀ-Gate remov es it. That is, if ev er a marked tic k is matching a non-mark ed tic k in a coun ter, tic ks hav e b een lost or spuriously generated somewhere, or lo cal and remote no de are sev erely out of sync hrony . Assume for the moment that we could generate exactly one marked tick at ev ery correct node, w e made sure that no such tick is in the system b efore this happ ens, and that we ha ve elastic pip elines of inﬁnite size. The following simple strategy would even tually establish matching dar ts tic ks: Whenever a Diﬀ-Gate encounters a marked tick in one pip e matched by an ordinary tic k in the other, it remov es the ordinary tick only . A t the pip e lev el, this rule implies that whatever the state of the pip es w as b efore the marked tic ks w ere generated, they will b e cleared b efore the matc hing pair of mark ed tic ks is remo ved. Since all dar ts tick generation rules ensure that no TG-Alg generates an y tic k k T + 1, k T + 2, . . . based on information from the previous dar ts round k − 1 (consisting of dar ts tic ks up to kT − 1) all coun ter states will b e v alid as so on as the matc hing mark ed tic ks k T ha ve b een remo v ed. As dar ts essentially generates tic ks based on comparing the n umber of locally and remotely generated tic ks, this is enough to ensure stabilization of the dar ts system; full d ar ts precision will b e achiev ed quic kly b ecause no des “catc h up”, i.e., generate tick n umbers that at least f + 1 correct dar ts clo c ks already reac hed, faster than “new ticks”, i.e., ones that no correct no de generated yet, may occur. The issue of ﬁnite-size pipes is (largely) solved by the pulse synchronization proto col: Pulses and hence mark ed tic ks are generated close to each other, in a time windo w of at most 2 d + T y ∈ O ( d ) (pro vided that T y is not unnecessarily large). Hence, apart from implementation issues, pip es that can accommo date all ticks that may be generated within this time windo w are suﬃcient for not losing an y v alid d ar ts tick. In reality , how ev er, we cannot alw ays exp ect the “single marked tic k” setting describ ed ab o ve: Elastic pip elines may initially b e p opulated with arbitrarily man y marked tic ks from the unstable p erio d. W e must hence mak e sure that all these mark ed tic ks (and the white ticks in betw een) are even tually remov ed, and that we do not generate new marked ticks close to each other. The pulse sync hronization protocol will ensure that forced ticks are separated b y T dar ts tic ks, and our implemen tations of (1) and (2) will ensure with a large probability that a forced marked tick will not b e generated close to a marked tic k generated regularly by d ar ts . Under these conditions, it is a relativ ely easy task to clear all sup erﬂuous marked tic ks b etw een pulses. F or example, w e could asynchronously reset the whole data ﬂip-ﬂop c hain that holds the mark- ings of the ticks (not the ticks themselves!) curren tly in a pip e shortly after the rising ﬂank of pulse i . Enlarging X and T y sligh tly , we can b e sure that all TG-Algs will remov e spurious markings from their pip es b efore any marked tick asso ciated with the resp ective correct pulse is generated. Although this could generate metastabilit y in the Diﬀ-Gate, namely , when the tic k at the head of the pip e is a marked tick and the Diﬀ-Gate is ab out to act when the pip e is reset upon arriv al of a new mark ed tic k arrives, this cannot happ en during normal op eration. 49 References [1] M. Ben-Or, D. Dolev, and E. N. Ho ch. F ast self-stabilizing b yzantine tolerant digital clo ck sync hronization. In Pr o c. 27th symp osium on Principles of Distribute d Computing (PODC) , pages 385–394, 2008. [2] A. Berman and I. Keidar. Low-Ov erhead Error Detection for Netw orks-on-Chip. In The 27th International Confer enc e on Computer Design (ICCD) , 2009. [3] R. Bhamidipati, A. Zaidi, S. Makineni, K. Low, R. Chen, K.-Y. Liu, and J. Dalgrehn. Chal- lenges and Methodologies for Implementing High-Performance Netw ork Pro cessors. Intel T e ch- nolo gy Journal , 6(3):83–92, 2002. [4] D. M. Chapiro. Glob al ly-Asynchr onous Lo c al ly-Synchr onous Systems . PhD thesis, Stanford Univ ersity , 1984. [5] C. Constantinescu. T rends and Challenges in VLSI Circuit Reliability. IEEE Micr o , 23(4):14– 19, 2003. [6] A. Daliot and D. Dolev. Self-Stabilizing Byzantine Pulse Synchronization. CoRR , abs/cs/0608092, 2006. [7] A. Daliot, D. Dolev, and H. P arnas. Self-Stabilizing Pulse Synchronization Inspired b y Bio- logical Pacemak er Netw orks. In Pr o c. 6th Symp osium on Self-Stabilizing Systems (SSS) , 2003. [8] C. Dik e and E. Burton. Miller and Noise Eﬀects in a Synchronizing Flip-Flop. IEEE Journal of Solid-State Cir cuits , SC-34(6):849–855, 1999. [9] S. Dolev and J. L. W elc h. Self-stabilizing clo ck sync hronization in the presence of b yzantine faults. Journal of the A CM , 51(5):780–799, 2004. [10] M. F erringer, G. F uchs, A. Steininger, and G. Kempf. VLSI Implementation of a Fault- Toleran t Distributed Clo ck Generation. IEEE Symp osium on Defe ct and F ault-T oler anc e in VLSI Systems (DFT) , pages 563–571, 2006. [11] M. J. Fisc her and N. A. Lync h. A Lo w er Bound for the Time to Assure In teractive Consistency. Information Pr o c essing L etters , 14:183–186, 1982. [12] E. G. F riedman. Clo c k Distribution Net works in Sync hronous Digital Integrated Circuits. Pr o c e e dings of the IEEE , 89(5):665–692, 2001. [13] G. F uc hs, M. F ¨ ugger, and A. Steininger. On the Threat of Metastabilit y in an Asynchronous F ault-T oleran t Clo ck Generation Scheme. In Pr o c. 15th Symp osium on Asynchr onous Cir cuits and Systems (ASYNC) , pages 127–136, Chapel Hill, N. Carolina, USA, 2009. [14] M. F ¨ ugger. Analysis of On-Chip F ault-T oler ant Distribute d Algorithms . PhD thesis, T ec hnisc he Univ ersit¨ at Wien, Institut f ¨ ur T echnisc he Informatik, 2010. [15] M. F ¨ ugger, A. Dielacher, and U. Schmid. How to Sp eed-Up F ault-T oleran t Clo ck Genera- tion in VLSI Systems-on-Chip via Pip elining. In Pr o c. 8th Eur op e an Dep endable Computing Confer enc e (EDCC) , pages 230–239, 2010. [16] M. F ¨ ugger and U. Sc hmid. Reconciling F ault-T oleran t Distributed Computing and Systems- on-Chip. Research Rep ort 13/2010, T ec hnisc he Universit¨ at Wien, Institut f ¨ ur T ec hnische Informatik, 2010. [17] M. F¨ ugger, U. Sc hmid, G. F uc hs, and G. Kempf. Fault-Toleran t Distributed Clo c k Generation in VLSI Systems-on-Chip. In Pr o c. 6th Eur op e an Dep endable Computing Confer enc e (EDCC) , pages 87–96, 2006. [18] M. J. Gadlage, P . H. Eaton, J. M. Benedetto, M. Carts, V. Zh u, and T. L. T urﬂinger. Digital Device Error Rate T rends in Adv anced CMOS T ec hnologies. IEEE T r ansactions on Nucle ar Scienc e , 53(6):3466–3471, 2006. [19] E. Ho ch, D. Dolev, and A. Daliot. Self-stabilizing Byzantine Digital Clo ck Synchronization. In Pr o c. 8th Symp osium on Stabilization, Safety, and Se curity of Distribute d Systems (SSS 2006) , v olume 4280, pages 350–362, 2006. [20] In ternat. T ec hnology Roadmap for Semiconductors, 2007. [21] D. J. Kinniment, A. Bystrov, and A. V. Y ako vlev. Synchronization Circuit P erformance. IEEE Journal of Solid-State Cir cuits , SC-37(2):202–209, 2002. [22] M. Malekp our. A Byzan tine-F ault T oleran t Self-stabilizing Proto col for Distributed Clo ck Sync hronization Systems. In Pr o c. 9th Confer enc e on Stabilization, Safety, and Se curity of Distribute d Systems (SSS) , pages 411–427, 2006. [23] L. Marino. General Theory of Metastable Op eration. IEEE T r ansactions on Computers , C-30(2):107–115, 1981. [24] C. Metra, S. F rancescantonio, and T. Mak. Implications of Clo ck Distribution F aults and Issues with Screening them During Manufacturing T esting. IEEE T r ansactions on Computers , 53(5):531–546, 2004. [25] C. J. My ers. Asynchr onous Cir cuit Design . John Wiley & Sons, Inc., 2001. [26] M. P ease, R. Shostak, and L. Lamp ort. Reac hing Agreemen t in the Presence of F aults. Journal of the ACM , 27:228–234, 1980. [27] T. P olzer, T. Handl, and A. Steininger. A Metastabilit y-F ree Multi-sync hronous Communi- cation Sc heme for SoCs. In Pr o c. 11th International Symp osium on Stabilization, Safety, and Se curity of Distribute d Systems (SSS 2009) , pages 578–592, 2009. [28] C. L. Portmann and T. H. Y. Meng. Supply Noise and CMOS Sync hronization Errors. IEEE Journal of Solid-State Cir cuits , SC-30(9):1015–1017, 1995. [29] P . J. Restle and others;. A Clock Distribution Net work for Micropro cessors. IEEE Journal of Solid-State Cir cuits , 36(5):792–799, 2001. [30] Y. Semiat and R. Ginosar. Timing Measurements of Sync hronization Circuits. In Pr o c. 9th Symp osium on Asynchr onous Cir cuits and Systems (ASYNC) , 2003. [31] T. K. Srik anth and S. T oueg. Optimal Clock Sync hronization. Journal of the A CM , 34(3):626– 645, 1987. [32] K. Sundaresan, P . Allen, and F. Ayazi. Pro cess and temp erature compensation in a 7-MHz CMOS clo c k oscillator. IEEE J. Solid-State Cir cuits , 41(2):433–442, 2006. [33] I. E. Sutherland. Micropip elines. Communic ations of the ACM, T uring Awar d , 32(6):720–738, 1989. [34] P . T eehan, M. Greenstreet, and G. Lemieux. A Survey and T axonom y of GALS Design Styles. IEEE Design and T est of Computers , 24(5):418–428, 2007. [35] J. Widder and U. Schmid. The Theta-Mo del: Ac hieving Synchron y without Clo cks. Distribute d Computing , 22(1):29–47, 2009.

Fault-tolerant Algorithms for Tick-Generation in Asynchronous Logic: Robust Pulse Generation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment