Discriminated Belief Propagation

Discriminated Belief Pr opa gation Uli Sorger ∗ 7th Nov ember 2018 Near optimal decoding of good error control codes is generally a difﬁcult task. Howe ver , for a certain type of (sufﬁciently) go od codes an ef ﬁcient decoding algorithm with near optimal perfor mance exists. These cod es are deﬁned v ia a comb ination of co nstituent code s with low complexity trellis representation s. Their dec oding algorithm is an instance of (loopy) belief propag ation an d is based on an iterati ve transfer of constituen t beliefs. The beliefs are thereby giv en by th e symbo l prob abilities computed in th e constituen t trellises. Even thou gh weak con - stituent cod es are em ployed close to op timal perfo rmance is o btained, i.e., the encod er/decoder pair (almost) achieves the information theoretic ca pacity . Howe ver , (lo opy) belief propagation only perform s w ell for a rather speciﬁc set of codes, which limits its applicability . In this paper a generalisation of iterativ e decoding is presented. It is propo sed to tr ansfer more values than just the constituen t beliefs. This is achie ved by the transfer of b eliefs ob- tained by in depend ently in vestigating par ts of the code space. This leads to th e concept o f discriminator s, which ar e used to improve th e decoder resolu tion within certain a reas an d deﬁnes d iscriminated sym bol beliefs. It is shown that these beliefs appro ximate the ov erall symbol p robabilities. This leads to an iteration rule that (b elow chan nel cap acity) typica lly only ad mits the solutio n of the overall dec oding p roblem. V ia a G AU S S approx imation a low complexity version of this algorithm is derived. Moreover , the ap proach may then b e app lied to a wide range of channel maps without signiﬁcant complexity increase. Keywords: Iterative Decodin g, Coupled Codes, Info rmation T heory , Complexity , Belief Pr opagation , T ypical Decoding, Set Representation s, Central Limit Theorem , Equalisation, Estimation , Trellis Algo - rithms D E C O D I N G erro r contr ol codes is the inversion of th e en coding map in th e p resence of er rors. An optimal deco der ﬁnds the c odew ord with th e least nu mber of err ors. Howe ver , optima l d ecoding is generally compu tationally infeasible due to the intrinsic no n lin earity of the inversion operatio n. Up to n ow only simple codes can be optimally decoded, e.g., by a simple trellis repr esentation. These codes generally exhibit poor performance or rate [11]. On the other han d, goo d co des can be constructed by a combina tion of simple co nstituent codes (see e.g., [14, pp .567ff]). This constru ction is inter esting as then a trellis b ased in version may per form almost optimally: B E R RO U et al. [2] sho wed that iterati ve tu rbo decod ing leads to near capacity perfo rmance. The same holds true for iterati ve decoding o f Lo w Density Parity Ch eck (LDPC) codes [6]. Both de coders are concep tually similar and based o n the ( loopy) pro pagation of beliefs [1 6] comp uted in the con stituent trellises. Howe ver , (loopy) belief pro pagation is often limited to id ealistic situations. E.g ., turbo de coding generally p erforms poorly for multiple constituent co des, comp lex channels, go od constituent c odes, and /or relativ ely short overall code lengths. In this pap er a concep t called discriminatio n is used to genera lise iterative decod ing by (loo py) belief propag ation. The generalisation is b ased on an un certainty or distance discriminated in vestigation of the ∗ T ec hnical Report TR-CSC-07-01 of the D_Max Project funded by the Univ erstity of Luxembourg . 1 code space. T he overall r esults of the approach are linked to basic principles in information theory such as typical sets and channe l capacity [18, 15, 13]. Overview: The paper is organised as follows: First the combina tion of codes together with the decodin g problem and its relation to belief pr opagation are revie wed. Then the concep t of discriminato rs togeth er with the notion of a common belief is introduced. In the second section local discriminator s are discussed. By a local discriminator a controllab le amount of parameter s (or gener alised beliefs) are transferr ed. It is shown that this leads to a practically c omputable co mmon belief th at ma y be used in an iteration. Mor eover , a ﬁxed point of the o btained iter ation is typically the optima l decoding decision. Sec tion 3 ﬁnally consider s a low complexity approx imation and the application to more complex channel maps. 1. Code Coupling T o r e view the com bination of constituent cod es we he re con sider only bin ary linear codes C given by the encodin g map C : x = ( x 1 , . . . , x k ) 7→ c = ( c 1 , . . . , c n ) = xG mod 2 with G the ( k × n ) gene rator matrix with x i , c i , and G i,j ∈ Z 2 = { 0 , 1 } . The map deﬁnes for rank ( G ) = k the event set E ( C ) of 2 k code words c . The rate of the code is given by R = k /n and it is for an erro r correcting code smaller than One. The e vent set E ( C ) is by linear algebra equivalently deﬁned by a (( n − k ) × n ) parity matr ix H with H G T = 0 mod 2 and thus E ( C ) = { c : H c T = 0 mod 2 } . Note that the modu lo operatio n is in the sequel not explicitly stated. E ( C ) is a subset of th e set S of all 2 n binary vectors of length n . Th e restriction to a subset is inter esting as this leads to the po ssibility to correct corr upted words. Ho wever , the co rrection is a difﬁcult o peration and can usually only be practically perfor med for simple or short codes. On the other hand lon g codes can be con structed by the u se of such simple con stituent code s. Such co n- structions are revie wed in this section. Deﬁnition 1 (Dir ect Couplin g) The two constituen t linear systematic coding maps C ( l ) : x 7→ c ( l ) = x · G ( l ) = x · [ I P ( l ) ] with l = 1 , 2 and a dir ect coupling gives the overall code E ( C ( a ) ) with c ( a ) = x · [ I P (1) P (2) ] . Example 1 The constituent codes used for turbo d ecoding [2] are two systematic convolutional codes [10] with low tr ellis de- coding complexity (See Appendix A.1). The overall code is ob- tained by a d irect coup ling as depicted in the ﬁgure to the rig ht. The enco ding of the no n–systematic part P ( l ) can be d one by a recursive e ncoder . The Π describes a permutation of the inp ut vector x , which signiﬁcan tly improves th e overall code prop - erties but does not affect the complexity of the constituen t de- coders. If the tw o codes hav e rate 1 / 2 then the overall c ode will have rate 1 / 3 . x x x c c c ( r 1 ) Π P P P ( 2 ) P P P ( 1 ) c c c ( r 2 ) Another possibility is to concaten ate two constituent codes as deﬁned below . Deﬁnition 2 (Concaten ated C odes) By c (1) = xG (1) and c ( a ) = c (1) G (2) = xG (1) G (2) (pr ovided matching dimensio ns, i.e. a ( k × n (1) ) generator ma trix G (1) and a ( n (1) × n ) generator matrix G (2) ) a concatena ted code is given. 2 Remark 1 ( Generalised Conc atenation ) A concatenation can b e used to constru ct codes with de ﬁned proper ties as usually a large minimum H A M M I N G distance. Note that generalised concaten ated [3, 4] co des exhibit the same basic concatenatio n map. Th ere distance p roperties are invest igated und er an add itional partitioning of code G (2) . Another po ssibility to co uple codes is g i ven in th e following deﬁnition. This metho d will show to be very general, albeit rather non intuitive as the description is based on parity check matrices H . Deﬁnition 3 (Dual Couplin g) The overall code C ( a ) := E ( C ( a ) ) =  c :  H (1) H (2)  c T = 0  is obtained by a dual coupling of the constituent codes C ( l ) := E ( C ( l ) ) = { c : H ( l ) c T = 0 } for l = 1 , 2 . By a dual couplin g the o btained code space is obtaine d by th e intersection C ( a ) = C (1) ∩ C (2) of the constituent code spaces. Example 2 A dually coup led code c onstruction similar to tu rbo co des is to use two mutu ally p ermuted rate 2 / 3 conv olutional co des. The intersection of these two co des giv es a code with r ate at least 1 / 3 . T o obtain a lar ger rate one may employ puncturing (not transmitting certain symbols). However , the encod ing of the overall cod e is not as simple a s for d irect cou pling codes. A straightfo rward way is to just u se the generato r matrix repre sentation of the o verall code. Remark 2 ( LDPC Codes ) LDPC co des are origin ally d eﬁned by a single p arity ch eck matrix with lo w weight rows (and colu mns). An equ i valent representation is via a graph of check no des (one f or each column) and variables n odes (one fo r each row). This leads to a third equ i valent rep resentation with two dually coupled constituent codes an d a sub sequent p uncturing [12]. The ﬁrst constituen t code is thereby giv en by a juxtapo sition o f repetition co des that repr esent the variable nod es (all nod e inputs need to be equ al). The second one is deﬁned b y single par ity check codes rep resenting the check no des. The punctur ing at the end has to be done such that only one symbol per repetitio n co de (code colum n) remains. Theorem 1 Both dir ect coupling and concatena ted c odes ar e special cases of dual coupling codes. Proof: The direct coup ling code is equi valently described in th e parity check form H ( a ) G ( a ) T = 0 by the parity check matrix H ( a ) =  H ( s 1) H ( r 1) 0 H ( s 2) 0 H ( r 2)  where H ( l ) = [ H ( sl ) H ( r l ) ] for l = 1 , 2 is the parity chec k matrix of G ( l ) consisting of systematic par t H ( sl ) and redun dant part H ( r l ) . This is obviously a du al cou pling. For a conca tenated cod e with systematic code G (2) = [ I P (2) ] the equiv alent description by a parity check matrix is H ( a ) =  H (1) 0 H ( s 2) H ( r 2)  with H (1) and H (2) = [ H ( s 2) H ( r 2) ] the parity check matrix of G (1) respectively G (2) . For n on-systematic concatenated codes a virtual sys- tematic extension (pu nctured p rior to the tr ansmission) is need ed [12]. Hence, a r epresentation by a dual coupling is again possible. It is thus suf ﬁcient to consider only dual code cou plings. The “ dual” is therefo re m ostly o mitted in the sequel. Remark 3 ( Multiple Dual Codes ) More than two codes can be dually coupled as described above: By C ( a ) = C (1) ∩ C (2) ∩ C (3) a co upling of three codes is given. The overall parity c heck ma trix is there g i ven by the ju xtaposition of the th ree co nstituent parity check matrix. Multiple dual c ouplings ar e pro duced by multiple intersection s. In the sequel mostly dual couplin gs with two constituent codes are considered. 3 1.1. O ptimal Decoding As stated ab ove the main difﬁculty is not the encod ing b ut the deco ding of a corrup ted w ord. This corrup- tion is usually the result of a transmission of the code word ov er a channel . Remark 4 ( Chann els ) I n the sequel we a ssume that th e co de symb ols C i are in B = {− 1 , +1 } . This is achieved b y the use of the “BPSK”-map B : x 7→ y = ( +1 for x = 0 − 1 for x = 1 prior to th e tran smission. As channel we assum e either a Binary Symmetric Channel (BSC) with c hannel error probab ility p and P ( r | s ) = n Y i =1 (1 − p ) h s i = r i i p h s i 6 = r i i ∝ n Y i =1 ( 1 − p p ) s i r i = n Y i =1 exp 2 ( s i r i log 2 ( 1 − p p )) = exp 2 (K n X i =1 s i r i ) with K = log 2 ( 1 − p p ) and s i , r i ∈ B = { − 1 , +1 } with h b i = ( 0 if b false 1 if b true or a chann el with Additive White G AU S S No ise (A WGN) gi ven by P ( r | s ) ∝ n Y i =1 2 − ( r i − s i ) 2 ∝ e x p 2 ( n X i =1 r i s i ) and s i ∈ B (this actually is the G AU S S pro bability density ) and the by 2 σ 2 E = log 2 ( e ) no rmalised noise v ariance . Th e received elements r i are in the A WGN case real v alued, i.e., r i ∈ R . Note that the norm alised noise variance is obtained by r ( l ) i ← K r ( l ) i and an appro priate constan t K . M ore- over , then both cases coincide. Overall this g i ves that decodin g is b ased o n      1) the knowledge of the co de space E ( C ) , 2) the knowledge of the ch annel map gi ven by P ( r | c ) , and 3) the receiv ed informatio n represen ted by r . A deco ding can be p erforme d by a d ecision for some word ˆ c , wh ich is in the M aximum Likelihood (M L) word decoding case ˆ c = arg max c ∈ E ( C ) P ( r | c ) or decisions on the code symbols by ML symbol by symbol decodin g ¯ c i = arg ma x x ∈ B P ( c ) C i ( x | r ) = arg max x ∈ B X c ∈ E ( C ) , c i = x P ( r | c ) . Here P ( c ) C i ( x | r ) is the pro bability that c i = x under the knowledge of the cod e space E ( C ) . If no further prior kn owledge ab out the code map or other addition al in formation is av ailable then these decision s are obviously optimal, i.e., the decisions e xhib it smallest word respectively Bit error proba bility . Remark 5 ( Domina ting ML W or d ) If by P ( a ) ( ˆ c ( a ) | r ) → 1 a dominating ML w ord dec ision exists then necessarily h olds that ˆ c ( a ) = ¯ c ( a ) . The decodin g problem is then equiv alent to solving eith er o f the ML decisions. ML word deco ding is for the BSC e quiv alent to ﬁnd the code w ord with the smallest nu mber of erro rs c i 6 = r i , r espectiv ely the smallest H A M M I N G distance d H ( c , r ) . For the A WGN chan nel th e word c th at minimises E U C L I D ’ s quadra tic distance d 2 E ( c , r ) = k r − c k 2 needs to be foun d. 4 For th e independ ent chann els o f Remark 4 the ML decisions can be com puted (see Append ix A.1) in the code trellis by the V I T E R B I or the BCJR algorithm. Howe ver, due to th e generally large trellis c omplexity of the overall code these algor ithms do there (pra ctically) not apply . On the other hand one may compute the “uncode d” word probabilities P ( s | r ) ∝ P ( r | s ) h s ∈ S i , (1) and for small constituent trellis complexities the constituent code word pro babilities P ( l ) ( s | r ) := P C ( l ) | R ( s | r ) = P ( r | s ) ·  s ∈ C ( l )  P s ′ ∈ S P ( r | s ′ ) ·  s ′ ∈ C ( l )  ∝ P ( r | s ) · D s ∈ C ( l ) E for l = 1 , 2 with S := E ( S ) the set of all words. Th is is interesting as the overall cod e word distribution P ( a ) ( s | r ) := P C ( a ) | R ( s | r ) ∝ P ( r | s ) · D s ∈ C ( a ) E can be computed out of P ( l ) ( s | r ) and P ( s | r ) : It ho lds with Deﬁnition 3 that C ( a ) = C (1) ∩ C (2) and thus P (1) ( s | r ) · P (2) ( s | r ) ∝ ( P ( r | s )) 2 · D s ∈ C (1) E · D s ∈ C (2) E = ( P ( r | s )) 2 · D s ∈ C ( a ) E , which gives with (1) that P ( a ) ( s | r ) ∝ P (1) ( s | r ) P (2) ( s | r ) P ( s | r ) . (2) If the constituent word p robabilities are all known then op timal decoding decisio ns can be taken. I.e., one can compute the ML word decision by ˆ c ( a ) = arg max s ∈ S P (1) ( s | r ) P (2) ( s | r ) P ( s | r ) (3) or the ML symbol decisions by ¯ c ( a ) i = a rg max x ∈ B P ( a ) C i ( x | r ) = arg max x ∈ B X s ∈ S ,s i = x P ( a ) ( s | r ) = arg max x ∈ B X s ∈ S i ( x ) P (1) ( s | r ) P (2) ( s | r ) P ( s | r ) (4) with S i ( x ) := { s ∈ S : s i = x } . Decoding decision s may therefore be taken by the co nstituent prob abilities. H owe ver , one may b y (2) only com pute a value pro portion al to e ach single word proba bility . The repre sentation comp lexity of the constituent word pr obability distribution remains p rohibitively large. I.e., the d ecoding decision s by (3) and (4) do n ot reduce the overall complexity as all word probabilities hav e to be join tly considered, which is equiv alent to in vestigating the complete code constrain t. 1.2. Beli ef Pr opagation The probabilities of the tw o constituent co des thu s contain the co mplete knowledge abo ut the decoding problem . Howe ver , the con stituent decoders may not use this knowledge (with r easonable complexity) as then 2 n values need to b e tran sferred. I.e., a r ealistic algorithm based on the constituent pr obabilities should transfer only a small number of parameters. In (loopy) belief propagation algor ithm this is don e by transmitting only the constituently “believed” sym - bol probabilities but to repeat this sev eral times. This algo rithm is here shortly revie wed: One ﬁrst uses a transfer vector w (1) to repre sent the belie ved P (1) C i ( x | r ) of cod e 1 . Th is belief representing transfer vector is then u sed toge ther with r in the decoder o f the other constituent cod e. I.e., a transfer vector w (2) is 5 computed o ut of P (2) C i ( x | r , w (1) ) that will the n be reu sed for a new w (1) by P (1) C i ( x | r , w (2) ) and so fo rth. The algorithm is stopped if the beliefs do not change any further and a decoding decision is emitted. The beliefs P ( h ) C i ( x | r , w ( l ) ) for l, h ∈ { 1 , 2 } a nd l 6 = h are obtain ed by P ( r , w ( l ) | s ) = P ( w ( l ) | s ) P ( r | s ) , which is a in w and r ind ependent rep resentation. Moreover , it is assumed that s i ∈ B = {− 1 , + 1 } and that P ( w ( l ) | s ) = n Y i =1 P ( w ( l ) i | s i ) ∝ exp 2 ( n X i =1 w ( l ) i s i ) = exp 2 ( w ( l ) s T ) (5) are of the form of P ( r | s ) in Remark 4. Remark 6 ( Distributions and T r ellis ) Obviously ma ny other ch oices for P ( w ( l ) | s ) exist. Howev er, the again indep endent description o f th e symbo ls C i = S i in (5) lead s to (see Ap pendix A.1) the possibility to use trellis based compu tations, i.e., the symbo l probabilities P ( l ) C i ( x | r , w ( h ) ) can be compu ted as befor e P ( l ) C i ( x | r ) . The transfer vector w ( h ) for belief propag ation fo r given r and w ( l ) with l, h ∈ { 1 , 2 } an d h 6 = l is deﬁned by P C i ( x | r , w (1) , w (2) ) = P ( h ) C i ( x | r , w ( l ) ) for all i. (6) I.e., the beliefs und er r , w (1) , w (2) , and no furth er set restriction are set suc h that they are equal to the beliefs u nder w ( l ) , r , and the kn owledge of the set restriction o f the h -th co nstituent cod e. This is alw ays possible as shown below . Remark 7 ( Notation ) T o simplify the notation we set in the sequel m = ( r , w (1) , w (2) ) , m (1) = ( r , w (2) ) , m (2) = ( r , w (1) ) , and often w (0) := r . For t he uncoded beliefs P C i ( x | m ) it is again assum ed that the information and belief carrying r , w (1) and w (2) are independ ent, i.e., P ( m | c ) = P ( r , w (1) , w (2) | c ) = P ( r | c ) P ( w (1) | c ) P ( w (2) | c ) . (7) The comp utation of w ( h ) for giv en w ( l ) is then simple as the in dependen ce assumptions (5) an d (7) giv e that P C i ( x | m ) = P C i | R ( x | r i + w (1) i + w (2) i ) . Moreover , the deﬁnition of the w ( l ) is simpliﬁed by the use of logarithmic probability ratios L i ( m ) = 1 2 log 2 P C i (+1 | m ) P C i ( − 1 | m ) and L ( l ) i ( m ( l ) ) = 1 2 log 2 P ( l ) C i (+1 | m ( l ) ) P ( l ) C i ( − 1 | m ( l ) ) for l = 1 , 2 . This representation is handy for the computatio ns as (5) directly gives L i ( m ) = r i + w (1) i + w (2) i and thus that Equation (6) is equiv alent to w ( l ) i = L ( l ) i ( m ( l ) ) − r i − w ( h ) i for l 6 = h and all i . (8) This equ ation can b e u sed as an iteration r ule such that the uncod ed be liefs are subsequen tly up dated by the constituent belief s. Th e tran sfer vectors w (1) and w (2) are thereby v ia (8) iterativ ely updated. Th e following deﬁnition further simpliﬁes the notatio n. 6 Algorithm 1 Loopy Belief Propagation 1. Set w (1) = w (2) = 0 , l = 1 , and h = 2 . 2. Swap l and h . 3. Set w ( l ) = ˘ L ( l ) ( m ( l ) ) . 4. If w ( h ) 6 = ˘ L ( h ) ( m ( h ) ) then go to Step 2. 5. Set ˆ c i = sign ( r i + w (1) i + w (2) i ) for all i . Deﬁnition 4 (Extrinsic Symb ol Pr oba bility) The e xtrinsic symbol pr obability of code l is ˘ P ( l ) C i ( x | m ( l ) ) ∝ P ( l ) C i ( x | m ( l ) ) exp 2 ( − x ( w ( h ) i − r i )) for h 6 = l. The extrinsic symbo l pro babilities ar e by ( 5) independen t of w ( l ) i for l = 1 , 2 and r i , i. e., they depend only on belief and information carrying w ( l ) j and r j from with j 6 = i o ther or “extrinsic” symbol positions. Moreover , one directly obtains the extrinsic logarithmic probability ratios ˘ L ( l ) i ( m ( l ) ) := 1 2 log 2 P ( l ) C i (+1 | m ( l ) ) P ( l ) C i ( − 1 | m ( l ) ) − r i − w ( h ) i = L ( l ) i ( m ( l ) ) − r i − w ( h ) i for l 6 = h. (9) W ith Equation (8) this gi ves the iteration rule w ( l ) i = ˘ L ( l ) i ( m ( l ) ) and thus Algorithm 1. Note that one ge nerally u ses an alternative, less stringent stop ping c riterion in St ep 4 of the algorithm. If the algorithm conv erges then one obtains that r i + w (1) i + w (2) i = L (2) i ( r , w (1) ) = L (1) i ( r , w (2) ) and ˆ c i = sign ( L i ( r ) + ˘ L (1) i ( m (1) ) + ˘ L (2) i ( m (2) )) (10) with L i ( r ) = r i . This is a r ather intuitive for m of the ﬁxed point of iterative b elief propag ation. The decodin g dec ision ˆ c i is deﬁn ed by th e sum of th e (repr esentations of the) chann el in formation r i and th e extrinsic constituent code beliefs ˘ L ( l ) i ( m ( l ) ) . Remark 8 ( P erformance ) If the algorithm con verges then simu lations show that the deco ding decision is usually go od. B y d ensity ev olution [1 7] or extrinsic info rmation tran sfer charts [1 9] the co n vergence o f iterativ e belief propaga tion is fur ther inv estigated. Th ese approaches ev aluate, which constituent codes are suitable for iterative belief p ropagatio n. This app roach and simulatio ns show that on ly rath er weak c odes should be employed fo r good co n vergence proper ties. T his indicates that the chosen tr ansfer is often too optimistic about its believ ed decisions. 1.3. Di scrimination The belief propag ation algo rithm uses only knowledge abou t the co nstituent cod es represented by w ( l ) . In this section we aim a t incr easing the tran sfer complexity by a dding mo re variables and hop e to obtain thereby a better repr esentation of the overall inf ormation and thus an improvement over the prop agation of only symbol beliefs. 7 Reconsider ﬁrst the add itional b elief representation w ( l ) giv en by the distributions P ( s | w (1) ) and P ( s | w (2) ) used for belief propag ation. T he overall distributions are P ( s | m ) = P ( s | r , w (1) , w (2) ) ∝ P ( r | s ) P ( w (1) | s ) P ( w (2) | s ) P (1) ( s | m (1) ) = P (1) ( s | r , w (2) ) ∝ P ( r | s ) P ( w (2) | s ) P (2) ( s | m (2) ) = P (2) ( s | r , w (1) ) ∝ P ( r | s ) P ( w (1) | s ) . (11) The following lemma ﬁrst giv es that these additiona l beliefs do not ch ange the compu tation of the overall word probabilities. Lemma 1 It holds for all w (1) , w (2) that P ( a ) ( s | r ) ∝ P (1) ( s | m (1) ) P (2) ( s | m (2) ) P ( s | m ) . Proof: A direct comp utation of the equ ation with (11) gives as for (2) equ ality . Th e terms that de pend on w ( l ) vanish by the independence ass ump tion (5). T o increase the transfer complexity now ad ditional pa rameters a re add ed to s . This ﬁrst seems counter intuitive as no new knowledge is added. However , with Lemma 1 the same holds tr ue f or the belief carrying w ( l ) and optimal deco ding. Deﬁnition 5 (W o r d uncertain ty) The uncertainty augmen ted wor d pr oba bility P ( h ) ( s , u | m ( h ) ) is P ( h ) ( s , u | m ( h ) ) := P ( h ) ( s | m ( h ) ) 2 Y l =0 D u l = w ( l ) s T E with u = u ( s ) = ( u 0 , u 1 , u 2 ) . This deﬁnition naturally extends to P ( s , u | m ) an d to P ( a ) ( s , u | r ) . Remark 9 ( Notation ) The notatio n of P ( a ) ( s , u | r ) do es no t reﬂect t he depe ndency on m . The same holds true for P ( l ) ( s , u | m ( l ) ) etc. A complete notation is for example P ( s , u | m k r ) or P ( l ) ( s , u | m k m ( l ) ) . T o maintain readability this depend ency will n ot be explicitly stated in the sequel. Under the assumption that co de words with th e same u do not nee d to b e distinguished o ne obtain s th e following deﬁnition. Deﬁnition 6 (Discriminated Distribution) The distrib ution of u discriminated by m is P ⊗ ( u | m ) ∝ P (1) ( u | m (1) ) P (2) ( u | m (2) ) P ( u | m ) with P u ∈ U P ⊗ ( u | m ) = 1 , P ( l ) ( u | m ( l ) ) = X s ∈ S P ( l ) ( s , u | m ( l ) ) , and U = E ( U ) = { u : u l = w ( l ) s T ∀ l and s ∈ S } . Remark 10 ( Discrimination ) W ords s with th e same u are not distingu ished. As m and s deﬁne u the discrimination of words is steered b y m . The variables u l are then used to r elate to the distance s k c − w ( l ) k 2 (see Remark 4). W ord s that do no not share the sam e distances are discriminated. The cho ice of u and (5) is natural as all code words with the same u hav e the same probab ility , i.e., that P ( l ) ( s , u | m ( l ) ) ∝ exp 2 ( 2 X k =0 ,k 6 = l u k ) · 2 Y j =0 D u j = w ( j ) s T E · D s ∈ C ( l ) E (12) 8 and similar for P ( s , u | m ) and P ( a ) ( s , u | r ) . Gene rally it holds that u is via 2 X k =0 u k = K + log 2 P ( s | m ) = K − H ( s | m ) (with K some co nstant) related to the u ncertainty H ( s | m ) . Note that any m ap of s o n some u will deﬁne some discrimination . Howev er, we will here o nly consider the correlation map, respectively the discrimination of the inform ation theoretic word uncertainties. In the same way one obtains the much more interesting (uncertainty) discriminated symbol probabilities. Deﬁnition 7 (Discriminated Symbo l Pr oba bilities) The symbol pr o babilities discriminated by m ar e P ⊗ C i ( x | m ) = X u ∈ U P ⊗ C i ( x, u | m ) ∝ X u ∈ U P (1) C i ( x, u | m (1) ) P (2) C i ( x, u | m (2) ) P C i ( x, u | m ) (13) with P ( l ) C i ( x, u | m ( l ) ) = X s ∈ S i ( x ) P ( l ) ( s , u | m ( l ) ) ∝ X s ∈ C ( l ) ,s i = x P ( s , u | m ( l ) ) . Remark 11 ( Ind ependen ce ) Note that P ⊗ C i ( x | m ) is by (5) indepe ndent of both w ( l ) i . The discriminated sy mbol p robabilities may be considere d as commo nly believed symbol prob abilities under discriminated word uncertainties. T o obtain a ﬁrst in tuiti ve understanding of this fact we relate P ⊗ C i ( x | m ) to the mo re accessible constituent symbol probab ilities P ( l ) C i ( x | m ( l ) ) . It holds by B A Y E S ’ theo rem that P ⊗ C i ( x | m ) ∝ P (1) C i ( x | m (1) ) P (2) C i ( x | m (2) ) P C i ( x | m ) P ⊠ C i ( x | m ) with (abusing notation as this is not a probability) P ⊠ C i ( x | m ) ∝ X u ∈ U P (1) C i ( u | x, m (1) ) P (2) C i ( u | x, m (2) ) P C i ( u | x, m ) . (14) In the logarithmic notation this giv es L ⊗ i ( m ) = L (1) i ( m (1) ) + L (2) i ( m (2) ) − L i ( m ) + L ⊠ i ( m ) or in the extrinsic notation of (9) that L ⊗ i ( m ) = ˘ L (1) i ( m (1) ) + ˘ L (2) i ( m (2) ) + L i ( r ) + L ⊠ i ( m ) . (15) Note ﬁrst the similarity with (10). One has again a sum of the extrinsic b eliefs, howe ver , an additiona l value L ⊠ i ( m ) is a dded, which is by Remark 11 necessarily indep endent of w ( l ) i and l = 1 , 2 . Overall the common belief joins the two constituent beliefs together with a “distance” correction term. Below we show that this new comm on belief is – under a gain p ractically p rohibitively high complexity – just the real overall “belief ”, i.e., the correct symbol probabilities obtained by optimal symbol decoding. Deﬁnition 8 (Globally Maximal Discriminator) The d iscriminator m is glob ally maximal (for S ) if | S ( u | m ) | = 1 for all u ∈ U . I. e., for glo bally ma ximal d iscriminators exists a one -to-one correspon- dence between s and u and thus | S | = | U | . 9 Lemma 2 F o r a globally maximal discriminator m it holds that P ⊗ ( u | m ) = P ( a ) ( u | r ) an d P ⊗ C i ( x | m ) = P ( a ) C i ( x | r ) . I.e., the by m d iscriminated symbol pr o babilities ar e corr ect. Proof: Lemma 1 and Deﬁnition 5 gi ve P ( a ) ( s , u | r ) ∝ P (1) ( s , u | m (1) ) P (2) ( s , u | m (2) ) P ( s , u | m ) as u follows directly from s . F or a g lobally maximal discriminato r m exists a on e-to-one correspond ence between s and u This gi ves that one can omit fo r any prob ability either u or s . This pr oves the optimality of the discriminated distribution. For the ov erall symbol probabilities holds P ( a ) C i ( x | r ) = X s ∈ S i ( x ) P ( a ) ( s | r ) = X s ∈ S i ( x ) P (1) ( s , u | m (1) ) P (2) ( s , u | m (2) ) P ( s , u | m ) . W ith P ( l ) C i ( x, s , u | m ( l ) ) = P ( l ) ( s , u | m ( l ) ) fo r s ∈ S i ( x ) an d P ( l ) C i ( x, s , u | m ( l ) ) = 0 for s 6∈ S i ( x ) th e right hand side becomes X s ∈ S i ( x ) P (1) ( s , u | m (1) ) P (2) ( s , u | m (2) ) P ( s , u | m ) = X s ∈ S P (1) C i ( x, s , u | m (1) ) P (2) C i ( x, s , u | m (2) ) P C i ( x, s , u | m ) . By the one-to- one correspo ndence one can replace t he sum over s b y a sum over u to obtain P ( a ) C i ( x | r ) = X u ∈ U P (1) C i ( x, s , u | m (1) ) P (2) C i ( x, s , u | m (2) ) P C i ( x, s , u | m ) , which is ( s can be omitted due to th e one-to -one corr esponden ce) the optimality of the d iscriminated symbol probab ilities. A g lobally maxim al d iscriminator m thus solves the prob lem of ML symbol by symbo l d ecoding. Likewise by arg ma x u ∈ U P ⊗ ( u | m ) = arg max u ∈ U P ( a ) ( u | r ) = arg max s ∈ S P ( a ) ( s | r ) the pr oblem of ML word decoding is solved (provided the one-to-o ne corr espondenc e of u and s can be easily in verted). This is not surpr ising as a g lobally m aximal discrim inator h as by the one-to- one corresp ondence of s and u the discrimin ator complexity | U | = | S | . The transfer complexity is then just the com plexity of the optimal decoder based on constituent probab ilities. Remark 12 ( Glob ally Maximal Discriminators ) The vector m = ( r , w (1) , 0 ) and w (1) i = 2 i is an exam- ple of a glob ally maximal d iscriminator as u 1 ( s ) = P n i =1 s i 2 i is different for all values of s . I .e., there exists a one -to-one corr esponden ce between s and u . Gen erally it is rather simple to construc t a globa lly maximal discriminator . E.g ., the r received via an A WGN channel is usually alr eady maximal discriminat- ing: The probability that two w ord s s (1) , s (2) ∈ S share the same real v alued distance to the received word is generally Zero. 10 2. Local Discriminator s In the last section the coupling of error cor recting codes was revie wed and different d ecoders were dis- cussed. It was shown that an optimal deco ding is due to the large r epresentation com plexity p ractically not feasible, but th at a transfe r of beliefs m ay lead to a good d ecoding algo rithm. A gener alisation of th is approa ch led to the concep t of discriminators and therewith to a new overall belief. T he comp lexity of th e computatio n of this belief is dep ending on | U | , i.e. , the num ber of different outcomes u of the discrimina- tion. Finally it was shown that the obtained overall belief leads to the optimal overall decoding decision if the set is with | U | = | S | m aximally large. (Howe ver , then the overall decoding complexity is not reduced.) In this section we co nsider local discriminato rs with | U | ≪ | S | . T hen only a limited num ber of values need to be transferred to compute by (13) a n ew overall b elief P ⊗ C i ( x | m ) . These discrimin ated b eliefs P ⊗ C i ( x | m ) may then be practically employed to imp rove iterative deco ding. T o do so we ﬁrst show th at local discriminator s exist. Example 3 The r obtained by a transmission over a BSC is ge nerally a local discrimin ator . The map U ( r ) : s 7→ u = ( u 0 , 0 , 0) is then only depend ent on the H A M M I N G distance d H ( r , s ) , i.e ., U 0 ( r ) : s 7→ u 0 = r s T = n − 2 d H ( r , s ) and thus U = E ( U 0 ) ⊆ {− n, − n + 2 , ..., n − 2 , n } , which gives | U | ≤ n + 1 . This furtherm ore gi ves that an addition al “hard d ecision” ch oice o f the w ( l ) will c ontinue to y ield a loca l “ H A M M I N G ” discrimin ator m . T o investigate local d iscrimination now reco nsider the discrim inated distributions. W ith Rem ark 10 one obtains the following lemma. Lemma 3 The distributions of u given m ar e P ( u | m ) ∝ | S ( u | m ) | exp 2 ( u 0 + u 1 + u 2 ) and P ( l ) ( u | m ( l ) ) ∝ | C ( l ) ( u | m ) | exp 2 ( 2 X k =0 ,k 6 = l u k ) wher e the sets S ( u | m ) and C ( l ) ( u | m ) are deﬁn ed by M ( u | m ) := { s ∈ M : u l = w ( l ) s T ∀ l } . Proof: By (12) follows that th e prob ability of a ll words s ∈ S ( u | m ) with the same u is eq ual and pro- portion al to exp 2 ( P 2 k =0 u k ) . As | S ( u | m ) | word s are in S ( u | m ) this gives the ﬁrst e quation. The secon d equation is obtained by adding the code constraint. Remark 13 ( Overall Distrib ution ) In the same way follows (see Remark 9) that P ( a ) ( u | r ) ∝ | C ( a ) ( u | m ) | exp 2 ( u 0 ) . (16) More gen eral restrictions (see below) can al ways b e handled by imposing restrictio ns on the considered sets. One thus generally obtains for the distributions of u a descr iption via on u d ependen t sets sizes. Example 4 W ith the co ncept of set sizes Example 3 is con tinued. Assume again th at the discrim inator is giv en by m = ( r , 0 , 0 ) . In this case no discr imination takes place o n u 1 and u 2 as on e obtains u 1 = u 2 = 0 for all s . One ﬁrst obtain s the overall d istribution P ( a ) ( u | r ) to b e with Remark 13 the mu ltiplication of exp 2 ( u 0 ) with the distribution of the correlation cr T with c ∈ C ( a ) giv en by | C ( a ) ( u | m ) | . Assume furtherm ore t hat the overall maximum likelihood decision ˆ c ( a ) is with P ( a ) ( ˆ c ( a ) | r ) → 1 11 0.5 0.5 0.2 0.2 P ⊗ ( u 0 | m m m ) P ( a ) ( u 0 | r r r ) P ( 1 ) ( u 0 | m m m ( 1 ) ) P ( a ) ( u 0 | r r r ) P ( u 0 | m m m ) P ( u 0 | m m m ) P ( u 0 | m m m ) P ( 2 ) ( u 0 | m m m ( 2 ) ) u 0 u 0 u 0 u 0 Figure 1: Hard Decisions distinguished . This assumption gives tha t P ( a ) ( u | r ) = P ( a ) ( u 0 | r ) ≈ ( 1 for u 0 = u 0 ( ˆ c ( a ) ) = n − 2 d H ( r , ˆ c ( a ) ) 0 else. I.e., P ( a ) ( u | r ) con sists of one peak. For the other pr obabilities P ( u 0 | m ) and P ( l ) ( u 0 | m ( l ) ) with Lemma 3 again a multiplicatio n of correlation distributions with e xp 2 ( u 0 ) is obtained . These distributions will, howev er, du e to the much lar ger spaces | S | ≫ | C ( l ) | ≫ | C ( a ) | usually not be in the fo rm of a single peak . Other words with u 0 ≥ ˆ c ( a ) r T may appear . Th e same then ho lds true fo r P ⊗ ( u 0 | m ) . These consider ations are exemplary depicted in Figure 1. Note that the distributions can all be computed (see Appendix A.1) in the constituent trellises . For a local discrimina tion a comp utation in the constituen t trellises pro duces by ( 13) symbol probabilities P ⊗ C i ( x | m ) . In equ i valence to (loopy) belief propa gation these pro babilities should lead to the deﬁnition of some w and thus to some iteration rule. Befo re con sidering this appro ach we e valuate the quality o f th e discriminated symbol probab ilities. 2.1. T ypicality W ith Lemma 3 one obtains that the discriminated symbol probab ilities deﬁned by (13) are P ⊗ C i ( x | m ) ∝ X u ∈ U | C (1) i ( x, u | m ) | exp 2 ( u 0 + u 2 ) | C (2) i ( x, u | m ) | exp 2 ( u 0 + u 1 ) | S i ( x, u | m ) | exp 2 ( u 0 + u 1 + u 2 ) = X u ∈ U | C (1) i ( x, u | m ) || C (2) i ( x, u | m ) | | S i ( x, u | m ) | exp 2 ( u 0 ) (17) with the sets C ( l ) i ( x, u | m ) deﬁned by s ∈ C ( l ) and s i = x . Hence, P ⊗ C i ( x | m ) on ly depen ds on the discriminated set sizes C ( l ) i ( x, u | m ) , S i ( x, u | m ) , and the word prob abilities P ( s | r ) ∝ exp 2 ( u 0 ( s )) . The discriminated symbol proba bilities should appr oximate the o verall probabilities, i.e., P ⊗ C i ( x | m ) ≈ P ( a ) C i ( x | r ) . W ith Remark 13 and (17) this approxim ation is surely good if | C (1) i ( x, u | m ) || C (2) i ( x, u | m ) | | S i ( x, u | m ) | ≈ | C ( a ) i ( x, u | m ) | . (18) 12 Intuitively , the appro ximation thus u ses the knowledge how many word s of th e sam e corr elation values u and d ecision c i = x are in both co des simultaneo usly . Mor eover , d ependin g on the discriminato r m th e quality of this approx imation will chang e. An av erage consideratio n of the appro ximations (18 ) is related to the following lemma. Lemma 4 If the duals of the (linear) constituen t cod es do not s hare common wor ds b ut the zer o wor d then | C (1) || C (2) | = | S || C ( a ) | . (19 ) Proof: W ith Deﬁnition 3 and by assumption linearly in depend ent H (1) and H (2) it holds that the dual code d imension of the cou pled code is just the sum of the dual c ode dim ension of the constituen t c odes, i.e., n − k = ( n − k (1) ) + ( n − k (2) ) . This is equiv alent to k (1) + k (2) = n + k and thu s the statement of the lemma. This lemma extends to the co nstrained set sizes | C ( l ) i ( x ) | as used in (18). The appro ximations are thus in the mean correct. For random co ding and in dependen tly chosen m = ( r , w (1) , w (2) ) th is conside ration can be put into a more precise form. Lemma 5 F o r rand om (lon g) codes C (1) and C (2) and indepen dently chosen m holds the asympto tic equality | C ( a ) i ( x, u | m ) | ≍ | C (1) i ( x, u | m ) || C (2) i ( x, u | m ) | | S i ( x, u | m ) | . (20) Proof: The prob ability of a rando m ch oice in S to be in S ( u | m ) is just the frac tion o f th e set sizes | S ( u | m ) | and | S | . For a random cou pled code | C ( a ) | the codew ords are a rando m sub set of the set | S | . For | C ( a ) | ≫ 1 the law of large numbers thus gi ves the asymptotic equality | C ( a ) i ( x, u | m ) | | C ( a ) | ≍ | S i ( x, u | m ) | | S | . (21) The same holds true for the constituent codes | C ( l ) i ( x, u | m ) | | C ( l ) | ≍ | S i ( x, u | m ) | | S | . A multiplication of the equality of code 1 with the one of code 2 giv es the asymptotic equiv alence | S | | C (1) || C (2) | | C (1) i ( x, u | m ) || C (2) i ( x, u | m ) | | S i ( x, u | m ) | ≍ | S i ( x, u | m ) | | S | . (22) Combining (21) and (22) then leads to | S | | C (1) || C (2) | | C (1) i ( x, u | m ) || C (2) i ( x, u | m ) | | S i ( x, u | m ) | ≍ | C ( a ) i ( x, u | m ) | | C ( a ) | . W ith (19) this is the statement of the lemma. Remark 14 ( Ran domness ) The proof of the lemma indicates that the appr oximation is rather goo d for code choices that are in depende nt of m . I.e., perfect randomness of the co des is generally not needed. Th is can be understoo d by the concep t of rando m codes in informatio n theory . A ran dom cod e is generally a good code. Con versely a good code should not e xhibit any structure, i.e., it behaves as a random code. 13 2.2. Di stinguished W ords The received vector r is obtained fro m the chan nel an d the enco ding. Th e discriminator m is due to the depend ent r thus generally not indep endent of the encoding. This becomes directly clear by reconsidering Example 4 and the assumptions that a distinguished word ˆ c ( a ) with P ( a ) ( ˆ c ( a ) | r ) → 1 exists. In this case th e constituen t distributions and thu s like wise the discrim inator distribution P ⊗ ( u | r ) will be large in a re gion where a “typical” num ber of erro rs ˆ t occur red, i.e, u 0 = r c T ≈ n − 2 ˆ t . For an indepen dent m , howe ver , this would not be the case: Then P ⊗ ( u | m ) would with Lemma 5 be large in the vicinity of a typical minimal overall code word distance . This distance is gener ally larger than the expected numb er of errors ˆ t u nder a disting uished word. Hence, P ⊗ ( u | m ) would th en b e large at a smaller u 0 than under a depend ent m . Remark 15 ( Chan nel Capa city and T ypical Sets ) The existence o f a distinguished word is equiv alent to assuming a long random code of rate below cap acity [18]. The word sent is then the only one in the typical set, i.e., it has a sm all distanc e to r . Th e other words of a rand om code will typically exhibit a large distance to r . T o describe single words o ne n eeds to d escribe h ow well certain environments in u giv en m are d iscrim- inated. The precision of the appr oximation of C ( a ) i ( x, u | m ) by (18) here by obviou sly depends on the set size | S i ( x, u | m ) | . This lead s to the following d eﬁnition. Deﬁnition 9 (Maximally Discriminated Region) The by m maximally discriminated r e gion D ( m ) := [ | S ( u | m ) | =1 S ( u | m ) consists of all wor ds s that uniqu ely deﬁne u with u l = sw ( l ) T for l = 0 , 1 , 2 . Theorem 2 F or independ ent constituen t cod es a nd a by ˆ c ( a ) ∈ D ( m ) maximally discriminated distin- guished event is P ⊗ C i ( x | m ) ≍ P ( a ) C i ( x | r ) an d P ⊗ ( u | m ) ≍ P ( a ) ( u | r ) . Proof: It holds with (17) that P ⊗ C i ( x | m ) ∝ X u ∈ U | C (1) i ( x, u | m ) || C (2) i ( x, u | m ) | | S i ( x, u | m ) | exp 2 ( u 0 ) . (23) For the distinguished e vent ˆ c ( a ) ∈ C ( a ) it follows that | C (1) i ( x, u ( ˆ c ( a ) ) | m ) || C (2) i ( x, u ( ˆ c ( a ) ) | m ) | | S i ( x, u ( ˆ c ( a ) ) | m ) | = | C ( a ) i ( x, u ( ˆ c ( a ) ) | m ) | = 1 for ˆ c ( a ) i = x as by assumption ˆ c ( a ) ∈ D ( m ) is maximally d iscriminated, wh ich gives by deﬁnition and ˆ c ( a ) ∈ C ( l ) for l = 1 , 2 that | S i ( x, u ( ˆ c ( a ) ) | m ) | = | C (1) i ( x, u ( ˆ c ( a ) ) | m ) | = | C (2) i ( x, u ( ˆ c ( a ) ) | m ) | = 1 . I.e., the term with u = u ( ˆ c ( a ) ) in (23) is corr ectly estimated. The othe r ter ms in (23) re present non distingu ished words an d can (with the assumption of indep endent constituent codes) be considered to be indepen dent of m . This g i ves th at they can b e assumed to b e obtained by random coding . I.e., for u 6 = u ( ˆ c ( a ) ) with u l ( ˆ c ( a ) ) = w ( l ) ˆ c ( a ) T 14 holds | C (1) i ( x, u | m ) || C (2) i ( x, u | m ) | | S i ( x, u | m ) | ≍ | C ( a ) i ( x, u | m ) | of Lemma 5. Hence the other words are (asymptotically) correctly estimated, too. Moreover , with (23) one obtains fo r ˆ c ( a ) a probability value proportional to exp 2 ( r ˆ c ( a ) T ) . Th e other terms of (23) are much smaller : An ind ependen t rando m code typically doe s n ot exhibit co de word s of small distance to r . As the code rate is belo w capacity then P ⊗ ( u ( ˆ c ( a ) ) | m ) exceeds the sum of the probabilities of the other words. Asymptotically b y (17) both the overall symbol pr obabilities and the overall distribution of correlation s follow . Remark 16 ( Distance ) No te that the multiplication with ex p 2 ( u 0 ) in (23) excludes elements tha t are n ot in the distingu ished set ( ≡ with large d istance to r ). These words can – as shown by in formation theor y – not dom inate (a r andom code) in prob ability . I.e., a ma ximal discrimination of no n typica l words will not signiﬁcantly chan ge the d iscriminated sy mbol proba bilities P ⊗ C i ( x | m ) . T his ind icates that a rando m choice of the w ( l ) for l = 1 , 2 will typically lead to similar beliefs P ⊗ C i ( x | m ) as under w (1) = w (2) = 0 . Con versely it holds tha t if one co de word at a small distance is maximally discrim inated then its p robability typically dominates the prob abilities of the other terms in (23). Example 5 W e con tinue the exam ple ab ove. The discriminator m = ( r , ˆ c ( a ) , 0 ) maximally discriminates the distinguished word ˆ c ( a ) at u = u ( ˆ c ( a ) ) = ( n − 2 d H ( r , ˆ c ( a ) ) , n, 0) . The discriminator complexity U is max imally ( n + 1) 2 as only this many dif feren t values of u = ( n − 2 d H ( r , c ) , n − 2 d H ( ˆ c ( a ) , c ) , 0) exist. The complexity is then gi ven by the computation of maximally ( n + 1 ) 2 elements. As this has to b e done n times in the trellis ( see Append ix A.1) the asymptotic complexity becomes O ( n 3 ) (fo r ﬁxed trellis state complexity). The comp utation will gi ve by Theorem 2 that P ⊗ C i ( x | m ) ≍ P ( a ) C i ( x | r ) with ˆ c ( a ) i = sig n ( L ⊗ i ( m )) as ˆ c ( a ) is distinguished and as all other words can be assumed to be chosen independe ntly . I.e., P ⊗ ( u | m ) exhib its a peak of h eight 1 and the P ⊗ C i ( x | m ) g iv e the asym ptotically correct symbol probab ilities. 2.3. Well Deﬁne d Discrimi nators Example 5 shows that fo r the distinguished e vent ˆ c ( a ) the har d decision discriminator m = ( r , ˆ c ( a ) , 0 ) with ˆ c ( a ) i = sig n ( L ⊗ i ( m )) produ ces d iscriminated symbol b eliefs close to the overall symbol probab ilities. The d iscriminator co m- plexity | U | ≤ ( n + 1) 2 is thus sufﬁcient t o o btain the asymptotically correct decoding decision. Remark 17 (Equ ivalent Har d Decision Discr iminato rs) By (23) the hard decision discrimin ators m = ( r , w , 0 ) , m = ( r , 0 , w ) , and m = ( r , w , w ) are equiv alent: For the three cases the same C ( l ) i ( x | m ) and S ( l ) i ( x | m ) and thus P ⊗ C i ( x | m ) follow . In the sequel of this section we will (for symmetry reasons) only consider the discriminator s m = ( r , w , w ) . The discussion above shows tha t a discrimina tor with randomly ch osen w should gi ve almo st the same L ⊗ i ( m ) as L ⊗ i ( r , 0 , 0 ) . If, however , the discrim inator is strongly dependen t on the distinguished solution, i.e., w = ˆ c ( a ) then th e cor rect solution is found v ia L ⊗ i ( m ) . This gives the follo wing deﬁn ition and lem ma. 15 Deﬁnition 10 (W ell Deﬁned Discriminator) A well deﬁned discriminator m = ( r , w , w ) fu lﬁls w i = sign ( L ⊗ i ( m )) for all i. (24) Lemma 6 F o r a B SC an d disting uished ˆ c ( a ) exis ts a well deﬁned discriminator m = ( r , w , w ) with w i , r i ∈ B such that ˆ c ( a ) i = w i . Proof: Set m = ( r , ˆ c ( a ) , ˆ c ( a ) ) . For this choice holds ˆ c ( a ) ∈ D ( m ) an d thu s with Theo rem 2 asymptotic equality . Moreover, holds for a distinguished element that P ⊗ C i (ˆ c ( a ) i | m ) ≍ P ( a ) C i (ˆ c ( a ) i | r ) ≍ 1 and thus ˆ c ( a ) i = sig n ( L ( a ) i ( r )) = sign ( L ⊗ i ( m )) . The d eﬁnition of a well deﬁned discrimin ator (24) can be used as an iteration r ule, which gives Algorith m 2. The iteratio n th ereby exhibits by Lemma 6 a ﬁxed p oint, whic h provably rep resents th e distinguished solution. Note th at the employment of w (1) = w (2) = w is her e han dy a s by L ⊗ i ( m ) only one common belief is a vailable. This is contrast to Algorithm 1 where the e mployment of the two con stituent beliefs generally give th at w (1) 6 = w (2) . Algorithm 2 Iterative Hard Decision Discrimination 1. Set m = ( r , 0 , 0 ) and w = 0 . 2. Set v = w and w i ← sign ( L ⊗ i ( m )) for all i . 3. If v 6 = w then m = ( r , w , w ) and go to 2. 4. Set ˆ c = w . T o und erstand the overall prop erties of the algorithm one n eeds to consider its c on vergence prop erties and the existence of o ther ﬁxed points. A ﬁrst intuiti ve assessment of the algo rithm is as f ollows. Th e decisions taken b y w i = sign ( L ⊗ i ( r , 0 , 0 )) should by (15) lead to a sma ller sym bol err or probab ility th an the one over r . Ov erall these decision s are based on P ⊗ ( u | r , 0 , 0 ) . This d istrib ution is ne cessarily large in th e vicinity of ˆ u 0 = n − 2 ˆ t with ˆ t the expected number of errors. The s ubseq uent discrimination with w and r will consider th e vicinity of c mo re precisely if wc T is larger than r c T : In th is vicinity less words exist, which g i ves that the | S ( u | m ) | are smaller there. Smaller error probab ility in w is thus with (17) typically equivalent to a better discrimination in the vicinity of ˆ c ( a ) . T his indicates tha t the discriminator ( r , w , w ) is better than ( r , 0 , 0 ) . Hen ce, the new w i ← sign ( L ⊗ i ( r , w , w )) should exhibit again smaller er ror p robability and so f orth. If the iter ation ends then a stable solution is found . Finally , the solution w = ˆ c ( a ) is stable. This behaviour is exemplary depicted in Figu re 2 where n − 2 ˆ t u 0 u 1 P ⊗ ( u | m ) (a) Initialisat ion n − 2 ˆ t u 0 u 1 P ⊗ ( u | m ) (b) Intermediate Step n − 2 ˆ t u 0 u 1 P ⊗ ( u | m ) (c) Stable Figure 2: Hard Decision Discrimination the density of the squares represen t the prob ability P ⊗ ( u | m ) of u = ( u 0 , u 1 ) . 16 2.4. Cross Entr opy T o obtain a quantitative assessment of Algorithm 2 we use the following deﬁnition. Deﬁnition 11 (Cr oss Entr opy) The cr oss entr o py H ( C | w k r ) := E C [ H ( s | w ) | r ] = − X s ∈ C P C ( s | r ) log 2 ( P ( s | w )) is the e xpecta tion of the uncertainty H ( s | w ) = − lo g 2 P ( s | w ) under r an d c ∈ E ( C ) . The cross entropy measures as the K U L L BA C K - L E I B L E R Distance D ( C | w k r ) := H ( C | w k r ) − H ( C | r ) with H ( C | r ) := E C [ H C ( s | r ) | r ] = − X s ∈ C P C ( s | r ) log 2 ( P C ( s | r )) . the similarity betwee n the distributions P ( c | r ) an d P ( s | w ) . By J E N S E N ’ s ine quality it is easy to sho w [15] that D ( C | w k r ) ≥ 0 and thu s H ( C | w k r ) ≥ H ( C | r ) ≥ 0 . The entropy H ( C | r ) is an informatio n theoretic measur e of the number of probable w ords in E ( C ) under r . T o better explain the cross entropy H ( C | w k r ) we shortly revie w some results regarding the entropy . The typical set A nε ( C | r ) is given by the typical region A nε ( C | r ) = { c ∈ E ( C ) : | H ( c | r ) − H ( C | r ) | ≤ nε } of word uncertainties H ( c | r ) = − log 2 P ( c | r ) . This deﬁnition directly gives 1 ≥ X A nε ( C | r ) P ( c | r ) = X A nε ( C | r ) exp 2 ( − H ( c | r )) ≥ exp 2 ( − H ( C | r ) − n ε ) X A nε ( C | r ) 1 , respectively , P C ( A nε ( C | r ) | r ) = X A nε ( C | r ) P ( c | r ) = X A nε ( C | r ) exp 2 ( − H ( c | r )) ≤ exp 2 ( − H ( C | r ) + n ε ) X A nε ( C | r ) 1 . W ith X A nε ( C | r ) 1 = | A nε ( C | r ) | this leads to the boun ds on the logarithm ic set sizes H ( C | r ) + nε ≥ log 2 | A nε ( C | r ) | ≥ H ( C | r ) + log 2 ( P C ( A nε ( C | r ) | r )) − nε by the entropy . For many independ ent events in r the law of lar ge numbers giv es for ε > 0 that P C ( A nε ( C | r ) | r ) ≈ 1 and thus H ( C | r ) ≈ log 2 ( | A nε ( C | r ) | ) . W e inv estigate if a similar statemen t can be done for the cross entro py . T o d o so ﬁrst a cr oss typ ical set A nε ( C | w k r ) is deﬁned by the region of typical w ord un certainties: min( H ( S | w ) , H ( C | w k r )) − nε ≤ H ( s | w ) ≤ max( H ( S | w ) , H ( C | w k r )) + nε. (25) 17 I.e., th e region spa ns the typical set in w but includes more w ord s if H ( S | w ) 6 = H ( C | w k r ) . As th e typical set in w is included this gives fo r large n that P S ( A nε ( C | w k r ) | w ) ≈ 1 and then in the same way as above the bounds on the logarithm ic set size max( H ( S | w ) , H ( C | w k r )) + nε ≥ log 2 | A nε ( C | w k r ) | ≥ min( H ( S | w ) , H ( C | w k r )) − nε. Moreover , holds by the deﬁnition of the cross entropy and the law of lar ge number s that typically P C ( A nε ( C | w k r ) | r ) ≈ 1 is true, too. This gives th at the cross typical set includes the typical sets A nε ( S | w ) and A nε ( C | r ) , i.e., A nε ( C | w k r ) ⊇ A nε ( S | w ) an d A nε ( C | w k r )) ⊇ A nε ( C | r ) . (26) If one wants to deﬁne a transfer vector w b ased on r o ne is thus intere sted to obtain a r epresentation in w such that the logarithmic set size log 2 | A nε ( C | w k r ) | ≤ max( H ( S | w ) , H ( C | w k r )) + nε is as small as possible. In the sequel we consider P ( s | w ) ∝ P ( w | s ) deﬁned by (5). This probability is given by P ( s | w ) = Q n i =1 P ( w i | s i ) P s ∈ S P ( w | s ) = n Y i =1 P ( s i | w i ) P (+1 | w i ) + P ( − 1 | w i ) = n Y i =1 2 s i w i 2 w i + 2 − w i . (27) The cross entropy thus becomes H ( C | w k r ) = X s ∈ C P C ( s | r ) n X i =1 (log 2 (2 w i + 2 − w i ) − s i w i ) = n X i =1 log 2 (2 w i + 2 − w i ) − n X i =1 X s ∈ C P C ( s | r ) s i w i (28) and n X i =1 X s ∈ C P C ( s | r ) s i w i = n X i =1 E C ,C i [ x | r ] w i = n X i =1 w i ( P ( c ) C i (+1 | r ) − P ( c ) C i ( − 1 | r )) . This deﬁnition almost directly deﬁnes an optimal transfer . Lemma 7 Equa l lo garithmic symbol pr obability ratios w i = L i ( w ) = L ( c ) i ( r ) = 1 2 log 2 P ( c ) C i (+1 | r ) P ( c ) C i ( − 1 | r ) and P ( s | w ) ∝ P ( w | s ) deﬁned by (5) minimise cr oss entr opy H ( C | w k r ) and K U L L B AC K - L E I B L E R distance D ( C | w k r ) . Proof: First it holds by (5) that w i = L i ( w ) = 1 2 log 2 exp 2 (+ w i ) exp 2 ( − w i ) . A differentiation of (28) leads to ∂ ∂ w i H ( C | w k r ) = X s ∈ C P C ( s | r )(tanh 2 ( w i ) − x i ) = tanh 2 ( w i ) − X s ∈ C c i P C ( s | r ) ! = 0 18 with tanh 2 ( x ) = (2 x − 2 − x ) / (2 x + 2 − x ) . Th is directly gi ves that tanh 2 ( w i ) = X s ∈ C c i P C ( s | r ) = P ( c ) C i (+1 | r ) − P ( c ) C i ( − 1 | r ) . As tanh 2 ( L ( c ) i ( r )) = P ( c ) C i (+1 | r ) − P ( c ) C i ( − 1 | r ) and ∂ ∂ w i H ( C | r ) = 0 this is equiv alent to the statement of the lemma. I.e., the deﬁnition o f w by L i ( w ) = L ( c ) i ( r ) is a con sequence of th e indepen dence assumption (5). Espe- cially interesting is that L i ( w ) = L ( c ) i ( r ) directly implies that H ( S | w ) = H ( C | w k r ) , which gives with (26) that A nε ( C | w k r ) = A nε ( S | w ) and thus A nε ( S | w ) ⊇ A nε ( C | r ) . A belief representing transfer vector w thu s typically describes all probable code words. By recon sidering the d eﬁnition of the cr oss ty pical set in (25) th e in r and C typical set A nε ( C | r ) is (in the mean) contained in the set of in w probab le words s ∈ S if H ( S | w ) ≥ H ( C | w k r ) . Hereby the set of probab le words is deﬁned by only considering the right hand side inequality of (25). 2.5. Di scriminator Entr opy In this sectio n the considerations are extended to the discrimina tion. T o do so we use in eq uiv alence to (28) the following deﬁnition. Deﬁnition 12 (Discriminated Cr oss Entr opy) The discriminated cr o ss entr opy is H ( C ⊗ | w k m ) : = − X u ∈ U P ⊗ ( u | m ) lo g 2 P ( s | w ) := n X i =1 log 2 (2 w i + 2 − w i ) − X u ∈ U w i s i · P ⊗ C i ( s i , u | m ) = n X i =1 log 2 (2 w i + 2 − w i ) − w i E ⊗ U [ c i | m ] with (27) and E ⊗ U [ c i | m ] = P ⊗ C i (+1 | m ) − P ⊗ C i ( − 1 | m ) . Note that this deﬁnition ag ain uses the correspo ndence o f u and s . Even though by a discrimin ation not all words a re independ ently considered , a word uncertainty consideration is still possible by attributing approp riate pro babilities. Lemma 7 directly gives that the d iscriminated cross entropy is alw ays larger than or equal to the discriminated symbol entr opy H ( C ⊗ k m ) , i.e., H ( C ⊗ | w k m ) ≥ − X u ∈ U P ⊗ ( u | m ) lo g 2 P ( s | L ⊗ ( m )) =: H ( C ⊗ k m ) The discrimina tor entropy measures the uncertainty of the discriminated d ecoding decision, i.e., the numbe r of words in S that need to be considered. Th is directly gi ves the following theore m. Theorem 3 The decoding pr ob lem for a distinguished wor d is equivalent to the solution of w i = sig n ( L ⊗ i ( m )) (29) with the discriminated symbol entr opy H ( C ⊗ k m ) < 1 and m = ( r , w , w ) . 19 Proof: For w i = ˆ c ( a ) i is ˆ c ( a ) ∈ D ( m ) . This gives with Lemma 6 for the discriminated distribution that P ⊗ ( u | m ) ≍ P ( a ) ( u | r ) . As ˆ c ( a ) is a distinguished solution this gives P ( a ) ( u ( ˆ c ( a ) ) | r ) ≈ 1 or equiv alently H ( C ⊗ k m ) ≈ 0 . The d iscriminated symbo l entropy H ( C ⊗ k m ) estimates by exp 2 H ( C ⊗ k m ) the loga rithmic n umber of elements in the set of probab le words in S . Any solution m with H ( C ⊗ k m ) < 1 thus exhibits on e word s with P ⊗ ( ˆ u | m ) ≈ 1 . I.e., one has a discriminated distribution P ⊗ ( u | m ) that contains just one peak of height alm ost one a t ˆ u . As only one word distributes the d ecisions by (29) g i ve this word, or equiv alently that ˆ u = u ( w ) . Hen ce, ˆ c = w is maximally discriminated. This directly im plies that the obtain ed ˆ c needs to be a cod ew ord of the coupled code: Both distributions P ( l ) ( u | m ) are used for th e sing le word d escription P ⊗ ( u | m ) 6 = 0 . Hence, b oth codes contain th e in u maximally discriminated word ˆ c , which gives (by the deﬁnition of the dually cou pled code) that this word is an overall codeword. Assume that ˆ c 6 = ˆ c ( a ) represents a non disting uished word. W ith Remark 16 this word needs to exhibit a large d istance to r . T ypically many word s c ∈ C ( a ) exist at such a large distance. By (2 1) these w ords are considered in the computatio n of P ⊗ ( u | m ) . Thus P ⊗ ( u | m ) is not in the form of a peek, which giv es that H ( C ⊗ k m ) > 1 . As this is a co ntradiction no oth er solution of (29) w b ut w = ˆ c ( a ) may exhibit a discriminated symbol entropy H ( C ⊗ k m ) < 1 . Remark 18 ( T yp ical Deco ding ) The proof of th e theorem indicates that any code word c ∈ C ( a ) with small distanc e to r ma y give rise to a we ll deﬁned d iscriminator m with H ( C ⊗ k m ) < 1 and w = c . Hence, a low entropy solution of the equatio n is not equiv alent to ML decodin g. Howev er, if the cod e rate is below capacity and a long cod e is employed only one distingu ished word exists. Theorem 3 giv es that Algorithm 2 fails in ﬁnding the distinguished word if either the stopping criterion is never fulﬁlled (it ru ns inﬁnitely lo ng) or th e solutio n exhibits a large discrim inated symb ol entr opy . T o in vestigate these cases consider the follo wing Lemma. Lemma 8 It holds that w i ← sign ( L ⊗ i ( m )) for all i ( 30) minimises the cr oss entr o py H ( C ⊗ | w k m ) under the constraint w i ∈ B . Proof: The cross entropy H ( C ⊗ | w k m ) is given by H ( C ⊗ | w k m ) = n X i =1 log 2 (2 w i + 2 − w i ) − w i · tanh( L ⊗ i ( m )) . The cross entropy is under constant | w i | or w i ∈ B obviously minimal for sign ( w i · tanh( L ⊗ i ( m ))) = 1 , which is the statement of the Lemma. The algo rithm fails if the iteration d oes not converge. Howe ver , the lemma g i ves th at (30) minim ises in each step of the iteration the cross entropy to wards w . This is equi valent to H ( C ⊗ | m k m ) ≥ H ( C ⊗ | w k m ) . This cross en tropy is with H ( C ⊗ | w k m ) ≥ min v H ( C ⊗ | v k m ) = H ( C ⊗ k m ) alw ays larger than the overall discrimin ated symbol entropy . Furth ermore, holds by the optimisation rule that H ( S | w ) ≥ H ( C ⊗ | w k m ) ≥ H ( C ⊗ k m ) , 20 which gi ves that the typical s et un der t he discrimin ation remain s includ ed. Th e s ubsequ ent step will there- fore continue to consider this set. If the discriminated cross entro py does n ot fu rther de crease on e thu s obtains the same w , which is a ﬁxed point. This observation is similar to the discussion above. A discrimin ator m describes en vironments with words close to r . A minimisation o f the cro ss entropy can be co nsidered as an optimal d escription of this en vi- ronmen t un der the independen ce assumption (and th e imposed hard decision constraint). If this knowl edg e is pro cessed itera ti vely then these en viron ments should be better and better in vestigated. The discrimin ated symbol entro py H ( C ⊗ k m ) will thus typically decr ease. For an inﬁnite loop this is no t fulﬁlled, i.e., such a loop is unlikely or non typical. Moreover , the iterativ e algorith m fails if a stable solution w i = sig n ( L ⊗ i ( m )) with w 6 = ˆ c is fou nd. These solutions exhibit with the p roof of Th eorem 3 large discriminated symbol en tropy H ( C ⊗ k m ) (ma ny word s are p robable) and thus small | L ⊗ i ( m ) | . Howev er, solution s with small | L ⊗ i ( m ) | seem unlikely as these values are usually for w = 0 alre ady relati vely large and Lemma 8 indicates that these v alues will become larger in each step. Remark 19 ( Impr ovements ) If the algorithm fails due to a well d eﬁned discriminato r of large cross entro py then an ap propriately chosen in crease of the discrimin ator complexity sho uld imp rove the algo rithm. T o increase the discrimination complexity u nder hard decisions one may use discriminators w (1) 6 = w (2) . One possibility is hereby to reuse the old transfer vector by w (2) = w (1) and w (1) i = sign ( L ⊗ i ( m )) in Step 2 of the iterativ e algorith m. The complexity of the algorithm will then, howe ver , increase to O ( n 4 ) . On the oth er h and th e comp lexity can be strongly decr eased witho ut loosing the po ssibility to maximally discriminate the distinguished word. First holds that only (distinguish ed) words up to some distance t fro m the received word con tribute to L ⊗ ( m ) . One may thu s decrease (if fu ll discrimination of th e distance to w is used) the complexity | U | ≤ t · ( n + 1) if on ly those values are comp uted. A further reduction is obtained by the use of erasures in w , i.e., by w i ∈ {− 1 , 0 , +1 } a nd w i = sign ( L ⊗ i ( m )) − r i 2 in Step 2 of the alg orithm. Note that th is discrimination has only complexity | U | ≤ t 2 as u 1 ( ˆ c ) ≤ w w T ≤ t is typically fulﬁlled. It remain s to show that the distinguished solution is stable. W e d o this he re with th e infor mal pr oof: For ˆ c i = sign ( L ⊗ i ( m )) one ob tains that w i = ˆ c i if ˆ c i 6 = r i and 0 e lse. Hence o ne obtains that u 1 ( ˆ c ) = r ˆ c T and u 1 ( ˆ c ) = w w T . I f o nly one word s exists for these values u l then this discrimin ator ( r , w , w ) is surely maximally discriminating. First holds that if u 1 ( ˆ c ) = w w T that then s i = w i for is u niquely d eﬁned w i 6 = 0 . Under u 1 ( ˆ c ) then the other symbols are uniquely deﬁned to s i = r i as u 1 ( ˆ c ) = r ˆ c T is the unique maximum of u 1 ( s ) und er s i = w i for w i 6 = 0 . The overall com plexity o f this algorithm is t hus smaller than O ( n · t 2 ) respectiv ely O ( n · t 3 ) for a discrim- ination with w (1) 6 = w (2) . 3. Appro ximations The last section ind icates that an iterative algorithm with discrim inated symb ol p robabilities should out- perfor m the iterative prop agation o f only the constituent beliefs. Howev er, the discrimin ator approach was restricted to problem s with small discrimina tor comp lexity | U | . In this form and Remar k 12 the alg orithm d oes not apply fo r example to A WGN cha nnels. In this section discriminator based decoding is genera lised to real valued w ( l ) and r , and hence genera lly | U | = | S | . 21 For a pro hibitiv ely large discrimin ator complexity | U | the distributions P ( l ) C i ( x, u | m ( l ) ) can no t be practi- cally computed ; only an appr oximation is fea sible. T his app roximatio n is usually d one via a p robability density , i.e., p ( l ) C i ( x, u | m ( l ) )d u ≈ P ( l ) C i ( x, u | m ( l ) ) where p ( l ) C i ( x, u | m ( l ) ) is described by a small numb er of parame ters. Remark 20 ( Representation and Appr oximation ) The use of an appr oximation changes the p remise com- pared to the last section. There we assumed that the rep resentation complexity of the d iscriminator is limited but that the comp utation is perfect. In this section we assum e that the d iscriminator is gener ally globally maximal but that an ap proximatio n is sufﬁcient. An estimation of a distribution may be per formed by a histogr am given by the rule U ε ( u | m ) := [ | v − u | <ǫ U ( v | m ) and the qu antisation ε . These values can be appr oximated ( see Ap pendix A.1) with an algo rithm th at ex- hibits a comp arable c omplexity as the one f or the com putation of th e hard d ecision values. For a sufﬁciently small ε on e obviously obtains a sufﬁcient ap proximatio n. Here the co mplexity remains of the order O ( n 3 ) . It may , ho wever , be reduce d as in R emark 19. Remark 21 ( Uncerta inty and Distance ) Th e appr oach with histogram s is equivalent to assuming that words with similar u d o not n eed to be disting uished; a discrimination of s (1) and s (2) is assumed to be not necessary if the “uncertainty distance” d H ( u ( s (1) ) , u ( s (2) )) = d H ( u (1) , u (2) ) = 2 X l =0 k H ( s (1) | w ( l ) ) − H ( s (2) | w ( l ) ) k = 2 X l =0 k u (1) l − u (2) l k of s (1) and s (2) is smaller than some ε . The error that occurs by P ⊗ C i ( x | m ) ∝ Z U p (1) C i ( x, u | m (1) ) p (2) C i ( x, u | m (2) ) p C i ( x, u | m ) d u (31) can for sufﬁciently s mall ε usua lly be neglected. Note th at ano ther appr oach is to approx imate only in u 0 and continue to use a limited discriminator com - plexity in w ( by for example hard decision w i ∈ B ), which gives an exact discrimination in u 1 . 3.1. G AU S S Discriminators Distributions are usually r epresented via param eters d eﬁned by expectations . This is do ne a s the law o f large number s shows that these expectations can be co mputed out of a statistics. Gi ven the se values then the unknown distrib utions may be approxim ated b y maximum entropy [9] densities. Example 6 The simplest meth od to approximate distributions by probab ility d ensities is to assume that no extra k nowledge is available over u . This leads to the ma ximal entropy “d istributions” (In B AY E S ’ estimation theory this is equiv alent to a non proper prior) with stripped u P ( l ) C i ( x | m ( l ) ) ≈ P ( l ) C i ( x, u | m ( l ) ) and P C i ( x | m ) ≈ P C i ( x, u | m ) , which is equiv alent to L ⊠ i ( m ) = 0 as then P ⊠ C i ( u | x, m ) = 1 or L ⊗ i ( m ) = r i + ˘ L (1) i ( m (1) ) + ˘ L (2) i ( m (2) ) and th us implicitly to Algorithm 1. Note that the de riv ed tools do not give rise to a further e valuation of this appro ach: A discriminatio n in the sense deﬁned above does not tak e place. 22 The additional expectations consider ed here are t he mean v alues µ l and the correlations φ l,k . T hese are for the giv en correlation map u l ( s ) = n X i =1 w ( l ) i s i (32) and | s i | = 1 deﬁned by µ ( h ) l = E C ( h ) [ u l | m ( h ) ] = n X i =1 X C ( h ) w ( l ) i c i P ( h ) ( c | m ( h ) ) = n X i =1 w ( l ) i ( P ( h ) C i (+1 | m ( h ) ) − P ( h ) C i (+1 | m ( h ) )) = n X i =1 w ( l ) i E C ( h ) [ c i | m ( h ) ] and h φ ( h ) l,k i 2 + µ l,j µ l,k = E C ( h ) [ u l u k | m ( h ) ] = n X i =1 n X j =1 w ( l ) i w ( k ) j X c ∈ C ( h ) c i c j P ( h ) ( c | m ( h ) ) = n X i =1 n X j =1 w ( l ) j w ( k ) i E C ( h ) [ c i c j | m ( h ) ] . The com plexity of the compu tation of each value µ ( h ) l and φ ( h ) l,k is here (see Append ix A.1) comparable to the complexity of the BC JR Algorithm, i.e., for ﬁxed t rellis state com plexity O ( n ) . For known me an v alues and v ariances the maximum entropy density is the G AU S S de nsity . This density is with the following Lemm a especially s uited for d iscriminator based decoding. Lemma 9 F o r long codes with small tr ellis co mplexity o ne obta ins asymptotically a G AU S S den sity for P ( l ) ( u | m ( l ) ) and P ( u | m ) . Proof: The values u l ( c ) ar e ob tained by the cor relation given in (32). F or P ( u | m ) th is is equ i valent to a sum of independent ran dom values. I.e ., P ( u | m ) is by the central lim it theorem G AU S S distributed. For long co des with small trellis state com plexity and many consid ered words the same holds true for P ( h ) ( u | m ( h ) ) . In this case the limited c ode memo ry gives sufﬁciently many indepen dent regions of sub- sequent code symbols. I.e., the correlation again leads to a sum of many independent random v alues. Remark 22 ( Notation ) The G AU S S app roximated symbol probab ility distributions are her e denoted by a hat, i.e., ˆ p ( l ) C i ( x, u | m ) a nd ˆ p C i ( x, u | m ) . The same is done for the approximated logarithmic symbol probab ility ratios. The constituent G AU S S appr oximations then imply the approximation of P ⊗ ( u | m ) by ˆ p ⊗ ( u | m ) ∝ ˆ p (2) ( u | m (2) ) ˆ p (1) C i ( u | m (1) ) ˆ p ( u | m ) and thus appro ximated discriminated symbo l pro babilities (for the computation see Appendix A.2) ˆ P ⊗ C i ( x | m ) ∝ Z U ˆ p (1) C i ( x, u | m (1) ) ˆ p (2) C i ( x, u | m (2) ) ˆ p C i ( x, u | m ) d u . (33) This appr oximation is obtained via other appr oximations. Its q uality can thus not b e guaranteed as before . T o use the approxim ated discrimin ated symbol probabilities in an iter ation one therefo re ﬁrst has to check the validity of (33 ). 23 By some cho ice of w (1) and w (2) the approx imations of the constituent distribution are perfor med in an environmen t A nε ( C ( l ) | m ( l ) ) wher e ˆ p ( l ) ( u | m ( l ) ) is large. The overall considered region is giv en b y A nε ( S | m ) deﬁn ed by ˆ p ( u | m ) . This overall region represents the possible overall words. T he approxima- tion is su rely valid if the possible code word s of the l -th constituent code un der m ( l ) are inclu ded in this region. I.e., the con ditions A nε ( C ( l ) | m ( l ) ) ⊆ A nε ( S | m ) (34) for l = 1 , 2 have to be fulﬁlled. In th is case the descrip tion of th e last section ap plies as then the appro xi- mation is typically good . Remark 23 That this con sideration is necessary beco mes clear un der the assump tion th at th e constituent G AU S S appro ximations d o not consider the same en vironmen ts. In this case their mean values stro ngly differ . The approximation of the discriminated distrib ution, ho wever , will therefore co nsider re gion s with a large distance to the mean. The obtained r esults are then not predictable as a G AU S S a pproxim ation is only good for the words assumed to be probable , i.e., close to its mean value. Under (34) this can not happen . The condition (34) is – in respect to the set sizes – fulﬁlled if H ( S | m ) ≥ H ( C ( l ) | m k m ( l ) ) as this is equiv alent to A nε ( S | m ) ⊇ A nε ( C ( l ) | m k m ( l ) ) , which gives with (26) that A nε ( S | m ) ⊇ ( A nε ( C ( l ) | m k m ( l ) ) ∩ C ( l ) ) ⊇ A nε ( C ( l ) | m ( l ) ) . Howe ver , by (34) not only the set sizes but also the w ords need to match . W ith H ( C ( l ) | m k m ( l ) ) = n X i =0 H ( C ( l ) i | r i + w (1) i + w (2) i k m ( l ) ) we therefo re employ the symbol wise conditions H ( C i | r i + w (1) i + w (2) i ) ≥ H ( C ( l ) i | r i + w (1) i + w (2) i k m ( l ) ) . ( 35) As all symbols are independ ently consid ered t he cond itions (34) are then typically fulﬁlled. A decoding decision is a gain fou nd if ˆ H ( C ⊗ k m ) < 1 u nder (35). T o ﬁnd such a solution we propo se to minimise in each step ˆ H ( C ⊗ | v k m ) un der the condition (35) of code l . As then (35) is fulﬁlled the obtained set of probable words remains in the region o f common be liefs, which guaran tees the validity of the subsequent approxim ation. This optimised v is then used to up date w ( l ) under ﬁxed w ( h ) and h 6 = l . This giv es Algo rithm 3. Con sider ﬁrst the constrained optimisation in Step 3 of Algorithm 3. The deﬁnition of the cross entropy H ( C ( l ) i | v i k m ( l ) ) = log 2 (2 v i + 2 − v i ) − v i tanh 2 ( L ( l ) i ( m ( l ) )) transform s the constrain t to v i tanh 2 ( v i ) ≤ v i tanh 2 ( L ( l ) i ( m ( l ) )) , which is equiv alent to | v i | ≤ | L ( l ) i ( m ( l ) ) | and sign ( v i ) = sign ( L ( l ) i ( m ( l ) )) . Moreover , the optimisation ˆ H ( C ⊗ | v k m ) → min withou t constrain t gi ves v i = ˆ L ⊗ i ( m ) . 24 Algorithm 3 Iteration with Approxim ated Discriminatio n 1. Set w (1) = w (2) = 0 . Set l = 2 and h = 1 . 2. Swap l and h . Set z = w ( l ) 3. Set v such that ˆ H ( C ⊗ | v k m ) → min under H ( C i | v i ) ≥ H ( C ( l ) i | v i k m ( l ) ) for all i . 4. Set w ( l ) = v − w ( h ) − r . 5. If w ( l ) 6 = z then go to Step 2 . 6. Set ˆ c i = sign ( v i ) for all i . This consideration directly gives the following cases: • If this v i does not violate the constraint then it is already optimal. • It violates the constraint if sign ( L ( l ) i ( m ( l ) )) 6 = sign ( ˆ L ⊗ i ( m )) . In this case one has to set v i = 0 to fulﬁl the constrain t. • For the remaining case that sign ( L ( l ) i ( m ( l ) )) = sign ( ˆ L ⊗ i ( m )) but that the constraint is violated by | L ( l ) i ( m ( l ) ) | < | ˆ L ⊗ i ( m ) | the optimal solution is v i = L ( l ) i ( m ( l ) ) as the cr oss entropy ˆ H ( C ⊗ i | v i k m ) is be tween v i = 0 an d v i = ˆ L ⊗ i ( m ) a strictly mo notonou s function . The obtaine d v i are thus given by either ˆ L ⊗ i ( m ) , L ( l ) i ( m ( l ) ) , o r Zero. The zero value is he reby obtained if the two estimated symbol decisions mutually contradict each other , wh ich is a rather intuitiv e result. Moreover , note that the constrained optimisation is symmetric, i.e., it is equivalent to H ( C ( l ) i | v i k m ( l ) ) → min under H ( C i | v i ) ≥ ˆ H ( C ⊗ i k m ) for all i. (36) Remark 24 ( High er Order Mo ments ) By th e central limit theor em higher o rder momen ts d o not signiﬁ- cantly improve the appr oximation. This statement is sur prising as th e k nowledge of all mome nts leads to perfect knowledge of the distrib ution and thus to globally maximal discriminatio n. Howe ver , th e st atemen t just indicates that on e would need a large numb er of h igher ord er moments to obtain addition al useful informa tion abou t t he distributions. 3.1.1. Conv ergence At the beginning of the algorithm many w ords are consider ed and a G AU S S ap proximatio n surely sufﬁces. I.e., in this case an approx imation by histograms would no t pr oduce signiﬁcan tly different resu lts. The co n- vergence properties should thus at the beginning be comp arable to an algorith m that uses a discrimination via histograms. Howe ver , there a sufﬁciently small ε sho uld gi ve good con vergence pro perties. 25 At the end of the algorithm typically on ly f e w words remain to be con sidered. For this ca se the G AU S S approx imation is surely outpe rformed by the use of h istograms. No te, that this observation does no t con- tradict the statement of L emma 9 as we there implicitly assumed “eno ugh” en tropy . In tuitiv ely , howev er, this case is simpler to solve, which implies that the G AU S S approx imation should remain sufﬁcient. This becom es clear by reco nsidering the region A nε ( S | m ) that is employed in each step of the algorithm . An alg orithm that uses histograms will outperf orm an algor ithm with a G AU S S ap proximatio n if different indepen dent regions in A nε ( S | m ) bec ome pro bable. A G AU S S app roximation expects a connected re gion and will thus span over these regions. I .e., the error of the app roximation will lead to a larger nu mber of words that need to be considered . Howe ver , this should not have a signiﬁcant impact on the conv ergence proper ties. T ypically the numb er of words to be consider ed will thus become smaller in any step: T he itera ti ve algo - rithm gives th at in e very step the discriminated cross entropy (see Deﬁnition 12) ˆ H ( C ⊗ | m ( n ) k m ( o ) ) = Z U ˆ p ⊗ ( u | m ( o ) ) log 2 P ( s | m ( n ) )d u is sma ller than the discrimin ated sy mbol entr opy ˆ H ( C ⊗ k m ( o ) ) und er the assume d pr ior discrimination m ( o ) . He nce the algorithm should con verge to some ﬁxed point. 3.1.2. Fixed P oints The considerations above giv e that the algorithm will typically not stay in an inﬁnite loop and th us end at a ﬁxed point. Moreover, at this ﬁxed po int the add itional constraints will be fulﬁlled. It remains to consider whether the additional constraints introdu ce solutions of large discriminated symbol entropy ˆ H ( C ⊗ k m ) . Intuitively the additionally imposed constraints seem no t less restrictive tha n the u se of histograms an d w (1) = w (2) as w (1) 6 = w (2) implies a better discrim ination. H owe ver , s olution s with large discrimin ated symbol entropy ˆ H ( C ⊗ k m ) will even f or the second case typically not e xist. Mo reover , the discrimination uses con tinuous values, which should be b etter than th e again sufﬁcient hard decision d iscrimination. I.e., the constraint should have only a small (negative) impact on the interme diate steps of the alg orithm. Usually ˆ H ( C ⊗ k m ) is alr eady a t the start ( w (1) = w (2) = 0 ) relatively small. The subsequen t step will despite the co nstraint typically exhib it a smaller discrim inated symbol entropy . This is equ i valent to a smaller error probab ility and hence typically a better discriminatio n of the disting uished w ord. If the process stalls for ˆ H ( C ⊗ k m ) > 1 then the by m in vestigated region either exhibits no or multiple typica l words. As (typ ically) the distin- guished word is the only cod e word in the typical set and a s the typical set is (ty pically) included this typically does not occur . Finally , fo r ˆ H ( C ⊗ k m ) ≈ 0 the distinguished solution is fo und. At the end of th e algorithm (and the as- sumption of a distingu ished solution ) the obtained G AU S S discriminated distribution then mimics a G AU S S approx imation of the overall distrib ution, i.e., ˆ p ⊗ ( u | m ) ∝ ˆ p (1) ( u | m (1) ) ˆ p (2) ( u | m (2) ) ˆ p ( u | m ) ≍ ˆ p ( a ) ( u | r ) . (37) W ithout the constrain ts many solutions m exist. It on ly has to be guaran teed that the constituen t sets intersect at the distinguishe d word. By the ad dition of the c onstraints th e solution becom es uniq ue and is deﬁned such that the numb er of by ˆ p ( u | m ) consider ed words is as small as possible. This generally implies that then both constituent approximated distrib utions need to be rather similar . This is the desired beh aviour as the co nsidered en viron ment is deﬁned by a narrow peak of ˆ p ( a ) ( u | r ) around u ( ˆ c ) . Hence, the ad ditional constrain ts seem needed for a deﬁned ﬁxed point and the limitations of th e G AU S S approxim ation. Th is emphasises th e statem ent above: W ithou t the constrain t n on predic table be- haviour may occu r . 26 Remark 25 ( Optima lity ) The v alues w ( l ) i are contin uous. Thus, one can search for the optimum of ˆ H ( C ⊗ k m ) → min by a differentiation of ˆ H ( C ⊗ | m ) . For the dif ferentiation holds 2 ∂ ˆ H ( C ⊗ k m ) ∂ w ( l ) i = tanh 2 L i ( m ) − ta nh 2 ˆ L ⊗ i ( m ) − 2 Z U ∂ ˆ p ⊗ ( u | m ) ∂ w ( l ) i log 2 P ( s | m )d u . For the ﬁrst term see Lemma 7 and the deﬁnition of the discriminated sy mbol proba bilities. The second term is the deriv ation of the discriminated probability density . For the case of a max imal discrimination of the distinguished word it will co nsist o f this word with pr obability of almo st o ne. A differential variation of the discrim inator shou ld remain maxim ally discriminatin g, which g i ves that the second term should be almost zero. Hence, one obtains that L i ( m ) ≈ ˆ L ⊗ i ( m ) (38) holds at the absolute minimum of ˆ H ( C ⊗ k m ) , which is a (soft decision) well deﬁned discriminator . Note, further more, that for (37) and similar con stituent distributions the d istrib ution ˆ p ( u | m ) will necessarily be similar to ˆ p ( a ) ( u | m ) , which is a similar statement as in (38). Remark 26 ( Complexity ) The decoding complexity is under the assumption of fast conv ergence of the order O ( n ) . I.e., the com plexity only depends on the BCJR d ecoding complexity of the constituen t codes. Moreover , Algorithm 3 can still be considered as an algo rithm where param eters a re transferred between the codes. Hereb y the number of parameters is increased by a factor of nineteen (for each i add itionally t o w ( l ) i for x = ± 1 three means and six correlations). Note, ﬁnally , th at the o riginal iterative (constituen t) belief propa gation algorithm is rather close to the propo sed alg orithm. Only by (36) an additiona l constraint is introduced. W ithout the constraint apparently too strong beliefs are transmitted. Alg orithm 3 cuts off excess constituent code belief. 3.2. Mul tiple Coupling Dually coupled codes constructe d by just two co nstituent co des (with simple trellises) are no t necessarily good codes. This can be understoo d by the necessity of simple constituent trellises . This gi ves th at the left- right (minimal row span ) [11] fo rms o f the (p ermuted) par ity check matrices h a ve sh ort effecti ve lengths. This g i ves that the co des canno t be co nsidered as pu rely rando m as this cond ition stro ngly limits the choice of cod es. Howe ver , to obtain asym ptotically good codes on e gen erally needs that the codes can be considered as random. If – as in Remark 3 – more constituent codes are considered, then the dual codes will have smaller rate and thus a larger effecti ve length. This is best un derstood in the limit, i.e., the case of n − k constituent co des with n − k ( l ) = 1 . The se codes ca n then be freely chosen with out chang ing the comp lexity , which leaves no restriction on the choice of the overall code. For a setup of a dual coupling with N codes the discriminated distribution of correlations is generalised to P ⊗ ( u | m ) ∝ Q N l =1 P ( l ) ( u | m ( l ) ) ( P ( u | m )) N − 1 , with m = ( r , w (1) , . . . , w ( N ) ) , m ( l ) = ( r , w (1) , . . . , w ( l − 1) , w ( l +1) , . . . , w ( N ) ) , and an indepen dence assumption as in (11) and (5). 27 The deﬁnition of the discriminated symbol probab ilities then beco mes P ⊗ C i ( x | m ) ∝ X u Q N l =1 P ( l ) C i ( x, u | m ( l ) ) ( P C i ( x, u | m )) N − 1 . Moreover , for glob ally maximal discriminators P ⊗ C i ( x | m ) = P ( a ) C i ( x | r ) and P ⊗ ( u | m ) = P ( a ) ( u | r ) remains tr ue. The others lemmas a nd theorems above can be likewis e generalised . Hence, discrim inator decodin g by G AU S S ap proximatio ns applies to multiple dually coupled codes, too. Remark 27 ( Iterative Algorithm ) The generalisation of Algorithm 3 may be done by using v i = arg min v H ( C ⊗ i | v i k m ) under H ( C i | v i ) ≥ H ( C ( l ) i | v i k m ( l ) ) for all i w ( l ) ← v − N X h 6 = l w ( l ) as constituent code depen dent update. Overall this gi ves – provided the distinguished well d eﬁned solution is fou nd – that discriminato r d ecoding asymptotically performs as typica l decod ing for a random code. I.e., with dually coupled codes and (to the distinguished solution conv ergent) G AU S S appro ximated discriminator decoding the capacity is attained. Remark 28 ( Complexity ) Th e com plexity of decoding is of the o rder of th e sum o f the con stituent tre llis complexities and thus genera lly increases with the nu mber of cod es employed. For a ﬁxed number of constituent codes of ﬁxed trellis state complexity and G AU S S app roximated discriminators the complexity thus remains of the order O ( n ) . Remark 29 ( Number of Solu tions ) For a coupling with many co nstituent co des o ne o btains a large nu mber of no n lin ear optimisation s that h av e to b e perfor med simultan eously . The non linear ity o f the com mon problem sh ould thus increase with the n umber of co des. Another explanation is that then many times typicality is assumed. Th e pro bability of some non typical event then in creases. This may increa se the number of stable solutions of the algorithm or introdu ce instability . This behaviour may be mitigated by the use of punctur ed codes. The puncture d positions d eﬁne b eliefs, too, which gives that th e transfer vector w is gener ally longer than n . Th e transfer complexity is thus increased, which sho uld lead to better perfor mance. No te that this appr oach is implicitly u sed for LDPC codes. 3.3. Channel Maps In the last sections on ly memory-less channel maps as given in Remark 4 were considered. A genera l channel is giv en by a stochastic map K : S → R deﬁned by P R | S ( r | s ) . W e will here o nly consider chan nels wh ere signal and “noise” are indep endent. In particu lar we assume that the channel K is g i ven by some kno wn deterministic map H : s 7→ v = ( v 1 , . . . , v n ) and r = v + e with the additive noise E deﬁn ed by P E ( e ) . 28 A code map C prior to the transmission together with the map H may then be considered as a concatenated map. The co ncatenation is her eby (for th e form al r epresentation by du ally coupled c ode see the proof o f Theorem 1) equally represen ted by the dual coup ling of the “cod es” C (1) := { c (1) = ( c , z ) : c ∈ C } a nd C (2) := { c (2) = ( s , v ) : s ∈ S an d H : s 7→ v } where z = ( z 1 , . . . , z n ) is undeﬁned, i.e., no restriction is imposed on z . Moreover , c is puncture d prior to transmission and only v + e is receiv ed. Discrimin ator based decoding thus applies and one obtains P ⊗ C i ( x | m ) ∝ X u ∈ U P (1) C i ( x, u | w (2) ) P (2) C i ( x, u | r , w (1) ) P C i ( x, u | w (1) , w (2) ) as by the deﬁnition of the dually coupled code P (1) C i ( x, u | m (1) ) = P (1) C i ( x, u | w (2) ) P ( u | r ) and by the indepe ndence assumption P C i ( x, u | m ) = P C i ( x, u | w (1) , w (2) ) P ( u | r ) ar e i ndep endent of the channel. Remark 30 ( T r ellis ) If a trellis algorithm exists to co mpute P (2) C i ( x | r ) then one may comp ute the symb ol probab ilities P (2) C i ( x | r , w (1) ) , the mean values a nd variances of u und er P (2) C i ( x, u | r , w (1) ) with similar complexity . Example 7 A linear time in variant channel with additi ve white G AU S S ian n oise E ( t ) is giv en by the map r ( t ) = Z ∞ −∞ s ( t − τ ) h ( τ )d τ + e ( t ) . Here, we assume a d escription in the equ i valent base ban d. I.e., the signals r ( t ) and s ( t ) a s well as the noise may be complex v alued – indicated by the under bar . The noise is assumed to be white and thu s exhibits the (stationary) correlation function E E ( t ) [ e ( t ) e ∗ ( t + τ )] = σ 2 E ( t ) · δ ( τ ) . For amplitude shift ke ying modulatio n one employs the signal s ( t ) = ∞ X i = −∞ s i w ( t − i T) with w ( τ ) being the waveformer . W ith a matched ﬁlter an d well chosen whitening ﬁlter one obtains an equiv alent (gen erally co mplex valued) discr ete chann el Q : s 7→ r with r i = M X j =0 s i − j q j + e i (39) deﬁned by q = ( q 0 , . . . , q M ) and independ ent G AU S S no ise E E [ e i e ∗ j ] = σ 2 E δ i − j . For binary phase shift ke ying one has s i = A x i and x i ∈ B . For quaternary phase shift ke ying the map is gi ven by s i = A √ 2 ( x 2 i + j x 2 i +1 ) , j 2 = − 1 , and x i ∈ B . In both cases a trellis for S may be co nstructed with logarithm ic complexity propo rtional to the me mory M of the chann el q = ( q 0 , q 1 , . . . , q M ) times the numb er o f info rmation Bits per channel symbo l S i . Note, moreover , that a time variance o f the channel does no t chan ge the trellis complexity . 29 3.4. Channel Detached Discrimination Overall one obtains f or a linear mo dulation and linear channels with additi ve n oise the discrete probab ilistic channel map K : s 7→ r = sQ + e . (40) For un correlated G AU S S noise E this gives the pro babilities (witho ut prio r knowledge about the code words) P ( c | r ) ∝ exp 2 ( − log 2 ( e ) 2 σ 2 E k r − cQ k 2 ) . If the channe l h as large memory M and/or if a mo dulation scheme with many Bits per symbol s i is used then the tr ellis comp lexity of a trellis equa lisation becomes prohib iti vely large. T o use the cha nnel map as a constituent code will then not give a practical algorithm. Reconsider theref ore the com putation of the discrimin ated symb ol p robabilities P ⊗ C i ( x | m ) und er th e as- sumption that t he employed cod e is already a dually coupled code with, to ag ain simplify the notation, only two constituent codes. T o apply the discriminato r based appro ach one thus needs to compute P ⊗ C i ( x | m ) ∝ X u ∈ U P (1) C i ( x, u | m (1) ) P (2) C i ( x, u | m (2) ) P C i ( x, u | m ) . Obviously , the comp lexity of the comp utation of the symbo l pro babilities P ( l ) C i ( x, u | m ( l ) ) of the constituent codes under the channel maps is generally prohibitively large. Howe ver , one may equivalently (see (17) on Page 12) compute P ⊗ C i ( x | m ) ∝ X u ∈ U P (1) C i ( x, u | w (2) ) P (2) C i ( x, u | w (1) ) P C i ( x, u | w (1) , w (2) ) exp 2 ( u 0 ) (41) where u 0 represents the chann el probabilities. An alternative method to compu te P ⊗ C i ( x | m ) is thus to ﬁrst compute P ⊗ C i ( x, u | w (1) , w (2) ) ∝ P (1) C i ( x, u | w (2) ) P (2) C i ( x, u | w (1) ) P C i ( x, u | w (1) , w (2) ) by the constituent distributions P ( l ) C i ( x, u | w ( h ) ) f or h 6 = l and P C i ( x, u | w (1) , w (2) ) to then sum the by exp 2 ( u 0 ) multiplied distributions P ⊗ C i ( x, u | w (1) , w (2) ) . In the distributions P ⊗ C i ( x, u | w (1) , w (2) ) the v ari- able u 0 = log 2 ( P ( c | r )) the reby relates to the channel p robabilities. T he d iscrimination itself is detached from the channel informatio n, i.e., don e only by the w ( l ) . This approach gi ves for linear chann els and a G AU S S appro ximation a surprisingly small complexity . This is the case as for linear chan nel maps the computation of the “channel moments”, i. e., the moments de- pending on u 0 is not considerably m ore d ifﬁcult than the computation o f the co de m oments a bove. T o illustrate consider the chann el dependent means, i.e., the expectation E ( l ) C i [ u 0 | x, w ( h ) ] where one obtains E ( l ) C i [ u 0 | x, w ( h ) ] := X c ∈ C ( l ) i ( x ) u 0 P ( c | w ( h ) ) = X c ∈ C ( l ) i ( x ) log 2 ( P ( c | r )) P ( c | w ( h ) ) = E ( l ) C i [log 2 ( P ( c | r )) | x, w ( h ) ] = const + log 2 ( e ) 2 σ 2 E E ( l ) C i [ k r − c Q k 2 | x, w ( h ) ] . (42) This is similar to the c omputation o f the variances on Page 23. Generally holds that the means and corre- lations can be compu ted for linear channels with complexity that incr eases only linearly with the chan nel memory M . This re sult follows as the expectations for the channels remain computations of moments, but now with vector operatio ns. The c omputation of th e variance of u 0 (for a channel with mem ory) is, e. g., equiv alent to the computation of a fourth moments in the independen t case. 30 Generally hold s that the mean values E ( l ) C i [ u 0 | x, w ( h ) ] are on ly compu table up to a con stant. Th is is under a G AU S S assumption and (41) equivalent to a shift of u 0 in exp 2 ( u 0 ) by this constant. Howe ver , this will lead t o a proportional factor, which v anishes in the compu tation of P ⊗ C i ( x | m ) . Th is u nknown constant may thus be disregarded. Remark 31 ( Constituen t Code ) This approac h applies by P ( l ) C i ( x | m ( l ) ) ∝ X u ∈ U P ( l ) C i ( x, u | w ( h ) ) exp 2 ( u 0 ) for l 6 = h to the con stituent codes C ( l ) , too : One may likewise compute the constitue nt beliefs via the mome nts and a G AU S S ap proximatio n an d thus apply Algorithm 3. The G AU S S approx imation for u 0 surely hold s tru e if th e ch annel is shor t co mpared to th e overall length as then many ind ependent parts con tribute. W ith ( 41) one can th us apply the iterati ve d ecoder based on G AU S S ap proximated discriminators for linear channels with memory without much extra complexity . Remark 32 ( Matched F ilter ) Note th at on e o btains by (4 2) for the initialisation w ( l ) = 0 , l = 1 , 2 th at ˆ L ⊠ i ( m ) is prop ortional to the “matched ﬁlter outpu t” given by q i r H . More over , in all steps of the alg orithm only ˆ L ⊠ ( m ) is directly affected by the channel map. 3.5. Estimation In many cases the transmission chan nel is u nknown at the receiver . Th is problem is usually mitigated by a channel estimation p rior to the decoding. Howe ver, an independ ent estimation needs – esp ecially for time varying channe ls [7] – considerable excess redu ndancy . T he op timal approach would be to perform decodin g, estimation, and equalisation simultaneo usly . Example 8 Assume that it is k nown that the channel is given as in (39), but that the chann el parameters q = ( q 0 , . . . , q L ) are u nknown. More over , assume that the transmission is in th e base band , which g i ves that the q i are r eal valued. The aim is to determin e these values together with the code symbol decisions. T o conside r them in the same way , i.e., by decision s one needs to redu ce the (inﬁnite) description entro py . W e therefore assume a quan tisation of q by a binar y vector b . This may , e.g., be do ne by q i = q B i − 1 X j =0 b l ( i )+ j exp 2 ( j ) , l ( i ) = l ( i − 1) + B i − 1 , l (0) = 0 , and b i ∈ B . Note that one uses the additional knowledge | q i | < q exp 2 ( B i ) under this quantisation. Moreover , the quantisation erro r tend s to zero with the qu antisation step size q. Finally , sure ly a better qu antisation can be found via rate distortion theory . The e xample sho ws that one obtains w ith an approp riate qu antisation additional binary unkn owns b j . T hus one needs additional parameters w ( l ) n + j that discriminate these B its. M oreover , again a pro bability distribu- tion is needed for these w ( l ) n + j . Here it is assumed that the distribution giv en in (5) is just extended to these parameters. Note that th is is equ i valent to assuming th at cod e Bits c i and “c hannel Bits” b j are ind ependent. The code symb ol discrim inated probab ilities remain un der the now longer w as in (41). Add itionally one obtains discriminated channel symbol prob abilities given by P ⊗ B i ( x | m ) ∝ X u ∈ U P (1) B i ( x, u | w (2) ) P (2) B i ( x, u | w (1) ) P B i ( x, u | w (1) , w (2) ) exp 2 ( u 0 ) . 31 A G AU S S app roximated d iscrimination is thus as before, h owe ver , o ne needs to com pute new an d m ore general expectations. E.g., for the general lin ear channel o f (40) o ne n eeds to co mpute the expec tation giv en by E ( l ) C i [ u 0 | x, w ( h ) ] = const − log 2 ( e ) 2 σ 2 E E ( l ) C i [ k r − cQ ( b ) k 2 | x, w ( h ) ] and equiv alently for E ( l ) B i [ u 0 | x, w ( h ) ] . The expectation s are generalised because Q is a map of the r andom variables b j . W ith the qu antisation of the example ab ove this map is linear in b . This ﬁrst gives that cQ ( b ) can be consider ed as a qu adratic function in the binar y random variables x and b . The compu tation of th e mean s is thu s akin to the on e of fourth momen ts and a known independent channel. Overall th is gives that the co mplexity fo r th e compu tation of the means and v ariances is for un known chan- nels “ only” twice as large as for a known channel (of th e same memory). It ma y , ho wever , still be comp uted with reason able co mplexity . Hen ce, again a n iteration based on G AU S S approximated discriminators can be perform ed. Remark 33 ( Miscellaneo us ) Note that without some kn own “training” sequence in the code word the iteration will by the symmetr y u sually stay at w ( l ) = 0 . Note, m oreover , that this ap proach is easily extended to time v ariant chann els as con sidered in [7] o r even to m ore complex, i.e., non lin ear channel maps. The com plexity th en remains d ominated by the comp lexity of the computation o f the means and correlation s. 4. Summary In this paper ﬁrst (dually ) coupled codes were discussed. A dually co upled code is given by a ju xtaposi- tion of the co nstituent par ity check matrices. Dually cou pled co des provide a straigh tforward albeit pro- hibitively complex comp utation o f the o verall word pro babilities P ( a ) ( s | r ) by the constituent p robabilities P ( l ) ( s | r ) . Howe ver , for these codes a decod ing by belief propagation applies. The then introd uced concep t of discrimin ators is summar ised by augmenting the prob abilities by additio nal (virtual) param eters w ( l ) and u to P ( s , u | r , w (1) , w (2) ) . This is similar to the proced ure used for belief propag ation but there the param eter u is not considered. Such carefully chosen probabilities led (in a g lob- ally max imal form ) again to optimu m dec oding decision s o f the coup led co de. Howe ver , the complexity of decodin g with globally maximal discriminator s r emains in the order of a brute force computation of the ML decisions. It w as then shown that local discrimina tors m ay perfo rm almo st optimally b ut with much smaller complex- ity . This observation then gave rise to the d eﬁnition of well deﬁned discrim inators and therewith again an iteration rule. It was then shown that this iteration theoretically admits any element o f the typical set o f the decodin g prob lem as ﬁxed point. In the last chapter the cen tral limit theore m then led to a G AU S S app roximation and a low complexity decoder . Finally (linea r) channel ma ps with m emory were considered . It was shown that under ad ditional approx imations equalisation and estimation may be accomm odated into the iterative alg orithm with o nly little impact on the complexity . 32 A. Appendix A.1. T r ellis Base d Algor ithms The trellis is a l ayer ed graph representation of the code space c 1 c 2 c 3 c 5 0 1 c 4 Figure 3: T rellis of the (5,4,2) Code E ( C ) such that every code word c = ( c 1 , . . . , c n ) corre- sponds to a u nique path throug h the trellis f rom left to right. For a binar y code every lay er of edg es is labelled by one code symbo l c i ∈ Z 2 = { 0 , 1 } . The comp lexity of the trel- lis is gen erally given by the max imum numb er of ed ges per layer . As example the tr ellis o f a “single p arity check” co de of length 5 with H = (11111) is depicted in the ﬁgure to the right. Each of the 2 4 paths in the trellis deﬁnes c 1 to c 5 of a code word c of e ven weight. Here only the basic ide as needed to perform the comp utations in the trellis are presen ted. A formal de- scription will be given in anoth er p aper [ 8]. The descrip tion h ere reﬂects the operatio ns perf ormed in th e trellis. I .e., only the len gthening (extend ing one path) an d the junction (combinin g tw o incoming path s of one trellis node ) are considered . This is ﬁrst explaine d for the V I T E R B I [5] alg orithm that ﬁn ds the code word with minim al distanc e. The “length ening” is g i ven by an addition of the path correlation s as dep icted in Figure 4 (a). For the combinatio n – the “jo in” operatio n – only the path of max imum value is kept. This is equivalent to a minimisation o peration for the distances. This is reﬂected in the name of the algo rithm, which is often called min - sum algorithm . On the o ther hand, the BCJR [1] algorithm (to co mpute P ( c ) C i ( x | r ) ) is often ca lled sum - pr oduct algorithm as the lengthenin g is perform ed by the produc t of the path prob abilities. Th e com bination o f two paths is giv en by a sum. T hese operations are summarised in Figure 4 (b). X ( 1 ) s X ( 2 ) s max ( X ( 1 ) s , X ( 2 ) s ) = X ( J ) e X s + x i · r i = X ( L ) e X s x i , r i (a) V I T E R B I Algorithm P ( 1 ) s P ( 2 ) s P ( 1 ) s + P ( 2 ) s = P ( J ) e P s P s · P ( x i | r i ) = P ( L ) e P ( x i | r i ) (b) BCJR Algorithm Figure 4: Basic Operations in the V I T E R B I and BCJR Algorithm s Remark 34 ( F o rwar d - Backwar d Algorithm ) For the V I T E R B I algorithm the ML c ode word is found by following the s elected path s (starting from the end node) in backward direction . The op erations o f the BCJR algor ithm ( in forward direction ) g i ve at th e end directly the p robabilities P ( c ) C n ( x | r ) . T o compute all P ( c ) C i ( x | r ) the BCJR algor ithm has be per formed into both directions. The same holds true for the algorith ms below . Th is is h ere n ot con sidered any further – but keep in mind that only by this two way appro ach symbo l b ased distributions o r moments can be compu ted with low complexity . In the following we shall reuse the notation of Figur e 4 and use the indexes s and e before respectively after the lengthen or join operatio n. 33 A.1.1. Discrete Sets T o comp ute a hard decision distrib ution one can just cou nt the number of words of a certain distance to r . Let this number be denoted D ( t ) for weight t ∈ Z . This can be done in the trellis by using for the lengthenin g operatio n from D s ( t ) to D (L) e ( t ) by D (L) e ( t ) = ( D s ( t − 1 ) for c i 6 = r i D s ( t ) for c i = r i . The junction of paths become s just D (J) e ( t ) = D (1) s ( t ) + D (2) s ( t ) . Giv en D ( t ) an d a BSC with erro r p robability p on e ma y comp ute the p robability of h aving words of distance t by P ( t | r ) ∝ D ( t ) · p t (1 − p ) n − t . This can also be done directly in the trellis by p (J) e ( t ) = p (1) s ( t ) + p (2) s ( t ) and p (L) e ( t ) = ( p · p s ( t − 1 ) for c i 6 = r i (1 − p ) · p s ( t ) for c i = r i . A.1.2. Moments For the mean v alue µ = E [ r c T ] = n X i =1 E[ r i c i ] holds E[ i X j =1 c j r j | c i ] = E[ i − 1 X j =1 c j r j ] + c i r i . This directly gives th at one obtains for the lengthenin g P (L) e = P s · 2 r c i and µ (L) e = µ s + r i c i . The junction is just the proba bility weighted sum of the prior compute d input means given by P (J) e = P (1) s + P (2) s and µ (J) e = P (1) s P ( L ) e µ (1) s + P (2) s P ( L ) e µ (2) s . Hence, the BCJR algorithm for th e p robabilities needs to be comp uted at the same time. Note that the obtained mean values are then read ily normalised. T o compute the “energies” S = E[( P i j =1 c j r j ) 2 ] one uses in the same way that S = E[( i X j =1 c j r j ) 2 | c i ] = E [( i − 1 X j =1 c j r j ) 2 ] + 2 c i r i · E[ i − 1 X j =1 c j r j ] + ( c i r i ) 2 . This additiona lly gi ves – to the then necessary comp utation of means and proba bilities – tha t len gthening and junction are now given by S (L) e = S s + 2 r i c i · µ s + ( r i c i ) 2 and S (J) e = P (1) s P ( L ) e S (1) s + P (2) s P ( L ) e S (2) s . Here aga in the n ormalisation is alr eady includ ed. Cor relation and higher o rder mome nt trellis com putations are der i ved in th e same way . Howe ver , f or an l − th mome nt all l − 1 lower moments and th e p robability need to be a dditionally co mputed. Moreover , the d escription gives th at these momen t co mputation s may be perf ormed like wise for any linear operation c Q (deﬁn ed over the ﬁeld of real o r complex numb ers) th en using vector operations. 34 A.1.3. Continuous Sets Another possibility to u se th e trellis is to com pute (ap proximated ) histograms fo r u = w c T with w i ∈ R and c i ∈ B . It is here pro posed (other possibilities surely exist) to use – as in the h ard decision case above – a vector fun ction ( h ( t ) , µ ) with t ∈ Z and | t | ≤ Q and the m ean value µ . I.e., the values of u with non vanishing probability are ass umed to be in a vicinity the mea n v alue µ (computed above) or p ( u | m ) = 0 for | u − µ | > Q ε . Thus ( h ( t ) , µ ) is deﬁned to be the approximatio n o f h ( t ) ≈ Z ( t +1) ε tε p ( u − µ | m )d u . Here, densities are used to simp lify th e notatio n. It is now assumed that the mean values are comp uted as above, which gives that the lengthen ing is the trivial operation ( h (L) e ( t ) , µ (L) ) = ( h s ( t ) , µ s + c i w i ) . The junction, howe ver , cannot be easily performed as usually the mean values d o no t ﬁt on each other . Here, it is assum ed that the d ensity has for any interval the fo rm of a rectangle. Note that th is is again a maximum entropy assumption. This giv es the approxim ation of the histogram h (J) e ( t ) by the junction operatio n to be ( h ( J ) e ( t ) , µ (J) ) = ( P (1) s P ( L ) e ˘ h (1) s ( t ) + P (2) s P ( L ) e ˘ h (2) s ( t ) , P (1) s P ( L ) e µ (1) s ( t ) + P (2) s P ( L ) e µ (2) s ) . and ˘ h ( j ) s ( t − j ( µ ( j ) s − µ (L) e ) ε k ) = a ( µ ( j ) s , µ (L) e ) · h ( j ) s ( t ) + b ( µ ( j ) s , µ (L) e ) · h ( j ) s ( t + 1 ) , with ⌊ z ⌋ the integer part, trunc ( z ) := z − ⌊ z ⌋ , a ( µ ( j ) s , µ (L) e ) + b ( µ ( j ) s , µ (L) e ) = 1 , and b ( µ ( j ) s , µ (L) e ) = trunc( ( µ ( j ) s − µ (L) e ) ε ) . A.2. Computation of ˆ L ⊗ i ( m ) Equation (14) gives th e logarithm ic probab ility ratio ˆ L ⊗ i ( m ) = r i + ˘ L (1) i ( m (1) ) + ˘ L (2) i ( m (1) ) + ˆ L ⊠ i ( m ) . The ﬁrst three terms can be compu ted as befor e. For the computation of ˆ L ⊠ i ( m ) use that ˆ P ⊠ C i ( x | m ) ∝ Z U ˆ p (1) C i ( u | x, m (1) ) · ˆ p (2) C i ( u | x, m (2) ) ˆ p C i ( u | x, m ) d u =: Z U ˆ p ⊠ C i ( x, u | m )d u . (43) T o com pute ( 43) a m ultiplication of multiv ariate Gauss d istributions ha s to be pe rformed . The mo ments of the multiv ariate distrib ution s ˆ p ( l ) C i ( u | x, m ( l ) ) and ˆ p C i ( u | x, m ) ar e deﬁned by µ ( l ) i,j ( x ) = E C ( l ) | C i [ u j | x, m ( l ) ] and A ( l ) i,j,k ( x ) = E C ( l ) | C i [( u j − µ ( l ) i,j )( u k − µ ( l ) i,k ) | x, m ( l ) ] and like wise for µ i,j ( x ) and A i,j,k ( x ) . The multivariate G A U S S distributions are of the form ˆ p C i ( u | x, m ) = 1 p | 2 π A i ( x ) | exp  − ( u − µ i ( x ))[2 A i ( x )] − 1 ( u − µ i ( x )) T  . 35 Set B ( l ) i ( x ) = h A ( l ) i ( x ) i − 1 and B i ( x ) = [ A i ( x )] − 1 . The operation in (43) then leads to ˆ p ⊠ C i ( x, u | m ) = exp  ˆ C ⊠ i ( x, m ) − ( u − ˆ µ ⊠ i ( x )) h 2 ˆ A ⊠ i ( x ) i − 1 ( u − ˆ µ ⊠ i ( x )) T  q | 2 π ˆ A ⊠ i ( x ) | with h ˆ A ⊠ i ( x ) i − 1 = B (1) i ( x ) + B (2) i ( x ) − B i ( x ) by a comparison of the terms u ( . ) u T , ˆ µ ⊠ i ( x ) =  µ (1) i ( x ) B (1) i ( x ) + µ (2) i ( x ) B (2) i ( x ) − µ i ( x ) B i ( x )  ˆ A ⊠ i ( x ) , by a comparison of the in u linear terms, and 2 ˆ C ⊠ i ( x, m ) = ˆ µ ⊠ i ( x ) h ˆ A ⊠ i ( x ) i − 1 ˆ µ ⊠ T i ( x ) + µ i ( x ) B i ( x ) µ T i ( x ) − µ (1) i ( x ) B (1) i ( x ) µ (1) T i ( x ) − µ (2) i ( x ) B (2) i ( x ) µ (2) T i ( x ) − log | A (1) i ( x ) || A (2) i ( x ) | | ˆ A ⊠ i ( x ) || A i ( x ) | by a consideration of the remainin g constant. From the deﬁnition of the multiv ariate distrib ution s then follows that ˆ P ⊠ C i ( x | m ) ∝ Z U ˆ p ⊠ C i ( x, u | m )d u = exp( ˆ C ⊠ i ( x | m )) , respectively ˆ L ⊠ i ( m ) = 1 2 log 2 ( e ) · ( ˆ C ⊠ i (+1 | m ) − ˆ C ⊠ i ( − 1 | m )) . Referen ces [1] L.R. Bahl, J. Cocke, F . Jelinek, an d J. Ravi v . Optimal decodin g of linear co des f or minim izing the symbol error rate. IEEE T r ansaction s on Commun ication Theory , IT -20 :284–2 87, Mar ch 1974. [2] C. Berr ou, A. Glavieux, and P . Thitim ajshima. Near Sha nnon limit error-correcting coding an d en- coding: T urbo codes. In IEEE Int. Conf. on Communication , Gene va, May 199 3. [3] E.L. Blo kh and V .V . Z yablov . Coding of generalized conc atenated codes. P r oblems of In formation T ransmission , 10(3 ):218– 222, 197 4. (in russian). [4] I. Du mer . Handbo ok o f Codin g Theory V olume 2 , cha pter Concatenated Codes an d their multilevel generalizatio ns, pages 1911 –1989 . No rth Holland, 1998. [5] G.D. Forney , Jr . T he Viterbi algorithm. Pr oceedings of the IEEE , 61(3):26 8–278, M arch 1973. [6] R. Gallager . Low-density parity-che ck codes. IEEE T ransactions on Information Theo ry , 8( 1):21– 28, 1962. [7] S. Gligorevic. Joint cha nnel estimation and equalisation of fast ti me-variant freq uency-selective cha n- nels. Eur opean T ransaction s on T elecommunica tions, Communica tions Theory , 2006. [8] A. Heim, V .R. Sid orenko, and U. Sorger . T rellis com putations - to b e submitted. IEEE T r ansaction s on Information Theory , 2007. 36 [9] E.T . Jaynes. Pr oba bility The ory - The Logic of Science . Cam bridge University Press, Cambrid ge, 2003. [10] R. Johan nesson and K. Zigangirov . Fundamenta ls o f Con volutio nal Coding . IEEE Press, New Y o rk, 1999. [11] F .R. Kschischang and V . Sorokine. On the trellis structure of b lock code s. IEEE T ransactions o n Information Theory , 41(6):1 924–1 937, 1995 . [12] I. Land. Reliability Information in Channel Decoding . PhD thesis, Christian-Albrec hts-University o f Kiel, 2005. [13] D.J.C. Mackay . In formation Theory , In fer ence & Learning Algorithms . Camb ridge Uni versity Press, June 2002. [14] F .J. MacW illiams and N.J. Sloane. The Theo ry of Err or Corr ecting Codes . New Y ork: North-Ho lland, 1983. [15] J.L. Massey . App lied Digital Information Theory, I+II. Lecture Notes, ETH Zurich, Switzerland. [16] R.J. McEliece, D.J.C. Mackay , and J.-F . Cheng . T urbo decoding as an instance of Pearl’ s " belief propag ation" algorithm . IEE E J ourna l on Selected Ar eas in Commun ications , 16(2):140– 152, 19 98. [17] T . Richardson an d R. Urb anke. The cap acity of low-density p arity-check code s u nder message- passing decodin g. I EEE T ransactions on Information Theory , 47, 2001. [18] C.E. Shanno n. A mathem atical theory o f comm unication. Bell Syst. T ech. Journal , 27:379 –423, 1948. [19] S. ten Brink. Conv ergence of iterati ve decoding. Electr o nics Letters , 35 (10):80 6–808 , May 1999. 37

Discriminated Belief Propagation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment