Counting Distinctions: On the Conceptual Foundations of Shannons Information Theory

Coun ting Distinctions: On the Conceptual F oundations of Shannon’s Info rmation Theory Da vid Ellerman ∗ Departmen t of Ph ilosophy Univ ersit y of California at Riv erside Octob er 29, 2018 Con ten ts 1 T o wa rds a Logic of Par titi ons 2 2 Logical Information Theory 4 2.1 The Closure Space U × U . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Some Set Structure Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Logical Information Theory on Finite Sets . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Using General Finite Probability Distributions . . . . . . . . . . . . . . . . . . . . . 8 2.5 A Br ief History of the Log ic al Entrop y F orm ula: h ( p ) = 1 − P i p 2 i . . . . . . . . . . 8 3 Relationshi p b e t w een the Logical and Shannon En tropies 10 3.1 The Search Approach to Find the “Sent Messag e” . . . . . . . . . . . . . . . . . . . 10 3.2 Distinction-based T reatment of Shannon’s E nt r o py . . . . . . . . . . . . . . . . . . . 12 3.3 Relationships Betw een the Blo ck Entropies . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 A Co in-W eig hing Ex ample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5 Blo ck-coun t Entrop y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Analogous Concepts for Shannon and Logical Entropies 17 4.1 Independent Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3 Cross Entrop y a nd Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Summary of Ana lo gous Co nc e pts and Results . . . . . . . . . . . . . . . . . . . . . . 22 5 Concluding Remarks 22 Abstract Categorica l logic has sho wn that mod ern logic is essentially t he logic of sub sets (or ”sub ob- jects”). P artitions are dual to su b sets so there is a dual logic of p artitions where a ”distinction” [an ordered pair of distinct elements ( u, u ′ ) from the universe U ] is du al to an ”elemen t”. A n elemen t being in a subset is analogous to a partition π on U making a distinction, i.e., if u and u ′ ∗ This paper is dedicated to the memory of Gian-Car l o Rota—mathematician, phil osopher, men tor, and friend. 1 w ere in diﬀeren t blocks of π . Subset logic leads to ﬁnite p rob ab ility t heory by takin g th e (Lapla- cian) probability as the n ormalized size of each subset-event of a ﬁ nite un ivers e. The analogous step in the logic of partitions is to assign to a p artition the num b er of distinctions made by a partition n ormalized by the total number of ordered pairs | U | 2 from the ﬁnite u n iverse . That yields a notion of ”logical entrop y” for partitions and a ”logical information t heory .” The logical theory directly coun ts the (normalized) n umb er of distinctions in a p artition while Shannon’s theory gives the av erage n umber of b inary partitions needed to m ake those same d istinctions. Thus the logica l theory is seen as providing a conceptual und erpinning for Shannon’s th eory based on the logical notion of ”distinctions.” 1 T o w ards a Logic of P artitions In o rdinary logic, a statement P ( a ) is formed by a predicate P ( x ) applying to an individua l name “ a ” (which co uld b e an n -tuple in the case o f r elations). The pr edicate is mo deled b y a subset S P of a universe set U a nd an individual name such as “ a ” would b e a ssigned an individual u a ∈ U (an n -tuple in the case of relations). The statement P ( a ) would hold in the mo del if u a ∈ S P . In short, logic is modeled as the lo gic of su bsets of a set. La rgely due to the e ﬀorts of William Lawv ere, the mo dern treatment of logic was refor m ulated and v a s tly generalized using categor y theory in what is now called c ate goric al lo gic . Subsets w ere genera lized to sub ob jects o r “pa rts” (equiv alence classes of monomorphisms) s o that log ic has b ecome the logic of sub ob jects. 1 There is a dua lit y b etw een subsets of a set and p artitions 2 on a set. “T he dual notion (obtained by reversing the arrows) of ‘part’ is the notion of p artition .”[23, p. 85] In categ ory theory , this emerges as the reverse-the-ar rows duality b etw een monomorphisms (monos), e.g., injective set functions, and epimorphisms (epis), e.g., surjective set functions, and b et ween sub ob jects and quotient o b jects. If mo dern lo gic is form ulated a s the logic of subsets, or mor e gener ally , subob jects or “parts”, then the question naturally a rises of a dua l logic that migh t play the analog ous ro le fo r partitions and their generaliza tions. Quite aside from categ o ry theory dualit y , it has long b een noted in combinatorial ma thematics, e.g., in Gian-Carlo Rota’s w or k in com binator ial theor y and probability theory [3], that there is a t yp e of duality b etw een subsets o f a set and partitions on a set. Just as subsets o f a set are partially ordered by inclusion, so partitions on a set are partia lly ordered by reﬁnement. 3 Moreov er , b oth partial ordering s are in fact lattices (i.e., have meets a nd joins ) with a top element b 1 a nd a b ottom element b 0. In the lattice of a ll subsets P ( U ) (the p ow er set) o f a s et U , the meet and join are, of course, intersection a nd union while the top element is the univ erse U and the b ottom element is the nu ll set ∅ . In the lattice of all partitions Π( U ) on a non-empty set U , there a re a ls o meet and join op erations (deﬁned la ter) while the bo ttom element is the indiscrete partition (the “ blob”) where all of U is one blo ck a nd the top element is the discrete par tition where each element of U is a singleton blo ck. 4 This pap er is part of a r esearch pr ogra mme to develop the gener al dual log ic of partitions. The principal novelt y in this pap er is a n analogy b etw een the usual semantics for subset logic a nd a suggested sema ntics for partition logic; the themes of the pap er unfold fr om that starting po in t. 1 See [23] App endix A for a go o d treatmen t. 2 A p artit ion π on a s et U is usually deﬁned as a mutually exclusiv e and jointly exhaustive set { B } B ∈ π of subsets or “blocks” B ⊆ U . Ev ery equiv al ence relation on a set U determines a partition on U (with the equiv alence classes as the blocks) and vice-versa. F or our purp oses, it is usef ul to think of partitions as binary relations deﬁned as the complemen t to an equiv alence relation in the set of ordered pairs U × U . In tuitively , they ha ve complemen tary functions in the sense that equiv alence relations identify while partitions distinguish elemen ts of U . 3 A partition π more reﬁned than a partition σ , written σ  π , i f each blo ck of π is cont ained i n some block of σ . Muc h of the older li terature (e.g., [ 5, Example 6, p. 2]) writes this relationship the other wa y around but, for reasons that will b ecome clear, we are adopting a newer wa y of writing reﬁnemen t (e.g., [14]) so that the more reﬁned partition is higher i n the reﬁnement ordering. 4 Rota and his studen ts hav e developed a logic for a special t yp e of equiv alence relation (whic h is rather ubiquitous in mathematics) usi ng join and m eet as the only conne ctives.[7] 2 Starting w ith the ana logy b etw een a subset of a set and a par tition on the set, the analogue to the notion of an element o f a subset is the no tion of a distinction of a partition whic h is simply an o rdered pair ( u, u ′ ) ∈ U × U in distinct blo cks of the pa rtition. 5 The logic o f subsets leads to ﬁnite probability theory where even ts are subsets S of a ﬁnite sample s pa ce U a nd which as signs pro ba bilities P r ob ( S ) to subsets (e.g., the Laplacian equipro ba ble distribution wher e Pr o b ( S ) = | S | / | U | ). F ollowing the suggested ana lo gies, the logic of par titions similarly lea ds to a “lo g ical” information theory whe r e the numerical v alue naturally a ssigned to a partition can b e s een as the lo gic al information c ontent or lo gic al entr opy h ( π ) of the pa rtition. It is initially deﬁned in a Laplacian manner as the num b er of distinctions that a pa rtition makes nor malized b y the num b er o f o r dered pairs of the universe set U . The probability interpretation o f h ( π ) is the pr obability that a random pair from U × U is distinguished by π , just as Pr ob ( S ) is the probability that a random ch o ic e from U is an element of S . This logical entropy is precisely related to Shannon’s en tro p y measure [32 ] so the development of logical information theory can be seen as providing a new co nceptual basis fo r information theory at the basic level of logic using “distinctions” as the conceptual a toms. Historically and conceptually , probability theor y started with the simple logical op erations on subsets (e.g., union, intersection, and complementation) a nd ass igned a n umerica l measure to subsets of a ﬁnite s et of outcomes (num b er of fav ora ble outcomes divided by the total nu mber of outco mes). Then pro ba bilit y theory “to ok oﬀ” from these s imple b eginnings to b ecome a ma jor br anch o f pur e and applied mathema tics. The resear ch prog r amme for pa rtition logic that underlies this pap er sees Shannon’s information theory as “tak ing oﬀ” from the s imple notions of partition log ic in analog y with the conceptual developmen t of pr obability theory that starts with simple notions of s ubs e t logic. But historically , Shannon’s information theory app ear ed ” as a b o lt o ut of the blue” in a r ather sophisticated and axiomatic form. Moreov er, pa r tition logic is still in its infancy today , not to mention the ov er half a cen tury ago when Shannon’s theory was published. 6 But starting with the suggested semantics for pa rtition logic (i.e., the subset-to-par tition and element-to-distinction analo gies), we develop the partition analo gue (“ counting distinctions”) o f the be ginnings of ﬁnite probability theory (“co un ting outcomes”), a nd then we show how it is related to the already- developed information theory of Shannon. It is in that sense that the developmen ts in the pap er pr ovide a logica l o r conceptual foundation (“foundation” in the s ense of a basic conce ptua l starting p oint) for informatio n theory . 7 The following table sets out some of the a na logies in a co ncise for m (where the dia gonal in U × U is ∆ U = { ( u, u ) | u ∈ U } ). 5 In tuitively w e might think of an element of a set as an “it.” W e wi ll ar gue that a distinction or “dit” is the corresponding logical atom of information. In economics, there is a basic distinction b et ween riv alrous go o ds (where more for one means l ess for another) such a material things (“its”) in contrast to non-ri v alr ous go o ds (where what one person acquires does not tak e a wa y from another) suc h as ideas, kno wledge, and information (“bits” or “dits”). In that spir it, an element of a s et represen ts a material thing, an “it,” while the dual notion of a distinction or “dit” represen ts the immaterial notion of t wo “its” b eing dis tinct. The distinction betw een u and u ′ is the fact that u 6 = u ′ , not a new “thing” or “it.” But for mathematical purp oses we may represent a distinction by a pair of distinct elements suc h as the or dered pair ( u, u ′ ) which is a hi gher level “it,” i. e., an elemen t i n the C ar tesian pro duct of a s et with itself (see next section) . 6 F or instance, the concept ual b eginnings of probability theory in subset logic is sho wn by the role of Boolean algebras in probability theory , but what is the corresp onding algebra for partition logic? 7 Pe rhaps an analogy will b e helpful. It is as if the axioms f or probabili t y theory had ﬁrst emerged full- blown from Kolmogorov [21] and then one realized b elatedly that the discipl i ne could b e seen as gro wing out of the starting point of oper ations on subsets of a ﬁnite space of outcomes where the l ogic was the logic of subsets. 3 T able of Analo gies Subsets P artitions “Atoms” Elements Distinctions All a toms Univ er se U (all u ∈ U ) = b 1 Discrete pa rtition b 1 (all dits) No ato ms Null se t ∅ (no u ∈ U ) = b 0 Indiscrete pa rtition b 0 (no dits) Mo del of pro po sition o r even t Subset S ⊆ U Partition π on U Mo del of individual o r outcome Element u in U Distinction ( u, u ′ ) in U × U − ∆ U Prop. holds or even t o ccurs Element u in subset S Partition π distinguishes ( u , u ′ ) Lattice of pro p ositions/events Lattice of a ll subsets P ( U ) Lattice of all partitions Π ( U ) Counting measure ( U ﬁnite) # elements in S # dits (a s or dered pairs) in π Normalized count ( U ﬁnite) Prob( S ) = # elements in S | U | h ( π ) = #distinct ions in π | U × U | Prob. Interpretation ( U ﬁnite) Prob ( S ) = probability that h ( π ) = probability random pair random element u is in S ( u, u ′ ) is distinguished by π These analog ies sho w one se t of reas o ns why the lattice o f partitions Π ( U ) should b e written with the discrete par tition a s the top element and the indiscr ete partition (blob) as the b o ttom element of the lattice—in spite o f the us ual conv ention of wr iting the “reﬁnement” order ing the other way a round as wha t Gian-Ca r lo Rota called the “ unreﬁnement order ing.” With this mo tiv ation, we turn to the development of this c o nceptual basis for infor mation theory . 2 Logical Information Th eory 2.1 The C losure Space U × U Claude Shanno n’s classic 194 8 a r ticles [32] develop ed a statistica l theory of communications that is ordinarily called “information theory .” Shannon built up on the work of Ralph Hartley [15] tw ent y years earlier. After Shanno n’s informa tion theor y was pres ent ed axiomatica lly , ther e was a spate of new deﬁnitions of “entrop y” with v ario us axiomatic pr o p erties but without concrete (never mind logical) in terpr e tations [20]. Here we tak e the a pproach o f star ting with a notion tha t aris es naturally in the log ic of partitio ns , dua l to the usual logic o f subse ts. The no tion of a distinction or “dit” is taken as the logical atom of infor ma tion and a “logica l infor ma tion theo r y” is developed based on tha t int er pr etation. When the universe set U is ﬁnite, then we hav e a numerical notion o f “informatio n” or “entropy” h ( π ) of a par tition π in the n umber o f distinctions normalize d by the n umber o f or der ed pairs. This logical “counting distinctions” notion of infor ma tion or ent ro py ca n then be r elated to Shannon’s mea sure of informa tion or entropy . The basic conceptual unit in logical info r mation theo ry is the distinction or dit (from “DIsTinc- tion” but mo tiv ated by “bit”). A pa ir ( u, u ′ ) o f distinct elements of U are distinguished by π , i.e., form a dit of π , if u and u ′ are in diﬀerent blo cks o f π . 8 A pair ( u, u ′ ) are identiﬁed by π and fo r m an indit (from INDIsTinction or “identiﬁcation”) of the partition if they ar e contained in the s a me blo ck o f π . A par tition on U ca n b e characterized by either its dits or indits (just as a subset S o f U can b e characterized b y the element s a dded to the null set to arrive at S or by the elements o f U thrown out to arr ive a t S ). When a partition π is thought of as determining an equiv alence relation, then the equiv alence relation, as a set of order ed pairs co nt a ined in U × U = U 2 , is the indit set indit( π ) of indits of the partition. But from the view p oint of logical infor mation theor y , the fo cus is on the dis tinctions, so the par tition π qua binar y relation is given by the complementary dit set dit ( π ) of dits w he r e dit ( π ) = ( U × U ) − indit ( π ) = indit ( π ) c . Rather tha n think of the partition as resulting from identiﬁcations made to the elements of U (i.e., distinctions excluded from the discre te partition), we think of it a s b eing formed by making distinctions star ting with the blob. This is 8 One might also deve lop the theory using unordered pair s { u, u ′ } but the later dev elopment of the theory using probabilistic methods is m uch facilitated by using ordered pairs ( u, u ′ ). Thus for u 6 = u ′ , ( u, u ′ ) and ( u ′ , u ) coun t as t wo distinctions. This means that the coun t of distinctions in a partition must be normalized b y | U × U | . Note that U × U includes the diagonal self-pairs ( u, u ) which can never be distinctions. 4 analogo us to a subset S b eing tho ught of as the se t o f elements that must b e added to the null set to o btain S r ather than the co mplement ar y approach to S by g iving the elements excluded fro m U to ar rive at S . F rom this viewpoint, the natural ordering σ  π of partitions w ould be given by the inclusion ordering of dit-sets dit ( σ ) ⊆ dit ( π ) and that is ex a ctly the new way of writing the reﬁnement relation tha t we ar e using, i.e., σ  π iﬀ dit ( σ ) ⊆ dit ( π ). There is a natura l (“built-in”) closure op eration on U × U so that the equiv alence rela tions o n U are given (as binar y rela tions) by the closed sets. A subset C ⊆ U 2 is close d if it contains the diagonal { ( u, u ) | u ∈ U } , if ( u, u ′ ) ∈ C implies ( u ′ , u ) ∈ C , and if ( u , u ′ ) a nd ( u ′ , u ′′ ) are in C , then ( u, u ′′ ) is in C . Thus the clo sed sets of U 2 are the reﬂexive, s y mmetric, and transitive relations, i.e., the equiv alence re la tions on U . The in ters ection of closed sets is closed and the intersection of a ll closed sets co ntaining a subset S ⊆ U 2 is the closur e S o f S . It should be carefully noted that the closure op era tio n on the clo sure spa c e U 2 is not a top olo gic al closure op er ation in the s ense that the unio n of t wo closed set is not neces sarily clos ed. In s pite of the closur e o p eration not b eing topolo gical, we may still refer to the complements of closed sets as being op en sets, i.e., the dit sets of partitions on U . As usual, the interior int( S ) o f any subset S is deﬁned as the co mplement of the closure of its complement: in t( S ) =  S c  c . The op en s e ts of U × U or der ed by inclusion form a lattice isomorphic to the lattice Π( U ) of partitions on U . The clo sed sets of U × U order ed by inclusion form a lattice iso mo rphic to Π( U ) op , the o ppo site of the la ttice of pa rtitions o n U (formed by turning around the partia l order ). The motiv atio n for wr iting the reﬁnement relation in the old way was probably that equiv alence rela tions were tho ught of as binary relations indit ( π ) ⊆ U × U , so the ordering of equiv alenc e relations was written to r eﬂect the inclusion ordering b etw een indit-sets. But since a pa rtition and an eq uiv alence relation w ere then taken as essentially the “ same thing,” i.e ., a set { B } B ∈ π of mutually exclusive and jointly exhaustive subsets (“blo cks” or “equiv a lence clas s es”) of U , that way of wr iting the ordering carried ov er to pa rtitions. B ut we identify a partition π as a binary r elation with its dit-set dit ( π ) = U × U − indit ( π ) so our reﬁnemen t ordering is the inclusion order ing b etw een dit-sets (the opp osite o f the inclusio n or dering of indit-sets). 9 Given tw o partitions π and σ on U , the open s et corres po nding to the join π ∨ σ of the partitions is the par tition who se dit-se t is the union o f their dit-sets: 10 dit( π ∨ σ ) = dit ( π ) ∪ dit ( σ ) . The op en set c o rresp onding to the me et π ∧ σ of partitions is the interior of the intersection o f their dit-sets: 11 dit( π ∧ σ ) = int (dit ( π ) ∩ dit ( σ )). The op en set corres p o nding to the b ottom o r blob b 0 is the null set ∅ ⊆ U × U (no distinctions) and the open set cor resp onding to the dis crete partition or top b 1 is the c omplement of the diagona l, U × U − ∆ U (all distinctions). 9 One wa y to establish the duality b et ween elemen ts of subsets and distinctions in a partition is to start with the reﬁnemen t relation as the partial order in the lattice of partitions Π( U ) analogous to the inclusion partial order i n the lattice of subsets P ( U ). Then the mapping π 7− → dit( π ) represent s the lattice of partitions as the lattice of open subsets of the closure space U × U with inclusi on as the partial or der. Then the analogue of the elemen ts i n the subsets of P ( U ) w ould be the elemen ts i n the subsets dit ( π ) represen ting the partitions, namely , the dis tinctions. 10 Note that this union of dit sets give s the dit set of the “meet” in the old reve rs ed wa y of wri ting the reﬁnemen t ordering. 11 Note that this is the “join” in the old rev ersed wa y of wr iting the reﬁnement ordering. This operation deﬁned by the int erior operator of the non-topological closure op eration leads to “anomolous” results such as the non-distributivit y of the partition lattice—in con trast to the distri butivit y of the lattice of op en sets of a top ological space. 5 2.2 Som e Set Structure Theorems Before restricting ourselves to ﬁnite U to use the counting measure | dit ( π ) | , there a re a few structure theorems that ar e independent of ca rdinality . If the “ atom” of information is the dit then the atomic information in a par titio n π “is” its dit set, dit( π ). The information commo n to tw o par titions π and σ , their mutu al informatio n set , would natura lly be the intersection of their dit sets (which is not necessarily the dit se t of a partition): Mut( π , σ ) = dit ( π ) ∩ dit ( σ ) . Shannon delib erately deﬁned his measure of informatio n so that it would b e “additive” in the sense that the measur e of informatio n in tw o independent probability distributio ns would b e the sum o f the information measures of the tw o separate distributions and there would b e zero mutual information betw een the indep endent distributions. But this is not true at the logica l level with infor mation deﬁned as distinctio ns . There is always mutual information b etw een tw o non-blob partitions—even though the interior o f Mut ( π , σ ) might b e empty , i.e., int (Mut( π , σ )) = int (dit ( π ) ∩ dit ( σ )) = dit ( π ∧ σ ) might b e empt y so that π ∧ σ = b 0. Prop ositi on 1 Given two p artitions π and σ on U with π 6 = b 0 6 = σ , Mut ( π , σ ) 6 = ∅ . 12 Since π is not the blob, co nsider tw o elements u and u ′ distinguished b y π but iden tiﬁed by σ [otherwise ( u, u ′ ) ∈ Mut( π , σ )]. Since σ is als o no t the blob, there must be a third ele ment u ′′ not in the sa me block of σ a s u and u ′ . But since u and u ′ are in diﬀerent blo cks of π , the third element u ′′ m ust be distinguished fr om one or the other or b oth in π . Hence ( u, u ′′ ) or ( u ′ , u ′′ ) must b e distinguished by bo th pa rtitions a nd th us must b e in their mutual information set Mut ( π , σ ).  (= end of pro o f marker) The c lo sed and op en subsets of U 2 can b e characterized using the usual notions of blo cks of a partition. Given a pa rtition π on U as a set o f blo cks π = { B } B ∈ π , let B × B ′ be the Cartesia n pro duct of B and B ′ . Then indit ( π ) = S B ∈ π B × B dit ( π ) = S B 6 = B ′ B ,B ′ ∈ π B × B ′ = U × U − indit ( π ) = indit ( π ) c . The m utual information set can also b e characterized in this manner . Prop ositi on 2 Given p artitions π and σ with blo cks { B } B ∈ π and { C } C ∈ σ , then Mut ( π , σ ) = S B ∈ π ,C ∈ σ ( B − ( B ∩ C )) × ( C − ( B ∩ C )) = S B ∈ π ,C ∈ σ ( B − C ) × ( C − B ). The union (which is a disjoint union) will include the pairs ( u , u ′ ) whe r e for so me B ∈ π and C ∈ σ , u ∈ B − ( B ∩ C ) and u ′ ∈ C − ( B ∩ C ). Since u ′ is in C but not in the intersection B ∩ C , it m ust be in a diﬀerent blo ck of π than B so ( u, u ′ ) ∈ dit ( π ). Symmetr ically , ( u, u ′ ) ∈ dit ( σ ) so ( u, u ′ ) ∈ Mut ( π , σ ) = dit ( π ) ∩ dit ( σ ) . Conversely if ( u, u ′ ) ∈ Mut ( π , σ ) then take the B containing u and the C containing u ′ . Since ( u, u ′ ) is distinguished by bo th par titions, u 6∈ C and u ′ 6∈ B so that ( u, u ′ ) ∈ ( B − ( B ∩ C )) × ( C − ( B ∩ C )).  12 The con trap ositive of this prop osition is interesting. Give n t wo equiv alence relations E 1 , E 2 ⊆ U 2 , if eve ry pair of elements u, u ′ ∈ U is iden tiﬁed by one or the other of the rel ations, i .e., E 1 ∪ E 2 = U 2 , then either E 1 = U 2 or E 2 = U 2 . 6 2.3 Logical Information Theory on Finite Sets F or a ﬁnite s et U , the (normalized) “counting distinctio ns ” meas ur e of information can b e deﬁned and compar ed to Shannon’s mea sure for ﬁnite proba bility distributions. Since the informatio n set of a partition π on U is its set of distinctions dit ( π ) , the un-normalized numerical measure of the information of a pa rtition is simply the count of that s e t, | dit ( π ) | (“dit count”). But to acco un t for the to tal n umber of o rdered pair s o f elements from U , w e normalize b y | U × U | = | U | 2 to obta in the lo gic al information c ontent o r lo gic al entr opy of a partition π as its no r malized dit co un t: h ( π ) = | dit( π ) | | U × U | . Probability theory s tarted with the ﬁnite case where there was a ﬁnite set U of p ossibilities (the ﬁnite s ample space) and a n e vent was a subset S ⊆ U . Under the Laplacian assumption that each o utcome was equiprobable, the probability o f the event S was the simila r norma liz e d counting measure of the set: Prob ( S ) = | S | | U | . This is the probability that any randomly chosen element of U is an element of the subset S . In view o f the dua l relationship b etw een b eing in a subset and be ing distinguished b y a partition, the analog ous concept w ould be the pr obability that an o rdered pair ( u, u ′ ) o f elemen ts of U c hos en indep endently (i.e., with replacement 13 ) would b e distinguished by a par tition π , and that is precis ely the logical ent r o py h ( π ) = | dit ( π ) | / | U × U | (since e ach pair ra ndomly chosen fro m U × U is equipr obable). Probabilis tic int er pr etation: h ( π ) = pro ba bility a rando m pair is distinguished by π . In ﬁnite pro bability theory , when a p oint is sampled from the sample s pa ce U , we say the even t S o c curs if the p oint u was an element in S ⊆ U . When a random pa ir ( u, u ′ ) is sa mpled from the sample s pace U × U , we say the par tition π distinguishes 14 if the pa ir is distinguished by the partition, i.e., if ( u, u ′ ) ∈ dit ( π ) ⊆ U × U . Then just a s w e take Pr ob ( S ) as the proba bilit y that the even t S o ccurs, s o the log ical entrop y h ( π ) is the pro bability that the par tition π distinguishes. Since dit ( π ∨ σ ) = dit ( π ) ∪ dit ( σ ) , probability that π ∨ σ distinguishes = h ( π ∨ σ ) = proba bilit y tha t π or σ distinguishes . The probability that a randomly chosen pair w ould be distinguished by π and σ would b e given by the relative car dinality of the m utual information set which is called the mu t ual information o f the pa rtitions: Mutual logical infor mation: m ( π , σ ) = | Mut( π ,σ ) | | U | 2 = probability that π and σ distinguishes . Since the car dina lity of intersections of sets ca n b e analyze d using the inclusion-exclus ion prin- ciple, we ha ve: | Mut ( π , σ ) | = | dit ( π ) ∩ dit ( σ ) | = | dit ( π ) | + | dit ( σ ) | − | dit ( π ) ∪ dit ( σ ) | . Normalizing, the pro bability that a r andom pa ir is dis tinguished by b oth par titions is given b y the mo dular law: 13 Drawing with replacemen t wou ld allow diagonal pairs ( u, u ) to b e dra wn and requires | U × U | as the normal i zing factor. 14 Equiv alent terminology would b e “diﬀerentiate s” or “discr iminates.” 7 m ( π, σ ) = | dit( π ) ∩ dit( σ ) | | U | 2 = | dit( π ) | | U | 2 + | dit( σ ) | | U | 2 − | dit( π ) ∪ dit( σ ) | | U | 2 = h ( π ) + h ( σ ) − h ( π ∨ σ ) . This can b e extended by the inclusio n-exclusion pr inciple to any n umber o f pa rtitions. T he mutual information set Mut ( π , σ ) is not the dit-set of a pa rtition but its in terio r is the dit-set o f the meet so the lo g ical entropies of the join and meet sa tis fy the: Submo dula r inequa lity: h ( π ∧ σ ) + h ( π ∨ σ ) ≤ h ( π ) + h ( σ ). 2.4 Usin g General Finite Probabilit y Distributions Since the logical entrop y of a partition o n a ﬁnite set can b e given a simple probabilistic int er pr eta- tion, it is not s ur prising that many metho ds of probability theory ca n b e ha rnessed to dev elop the theory . The theory for the ﬁnite ca se can b e developed at tw o diﬀerent lev els of g enerality , using the sp eciﬁc Laplac ia n equiprobability distributio n on the ﬁnite set U o r using an ar bitrary ﬁnite probability distribution. Correctly formulated, all the fo r mulas concer ning log ic al en tropy and the related concepts will w ork for the general case, but our purp ose is not ma thematical generality . Our purp ose is to give the basic motiv ating ex a mple of logical ent ro py based on “counting distinctions” and to s how its relationship to Shannon’s notion of en tro py , ther eby clar ifying the logica l foundations of the la tter co ncept. Every pro bability distribution on a ﬁnite set U g ives a probability p B for each blo ck B in a partition π but for the Laplacian distribution, it is just the relative cardinality of the blo ck: p B = | B | | U | for blo cks B ∈ π . Since there are no empt y blo cks, p B > 0 and P B ∈ π p B = 1. Since the dit set of a par tition is dit ( π ) = S B 6 = B ′ B × B ′ , its size is | dit ( π ) | = P B 6 = B ′ | B | | B ′ | = P B ∈ π | B | | U − B | . Thu s the log ical information or en tropy in a par tition as the normalized size of the dit set can be developed a s follows: h ( π ) = P B 6 = B ′ | B | | B ′ | | U |×| U | = X B 6 = B ′ p B p B ′ = X B ∈ π p B (1 − p B ) = 1 − X B ∈ π p 2 B . Having deﬁned and interpreted log ical entropy in terms of the distinctions of a set partition, we may , if desired, “kick aw ay the ladder” a nd deﬁne the log ical entrop y of a ny ﬁnite probability distribution p = { p 1 , ..., p n } a s : h ( p ) = P n i =1 p i (1 − p i ) = 1 − P n i =1 p 2 i . The pr obabilistic interpretation is that h ( p ) is the probability that tw o indep endent draws (fro m the s ample spa ce of n p oints with these pr obabilities) will g ive distinct p oints. 15 2.5 A B rief Hist ory of the Logical Entrop y F or m ula: h ( p ) = 1 − P i p 2 i The lo gical entrop y formula h ( p ) = 1 − P i p 2 i was motiv ated as the normalized count o f the dis- tinctions made by a par tition, | dit ( π ) | / | U | 2 , when the pr obabilities are the block proba bilities p B = | B | | U | of a partition on a set U (under a Laplacian a s sumption). The complementary measure 1 − h ( p ) = P i p 2 i would b e mo tiv ated as the norma lized count o f the iden tiﬁcations made by a partition, | indit ( π ) | / | U | 2 , thought of as an equiv ale nce r elation. Th us 1 − P i p 2 i , motiv ated by distinctions, is a meas ure of hetero geneity or diversity , while the complementary measure P i p 2 i , motiv ated b y iden tiﬁcatio ns , is a measur e of homogeneity or concentration. Historica lly , the formula 15 Note that w e can alw ays r ephrase i n terms of partitions by taking h ( p ) as the entrop y h “ b 1 ” of discrete partition on U = { u 1 , ..., u n } with the p i ’s as the pr obabilities of the s ingleton blo c ks { u i } of the discr ete partition. 8 can b e found in either form depending on the particular context. The p i ’s might be relative shar es such as the rela tive shar e of org anisms of the i th sp ecies in s o me p opulation of orga nisms, and then the interpretation of p i as a pr o bability a rises by co nsidering the r andom choice of an o rganism from the p o pula tion. According to I. J. Go o d, the formula has a certain naturalness: “ If p 1 , ..., p t are the probabilities of t mutually exclusive and exhaustive ev ents, any statis ticia n of this ce n tury who wan ted a measure of homo g eneity would hav e take ab out tw o sec onds to suggest P p 2 i which I shall call ρ .” [13, p. 561] As noted by Bhar gav a and Uppuluri [4], the for m ula 1 − P p 2 i was used by Gini in 1912 ([10] re pr inted in [11, p. 36 9]) as a measure of “ mut ability” or diversity . But another dev elopment of the formula (in the complementary for m) in the ea rly tw entieth century w as in cryptogr aphy . The America n cryptolog is t, William F. F riedman, devoted a 192 2 bo o k ([8]) to the “index of c o incidence” (i.e., P p 2 i ). Solomo n Kullback (see the K ullback-Leibler divergence treated later) work ed as an assistant to F riedman and wrote a b o o k on cryptolo g y which used the index . [22] During W or ld W ar I I, Ala n M. T uring worked fo r a time in the Go vernment C o de and Cypher School at the Bletc hley Park fac ilit y in E ngland. Probably una ware o f the earlier w o r k, T uring used ρ = P p 2 i in his cr yptoanalysis w or k and c alled it the r ep e at r ate since it is the pro bability o f a rep eat in a pair of indep e ndent draws fro m a po pulation with those proba bilities (i.e., the identiﬁcation probability 1 − h ( p )). Polish cry pto analyists had independently used the repea t rate in their work on the Enig ma [27]. After the w ar , E dward H. Simpson, a British s ta tistician, prop osed P B ∈ π p 2 B as a mea sure of spe c ies concentration (the opp osite of diversity) wher e π is the partition of anima ls or plants according to s pe cies and wher e each animal or plant is co nsidered as equipr obable. And Simpson gav e the interpretation o f this homogeneity measure as “the probability that two individuals chosen at random and independently from the populatio n will be found to belong to the same gro up.”[33, p. 688] Hence 1 − P B ∈ π p 2 B is the pro bability that a random ordere d pair will b elong to diﬀerent sp ecies, i.e., will be distinguished by the sp ecies partition. In the biodiversity litera ture [31], the fo rmula is known a s “Simpson’s index o f diversit y” or sometimes, the “Gini-Simpson div ersity index.” Ho wev er, Simpson alo ng with I. J. Go o d worked at Bletchley dur ing WWII, and, a ccording to Go o d, “ E. H. Simpson and I b oth o btained the notion [the rep eat ra te] from T uring.” [12, p. 39 5] When Simpson published the index in 1948 , he (again, a c cording to Go o d) did not ackno wledge T uring “fearing that to acknowledge him would b e rega rded as a br each o f s ecurity .” [13, p. 562] In 1 945, Alb ert O . Hirschman ([18, p. 159 ] and [19]) suggested using p P p 2 i as an index of trade concentration (where p i is the relative share of tr a de in a certa in commodity o r with a cer tain partner). A few years later , Orris Herﬁndahl [17] indep endently suggested using P p 2 i as an index of industrial concentration (where p i is the relative share of the i th ﬁrm in a n industry). In the industrial economics liter ature, the index H = P p 2 i is v ario usly ca lle d the Hirschman-Herﬁndahl index, the HH index, o r just the H index of concentration. If all the r elative shares were equal (i.e., p i = 1 /n ), then the identiﬁcation or r ep eat probability is just the probability of drawing any element, i.e., H = 1 /n , so 1 H = n is the num b er of equal e le men ts. This led to the “num b ers eq uiv alent” int er pr etation of the rec ipro cal of the H index [2]. In general, g iven an even t with probability p 0 , the “num ber s-equiv alent” interpretation of the ev ent is that it is ‘as if ’ an element was drawn out of a set of 1 p 0 equiprobable ele men ts (it is ‘as if ’ since 1 /p 0 need not b e a n int eg e r). This n umbers -equiv a lent idea is r elated to the “blo ck-count” no tion of entrop y deﬁned later. In vie w of the frequent a nd indep endent discov ery and rediscov er y of the formula ρ = P p 2 i or its c o mplement 1 − P p 2 i by Gini, F riedman, T uring, Hirschman, Herﬁndahl, and no do ubt o ther s, I. J. Go o d wise ly advis e s that “it is unjust to asso cia te ρ with a ny one p erso n.” [1 3, p. 562] 16 After Shannon’s axiomatic introduction of his entropy [32], there was a prolifer ation of axio matic ent r o pies with a v aria ble par ameter. 17 The formula 1 − P p 2 i for logica l entropy app eared as a 16 The name “logical en tropy” f or 1 − P p 2 i not only denot es the basic status of the formula, it av oids “Stigler’ s Law of Ep on ymy”: “No scien tiﬁc discov ery is named after i ts original discov erer.”[34, p. 277] 17 There was no need for Shannon to presen t his entrop y concept axiomatically since i t was based on a standard 9 sp ecial case for a sp eciﬁc para meter v alue in several ca s es. During the 19 60’s, Acz´ el and Dar´ oczy [1] developed the gener alize d entr opies of de gr e e α : H α n ( p 1 , ..., p n ) = P i p α i − 1 (2 1 − α − 1) and the log ical e n tro py o ccur red a s ha lf the v alue for α = 2. That form ula also appea red a s Havrda- Charv at’s st ructur al α -entro py [16]: S ( p 1 , ..., p n , ; α ) = 2 α − 1 2 α − 1 − 1 (1 − P i p α i ) and the sp ecial ca se of α = 2 was consider ed by V a jda [3 6]. Patil and T aillie [25] deﬁned the diversity index of de gr e e β in 1 982: ∆ β = 1 − P i p β +1 i β and Tsa llis [35] indep endently gave the sa me formula as an entrop y formula in 198 8: S q ( p 1 , ..., p n ) = 1 − P i p q i q − 1 where the log ical entrop y form ula oc c urs as a specia l case ( β = 1 or q = 2). While the generalized parametric entropies may b e interesting as axio matic exer cises, our purp os e is to emphas iz e the sp eciﬁc logical interpretation of the log ical entrop y formula (or its complement). F ro m the log ical viewp oint, t wo element s from U = { u 1 , ..., u n } ar e either identical or distinct. Gini [10] introduced d ij as the “distance” b etw een the i th and j th elements where d ij = 1 for i 6 = j and d ii = 0. Since 1 = ( p 1 + ... + p n ) ( p 1 + ... + p n ) = P i p 2 i + P i 6 = j p i p j , the log ical en tropy , i.e., Gini’s index of mut ability , h ( p ) = 1 − P i p 2 i = P i 6 = j p i p j , is the average logical distance b etw een a pair of indep endently drawn ele ments. But one migh t g eneralize by allowing other distances d ij = d j i for i 6 = j (but always d ii = 0) so that Q = P i 6 = j d ij p i p j would be the average distance b etw een a pair of indep e ndently drawn elemen ts from U . I n 198 2, C. R. (Ca lyampudi Radhakrishna) Ra o int ro duced precisely this concept a s qu adr atic entr opy [26] (whic h was later rediscovered in the bio diversit y literature as the “Av alanche Index” by Ganeshaish et a l. [9]). In many domains, it is quite r easona ble to mov e b eyond the bare-b ones lo gical distance of d ij = 1 for i 6 = j so that Rao ’s quadratic entrop y is a useful and easily in terpr eted gener alization of logica l entrop y . 3 Relationship b et w een the Logical and Shannon En tropies 3.1 The Search Approac h to Find the “Sen t Message” The logical e n tro py h ( π ) = P B ∈ π p B (1 − p B ) in this form as an a verage over blocks allows a dir e ct compariso n with Shannon’s entrop y H ( π ) = P B ∈ π p B log 2 ( 1 p B ) of the par tition which is also an av erag e ov er the blo cks. What is the connection b etw een the blo ck entr opies h ( B ) = 1 − p B and H ( B ) = log 2  1 p B  ? Sha nnon uses rea s oning (share d with Hartley) to arrive at a notion o f entrop y or information cont ent for an element o ut of a subset (e.g., a blo ck in a partition as a set of blo cks). Then for a par tition π , Shanno n averages the blo ck v alues to get the partition v alue H ( π ) . Har tley and Shannon start with the question of the information required to single an elemen t u out of a set U , e.g., to single o ut the sent messag e from the set of po ssible messag es. Alfred Renyi has concrete interpreta tion (exp ected num ber of bi nary partitions needed to distinguish a designated element ) whic h could then b e generalized. The axiomatic developmen t encouraged the present ation of other “entropies” as if the axioms eliminated or, at least, relaxed any need for an in terpretation of the “en tropy” concept. 10 also emphasized this “ search-theoretic” approach to informatio n theory (see [28], [29], or n umero us pap ers in [3 0]). 18 One intuit ive measur e of the infor mation obtained b y deter mining the desig nated element in a set U of equipr obable element s w ould just b e the car dinality | U | o f the set, a nd, as we will see, that leads to a mult iplicative “blo ck-count” version of Shannon’s entrop y . But Hartley and Shannon wan ted the additivity that comes from tak ing the logarithm of the set size | U | . If | U | = 2 n then this allows the cr ucial Shannon in terpr etation of log 2 ( | U | ) = n as the minim um num b er of yes-or-no questions (bina r y partitions) it takes to sing le out any designated elemen t (the “sent messa ge”) of the set. In a mathematical version of the game of tw ent y questions (like R´ en yi’s Hungar ian game of “Ba r-Ko ch ba” ), think of ea ch element o f U as be ing ass ig ned a unique binar y n umber with n digits. Then the minim um n questions can just be the questions a sking for the i th binary digit o f the hidden desig nated element. Each answer gives one bit (short for “binary dig it”) of informa tion. With this motiv ation for the case of | U | = 2 n , Shannon a nd Hartley tak e log ( | U | ) as the measure of the infor mation re quired to single out a hidden element in a set with | U | equipro ba ble elements. 19 That extends the “ minim um num b er of yes-or-no questions” motiv ation from | U | = 2 n to any ﬁnite set U with | U | equiprobable elemen ts. If a partition π ha d eq uiprobable blocks, then the Shannon ent r o py would b e H ( B ) = lo g ( | π | ) where | π | is the num b er o f blo cks. T o ex tend this basic idea to sets o f elemen ts which are no t equiprobable (e.g., partitions with unequal blo cks), it is useful to use an old device to restate an y p os itive pr o bability as a chance a mong equiprobable element s. If p i = 0 . 02, then there is a 1 in 5 0 = 1 p i chance of the i th outcome o ccurring in any trial. It is “as if” the outco me was one among 1 /p i equiprobable outco mes. 20 Thu s each po sitive pr obability p i has an asso cia ted e quivalent numb er 1 /p i which is the size of the hypothetical set of equipr o bable ele men ts so that the pr obability of drawing any given element is p i . 21 Given a partitio n { B } B ∈ π with unequal blo cks, w e motiv ate the blo ck en tropy H ( B ) for a blo ck with probability p B by taking it as the e n tro py for a h yp othetical numb ers-e qu ivalent p artition π B with 1 p B equiprobable blo cks, i.e., H ( B ) = log ( | π B | ) = log  1 p B  . With this motiv ation, the Shannon entrop y o f the partition is then deﬁned a s the ar ithmetical av erag e of the blo ck en tro pies: Shannon’s e ntropy: H ( π ) = X B ∈ π p B H ( B ) = X B ∈ π p B log  1 p B  . This can b e directly compared to the logical entropy h ( π ) = P B ∈ π p B (1 − p B ) which aro se from quite diﬀer ent distinction-based reasoning (e.g., wher e the sea rch of a single designa ted element play ed no r o le). Nevertheless, the formula P B ∈ π p B (1 − p B ) ca n b e viewed as a n av erage over the quantities which play the ro le of “blo ck entropies” h ( B ) = (1 − p B ). But this “blo ck entrop y” canno t be directly interpreted as a (norma lized) dit count s ince there is no such thing as the dit count for a sing le blo ck. The dits are the pairs o f e lement s in distinct blo cks. 18 In Gian-Carlo Rota’s teac hing, he supp osed that the Devil had pic ked an elemen t out of U and would not rev eal its identit y . But when giv en a binary partition (i.e., a y es-or-no question), the Devil had to truthfully tell which block con tained the hi dden element. Hence the problem was to ﬁnd the minimum n umber of binary partitions needed to force the Devil to revea l the hidden element . 19 Hartley used logs to the base 10 but here all logs are to base 2 unless othe r w i se indicated. Instead of considering whether the base should be 2, 10, or e , it is perhaps mor e i mp or tan t to see that there i s a natural base-free v ariation H m ( π ) on Shannon’s en trop y (see “blo c k-count entrop y” deﬁne d b elo w). 20 Since 1 /p i need not b e an in teger (or eve n rational), one could interpret the equiprobable “nu mber of elemen ts” as being heuristic or one could r estate it i n cont inuous terms. The contin uous ve rsi on i s the uniform distribution on the real i nterv al [0 , 1 /p i ] where the probabilit y of an outcome in the unit interv al [0 , 1] is 1 / (1 /p i ) = p i . 21 In con tinuous terms, the numb ers-equival ent is the l ength of the interv al [0 , 1 /p i ] with the unif orm distribution on it. 11 F or compariso n purp oses , we may nev er theless carr y over the heur istic rea soning to the case of logical en tro p y . F or each blo ck B , we take the same hypothetical num b er s -equiv a lent partition π B with | U | | B | = 1 p B equal blo cks of size | B | and then take the desired blo ck entropy h ( B ) as the normalized dit count h ( π B ) for that partition. Ea ch blo ck contributes p B (1 − p B ) to the normalized dit count and there are | U | / | B | = 1 /p B blo cks in π B so the total nor malized dit co unt simpliﬁes to: h ( π B ) = 1 p B p B (1 − p B ) = 1 − p B = h ( B ), which we could take as the lo gic al blo ck entr opy . Then the av er age of these logica l block entropies g ives the lo gical entrop y h ( π ) = P B ∈ π p B h ( B ) = P B ∈ π p B (1 − p B ) of the partition π , all in the manner of the heuristic developmen t of Shannon’s H ( π ) = P B ∈ π p B log  1 p B  . There is, how ever, no need to go throug h this reas oning to arrive at the logical entrop y of a partition as the av erag e of blo ck ent r o pies. T he in terpreta tion of the log ic a l entropy as the nor malized dit count survives the a veraging e ven though all the blo cks of π might hav e diﬀerent sizes, i.e., the int er pr etation “commutes” with the av erag ing o f blo ck en tro pies. Thus h ( π ) is the actual dit count (normalized) for a pa rtition π , not just the average o f blo ck entropies h ( B ) that could be interpreted as the nor malized dit co un ts for hypothetical pa rtitions π B . The interpretation of the Shanno n measure o f information as the minimum n umber of binary questions it takes to single out a designated blo ck do es no t commute with the averaging over the set of diﬀerent-sized blo cks in a pa rtition. Hence the Shannon e ntropy of a partition is the ex p e cte d nu mber of bits it takes to single out the des ignated blo ck while the logica l en tropy of a partition on a set is the actual num b er o f dits (nor malized) disting uis hed b y the partition. The last step in co nnecting Shannon ent ro py and logical entrop y is to rephra se the heur istics behind Shannon entrop y in ter ms of “making a ll the distinctions” ra ther than “singling out the designated element.” 3.2 Distinction-ba sed T reatmen t of Shannon’s En trop y The search-theoretic appro ach was the heritage o f the o riginal application of information theor y to communications where the fo cus was on singling out a designated element, the sent mess a ge. In the “t wen ty questio ns” v er s ion, one p erson picks a hidden element a nd the other p er son seeks the minim um n umber of binar y par titions on the set of p ossible answers to single out the a nswer. But it is simple to s ee tha t the fo cus on the single designa ted element was unnecessary . The essential po int was to make al l t he distinctions to sep ar ate t he elements – since any element co uld have b een the desig nated one. If the join of the minim um num b er of bina ry par titions did not distinguis h all the e le men ts into sing le ton blo cks, then one co uld no t hav e pic ked o ut the hidden element if it was in a non-s ingleton blo ck. Hence the distinction-ba sed trea tment o f Sha nnon’s entropy amounts to rephrasing the ab ov e heuristic argument in terms of “making all the distinctions” ra ther than “making the distinctions nece s sary to single out any designated element.” In the basic example of | U | = 2 n where we may think of the 2 n like or equipro bable elemen ts as b eing enco ded with n binary digit num b ers, then n = lo g  1 1 / 2 n  is the minimum num b er of binary partitions (each partitioning according to one o f the n digits) necessary to make al l t he distinctions b etw een the elements, i.e ., the minim um num b er of binar y par titions whose join is the discrete partition with s ingleton blocks (each blo ck probability b eing p B = 1 / 2 n ). Generalizing to any set U of equiprobable elemen ts, the minim um n umber of bits necessa ry to disting uish a ll the elements from each o ther is log  1 1 / | U |  = lo g ( | U | ). Given a par tition π = { B } B ∈ π on U , the block ent r o py H ( B ) = log  1 p B  is the minimum num b er of bits necessary to distinguish all the blo cks in the num b ers-equiv alent partition π B , and the average of those blo ck entropies gives the Shannon ent r o py: H ( π ) = P B ∈ π p B log  1 p B  . The p oint of rephra s ing the heur istics b ehind Shannon’s deﬁnition of entrop y in terms of the 12 av erag e bits needed to “make all the distinctions” is that it ca n then b e directly co mpared with the logical deﬁnition of entropy which is simply the total num b er o f distinctions normalized by | U | 2 . Thu s the t wo deﬁnitions of entrop y b oil down to tw o diﬀerent w ays of meas uring the totality of distinctions. A third w ay to measure the to ta lit y o f distinctions, ca lled the “ blo ck-coun t en tro p y ,” is deﬁned b elow. Hence we hav e o ur overall theme that these three no tions o f entropy b o il down to three ways o f “ counting distinctions.” 3.3 Relationshi ps Betw een the Blo ck En tropies Since the lo g ical and Shannon entropies hav e formulas presenting them a s averages o f blo ck-entropies, h ( π ) = P B ∈ π p B (1 − p B ) and H ( π ) = P B ∈ π p B log  1 p B  , the t wo notions ar e prec isely related by their resp ective blo ck entropies, h ( B ) = 1 − p B and H ( B ) = log  1 p B  . So lv ing each for p B and then elimina ting it y ields the: Blo ck ent r o py r e lationship: h ( B ) = 1 − 1 2 H ( B ) and H ( B ) = lo g  1 1 − h ( B )  . The blo ck entrop y relation, h ( B ) = 1 − 1 2 H ( B ) , has a s imple pro ba bilistic interpretation. Thinking of H ( B ) a s a n integer, H ( B ) is the Shannon entropy o f the discre te pa rtition o n U with | U | = 2 H ( B ) elements while h ( B ) = 1 − 1 2 H ( B ) = 1 − p B is the lo gical entropy of that par tition since 1 / 2 H ( B ) is the probability of each block in that disc r ete par tition. The pr obability that a r andom pa ir is distinguished b y a discr ete partition is just the pr o bability that the second draw is distinct from the ﬁrst draw. Given the ﬁrst draw from a set o f 2 H ( B ) individuals, the probability that the second dr aw (with r eplacement) is diﬀere nt is 1 − 1 2 H ( B ) = h ( B ). T o summarize the co mpa rison up to this po int, the logical theory and Shannon’s theory star t b y po sing diﬀere nt ques tions which then tur n out to b e precisely r elated. Shannon’s statistical theo ry of communications is conce r ned with determining the sent message out o f a se t of p ossible messages. In the basic case , the messag e s are equipro bable so it is abstractly the problem o f determining the hidden designated element out o f a set of equiproba ble elements which, for simplicit y , we can assume has 2 n elements. The pro ces s of deter mining the hidden elemen t can be conceptualized as the pro cess of asking binary questions which split the set of p o ssibilities int o equiprobable parts. The answer to the ﬁrst question determines which s ubs et of 2 n − 1 elements contains the hidden elemen t and that provides 1 bit of information. An indep endent equal-blo ck ed binary partition would split each of the 2 n − 1 element blo cks into equa l blo cks with 2 n − 2 elements eac h. Thus 2 bits of info r mation would determine whic h of those 2 2 blo cks contained the hidden e le men t, and so forth. Th us n indep endent equal-blo ck ed binary partitions would determine whic h of the r esulting 2 n blo cks contains the hidden element. Since there ar e 2 n elements, each of tho se blo cks is a singleton so the hidden element has bee n determined. Hence the problem of ﬁnding a designa ted element among 2 n equiprobable element s requires lo g (2 n ) = n bits of infor mation. The logical theory starts with the basic notion of a distinction betw een elements a nd deﬁnes the logica l infor mation in a set of distinct 2 n elements as the (norma lized) n umber of distinctions that need to b e made to distinguish the 2 n elements. The distinctions are co unt ed as ordered rather than unor dered pa irs (in order to better apply the mac hinery of probability theory ) and the num b er of distinctions or dits is no rmalized b y the num b er of all order ed pa irs. Hence a set of 2 n distinct elements would inv olve | U × U − ∆ U | = 2 n × 2 n − 2 n = 2 2 n − 2 n = 2 n (2 n − 1) distinctions which normalizes to 2 2 n − 2 n 2 2 n = 1 − 1 2 n . There is, how ever, no need to motiv ate Shannon’s entropy by fo cusing on the search for a designated element. The task ca n e q uiv alently b e taken a s distinguishing all elements fro m e ach other r ather than distinguishing a des ig nated element from all the other elements. The connection betw een the tw o approaches can be s een by computing the total num b er of distinctions made by int er s ecting the n indep endent equal- blo ck ed binary pa r titions in Shanno n’s appr o ach. 13 Example of counting distinctions: Doing the computation, the ﬁr s t partition which cr e ates t wo sets of 2 n − 1 elements ea ch thereb y crea tes 2 n − 1 × 2 n − 1 = 2 2 n − 2 distinctions as unorder ed pairs and 2 × 2 2 n − 2 = 2 2 n − 1 distinctions as ordered pair s. The next binar y partition s plits each o f those blo cks into equal blo cks o f 2 n − 2 elements. Ea ch split blo ck creates 2 n − 2 × 2 n − 2 = 2 2 n − 4 new distinctions as unor dered pairs a nd there were tw o such splits so ther e ar e 2 × 2 2 n − 4 = 2 2 n − 3 additional unordered pairs of distinct elements crea ted or 2 2 n − 2 new ordered pair distinctions. In a similar manner, the third partition creates 2 2 n − 3 new dits and s o for th down to the n th partition which a dds 2 2 n − n new dits. Th us in total, the in terse c tio n of the n indepe ndent e q ual- blo ck ed binary par titions has cr eated 2 2 n − 1 +2 2 n − 2 + ... +2 2 n − n = 2 n  2 n − 1 + 2 n − 2 + ... + 2 0  = 2 n  2 n − 1 2 − 1  = 2 n (2 n − 1) (ordered pair) distinctio ns which a re all the dits on a set with 2 n elements. This is the insta nce of the blo ck e ntropy r elationship h ( B ) = 1 − 1 2 H ( B ) when the blo ck B is a s ingleton in a 2 n element set s o that H ( B ) = lo g  1 1 / 2 n  = log (2 n ) = n and h ( B ) = 1 − 1 2 H ( B ) = 1 − 1 2 n . Thu s the Shannon ent ro py as the n umber of indep endent equal- blo ck ed bina ry partitions it takes to single out a hidden de s ignated element in a 2 n element set is also the num b er of independent equal-blo ck ed binar y pa r titions it takes to distinguish all the elements o f a 2 n element set fro m e ach other. The connection b etw een Shannon entrop y a nd log ic a l entropy b oils down to tw o p oints. 1. The ﬁr st p oint is the basic fa c t that for binary partitions to single o ut a hidden element (“sent message” ) in a set is the same as the par titions distinguishing any pair of distinct elements (since if a pair w as left undistinguished, the hidden element could no t b e s ing led out if it were o ne of the elements in that undisting uished pa ir). This gives what might be called the distinction interpr etation of Shannon entropy a s a count of the binary partitions necessar y to distinguish be tw een all the dis tinct messages in the set o f p ossible messages in contrast to the usual se ar ch interpr etation a s the binary partition coun t necessa ry to ﬁnd the hidden designated element suc h as the sent mes s age. 2. The second point is that in addition to the Shannon co unt of the binary partitions nece s sary to mak e all the distinctions, w e may use the logical measure that is simply the (norma lized) count of the distinctions themselves. 3.4 A Coin-W eighing Example The lo gic of the co nnection b etw een joining indep endent equal-blo ck ed pa rtitions a nd eﬃciently creating dits is not dep endent on the c hoice o f base 2 . Consider the coin-weighing pr oblem where one has a bala nce scale and a s e t of 3 n coins all of which lo ok alike but o ne is counterfeit (the hidden designated element) and is lig h ter tha n the others. The co ins migh t b e n umbered using the n -digit num b er s in mo d 3 arithmetic where the three digits are 0, 1, and 2. The n indep endent ternary partitions ar e ar rived at b y dividing the coins in to three piles accor ding to the i th digit a s i = 1 , ..., n . T o use the n partitions to ﬁnd the false coin, tw o o f the piles are put on the bala nce sca le. If one side is lighter, then the counterfeit coin is in that blo ck. If the tw o sides bala nce, then the light coin is in the third blo ck of coins not on the scale. Thus n w eighing s (i.e., the join of n indep endent equal-blo ck ed ternary partitions) will deter mine the n ter nary digits of the false coin, and thus the ternary Shannon entropy is log 3  1 1 / 3 n  = log 3 (3 n ) = n trits. As b efor e w e can interpret the jo ining of indep endent partitions not only as the most eﬃcient w ay to ﬁnd the hidden element (e.g., the false coin or the sen t mes sage) but as the mos t eﬃcient wa y to make all the distinctions b etw een the e le ments of the set. The ﬁrst partition (separating by the ﬁrs t ternary digit) cr eates 3 equal blo cks of 3 n − 1 elements each so that creates 3 × 3 n − 1 × 3 n − 1 = 3 2 n − 1 unordered pair s of distinct ele men ts or 2 × 3 2 n − 1 14 ordered pair distinctions. The partition according to the seco nd ternary digit divides ea ch of thes e three blocks in to three equal blo cks of 3 n − 2 elements each so the additional unordered pairs created are 3 × 3 × 3 n − 2 × 3 n − 2 = 3 2 n − 2 or 2 × 3 2 n − 2 ordered pair distinctions. Contin uing in this fas hion, the n th ternary partition adds 2 × 3 2 n − n dits. Hence the total num b er of dits crea ted b y joining the n indepe ndent pa rtitions is : 2 ×  3 2 n − 1 + 3 2 n − 2 ... + 3 n  = 2 ×  3 n  3 n − 1 + 3 n − 2 ... + 1  = 2 × h 3 n (3 n − 1) 3 − 1 i = 3 n (3 n − 1) which is the to ta l num b er of orde r ed pair distinctions b etw een the elements of the 3 n element set. Thus the Shannon mea s ure in trits is the minimum num b er o f ter na ry partitions needed to create all the distinctions b etw een the elements of a set. The base-3 Shannon entrop y is H 3 ( π ) = P B ∈ π p B log 3  1 p B  which for this example of the discrete pa rtition on a 3 n element set U is H 3  b 1  = P u ∈ U 1 3 n log 3  1 1 / 3 n  = log 3 (3 n ) = n which can als o b e thought of as the blo ck v alue entrop y for a singleton blo ck so that we may apply the block v alue relationship. The logic al entropy of the discrete partition on this se t is : h  b 1  = 3 n (3 n − 1) 3 2 n = 1 − 1 3 n which could also be thought o f a s the blo ck v alue of the logical en tropy for a singleton blo ck. Thus the ent ro pies for the discr ete par tition stand in the blo ck v alue rela tionship which for base 3 is: h ( B ) = 1 − 1 3 H 3 ( B ) . The exa mple helps to s how how the logical notion of a distinction underlies the Shannon measure of information, and ho w a complete pr o cedure for ﬁnding the hidden element (e.g ., the sen t message) is equiv alent to b eing able to make all the distinctions in a set of e le ments. But this should not b e int er pr eted as showing that the Shannon’s information theory “reduces” to the logica l theo ry . The Shannon theory is addr essing an additional question o f ﬁnding the unknown ele ment. One ca n hav e all the dis tinctio ns betw een elements, e.g., the assignment of distinct base- 3 num b ers to the 3 n coins, without knowing which element is the designa ted one. Information theor y b ecomes a theory of the tra nsmission of informa tio n, i.e., a theo ry of communication, when that se c ond question of “receiving the message” as to which element is the designated one is the fo cus of a nalysis. In the coin ex a mple, we might s ay tha t the information ab out the lig ht c o in was alwa ys there in the nature of the situation (i.e., taking “nature” as the sender) but was unknown to an obser ver (i.e., on the receiver s ide). T he coin weighing schem e was a wa y for the observer to elicit the information o ut of the s ituation. Similar ly , the ga me of tw e n ty questions is ab out ﬁnding a way to uncover the hidden answer—whic h was all a long distinct from the other p oss ible answers (on the sender side). It is this question o f the trans mission of information (and the noise that might interfere with the pro cess) that carries Shannon’s statistical theory of communications w ell beyond the bare-b ones lo gical a nalysis of information in terms o f distinctions. 3.5 Block-coun t En tr op y The fac t that the Sha nnon motiv ation works for other bases than 2 suggests that there might be a base-free version of the Shannon measure (the logica l measure is alr eady bas e-free). Sometimes the recipro ca l 1 p B of the probability of an event B is int er preted as the “surpr is e-v a lue information” conv eyed by the o ccurrence o f B . But ther e is a be tter concept to use than the v a gue notio n of “surprise - v alue information.” F o r any p os itive probability p 0 , we deﬁned the recipro ca l 1 p 0 as the e quivalent numb er of (equipr obable) elements (alwa ys “as it were” s ince it need no t be an integer) since that is the n umber of equipr o bable elements in a set so that the probability of ch o o sing any particular element is p 0 . The “ big surprise” as a small proba bilit y even t o ccurs means it is “as if” a particular element was pick ed from a big set of elemen ts. F or instance, for a blo ck proba bilit y p B = | B | | U | , its num ber s-equiv alent is the num b er of blo cks | U | | B | = 1 p B in the hypothetical equal-blo ck ed 15 partition π B with ea ch blo ck equipro bable with B . O ur task is to develop this n umber-of- blo cks o r blo ck-coun t measur e of informa tion for partitions . The blo ck-c ount blo ck entr opy H m ( B ) is just the num b er of blo cks in the hypothetical num b er- of-equiv a lent-blocks partition π B where B is one o f | U | | B | = 1 p B asso ciated similar blo cks so tha t H m ( B ) = 1 p B . If even ts B and C were indep endent, then p B ∩ C = p B p C so the equiv alent num b er of elements asso ciated with the o ccurrence of both even ts is the pro duct 1 p B ∩ C = 1 p B 1 p C of the n umber of elements asso ciated with the separ ate even ts. This sug g ests that the a verage of the blo ck entropies H m ( B ) = 1 p B should b e the multiplicativ e av er a ge (or g eometric mean) ra ther than the arithmetical av era g e. Hence we deﬁne the numb er-of-e quivalent blo cks entr opy or , in short, blo ck-c ount en tr opy o f a partition π (whic h do es no t inv olve an y choice of a ba se for logs) as the geometric mean of blo ck ent r o pies: Blo ck-coun t entropy: H m ( π ) = Q B ∈ π H m ( B ) p B = Q B ∈ π  1 p B  p B blo cks . Finding the designated blo ck in π is the sa me on average as ﬁnding the designated blo ck in a partition with H m ( π ) equal blocks. But since H m ( π ) need no t b e an in teger, one migh t take the recipr o cal to obtain the proba bility interpretation: ﬁnding the designa ted block in π is the sa me on av er a ge as the o ccurrence o f a n even t with pr obability 1 /H m ( π ). Given a ﬁnite-v alued random v aria ble X with the v alues { x 1 , ..., x n } with the probabilities { p 1 , ..., p n } , the additive ex p e ctation is: E [ X ] = P n i =1 p i x i and the mu ltiplic ative exp e ctation is: E m [ X ] = Q n i =1 x p i i . T reating the blo ck probability a s a ra ndom v ar iable de ﬁned on the blocks of a partition, a ll three entropies can b e expr e ssed as exp ectations: H m ( π ) = E m  1 p B  H ( π ) = E  log  1 p B  h ( π ) = E [1 − p B ] = 1 − E [ p B ] . The us ual (additive) Shannon entrop y is then obtained as the log 2 version of this “log-free” blo ck-coun t entrop y: log 2 ( H m ( π )) = lo g  Q B ∈ π  1 p B  p B  = P B ∈ π log  1 p B  p B  = P B ∈ π p B log  1 p B  = H ( π ). Or viewed the other wa y ar ound, H m ( π ) = 2 H ( π ) . 22 The base 3 ent ro py encountered in the coin- weighing example is obtained by taking logs to that base : H 3 ( π ) = log 3 ( H m ( π )), and similarly for the Shannon entrop y with natur a l lo gs: H e ( π ) = log e ( H m ( π )), or with common logs: H 10 ( π ) = log 10 ( H m ( π )). Note that this r elation H m ( π ) = 2 H ( π ) is a result, no t a deﬁnition. The block-count en tro py was deﬁned from “ scratch” in a manner similar to the usual Shannon en tropy (which thus might be called the “ log 2 -of-blo ck-count entropy” or “bina ry-partitio n- count entropy”). In a partition of individual organisms by sp ecie s , the in terpr e ta tion of 2 H ( π ) (or e H e ( π ) when natural logs are used) is the “ nu mber of equally common spec ies” [24, p. 514]. Ma c Arthu r argued that this block-count 22 Th us we expect the num b er-of- blo cks en tropy to b e multiplicativ e where the usual Shannon ent ropy is additiv e (e.g., for sto cha stically indep enden t partitions) and hence the subscript on H m ( π ). 16 ent r o py (where a blo ck is a sp e cies) will “acco r d muc h more closely with our in tuition...” (than the usual Shanno n ent r o py). The blo ck-count entrop y is the information measur e that takes the count of a set (of like ele- men ts) as the mea s ure of the information in the set. That is, fo r the discrete pa rtition o n U , each p B is 1 | U | so the blo ck-coun t ent ro py of the discrete partition is H m  b 1  = Y u ∈ U | U | 1 / | U | = | U | which co uld also be obtained as 2 H ( b 1 ) since H  b 1  = log ( | U | ) is the log 2 -of-blo ck-count Shannon entropy o f b 1. Hence, the natural c hoice of unit for the block-count entropy is “blo cks” (as in H m  b 1  = | U | blocks in the discrete partition on U ). The blo ck-count entropy of the discrete par tition on a n equiproba ble 3 n element se t is 3 n blo cks. Hence the Shannon entrop y with bas e 3 would be the log 3 -of-blo ck- count e ntropy: lo g 3  H m  b 1  = log 3 (3 n ) = n tr its as in the coin-weighing example ab ove. The blo ck v alue rela tionship b etw een the blo ck-coun t entrop y and the lo gical entropy in genera l is: h ( B ) = 1 − p B = 1 − 1 1 /p B = 1 − 1 H m ( B ) = 1 − 1 2 H ( B ) = 1 − 1 3 H 3 ( B ) = 1 − 1 e H e ( B ) = 1 − 1 10 H 10 ( B ) where H m ( B ) = 1 /p B = 2 H ( B ) = 3 H 3 ( B ) = e H e ( B ) = 10 H 10 ( B ) . 4 Analogous Concepts for Shannon and L ogical En tropies 4.1 Independen t Partitions It is so metimes asserted that “informa tion” should b e additive for indep endent 23 partitions but the underlying ma thematical fact is that the block-count is m ultiplicative for indep endent par titions and Shanno n chose to use the logarithm of the blo ck-count as his measure of information. If t wo partitions π = { B } B ∈ π and σ = { C } C ∈ σ are indep endent, then the blo ck counts (i.e., the blo ck ent ro pies for the blo ck-count entropy) multiply , i.e., H m ( B ∩ C ) = 1 p B ∩ C = 1 p B 1 p C = H m ( B ) H m ( C ). Hence for the multiplicativ e e xp e ctations we hav e: H m ( π ∨ σ ) = Q B ,C H m ( B ∩ C ) p B ∩ C = Q B ,C [ H m ( B ) H m ( C )] p B p C =  Q B ∈ π H m ( B ) p B   Q C ∈ σ H m ( C ) p C  = H m ( π ) H m ( σ ), or taking lo gs to any desired base s uch as 2 : H ( π ∨ σ ) = lo g 2 ( H m ( π ∨ σ )) = log 2 ( H m ( π ) H m ( σ )) = log 2 ( H m ( π )) + log 2 ( H m ( σ )) = H ( π ) + H ( σ ) . Thu s for indep endent partitions , the blo ck-coun t entropies multiply and the log-of-blo ck-count ent r o pies add. What happens to the log ical entropies? W e have seen that when the information in a partition is repres e n ted by its dit set dit ( π ) , then the ov erlap in the dit sets of any tw o non- blob pa rtitions is alwa ys non-empty . The dit s et o f the join of tw o pa rtitions is just the union, dit( π ∨ σ ) = dit ( π ) ∪ dit ( σ ), s o that union is never a disjo in t union (when the dit sets are non- empt y). W e hav e used the motiv ation o f thinking o f a partition-a s-dit-set dit ( π ) as an “even t” in a sample space U × U with the probability of that ev ent b eing the logical entropy of the partition. The following prop o sition s hows tha t this motiv ation extends to the no tio n of indep endence. Prop ositi on 3 If π and σ ar e (sto chastic al ly) indep endent p artitions, then their dit sets dit ( π ) and dit ( σ ) ar e indep endent as events in t he sample sp ac e U × U (with e quipr ob able p oints). 23 Recall the “independen t” means stochastic indep endence so that partitions π and σ ar e indep e ndent i f for all B ∈ π and C ∈ σ , p B ∩ C = p B p C . 17 F or indep e nden t partitions π and σ , we need to show that the probability m ( π , σ ) o f the even t Mut ( π , σ ) = dit ( π ) ∩ dit ( σ ) is equal to the pro duct of the pro babilities h ( π ) and h ( σ ) of the even ts dit ( π ) and dit ( σ ) in the sample spa ce U × U . By the assumption of indep endence, w e hav e | B ∩ C | | U | = p B ∩ C = p B p C = | B || C | | U | 2 so that | B ∩ C | = | B | | C | / | U | . By the previous str ucture theorem for the mutual information set: Mut ( π , σ ) = S B ∈ π ,C ∈ σ ( B − ( B ∩ C )) × ( C − ( B ∩ C )), wher e the union is disjoint so that: | Mut ( π , σ ) | = X B ∈ π ,C ∈ σ ( | B | − | B ∩ C | ) ( | C | − | B ∩ C | ) = X B ∈ π ,C ∈ σ  | B | − | B | | C | | U |   | C | − | B | | C | | U |  = 1 | U | 2 X B ∈ π ,C ∈ σ | B | ( | U | − | C | ) | C | ( | U | − | B | ) = 1 | U | 2 X B ∈ π | B | | U − B | X C ∈ σ | C | | U − C | = 1 | U | 2 | dit ( π ) | | dit ( σ ) | . Hence under indep endence, the normalized dit co unt m ( π , σ ) = | Mut( π ,σ ) | | U | 2 = dit( π ) | U | 2 dit( σ ) | U | 2 = h ( π ) h ( σ ) of the m utual information set Mut ( π , σ ) = dit ( π ) ∩ dit ( σ ) is equal to pro duct o f the norma lized dit counts of the pa r titions: m ( π , σ ) = h ( π ) h ( σ ) if π and σ are indep endent.  4.2 Mu tual Information F or ea ch of the ma jor concepts in the informa tion theory based on the usual Shannon mea sure, there should b e a corres po nding conce pt bas ed o n the normalized dit counts o f logical entrop y . 24 In the following sections, we give so me of these cor resp onding concepts and results. The lo gical m utual information o f tw o par titio ns m ( π , σ ) is the normalized dit count of the int er s ection o f their dit-sets: m ( π, σ ) = | dit( π ) ∩ dit( σ ) | | U × U | . F or Shannon’s no tio n of m utual informatio n, w e might apply the V enn diagra m heur istics using a blo ck B ∈ π and a blo ck C ∈ σ . W e saw before that the information contained in a blo ck B was H ( B ) = log  1 p B  and similarly for C while H ( B ∩ C ) = log  1 p B ∩ C  would corr esp ond to the union of the informatio n in B and in C . Hence the ov erla p or “mutual information” in B and C co uld b e motiv ated as the sum of the tw o informations minu s the union: I ( B ; C ) = log  1 p B  + log  1 p C  − log  1 p B ∩ C  = log  1 p B p C  + log ( p B ∩ C ) = log  p B ∩ C p B p C  . Then the (Shannon) mut ual information in the tw o par titions is obtaine d by av era ging ov er the m utual infor mation for each pair o f blo cks from the tw o partitions: 24 See Co ver and Thomas’ b o ok [6] for more bac kground on the standard concep ts. The corresp onding notions for the blo ck-coun t ent ropy are obtained from the usual Shannon en tropy notions by taking ant il ogs. 18 I ( π ; σ ) = P B ,C p B ∩ C log  p B ∩ C p B p C  . The m utual information can be e x panded to verify the V enn diagram heur istics: I ( π ; σ ) = P B ∈ π ,C ∈ σ p B ∩ C log  p B ∩ C p B p C  = P B ,C p B ∩ C log ( p B ∩ C ) + P B ,C p B ∩ C log  1 p B  + P B ,C p B ∩ C log  1 p C  = − H ( π ∨ σ ) + P B ∈ π p B log  1 p B  + P C ∈ σ p C log  1 p C  = H ( π ) + H ( σ ) − H ( π ∨ σ ). W e will la ter see an imp ortant inequality , I ( π ; σ ) ≥ 0 (with equality under independence), and its logical version. In the logical theor y , the co rresp onding “mo dular law” follows from the inclusion-exc lus ion principle a pplied to dit-sets: | dit ( π ) ∩ dit ( σ ) | = | dit ( π ) | + | dit ( σ ) | − | dit ( π ) ∪ dit ( σ ) | . Normaliz ing yields: m ( π, σ ) = | dit( π ) ∩ dit( σ ) | | U | 2 = | dit( π ) | | U | 2 + | dit( σ ) | | U | 2 − | dit( π ) ∪ dit( σ ) | | U | 2 = h ( π ) + h ( σ ) − h ( π ∨ σ ) . Since the formulas c oncerning the logica l and Sha nnon en tropies often hav e similar relatio nships, e.g., I ( π ; σ ) = H ( π ) + H ( σ ) − H ( π ∨ σ ) and m ( π , σ ) = h ( π ) + h ( σ ) − h ( π ∨ σ ), it is useful to also emphasize some cr uc ia l diﬀer ences. One of the mo st impo rtant sp ecial ca ses is for tw o partitions that are (sto chastically) indep endent . F or independent par titions, it is immediate that I ( π ; σ ) = P B ,C p B ∩ C log  p B ∩ C p B p C  = 0 but we have a lready seen tha t fo r the logica l mutual infor mation, m ( π, σ ) > 0 s o long as neither pa r tition is the blob b 0. How ever for indep endent partitions we hav e; m ( π, σ ) = h ( π ) h ( σ ) so the logical mutual information behav es lik e the probability o f b oth even ts o ccurring in the ca se of independenc e (as it must since lo gical entrop y conc e pts hav e direct pr obabilistic in terpreta tions). F o r independent pa rtitions, the relation m ( π , σ ) = h ( π ) h ( σ ) mea ns that the probability that a r andom pair is distinguis hed by both par titions is the same as the pr obability that it is disting uished by one partition times the pro bability that it is distinguished by the o ther partition. In simpler terms, for independent π a nd σ , the proba bilit y that π and σ disting uishes is the probability that π distinguishes times the pr obability that σ distinguishes. It is sometimes co nv enient to think in the co mplement ar y terms o f an equiv alence relation “identifying.” ra ther than a partition dis tinguishing. Since h ( π ) can be int er preted a s the pro bability that a random pair o f elements from U are distinguished b y π , i.e ., as a distinctio n proba bility , its complement 1 − h ( π ) can b e in terpr eted a s a n identiﬁc ation pr ob ability , i.e., the probability that a random pair is identiﬁed by π (thinking of π as an equiv alence r elation o n U ). In gener al, [1 − h ( π ) ] [ 1 − h ( σ )] = 1 − h ( π ) − h ( σ ) + h ( π ) h ( σ ) = [1 − h ( π ∨ σ ) ] + [ h ( π ) h ( σ ) − m ( π , σ ] which could a lso b e rewritten as : [1 − h ( π ∨ σ )] − [1 − h ( π )] [1 − h ( σ ) ] = m ( π , σ ) − h ( π ) h ( σ ). Hence: if π and σ are indep endent: [1 − h ( π )] [1 − h ( σ )] = [1 − h ( π ∨ σ )]. 19 Thu s if π and σ a re indep endent, then the proba bility that the join partition π ∨ σ iden tiﬁes is the proba bility that π identiﬁes times the pr obability that σ iden tiﬁes. In summary , if π and σ are independent, then: Binary-pa rtition-count (Shannon) entropy : H ( π ∨ σ ) = H ( π ) + H ( σ ) Blo ck-coun t entropy : H m ( π ∨ σ ) = H m ( π ) H m ( σ ) Normalized-dit-co unt (logica l) entrop y: h ( π ∨ σ ) = 1 − [1 − h ( π )] [1 − h ( σ ) ] . 4.3 Cross En tropy and Div ergence Given a set partition π = { B } B ∈ π on a set U , the “natur al” or Lapla cian probability distribution on the blo cks o f the partition was p B = | B | | U | . The set partition π also determines the set of distinctions dit ( π ) ⊆ U × U and the log ical en tro py of the pa r tition w as the Laplacian probability o f the dit-set as an event, i.e., h ( π ) = | dit( π | | U × U | = P B p B (1 − p B ). But w e may also “kick aw ay the ladder” and generalize all the deﬁnitions to any ﬁnite proba bilit y distributions p = { p 1 , ..., p n } . A probability distribution p might b e given by ﬁnite-v alued ra ndom v ar iables X o n a sample s pace U where p i = Prob( X = x i ) for the ﬁnite set of distinct v alues x i for i = 1 , ..., n . Thus the logical entrop y of the r andom v a riable X is: h ( X ) = P n i =1 p i (1 − p i ) = 1 − P i p 2 i . The entropy is only a function of the pr obability distribution of the ra ndom v ariable, not its v alues, so w e could also take it simply as a function o f the proba bilit y distribution p , h ( p ) = 1 − P i p 2 i . T a k ing the sample space as { 1 , ..., n } , the logica l entrop y is s till interpreted as the probability that tw o indep endent draws will draw distinct p oints fr o m { 1 , ..., n } . The further generalizatio ns repla cing probabilities by pr obability density functions a nd sums b y in tegr a ls ar e straig ht for ward but b eyond the s c o p e o f this pap er (whic h is fo cused on conceptual foundations rather than mathematical developmen ts). Given tw o probability distributions p = { p 1 , ..., p n } and q = { q 1 , ..., q n } on the same sa mple space { 1 , ..., n } , w e can again consider the drawing of a pair of po ints but wher e the ﬁrst drawing is acco rding to p and the s econd drawing according to q . The probability that the pair of po ints is distinct would b e a natur al and more genera l notion o f lo gical entropy which we will call the: lo gic al cr oss entr opy: h ( p k q ) = P i p i (1 − q i ) = 1 − P i p i q i = P i q i (1 − p i ) = h ( q k p ) which is symmetric. The lo gical cross entropy is the same as the log ical entropy when the distributions are the s ame, i.e., if p = q , then h ( p k q ) = h ( p ). The notion of cr oss entr opy in conven tional infor ma tion theory is: H ( p k q ) = P i p i log  1 q i  which is not s ymmetrical due to the asymmetric r o le of the log arithm, although if p = q , then H ( p k q ) = H ( p ). Then the Kul lb ack-L eibler diver genc e D ( p k q ) = P i p i log  p i q i  is deﬁned as a mea- sure o f the distance or divergence b etw een the tw o distributions where D ( p k q ) = H ( p k q ) − H ( p ). The information ine quality is: D ( p k q ) ≥ 0 with equality if a nd only if p i = q i for i = 1 , ..., n [6, p. 26]. Given t wo partitions π and σ , the ineq ua lity I ( π ; σ ) ≥ 0 is obtained b y applying the information in- equality to the tw o distributions { p B ∩ C } and { p B p C } on the sample spac e { ( B , C ) : B ∈ π , C ∈ σ } = π × σ : I ( π ; σ ) = P B ,C p B ∩ C log  p B ∩ C p B p C  = D ( { p B ∩ C } k { p B p C } ) ≥ 0 with equa lit y under indep endence. But s ta rting a fr esh, one migh t ask: “What is the natural measur e of the diﬀerence or distance betw een tw o pro ba bilit y distributions p = { p 1 , ..., p n } and q = { q 1 , ..., q n } that would alwa ys be 20 non-negative, and would be zero if and o nly they are equal?” The (Euclidean) distance b etw een the t wo p o ints in R n would seem to b e the “log ical” answer—so w e take that distance (squar e d) as the deﬁnition of the: lo gic al diver genc e (or lo gic al r elative entr opy ): d ( p k q ) = P i ( p i − q i ) 2 , which is sy mmetr ic and non- neg ative. W e hav e comp onent-wise: 0 ≤ ( p i − q i ) 2 = p 2 i − 2 p i q i + q 2 i = 2  1 n − p i q i  −  1 n − p 2 i  −  1 n − q 2 i  so tha t tak ing the sum for i = 1 , ..., n g ives: 0 ≤ d ( p k q ) = P i ( p i − q i ) 2 = 2 [1 − P i p i q i ] −  1 − P i p 2 i  −  1 − P i q 2 i  = 2 h ( p k q ) − h ( p ) − h ( q ). Thu s we hav e the: 0 ≤ d ( p k q ) = 2 h ( p k q ) − h ( p ) − h ( q ) with equality if a nd only if p i = q i for i = 1 , ..., n Logical infor mation ineq uality . If we tak e h ( p k q ) − 1 2 [ h ( p ) + h ( q )] as the Jensen diﬀer en c e [26, p. 2 5 ] b etw een the t wo distributions, then the lo gical divergence is twice the Jensen diﬀerence. The half-and-half pro bability distr ibution p + q 2 that mixes p a nd q has the lo gical entropy of h  p + q 2  = h ( p k q ) 2 + h ( p )+ h ( q ) 4 so tha t: d ( p k q ) = 4  h  p + q 2  − 1 2 { h ( p ) + h ( q ) }  ≥ 0. The logical information inequalit y tells us that “mixing increases logical en tro p y” (or, to be pr ecise, mixing do es not decre ase logica l entrop y) which also follows from the fact that logical entropy h ( p ) = 1 − P i p 2 i is a co ncav e function. An imp orta nt sp ecial c ase of the logical infor mation inequality is when p = { p 1 , ..., p n } is the uniform distribution with all p i = 1 n . Then h ( p ) = 1 − 1 n where the probability that a ra ndo m pair is dis tinguished (i.e., the r andom v aria ble X with P rob( X = x i ) = p i has diﬀeren t v alues in tw o independent samples) tak es the speciﬁc form of the pro bability 1 − 1 n that the seco nd draw g ets a diﬀerent v alue than the ﬁrst. It may at ﬁrst seem count er intuitiv e that in this case the cro ss entrop y is h ( p k q ) = h ( p ) + P i p i ( p i − q i ) = h ( p ) + P i 1 n  1 n − q i  = h ( p ) = 1 − 1 n for any q = { q 1 , ..., q n } . But h ( p k q ) is the probability that the tw o p oints, say i and i ′ , in the sample space { 1 , ..., n } a re distinct when one draw was acco rding to p and the other according to q . T aking the ﬁrst draw according to q , the pro bability that the second draw is distinct from whatever p oint was determined in the ﬁrst draw is indeed 1 − 1 n (regardles s of probability q i of the p oint drawn on the ﬁrst draw). Then the div er g ence d ( p k q ) = 2 h ( p k q ) − h ( p ) − h ( q ) =  1 − 1 n  − h ( q ) is a non-negative measure of how muc h the pro bability distribution q diverges fro m the uniform distribution. It is simply the diﬀerence in the probability that a r andom pa ir will be distinguished b y the uniform distribution and b y q . Also since 0 ≤ d ( p k q ), this shows that among all probability dis tributions on { 1 , ..., n } , the uniform distributio n has the maximum logical entropy . In terms o f partitions , the n -blo ck partition with p B = 1 n has maximum logical ent ro py among all n - blo ck partitions. In the case o f | U | divisible by n , the eq ual n -blo ck pa rtitions make more distinctions tha n an y o f the uneq ua l n -blo ck pa rtitions on U . F or any partition π with the n blo ck probabilities { p B } B ∈ π = { p 1 , ..., p n } : h ( π ) ≤ 1 − 1 n with eq uality if and o nly if p 1 = ... = p n = 1 n . F or the cor r esp onding r esults in the Sha nnon’s information theory , we ca n apply the informa tio n inequality D ( p k q ) = H ( p k q ) − H ( p ) ≥ 0 with q as the uniform distribution q 1 = ... = q n = 1 n . Then H ( p k q ) = P i p i log  1 1 /n  = log ( n ) so that: H ( p ) ≤ log ( n ) or in terms o f pa rtitions: 21 H ( π ) ≤ lo g 2 ( | π | ) with eq uality if and only if the probabilities are e q ual or, in ba se-free terms , H m ( π ) ≤ | π | with eq uality if and only if the pro babilities a re e q ual. The three en tropie s take their maximum v alues (for ﬁxed num b er o f blo cks | π | ) at the partitions with eq uiprobable blo cks. In information theor y texts, it is customa ry to gra ph the case of n = 2 where the entrop y is graphed a s a function o f p 1 = p with p 2 = 1 − p . The Shannon ent ro py function H ( p ) = − p log ( p ) − (1 − p ) log (1 − p ) lo oks somewha t like an in verted parab ola with its maximum v alue o f log( n ) = log (2 ) = 1 a t p = . 5. The logical entrop y function h ( p ) = 1 − p 2 − (1 − p ) 2 = 2 p − 2 p 2 = 2 p (1 − p ) is a n inv er ted para b o la with its maximum v a lue o f 1 − 1 n = 1 − 1 2 = . 5 at p = . 5 . The blo ck-coun t en tro py H m ( p ) =  1 p  p  1 1 − p  1 − p = 2 H ( p ) is an inv erted U-s ha p e d cur ve that starts and ends a t 1 = 2 H (0) = 2 H (1) and has its maximum at 2 = 2 H ( . 5) . 4.4 Su mmary of Analogous Concepts and Results Shannon Entrop y Logical E nt r o py Blo ck Entrop y H ( B ) = log ( 1 /p B ) h ( B ) = 1 − p B Relationship H ( B ) = log  1 1 − h ( B )  h ( B ) = 1 − 1 2 H ( B ) Entrop y H ( π ) = P p B log (1 / p B ) h ( π ) = P p B (1 − p B ) Mutual Information I ( π ; σ ) = H ( π ) + H ( σ ) − H ( π ∨ σ ) m ( π, σ ) = h ( π ) + h ( σ ) − h ( π ∨ σ ) Independenc e I ( π ; σ ) = 0 m ( π, σ ) = h ( π ) h ( σ ) Independenc e & Jo ins H ( π ∨ σ ) = H ( π ) + H ( σ ) h ( π ∨ σ ) = 1 − [1 − h ( π )] [1 − h ( σ )] Cross E nt ro py H ( p k q ) = P p i log (1 /q i ) h ( p k q ) = P p i (1 − q i ) Div er gence D ( p k q ) = H ( p k q ) − H ( p ) d ( p k q ) = 2 h ( p k q ) − h ( p ) − h ( q ) Information Inequality D ( p k q ) ≥ 0 with = iﬀ p i = q i ∀ i d ( p k q ) ≥ 0 with = iﬀ p i = q i ∀ i Info. Ineq. Sp. Case I ( π ; σ ) = D ( { p B ∩ C } k { p B p C } ) ≥ 0 d ( { p B ∩ C } k { p B p C } ) ≥ 0 with eq uality under indep endence with eq uality under independence . 5 Concluding Remarks In the dualit y of subsets of a set with partitions on a set, w e found that the elements of a subset were dual to the distinctions (dits) o f a par titio n. Just as the ﬁnite probability theory for ev ents sta rted by ta king the size of a subset (“even t”) S normalized to the size of the ﬁnite universe U as the probability Prob ( S ) = | S | | U | , so it w ould b e natural to consider the corre s po nding theory that would asso ciate with a partition π on a ﬁnite U , the size | dit ( π ) | of the s et of distinctions o f the partition normalized b y the total n umber o f o rdered pairs | U × U | . This num b er h ( π ) = | dit( π ) | | U × U | was ca lled the logica l entropy of π and could be in terpr eted a s the proba bilit y that a ra ndomly pic ked (with replacement) pair of elemen ts from U is distinguished b y the partition π , just as Prob ( s ) = | S | | U | is the probability that a r andomly pick ed element from U is an element of the subset S . Hence this notion of logic a l en tropy arises naturally out of the log ic of partitions that is dual to the usual lo gic of subsets. The question immediately a rises of the relationship w ith Sha nnon’s concept of entropy . F ollowing Shannon’s deﬁnition of entropy , there has been a veritable plethora of sugges ted alterna tive entropy concepts [2 0]. Logica l entrop y is not an a lternative en tropy concept intended to displace Shannon’s concept any mor e than is the blo ck-count entrop y concept. Instead, I have argued that the dit- count, block-count, and binary-par tition-count concepts of entrop y sho uld be s een as three wa ys to 22 measure that sa me “ information” expres s ed in its most atomic terms as distinctions. The blo ck-count ent r o py , although it can be independently deﬁned, is trivially related to Shannon’s bina ry-partition- count concept—just ta ke antilogs. The relations hip of the log ical concept of entropy to the Shannon concept is a little mor e s ubtle but is quite simple at the level of blo cks B ∈ π : h ( B ) = 1 − p B , H m ( B ) = 1 p B , and H ( B ) = log  1 p B  so tha t eliminating the proba bilit y , we hav e: h ( B ) = 1 − 1 H m ( B ) = 1 − 1 2 H ( B ) . Then the logical a nd additive ent ro pies for the who le partition are obtained b y taking the (additiv e) exp ectation of the block entropies while the blo ck-count en tropy is the multiplicativ e expe c ta tion of the blo ck en tropies : H m ( π ) = Y B ∈ π  1 p B  p B H ( π ) = X B ∈ π p B log  1 p B  h ( π ) = X B ∈ π p B (1 − p B ) . In conclusio n, the simple r o ot of the matter is three diﬀerent w ays to “measure” the distinctions that g e ne r ate an n -element set. Consider a 4 element set. One measure o f the distinctions that distinguish a set of 4 element s is its ca rdinality 4, and that measure leads to the blo ck-count entropy . Another mea sure o f that set is log 2 (4) = 2 which c an b e interpreted a s the minimal num b er of binary partitions necessary : (a) to single out any designated e lement as a singleto n (sear ch in terpreta tion) or, equiv a lent ly , (b) to distinguish all the elemen ts from each other (distinction interpretation). That measure leads to Shannon’s entropy for mu la. And the third measure is the (normalized) coun t o f distinctions (counted a s ordered pairs) necessa ry to distinguish all the elements from eac h other, i.e., 4 × 4 − 4 4 × 4 = 12 16 = 3 4 , which yields the lo gical entrop y formula. These measures stand in the blo ck v alue relationship: 3 4 = 1 − 1 4 = 1 − 1 2 2 . It is just a matter o f: 1. counting the elements distinguished (blo ck-coun t entrop y), 2. counting the binary par titions needed to distinguish the elements (Shannon entrop y), or 3. counting the (normalized) distinctions themselves (log ical entropy). References [1] Acz ´ el, J . and Z. Dar´ oczy 19 75. On Me asures of Information and Their Char acterization . New Y or k: Academic Pres s . [2] Adelman, M. A. 1 969. Comment on the H Concentration Measure as a Numbers-E quiv a lent . R eview of Ec onomics and Statistics . 51: 99- 101. [3] Baclawski, Kenneth a nd Gia n- Carlo Rota 1 979. An Intr o duction to Pr ob ability and R andom Pr o c esses . Unpublished t yp escr ipt. 467 pages. Do wnload av ailable a t: h ttp://www.eller man.org. 23 [4] Bhargav a, T. N. and V. R. R. Uppuluri 19 75. O n an Axiomatic Deriv ation o f Gini Diversit y , With Applications . Metr on . 33: 41-5 3. [5] Birkhoﬀ, Ga r rett 1 948. L attic e The ory . New Y ork: American Mathematical So ciety . [6] Cov er, Thomas and Joy Thomas 1991 . Elements of Information The ory . New Y ork: John Wiley . [7] Fin b erg, David, Matteo Mainetti and Gia n- Carlo Ro ta 1996. The Logic of Commuting E quiv- alence Relations. In L o gic and Algebr a . Aldo Urs ini a nd Paolo Agliano eds., New Y o r k: Ma r cel Dekker: 69-96. [8] F riedman, William F. 19 22. The Index of Coincidenc e and Its Applic ations in Crypto gr aphy . Genev a IL: Riverbank Lab orato ries. [9] Ganeshaiah, K. N., K. Chandr ashek ara and A. R. V. K umar 1997. Av a la nche Index: A new measure of bio diversit y ba s ed on biologic a l heterogeneity of co mmunities. Curr ent Scienc e . 73: 128-3 3. [10] Gini, C o rrado 19 12. V ariabilit` a e m utabilit` a . B o logna: Tip og raﬁa di Paolo Cuppini. [11] Gini, Co rrado 195 5. V ariabilit` a e mutabilit` a. In Memorie di m eto dolo gic a statistic a . E. P iz etti and T. Salvemini eds., Rome: Libreria Eredi Virgilio V eschi. [12] Go o d, I. J. 197 9. A.M. T ur ing ’s sta tis tica l work in W o rld W ar I I. Biometrika . 66 (2): 393-6 . [13] Go o d, I. J. 1982. Comment (on P atil and T aillie: Diversity as a Concept and its Measurement). Journal of the Americ an Statistic al Ass o ciation . 77 (379): 561-3 . [14] Gray , Rob ert M. 1990. Entr opy and Information The ory . New Y ork: Springe r-V e r lag. [15] Hartley , Ralph V. L. 1928. T ransmission of information. Bel l System T e chnic al Journal . 7 (3, July): 53 5-63. [16] Havrda, J. H. and F. Charv at 1967 . Q uantiﬁcation Metho ds of Class iﬁcation Pro cess e s: Co ncept of Structura l α -Entropy . Kyb ernetika (Pr ague) . 3: 30-35 . [17] Herﬁndahl, O rris C. 19 50. Conc entr ation in the U.S. Ste el I n dustry . Unpublished doctor al dis- sertation, Co lumbia University . [18] Hirschman, Alb ert O. 194 5. National p ower and the structu re of for eign tr ade . Berkeley: Uni- versit y of California Pres s. [19] Hirschman, Alb ert O. 196 4 . The Paternity of an Index. Americ an Ec onomic Revi ew . 54 (5): 761-2 . [20] Kapur, J.N. 1 994. Me asur es of I n formation and Their Applic ations . New Delhi: Wiley E astern. [21] Kolmogo rov, A.N. 1956. F oun dations of the The ory of Pr ob ability . New Y ork: Chelsea. [22] Kullback, Solomon 197 6. St atistic al Metho ds in Cryptanalysis . W alnut Cr eek CA: Aegean Park Press. [23] Lawv ere, F. William and Rob er t Rosebr ugh 2 003. S ets for Mathematics . Cambridge: Cambridge Univ er sity P r ess. [24] MacArthur, Rober t H. 19 65. Patterns of Spec ie s Diversit y . Biol. R ev. 40 : 510-3 3. [25] Patil, G. P . and C. T aillie 1 982. Diversit y as a Concept and its Measurement. J ournal of the Americ an St atistic al Asso ciation . 77 (379): 548- 61. 24 [26] Rao, C. Radha k rishna 1982 . Div ersity and Dissimilarity Co eﬃcients: A Uniﬁed Approa ch. The- or etic al Popula tion Biolo gy . 21: 24-43 . [27] Rejewski, M. 198 1. Ho w Polish Mathematicians Deciphered the E nigma. Annals of t he History of Computing . 3: 213 -34. [28] R´ en yi, Alfr´ ed 1965 . On the Theor y of Random Search. Bul l. Am. Math. S o c. 7 1 : 809- 28. [29] R´ en yi, Alfr´ ed 1970 . Pr ob ability The ory . Las zlo V ekerdi (tra ns.), Amsterdam: No r th-Holland. [30] R´ en yi, Alfr ´ ed 19 7 6. Sele cte d Pap ers of Alfr ´ ed R´ enyi: V olumes 1,2, and 3 . Pal T uran (editor), Budap est: Ak ademiai Kiado . [31] Ricotta, Carlo and Las zlo Szeidl 2006. T ow ar ds a unifying approach to diversit y measures: Bridging the gap b etw een the Sha nno n e ntropy a nd Ra o’s qua dratic index . The or etic al Popu- lation Biolo gy . 70: 2 37-43 . [32] Shannon, Claude E. 1948. A Mathematical Theor y o f Commu nica tion. Bel l System T e chnic al Journal . 27 : 37 9-423 ; 623-56. [33] Simpson, E dward Hugh 194 9. Measur ement o f Diversit y . N atu r e . 163 : 688. [34] Stigler, Stephen M. 19 99. St atistics on the T able . Cambridge: Harv a rd Universit y Press. [35] Tsallis, C. 1988. Possible Generalization for Boltzmann-Gibbs sta tistics. J . Stat. Physics . 52: 479-8 7. [36] V a jda, I. 1969 . A Contribution to Informational Analysis o f Patterns. In Metho dolo gies of Pat- tern Re c o gnition . Sato si W atanab e ed., New Y ork: Academic Pr ess: 509 -519. 25

Counting Distinctions: On the Conceptual Foundations of Shannons Information Theory

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment