InSilicoVA: A Method to Automate Cause of Death Assignment for Verbal Autopsy

InSilicoV A: A Metho d to Automate Cause of Death Assignmen t for V erbal Autopsy Sam uel J. Clark 1,4,5,6,7,* , T yler McCormic k 1,2 , Zehang Li 2 , and Jon W ak eﬁeld 2,3 1 Departmen t of Sociology , Univ ersit y of W ashington 2 Departmen t of Statistics, Univ ersity of W ashington 3 Departmen t of Biostatistics, Univ ersity of W ashington 4 Institute of Beha vioral Science (IBS), Univ ersity of Colorado at Boulder 5 MR C/Wits Rural Public Health and Health T ransitions Researc h Unit (Agincourt), Sc ho ol of Public Health, F aculty of Health Sciences, Univ ersity of the Witw atersrand 6 ALPHA Netw ork, London School of Hygience and T ropical Medicine, London, UK 7 INDEPTH Netw ork, Accra, Ghana * Corresp ondence to: work@samclark.net August 24, 2013 i Abstra ct V erbal autopsies (V A) are widely used to provide cause-speciﬁc mortalit y estimates in de- v eloping w orld settings where vital registration do es not function w ell. V As assign cause(s) to a death b y using information describing the even ts leading up to the death, provided by care giv ers. T ypically physicians read V A interviews and assign causes using their exp ert kno wledge. Ph ysician coding is often slo w, and individual ph ysicians bring bias to the coding pro cess that results in non-comparable cause assignmen ts. These problems signiﬁcantly limit the utilit y of ph ysician-co ded V As. A solution to b oth is to use an algorithmic approach that formalizes the cause-assignment pro cess. This ensures that assigned causes are comparable and requires man y few er p erson-hours so that cause assignment can b e conducted quic kly without disrupting the normal work of ph ysicians. P eter Byass’ In terV A metho d (By ass et al., 2012) is the most widely used algorithmic approach to V A co ding and is aligned with the WHO 2012 standard V A questionnaire (Leitao et al., 2013). The statistical mo del underpinning In terV A can b e impro ved; uncertaint y needs to b e quan- tiﬁed, and the link b et ween the p opulation-lev el CSMFs and the individual-lev el cause as- signmen ts needs to b e statistically rigorous. Addressing these theoretical concerns pro vides an opp ortunit y to create new soft ware using mo dern languages that can run on multiple plat- forms and will b e widely shared. Building on the ov erall framework pioneered by InterV A, our w ork creates a statistical mo del for automated V A cause assignment. A cknowledgments Preparation of this man uscript was partially supp orted by the Bill and Melinda Gates F oundation. The authors are grateful to P eter Byass, Basia Zaba, Kathleen Kahn, Stephen T ollman, Adrian Raftery , Philip Setel and Osman Sankoh for helpful discussions. ii Con ten ts 1 In tro duction 1 2 In terV A 2 2.1 By ass’ Description of In terV A . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Our Notation for InterV A: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Our Description of InterV A Data Requiremen ts: . . . . . . . . . . . . . . . . 4 2.4 Our Presen tation of the In terV A Mo del and Algorithm . . . . . . . . . . . . 5 2.5 Ev aluation of InterV A Mo del & Algorithm . . . . . . . . . . . . . . . . . . . 6 3 InSilicoV A 8 3.1 InSilicoV A Notation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 InSilicoV A Data: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 InSilicoV A Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 T esting & Comparing InSilicoV A and InterV A 11 4.1 Sim ulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Application to Real Data - Agincourt HDSS . . . . . . . . . . . . . . . . . . 12 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5 Discussion 13 5.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2 F uture W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 iii 1 In tro duction V erbal autopsy (V A) is a common approach for determining cause of death in regions where deaths are not recorded routinely . V As are a standardized questionnaire adminis- tered to caregivers, family mem b ers, or others knowledgeable of the circumstances of a recen t death with the goal of using these data to infer the lik ely causes of death (Byass et al., 2012). The V A surv ey instrument asks questions related to the deceased individ- ual’s medical condition (did the p erson ha ve diarrhea, for example) and related to other factors surrounding the death (did the person die in an automotive accident, for exam- ple). V A has b een widely used b y researc hers in Health and Demographic Surv eillance Sites (HDSS), such as the INDEPTH Netw ork (Sankoh and By ass, 2012), and has recen tly receiv ed renew ed attention from the W orld Health Organization through the release of an up- date to the widely used standard V A questionnaire (see http://www.who.int/healthinfo/ statistics/verbalautopsystandards/ ). The main statistical c hallenge with V A data is to ascertain patterns in resp onses that corresp ond to a pre-deﬁned set of causes of death. T ypically the nature of suc h patterns is not known a priori and measurements are sub ject to v arious types of measuremen t error, which are discussed in further detail b elo w. There are tw o credible metho ds for automating the assignmen t of cause of death from V A data. In terV A (see for example: Byass et al., 2003, 2006, 2012) is a proprietary algorithm dev elop ed and maintaine d by P eter By ass, and it has b een used extensiv ely b y b oth the AL- PHA (Maher et al., 2010) and INDEPTH net works of HDSS sites and a wide v ariet y of other researc h groups. A t this time In terV A app ears to b e the de facto standard metho d. The Institute for Health Metrics and Ev aluation (IHME) ha ve prop osed a n umber of additional metho ds (for example: Flaxman et al., 2011; James et al., 2011; Murra y et al., 2011), some of whic h build on earlier w ork b y King and Lu (King et al., 2010; King and Lu, 2008). Among their metho ds, the Simpliﬁed Symptom Pattern Recognition (SSPR) metho d (Murra y et al., 2011) is most directly comparable to InterV A and app ears to ha ve the best chance of becom- ing widely used. The SSPR and related metho ds require a so-called ‘gold standard’ database, a database consisting of a large n um b er of deaths where the cause has b een certiﬁed by medi- cal professionals and is considered reliable, and further, where the symptoms for those deaths are also veriﬁable b y medical professionals. Deaths recorded in a gold standard database are typically in-hospital deaths. Giv en regional v ariation in the prev alence and familiarit y of medical professionals with certain causes of death, region-sp eciﬁc gold standard databases are also necessary . The ma jority of public health and epidemiology researc hers and oﬃcials do not ha ve access to such a gold standard database, motiv ating dev elopment of metho ds to infer cause of death using V A data without access to gold standard databases. Aside from the challenges related to obtaining useful gold standard training data, using V A data to infer cause assignment is also statistically challenging b ecause there are m ultiple sources of v ariation and error presen t in V A data. W e identify three sources of v ariation and prop ose a no v el metho d called InSilicoV A to address these sources of v ariation. First, though resp onses on the V A instrument may b e the resp onden ts recollection or b est guess ab out what has happ ened, they are not necessarily accurate. This t yp e of error is presen t in all survey data but may b e esp ecially pronounced in V A data b ecause resp onden ts hav e 1 v arying levels of familiarit y with the circumstances surrounding a death. Also, the deﬁnition of an ev ent of in terest ma y be diﬀeren t for each resp onden t. A question ab out diarrhea, for example, requires the resp ondent to b e suﬃcien tly inv olv ed in the deceden t’s medical care to kno w this information and to ha ve a deﬁnition of diarrhea that maps relativ ely w ell to accepted clinical standards. A second source of v ariability arises from individual v ariation in presentation of diseases. Statistical metho ds m ust b e suﬃciently robust to o verﬁtting to appreciate that t w o individuals with the same cause of death may ha v e slightly diﬀeren t presen tations, and thus, diﬀerent V A rep orts. Third, lik e In terV A, InSilicoV A will use physician elicited conditional probabilities. These probabilities, represen ting the consensus estimate of the likelihoo d that a p erson will exp erience a given symptom if they died from a certain cause, are unknown. Simulation and results with data from the Agi ncourt MR C/Wits Rural Public Health and Heath T ransitions Research Unit (Agincourt) indicate that results obtained with conditional probabilities assumed ﬁxed and known (as is done in b oth InterV A and the SSPR metho d) can underestimate uncertaint y in p opulation cause of death distributions. W e ev aluate the sensitivit y to these prior probabilities. InSilicoV A incorp orates uncertaint y from these sources and propagates the uncertaint y b e- t ween individual cause assignmen ts and p opulation distributions. Accoun ting for these sources of error produces conﬁdence in terv als for b oth individual cause assignmen ts and p opulation distributions. These conﬁdence interv als reﬂect more realistically the complex- it y of cause assignmen t with V A data. A uniﬁed framew ork for b oth individual cause and p opulation distribution estimation also means that additional information ab out individual causes, suc h as physician coded V As, can easily b e used in InSilicoV A, ev en if physician coded records are only av ailable for a subset of cases. F urther, we can exactly quantify the con tri- bution of eac h V A questionnaire item in classifying a case into a sp eciﬁc cause, aﬀording the p ossibilit y for ‘item reduction’ by iden tifying which inputs are most useful for discriminating b et w een causes. This feature could lead to a more eﬃcien t, streamlined surv ey mechanism and set of conditional probabilities for elicitation. In Section 2 we describ e InterV A and present op en c hallenges whic h w e address with InSil- icoV A. In Section 3 w e present the InSilicoV A mo del. Section 4 describ es applications of b oth metho ds to simulated and real data, and Section 4.3 pro vides the results. W e conclude with a v ery brief discussion in Section 5. 2 In terV A 2.1 By ass’ Description of In terV A Belo w is Byass’ summary of the In terV A metho d presented in By ass et al. (2012). Ba yes theorem links the probabilit y of an even t happ ening giv en a particular circumstance with the unconditional probability of the same even t and the conditional probabilit y of the circumstance giv en the even t. If the even t of interest is a particular cause of death, and the circumstance is part of the ev en ts leading to death, then Ba y es theorem can be applied in terms of circumstances and causes of death. Sp eciﬁcally , if there are a predetermined set of p ossible 2 causes of death C 1 . . . C m and another set of indicators I 1 . . . I n represen ting v arious signs, symptoms and circumstances leading to death, then Ba yes general theorem for an y particular C i and I j can b e stated as: P ( C i | I j ) = P ( I j | C i ) × P ( C i ) P ( I j | C i ) × P ( C i ) + P ( I j | ! C i ) × P (! C i ) (1) where P (! C i ) is (1 − P ( C i )). Over the whole set of causes of death C 1 . . . C m a set of probabilities for eac h C i can be calculated using a normalising assumption so that the total conditional probabilit y o ver all causes totals unit y: P ( C i | I j ) = P ( I j | C i ) × P ( C i ) P m i =1 P ( C i ) (2) Using an initial set of unconditional probabilities for causes of death C 1 . . . C m (whic h can b e thought of as P ( C i | I 0 )) and a matrix of conditional probabilities P ( I j | C i ) for indicators I 1 . . . I n and causes C 1 . . . C m , it is p ossible to rep eatedly apply the same calculation pro cess for each I 1 . . . I n that applies to a particular death: P ( C i | I 1 ...n ) = P ( I j | C i ) × P ( C i | I 0 ...n − 1 ) P m i =1 P ( C i | I 0 ...n − 1 ) (3) This pro cess t ypically results in the probabilities of most causes reducing, while a few lik ely causes are c haracterised b y their increasing probabilities as successive indicators are pro cessed. In the same article Byass describ es the pro cess of deﬁning the conditional probabilities P ( I j | C i ). Apart from the mathematics, the ma jor challenge in building a probabilistic mo del co vering all causes of death to a reasonable lev el of detail lies in p opulating the matrix of conditional probabilities P ( I j | C i ). There is no ov erall source of data a v ailable whic h systematically quan- tiﬁes probabilities of v arious signs, symptoms and circumstances leading to death in terms of their asso ciations with particular causes. Therefore, these conditional probabilities hav e to b e estimated from a diversit y of incomplete sources (including previous In terV A models) and mod- ulated by exp ert opinion. In the v arious v ersions of In terV A that ha ve b een dev elop ed, exp ert panels hav e b een conv ened to capture clinical exp ertise on the relationships b etw een indicators and causes. In this case, an exp ert panel conv ened in Genev a in December 2011 and contin- ued to delib erate subsequently , particularly considering issues that built on previous InterV A v ersions. Exp erience has shown that gradations in lev els of p erceiv ed probabilities corresp ond more to a logarithmic than linear scale, and in the exp ert consultation for In terV A-4, we used a p erceiv ed probability scale that was subsequen tly conv erted to num b ers on a logarithmic scale as shown b elo w. 3 T able 1: In terV A Conditional Probabilit y Letter-V alue Corresp ondances Lab el V alue In terpretation I 1.0 Alw ays A 0.8 Almost alwa ys A 0.5 Common A 0.2 B 0.1 Often B 0.05 B 0.02 C 0.01 Un usual C 0.005 C 0.002 D 0.001 vRare D 0.0005 D 0.0001 E 0.00001 Hardly ever N 0.0 Nev er The ph ysician-derived conditional probabilities that are supplied with the InterV A softw are (By ass, 2013) are co ded using the letter co des in the leftmost column of T able 1. W e rewrite, interpret and discuss the In terV A mo del b elo w. 2.2 Our Notation for In terV A: • Deaths: y j j ∈ { 1 , . . . , J } , ~ Y = [ y 1 , . . . , y J ] • Signs/symptoms: s k ∈ { 0 , 1 } , k ∈ { 1 , . . . , K } , ~ S = [ s 1 , . . . , s K ] • Causes: c n n ∈ { 1 , . . . , N } , P N n =1 c j,n = 1 • F raction of all deaths that are cause n , the ‘cause-sp eciﬁc mortality fraction’ (CSMF): f n n ∈ { 1 , . . . , N } , ~ F = [ f 1 , . . . , f N ] , P N n =1 f n = 1 2.3 Our Description of In terV A Data Requiremen ts: 1. F or each death y j , a V A interview pro duces a binary-v alued vector of signs/symptoms: ~ S j = { s j, 1 , s j, 2 , . . . s j,K } (4) S is the J × K matrix whose ro ws are the ~ S j for eac h death. 2. A K × N matrix of conditional probabilities reﬂecting physicians’ opinions ab out how 4 lik ely a given sign/symptom is for a death resulting from a given cause: P =      Pr( s 1 | c 1 ) Pr( s 1 | c 2 ) · · · Pr( s 1 | c N ) Pr( s 2 | c 1 ) Pr( s 2 | c 2 ) · · · Pr( s 2 | c N ) . . . . . . . . . . . . Pr( s K | c 1 ) Pr( s K | c 2 ) · · · Pr( s K | c N )      (5) As supplied with the InterV A soft w are (By ass, 2013) P do es not con tain in ternally con- sisten t probabilities 1 . This is easy to understand by noting that these probabilities are not deriv ed from a well-deﬁned ev en t space that w ould constrain them to b e consistent with one another. As describ ed b y Byass ab o ve in Section 2.1 the ph ysicians provide a ‘letter grade’ for eac h conditional probabilit y , and these corresp ond to a ranking of p erceived lik eliho o d of a given sign/symptom if the death is due to a giv en cause. These letter grades are then turned into n umbers in the range [0 , 1] (NB: 0.0 and 1.0 are included) using T able 1. Consequen tly it is not p ossible to assume that the mem b ers of P will b eha ve as exp ected when one attempts to calculate complements and use more than one in an expression in a w ay that should b e consisten t. 3. An initial guess of ~ F , ~ F 0 = [ f 0 n , . . . , f 0 N ] 2.4 Our Presen tation of the In terV A Mo del and Algorithm F or a sp eciﬁc death y j w e can imagine and examine the t wo-dimensional joint distribution ( c j,n , s j,k ): Pr( c j,n | s j,k ) = P ( s j,k | c j,n ) · Pr( c j,n ) Pr( s j,k ) = Pr( s j,k | c j,n ) · Pr( c j,n ) Pr( s j,k | c j,n ) · Pr( c j,n ) + x (6) where x = Pr( s j,k |¬ c j,n ) · Pr( ¬ c j,n ) (7) Lo oking on the RHS of (6), w e ha v e Pr( s j,k | c j,n ) from the conditional probabilities from ph ysicians and Pr( c j,n ) ≈ f 0 n . If the conditional probabilities P w ere well-behav ed, then x = N X n 0 =1 , n 0 6 = n Pr( s j,k | c j,n 0 ) · Pr( c j,n 0 ) (8) 1 The P supplied with InterV A has many logical inconsistencies, for example situations where con- ditional probabilities should add up to equal another: Pr(fast breathing for 2 w eeks or longer | HIV) + Pr(fast breathing for less than 2 w eeks | HIV) 6 = Pr(fast breathing | HIV), or where they just do not make sense: Pr(fast breathing for 2 weeks or longer | sepsis) > Pr(fast breathing | sepsis) . The P supplied with In terV A-4 (Byass, 2013) is a 254 × 69 matrix with 17,526 entries. W e hav e inv estigated automated wa ys of correcting the inconsistencies, but with every attempt we disco ver more, so we hav e concluded that the en tries in P need to b e re-elicited from physicians using an approac h that ensures that they are consistent. 5 Ho wev er since the P supplied with the InterV A softw are (By ass, 2013) are not consisten t with one another this calculation do es not pro duce useful results. In terV A solv es this with an arbitrary reform ulation of the relationship. F or each death y j and o ver all signs/symptoms s k asso ciated with y j : ∀ ( j, n ) : Prop ensit y( c j,n | ~ S j ) = f 0 n · K Y k =1 [Pr( s j,k | c n )] s k (9) F or each death y j these ‘Prop ensities’ do not add to 1.0 so they need to b e normalized to pro duce w ell-b eha v ed probabilities: ∀ ( j, n ) : Pr( c j,n ) = Prop ensit y( c j,n | ~ S j ) P N n =1 Prop ensit y( c j,n | ~ S j ) (10) The p opulation-lev el CSMFs ~ F are calculated by adding up the results of calculating (10) for all causes for all deaths: ∀ n : f n = J X j =1 Pr( c j,n ) (11) 2.5 Ev aluation of In terV A Mo del & Algorithm In eﬀect what InterV A do es is distribute a giv en death among a num b er of predeﬁned causes. The cause with the largest fraction is assumed to b e the primary cause, follo wed with decreas- ing signiﬁcance by the remaining causes in order from largest to smallest. The conceptual construct of a ‘partial’ death is central to InterV A and is interc hanged freely with the pr ob- ability of dying from a given cause. This equiv alence is not real and is at the heart of the theoretical problems with InterV A. A t a high lev el InterV A prop oses a v ery useful solution to the fundamental challenge that all automated V A co ding algorithms face - ho w to characterize and use the relationship that exists b et w een signs/symptoms and causes of death. In a p erfect world we would hav e medically certiﬁed patien t records that include the results of ‘real’ autopsies, and we could use those to model this relationship and use the results of that model in our cause-assignmen t algorithms. But in that p erfect world there is no use for V A at all. So b y deﬁnition w e live in the w orld where that type of ‘gold standard’ data do not and will not exist most of the time for most of the developing w orld where V As are conducted. By ass’ solution to this is to accept the limitations on the exp ert knowledge that physicians can provide to substitute for gold standard data, and further, to elicit and then organize that information in a very useful format – the conditional probabilities matrix P ab o ve in (5). In sum, By ass has sorted through a v ariety of p ossible general strategies and settled on one that is b oth do able and pro duces useful results. Where w e con tribute is to help reﬁne the statistical and computational metho ds used to conduct the cause assignmen ts. In order to do that w e hav e ev aluated InterV A thoroughly , 6 and we hav e iden tiﬁed a n umber of weaknesses that we feel can b e addressed. The brief list b elo w is by necessity a blunt description of those weaknesses, which nevertheless do not reduce the imp ortance of In terV A as describ ed just ab o v e. There are sev eral theoretical problems with the InterV A mo del: 1. The deriv ation presented in (1) through (3) is incorrect; in particular (2) do es not follow from (1), and (3) does not follow from (2). As describ ed just ab ov e, (11) requires that probabilities b e equated with fractional deaths, which is conceptually diﬃcult. 2. InterV A’s statistical mo del is not ‘probabilistic’ in the recognized sense b ecause it do es not include elements that can v ary unpredictably , and hence there is no r andom- ness . Although P con tains ‘probabilities’ (see discussion with (5) ab o v e), these are not allow ed to v ary in the estimation pro cedure used to assign causes of death – P is eﬀectiv ely a ﬁxe d input to the mo del. 3. Because the mo del do es not con tain features that are allow ed to v ary unpredictably , it is not p ossible to quan tify uncertaint y to pro duce probabilistic error b ounds. 4. If w e ignore the errors in the deriv ation and w ork with (9) - (11) as if they w ere correct, there are additional problems. Equation (9) is at the core of In terV A and demonstrates t wo undesirable features: (a) F or a sp eciﬁc individual the prop ensit y for each cause is deterministically aﬀected b y f 0 n , what By ass terms the ‘prior’ probabilit y of cause n , eﬀectiv ely a user- deﬁned parameter of the algorithm. This means that the ﬁnal individual-level cause assignments are a deterministic transformation of the so-called ‘prior’ – i.e. the results are not only sensitive to but dep end dir e ctly on the ‘prior’. (b) The expression in (9) captures only one v alance in the relationship b et ween signs/symptoms and a cause of death – it ac knowledges and reacts only to the presence of a sign/symptom but not to its absence, eﬀectively thro wing aw ay half of the information in the dataset. T o include information con vey ed b y the absence of a sign/symptom, (9) needs a term that in volv es something lik e ‘1 − Pr( s j,k | C n )’. [The comp onen ts of InSilicoV A in (22) and (27) below include this term.] This is undesirable for tw o reasons: (1) signs/symptoms are not selecting causes that ﬁt and de-selecting causes that don’t, but rather just de-selecting causes, and (2) the ﬁnal probabilities are t ypically the pro duct of a large num b er of v ery small n um- b ers, and hence their n umeric v alues can b ecome extremely small, small enough to in teract badly and unpredictable with the numerical storage/arithmetic capacity of the softw are and computers used to calculate them. A simple log transforma- tion w ould solve this problem. 5. Finally , (11) is a deterministic transformation of the individual-lev el cause assignmen ts to pro duce a population-level CSMF; simply another w ay of stating the individual-lev el cause assignments. Because the individual-lev el cause assignments are not probabilis- tic, neither are the resulting CSMFs. In addition there are idiosyncrasies that aﬀect the curren t implemen tation of In terV A (By ass, 7 2013) and some o ddities ha ving to do with the matrix of conditional probabilities pro vided with By ass’ InterV A. W e will not describ e those here. InSilicoV A is designed to ov ercome these problems and provide a v alid statistical framew ork on whic h further reﬁnemen ts can b e built. 3 InSilicoV A InSilicoV A is a statistical model and computational algorithm to automate assignment of cause of death from data obtained by V A in terviews. Broadly the metho d aims to: • F ollow in the fo otsteps of InterV A building on its strengths and addressing its w eak- nesses. • Pro duce consistent, comparable cause assignments and CSMFs. • Be statistically and computationally v alid and extensible. • Provide a means to quantify uncertaint y . • Be able to function to assign causes to a single death. The name ‘InSilicoV A’ is inspired by ‘in-vitro’ studies that mimic real biology but in more con trolled circumstances (often on ‘glass’ p etri dishes). In this case we are assigning causes to deaths using a computer that performs the required calculations using a silicon chip. F urther, we o we a great debt to the InterV A (in terpret V A) method that provides useful philosophical and practical frameworks on whic h to build the new metho d - so we ha ve stuc k to the structure of InterV A’s name. 3.1 InSilicoV A Notation: • Deaths: y j j ∈ { 1 , . . . , J } , ~ Y = [ y 1 , . . . , y J ] • Signs/symptoms: s k ∈ { 0 , 1 } , k ∈ { 1 , . . . , K } , ~ S = [ s 1 , . . . , s K ] • Causes of death: c n n ∈ { 1 , . . . , N } , ~ C = [ c 1 , . . . , c N ] • F or individual j , probabilit y of cause n given ~ S j : ` j,n = Pr( y j = c n | ~ S j ) , j ∈ { 1 , . . . , J } , n ∈ { 1 , . . . , N } , ~ L j = [ l j, 1 , . . . , l j,N ] , P N n =1 ` j,n = 1 • Count of all deaths that are cause n , the ‘cause-sp eciﬁc death coun t’ (CSDC): m n n ∈ { 1 , . . . , N } , ~ M = [ m 1 , . . . , m N ] , P N n =1 m n = J • F raction of all deaths that are cause n , the ‘cause-sp eciﬁc mortality fraction’ (CSMF): f n n ∈ { 1 , . . . , N } , ~ F = [ f 1 , . . . , f N ] , P N n =1 f n = 1 8 3.2 InSilicoV A Data: 1. F or eac h death y j , the V A in terview produces a binary-v alued v ector of signs/symptoms: ~ S j = { s j, 1 , s j, 2 , . . . s j,K } (12) S is the J × K matrix whose ro ws are the ~ S j for eac h death. The columns of S are assumed to b e indep enden t giv en ~ C , i.e. there is no systematic relationship b et ween the signs/symptoms for a giv en cause. This is v ery obviously not a justiﬁable assumption. Signs and symptoms come in characteristic sets dep ending on the cause of death, so there is some correlation b et ween them, conditional on a giv en cause. Nonetheless we assume indep endence in order to facilitate initial construction and testing of our model, and most pragmatically , so that we can utilize the matrix of conditional probabilities supplied b y By ass with the In terV A softw are (By ass, 2013) – it is impossible to either regenerate or signiﬁcan tly impro ve up on these without signiﬁcan t resources with whic h to organize meetings of ph ysicians with the relev ant exp erience who can provide this information. This ~ S j is the same as (4) used by In terV A. 2. A K × N matrix of conditional probabilities reﬂecting physicians’ opinions ab out how lik ely a given sign/symptom is for a death resulting from a given cause: P =      Pr( s 1 | c 1 ) Pr( s 1 | c 2 ) · · · Pr( s 1 | c N ) Pr( s 2 | c 1 ) Pr( s 2 | c 2 ) · · · Pr( s 2 | c N ) . . . . . . . . . . . . Pr( s K | c 1 ) Pr( s K | c 2 ) · · · Pr( s K | c N )      (13) InSilicoV A assumes that the comp onen ts of P are consisten t with one another. In the sim ulation study describ ed in Section 4.1, we construct consistent v alues for P , but when w e test the mo del on real data in Section 4.2, we ha v e no option other than using the inconsisten t P supplied with the InterV A softw are (By ass, 2013). 3. An initial guess of ~ F , ~ F 0 = [ f 0 n , . . . , f 0 N ] 3.3 InSilicoV A Algorithm W e are interested in the joint distribution ( ~ F , ~ Y ) given the set of observ ed signs/symptoms S . The p osterior distribution is: Pr( ~ F , ~ Y | S ) = Pr( S | ~ Y , ~ F ) Pr( ~ Y | ~ F ) Pr( ~ F ) Pr( S ) ∝ Pr( S | ~ Y , ~ F ) Pr( ~ Y | ~ F ) Pr( ~ F ) (14) = J Y j =1 Pr( S | y j , ~ F ) Pr( y j | ~ F ) Pr( ~ F ) (15) 9 Because individual cause assignmen ts are indep enden t, individual sign/symptom v ectors ~ S j are indep enden t from ~ F (the CSMFs), and w e hav e: Pr( ~ F , ~ Y | S ) ∝ J Y j =1 Pr( S | y j ) Pr( y j | ~ F ) Pr( ~ F ) (16) W e will use a Gibbs sampler to sample from this p osterior as follows: G.1 start with an initial guess of ~ F , ~ F 0 G.2 sample ~ Y | ~ F , S G.3 sample ~ F | ~ Y , S G.4 rep eat steps G.2 and G.3 until ~ F and ~ Y conv erge This algorithm is generic and allows a ric h range of mo dels. F or the moment the InsilicoV A mo del is: s j,k | c n ∼ Bernoulli(Pr( s j,k | c n )) (17) y j = c n | ~ F ∼ Multinomial N (1 , ~ F ) (18) ~ F ∼ Diric hlet( ~ α ) , ~ α is N -dimensional and constant (19) Then the p osterior in (16) is: Pr( ~ F , ~ Y | S ) ∝ J Y j =1 K Y k =1 Pr( s j,k | y j = c n ) Pr( y j = c n | ~ F ) Pr( ~ F ) (20) This form ulation is computationally eﬃcien t because of Multinomial/Dirichlet conjugacy , and b ecause using Ba yes rule we ha ve, for step G.2: y j = c n | ~ F , ~ S j ∼ Multinomial N (1 , ~ L j ) (21) where the ` j,n that comp ose ~ L j are: ` j,n = Pr( y j = c n | ~ S j ) = Pr( y j = c n ) · Pr( ~ S j | y j = c n ) Pr( ~ S j ) substituting f n = Pr( y j = c n ) and using the data ~ S j and P to calculate the probabilit y of a sp eciﬁc ~ S j giv en the cause assignmen t y j = c n = f n · Q K k =1  Pr( s j,k | y j = c n ) s j,k · [1 − Pr( s j,k | y j = c n )] (1 − s j,k )  P N n 0 =1 f n 0 · Q K k =1  Pr( s j,k | y j = c n 0 ) s j,k · [1 − Pr( s j,k | y j = c n 0 )] (1 − s j,k )  (22) 10 W e can also derive from (20) the distribution of ~ F conditional on ~ Y , for step G.3: ~ F | ~ Y , S ∼ Diric hlet( ~ M + ~ α ) (23) where m n = J X j =1 [ y j = c n ] , using Iv erson’s brack et notation:[ z ] = ( 1 if z true 0 if z false (24) In summary , the Gibbs sampler pro ceeds given suitable initialization ~ F 0 b y: G.2 sampling a cause for eac h death to generate a new ~ Y | ~ F , S : y j = c n | ~ F , ~ S j ∼ Multinomial N (1 , ~ L j ) (25) where ` j,n = f n · Q K k =1  Pr( s j,k | y j = c n ) s j,k · [1 − Pr( s j,k | y j = c n )] (1 − s j,k )  P N n 0 =1 f n 0 · Q K k =1  Pr( s j,k | y j = c n 0 ) s j,k · [1 − Pr( s j,k | y j = c n 0 )] (1 − s j,k )  (26) G.3 sampling a new ~ F | ~ Y , S : ~ F | ~ Y , S ∼ Diric hlet( ~ M + ~ α ) (27) The resulting sample of ( ~ F , ~ Y ) and the ~ L j that go with it form the output of the metho d. These are distributions of CSMFs at the p opulation level and probabilities of dying from eac h cause at the individual level. These distributions can b e summarized as required to pro duce point v alues and measures of uncertaint y in ~ F and ~ L j . Deaths often result from more than one cause. InSilicoV A accommo dates this p ossibilit y b y pro ducing a separate distribution of the probabilities of b eing assigned to each cause; that is N distributions, one for each cause. In con trast, InterV A rep orts one v alue for each cause, and those v alues sum to unit y across causes for a single death. Finally , with a suitable ~ F , InSilicoV A can b e used to assign causes (and their asso ciated ` j,n ) to a single death b y rep eatedly dra wing causes using (25). This requires no more information than InterV A to accomplish the same ob jectiv e, and it pro duces uncertaint y b ounds around the probabilities of b eing assigned to each cause. 4 T esting & Comparing InSilicoV A and In terV A T o ev aluate b oth InSilicoV A and In terV A w e ﬁt them to simulated and real data. W e hav e created R co de that implements b oth metho ds. The R co de for InterV A matc hes the results pro duced b y Peter Byass’ implementation (By ass, 2013). 11 4.1 Sim ulation Study Our sim ulated data are generated using this pro cedure: 1. Draw a set of Pr( s k | c n ) so that they ha ve the same distribution and range as those pro vided with Byass’ In terV A softw are (By ass, 2013). 2. Draw a set of simu lated deaths from a made up distribution of deaths b y cause. 3. F or each simulated death, assign a set of signs/symptoms by applying the conditional probabilities sim ulated in step 1. These simulated data ha ve the same o verall features as the data required for either In terV A or InSilicoV A, and w e know b oth the real p opulation distribution of deaths b y cause and the true individual cause assignments. Our sim ulation study p oses three questions: 1. F air comparison of InSilicoV A and InterV A . T o make this comparison w e gen- erate 100 simulated datasets and apply b oth metho ds to each dataset. W e summarize the results with individual-lev el and p opulation-lev el error measures. W e refer to this as ‘fair’ b ecause we apply b oth metho ds in their simplest form to data that fulﬁll all the requiremen ts of b oth metho ds. Since the data are eﬀectively ‘p erfect’ we exp ect b oth methods to p erform w ell. 2. Inﬂuence of n umeric v alues of Pr( s k | c n ). Giv en the structure of In terV A, we are concerned that the results of In terV A may b e sensitive to the exact numerical v alues tak en b y the conditional probabilities supplied b y ph ysicians. T o test this we rescale the sim ulated Pr( s k | c n ) so that their range is restricted to [0 . 25 − 0 . 75]. W e again sim ulate 100 datasets and apply both metho ds to each dataset and summarize the resulting errors for each metho d. 3. Rep orting errors . As w e men tioned in the introduction, we are concerned about rep orting errors for an y algorithmic approach to V A cause assignment. T o in vestigate this w e randomly reco de our sim ulated signs/symptoms so that a small fraction are ‘wrong’ - i.e. co ded 0 when the sign/symptom exists (15%) or 1 when there is no sign/symptom (10%). W e do this for 100 sim ulated datasets and summarize errors resulting from application of b oth metho ds. 4.2 Application to Real Data - Agincourt HDSS T o in v estigate the b ehavior of InSilicoV A and InterV A on real data, we apply b oth metho ds to the V A data generated b y the Agincourt health and demographic surveillance system (HDSS) in South Africa (Kahn et al., 2012) from roughly 1993 to the present. The Agincourt site con tinuously monitors the p opulation of 21 villages lo cated in the Bush buckridge District of Mpumalanga Pro vince in northeast South Africa. This is a rural p opulation living in what w as during Apartheid a black ‘homeland’. The Agincourt HDSS was established in the early 1990s with the purp ose of guiding the reorganization of South Africa’s health system. 12 Since then the goals of the HDSS hav e evolv ed and no w it con tributes to ev aluation of national p olicy at p opulation, household and individual levels. The p opulation co vered by the Agincourt site is approximately eight y-thousand. F or this test w e us the ph ysician-generated conditional probabilities P and initial guess of the CSMFs ~ F 0 pro vided by Byass with the In terV A softw are (Byass, 2013). 4.3 Results The results of our sim ulation study are summarized graphically as a set of Figures 1 – 3. The Agincourt results are presented in Figure 4. 5 Discussion 5.1 Summary of Findings InSilicoV A b egins to solv e most of the critical problems with In terV A. The results of ap- plying b oth metho ds to sim ulated data indicate that InSilicoV A p erforms well under all circumstances except ‘rep orting errors’, but even in this situation InSilicoV A p erforms far b etter than InterV A. InSilicoV A and In terV A b oth p erform relativ ely well when the simu- lated data are p erfect. InSilicoV A’s performance is not aﬀected b y changing the magnitudes and ranges of the conditional probabilit y inputs, whereas In terV A’s p erformance suﬀers dra- matically . With rep orting errors both metho ds’ performance is negativ ely impacted, but In terV A b ecomes eﬀectively useless. Applied to one sp eciﬁc real dataset, b oth metho ds pro duce qualitatively similar results, but InSilicoV A is far more conserv ativ e and pro duces conﬁdence b ounds, whereas InterV A do es not. F or Agincourt, Figure 4.A sho ws the causes with the largest diﬀerence b et ween the In- SilicoV A and In terV A estimates of the CSMF. InSilicoV A classiﬁes a larger p ortion of deaths as due to causes lab eled as ‘other.’ This indicates that these causes are related to either the comm unicable or non-communicable diseases, but there is not enough information to mak e a more sp eciﬁc classiﬁcation. This feature of InSilicoV A iden tiﬁes cases that are diﬃ- cult to classify using av ailable data and ma y , for example, b e goo d candidates for ph ysician review. W e view this b eha vior as a strength of InSilicoV A b ecause it is consistent with the fun- damen tal weakness of the V A approach, namely that b oth the information obtained from a V A in terview and the exp ert knowledge and/or gold standard used to c haracterize the relationship b et w een signs/symptoms and causes are inherently weak and incomplete, and consequen tly it is very diﬃcult or imp ossible to mak e highly sp eciﬁc cause assignmen ts using V A. Given this, w e do not w ant a metho d that is artiﬁcially precise, i.e. forces ﬁne-tuned classiﬁcation when there is insuﬃcien t information. Hence w e view InSilicoV A’s behavior as reasonable, ‘honest’ (in that it do es not ov er interpret the data) and useful. ‘Useful’ in 13 the sense that it iden tiﬁes where our information is particularly weak and therefore where w e need to apply more eﬀort either to data or to in terpretation, lik e addition physician reviews. 5.2 F uture W ork W e plan a v ariet y of additional work on InSilicoV A: 1. Explore the possibility of replacing the Dirichle t distribution in (19), (23) and (27) with a mixture of Normals on the baseline logit transformed set of f n ’s. This provides additional ﬂexible parameters to allow eac h CSMF to ha ve its own mean and v ariance. 2. Embed InSilicoV A in a spatio-temp oral mo del that allows ~ F to v ary smo othly through space and time. This would pro vide a parsimonious wa y of exploring spatio-temp oral v ariation in the CSMFs while using the data as eﬃciently as p ossible. 3. Create the ability to add physician cause assignments to (22) and (26) so that infor- mation in that form can b e utilized when av ailable. The ph ysician co des will require pre-pro cessing to remov e ph ysician-sp eciﬁc bias in cause assignment, p erhaps using a ‘rater reliabilit y metho d’ (for example: Salter-T o wnshend and Murphy, 2012). 4. Most imp ortan tly , address the ob viously in v alid assumption that the signs/symptoms are indep enden t giv en a sp eciﬁc cause. This will require mo deling of the signs/symptoms and the ph ysician-provided conditional probabilities so that imp ortant dep endencies can b e accommo dated. F urther, this will require additional consultation with physi- cians and acquisition of new exp ert kno wledge to characterize these dep endencies. All of this will require a generous grant and the collab oration of a large n umber of exp erts. This will v ery lik ely greatly improv e the p erformance and robustness of the metho d. 5. Critically , re-elicit the conditional probabilities P from ph ysicians so that they are logically w ell-b eha ved, i.e. fully consisten t with one another and their complemen ts. 6. F o cus and sharp en V A questionnaire. Quan tify the inﬂuence of eac h sign/symptom to: (1) p oten tially eliminate low-v alue signs/symptoms and thereb y make the V A in terview more eﬃcien t, and/or (2) suggest sign/symptom ‘t yp es’ that appear particularly useful, and p oten tially suggest augmenting V A in terviews based on that information. 7. Explore new p ossibilities for reﬁning the conditional probabilities P and potentially for en tirely new mo dels. 14 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● InSilicoV A InterV A 0.85 0.90 0.95 1.00 Individual accuracy Accuracy A ● ● ● ● ● ● ● ● InSilicoV A InterV A 0.001 0.002 0.003 0.004 0.005 0.006 CSMF estimation Mean Absolute Error B Figure 1: Sim ulation setup 1: ‘F air Comparison’. A : InSilicoV A correctly assigns cause of death correctly eﬀectiv ely 100% of the time. In terV A is less accurate in assigning individual causes of death. B : InSilicoV A’s errors in iden tifying CSMFs are consistently very small. InterV A’s errors are also generally small, but the distribution has a long tail in the direction of large errors – sometimes In terV A’s errors are large. 15 ● ● ● ● InSilicoV A InterV A 0.6 0.7 0.8 0.9 1.0 Individual accuracy Accuracy A ● ● InSilicoV A InterV A 0.002 0.004 0.006 0.008 0.010 0.012 0.014 CSMF estimation Mean Absolute Error B Figure 2: Sim ulation setup 2: ‘Conditional Probabilities in the range [0 . 25 − 0 . 75] ’. A : InSilicoV A correctly assigns cause of death correctly eﬀectiv ely 100% of the time. In terV A correctly assigns cause of death correctly 80% of the time with wide v ariation all the wa y do wn to as lo w as 60% and nev er abov e ab out 90%. B : InSilicoV A’s errors in identifying CSMFs are consistently very small. InterV A’s errors in iden tifying the CSMFs are larger and more v ariable. 16 ● ● ● InSilicoV A InterV A 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Individual accuracy Accuracy A ● ● ● ● InSilicoV A InterV A 0.005 0.010 0.015 0.020 CSMF estimation Mean Absolute Error B Figure 3: Sim ulation setup 3: ‘Rep orting Errors’. Both metho ds suﬀer, but In terV A suﬀers a lot more. A : InSilicoV A correctly assigns cause of death correctly ab out 70% of the time. In terV A correctly assigns cause of death correctly ab out 40% of the time. B : InSilicoV A’s errors in iden tifying CSMFs are still consistently small. In terV A’s errors in iden tifying the CSMFs are larger. 17 Most discrepency InSilico−InterV A −20% −15% −10% −5% 0 5% 10% 15% 20% ● ● ● ● ● ● ● ● ● ● Other and unspecified neoplasms Diabetes mellitus Diarrhoeal diseases Stroke Other and unspecified infect dis Chronic obstructive pulmonary dis Digestive neoplasms Pulmonary tuberculosis Sepsis (non−obstetric) Other and unspecified NCD A Within 4% InSilico−InterV A −4% −2% 0 2% 4% ● ● ● ● ● ● ● ● ● ● Intentional self−harm Road traffic accident Asthma Breast neoplasms Assault Meningitis and encephalitis Other and unspecified cardiac dis Reproductive neoplasms MF Other and unspecified external CoD Acute abdomen B Estimated cause−specific mortality fractions Estimated CSMF 0.00 0.05 0.10 0.15 0.20 Other and unspecified neoplasms Diabetes mellitus Diarrhoeal diseases Stroke Other and unspecified infect dis Chronic obstructive pulmonary dis Digestive neoplasms Pulmonary tuberculosis Sepsis (non−obstetric) Other and unspecified NCD InterV A InSilicoV A C Figure 4: Agincourt Application. A : Diﬀerences in CSMFs (InSilicoV A – InterV A) displa ying the 10 sp eciﬁc causes that diﬀer most. InSilicoV A is less willing to mak e highly sp eciﬁc classiﬁcations and thus pro duces larger CSMFs asso ciated with less-speciﬁc causes and smaller CSMFs asso ciated with more sp eciﬁc causes. B : Same as (A) but for the 10 largest diﬀerences in CSMFs that are still less than 4%, whic h as a group includes HIV ev en though it is not in the top 10 that are plotted here. C : The CSMFs pro duced by b oth mo dels on their natural scale, the same 10 causes as in A for whic h the diﬀerences are greatest. 18 References By ass, P . (2013). In terv a soft ware. www.interva.or g . By ass, P ., D. Chandramohan, S. J. Clark, L. D’Am bruoso, E. F ottrell, W. J. Graham, A. J. Herbst, A. Ho dgson, S. Houn ton, K. Kahn, et al. (2012). Strengthening standardised in terpretation of verbal autopsy data: the new in terv a-4 to ol. Glob al he alth action 5 . By ass, P ., E. F ottrell, D. L. Huong, Y. Berhane, T. Corrah, K. Kahn, L. Muhe, et al. (2006). Reﬁning a probabilistic model for in terpreting v erbal autopsy data. Sc andinavian journal of public he alth 34 (1), 26–31. By ass, P ., D. L. Huong, and H. V an Minh (2003). A probabilistic approach to interpret- ing v erbal autopsies: metho dology and preliminary v alidation in vietnam. Sc andinavian Journal of Public He alth 31 (62 suppl), 32–37. Flaxman, A. D., A. V ahdatp our, S. Green, S. L. James, C. J. Murra y , and Consortium P opulation Health Metrics Researc h (2011). Random forests for v erbal autopsy anal- ysis: m ultisite v alidation study using clinical diagnostic gold standards. Popul He alth Metr 9 (29). James, S. L., A. D. Flaxman, C. J. Murra y , and Consortium P opulation Health Metrics Re- searc h (2011). P erformance of the tariﬀ metho d: v alidation of a simple additive algorithm for analysis of verbal autopsies. Popul He alth Metr 9 (31). Kahn, K., M. A. Collinson, F. X. G´ omez-Oliv´ e, O. Mok o ena, R. Twine, P . Mee, S. A. Afolabi, B. D. Clark, C. W. Kabudula, A. Khosa, et al. (2012). Proﬁle: Agincourt health and so cio-demographic surveillance system. International journal of epidemiolo gy 41 (4), 988–1001. King, G. and Y. Lu (2008). V erbal autopsy methods with m ultiple causes of death. Statistic al Scienc e 100 (469). King, G., Y. Lu, and K. Shibuya (2010). Designing verbal autopsy studies. Popul He alth Metr 8 (19). Leitao, J., D. Chandramohan, P . Byass, R. Jak ob, K. Bundhamc haro en, and C. Choprap ow an (2013). Revising the WHO verbal autopsy instrument to facilitate routine cause-of-death monitoring. Glob al He alth A ction 6 (21518). Maher, D., S. Biraro, V. Hosegoo d, R. Isingo, T. Lutalo, P . Mushati, B. Ngwira, M. Nyirenda, J. T o dd, and B. Zaba (2010). T ranslating global health research aims in to action: the example of the alpha netw ork*. T r opic al Me dicine & International He alth 15 (3), 321–328. Murra y , C. J., S. L. James, J. K. Birn baum, M. K. F reeman, R. Lozano, A. D. Lop ez, and Consortium P opulation Health Metrics Research (2011). Simpliﬁed symptom pattern metho d for verbal autopsy analysis: m ultisite v alidation study using clinical diagnostic gold standards. Popul He alth Metr 9 (30). 19 Salter-T o wnshend, M. and T. B. Murph y (2012). Sentimen t analysis of online media. L ausen, B., van del Po el, D. and Ultsch, A.(e ds.). Algorithms fr om and for Natur e and Life. Studies in Classiﬁc ation, Data A nalysis, and Know le dge Or ganization . Sank oh, O. and P . Byass (2012). The indepth net work: ﬁlling vital gaps in global epidemi- ology . International Journal of Epidemiolo gy 41 (3), 579–588. 20

InSilicoVA: A Method to Automate Cause of Death Assignment for Verbal Autopsy

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment