Bayesian Inference for Epidemic Final Size Datasets with Hidden Underlying Household Structure

Ba y esian Inference for Epidemic Final Size Datasets with Hidden Underlying Household Structure Joseph Bro oks 1 , Thomas House 1 , Lorenzo P ellis 1 , and Jo e Hilton 1,2 1 Departmen t of Mathematics, Universit y of Manc hester, Manc hester, United Kingdom 2 Manc hester Centre for Health Economics, Univ ersit y of Manc hester, Manc hester, United Kingdom 1 Abstract Households represen t a k ey unit of in terest in infectious disease epi- demiology , in b oth empirical studies and mathematical modelling. The within-household transmission p oten tial of an infectious disease is often summarised in terms of a secondary attac k ratio, the prop ortion of an index case’s household contacts who become infected during the index case’s infectious p eriod. Despite its widespread use, the secondary at- tac k ratio dep ends on the distribution of household comp ositions seen during the study perio d, making the information it con v eys diﬃcult to generalise to new p opulations and diﬀeren t contexts. Extending es- timates of transmission p oten tial to new p opulations instead requires estimates of person-to-p erson transmission rates which can be con vo- luted with data on p opulation structure to parametrise mechanistic transmission mo dels. In this study w e present a new Bay esian inference metho d which uses an MCMC algorithm to infer the transmission in tensit y b y im- puting the unrep orted household structure underlying the epidemic dynamics. This metho d can be run on household epidemiological data rep orted at v arying levels of resolution. F or synthetic data from a re- alistic underlying household size distribution, we were able to achiev e o ver 95% co verage in our estimates of transmission rate o v er a range of transmission in tensity scenarios. The metho d w as also able to consis- ten tly ac hiev e o v er 95% cov erage for data generated with a patholog- ical underlying household size distribution, giv en strong information ab out the household size distribution. Using an existing dataset which recorded micro-scale household epidemiological outcomes during the CO VID-19 pandemic, we sho w that at least stratifying observed sec- ondary attac k ratios b y household size substantially reduces the uncer- tain ty in transmission parameter estimates. Our ﬁndings suggest that researc hers conducting household epidemiological studies can drasti- cally impro ve the utilit y of their results to infectious disease mo dellers b y rep orting household-stratiﬁed estimates, without needing to rep ort full micro-level transmission outcomes. These results aim to encour- age the reporting of higher resolution outputs in future epidemiological ﬁeld work as, in the absence of strong priors, the transmission param- eters were not easily identiﬁable from low resolution datasets, whic h are commonly rep orted. 2 In tro duction Households are imp ortan t units of study in infectious disease epidemiol- ogy , b oth as sites of prolonged intense so cial contact and as conv enien t p opulation units for empirical studies[ 1 , 2 ]. Studies that in vestigate within- household transmission studies are common and provide essential data on emerging pathogens, while household structure pla ys an extremely impor- tan t role in determining the onw ard spread of infection. F or example, during the COVID-19 pandemic con tact-tracing studies show ed household contacts to be muc h more likely to b e infected than other t yp es of con tacts (e.g. w orkplace, so cial) [ 3 – 5 ]. Household structure also pla ys a key role in in- terv entions against infection, with non-pharmaceutical in terven tions (NPIs) suc h as stay-at-home orders acting at the household level, and uptak e of v accines sho wing correlations within families. In ﬁeld studies of household transmission, secondary attack rate (SAR), deﬁned as the n umber of secondary cases divided b y the n umber of con- tacts, is ubiquitous as a measure of transmissibilit y of a pathogen in en- closed settings suc h as households. As a simple ratio, it can be calculated from relativ ely coarse epidemiological data, requiring ﬁeld researchers only to iden tify cohabitants of an index case and record their infection status o v er a suﬃciently long perio d. Despite its widespread use, its simplicit y mak es the SAR po orly suited to capture the complex dynamics of person-to-p erson transmission. Infectious disease dynamics arise from a com bination of biological and so ciological factors whic h the SAR summarises as a single num b er, mean- ing that an estimate of SAR from an outbreak in one p opulation provides v ery limited insigh t in to the b ehaviour of future outbreaks in other p opu- lations with diﬀering demographic and so cial contact structures. Instead, mec hanistic models that are parameterised to reﬂect the so cial structure of sp eciﬁc p opulations are necessary to analyse infectious disease outbreaks and to make predictions. Mec hanistic mathematical mo dels pla y an important role providing quantitativ e evidence and mo dels ha v e been dev elop ed that tak e account the geographical, gender/sex, age, w orkplace and household structured complexity in social mixing patterns [ 6 – 8 ]. Despite the considerable literature on household-structured epidemic mo delling and due to the ubiquity of the SAR, empirical household transmis- sion studies most often rep ort their data with SAR and logistic regression in 3 mind. These studies often only report the n umber of cases and the n um b er of con tacts, giving little to no information on the household size distribution and only rep orting the mean of the household ﬁnal size distribution. Because eac h secondary infectious cases increases the infectious pressure on an y re- maining susceptible household members and thus increases their lik eliho o d of b eing infected, the household ﬁnal size distribution is often bimo dal and so particularly p o orly-describ ed b y its mean (see, e.g. House et al. [ 9 ], Figure 2). While eﬃcient likelihoo d-based metho ds for parameterising mec hanistic mo dels exist and are relatively eﬃcien t in p opulations of the small scale of a household [ 9 , 10 ], they require higher resolution data than just the rep orted o verall SAR. Therefore, in order to estimate p erson-to-p erson transmission rates from these studies we m ust augment the datasets, imputing the house- hold structure that underlies them. Data augmentation and data imputation hav e a long and successful his- tory of use in inference for a v ariety of applications in infectious disease mod- elling [ 11 – 14 ]. The cen tral problem these methods seek to address is that epidemics are rarely fully observ ed, meaning that epidemiological datasets are often incomplete and so we cannot directly ev aluate the mo del likeli- ho o d based on these data alone. Data imputation approaches address this b y augmenting these datasets with the missing information and treating the unobserv ed data as parameters. Standard parameter inference metho ds, of- ten using MCMC, can then b e used with additional steps which prop ose v alues for the missing data and identify scenarios which are most consistent with the outcomes we do observe. In this study we present a nov el mass imputation MCMC metho d for estimating transmission rates from household-stratiﬁed datasets at three diﬀeren t lev els of detail: lo w information datasets whic h report the total n umber of contacts, the total n umber of secondary cases, and the n umber of households, with no information on ho w infection stratiﬁes b y house- hold size; medium information datasets which report these same outcomes stratiﬁed by household size; and high information datasets which rep ort the n umber of contacts and cases for each individual household included in the study . In the high information setting our method resem bles a standard lik eliho o d-based MCMC approac h, while in the lo w and medium informa- tion cases w e impute the n umber of contacts and cases in each household and ev aluate the joint likelihoo ds of sp eciﬁc combinations of parameter and distribution of contacts and cases b y household. Our metho d can incorpo- rate existing kno wledge of background demograph y b y using the household 4 size distribution of the am bient p opulation as a basis for its prop osals of household-sp eciﬁc contact and case n umbers. W e v alidate our metho d b y demonstrating that it is able to successfully reco ver transmission rates from syn thetic data generated with an underly- ing household size distribution tak en from the UK lab our force surv ey[ 15 ] o ver a range of parameter v alues. F or data generated from a pathologi- cal “split” household distribution, results v aried and the algorithm tended to ov erestimated transmission, particularly when density dep endent mixing w as assumed. Ho wev er, in the ma jorit y of cases this could b e corrected b y incorp orating information ab out this pathological household size dis- tribution. W e demonstrate the practical applicability of our approach by estimating parameters from household-based CO VID-19 studies. W e use our high-information method to estimate the transmission rate and density mixing parameter from Carazo et al. [ 16 ], which pro vides a high information set, and compare our estimates when w e aggregate the data and apply our medium- and lo w-information metho ds. W e ﬁnd that in the lo w informa- tion case, the most common lev el of detail in household studies, there are iden tiﬁability issues. W e ﬁnd that con tacts and cases stratiﬁed b y household size are necessary to iden tify b oth parameters and that uncertain ty could b e reduced with datasets whic h rep ort the size and n um b er of secondary cases of each household in the study . W e further estimate transmission rates from a n um b er of lo w information datasets taken from a systematic review of household transmission of SARS-CoV-2 [ 2 ]. These results sho w that low-information datasets, whic h are commonly rep orted in household transmission studies, are insuﬃcien t to b e able to parametrise the simplest of mec hanistic mo dels. Contacts and cases strati- ﬁed by household size can resolv e identiﬁabilit y issues and datasets rep orting the full outcomes of household outbreaks can further reduce uncertain ty in parameter estimates. Metho ds T ransmission mo del The datasets of interest in this analysis come from household transmission studies, with our transmission mo del sim ulating the processes whic h gener- ated the data in a given study . W e make the assumption that each household in the study exp eriences an outbreak seeded by a single primary case with 5 all subsequent infections coming as a result of within-household transmis- sion and that the maximum household size is m + 1. The infectious p erio d of an individual, denoted I , is a random v ariable which follows a sp eciﬁed distribution whic h is the same for all individuals. During this infectious p erio d, individuals mak e infectious con tact with each other individual in their household according to a P oisson pro cess with a rate given by the parametric, form dependent on the size of the household, β n = β n η for n = 1 , . . . , m, (1) where n is the n umber of initially s usceptible individuals (one less than the size of the household), hereafter referred to as c ontacts , β > 0 is the base transmission rate and η ∈ [0 , 1] is a mixing parameter. This particular para- metric form for a transmission rate dep endent on the size of the household is often considered [ 17 , 18 ] and allo ws us to mov e b etw een t wo common mixing assumptions: density dependence ( η = 0) and frequency dep endence ( η = 1) [ 19 – 21 ]. Throughout this paper w e assume that I ∼ Gamma( a, a ) (2) so that E [ I ] = 1 and the exp ected num b er of secondary cases pro duced by a primary case in a household with n initial susceptible contacts is β n . In particular w e will consider the cases when a = 1, so that I ∼ Exponential(1) and the infectious disease dynamics are Mark ovian, the limit for a → ∞ , so that w e hav e a constan t infectious p eriod I ≡ 1, and an in termediate case, namely when a = 2. Datasets Dep ending on ho w results from such studies are rep orted, datasets can v ary in the amoun t of information that is rep orted. In this pap er we only consider “ﬁnal size” data, i.e. w e ignore information ab out when individuals b ecome infected and fo cus on how man y initially susceptible individuals, named c on- tacts , are recorded as ha ving been infected by the end of the p erio d during whic h the household was observ ed, hence having b ecome c ases . W e consider three categories of ﬁnal size datasets distinguished b y the lev el of informa- tion reported, with this information lev el denoted by a letter J ∈ { L, M , H } . Datasets of a given information lev el are denoted D J . Lo w-information datasets ( J = L ) rep ort the total num b er of contacts, cases and households in a study , so that the dataset consists of just three integers. Medium- information datasets ( J = M ) rep ort the num b er of con tacts and cases, 6 stratiﬁed by the size of the household, i.e. three integers for eac h house- hold size observ ed in the data, including the household size itself. High information datasets ( J = H ) rep ort the num b er of households with each observ ed com bination of the n umber of initially susceptible contacts and the ﬁnal num b er of cases. In this case, data ma y b e rep orted in a upp er (or lo wer) triangular table. Given the fo cus is on estimating within-household transmission, w e do not consider fully susceptible households (generally the situation in case-ascertainmen t studies) or households with only one mem- b er. Lik eliho o d Giv en a dataset D , we aim to estimate the p osterior of the transmission parameters θ = ( β , η ), π ( θ | D ), using a Marko v chain Monte Carlo metho d. Throughout the ﬁtting pro cess we assume a ﬁxed v alue of a (either a = 1 , 2 or a → ∞ ) and so all dep endence on a is suppressed, including in the ab ov e p osterior. W e ﬁrst construct a lik eliho o d that can b e ev aluated using a high-information dataset. Let Z denote the ﬁnal size of a household outbreak. W e can use results from Ball [ 22 ] to construct a set of triangular equations which can b e solved to determine the exact distribution of Z conditional on n and θ : j X z =0  n − z j − z  P ( Z = z | n, θ ) ϕ ( β ( j − n )) 1+ z =  n j  j = 0 , 1 , . . . , n, (3) where ϕ ( t ) = E [exp ( I )] =  1 − t a  a is the moment generating function of the infectious p erio d. Alternativ e methods for the computation of the ﬁnal size distribution could b e found in House, Ross, and Sirl [ 23 ]. In order to derive our lik eliho o d more easily , w e deﬁne an enumeration map f , so that we can write out a high-information dataset as a vector of coun ts: f ( n, z ) = n − 1 X i =1 ( i + 1) + z = ( n − 1)( n + 2) 2 + z (4) This en umeration orders household outcomes ﬁrst in terms of the num b er of con tacts and secondly in terms of the num b er of cases among these con tacts. Giv en households of size one are not included in the dataset, the enumer- ation starts counting from f (1 , 0) = 0 , f (1 , 1) = 1 and f (2 , 0) = 2, which are the indices corresp onding to households with 1 contact and 0 cases, 1 7 con tact and 1 case and 2 con tacts and 0 cases resp ectively . The counting stops at K = f ( m, m ), which corresp onds to a household of size m + 1 in whic h eac h of the m contacts are infected. W e also deﬁne the element wise in verses of f , f n and f z , such that f ( f n ( k ) , f z ( k )) = k . W e can therefore represen t a high-information dataset by a ﬂat v ector, C ∈ N K where C k is the n umber of households with f n ( k ) initially susceptible contacts and f z ( k ) non-primary cases. Our aim is no w to estimate the posterior for θ giv en a high-information dataset C , which w e denote as π ( θ | C ). F or some θ we can deﬁne a v ector of ﬁnal size distributions P ( θ ) ∈ [0 , 1] K where P k ( θ ) = P ( Z = f z ( k ) | f n ( k ) , θ ). F or a giv en θ and relatively small m ( ≤ 20, the scale of a household), P ( θ ) is straightforw ard and eﬃcient to calculate by solving the linear system in Equation ( 3 ). W e assume that our study population samples household sizes from an underlying household size distribution, p = ( p 1 , . . . , p m ), where p n is the probability a randomly selected household is size n + 1. W e assume a hierarc hical framew ork for the ﬁnal size outcomes C : p | α ∼ Dir( α ) (5) C | p , θ ∼ Multi( X C k , P ⊙ f n p ) (6) where α = ( α 1 , . . . , α m ) are c hosen parameters for the Dirichlet distribution and ⊙ f n denotes an element wise pro duct where the k th en try of the ﬁrst v ector is m ultiplied b y the f n ( k ) th . Note that if p = ( p 1 , . . . , p m ) ∼ Dir( α ) then E ( p i ) = α i α 0 where α 0 = P m i =1 α i . Also, larger v alues of α 0 corresp ond to smaller v ariance of p and so in some sense larger α 0 represen ts higher conﬁdence that the household size distribution in the study is close to E ( p ). Note that, although this is similar to a Dirichlet-Multinomial distribution, it is not the same b ecause C and p are of length K and m resp ectively . W e can deriv e a likelihoo d similar to that of a Dirichlet-Multinomial as follows: L ( θ | C , α ) = π ( C | θ , α ) (7) = Z S m − 1 π ( C | p , θ ) π ( p | α )d p (8) = Z S m − 1 1 B( C ) K Y k =0  P k ( θ ) p f n ( k )  C k 1 B( α ) m Y n =1 p α n − 1 n d p (9) = 1 B( C ) B ( α ) K Y k =0 P k ( θ ) C k Z S m − 1 K Y k =0 p C k f n ( k ) m Y n =1 p α n − 1 n d p , (10) 8 where S m − 1 is the m − 1 simplex. If we let γ ∈ R m b e deﬁned such that γ n := P f ( n,n ) k = f ( n, 0) C k + α n − 1, then we can reduce the lik eliho o d to the follo wing: π ( C | θ, α ) = 1 B( C )B( α ) K Y k =0 P k ( θ ) C k Z S n − 1 m Y n =1 p γ n n d q (11) = B( γ ) B( C )B( α ) K Y k =0 P k ( θ ) C k . (12) All comp onen ts of this lik eliho o d can b e eﬃciently calculated, making it suit- able for inference methods that require man y ev aluations of the likelihoo d, suc h as MCMC. Data Augmen tation and MCMC algorithm In the cases that w e only ha ve a medium/low-information dataset D J ( J ∈ { L, M } ) we would lik e to deriv e the p osterior distribution of θ , π ( θ | D J , α ). Ho wev er, we cannot directly ev aluate the corresp onding lik eliho o d L ( θ | D J , α ) = π ( D J | θ , α ). Instead, w e must augmen t the dataset, adding the household structure necessary to use the likelihoo d in Equation ( 12 ). Therefore, w e target π ( θ , C | D J , α ) ∝ π ( D J , C | θ , α ) π ( θ ) = π ( C | θ , α ) π ( θ ) where the equalit y comes from the fact that D J is fully determined b y C . If we can sp ecify an appropriate prop osal distribution q J (( θ ′ , C ′ ) | ( θ , C )) then w e can use a Metropolis-Hastings algorithm with acceptance probability A (( θ ′ , C ′ ) | ( θ, C )) = min  1 , π ( C ′ | θ ′ , α ) π ( θ ′ ) q J (( θ ′ , C ′ ) | ( θ , C )) π ( C | θ , α ) π ( θ ) q J (( θ , C ) | ( θ ′ , C ′ ))  (13) to sample from π ( θ , C | D J , α ) from which we can get the marginal distri- bution π ( θ | D J , α ). Note that so long as the prop osal distribution results in an ap erio dic and irreducible chain, the algorithm will target the required distribution. Prop osal Distribution W e no w need to deﬁne a prop osal distribution q J : ( R 2 × g J ( D J )) → ( R 2 × g J ( D J )) where g J ( D J ) denotes the set of compatible high-information datasets for a given medium/lo w-information dataset; see App endix A for a formal deﬁnition of g J . Note that R 2 × g J ( D J ) is not a standard space, 9 b eing the product of a discrete space g J ( D J ) with speciﬁc restrictions, and a more standard space R 2 . In order to eﬃciently na vigate this space (a voiding prop osing incompatible high-information datasets), we cannot use standard c hoices for the proposal distribution and so need to specify our own. Supp ose w e start with ( θ 0 , C 0 ). W e split our prop osal into tw o cases. On each iteration of the MCMC algorithm either a new set of transmission parameters is prop osed (and C 1 = C 0 ) or a new household structure is pro- p osed (and θ 1 = θ 0 ) with probabilit y s and 1 − s resp ectively , where s is a c hosen h yp er-parameter. In the case where w e prop ose a new set of transmission parameters, w e use a 2-dimensional Gaussian prop osal distribution centred on θ suc h that θ 1 ∼ N ( θ 0 , Σ). In this paper we either consider the case when η is known/assumed and we only ﬁt β in which case Σ = ( σ 0 0 0 ) for some σ > 0 or the case where w e ﬁt b oth parameters with indep endent Gaussian proposals with iden tical standard deviation so Σ = ( σ 0 0 σ ). The case in which we prop ose a new household structure is more compli- cated. Since the space g J ( D J ) is dep endent on the information level of the dataset, we sp ecify diﬀerent prop osal distributions for J = L and J = M . The intuition for q L is that individual contacts, either infected or not, are selected at random from the current household structure C 0 and then mo ved from their household to another. The resulting household structure, C 1 , is still compatible with D L (i.e., C 1 ∈ g L ( D L )). This pro cess is represented pictorially in Figure 1 and the full algorithm is presen ted in Algorithm 1 . The algorithm for q M is shown in Algorithm 2 . Once a new pair ( θ 1 , C 1 ) is prop osed, it is accepted with the probability given in Equation ( 13 ). Priors The prior on θ in Equation ( 13 ), π ( θ ), is the pro duct of the tw o independent priors for β and η , i.e. π ( θ ) = π 1 ( β ) π 2 ( η ). There are no restrictions on the distributions of π 1 or π 2 . How ev er, throughout this pap er we do constrain π 1 and π 2 based on the meaning of the parameters: β is a rate, so is assumed ≥ 0, and although its range might b e larger (ev en in v olving negativ e v alues), w e assume η ∈ [0 , 1]. Syn thetic data generation In order to v alidate our metho d, we generated synthetic high-information datasets with known parameters. W e try to re-estimate β (with η known 10 Figure 1: Visual representation of the low information prop osal algorithm (Algorithm 1 ). Three hy p othetical prop osal steps are shown for a low in- formation dataset with ( N , y , n ) = (24 , 25 , 45) and maximum n um b er of household contacts m = 3. On the left hand side the indices of the out- comes are shown with pictorial represen tations of these outcomes (primary cases, secondary cases and non-cases sho wn b y red Ps, red p ositive signs and blue negative signs resp ectively). Each prop osal mov e is represen ted by a blac k arro w starting at an en try in the previous pseudo-dataset and ending at an en try in the new pseudo-dataset with their resp ective indices b eing k 1 and k 2 in the notation of Algorithm 1 . The infectious status of the contact b eing mov ed is shown by the sym b ol on the arrow. Data that ha ve changed in a step are sho wn as white digits with a black outline. and ﬁxed) from the corresp onding low-information dataset. Each of these datasets had N = 1000 households with sizes sampled from a given house- hold size distribution. W e used t wo household distributions, one informed b y the 2023 UK Lab our F orce Surv ey (LFS) [ 15 ] and another one constructed to b e an extreme example in which most of the households are either of the smallest or largest p ossible size, i.e. 2 or 6 (Figure S1 ). In b oth cases, the 11 maxim um n umber of contacts is m = 5. F or details of these distributions see App endix C . The ﬁnal sizes of each outbreak are then sampled from the probabilities in Equation ( 3 ). Results V alidation on synthetic data W e ﬁrst v alidated our metho d, estimating β from syn thetic datasets gener- ated from known parameters and Equation ( 3 ). W e used 12 diﬀeren t param- eter sets and t w o diﬀeren t household distributions, generating 100 datasets of 1000 households for eac h of these 24 combinations. When estimating β from each synthetic dataset we ﬁxed η to the v alue used to generate the dataset. In each case, the parameters for the Dirichlet distribution ( α ) were c hosen so that they w ere prop ortional to (i.e. the exp ectation of the Diric hlet w as equal to) the household size distribution used to generate the datasets. The v alue of α 0 = P m n =1 α n determines how “concentrated” the Diric hlet distribution is, reﬂecting ho w conﬁdent w e are that the household size dis- tribution in the study is similar to what we hav e sp eciﬁed as the mean of the Dirichlet. The smaller the v alue of α 0 the further and more easily the household distribution can w ander from the mean of the Diric hlet during the inference. W e ﬁt the mo del for α 0 = 100 (less constrained) and α 0 = 1000 (v ery constrained). Eac h panel in Figure 2 sho ws the 95% conﬁdence in terv als for β estimated from 100 synthetic datasets for a giv en parameter set with β determined by the column and η determined by the ro w. Within each panel are results for syn thetic datasets generated from the UK household size distribution and the extreme “split” distribution (with ﬁxed probabilities, i.e. corresp onding to the limit for α 0 → ∞ ) and then b oth ﬁtted with either α 0 = 100 or 1000. F or data generated from the UK household distribution the correct pa- rameters are recov ered b y the MCMC algorithm at least 95% of the time for all 12 v alues of θ . Ho wev er, when data is generated from the “split” distribu- tion, the MCMC systematically o v erestimates β when η = 0 or 0 . 5 and α 0 = 100. F or θ = (0 . 2 , 0) , (0 . 5 , 0) , (1 . 5 , 0) , (2 , 0) , (0 . 5 , 0 . 5) , (1 . 5 , 0 . 5) and (2 . 0 , 0 . 5) only 17% , 1% , 44% , 83% , 85% , 42% and 63% of the datasets, resp ectively , re- sult in p osteriors with 95% in terv als that ov erlap with the actual v alues, with most datasets resulting in ov erestimate. F or η = 1, there was a ten- dency for β to b e underestimated when data was generated from the split 12 Figure 2: In each subplot, 95% conﬁdence interv als for the p osteriors of β are sho wn for 100 low information synthetic datasets p er household size distri- bution. Eac h dataset is generated using the Ball model ( I ∼ Gamma(2 , 2)) and 1000 household sizes sampled either from the UK LFS (2023) or the “split” household size distribution (see App endix C for details), and β is re-estimated separately using α 0 = 100 and 1000. Eac h subplot row sho ws results for synthetic data generated with diﬀeren t base transmission rate ( β ) and eac h column shows results for diﬀerent density mixing parameter ( η ). The v alue of β used to generate the data is sho wn b y the v ertical dot- ted line in eac h plot and conﬁdence interv als are plotted in green if this v alue is contained in them and red otherwise. The p ercentage of syn thetic datasets for which the real v alue is contained within the 95% conﬁdence in terv al is sho wn ab ov e each set of conﬁdence interv als. All ﬁts were done with I ∼ Gamma(2 , 2) for a kno wn η and so only β was inferred. 13 distribution with α 0 = 100. How ev er, if an α 0 is increased to 1000, more strongly restricting the MCMC not to w ander too far from the split house- hold size distribution, then the v alue of β can b e recov ered successfully in all cases (minim um of 93% cov erage with the 95% conﬁdence interv als). Fitting to real data Carazo et al. [ 16 ] rep orts ﬁnal size data from outbreaks in households of health care work ers in Qu´ eb ec, Canada during the COVID-19 pandemic. In that pap er, ﬁnal size data is rep orted in suﬃcient detail for a high informa- tion dataset. This allo wed us to compare the p erformance of the MCMC metho d on a real dataset at eac h of the three information levels. F or the c hoice of α we used data from the 2021 Census in Canada which rep orted the n um b er of households of each size in Qu´ ebec. W e use this distribu- tion in the same w a y as the UK LFS (see Appendix C ); how ev er, giv en the census aggregated all households with 5 or more individuals, w e arbitrarily distributed these households b et ween sizes 5 and 6 at a 2:1 ratio. When estimating parameters, w e used an α centred around such a distribution, with a v alue of α 0 = 200. P anels A, B and C in Figure 3 show the p osteriors of simultaneously ﬁt- ting b oth β and η to the lo w- (red), medium- (green) and high-information datasets (blue). F or each of these we had a Uniform(0 , 1) prior on η and an improp er (p ositiv e reals) prior on β . These results are for I ∼ Gamma(2 , 2), though the same plots for I ∼ Exp onen tial(1) and I = 1 are sho wn in Fig- ures S3 and S4 resp ectively . In Panel A of Figure 3 we can see that there w ere issues of identiﬁabilit y b etw een β and η , with the MCMC exploring all v alues of η (limited by assumption to [0,1]) and so w e end up with a strip of v alues compatible with the observed SAR. With the medium-information dataset, instead, the parameters were clearly iden tiﬁable and the p osterior w as similar to, though slightly less certain than, the one obtained from the high-information dataset. The means in these t wo cases – shown by the blac k cross – are 0.57 (0.51, 0.63) for β and 0.70 (0.59, 0.80) for η for the medium-information dataset, and 0.57 (0.52, 0.62) for β and 0.70 (0.61, 0.78) for η for the high information one. The error bars in P anel D of Figure 3 show the estimated ov erall and size-stratiﬁed SAR for eac h of the information lev els and the observed v alues from Carazo et al. [ 16 ]. The low-information conﬁdence interv als do app ear to cov er the observed v alue for all sizes but are v ery wide, particularly for 14 Figure 3: Panels A, B, and C show the p osterior distributions of θ obtained b y ﬁtting the low- (red), medium- (green), and high-information (blue) v er- sions of the data from Carazo et al. [ 16 ], resp ectiv ely . Panel D displa ys the secondary attack rate (SAR) implied by each p osterior alongside the ob- serv ed SAR (black) for each household size and ov erall; error bars represent 95% conﬁdence in terv als, with those for the observ ed SAR estimated by b o otstrapping. Bars indicate the num b er of households of each size in the lo w-information ﬁt (red) compared with the observed distribution (black). The infectious perio d is assumed to b e I ∼ Gamma(2 , 2). 15 households of size 2 or 6. Adding higher levels of information incrementally narro ws the conﬁdence interv als and the medium- and high-information SAR estimates are close to those observed in the data. The bar chart in Panel D of Figure 3 shows the num ber of households of each size from the low in- formation ﬁt (red) and the observed data (black). The error bars cov er the observ ed household size distribution showing that the algorithm prop osed sensible household structures. Next we estimated the transmission rate for SARS-CoV-2 from 23 pap ers [ 16 , 24 – 41 ] (25 datasets) that rep orted at least a low-information dataset. These were chosen from the ones considered b y a series of literature reviews b y Madewell et al. [ 2 , 42 , 43 ]. F or each of these low information datasets w e ﬁt the base transmission rate ( β ) using the p osterior for η obtained from Carazo et al. [ 16 ], com bined with an improper prior on β , as the prior, i.e. π ( θ ) = π 2 ( η ), and again assuming I ∼ Gamma(2 , 2). Using this prior was necessary given the issues with iden tiﬁability found for the low information case in Figure 3 , but the resulting MCMC chains mixed well, as shown in the trace plots in Figures S5 , S6 and S7 .. F or each study we found census or similar data to inform α (in all cases, the data w as size-w eighted as discussed in App endix C and we used α 0 = 200). Each lo w information dataset and source for α can b e found in the supplementary CSV ﬁle. Figure 4 shows the estimates and 95% conﬁdence interv als for each of these studies, split by v ariant – where speciﬁed – and colour co ded accordingly . The table on the righ t hand size also sho ws the rep orted SAR, mean household size (including index cases) and num b er of households ( N ) for each study . W e successfully inferred transmission parameters from low-information datasets, combining prior information on η (in this case obtained from the high-information dataset in Carazo et al. [ 16 ]) and information on the house- hold size distribution of the relev an t populations in α . W e w ere able to prop erly account for diﬀerences in household size and b etter quan tify the uncertain ty compared to binomial conﬁdence in terv als often used for SARs. Discussion In this pap er we present a no vel MCMC algorithm for estimating epidemio- logical parameters from household ﬁnal size data of diﬀerent lev els of coarse- ness. This includes very coarse “low information” datasets whic h only report the total n umber of household, total n um b er of household contacts and total 16 Figure 4: Base transmission rates ( β ) for SARS-CoV-2 estimated from lo w-information datasets from studies included in Madewell et al. [ 2 , 42 , 43 ], colour coded and grouped by strain where it was speciﬁed. A table on the right hand side lists the SAR, a v erage household size and n umber of households in eac h of these studies. 17 n umber of secondary cases. Without using an inference metho d to impute a household size distribution, like the MCMC approach we presen t here, it w ould not b e p ossible to ﬁt a mechanistic mo del to datasets of this kind. W e ﬁrst ﬁt to low-information datasets deriv ed from synthetically generated high-information dataset in order to assess the algorithm’s ability to reco ver the transmission rate, b efore applying the metho d to real data. Insigh ts F or data generated from a realistic household size distribution – in this pa- p er, from the UK Labour F orce Surv ey (UK LFS) – the algorithm reliably reco vered the transmission rate β for datasets generated from a range of parameter v alues of b oth β and η . F or all 12 parameter combinations we considered, the true v alue of β w as co vered b y the 95% conﬁdence in terv al. Ho wev er, the p erformance on data from an unrealistic “split” household size distribution, in whic h the ma jority of households had either 2 or 6 individu- als (but the mean size of ≈ 3 was the same), v aried signiﬁcantly for diﬀeren t parameters. The p erformance was particularly p o or in the case that α 0 = 100 (rel- ativ ely w eak constrain t on the household size distributions of a study from whic h a lo w-information dataset is collected). This is because the syn thetic data w as generated from the “split” household size distribution, but the MCMC w as not strongly constrained to preserv e that distribution and w as settling on a distribution that include a signiﬁcant num b er of households with sizes closer to the mean. Giv en the strong non-linear relationship b et ween the transmission rate and the SAR for diﬀerent household sizes (see Figure S2 ), the same v alue of β leads to a diﬀerent SAR from the “split” distribution compared to the distribution the MCMC is settling on. In particular, for η = 0 (bottom left panel of Figure S2 ), the former is larger than the latter, so to ﬁt the same SAR, a signiﬁcan tly larger β is required with the new distribution compared to the original one, causing the observed o verestimation. This discrepancy is strongest for v alues of β around 0.5, in line with what can be seen in Figure 2 . The same qualitative discrepancy is seen for η = 0 . 5, though to a m uch less extent and reaching its maximum around 1.5, resulting in consistent o verestimation for β = 1 . 5 and β = 2 . 0. The discrepancy , instead, is minimal for η = 1, and in terestingly is slightly reversed, which is compatible with 18 the slight underestimation observ ed in this case in Figure 2 . Note that in all cases increasing the v alue of α 0 constrained the MCMC to not drift aw a y from the “split” distribution, leading to muc h b etter re- sults, with a minimum of 93% of conﬁdence in terv als con taining the actual v alues of β . These results demonstrate the imp ortance of the size distribution of the household study , which are often not reported in lo w-information datasets. In the absence of that information, assumptions need to be made about suc h a distribution and the conﬁdence we hav e in it. Ho w ever, such assumptions in teract in non-trivial wa ys with the amoun t of person-to-p erson transmis- sion and the assumed w ay in which transmission scales with the household size. Ho wev er, although discrepancies emerge particularly strongly when mixing is close to b eing densit y-dep enden t ( η close to 0), for respiratory diseases such as SARS-CoV-2, which is the example used in this pap er, it is common to assume frequency dep endent mixing ( η = 1). In this case, estimates app ear to b e muc h less aﬀected by deviations from the true dis- tribution of the household size in the studies the low-information dataset comes from. W e then in v estigated the diﬀerence in p erformance when estimating b oth β and η simultaneously from lo w-, medium- and high-information datasets from Carazo et al. [ 16 ]. The resulting p osteriors on ( β , η ) demonstrate that at least medium-information datasets are necessary to conﬁdently identify b oth simultaneously and that simply rep orting low-information datasets is insuﬃcien t for transmission parameters to b e estimated with any conﬁdence in the absence of informativ e priors. How ev er, if detailed data on the ﬁnal size of outbreaks in each household cannot b e rep orted in full, then total con tacts and cases stratiﬁed by household size could still b e used to iden tify b oth the transmission rate and the densit y mixing parameter almost to the same level of accuracy . Finally , w e estimated β for a n um b er of household studies whic h provided at least lo w-information datasets taken from the Madew ell et al. systematic reviews [ 2 , 42 , 43 ]. Only 25 of the 144 studies included in these reviews a) tried to iden tify co-primary cases, b) dealt with susp ected co-primaries in a w ay that preserved the household sizes and c) rep orted suﬃcient detail so that a lo w-information dataset could b e extracted or deriv ed. Using existing inference metho ds, we w ould not b e able to estimate trans- 19 mission rates from these datasets. How ev er, the framework presen ted in this pap er allows us to consolidate these low-information ﬁnal size datasets with household size data and prior information on the densit y mixing parame- ter to estimate transmission rates. Compared to the Binomial conﬁdence in terv als often used for estimates of SAR, the error bars in Figure 4 more accurately show the uncertain t y in our estimates. Limitations In order to ﬁt a transmission mo del to low-information data w e were required to k eep the model very simple. One limitation comes from our assumption of a single primary case. Man y household studies limit the perio d of follow up in order to limit the risk of m ultiple indep endent introductions. Ho wev er, m ultiple primary cases are not uncommon, e.g. when m ultiple residen ts at- tend so cial ev ents together and so would b e at risk of con tracting a disease from a shared source. One of the common uses of household-stratiﬁed transmission surveys is to estimate the relative susceptibility and infectivity asso ciated with char- acteristics such as age, sex and ethnicity . In this work, how ev er, we lim- ited ourselv es to identical individuals, all mixing homogeneously within eac h household, b oth to simplify the mo del for testing and to ensure that g J ( D J ) w as not to o large. How ev er, future work could relax this assumption and p erform a similar inv estigation to estimate the eﬀects of risk factors on transmission. Conclusion In this pap er w e presented an MCMC algorithm that can estimate trans- mission rates from v ery coarse datasets of the type commonly rep orted in household transmission studies. W e successfully reco vered parameters from syn thetic datasets and estimated transmission rates for COVID-19 under an assumption of frequency-dependent mixing from a num b er of household studies. Ho wev er, we also demonstrated that, in the absence of prior information on the parameter controlling how p erson-to-p erson transmission v aries with household size, low information datasets are generally insuﬃcient to resolve b oth this mixing parameter and the transmission rate. This result suggests researc hers conducting household transmission studies can greatly improv e 20 the usefulness of their results for parametrising mechanistic mo dels by re- p orting con tacts and cases stratiﬁed by household size and, if possible, full information on the frequency of eac h outcome. Ac kno wledgmen ts TH, LP , and JH are supp orted by the W ellcome T rust (Grant Number 227438/Z/23/Z). TH and LP are also supp orted b y the Medical Research Council (Grant Number UKRI483). JB is an aﬃliate, JH is a fellow and TH and LP are mem b ers of the JUNIPER partnership , which is supp orted b y the Medical Researc h Council (gran t num b er MR/X018598/1). Co de a v ailabilit y All co de used to implemen t the methods describ ed in this pap er and to generate all ﬁgures can be found in this GitHub rep ository . 21 A F ormal Deﬁnitions of g J ( D J ) Medium-Information Datasets Let a medium information dataset b e enco ded in a tuple of tw o v ectors D M = ( n , z ) ∈ N m × N m where n ≥ z element-wise. W e can then deﬁne a function g M : N m × N m → P ( N K ) (where P ( S ) denotes the p ow er set of of a set S ) suc h that g M ( D M ) is the set of high information datasets compatible with the medium information dataset D M . F or the purp ose of a formal deﬁnition of g M w e deﬁne tw o matrices A , B ∈ N m × K where A i,k = ( i if f ( i, 0) ≤ k ≤ f ( i, i ) 0 otherwise (14) B i,k = ( k − f ( i, 0) if f ( i, 0) ≤ k ≤ f ( i, i ) 0 otherwise (15) Then g M (( n , z )) = { C ∈ N K : ( A C = n ) ∧ ( BC = z ) } (16) Similar to the high-information case, since P f ( n,n ) k = f ( n, 0) C k is the same for all n = 1 , . . . , m and C ∈ g M (( n , z )) the choice of α do es not eﬀect the lik eliho o d. Lo w-information Datasets Let a low-information dataset b e enco ded b y three in tegers in a 3-tuple, denoted D L = ( N , n, z ) ∈ N 3 where z ≤ n and N ≤ n ≤ mN . N denotes the num ber of households, n the total num ber of contacts and z the total n umber of cases. W e can then again deﬁne a function g L : N 3 → P ( N K ) where g L ( D L ) is the set of high information datasets compatible with the low information datasets. F or low information datasets we will deﬁne g L ( D L ) similarly to ho w we did for g M ( D M ). W e deﬁne tw o v ectors, w and v ∈ N K , suc h that w k = f n ( k ) and v k = f z ( k ) and so g L (( N , n, z )) = { C ∈ N K : ( ⟨ C , w ⟩ = n ) ∧ ( ⟨ C , v ⟩ = z ) ∧ ( ⟨ C , 1 ⟩ = N ) } (17) 22 B Prop osal algorithms for q M and q L 23 Algorithm 1 Lo w information new frequency coun t pseudo-dataset pro- p osal algorithm Require: C = ( C 0 , . . . , C K ) ∈ N K − 1 1: Calculate the total num b er of households with at least 2 con tacts, S 1 = X { k : f n ( k ) ≥ 2 } C k . 2: Calculate a probabilit y distribution prop ortional to the num b er of house- holds with each outcome and conditioned on there b eing at least 2 con- tacts P 1 = (0 , 0 , C 2 S 1 , . . . , C K S 1 ) 3: Sample an index k 1 from P 1 and denote f n ( k 1 ) and f z ( k 1 ) b y n 1 and z 1 resp ectiv ely . 4: Uniformly draw a random n umber u inf ∈ [0 , 1] 5: if u inf ≤ z 1 n 1 then an infected contact is chosen and we set s = 1 else set s = 0 6: Calculate the total num b er of households with few er than m contacts, S 2 = X { k : f n ( k ) ≤ ( m − 1) } c k 7: Calculate another probabilit y vector P 2 conditioned on there b eing fewer than the maxim um n um b er of con tacts suc h that for k = 0 , . . . , K : P 2 k =      c k S 2 − 1 if f n ( k ) < m and k  = k 1 c k − 1 S 2 − 1 if k = k 1 0 otherwise (18) 8: Dra w a second index, k 2 , from P 2 and denote f n ( k 2 ) and f z ( k 2 ) by n 2 and z 2 resp ectiv ely . 9: Deﬁne a new frequency count partition C ∗ = ( c ∗ 0 , . . . , c ∗ K ) by starting with a cop y of the original frequency count partition, C . 10: Up date the en tries c ∗ k 1 and c ∗ k 2 in turn by subtracting 1. Note it could b e that k 1 = k 2 in which case w e subtract 2 from c ∗ k 1 in this step. 11: Deﬁne k 3 = f ( n 1 − 1 , z 1 − s ) and k 4 = f ( n 2 + 1 , z 2 + s ) . 12: Up date the en tries c ∗ k 3 and c ∗ k 4 in turn b y adding 1. Note that it is p ossible that k 3 = k 2 and/or k 4 = k 1 and the entries at those indices are unchanged. 24 Algorithm 2 Medium information new partition prop osal algorithm Require: C = ( c 0 , . . . , c K ) 1: Calculate the total num b er of cases in households with at least 2 con- tacts, S 1 = P K k =2 c k f z ( k ) 2: Calculate a probabilit y distribution prop ortional to the num b er of house- holds with each outcome and conditioned on there b eing at least 2 con- tacts P 1 = (0 , 0 , 0 , c 3 S 1 , 2 c 4 S 1 , . . . , mc K S 1 ) 3: Sample an index k 1 from P 1 . Denote f n ( k 1 ) and f z ( k 1 ) b y n 1 and z 1 resp ectiv ely . 4: Calculate the num b er of non-case contacts in households with n 1 con- tacts S 2 = f ( n 1 ,n 1 ) X k = f ( n 1 , 0) ( n 1 − f z ( k )) c k 5: Calculate another probability v ector P 2 conditioned on their b eing n 1 con tacts and w eighted b y the n umber of non-case contacts. F or k = 0 , . . . , K : P 2 k =      c k ( n 1 − f z ( k )) S 2 − 1 if f n ( k ) = n 1 and k  = k 1 ( c k − 1)( n 1 − f z ( k )) S 2 − 1 if k = k 1 0 otherwise (19) 6: Dra w a second index, k 2 , from P 2 7: Deﬁne a new frequency count partition C ∗ = ( c ∗ 0 , . . . , c ∗ K ) by starting with a cop y of the original frequency count partition, C . 8: Up date the en tries c ∗ k 1 and c ∗ k 2 in turn by subtracting 1. Note it could b e that k 1 = k 2 in which case w e subtract 2 from c ∗ k 1 in this step. 9: Up date the en tries c ∗ k 1 − 1 and c ∗ k 2 +1 in turn b y adding 1. C Household size distributions W e generated synthetic data for t wo v ery diﬀeren t household distributions: the UK Labour F orce Surv ey (LFS) from 2023 and an unrealistic “split” dis- tribution in whic h most households are either comp osed of 2 or the maximum (6) n um b er of individuals. In the former case, we to ok the UK LFS house- hold size distribution, remov ed the probability of selecting a household of size 1 and re-normalised the distribution, obtaining the probability P k UK that 25 a randomly selected household in our population is of size k ( k = 2 , . . . , 6). W e then deriv ed the size-weigh ted distribution, i.e. we made the probability of a household included dataset ha ving k con tacts prop ortional to k P k U K . This is the household distribution we would exp ect to see in our study if an infection spreading betw een household could be describ ed w ell b y a homoge- neously mixing model in the general population, so that eac h individual in the population is equally likely b e infected, and further assuming the same probabilit y for eac h case to b e detected and hav e their household included in the study . The split distribution was selected so that the mean of the size- w eighted distribution was the same as the size- w eighted distribution based on the UK LFS. See Figure S1 for the size-w eigh ted distributions used. 26 D Pro cedure for selecting appropriate pap ers from Madew ell et al. [ 2 ] F rom the full list of pap ers that estimated secondary attac k rate in Madew ell et al. [ 2 ], we ﬁrst ﬁltered out papers, so that the remaining ones w ere compat- ible with the mo del assumptions and that w e could extract a lo w-information dataset, according to the follo wing criteria: 1. In eac h household, all con tacts m ust ha ve b een tested or screened for symptoms; 2. There was a pro cedure for iden tifying the primary and co-primary cases, e.g. based on symptom onset or epidemiological inv estigation; 3. If co-primary cases w ere suspected, those individuals were not remo ved without the rest of the household b eing remov ed, as this w ould ha ve left the household partially depleted; 4. There was suﬃcient information rep orted to deduce a total n umber of con tacts, cases and households. P ap ers that received ad-ho c consideration were the following: • Hsu et al. [ 27 ]: SAR w as rep orted for family clusters, not households. Here, the w ork w as included, with clusters treated as households. • Singana yagam et al. [ 44 ]: SARs were rep orted for 127 households and 138 index cases. It was not p ossible to disen tangle the contacts in households with a single primary cases from the information provided and so this pap er w as not included. • Dub et al. [ 38 ]: Diﬀerent SARs, based on PCR testing and symptoms, w ere reported. W e included the data from PCR testing. • T anak a et al. [ 45 ]: The num b er of households with a single primary case in the study w as not clear and so this pap er w as not included. • P ark et al. [ 46 ]: The num b er of households was not explicitly stated and so was estimated from the mean household size and total n umber of contacts as 225 / 2 . 3 ≈ 98 households. The pap er w as included. 27 References [1] Jo ¨ el Mossong et al. “So cial con tacts and mixing patterns relev an t to the spread of infectious diseases”. In: PL oS me dicine 5.3 (2008), e74. [2] Zac hary J. Madewell et al. “Household Secondary Attac k Rates of SARS-CoV-2 by V ariant and V accination Status: An Up dated System- atic Review and Meta-analysis”. In: JAMA Network Op en 5.4 (Apr. 2022), e229317. issn : 2574-3805. doi : 10 . 1001 / jamanetworkopen . 2022 . 9317 . url : https : / / doi . org / 10 . 1001 / jamanetworkopen . 2022.9317 (visited on 12/08/2023). [3] Camino T roba jo-Sanmart ´ ın et al. “Diﬀerences in T ransmission be- t ween SARS-CoV-2 Alpha (B.1.1.7) and Delta (B.1.617.2) V ariants”. In: Micr obiolo gy Sp e ctrum 10.2 (Apr. 2022), e00008–22. doi : 10.1128/ spectrum . 00008 - 22 . url : https : / / journals . asm . org / doi / full / 10.1128/spectrum.00008- 22 (visited on 03/06/2025). [4] Bin u Areek al et al. “Risk F actors, Epidemiological and Clinical Out- come of Close Con tacts of CO VID-19 Cases in a T ertiary Hospital in Southern India”. en. In: JOURNAL OF CLINICAL AND DIAGNOS- TIC RESEARCH (2021). issn : 2249782X. doi : 10.7860/JCDR/2021/ 48059 . 14664 . url : https : / / jcdr . net / article _ fulltext . asp ? issn = 0973 - 709x & year = 2021 & volume = 15 & issue = 3 & page = LC34 & issn=0973- 709x&id=14664 (visited on 09/02/2024). [5] Haﬁzuddin Aw ang et al. “A case-con trol study of determinan ts for CO VID-19 infection based on contact tracing in Dungun district, T ereng- gan u state of Malaysia”. eng. In: Infe ctious Dise ases (L ondon, Eng- land) 53.3 (Mar. 2021), pp. 222–225. issn : 2374-4243. doi : 10.1080 / 23744235.2020.1857829 . [6] F rank Ball, Lorenzo Pellis, and Pieter T rapman. “Repro duction num- b ers for epidemic mo dels with households and other so cial structures I I: Comparisons and implications for v accination”. In: Mathematic al Bioscienc es 274 (Apr. 2016), pp. 108–139. issn : 0025-5564. doi : 10 . 1016/j.mbs.2016 . 01.006 . url : https://www. sciencedirect .com/ science/article/pii/S0025556416000171 (visited on 03/11/2026). [7] Lorenzo Pellis, Neil M. F erguson, and Christophe F raser. “Epidemic gro wth rate and household reproduction n umber in communities of households, schools and w orkplaces”. en. In: Journal of Mathematic al Biolo gy 63.4 (Oct. 2011), pp. 691–734. issn : 1432-1416. doi : 10.1007/ 28 s00285- 010- 0386- 0 . url : https://doi.org/10.1007/s00285- 010- 0386- 0 (visited on 03/11/2026). [8] Jo e Hilton et al. “A computational framew ork for mo delling infectious disease p olicy based on age and household structure with applications to the CO VID-19 pandemic”. en. In: PLOS Computational Biolo gy 18.9 (Sept. 2022), e1010390. issn : 1553-7358. doi : 10.1371/journal. pcbi . 1010390 . url : https : / / journals . plos . org / ploscompbiol / article?id=10.1371/journal.pcbi.1010390 (visited on 03/11/2026). [9] Thomas House et al. “Inferring risks of coronavirus transmission from comm unity household data”. en. In: Statistic al Metho ds in Me dic al R ese ar ch 31.9 (Sept. 2022), pp. 1738–1756. issn : 0962-2802. doi : 10. 1177/09622802211055853 . url : https://doi.org/10.1177/09622802211055853 (visited on 09/22/2023). [10] Cheryl Lynn Addy. “The ﬁnal size distribution for a generalized sto c has- tic epidemic”. English. Ph.D. United States – Georgia: Emory Univer- sit y . isbn : 979-8-206-78084-0. url : https : / / www . proquest . com / docview / 303645195 / abstract / 5DC01521970942ACPQ / 1 (visited on 04/02/2024). [11] Philip D. O’Neill and G. O. Rob erts. “Bay esian Inference for Partially Observ ed Sto c hastic Epidemics”. In: Journal of the R oyal Statistic al So ciety Series A: Statistics in So ciety 162.1 (Jan. 1999), pp. 121– 129. issn : 0964-1998. doi : 10 . 1111 / 1467 - 985X. 00125 . url : https : //doi.org/10.1111/1467- 985X.00125 (visited on 03/11/2026). [12] Chris P . Jewell et al. “Bay esian analysis for emerging infectious dis- eases”. en. In: Bayesian Analysis 4.3 (Sept. 2009), pp. 465–496. issn : 1936-0975, 1931-6690. doi : 10 . 1214 / 09 - BA417 . url : https : / / projecteuclid . org / journals / bayesian - analysis / volume - 4 / issue- 3/Bayesian- analysis- for- emerging- infectious- diseases/ 10.1214/09- BA417.full (visited on 03/11/2026). [13] Colin J. W orby et al. “Reconstructing transmission trees for comm uni- cable diseases using densely sampled genetic data”. en. In: The A nnals of Applie d Statistics 10.1 (Mar. 2016), pp. 395–417. issn : 1932-6157, 1941-7330. doi : 10.1214/15- AOAS898 . url : https://projecteuclid. org / journals / annals - of - applied - statistics / volume - 10 / issue- 1/Reconstructing- transmission- trees- for- communicable- diseases- using- densely- sampled- genetic/10.1214/15- AOAS898. full (visited on 03/11/2026). 29 [14] P anayiota T ouloup ou et al. “Ba y esian inference for m ultistrain epi- demics with application to ESCHERICHIA COLI O157:H7 in feed- lot cattle”. In: The Annals of Applie d Statistics 14.4 (Dec. 2020), pp. 1925–1944. issn : 1932-6157, 1941-7330. doi : 10.1214/20- AOAS1366 . url : https:// projecteuclid .org/journals/annals- of- applied- statistics / volume - 14 / issue - 4 / Bayesian - inference - for - multistrain - epidemics - with - application - to - ESCHERICHIA - COLI/10.1214/20- AOAS1366.full (visited on 12/16/2024). [15] ONS. F amilies and households statistics explaine d - Oﬃc e for National Statistics . Ma y 2024. url : https://www.ons.gov.uk/peoplepopulationandcommunity/ birthsdeathsandmarriages/families/articles/familiesandhouseholdsstatisticsexplai ned/ 2021- 03- 02 (visited on 06/18/2024). [16] Sara Carazo et al. “Characterization and ev olution of infection con trol practices among severe acute respiratory corona virus virus 2 (SARS- CoV-2)–infected healthcare w orkers in acute-care hospitals and long- term care facilities in Qu ´ eb ec, Canada, Spring 2020”. en. In: Infe ction Contr ol & Hospital Epidemiolo gy 43.4 (Apr. 2021), pp. 481–489. issn : 0899-823X, 1559-6834. doi : 10 . 1017 / ice . 2021 . 160 . url : https : / / www . cambridge . org / core / journals / infection - control - and - hospital - epidemiology / article / characterization - and - evolution - of - infection - control - practices - among - severe - acute - respiratory - coronavirus - virus - 2 - sarscov2infected - healthcare - workers - in - acutecare - hospitals - and - longterm - care- facilities- in- quebec- canada- spring- 2020/895664B9DFCE12450BEBB56958336393 (visited on 02/02/2024). [17] S. Cauchemez et al. “A Bay esian MCMC approac h to study transmis- sion of inﬂuenza: application to household longitudinal data”. en. In: Statistics in Me dicine 23.22 (2004). eprint: h ttps://onlinelibrary .wiley .com/doi/p df/10.1002/sim. 1912, pp. 3469–3487. issn : 1097-0258. doi : 10.1002/sim.1912 . url : https: / /onlinelibrary . wiley . com / doi/ abs / 10 . 1002 /sim . 1912 (visited on 09/17/2025). [18] Janis Antono vics, Y oh Iwasa, and Michael P . Hassell. “A Generalized Mo del of Parasitoid, V enereal, and V ector-Based T ransmission Pro- cesses”. In: The A meric an Natur alist 145.5 (1995), pp. 661–675. issn : 0003-0147. url : https : / / www . jstor . org / stable / 2462994 (visited on 03/24/2026). [19] Mart C. M de Jong, Odo Diekmann, and Hans Heesterb eek. “Ho w Do es T ransmission of Infection Depend on Population Size?” In: Epi- 30 demic mo dels: their structur e and r elation to data . Cambridge Univ er- sit y Press, 1995, pp. 84–94. [20] Hamish McCallum, Nigel Barlow, and Jim Hone. “Ho w should pathogen transmission b e mo delled?” English. In: T r ends in Ec olo gy & Evolu- tion 16.6 (June 2001), pp. 295–300. issn : 0169-5347. doi : 10 . 1016 / S0169 - 5347(01 ) 02144 - 9 . url : https : / / www . cell . com / trends / ecology- evolution/abstract/S0169- 5347(01)02144- 9 (visited on 03/24/2026). [21] C. J. Rho des and R. M. Anderson. “Contact rate calculation for a ba- sic epidemic mo del”. In: Mathematic al Bioscienc es 216.1 (No v. 2008), pp. 56–62. issn : 0025-5564. doi : 10 . 1016 / j . mbs . 2008 . 08 . 007 . url : https : / / www . sciencedirect . com / science / article / pii / S0025556408001260 (visited on 03/24/2026). [22] F rank Ball. “A Uniﬁed Approach to the Distribution of T otal Size and T otal Area under the T ra jectory of Infectiv es in Epidemic Mo dels”. In: A dvanc es in Applie d Pr ob ability 18.2 (1986), pp. 289–310. issn : 0001-8678. doi : 10 . 2307 / 1427301 . url : https : / / www . jstor . org / stable/1427301 (visited on 12/07/2023). [23] Thomas House, Josh ua V. Ross, and David Sirl. “Ho w big is an out- break likely to b e? Methods for epidemic ﬁnal-size calculation”. In: Pr o c e e dings of the R oyal So ciety A: Mathematic al, Physic al and En- gine ering Scienc es 469.2150 (F eb. 2013), p. 20120436. doi : 10. 1098/ rspa . 2012 . 0436 . url : https : / / royalsocietypublishing . org / doi/full/10.1098/rspa.2012.0436 (visited on 04/11/2024). [24] Mokh tar R. Gomaa et al. “Incidence, household transmission, and neutralizing an tib o dy seroprev alence of Coronavirus Disease 2019 in Egypt: Results of a comm unity-based cohort”. en. In: PLOS Patho gens 17.3 (Mar. 2021), e1009413. issn : 1553-7374. doi : 10.1371/journal. ppat.1009413 . url : https:/ /journals.plos.org /plospathogens/ article?id=10.1371/journal.ppat.1009413 (visited on 09/03/2024). [25] Da vid Chun-Ern Ng et al. “Risk factors asso ciated with household transmission of SARS-CoV-2 in Negeri Sembilan, Malaysia”. eng. In: Journal of Pae diatrics and Child He alth 58.5 (May 2022), pp. 769–773. issn : 1440-1754. doi : 10.1111/jpc.15821 . [26] Natc ha W atanap ok asin et al. “T ransmissibility of SARS-CoV-2 V ari- an ts as a Secondary Attac k in Thai Households: a Retrosp ective Study”. eng. In: IJID r e gions 1 (Dec. 2021), pp. 1–2. issn : 2772-7076. doi : 10.1016/j.ijregi.2021.09.001 . 31 [27] Chen-Y ang Hsu et al. “Household transmission but without the comm unit y- acquired outbreak of COVID-19 in T aiw an”. In: Journal of the F or- mosan Me dic al Asso ciation . COVID-19 P andemic: Challenges and Op- p ortunities 120 (June 2021), S38–S45. issn : 0929-6646. doi : 10.1016/ j . jfma . 2021 . 04 . 021 . url : https : / / www . sciencedirect . com / science/article/pii/S0929664621001789 (visited on 02/09/2024). [28] Nathaniel M Lewis et al. “Household T ransmission of SARS-CoV-2 in the United States”. In: Clinic al Infe ctious Dise ases: A n Oﬃcial Public ation of the Infe ctious Dise ases So ciety of A meric a (Aug. 2020), ciaa1166. issn : 1058-4838. doi : 10.1093/cid/ciaa1166 . url : https: / / www . ncbi . nlm . nih . gov / pmc / articles / PMC7454394/ (visited on 09/03/2024). [29] Jannek e D. M. V erb erk et al. “T ransmission of SARS-CoV-2 within households: a remote prosp ective cohort study in European countries”. en. In: Eur op e an Journal of Epidemiolo gy 37.5 (Ma y 2022), pp. 549– 561. issn : 1573-7284. doi : 10 . 1007 / s10654 - 022 - 00870 - 9 . url : https : / / doi . org / 10 . 1007 / s10654 - 022 - 00870 - 9 (visited on 03/06/2025). [30] Julia M. Bak er et al. “SARS-CoV-2 B.1.1.529 (Omicron) V ariant T rans- mission Within Households — F our U.S. Jurisdictions, Nov em b er 2021–F ebru- ary 2022”. In: Morbidity and Mortality We ekly R ep ort 71.9 (Mar. 2022), pp. 341–346. issn : 0149-2195. doi : 10.15585/mmwr .mm7109e1 . url : https://www .ncbi.nlm .nih. gov /pmc/ articles/PMC8893332/ (visited on 12/11/2023). [31] Neda Jalali et al. “Increased household transmission and immune es- cap e of the SARS-CoV-2 Omicron compared to Delta v ariants”. en. In: Natur e Communic ations 13.1 (Sept. 2022), p. 5706. issn : 2041-1723. doi : 10 . 1038 / s41467 - 022 - 33233 - 9 . url : https : / / www . nature . com/articles/s41467- 022- 33233- 9 (visited on 03/07/2025). [32] Oon T ek Ng et al. “Impact of Sev ere Acute Respiratory Syndrome Corona virus 2 (SARS-CoV-2) V accination and Pediatric Age on Delta V ariant Household T ransmission”. In: Clinic al Infe ctious Dise ases 75.1 (July 2022), e35–e43. issn : 1058-4838. doi : 10 . 1093 / cid / ciac219 . url : https://doi.org/10.1093/cid/ciac219 (visited on 03/06/2025). [33] Ko en M. F. Gorgels et al. “Increased transmissibility of SARS-CoV- 2 alpha v arian t (B.1.1.7) in children: three large primary school out- breaks rev ealed b y whole genome sequencing in the Netherlands”. eng. 32 In: BMC infe ctious dise ases 22.1 (Aug. 2022), p. 713. issn : 1471-2334. doi : 10.1186/s12879- 022- 07623- 9 . [34] Hideo T anak a et al. “Increased T ransmissibility of the SARS-CoV-2 Alpha V arian t in a Japanese Population”. eng. In: International Jour- nal of Envir onmental R ese ar ch and Public He alth 18.15 (July 2021), p. 7752. issn : 1660-4601. doi : 10.3390/ijerph18157752 . [35] Mehgan F T eherani et al. “Burden of Illness in Households With Severe Acute Respiratory Syndrome Corona virus 2–Infected Children”. In: Journal of the Pe diatric Infe ctious Dise ases So ciety 9.5 (Nov. 2020), pp. 613–616. issn : 2048-7207. doi : 10 . 1093 / jpids / piaa097 . url : https://doi.org/10.1093/jpids/piaa097 (visited on 03/07/2025). [36] V arun Sundar and Emmanuel Bhask ar. “Lo w secondary transmission rates of SARS-CoV-2 infection among con tacts of construction la- b orers at open air environmen t”. eng. In: Germs 11.1 (Mar. 2021), pp. 128–131. issn : 2248-2997. doi : 10.18683/germs.2021.1250 . [37] Xiaoli W ang et al. “Basic epidemiological parameter v alues from data of real-world in mega-cities: the c haracteristics of COVID-19 in Bei- jing, China”. In: BMC Infe ctious Dise ases 20.1 (July 2020), p. 526. issn : 1471-2334. doi : 10 . 1186 / s12879 - 020 - 05251 - 9 . url : https : //doi.org/10.1186/s12879- 020- 05251- 9 (visited on 09/03/2024). [38] Timoth ´ ee Dub et al. “High secondary attack rate and p ersistence of SARS-CoV-2 an tib odies in household transmission study participan ts, Finland 2020-2021”. eng. In: F r ontiers in Me dicine 9 (2022), p. 876532. issn : 2296-858X. doi : 10.3389/fmed.2022.876532 . [39] Ra yhaneh Jashaninejad et al. “T ransmission of COVID-19 and its De- terminan ts among Close Contacts of CO VID-19 P atien ts Running ti- tle”. eng. In: Journal of R ese ar ch in He alth Scienc es 21.2 (Apr. 2021), e00514. issn : 2228-7809. doi : 10.34172/jrhs.2021.48 . [40] Tsuy oshi Ogata et al. “Increased Secondary A ttac k Rates among the Household Con tacts of Patien ts with the Omicron V ariant of the Coro- na virus Disease 2019 in Japan”. en. In: International Journal of Envi- r onmental R ese ar ch and Public He alth 19.13 (Jan. 2022). Num b er: 13, p. 8068. issn : 1660-4601. doi : 10.3390/ijerph19138068 . url : https: //www.mdpi.com/1660- 4601/19/13/8068 (visited on 03/06/2025). 33 [41] Miren Rem´ on-Berrade et al. “Risk of Secondary Household T ransmis- sion of CO VID-19 from He alth Care W ork ers in a Hospital in Spain”. eng. In: Epidemiolo gia (Basel, Switzerland) 3.1 (Dec. 2021), pp. 1–10. issn : 2673-3986. doi : 10.3390/epidemiologia3010001 . [42] Zac hary J. Madewell et al. “Household T ransmission of SARS-CoV- 2: A Systematic Review and Meta-analysis”. In: JAMA Network Op en 3.12 (Dec. 2020), e2031756. issn : 2574-3805. doi : 10.1001/jamanetworkopen. 2020 . 31756 . url : https : / / doi . org / 10 . 1001 / jamanetworkopen . 2020.31756 (visited on 12/08/2023). [43] Zac hary J. Madewell et al. “F actors Asso ciated With Household T rans- mission of SARS-CoV-2: An Up dated Systematic Review and Meta- analysis”. In: JAMA Network Op en 4.8 (Aug. 2021), e2122240. issn : 2574-3805. doi : 10.1001/jamanetworkopen.2021.22240 . url : https: / / doi . org / 10 . 1001 / jamanetworkopen . 2021 . 22240 (visited on 12/08/2023). [44] Anik a Singana yagam et al. “Communit y transmission and viral load kinetics of the SARS-CoV-2 delta (B.1.617.2) v ariant in v accinated and un v accinated individuals in the UK: a prosp ective, longitudinal, cohort study”. English. In: The L anc et Infe ctious Dise ases 22.2 (F eb. 2022), pp. 183–195. issn : 1473-3099, 1474-4457. doi : 10.1016/S1473- 3099(21 ) 00648 - 4 . url : https : / / www . thelancet . com / journals / laninf/article/PIIS1473- 3099(21)00648 - 4/fulltext (visited on 12/31/2023). [45] Melissa Lucero T anak a et al. “SARS-CoV-2 T ransmission Dynam- ics in Households With Children, Los Angeles, California”. In: F r on- tiers in Pe diatrics 9 (2022). issn : 2296-2360. url : https : / / www . frontiersin . org / articles / 10 . 3389 / fped . 2021 . 752993 (visited on 12/31/2023). [46] Shin Y oung Park et al. “Coronavirus Disease Outbreak in Call Cen- ter, South Korea”. In: Emer ging Infe ctious Dise ases 26.8 (Aug. 2020), pp. 1666–1670. issn : 1080-6040. doi : 10.3201/eid2608.201274 . url : https : / / www . ncbi . nlm . nih . gov / pmc / articles / PMC7392450/ (visited on 09/03/2024). 34 Supplemen tary plots Figure S1: Size-weigh ted household size distributions that were used to gen- erate syn thetic data. The distribution for the UK LFS (2023) [ 15 ] and the “split” distribution are sho wn in red and blue, re spectively . Both distribu- tions share a mean ﬁnal size, after size weigh ting, of 3.32 which is shown b y the v ertical black dotted line. F or the purpose of this pap er households rep orted to ha ve 6+ individuals in the UK LFS are assumed to hav e size exactly 6. 35 β = 0 . 2 β = 0 . 5 β = 1 . 5 UK Split UK Split UK Split η = 0 22.2(19.9,24.1)% 25.3(22.4,27.9)% 50.2(47.3,52.7)% 56.9(54.1,60.0)% 84.1(81.9,86.2)% 86.7(84.6,88.4)% η = 0 . 5 13.2(11.7,14.6)% 13.3(11.9,14.9)% 32.6(29.9,34.7)% 33.4(31.0,36.2)% 71.6(68.6,73.9)% 74.3(71.6,76.8)% η = 1 8.6(7.3,10.0)% 8.2(7.0,9.4)% 21.0(19.3,22.9)% 19.6(17.6,21.5)% 52.5(49.8,55.1)% 50.8(47.5,54.4)% β = 2 . 0 UK Split η = 0 80.3(77.6,82.6)% 90.5(88.8,91.9)% η = 0 . 5 80.3(77.6,82.6)% 82.5(80.1,85.0)% η = 1 63.3(60.7,66.5)% 62.1(59.0,66.0)% T able S1: The secondary attack rates (SAR) from the synthetic datasets for each v alue of β , η and household size distribution. These results are obtained for an infectious perio d distribution I ∼ Gamma(2 , 2). Figure S2: In this ﬁgure we compare the mean ﬁnal size (top ro w) and mean SAR (b ottom row) for households of diﬀerent sizes (2–6) for diﬀeren t transmission rates ( β – sho wn on the x-axis). V ertical dashed lines are sho wn at the v alues of β considered in Figure 2 . Eac h row shows results for a diﬀeren t v alue of η , ranging from η = 0 (densit y dep endent mixing), η = 0 . 5 and η = 1 (frequency dep endent mixing). W e also show the mean ﬁnal size/SAR for t wo diﬀeren t household size distributions: the “split” distribution sho wn in Figure S1 and the household size distribution that the MCMC chain resulting from the data augmentation pro cess from the lo w- information data settles on: this was derived b y av eraging ov er the c hains obtained from ﬁtting the datasets generated from η = 1, β = 0 . 5, the “split” distribution and α 0 = 100. This mean household size distribution is similar for other v alues of η and β . 36 Figure S3: Panels A, B, and C sho w the posterior distributions of θ obtained b y ﬁtting the low- (red), medium- (green), and high-information (blue) v er- sions of the data from Carazo et al. [ 16 ], resp ectiv ely . Panel D displa ys the secondary attack rate (SAR) implied by each p osterior alongside the ob- serv ed SAR (black) for each household size and ov erall; error bars represent 95% conﬁdence in terv als, with those for the observ ed SAR estimated by b o otstrapping. Bars indicate the num b er of households of each size in the lo w-information ﬁt (red) compared with the observed distribution (black). The infectious perio d is assumed to b e I ∼ Exp onential(1). 37 Figure S4: Panels A, B, and C sho w the posterior distributions of θ obtained b y ﬁtting the low- (red), medium- (green), and high-information (blue) v er- sions of the data from Carazo et al. [ 16 ], resp ectiv ely . Panel D displa ys the secondary attack rate (SAR) implied by each p osterior alongside the ob- serv ed SAR (black) for each household size and ov erall; error bars represent 95% conﬁdence in terv als, with those for the observ ed SAR estimated by b o otstrapping. Bars indicate the num b er of households of each size in the lo w-information ﬁt (red) compared with the observed distribution (black). The infectious perio d is assumed to b e ﬁxed to I = 1. 38 Figure S5: T race plots for β for each of the pap ers reporting low information datasets taken from Madewell et al. (datasets 1-10). Each plot is colour- co ded by the v arian t, as in Figure 4 : blue for pre-Alpha, green for Alpha, orange for Delta, red for Omicron and black for Unsp eciﬁed/Other. Note that the chain lengths ( x -axis ranges) v ary signiﬁcantly b etw een panels and that all chains mix w ell: those that app ear not to mix well are just muc h shorter than the others and corresp ond to studies with small num b ers of households. 39 Figure S6: T race plots for β for each of the pap ers reporting low information datasets taken from Madew ell et al. (datasets 11-20). Eac h plot is colour- co ded by the v arian t, as in Figure 4 : blue for pre-Alpha, green for Alpha, orange for Delta, red for Omicron and black for Unsp eciﬁed/Other. Note that the chain lengths ( x -axis ranges) v ary signiﬁcantly b etw een panels and that all chains mix w ell: those that app ear not to mix well are just muc h shorter than the others and corresp ond to studies with small num b ers of households. 40 Figure S7: T race plots for β for each of the pap ers reporting low information datasets taken from Madew ell et al. (datasets 21-28). Eac h plot is colour- co ded by the v arian t, as in Figure 4 : blue for pre-Alpha, green for Alpha, orange for Delta, red for Omicron and black for Unsp eciﬁed/Other. Note that the chain lengths ( x -axis ranges) v ary signiﬁcantly b etw een panels and that all chains mix w ell: those that app ear not to mix well are just muc h shorter than the others and corresp ond to studies with small num b ers of households. 41

Bayesian Inference for Epidemic Final Size Datasets with Hidden Underlying Household Structure

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment