Bayesian Inference for Epidemic Final Size Datasets with Hidden Underlying Household Structure
Households represent a key unit of interest in infectious disease epidemiology, in both empirical studies and mathematical modelling. The within-household transmission potential of a disease is often summarised by a secondary attack ratio (SAR). Desp…
Authors: Joseph Brooks, Thomas House, Lorenzo Pellis
Ba y esian Inference for Epidemic Final Size Datasets with Hidden Underlying Household Structure Joseph Bro oks 1 , Thomas House 1 , Lorenzo P ellis 1 , and Jo e Hilton 1,2 1 Departmen t of Mathematics, Universit y of Manc hester, Manc hester, United Kingdom 2 Manc hester Centre for Health Economics, Univ ersit y of Manc hester, Manc hester, United Kingdom 1 Abstract Households represen t a k ey unit of in terest in infectious disease epi- demiology , in b oth empirical studies and mathematical modelling. The within-household transmission p oten tial of an infectious disease is often summarised in terms of a secondary attac k ratio, the prop ortion of an index case’s household contacts who become infected during the index case’s infectious p eriod. Despite its widespread use, the secondary at- tac k ratio dep ends on the distribution of household comp ositions seen during the study perio d, making the information it con v eys difficult to generalise to new p opulations and differen t contexts. Extending es- timates of transmission p oten tial to new p opulations instead requires estimates of person-to-p erson transmission rates which can be con vo- luted with data on p opulation structure to parametrise mechanistic transmission mo dels. In this study w e present a new Bay esian inference metho d which uses an MCMC algorithm to infer the transmission in tensit y b y im- puting the unrep orted household structure underlying the epidemic dynamics. This metho d can be run on household epidemiological data rep orted at v arying levels of resolution. F or synthetic data from a re- alistic underlying household size distribution, we were able to achiev e o ver 95% co verage in our estimates of transmission rate o v er a range of transmission in tensity scenarios. The metho d w as also able to consis- ten tly ac hiev e o v er 95% cov erage for data generated with a patholog- ical underlying household size distribution, giv en strong information ab out the household size distribution. Using an existing dataset which recorded micro-scale household epidemiological outcomes during the CO VID-19 pandemic, we sho w that at least stratifying observed sec- ondary attac k ratios b y household size substantially reduces the uncer- tain ty in transmission parameter estimates. Our findings suggest that researc hers conducting household epidemiological studies can drasti- cally impro ve the utilit y of their results to infectious disease mo dellers b y rep orting household-stratified estimates, without needing to rep ort full micro-level transmission outcomes. These results aim to encour- age the reporting of higher resolution outputs in future epidemiological field work as, in the absence of strong priors, the transmission param- eters were not easily identifiable from low resolution datasets, whic h are commonly rep orted. 2 In tro duction Households are imp ortan t units of study in infectious disease epidemiol- ogy , b oth as sites of prolonged intense so cial contact and as conv enien t p opulation units for empirical studies[ 1 , 2 ]. Studies that in vestigate within- household transmission studies are common and provide essential data on emerging pathogens, while household structure pla ys an extremely impor- tan t role in determining the onw ard spread of infection. F or example, during the COVID-19 pandemic con tact-tracing studies show ed household contacts to be muc h more likely to b e infected than other t yp es of con tacts (e.g. w orkplace, so cial) [ 3 – 5 ]. Household structure also pla ys a key role in in- terv entions against infection, with non-pharmaceutical in terven tions (NPIs) suc h as stay-at-home orders acting at the household level, and uptak e of v accines sho wing correlations within families. In field studies of household transmission, secondary attack rate (SAR), defined as the n umber of secondary cases divided b y the n umber of con- tacts, is ubiquitous as a measure of transmissibilit y of a pathogen in en- closed settings suc h as households. As a simple ratio, it can be calculated from relativ ely coarse epidemiological data, requiring field researchers only to iden tify cohabitants of an index case and record their infection status o v er a sufficiently long perio d. Despite its widespread use, its simplicit y mak es the SAR po orly suited to capture the complex dynamics of person-to-p erson transmission. Infectious disease dynamics arise from a com bination of biological and so ciological factors whic h the SAR summarises as a single num b er, mean- ing that an estimate of SAR from an outbreak in one p opulation provides v ery limited insigh t in to the b ehaviour of future outbreaks in other p opu- lations with differing demographic and so cial contact structures. Instead, mec hanistic models that are parameterised to reflect the so cial structure of sp ecific p opulations are necessary to analyse infectious disease outbreaks and to make predictions. Mec hanistic mathematical mo dels pla y an important role providing quantitativ e evidence and mo dels ha v e been dev elop ed that tak e account the geographical, gender/sex, age, w orkplace and household structured complexity in social mixing patterns [ 6 – 8 ]. Despite the considerable literature on household-structured epidemic mo delling and due to the ubiquity of the SAR, empirical household transmis- sion studies most often rep ort their data with SAR and logistic regression in 3 mind. These studies often only report the n umber of cases and the n um b er of con tacts, giving little to no information on the household size distribution and only rep orting the mean of the household final size distribution. Because eac h secondary infectious cases increases the infectious pressure on an y re- maining susceptible household members and thus increases their lik eliho o d of b eing infected, the household final size distribution is often bimo dal and so particularly p o orly-describ ed b y its mean (see, e.g. House et al. [ 9 ], Figure 2). While efficient likelihoo d-based metho ds for parameterising mec hanistic mo dels exist and are relatively efficien t in p opulations of the small scale of a household [ 9 , 10 ], they require higher resolution data than just the rep orted o verall SAR. Therefore, in order to estimate p erson-to-p erson transmission rates from these studies we m ust augment the datasets, imputing the house- hold structure that underlies them. Data augmentation and data imputation hav e a long and successful his- tory of use in inference for a v ariety of applications in infectious disease mod- elling [ 11 – 14 ]. The cen tral problem these methods seek to address is that epidemics are rarely fully observ ed, meaning that epidemiological datasets are often incomplete and so we cannot directly ev aluate the mo del likeli- ho o d based on these data alone. Data imputation approaches address this b y augmenting these datasets with the missing information and treating the unobserv ed data as parameters. Standard parameter inference metho ds, of- ten using MCMC, can then b e used with additional steps which prop ose v alues for the missing data and identify scenarios which are most consistent with the outcomes we do observe. In this study we present a nov el mass imputation MCMC metho d for estimating transmission rates from household-stratified datasets at three differen t lev els of detail: lo w information datasets whic h report the total n umber of contacts, the total n umber of secondary cases, and the n umber of households, with no information on ho w infection stratifies b y house- hold size; medium information datasets which report these same outcomes stratified by household size; and high information datasets which rep ort the n umber of contacts and cases for each individual household included in the study . In the high information setting our method resem bles a standard lik eliho o d-based MCMC approac h, while in the lo w and medium informa- tion cases w e impute the n umber of contacts and cases in each household and ev aluate the joint likelihoo ds of sp ecific combinations of parameter and distribution of contacts and cases b y household. Our metho d can incorpo- rate existing kno wledge of background demograph y b y using the household 4 size distribution of the am bient p opulation as a basis for its prop osals of household-sp ecific contact and case n umbers. W e v alidate our metho d b y demonstrating that it is able to successfully reco ver transmission rates from syn thetic data generated with an underly- ing household size distribution tak en from the UK lab our force surv ey[ 15 ] o ver a range of parameter v alues. F or data generated from a pathologi- cal “split” household distribution, results v aried and the algorithm tended to ov erestimated transmission, particularly when density dep endent mixing w as assumed. Ho wev er, in the ma jorit y of cases this could b e corrected b y incorp orating information ab out this pathological household size dis- tribution. W e demonstrate the practical applicability of our approach by estimating parameters from household-based CO VID-19 studies. W e use our high-information method to estimate the transmission rate and density mixing parameter from Carazo et al. [ 16 ], which pro vides a high information set, and compare our estimates when w e aggregate the data and apply our medium- and lo w-information metho ds. W e find that in the lo w informa- tion case, the most common lev el of detail in household studies, there are iden tifiability issues. W e find that con tacts and cases stratified b y household size are necessary to iden tify b oth parameters and that uncertain ty could b e reduced with datasets whic h rep ort the size and n um b er of secondary cases of each household in the study . W e further estimate transmission rates from a n um b er of lo w information datasets taken from a systematic review of household transmission of SARS-CoV-2 [ 2 ]. These results sho w that low-information datasets, whic h are commonly rep orted in household transmission studies, are insufficien t to b e able to parametrise the simplest of mec hanistic mo dels. Contacts and cases strati- fied by household size can resolv e identifiabilit y issues and datasets rep orting the full outcomes of household outbreaks can further reduce uncertain ty in parameter estimates. Metho ds T ransmission mo del The datasets of interest in this analysis come from household transmission studies, with our transmission mo del sim ulating the processes whic h gener- ated the data in a given study . W e make the assumption that each household in the study exp eriences an outbreak seeded by a single primary case with 5 all subsequent infections coming as a result of within-household transmis- sion and that the maximum household size is m + 1. The infectious p erio d of an individual, denoted I , is a random v ariable which follows a sp ecified distribution whic h is the same for all individuals. During this infectious p erio d, individuals mak e infectious con tact with each other individual in their household according to a P oisson pro cess with a rate given by the parametric, form dependent on the size of the household, β n = β n η for n = 1 , . . . , m, (1) where n is the n umber of initially s usceptible individuals (one less than the size of the household), hereafter referred to as c ontacts , β > 0 is the base transmission rate and η ∈ [0 , 1] is a mixing parameter. This particular para- metric form for a transmission rate dep endent on the size of the household is often considered [ 17 , 18 ] and allo ws us to mov e b etw een t wo common mixing assumptions: density dependence ( η = 0) and frequency dep endence ( η = 1) [ 19 – 21 ]. Throughout this paper w e assume that I ∼ Gamma( a, a ) (2) so that E [ I ] = 1 and the exp ected num b er of secondary cases pro duced by a primary case in a household with n initial susceptible contacts is β n . In particular w e will consider the cases when a = 1, so that I ∼ Exponential(1) and the infectious disease dynamics are Mark ovian, the limit for a → ∞ , so that w e hav e a constan t infectious p eriod I ≡ 1, and an in termediate case, namely when a = 2. Datasets Dep ending on ho w results from such studies are rep orted, datasets can v ary in the amoun t of information that is rep orted. In this pap er we only consider “final size” data, i.e. w e ignore information ab out when individuals b ecome infected and fo cus on how man y initially susceptible individuals, named c on- tacts , are recorded as ha ving been infected by the end of the p erio d during whic h the household was observ ed, hence having b ecome c ases . W e consider three categories of final size datasets distinguished b y the lev el of informa- tion reported, with this information lev el denoted by a letter J ∈ { L, M , H } . Datasets of a given information lev el are denoted D J . Lo w-information datasets ( J = L ) rep ort the total num b er of contacts, cases and households in a study , so that the dataset consists of just three integers. Medium- information datasets ( J = M ) rep ort the num b er of con tacts and cases, 6 stratified by the size of the household, i.e. three integers for eac h house- hold size observ ed in the data, including the household size itself. High information datasets ( J = H ) rep ort the num b er of households with each observ ed com bination of the n umber of initially susceptible contacts and the final num b er of cases. In this case, data ma y b e rep orted in a upp er (or lo wer) triangular table. Given the fo cus is on estimating within-household transmission, w e do not consider fully susceptible households (generally the situation in case-ascertainmen t studies) or households with only one mem- b er. Lik eliho o d Giv en a dataset D , we aim to estimate the p osterior of the transmission parameters θ = ( β , η ), π ( θ | D ), using a Marko v chain Monte Carlo metho d. Throughout the fitting pro cess we assume a fixed v alue of a (either a = 1 , 2 or a → ∞ ) and so all dep endence on a is suppressed, including in the ab ov e p osterior. W e first construct a lik eliho o d that can b e ev aluated using a high-information dataset. Let Z denote the final size of a household outbreak. W e can use results from Ball [ 22 ] to construct a set of triangular equations which can b e solved to determine the exact distribution of Z conditional on n and θ : j X z =0 n − z j − z P ( Z = z | n, θ ) ϕ ( β ( j − n )) 1+ z = n j j = 0 , 1 , . . . , n, (3) where ϕ ( t ) = E [exp ( I )] = 1 − t a a is the moment generating function of the infectious p erio d. Alternativ e methods for the computation of the final size distribution could b e found in House, Ross, and Sirl [ 23 ]. In order to derive our lik eliho o d more easily , w e define an enumeration map f , so that we can write out a high-information dataset as a vector of coun ts: f ( n, z ) = n − 1 X i =1 ( i + 1) + z = ( n − 1)( n + 2) 2 + z (4) This en umeration orders household outcomes first in terms of the num b er of con tacts and secondly in terms of the num b er of cases among these con tacts. Giv en households of size one are not included in the dataset, the enumer- ation starts counting from f (1 , 0) = 0 , f (1 , 1) = 1 and f (2 , 0) = 2, which are the indices corresp onding to households with 1 contact and 0 cases, 1 7 con tact and 1 case and 2 con tacts and 0 cases resp ectively . The counting stops at K = f ( m, m ), which corresp onds to a household of size m + 1 in whic h eac h of the m contacts are infected. W e also define the element wise in verses of f , f n and f z , such that f ( f n ( k ) , f z ( k )) = k . W e can therefore represen t a high-information dataset by a flat v ector, C ∈ N K where C k is the n umber of households with f n ( k ) initially susceptible contacts and f z ( k ) non-primary cases. Our aim is no w to estimate the posterior for θ giv en a high-information dataset C , which w e denote as π ( θ | C ). F or some θ we can define a v ector of final size distributions P ( θ ) ∈ [0 , 1] K where P k ( θ ) = P ( Z = f z ( k ) | f n ( k ) , θ ). F or a giv en θ and relatively small m ( ≤ 20, the scale of a household), P ( θ ) is straightforw ard and efficient to calculate by solving the linear system in Equation ( 3 ). W e assume that our study population samples household sizes from an underlying household size distribution, p = ( p 1 , . . . , p m ), where p n is the probability a randomly selected household is size n + 1. W e assume a hierarc hical framew ork for the final size outcomes C : p | α ∼ Dir( α ) (5) C | p , θ ∼ Multi( X C k , P ⊙ f n p ) (6) where α = ( α 1 , . . . , α m ) are c hosen parameters for the Dirichlet distribution and ⊙ f n denotes an element wise pro duct where the k th en try of the first v ector is m ultiplied b y the f n ( k ) th . Note that if p = ( p 1 , . . . , p m ) ∼ Dir( α ) then E ( p i ) = α i α 0 where α 0 = P m i =1 α i . Also, larger v alues of α 0 corresp ond to smaller v ariance of p and so in some sense larger α 0 represen ts higher confidence that the household size distribution in the study is close to E ( p ). Note that, although this is similar to a Dirichlet-Multinomial distribution, it is not the same b ecause C and p are of length K and m resp ectively . W e can deriv e a likelihoo d similar to that of a Dirichlet-Multinomial as follows: L ( θ | C , α ) = π ( C | θ , α ) (7) = Z S m − 1 π ( C | p , θ ) π ( p | α )d p (8) = Z S m − 1 1 B( C ) K Y k =0 P k ( θ ) p f n ( k ) C k 1 B( α ) m Y n =1 p α n − 1 n d p (9) = 1 B( C ) B ( α ) K Y k =0 P k ( θ ) C k Z S m − 1 K Y k =0 p C k f n ( k ) m Y n =1 p α n − 1 n d p , (10) 8 where S m − 1 is the m − 1 simplex. If we let γ ∈ R m b e defined such that γ n := P f ( n,n ) k = f ( n, 0) C k + α n − 1, then we can reduce the lik eliho o d to the follo wing: π ( C | θ, α ) = 1 B( C )B( α ) K Y k =0 P k ( θ ) C k Z S n − 1 m Y n =1 p γ n n d q (11) = B( γ ) B( C )B( α ) K Y k =0 P k ( θ ) C k . (12) All comp onen ts of this lik eliho o d can b e efficiently calculated, making it suit- able for inference methods that require man y ev aluations of the likelihoo d, suc h as MCMC. Data Augmen tation and MCMC algorithm In the cases that w e only ha ve a medium/low-information dataset D J ( J ∈ { L, M } ) we would lik e to deriv e the p osterior distribution of θ , π ( θ | D J , α ). Ho wev er, we cannot directly ev aluate the corresp onding lik eliho o d L ( θ | D J , α ) = π ( D J | θ , α ). Instead, w e must augmen t the dataset, adding the household structure necessary to use the likelihoo d in Equation ( 12 ). Therefore, w e target π ( θ , C | D J , α ) ∝ π ( D J , C | θ , α ) π ( θ ) = π ( C | θ , α ) π ( θ ) where the equalit y comes from the fact that D J is fully determined b y C . If we can sp ecify an appropriate prop osal distribution q J (( θ ′ , C ′ ) | ( θ , C )) then w e can use a Metropolis-Hastings algorithm with acceptance probability A (( θ ′ , C ′ ) | ( θ, C )) = min 1 , π ( C ′ | θ ′ , α ) π ( θ ′ ) q J (( θ ′ , C ′ ) | ( θ , C )) π ( C | θ , α ) π ( θ ) q J (( θ , C ) | ( θ ′ , C ′ )) (13) to sample from π ( θ , C | D J , α ) from which we can get the marginal distri- bution π ( θ | D J , α ). Note that so long as the prop osal distribution results in an ap erio dic and irreducible chain, the algorithm will target the required distribution. Prop osal Distribution W e no w need to define a prop osal distribution q J : ( R 2 × g J ( D J )) → ( R 2 × g J ( D J )) where g J ( D J ) denotes the set of compatible high-information datasets for a given medium/lo w-information dataset; see App endix A for a formal definition of g J . Note that R 2 × g J ( D J ) is not a standard space, 9 b eing the product of a discrete space g J ( D J ) with specific restrictions, and a more standard space R 2 . In order to efficiently na vigate this space (a voiding prop osing incompatible high-information datasets), we cannot use standard c hoices for the proposal distribution and so need to specify our own. Supp ose w e start with ( θ 0 , C 0 ). W e split our prop osal into tw o cases. On each iteration of the MCMC algorithm either a new set of transmission parameters is prop osed (and C 1 = C 0 ) or a new household structure is pro- p osed (and θ 1 = θ 0 ) with probabilit y s and 1 − s resp ectively , where s is a c hosen h yp er-parameter. In the case where w e prop ose a new set of transmission parameters, w e use a 2-dimensional Gaussian prop osal distribution centred on θ suc h that θ 1 ∼ N ( θ 0 , Σ). In this paper we either consider the case when η is known/assumed and we only fit β in which case Σ = ( σ 0 0 0 ) for some σ > 0 or the case where w e fit b oth parameters with indep endent Gaussian proposals with iden tical standard deviation so Σ = ( σ 0 0 σ ). The case in which we prop ose a new household structure is more compli- cated. Since the space g J ( D J ) is dep endent on the information level of the dataset, we sp ecify different prop osal distributions for J = L and J = M . The intuition for q L is that individual contacts, either infected or not, are selected at random from the current household structure C 0 and then mo ved from their household to another. The resulting household structure, C 1 , is still compatible with D L (i.e., C 1 ∈ g L ( D L )). This pro cess is represented pictorially in Figure 1 and the full algorithm is presen ted in Algorithm 1 . The algorithm for q M is shown in Algorithm 2 . Once a new pair ( θ 1 , C 1 ) is prop osed, it is accepted with the probability given in Equation ( 13 ). Priors The prior on θ in Equation ( 13 ), π ( θ ), is the pro duct of the tw o independent priors for β and η , i.e. π ( θ ) = π 1 ( β ) π 2 ( η ). There are no restrictions on the distributions of π 1 or π 2 . How ev er, throughout this pap er we do constrain π 1 and π 2 based on the meaning of the parameters: β is a rate, so is assumed ≥ 0, and although its range might b e larger (ev en in v olving negativ e v alues), w e assume η ∈ [0 , 1]. Syn thetic data generation In order to v alidate our metho d, we generated synthetic high-information datasets with known parameters. W e try to re-estimate β (with η known 10 Figure 1: Visual representation of the low information prop osal algorithm (Algorithm 1 ). Three hy p othetical prop osal steps are shown for a low in- formation dataset with ( N , y , n ) = (24 , 25 , 45) and maximum n um b er of household contacts m = 3. On the left hand side the indices of the out- comes are shown with pictorial represen tations of these outcomes (primary cases, secondary cases and non-cases sho wn b y red Ps, red p ositive signs and blue negative signs resp ectively). Each prop osal mov e is represen ted by a blac k arro w starting at an en try in the previous pseudo-dataset and ending at an en try in the new pseudo-dataset with their resp ective indices b eing k 1 and k 2 in the notation of Algorithm 1 . The infectious status of the contact b eing mov ed is shown by the sym b ol on the arrow. Data that ha ve changed in a step are sho wn as white digits with a black outline. and fixed) from the corresp onding low-information dataset. Each of these datasets had N = 1000 households with sizes sampled from a given house- hold size distribution. W e used t wo household distributions, one informed b y the 2023 UK Lab our F orce Surv ey (LFS) [ 15 ] and another one constructed to b e an extreme example in which most of the households are either of the smallest or largest p ossible size, i.e. 2 or 6 (Figure S1 ). In b oth cases, the 11 maxim um n umber of contacts is m = 5. F or details of these distributions see App endix C . The final sizes of each outbreak are then sampled from the probabilities in Equation ( 3 ). Results V alidation on synthetic data W e first v alidated our metho d, estimating β from syn thetic datasets gener- ated from known parameters and Equation ( 3 ). W e used 12 differen t param- eter sets and t w o differen t household distributions, generating 100 datasets of 1000 households for eac h of these 24 combinations. When estimating β from each synthetic dataset we fixed η to the v alue used to generate the dataset. In each case, the parameters for the Dirichlet distribution ( α ) were c hosen so that they w ere prop ortional to (i.e. the exp ectation of the Diric hlet w as equal to) the household size distribution used to generate the datasets. The v alue of α 0 = P m n =1 α n determines how “concentrated” the Diric hlet distribution is, reflecting ho w confident w e are that the household size dis- tribution in the study is similar to what we hav e sp ecified as the mean of the Dirichlet. The smaller the v alue of α 0 the further and more easily the household distribution can w ander from the mean of the Diric hlet during the inference. W e fit the mo del for α 0 = 100 (less constrained) and α 0 = 1000 (v ery constrained). Eac h panel in Figure 2 sho ws the 95% confidence in terv als for β estimated from 100 synthetic datasets for a giv en parameter set with β determined by the column and η determined by the ro w. Within each panel are results for syn thetic datasets generated from the UK household size distribution and the extreme “split” distribution (with fixed probabilities, i.e. corresp onding to the limit for α 0 → ∞ ) and then b oth fitted with either α 0 = 100 or 1000. F or data generated from the UK household distribution the correct pa- rameters are recov ered b y the MCMC algorithm at least 95% of the time for all 12 v alues of θ . Ho wev er, when data is generated from the “split” distribu- tion, the MCMC systematically o v erestimates β when η = 0 or 0 . 5 and α 0 = 100. F or θ = (0 . 2 , 0) , (0 . 5 , 0) , (1 . 5 , 0) , (2 , 0) , (0 . 5 , 0 . 5) , (1 . 5 , 0 . 5) and (2 . 0 , 0 . 5) only 17% , 1% , 44% , 83% , 85% , 42% and 63% of the datasets, resp ectively , re- sult in p osteriors with 95% in terv als that ov erlap with the actual v alues, with most datasets resulting in ov erestimate. F or η = 1, there was a ten- dency for β to b e underestimated when data was generated from the split 12 Figure 2: In each subplot, 95% confidence interv als for the p osteriors of β are sho wn for 100 low information synthetic datasets p er household size distri- bution. Eac h dataset is generated using the Ball model ( I ∼ Gamma(2 , 2)) and 1000 household sizes sampled either from the UK LFS (2023) or the “split” household size distribution (see App endix C for details), and β is re-estimated separately using α 0 = 100 and 1000. Eac h subplot row sho ws results for synthetic data generated with differen t base transmission rate ( β ) and eac h column shows results for different density mixing parameter ( η ). The v alue of β used to generate the data is sho wn b y the v ertical dot- ted line in eac h plot and confidence interv als are plotted in green if this v alue is contained in them and red otherwise. The p ercentage of syn thetic datasets for which the real v alue is contained within the 95% confidence in terv al is sho wn ab ov e each set of confidence interv als. All fits were done with I ∼ Gamma(2 , 2) for a kno wn η and so only β was inferred. 13 distribution with α 0 = 100. How ev er, if an α 0 is increased to 1000, more strongly restricting the MCMC not to w ander too far from the split house- hold size distribution, then the v alue of β can b e recov ered successfully in all cases (minim um of 93% cov erage with the 95% confidence interv als). Fitting to real data Carazo et al. [ 16 ] rep orts final size data from outbreaks in households of health care work ers in Qu´ eb ec, Canada during the COVID-19 pandemic. In that pap er, final size data is rep orted in sufficient detail for a high informa- tion dataset. This allo wed us to compare the p erformance of the MCMC metho d on a real dataset at eac h of the three information levels. F or the c hoice of α we used data from the 2021 Census in Canada which rep orted the n um b er of households of each size in Qu´ ebec. W e use this distribu- tion in the same w a y as the UK LFS (see Appendix C ); how ev er, giv en the census aggregated all households with 5 or more individuals, w e arbitrarily distributed these households b et ween sizes 5 and 6 at a 2:1 ratio. When estimating parameters, w e used an α centred around such a distribution, with a v alue of α 0 = 200. P anels A, B and C in Figure 3 show the p osteriors of simultaneously fit- ting b oth β and η to the lo w- (red), medium- (green) and high-information datasets (blue). F or each of these we had a Uniform(0 , 1) prior on η and an improp er (p ositiv e reals) prior on β . These results are for I ∼ Gamma(2 , 2), though the same plots for I ∼ Exp onen tial(1) and I = 1 are sho wn in Fig- ures S3 and S4 resp ectively . In Panel A of Figure 3 we can see that there w ere issues of identifiabilit y b etw een β and η , with the MCMC exploring all v alues of η (limited by assumption to [0,1]) and so w e end up with a strip of v alues compatible with the observed SAR. With the medium-information dataset, instead, the parameters were clearly iden tifiable and the p osterior w as similar to, though slightly less certain than, the one obtained from the high-information dataset. The means in these t wo cases – shown by the blac k cross – are 0.57 (0.51, 0.63) for β and 0.70 (0.59, 0.80) for η for the medium-information dataset, and 0.57 (0.52, 0.62) for β and 0.70 (0.61, 0.78) for η for the high information one. The error bars in P anel D of Figure 3 show the estimated ov erall and size-stratified SAR for eac h of the information lev els and the observed v alues from Carazo et al. [ 16 ]. The low-information confidence interv als do app ear to cov er the observed v alue for all sizes but are v ery wide, particularly for 14 Figure 3: Panels A, B, and C show the p osterior distributions of θ obtained b y fitting the low- (red), medium- (green), and high-information (blue) v er- sions of the data from Carazo et al. [ 16 ], resp ectiv ely . Panel D displa ys the secondary attack rate (SAR) implied by each p osterior alongside the ob- serv ed SAR (black) for each household size and ov erall; error bars represent 95% confidence in terv als, with those for the observ ed SAR estimated by b o otstrapping. Bars indicate the num b er of households of each size in the lo w-information fit (red) compared with the observed distribution (black). The infectious perio d is assumed to b e I ∼ Gamma(2 , 2). 15 households of size 2 or 6. Adding higher levels of information incrementally narro ws the confidence interv als and the medium- and high-information SAR estimates are close to those observed in the data. The bar chart in Panel D of Figure 3 shows the num ber of households of each size from the low in- formation fit (red) and the observed data (black). The error bars cov er the observ ed household size distribution showing that the algorithm prop osed sensible household structures. Next we estimated the transmission rate for SARS-CoV-2 from 23 pap ers [ 16 , 24 – 41 ] (25 datasets) that rep orted at least a low-information dataset. These were chosen from the ones considered b y a series of literature reviews b y Madewell et al. [ 2 , 42 , 43 ]. F or each of these low information datasets w e fit the base transmission rate ( β ) using the p osterior for η obtained from Carazo et al. [ 16 ], com bined with an improper prior on β , as the prior, i.e. π ( θ ) = π 2 ( η ), and again assuming I ∼ Gamma(2 , 2). Using this prior was necessary given the issues with iden tifiability found for the low information case in Figure 3 , but the resulting MCMC chains mixed well, as shown in the trace plots in Figures S5 , S6 and S7 .. F or each study we found census or similar data to inform α (in all cases, the data w as size-w eighted as discussed in App endix C and we used α 0 = 200). Each lo w information dataset and source for α can b e found in the supplementary CSV file. Figure 4 shows the estimates and 95% confidence interv als for each of these studies, split by v ariant – where specified – and colour co ded accordingly . The table on the righ t hand size also sho ws the rep orted SAR, mean household size (including index cases) and num b er of households ( N ) for each study . W e successfully inferred transmission parameters from low-information datasets, combining prior information on η (in this case obtained from the high-information dataset in Carazo et al. [ 16 ]) and information on the house- hold size distribution of the relev an t populations in α . W e w ere able to prop erly account for differences in household size and b etter quan tify the uncertain ty compared to binomial confidence in terv als often used for SARs. Discussion In this pap er we present a no vel MCMC algorithm for estimating epidemio- logical parameters from household final size data of different lev els of coarse- ness. This includes very coarse “low information” datasets whic h only report the total n umber of household, total n um b er of household contacts and total 16 Figure 4: Base transmission rates ( β ) for SARS-CoV-2 estimated from lo w-information datasets from studies included in Madewell et al. [ 2 , 42 , 43 ], colour coded and grouped by strain where it was specified. A table on the right hand side lists the SAR, a v erage household size and n umber of households in eac h of these studies. 17 n umber of secondary cases. Without using an inference metho d to impute a household size distribution, like the MCMC approach we presen t here, it w ould not b e p ossible to fit a mechanistic mo del to datasets of this kind. W e first fit to low-information datasets deriv ed from synthetically generated high-information dataset in order to assess the algorithm’s ability to reco ver the transmission rate, b efore applying the metho d to real data. Insigh ts F or data generated from a realistic household size distribution – in this pa- p er, from the UK Labour F orce Surv ey (UK LFS) – the algorithm reliably reco vered the transmission rate β for datasets generated from a range of parameter v alues of b oth β and η . F or all 12 parameter combinations we considered, the true v alue of β w as co vered b y the 95% confidence in terv al. Ho wev er, the p erformance on data from an unrealistic “split” household size distribution, in whic h the ma jority of households had either 2 or 6 individu- als (but the mean size of ≈ 3 was the same), v aried significantly for differen t parameters. The p erformance was particularly p o or in the case that α 0 = 100 (rel- ativ ely w eak constrain t on the household size distributions of a study from whic h a lo w-information dataset is collected). This is because the syn thetic data w as generated from the “split” household size distribution, but the MCMC w as not strongly constrained to preserv e that distribution and w as settling on a distribution that include a significant num b er of households with sizes closer to the mean. Giv en the strong non-linear relationship b et ween the transmission rate and the SAR for different household sizes (see Figure S2 ), the same v alue of β leads to a different SAR from the “split” distribution compared to the distribution the MCMC is settling on. In particular, for η = 0 (bottom left panel of Figure S2 ), the former is larger than the latter, so to fit the same SAR, a significan tly larger β is required with the new distribution compared to the original one, causing the observed o verestimation. This discrepancy is strongest for v alues of β around 0.5, in line with what can be seen in Figure 2 . The same qualitative discrepancy is seen for η = 0 . 5, though to a m uch less extent and reaching its maximum around 1.5, resulting in consistent o verestimation for β = 1 . 5 and β = 2 . 0. The discrepancy , instead, is minimal for η = 1, and in terestingly is slightly reversed, which is compatible with 18 the slight underestimation observ ed in this case in Figure 2 . Note that in all cases increasing the v alue of α 0 constrained the MCMC to not drift aw a y from the “split” distribution, leading to muc h b etter re- sults, with a minimum of 93% of confidence in terv als con taining the actual v alues of β . These results demonstrate the imp ortance of the size distribution of the household study , which are often not reported in lo w-information datasets. In the absence of that information, assumptions need to be made about suc h a distribution and the confidence we hav e in it. Ho w ever, such assumptions in teract in non-trivial wa ys with the amoun t of person-to-p erson transmis- sion and the assumed w ay in which transmission scales with the household size. Ho wev er, although discrepancies emerge particularly strongly when mixing is close to b eing densit y-dep enden t ( η close to 0), for respiratory diseases such as SARS-CoV-2, which is the example used in this pap er, it is common to assume frequency dep endent mixing ( η = 1). In this case, estimates app ear to b e muc h less affected by deviations from the true dis- tribution of the household size in the studies the low-information dataset comes from. W e then in v estigated the difference in p erformance when estimating b oth β and η simultaneously from lo w-, medium- and high-information datasets from Carazo et al. [ 16 ]. The resulting p osteriors on ( β , η ) demonstrate that at least medium-information datasets are necessary to confidently identify b oth simultaneously and that simply rep orting low-information datasets is insufficien t for transmission parameters to b e estimated with any confidence in the absence of informativ e priors. How ev er, if detailed data on the final size of outbreaks in each household cannot b e rep orted in full, then total con tacts and cases stratified by household size could still b e used to iden tify b oth the transmission rate and the densit y mixing parameter almost to the same level of accuracy . Finally , w e estimated β for a n um b er of household studies whic h provided at least lo w-information datasets taken from the Madew ell et al. systematic reviews [ 2 , 42 , 43 ]. Only 25 of the 144 studies included in these reviews a) tried to iden tify co-primary cases, b) dealt with susp ected co-primaries in a w ay that preserved the household sizes and c) rep orted sufficient detail so that a lo w-information dataset could b e extracted or deriv ed. Using existing inference metho ds, we w ould not b e able to estimate trans- 19 mission rates from these datasets. How ev er, the framework presen ted in this pap er allows us to consolidate these low-information final size datasets with household size data and prior information on the densit y mixing parame- ter to estimate transmission rates. Compared to the Binomial confidence in terv als often used for estimates of SAR, the error bars in Figure 4 more accurately show the uncertain t y in our estimates. Limitations In order to fit a transmission mo del to low-information data w e were required to k eep the model very simple. One limitation comes from our assumption of a single primary case. Man y household studies limit the perio d of follow up in order to limit the risk of m ultiple indep endent introductions. Ho wev er, m ultiple primary cases are not uncommon, e.g. when m ultiple residen ts at- tend so cial ev ents together and so would b e at risk of con tracting a disease from a shared source. One of the common uses of household-stratified transmission surveys is to estimate the relative susceptibility and infectivity asso ciated with char- acteristics such as age, sex and ethnicity . In this work, how ev er, we lim- ited ourselv es to identical individuals, all mixing homogeneously within eac h household, b oth to simplify the mo del for testing and to ensure that g J ( D J ) w as not to o large. How ev er, future work could relax this assumption and p erform a similar inv estigation to estimate the effects of risk factors on transmission. Conclusion In this pap er w e presented an MCMC algorithm that can estimate trans- mission rates from v ery coarse datasets of the type commonly rep orted in household transmission studies. W e successfully reco vered parameters from syn thetic datasets and estimated transmission rates for COVID-19 under an assumption of frequency-dependent mixing from a num b er of household studies. Ho wev er, we also demonstrated that, in the absence of prior information on the parameter controlling how p erson-to-p erson transmission v aries with household size, low information datasets are generally insufficient to resolve b oth this mixing parameter and the transmission rate. This result suggests researc hers conducting household transmission studies can greatly improv e 20 the usefulness of their results for parametrising mechanistic mo dels by re- p orting con tacts and cases stratified by household size and, if possible, full information on the frequency of eac h outcome. Ac kno wledgmen ts TH, LP , and JH are supp orted by the W ellcome T rust (Grant Number 227438/Z/23/Z). TH and LP are also supp orted b y the Medical Research Council (Grant Number UKRI483). JB is an affiliate, JH is a fellow and TH and LP are mem b ers of the JUNIPER partnership , which is supp orted b y the Medical Researc h Council (gran t num b er MR/X018598/1). Co de a v ailabilit y All co de used to implemen t the methods describ ed in this pap er and to generate all figures can be found in this GitHub rep ository . 21 A F ormal Definitions of g J ( D J ) Medium-Information Datasets Let a medium information dataset b e enco ded in a tuple of tw o v ectors D M = ( n , z ) ∈ N m × N m where n ≥ z element-wise. W e can then define a function g M : N m × N m → P ( N K ) (where P ( S ) denotes the p ow er set of of a set S ) suc h that g M ( D M ) is the set of high information datasets compatible with the medium information dataset D M . F or the purp ose of a formal definition of g M w e define tw o matrices A , B ∈ N m × K where A i,k = ( i if f ( i, 0) ≤ k ≤ f ( i, i ) 0 otherwise (14) B i,k = ( k − f ( i, 0) if f ( i, 0) ≤ k ≤ f ( i, i ) 0 otherwise (15) Then g M (( n , z )) = { C ∈ N K : ( A C = n ) ∧ ( BC = z ) } (16) Similar to the high-information case, since P f ( n,n ) k = f ( n, 0) C k is the same for all n = 1 , . . . , m and C ∈ g M (( n , z )) the choice of α do es not effect the lik eliho o d. Lo w-information Datasets Let a low-information dataset b e enco ded b y three in tegers in a 3-tuple, denoted D L = ( N , n, z ) ∈ N 3 where z ≤ n and N ≤ n ≤ mN . N denotes the num ber of households, n the total num ber of contacts and z the total n umber of cases. W e can then again define a function g L : N 3 → P ( N K ) where g L ( D L ) is the set of high information datasets compatible with the low information datasets. F or low information datasets we will define g L ( D L ) similarly to ho w we did for g M ( D M ). W e define tw o v ectors, w and v ∈ N K , suc h that w k = f n ( k ) and v k = f z ( k ) and so g L (( N , n, z )) = { C ∈ N K : ( ⟨ C , w ⟩ = n ) ∧ ( ⟨ C , v ⟩ = z ) ∧ ( ⟨ C , 1 ⟩ = N ) } (17) 22 B Prop osal algorithms for q M and q L 23 Algorithm 1 Lo w information new frequency coun t pseudo-dataset pro- p osal algorithm Require: C = ( C 0 , . . . , C K ) ∈ N K − 1 1: Calculate the total num b er of households with at least 2 con tacts, S 1 = X { k : f n ( k ) ≥ 2 } C k . 2: Calculate a probabilit y distribution prop ortional to the num b er of house- holds with each outcome and conditioned on there b eing at least 2 con- tacts P 1 = (0 , 0 , C 2 S 1 , . . . , C K S 1 ) 3: Sample an index k 1 from P 1 and denote f n ( k 1 ) and f z ( k 1 ) b y n 1 and z 1 resp ectiv ely . 4: Uniformly draw a random n umber u inf ∈ [0 , 1] 5: if u inf ≤ z 1 n 1 then an infected contact is chosen and we set s = 1 else set s = 0 6: Calculate the total num b er of households with few er than m contacts, S 2 = X { k : f n ( k ) ≤ ( m − 1) } c k 7: Calculate another probabilit y vector P 2 conditioned on there b eing fewer than the maxim um n um b er of con tacts suc h that for k = 0 , . . . , K : P 2 k = c k S 2 − 1 if f n ( k ) < m and k = k 1 c k − 1 S 2 − 1 if k = k 1 0 otherwise (18) 8: Dra w a second index, k 2 , from P 2 and denote f n ( k 2 ) and f z ( k 2 ) by n 2 and z 2 resp ectiv ely . 9: Define a new frequency count partition C ∗ = ( c ∗ 0 , . . . , c ∗ K ) by starting with a cop y of the original frequency count partition, C . 10: Up date the en tries c ∗ k 1 and c ∗ k 2 in turn by subtracting 1. Note it could b e that k 1 = k 2 in which case w e subtract 2 from c ∗ k 1 in this step. 11: Define k 3 = f ( n 1 − 1 , z 1 − s ) and k 4 = f ( n 2 + 1 , z 2 + s ) . 12: Up date the en tries c ∗ k 3 and c ∗ k 4 in turn b y adding 1. Note that it is p ossible that k 3 = k 2 and/or k 4 = k 1 and the entries at those indices are unchanged. 24 Algorithm 2 Medium information new partition prop osal algorithm Require: C = ( c 0 , . . . , c K ) 1: Calculate the total num b er of cases in households with at least 2 con- tacts, S 1 = P K k =2 c k f z ( k ) 2: Calculate a probabilit y distribution prop ortional to the num b er of house- holds with each outcome and conditioned on there b eing at least 2 con- tacts P 1 = (0 , 0 , 0 , c 3 S 1 , 2 c 4 S 1 , . . . , mc K S 1 ) 3: Sample an index k 1 from P 1 . Denote f n ( k 1 ) and f z ( k 1 ) b y n 1 and z 1 resp ectiv ely . 4: Calculate the num b er of non-case contacts in households with n 1 con- tacts S 2 = f ( n 1 ,n 1 ) X k = f ( n 1 , 0) ( n 1 − f z ( k )) c k 5: Calculate another probability v ector P 2 conditioned on their b eing n 1 con tacts and w eighted b y the n umber of non-case contacts. F or k = 0 , . . . , K : P 2 k = c k ( n 1 − f z ( k )) S 2 − 1 if f n ( k ) = n 1 and k = k 1 ( c k − 1)( n 1 − f z ( k )) S 2 − 1 if k = k 1 0 otherwise (19) 6: Dra w a second index, k 2 , from P 2 7: Define a new frequency count partition C ∗ = ( c ∗ 0 , . . . , c ∗ K ) by starting with a cop y of the original frequency count partition, C . 8: Up date the en tries c ∗ k 1 and c ∗ k 2 in turn by subtracting 1. Note it could b e that k 1 = k 2 in which case w e subtract 2 from c ∗ k 1 in this step. 9: Up date the en tries c ∗ k 1 − 1 and c ∗ k 2 +1 in turn b y adding 1. C Household size distributions W e generated synthetic data for t wo v ery differen t household distributions: the UK Labour F orce Surv ey (LFS) from 2023 and an unrealistic “split” dis- tribution in whic h most households are either comp osed of 2 or the maximum (6) n um b er of individuals. In the former case, we to ok the UK LFS house- hold size distribution, remov ed the probability of selecting a household of size 1 and re-normalised the distribution, obtaining the probability P k UK that 25 a randomly selected household in our population is of size k ( k = 2 , . . . , 6). W e then deriv ed the size-weigh ted distribution, i.e. we made the probability of a household included dataset ha ving k con tacts prop ortional to k P k U K . This is the household distribution we would exp ect to see in our study if an infection spreading betw een household could be describ ed w ell b y a homoge- neously mixing model in the general population, so that eac h individual in the population is equally likely b e infected, and further assuming the same probabilit y for eac h case to b e detected and hav e their household included in the study . The split distribution was selected so that the mean of the size- w eighted distribution was the same as the size- w eighted distribution based on the UK LFS. See Figure S1 for the size-w eigh ted distributions used. 26 D Pro cedure for selecting appropriate pap ers from Madew ell et al. [ 2 ] F rom the full list of pap ers that estimated secondary attac k rate in Madew ell et al. [ 2 ], we first filtered out papers, so that the remaining ones w ere compat- ible with the mo del assumptions and that w e could extract a lo w-information dataset, according to the follo wing criteria: 1. In eac h household, all con tacts m ust ha ve b een tested or screened for symptoms; 2. There was a pro cedure for iden tifying the primary and co-primary cases, e.g. based on symptom onset or epidemiological inv estigation; 3. If co-primary cases w ere suspected, those individuals were not remo ved without the rest of the household b eing remov ed, as this w ould ha ve left the household partially depleted; 4. There was sufficient information rep orted to deduce a total n umber of con tacts, cases and households. P ap ers that received ad-ho c consideration were the following: • Hsu et al. [ 27 ]: SAR w as rep orted for family clusters, not households. Here, the w ork w as included, with clusters treated as households. • Singana yagam et al. [ 44 ]: SARs were rep orted for 127 households and 138 index cases. It was not p ossible to disen tangle the contacts in households with a single primary cases from the information provided and so this pap er w as not included. • Dub et al. [ 38 ]: Different SARs, based on PCR testing and symptoms, w ere reported. W e included the data from PCR testing. • T anak a et al. [ 45 ]: The num b er of households with a single primary case in the study w as not clear and so this pap er w as not included. • P ark et al. [ 46 ]: The num b er of households was not explicitly stated and so was estimated from the mean household size and total n umber of contacts as 225 / 2 . 3 ≈ 98 households. The pap er w as included. 27 References [1] Jo ¨ el Mossong et al. “So cial con tacts and mixing patterns relev an t to the spread of infectious diseases”. In: PL oS me dicine 5.3 (2008), e74. [2] Zac hary J. Madewell et al. “Household Secondary Attac k Rates of SARS-CoV-2 by V ariant and V accination Status: An Up dated System- atic Review and Meta-analysis”. In: JAMA Network Op en 5.4 (Apr. 2022), e229317. issn : 2574-3805. doi : 10 . 1001 / jamanetworkopen . 2022 . 9317 . url : https : / / doi . org / 10 . 1001 / jamanetworkopen . 2022.9317 (visited on 12/08/2023). [3] Camino T roba jo-Sanmart ´ ın et al. “Differences in T ransmission be- t ween SARS-CoV-2 Alpha (B.1.1.7) and Delta (B.1.617.2) V ariants”. In: Micr obiolo gy Sp e ctrum 10.2 (Apr. 2022), e00008–22. doi : 10.1128/ spectrum . 00008 - 22 . url : https : / / journals . asm . org / doi / full / 10.1128/spectrum.00008- 22 (visited on 03/06/2025). [4] Bin u Areek al et al. “Risk F actors, Epidemiological and Clinical Out- come of Close Con tacts of CO VID-19 Cases in a T ertiary Hospital in Southern India”. en. In: JOURNAL OF CLINICAL AND DIAGNOS- TIC RESEARCH (2021). issn : 2249782X. doi : 10.7860/JCDR/2021/ 48059 . 14664 . url : https : / / jcdr . net / article _ fulltext . asp ? issn = 0973 - 709x & year = 2021 & volume = 15 & issue = 3 & page = LC34 & issn=0973- 709x&id=14664 (visited on 09/02/2024). [5] Hafizuddin Aw ang et al. “A case-con trol study of determinan ts for CO VID-19 infection based on contact tracing in Dungun district, T ereng- gan u state of Malaysia”. eng. In: Infe ctious Dise ases (L ondon, Eng- land) 53.3 (Mar. 2021), pp. 222–225. issn : 2374-4243. doi : 10.1080 / 23744235.2020.1857829 . [6] F rank Ball, Lorenzo Pellis, and Pieter T rapman. “Repro duction num- b ers for epidemic mo dels with households and other so cial structures I I: Comparisons and implications for v accination”. In: Mathematic al Bioscienc es 274 (Apr. 2016), pp. 108–139. issn : 0025-5564. doi : 10 . 1016/j.mbs.2016 . 01.006 . url : https://www. sciencedirect .com/ science/article/pii/S0025556416000171 (visited on 03/11/2026). [7] Lorenzo Pellis, Neil M. F erguson, and Christophe F raser. “Epidemic gro wth rate and household reproduction n umber in communities of households, schools and w orkplaces”. en. In: Journal of Mathematic al Biolo gy 63.4 (Oct. 2011), pp. 691–734. issn : 1432-1416. doi : 10.1007/ 28 s00285- 010- 0386- 0 . url : https://doi.org/10.1007/s00285- 010- 0386- 0 (visited on 03/11/2026). [8] Jo e Hilton et al. “A computational framew ork for mo delling infectious disease p olicy based on age and household structure with applications to the CO VID-19 pandemic”. en. In: PLOS Computational Biolo gy 18.9 (Sept. 2022), e1010390. issn : 1553-7358. doi : 10.1371/journal. pcbi . 1010390 . url : https : / / journals . plos . org / ploscompbiol / article?id=10.1371/journal.pcbi.1010390 (visited on 03/11/2026). [9] Thomas House et al. “Inferring risks of coronavirus transmission from comm unity household data”. en. In: Statistic al Metho ds in Me dic al R ese ar ch 31.9 (Sept. 2022), pp. 1738–1756. issn : 0962-2802. doi : 10. 1177/09622802211055853 . url : https://doi.org/10.1177/09622802211055853 (visited on 09/22/2023). [10] Cheryl Lynn Addy. “The final size distribution for a generalized sto c has- tic epidemic”. English. Ph.D. United States – Georgia: Emory Univer- sit y . isbn : 979-8-206-78084-0. url : https : / / www . proquest . com / docview / 303645195 / abstract / 5DC01521970942ACPQ / 1 (visited on 04/02/2024). [11] Philip D. O’Neill and G. O. Rob erts. “Bay esian Inference for Partially Observ ed Sto c hastic Epidemics”. In: Journal of the R oyal Statistic al So ciety Series A: Statistics in So ciety 162.1 (Jan. 1999), pp. 121– 129. issn : 0964-1998. doi : 10 . 1111 / 1467 - 985X. 00125 . url : https : //doi.org/10.1111/1467- 985X.00125 (visited on 03/11/2026). [12] Chris P . Jewell et al. “Bay esian analysis for emerging infectious dis- eases”. en. In: Bayesian Analysis 4.3 (Sept. 2009), pp. 465–496. issn : 1936-0975, 1931-6690. doi : 10 . 1214 / 09 - BA417 . url : https : / / projecteuclid . org / journals / bayesian - analysis / volume - 4 / issue- 3/Bayesian- analysis- for- emerging- infectious- diseases/ 10.1214/09- BA417.full (visited on 03/11/2026). [13] Colin J. W orby et al. “Reconstructing transmission trees for comm uni- cable diseases using densely sampled genetic data”. en. In: The A nnals of Applie d Statistics 10.1 (Mar. 2016), pp. 395–417. issn : 1932-6157, 1941-7330. doi : 10.1214/15- AOAS898 . url : https://projecteuclid. org / journals / annals - of - applied - statistics / volume - 10 / issue- 1/Reconstructing- transmission- trees- for- communicable- diseases- using- densely- sampled- genetic/10.1214/15- AOAS898. full (visited on 03/11/2026). 29 [14] P anayiota T ouloup ou et al. “Ba y esian inference for m ultistrain epi- demics with application to ESCHERICHIA COLI O157:H7 in feed- lot cattle”. In: The Annals of Applie d Statistics 14.4 (Dec. 2020), pp. 1925–1944. issn : 1932-6157, 1941-7330. doi : 10.1214/20- AOAS1366 . url : https:// projecteuclid .org/journals/annals- of- applied- statistics / volume - 14 / issue - 4 / Bayesian - inference - for - multistrain - epidemics - with - application - to - ESCHERICHIA - COLI/10.1214/20- AOAS1366.full (visited on 12/16/2024). [15] ONS. F amilies and households statistics explaine d - Offic e for National Statistics . Ma y 2024. url : https://www.ons.gov.uk/peoplepopulationandcommunity/ birthsdeathsandmarriages/families/articles/familiesandhouseholdsstatisticsexplai ned/ 2021- 03- 02 (visited on 06/18/2024). [16] Sara Carazo et al. “Characterization and ev olution of infection con trol practices among severe acute respiratory corona virus virus 2 (SARS- CoV-2)–infected healthcare w orkers in acute-care hospitals and long- term care facilities in Qu ´ eb ec, Canada, Spring 2020”. en. In: Infe ction Contr ol & Hospital Epidemiolo gy 43.4 (Apr. 2021), pp. 481–489. issn : 0899-823X, 1559-6834. doi : 10 . 1017 / ice . 2021 . 160 . url : https : / / www . cambridge . org / core / journals / infection - control - and - hospital - epidemiology / article / characterization - and - evolution - of - infection - control - practices - among - severe - acute - respiratory - coronavirus - virus - 2 - sarscov2infected - healthcare - workers - in - acutecare - hospitals - and - longterm - care- facilities- in- quebec- canada- spring- 2020/895664B9DFCE12450BEBB56958336393 (visited on 02/02/2024). [17] S. Cauchemez et al. “A Bay esian MCMC approac h to study transmis- sion of influenza: application to household longitudinal data”. en. In: Statistics in Me dicine 23.22 (2004). eprint: h ttps://onlinelibrary .wiley .com/doi/p df/10.1002/sim. 1912, pp. 3469–3487. issn : 1097-0258. doi : 10.1002/sim.1912 . url : https: / /onlinelibrary . wiley . com / doi/ abs / 10 . 1002 /sim . 1912 (visited on 09/17/2025). [18] Janis Antono vics, Y oh Iwasa, and Michael P . Hassell. “A Generalized Mo del of Parasitoid, V enereal, and V ector-Based T ransmission Pro- cesses”. In: The A meric an Natur alist 145.5 (1995), pp. 661–675. issn : 0003-0147. url : https : / / www . jstor . org / stable / 2462994 (visited on 03/24/2026). [19] Mart C. M de Jong, Odo Diekmann, and Hans Heesterb eek. “Ho w Do es T ransmission of Infection Depend on Population Size?” In: Epi- 30 demic mo dels: their structur e and r elation to data . Cambridge Univ er- sit y Press, 1995, pp. 84–94. [20] Hamish McCallum, Nigel Barlow, and Jim Hone. “Ho w should pathogen transmission b e mo delled?” English. In: T r ends in Ec olo gy & Evolu- tion 16.6 (June 2001), pp. 295–300. issn : 0169-5347. doi : 10 . 1016 / S0169 - 5347(01 ) 02144 - 9 . url : https : / / www . cell . com / trends / ecology- evolution/abstract/S0169- 5347(01)02144- 9 (visited on 03/24/2026). [21] C. J. Rho des and R. M. Anderson. “Contact rate calculation for a ba- sic epidemic mo del”. In: Mathematic al Bioscienc es 216.1 (No v. 2008), pp. 56–62. issn : 0025-5564. doi : 10 . 1016 / j . mbs . 2008 . 08 . 007 . url : https : / / www . sciencedirect . com / science / article / pii / S0025556408001260 (visited on 03/24/2026). [22] F rank Ball. “A Unified Approach to the Distribution of T otal Size and T otal Area under the T ra jectory of Infectiv es in Epidemic Mo dels”. In: A dvanc es in Applie d Pr ob ability 18.2 (1986), pp. 289–310. issn : 0001-8678. doi : 10 . 2307 / 1427301 . url : https : / / www . jstor . org / stable/1427301 (visited on 12/07/2023). [23] Thomas House, Josh ua V. Ross, and David Sirl. “Ho w big is an out- break likely to b e? Methods for epidemic final-size calculation”. In: Pr o c e e dings of the R oyal So ciety A: Mathematic al, Physic al and En- gine ering Scienc es 469.2150 (F eb. 2013), p. 20120436. doi : 10. 1098/ rspa . 2012 . 0436 . url : https : / / royalsocietypublishing . org / doi/full/10.1098/rspa.2012.0436 (visited on 04/11/2024). [24] Mokh tar R. Gomaa et al. “Incidence, household transmission, and neutralizing an tib o dy seroprev alence of Coronavirus Disease 2019 in Egypt: Results of a comm unity-based cohort”. en. In: PLOS Patho gens 17.3 (Mar. 2021), e1009413. issn : 1553-7374. doi : 10.1371/journal. ppat.1009413 . url : https:/ /journals.plos.org /plospathogens/ article?id=10.1371/journal.ppat.1009413 (visited on 09/03/2024). [25] Da vid Chun-Ern Ng et al. “Risk factors asso ciated with household transmission of SARS-CoV-2 in Negeri Sembilan, Malaysia”. eng. In: Journal of Pae diatrics and Child He alth 58.5 (May 2022), pp. 769–773. issn : 1440-1754. doi : 10.1111/jpc.15821 . [26] Natc ha W atanap ok asin et al. “T ransmissibility of SARS-CoV-2 V ari- an ts as a Secondary Attac k in Thai Households: a Retrosp ective Study”. eng. In: IJID r e gions 1 (Dec. 2021), pp. 1–2. issn : 2772-7076. doi : 10.1016/j.ijregi.2021.09.001 . 31 [27] Chen-Y ang Hsu et al. “Household transmission but without the comm unit y- acquired outbreak of COVID-19 in T aiw an”. In: Journal of the F or- mosan Me dic al Asso ciation . COVID-19 P andemic: Challenges and Op- p ortunities 120 (June 2021), S38–S45. issn : 0929-6646. doi : 10.1016/ j . jfma . 2021 . 04 . 021 . url : https : / / www . sciencedirect . com / science/article/pii/S0929664621001789 (visited on 02/09/2024). [28] Nathaniel M Lewis et al. “Household T ransmission of SARS-CoV-2 in the United States”. In: Clinic al Infe ctious Dise ases: A n Official Public ation of the Infe ctious Dise ases So ciety of A meric a (Aug. 2020), ciaa1166. issn : 1058-4838. doi : 10.1093/cid/ciaa1166 . url : https: / / www . ncbi . nlm . nih . gov / pmc / articles / PMC7454394/ (visited on 09/03/2024). [29] Jannek e D. M. V erb erk et al. “T ransmission of SARS-CoV-2 within households: a remote prosp ective cohort study in European countries”. en. In: Eur op e an Journal of Epidemiolo gy 37.5 (Ma y 2022), pp. 549– 561. issn : 1573-7284. doi : 10 . 1007 / s10654 - 022 - 00870 - 9 . url : https : / / doi . org / 10 . 1007 / s10654 - 022 - 00870 - 9 (visited on 03/06/2025). [30] Julia M. Bak er et al. “SARS-CoV-2 B.1.1.529 (Omicron) V ariant T rans- mission Within Households — F our U.S. Jurisdictions, Nov em b er 2021–F ebru- ary 2022”. In: Morbidity and Mortality We ekly R ep ort 71.9 (Mar. 2022), pp. 341–346. issn : 0149-2195. doi : 10.15585/mmwr .mm7109e1 . url : https://www .ncbi.nlm .nih. gov /pmc/ articles/PMC8893332/ (visited on 12/11/2023). [31] Neda Jalali et al. “Increased household transmission and immune es- cap e of the SARS-CoV-2 Omicron compared to Delta v ariants”. en. In: Natur e Communic ations 13.1 (Sept. 2022), p. 5706. issn : 2041-1723. doi : 10 . 1038 / s41467 - 022 - 33233 - 9 . url : https : / / www . nature . com/articles/s41467- 022- 33233- 9 (visited on 03/07/2025). [32] Oon T ek Ng et al. “Impact of Sev ere Acute Respiratory Syndrome Corona virus 2 (SARS-CoV-2) V accination and Pediatric Age on Delta V ariant Household T ransmission”. In: Clinic al Infe ctious Dise ases 75.1 (July 2022), e35–e43. issn : 1058-4838. doi : 10 . 1093 / cid / ciac219 . url : https://doi.org/10.1093/cid/ciac219 (visited on 03/06/2025). [33] Ko en M. F. Gorgels et al. “Increased transmissibility of SARS-CoV- 2 alpha v arian t (B.1.1.7) in children: three large primary school out- breaks rev ealed b y whole genome sequencing in the Netherlands”. eng. 32 In: BMC infe ctious dise ases 22.1 (Aug. 2022), p. 713. issn : 1471-2334. doi : 10.1186/s12879- 022- 07623- 9 . [34] Hideo T anak a et al. “Increased T ransmissibility of the SARS-CoV-2 Alpha V arian t in a Japanese Population”. eng. In: International Jour- nal of Envir onmental R ese ar ch and Public He alth 18.15 (July 2021), p. 7752. issn : 1660-4601. doi : 10.3390/ijerph18157752 . [35] Mehgan F T eherani et al. “Burden of Illness in Households With Severe Acute Respiratory Syndrome Corona virus 2–Infected Children”. In: Journal of the Pe diatric Infe ctious Dise ases So ciety 9.5 (Nov. 2020), pp. 613–616. issn : 2048-7207. doi : 10 . 1093 / jpids / piaa097 . url : https://doi.org/10.1093/jpids/piaa097 (visited on 03/07/2025). [36] V arun Sundar and Emmanuel Bhask ar. “Lo w secondary transmission rates of SARS-CoV-2 infection among con tacts of construction la- b orers at open air environmen t”. eng. In: Germs 11.1 (Mar. 2021), pp. 128–131. issn : 2248-2997. doi : 10.18683/germs.2021.1250 . [37] Xiaoli W ang et al. “Basic epidemiological parameter v alues from data of real-world in mega-cities: the c haracteristics of COVID-19 in Bei- jing, China”. In: BMC Infe ctious Dise ases 20.1 (July 2020), p. 526. issn : 1471-2334. doi : 10 . 1186 / s12879 - 020 - 05251 - 9 . url : https : //doi.org/10.1186/s12879- 020- 05251- 9 (visited on 09/03/2024). [38] Timoth ´ ee Dub et al. “High secondary attack rate and p ersistence of SARS-CoV-2 an tib odies in household transmission study participan ts, Finland 2020-2021”. eng. In: F r ontiers in Me dicine 9 (2022), p. 876532. issn : 2296-858X. doi : 10.3389/fmed.2022.876532 . [39] Ra yhaneh Jashaninejad et al. “T ransmission of COVID-19 and its De- terminan ts among Close Contacts of CO VID-19 P atien ts Running ti- tle”. eng. In: Journal of R ese ar ch in He alth Scienc es 21.2 (Apr. 2021), e00514. issn : 2228-7809. doi : 10.34172/jrhs.2021.48 . [40] Tsuy oshi Ogata et al. “Increased Secondary A ttac k Rates among the Household Con tacts of Patien ts with the Omicron V ariant of the Coro- na virus Disease 2019 in Japan”. en. In: International Journal of Envi- r onmental R ese ar ch and Public He alth 19.13 (Jan. 2022). Num b er: 13, p. 8068. issn : 1660-4601. doi : 10.3390/ijerph19138068 . url : https: //www.mdpi.com/1660- 4601/19/13/8068 (visited on 03/06/2025). 33 [41] Miren Rem´ on-Berrade et al. “Risk of Secondary Household T ransmis- sion of CO VID-19 from He alth Care W ork ers in a Hospital in Spain”. eng. In: Epidemiolo gia (Basel, Switzerland) 3.1 (Dec. 2021), pp. 1–10. issn : 2673-3986. doi : 10.3390/epidemiologia3010001 . [42] Zac hary J. Madewell et al. “Household T ransmission of SARS-CoV- 2: A Systematic Review and Meta-analysis”. In: JAMA Network Op en 3.12 (Dec. 2020), e2031756. issn : 2574-3805. doi : 10.1001/jamanetworkopen. 2020 . 31756 . url : https : / / doi . org / 10 . 1001 / jamanetworkopen . 2020.31756 (visited on 12/08/2023). [43] Zac hary J. Madewell et al. “F actors Asso ciated With Household T rans- mission of SARS-CoV-2: An Up dated Systematic Review and Meta- analysis”. In: JAMA Network Op en 4.8 (Aug. 2021), e2122240. issn : 2574-3805. doi : 10.1001/jamanetworkopen.2021.22240 . url : https: / / doi . org / 10 . 1001 / jamanetworkopen . 2021 . 22240 (visited on 12/08/2023). [44] Anik a Singana yagam et al. “Communit y transmission and viral load kinetics of the SARS-CoV-2 delta (B.1.617.2) v ariant in v accinated and un v accinated individuals in the UK: a prosp ective, longitudinal, cohort study”. English. In: The L anc et Infe ctious Dise ases 22.2 (F eb. 2022), pp. 183–195. issn : 1473-3099, 1474-4457. doi : 10.1016/S1473- 3099(21 ) 00648 - 4 . url : https : / / www . thelancet . com / journals / laninf/article/PIIS1473- 3099(21)00648 - 4/fulltext (visited on 12/31/2023). [45] Melissa Lucero T anak a et al. “SARS-CoV-2 T ransmission Dynam- ics in Households With Children, Los Angeles, California”. In: F r on- tiers in Pe diatrics 9 (2022). issn : 2296-2360. url : https : / / www . frontiersin . org / articles / 10 . 3389 / fped . 2021 . 752993 (visited on 12/31/2023). [46] Shin Y oung Park et al. “Coronavirus Disease Outbreak in Call Cen- ter, South Korea”. In: Emer ging Infe ctious Dise ases 26.8 (Aug. 2020), pp. 1666–1670. issn : 1080-6040. doi : 10.3201/eid2608.201274 . url : https : / / www . ncbi . nlm . nih . gov / pmc / articles / PMC7392450/ (visited on 09/03/2024). 34 Supplemen tary plots Figure S1: Size-weigh ted household size distributions that were used to gen- erate syn thetic data. The distribution for the UK LFS (2023) [ 15 ] and the “split” distribution are sho wn in red and blue, re spectively . Both distribu- tions share a mean final size, after size weigh ting, of 3.32 which is shown b y the v ertical black dotted line. F or the purpose of this pap er households rep orted to ha ve 6+ individuals in the UK LFS are assumed to hav e size exactly 6. 35 β = 0 . 2 β = 0 . 5 β = 1 . 5 UK Split UK Split UK Split η = 0 22.2(19.9,24.1)% 25.3(22.4,27.9)% 50.2(47.3,52.7)% 56.9(54.1,60.0)% 84.1(81.9,86.2)% 86.7(84.6,88.4)% η = 0 . 5 13.2(11.7,14.6)% 13.3(11.9,14.9)% 32.6(29.9,34.7)% 33.4(31.0,36.2)% 71.6(68.6,73.9)% 74.3(71.6,76.8)% η = 1 8.6(7.3,10.0)% 8.2(7.0,9.4)% 21.0(19.3,22.9)% 19.6(17.6,21.5)% 52.5(49.8,55.1)% 50.8(47.5,54.4)% β = 2 . 0 UK Split η = 0 80.3(77.6,82.6)% 90.5(88.8,91.9)% η = 0 . 5 80.3(77.6,82.6)% 82.5(80.1,85.0)% η = 1 63.3(60.7,66.5)% 62.1(59.0,66.0)% T able S1: The secondary attack rates (SAR) from the synthetic datasets for each v alue of β , η and household size distribution. These results are obtained for an infectious perio d distribution I ∼ Gamma(2 , 2). Figure S2: In this figure we compare the mean final size (top ro w) and mean SAR (b ottom row) for households of different sizes (2–6) for differen t transmission rates ( β – sho wn on the x-axis). V ertical dashed lines are sho wn at the v alues of β considered in Figure 2 . Eac h row shows results for a differen t v alue of η , ranging from η = 0 (densit y dep endent mixing), η = 0 . 5 and η = 1 (frequency dep endent mixing). W e also show the mean final size/SAR for t wo differen t household size distributions: the “split” distribution sho wn in Figure S1 and the household size distribution that the MCMC chain resulting from the data augmentation pro cess from the lo w- information data settles on: this was derived b y av eraging ov er the c hains obtained from fitting the datasets generated from η = 1, β = 0 . 5, the “split” distribution and α 0 = 100. This mean household size distribution is similar for other v alues of η and β . 36 Figure S3: Panels A, B, and C sho w the posterior distributions of θ obtained b y fitting the low- (red), medium- (green), and high-information (blue) v er- sions of the data from Carazo et al. [ 16 ], resp ectiv ely . Panel D displa ys the secondary attack rate (SAR) implied by each p osterior alongside the ob- serv ed SAR (black) for each household size and ov erall; error bars represent 95% confidence in terv als, with those for the observ ed SAR estimated by b o otstrapping. Bars indicate the num b er of households of each size in the lo w-information fit (red) compared with the observed distribution (black). The infectious perio d is assumed to b e I ∼ Exp onential(1). 37 Figure S4: Panels A, B, and C sho w the posterior distributions of θ obtained b y fitting the low- (red), medium- (green), and high-information (blue) v er- sions of the data from Carazo et al. [ 16 ], resp ectiv ely . Panel D displa ys the secondary attack rate (SAR) implied by each p osterior alongside the ob- serv ed SAR (black) for each household size and ov erall; error bars represent 95% confidence in terv als, with those for the observ ed SAR estimated by b o otstrapping. Bars indicate the num b er of households of each size in the lo w-information fit (red) compared with the observed distribution (black). The infectious perio d is assumed to b e fixed to I = 1. 38 Figure S5: T race plots for β for each of the pap ers reporting low information datasets taken from Madewell et al. (datasets 1-10). Each plot is colour- co ded by the v arian t, as in Figure 4 : blue for pre-Alpha, green for Alpha, orange for Delta, red for Omicron and black for Unsp ecified/Other. Note that the chain lengths ( x -axis ranges) v ary significantly b etw een panels and that all chains mix w ell: those that app ear not to mix well are just muc h shorter than the others and corresp ond to studies with small num b ers of households. 39 Figure S6: T race plots for β for each of the pap ers reporting low information datasets taken from Madew ell et al. (datasets 11-20). Eac h plot is colour- co ded by the v arian t, as in Figure 4 : blue for pre-Alpha, green for Alpha, orange for Delta, red for Omicron and black for Unsp ecified/Other. Note that the chain lengths ( x -axis ranges) v ary significantly b etw een panels and that all chains mix w ell: those that app ear not to mix well are just muc h shorter than the others and corresp ond to studies with small num b ers of households. 40 Figure S7: T race plots for β for each of the pap ers reporting low information datasets taken from Madew ell et al. (datasets 21-28). Eac h plot is colour- co ded by the v arian t, as in Figure 4 : blue for pre-Alpha, green for Alpha, orange for Delta, red for Omicron and black for Unsp ecified/Other. Note that the chain lengths ( x -axis ranges) v ary significantly b etw een panels and that all chains mix w ell: those that app ear not to mix well are just muc h shorter than the others and corresp ond to studies with small num b ers of households. 41
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment